|
背景:通过对福彩3D每次开票结果的数据进行爬取,分析每年次数出现最高的数字,百位、十位、个位出现的频次最高的数字。
环境:Python 3.x
功能:爬虫url生成、页面解析、数据统计、Excel保存
页面url:http://kaijiang.zhcw.com/zhcw/html/3d/list.html
模块一:URL生成
def get_3d_html():
page_num = range(1,22)
b = ''
for page in page_num:
url = 'http://kaijiang.zhcw.com/zhcw/html/3d/list_'+ str(page_num[page-1]) + '.html'
a = urllib.request.urlopen(url)
html = a.read()
html = html.decode('utf-8')
b = b + html
return b模块二:页面解析
def get_3d_num():
html = get_3d_html()
reg = re.compile(r&#39;<tr>.*?<td align=&#34;center&#34;>&#39;
r&#39;(.*?)</td>.*?<td align=&#34;center&#34;>&#39;
r&#39;(.*?)</td>.*?<td align=&#34;center&#34; &#39;
r&#39;style=&#34;padding-left:20px;&#34;>&#39;
r&#39;<em>(.*?)</em>.*?<em>(.*?)</em>.*?&#39;
r&#39;<em>(.*?)</em></td>&#39;,re.S)
it = re.findall(reg,html)
return it
模块三:结果分析
def analyze_num(w):
import collections
all_nums = []
hundred_nums = []
ten_nums = []
unit_nums = []
for each in w:
for n in each[-3:]:
all_nums.append(n)
hundred_nums.append(each[2])
ten_nums.append(each[3])
unit_nums.append(each[4])
print(&#39;most popular number:&#39;,collections.Counter(all_nums).most_common(3))
print(&#39;top 3 popular number in hundred is:&#39;,collections.Counter(hundred_nums).most_common(3))
print(&#39;top 3 popular number in ten is:&#39;, collections.Counter(ten_nums).most_common(3))
print(&#39;top 3 popular number in unit is:&#39;, collections.Counter(unit_nums).most_common(3))
模块四:Excel保存
def excel_create(test):
newTabel = &#39;/usr/local/bin/python_data/caipiao_2.xlsx&#39;
wb = xlwt.Workbook(encoding=&#39;utf-8&#39;)
ws = wb.add_sheet(&#39;test&#39;)
headData = [&#39;date&#39;,&#39;batch&#39;,&#39;hundred&#39;,&#39;ten&#39;,&#39;unit&#39;]
for i in range(0,5):
ws.write(0,i,headData)
index = 1
for j in test:
for i in range(0,5):
print(j)
ws.write(index,i,j)
index += 1
wb.save(newTabel)最终输出结果:
Excel保存结果:
—————————————————END———————————————————————
微信公众号:datafa(数据分析联盟)
数据微信群:数据分析联盟2群(关注公众号扫码)
电子书购买:数据分析侠 《人人都会数据分析》20万字电子书-淘宝网
手机用户可复制链接手机淘宝:
【数据分析侠 《人人都会数据分析》20万字电子书】,复制这条信息€ccxI0EmNtVo€后打开 淘宝 |
|