比这篇新的文章:
Python生成Gravatar url的函数
比这篇旧的文章: GohanConsoleHelper
作者: 半瓶墨水, 点击1077次, 评论(7), 收藏者(1)
打分:
所有评论,共7条:( 我也来说两句)
比这篇旧的文章: GohanConsoleHelper
抓取糗事百科前100页的Python脚本
语言: Python, 标签: BeautifulSoup 美丽的汤 糗事百科 2008/05/23发布 5个月前更新作者: 半瓶墨水, 点击1077次, 评论(7), 收藏者(1)
Python语言: 抓取糗事百科前100页的Python脚本
01 #coding=utf-8
02 #需要BeautifulSoup(美丽的汤)支持:http://crummy.com/software/BeautifulSoup
03
04 import urllib
05 import urllib2
06 from xml.sax.saxutils import unescape
07 from BeautifulSoup import BeautifulSoup # For processing HTML
08
09 def formalize(text):
10 result = ''
11 lines = text.split(u'\n')
12 for line in lines:
13 line = line.strip()
14 if len(line) == 0:
15 continue
16 result += line + u'\n\n'
17 return result
18
19 outfile = open("qiushi.txt", "w")
20 count = 0
21 for i in range(1, 101):
22 url = "http://qiushibaike.com/qiushi/best/all/page/%d" % i
23 data = urllib2.urlopen(url).readlines()
24 soup = BeautifulSoup("".join(data))
25 contents = soup.findAll('div', "content")
26 stories = [str(text) for text in contents]
27 for story in stories:
28 count += 1
29 print "processing page %d, %d items added" % (i, count)
30 minisoup = BeautifulSoup(story)
31 text = ''.join([e for e in minisoup.recursiveChildGenerator() if isinstance(e, unicode)])
32 text = urllib.unquote(unescape(text, {'"':'"'}))
33 text = formalize(text).encode("utf-8")
34 print >> outfile, '-' * 20 + " %05d " % count + '-' * 20 + "\n"
35 print >> outfile, text + "\r\n"
36 outfile.close()
02 #需要BeautifulSoup(美丽的汤)支持:http://crummy.com/software/BeautifulSoup
03
04 import urllib
05 import urllib2
06 from xml.sax.saxutils import unescape
07 from BeautifulSoup import BeautifulSoup # For processing HTML
08
09 def formalize(text):
10 result = ''
11 lines = text.split(u'\n')
12 for line in lines:
13 line = line.strip()
14 if len(line) == 0:
15 continue
16 result += line + u'\n\n'
17 return result
18
19 outfile = open("qiushi.txt", "w")
20 count = 0
21 for i in range(1, 101):
22 url = "http://qiushibaike.com/qiushi/best/all/page/%d" % i
23 data = urllib2.urlopen(url).readlines()
24 soup = BeautifulSoup("".join(data))
25 contents = soup.findAll('div', "content")
26 stories = [str(text) for text in contents]
27 for story in stories:
28 count += 1
29 print "processing page %d, %d items added" % (i, count)
30 minisoup = BeautifulSoup(story)
31 text = ''.join([e for e in minisoup.recursiveChildGenerator() if isinstance(e, unicode)])
32 text = urllib.unquote(unescape(text, {'"':'"'}))
33 text = formalize(text).encode("utf-8")
34 print >> outfile, '-' * 20 + " %05d " % count + '-' * 20 + "\n"
35 print >> outfile, text + "\r\n"
36 outfile.close()
所有评论,共7条:( 我也来说两句)
| 1 |
Bruce
5个月前
回复
|
| 2 |
@Bruce
|
| 3 |
@2: 测试回帖内联链接
|
| 4 |
加油加油,我觉得挺好用的。
|
| 5 |
非常不错的代码,很有参考价值。
|
| 6 |
|
| 7 |
liuxin9023
12天前
回复
#糗事百科完美下载版
|
代码
,不错,不错:)