URL 已经有了,但是 scrapy 里面该怎么操作爬文章内容啊。
class BlogSpider(CrawlSpider):
name = 'bee'
allowed_domains = ['xxx.com']
start_urls = ['https://xxxxxx.com/blogs?']
rules = (
Rule(LinkExtractor(allow="page=[0-9]{1,20}"),follow=True,callback='parse_item'),
)
def parse_item(self,response):
item = ScrapyBlogItem()
title = response.xpath('//h2[@class="title"]/a/text()').extract()
url = response.xpath('//h2[@class="title"]/a/@href').extract()
item_dict = dict(zip(url,title))
item['page'] = item_dict
# yield item
return item
然后我把这东西写完了,可以爬到文章了,然后github连接:https://github.com/ixizi/scrapy_segmentfault.git
不过我的实现方式不太对~求指教~
1
naomhan 2016-04-21 17:57:36 +08:00
还是一样的啊 通过 css 选择权或者 xpath 取到正文所在区域 再去除干扰数据抽取正文
|
2
icybee 2016-04-21 17:59:25 +08:00
楼上说的对
|
3
ChiChou 2016-04-21 18:21:42 +08:00
跟进每一个链接再抓全部的内容。下面是爬乌云漏洞的例子,仅供参考
```python import re from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.http import Request RE_VULID = r'\-(\d+\-\d+)' class WooyunSpider(CrawlSpider): name = 'wooyun' allowed_domains = ['www.wooyun.org'] start_urls = ['http://www.wooyun.org/bugs/new_public/page/1'] rules = ( Rule(LinkExtractor(allow=(r'\/bugs\/new_public\/page\/\d+', ), )), Rule(LinkExtractor(allow=(r'\/bugs\/wooyun\-\d+\-\d+', )), callback='parse_vul'), ) def __init__(self, *args, **kwarg): super(WooyunSpider, self).__init__(*args, **kwarg) self.finished = set() def make_requests_from_url(self, url): match = re.findall(RE_VULID, url) if match: vulid, = match if vulid in self.finished: return else: self.finished.add(vulid) return Request(url, dont_filter=False) def parse_vul(self, response): item = {key: ''.join([text.strip() for text in extracted]) for key, extracted in { 'title': response.css('h3.wybug_title::text').re(ur'\t\t(\S+)'), 'vulid': response.xpath('//h3/a[starts-with(@href,"/bugs/wooyun")]/@href').re(RE_VULID), 'vendor': response.css('h3.wybug_corp a::text').extract(), 'author': response.css('h3.wybug_author a::text').extract(), 'submitted': response.css('h3.wybug_date::text').re('\t\t(\d+\-\d+\-\d+\s+\d+:\d+)'), 'published': response.css('h3.wybug_open_date::text').re('\t\t(\d+\-\d+\-\d+\s+\d+:\d+)'), 'detail': response.css('.wybug_detail').xpath('./node()').extract(), 'patch': response.css('.wybug_patch .detail').xpath('./node()').extract(), 'rank': response.css('.bug_result .detail').re(r'Rank[\s\S]?(\d*)'), 'description': response.css('p.detail.wybug_description::text').extract(), 'vultype': response.css('h3.wybug_type::text').re('\t\t(\S+)'), 'level': response.css('.bug_result .detailTitle + p.detail::text').re(ur'\uff1a(\S+)'), }.iteritems()} yield item ``` |
4
ChiChou 2016-04-21 18:22:18 +08:00
所以特么怎么在回复里贴代码?
|
5
v2014 2016-04-21 18:24:27 +08:00
楼上都没理解楼主的意思,他的意思是取到了 url 和 title ,但是内容不在这个页面,怎么去爬内容。
只要在 return 一个 Request(url, callback=self.parse),后面带个 callback 处理你的内容页就可以了, 什么时候 return ,就看你是 bfs 还是 dfs 了 |
10
ChiChou 2016-04-21 20:51:35 +08:00
@Ixizi pastebin: http://pastebin.com/stRTs7Yj
|