scrapy 内部分页怎么写？

class MeizituSpider(scrapy.Spider): name = "meizitu" allowed_domains = ["27270.com"] start_urls = [] #获取全部翻页链接 for pn in range(2,3): url = 'http://www.27270.com/ent/meinvtupian/list_11_%s.html' % pn start_urls.append(url)

def parse(self, response):
    sel = Selector(response)
    for link in sel.xpath('//html/body/div[2]/div[10]/ul/li/a[2]/@href').extract():
        request = scrapy.Request(link, callback=self.parse_item)
        yield request


def parse_item(self, response):
    l = ItemLoader(item=MeizituItem(), response=response)
    l.add_xpath('name', '///html/head/title/text()')
    l.add_xpath('tags', '//*[@id="body"]/div[1]/div[4]/div[3]/a/text()')
    l.add_xpath('image_urls', '//*[@id="RightUrl"]/img/@src', Identity())
    l.add_value('url', response.url)
    return l.load_item()

我目前的代码。抓取分页内容到底是在 parse_item 里抓？还是单独设定一个类？

我目前没找到现成的抓取内页分页的代码。求助

leopku

2016-08-22 21:54:12 +08:00

分开处理即可

```python
def parse(self, reponse):
sel = scrapy.Selector(response)

for item_link in self.xpath('单个 item 的链接解析 xpath 填这里 <----'):
yield scrapy.Request(item_url, callback=self.parse_item)

for next_page in sel.xpath('//html/body/div[2]/div[10]/ul/li/a[2]/@href').extract():
yield request

```

xiaoyu9527

2016-08-23 10:25:13 +08:00

已经解决了。

我把代码贴上来希望能帮助到别人

class MeizituSpider(scrapy.Spider):
name = "meizitu"
allowed_domains = ["27270.com"]
start_urls = []
#获取全部翻页链接
for pn in range(2,3):
url = 'http://www.27270.com/ent/meinvtupian/list_11_%s.html' % pn
start_urls.append(url)

def parse(self, response):
sel = Selector(response)
for link in sel.xpath('//html/body/div[2]/div[10]/ul/li/a[2]/@href').extract():
request = scrapy.Request(link, callback=self.parse_item)
yield request

def parse_item(self, response):
sel = Selector(response)
l = ItemLoader(item=MeizituItem(), response=response)
l.add_xpath('name', '//html/body/div[3]/div[4]/div[1]/h1/text()')
l.add_xpath('tags', '//html/body/div[3]/div[5]/dl/dd/a/text()')
l.add_xpath('image_urls', '//*[@id="RightUrl"]/img/@src', Identity())
l.add_value('url', response.url)
yield l.load_item()
next_pages = sel.xpath('//*[@id="nl"]/a/@href').extract()
if next_pages:
full_url = response.urljoin(next_pages[0])
print '完整连接', full_url
yield scrapy.Request(full_url, callback=self.parse_item)

这次弄完我会在我的博客写一篇简单的教程。

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/300679

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.