新型 Python 爬虫框架 gain: 基于 asyncio, uvloop 和 aiohttp.

2017 年 6 月 2 日

prasanta

Gain

github 地址: https://github.com/gaojiuli/gain/

gain 是为了让每大家能够轻松编写 python 爬虫, 它使用了 asyncio, uvloop 和 aiohttp.

准备

Python3.5+

安装

pip install gain

用法

Write spider.py:

from gain import Css, Item, Parser, Spider


class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        with open('scrapinghub.txt', 'a+') as f:
            f.writelines(self.results['title'] + '\n')


class MySpider(Spider):
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

run python spider.py

案例

案例在 /example/ 目录下.

github 地址: https://github.com/gaojiuli/gain/

8581 次点击

所在节点

Python

20 条回复

awolfly9

2017 年 6 月 2 日

mark

MIROKY

2017 年 6 月 2 日

wow 马克

charove

2017 年 6 月 2 日

感觉好叼。。。

2017 年 6 月 2 日

win 平台不支持 uvloop, 建议做个适配

prasanta

2017 年 6 月 2 日

@qs 多谢测试

maze1024

2017 年 6 月 3 日

aiohttp 的 http 解析配合 uvloop 不是很高效，建议看看 uvloop 的实现 https://github.com/MagicStack/httptools

isaced

2017 年 6 月 3 日

建议把输出结果文件的操作封装一下用起来更舒服

PythonAnswer

2017 年 6 月 3 日

uvloop win 跑不了啊

prasanta

2017 年 6 月 3 日

@maze1024 感谢你的意见，我去测试

prasanta

2017 年 6 月 3 日

@isaced 你有什么好的想法么，我现在是让使用者自定义 save()函数。不知道你的封装是指？

prasanta

2017 年 6 月 3 日

@PythonAnswer 看了楼上的意见，我准备暂时移除 uvloop

chuanqirenwu

2017 年 6 月 3 日

不错，感谢作者的分享，已转发到 Pythonzhcn 社区，不知道是否允许？

prasanta

2017 年 6 月 3 日

@chuanqirenwu 可以

pythonee

2017 年 6 月 3 日

mark

hellogbk

2017 年 6 月 4 日

我这些天在用 pyquery 的时候是到一个问题，如果网页是以
<?xml version="1.0" encoding="UTF-8"?>
开头的则 pyquery 会出错。不知道楼主有没有遇到。

prasanta

2017 年 6 月 4 日

@hellogbk pyquery 没有支持 xml 文件吧？

hbmask

2017 年 6 月 4 日

mark 一下

hellogbk

2017 年 6 月 4 日

@prasanta #16
我碰到的是 HTML 文件，但是文件开头写着的却是<?xml version="1.0" encoding="UTF-8"?>
然后 pyquery 就出错了。

prasanta

2017 年 6 月 4 日

@hellogbk pyquery 已经够好用了，你这种情况建议使用 lxml

pb941129

2017 年 6 月 4 日

mark 一下

第 1 页／共 1 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/365505

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.