V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
爱意满满的作品展示区。
fuergaosi
V2EX  ›  分享创造

Gitbook2pdf :抓取 Gitbook 生成的网站生成 pdf 文件的工具

  •  
  •   fuergaosi · 2019-03-07 10:21:36 +08:00 · 6761 次点击
    这是一个创建于 1849 天前的主题,其中的信息可能已经有所发展或是发生改变。

    介绍

    经常发现很多用gitbook生成的书籍质量很高
    就想离线下来看
    但是gitbook生成的pdf都无法复制且体积很大
    有的网站甚至不提供下载的选项
    就和小伙伴一起做了个工具
    对于gitbook生成的网站进行抓取
    解析以后使用weasyprint进行生成文件

    特性

    • 异步抓取 使用aiohttp抓取 对于网站内容抓取基本秒速完成

    • 文本可复制

    • 保持原目录结构

    • 保留原文链接

    • 完整还原原 html 页面样式
    • 体积小,800+页的 pdf 只占用 4.6M
    第 1 条附言  ·  2019-03-07 19:15:28 +08:00

    项目地址:gitbook2pdf

    33 条回复    2019-05-07 23:51:07 +08:00
    fuergaosi
        1
    fuergaosi  
    OP
       2019-03-07 10:25:20 +08:00
    求 star
    magicZ
        2
    magicZ  
       2019-03-07 10:28:11 +08:00
    给个链接呀
    fuergaosi
        3
    fuergaosi  
    OP
       2019-03-07 10:31:40 +08:00
    忘记放链接了
    gitbook2pdf: https://github.com/fuergaosi233/gitbook2pdf
    22k
        4
    22k  
       2019-03-07 10:32:00 +08:00
    昨天还在想着有没有能下载 gitbook 的书籍,mark 一下,楼主可以分享的话更新下原帖。谢谢大佬
    fuergaosi
        5
    fuergaosi  
    OP
       2019-03-07 10:50:09 +08:00
    @22k 看 3 楼
    changjiangzzZ
        6
    changjiangzzZ  
       2019-03-07 11:22:48 +08:00
    已 star :)
    newmind
        7
    newmind  
       2019-03-07 11:27:17 +08:00
    效果很不错, 已赞
    newmind
        8
    newmind  
       2019-03-07 11:28:13 +08:00
    要是能有个在线版就更好了
    jasonslyvia
        9
    jasonslyvia  
       2019-03-07 11:55:25 +08:00
    赞,一直想要一个这样的工具,希望能持续打磨!
    FakeLeung
        10
    FakeLeung  
       2019-03-07 11:59:18 +08:00
    没有 usage 吗?
    看代码貌似是直接修改 main 里面那个 run 的 url ?

    ps:github 地址可以 append。
    fffflyfish
        11
    fffflyfish  
       2019-03-07 12:19:46 +08:00
    点赞!终于看到有人做了
    mseasons
        12
    mseasons  
       2019-03-07 14:31:23 +08:00
    aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host wizardforcel.gitbooks.io:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)')]
    d5
        13
    d5  
       2019-03-07 14:34:09 +08:00
    楼主可以考虑做一个在线版,后端放在外地主机上~
    privil
        14
    privil  
       2019-03-07 16:32:32 +08:00
    ……好像比较吃内存,被 kill 掉了
    tongdongdong
        15
    tongdongdong  
       2019-03-07 18:59:15 +08:00
    C:\Users\TDD\Desktop>python -m weasyprint https://ts.xcatliu.com ts.pdf
    WARNING: Ignored `text-rendering:auto` at 4:620, unknown property.
    WARNING: Ignored `filter:none` at 4:2882, unknown property.
    WARNING: Expected a media type, got (max-width:600px)
    WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:83.
    WARNING: Expected a media type, got (max-width:600px)
    WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:669.
    WARNING: Ignored `box-shadow:none` at 9:1092, unknown property.
    WARNING: Ignored `text-overflow:ellipsis` at 9:1686, unknown property.
    WARNING: Expected a media type, got (max-width:1000px)
    WARNING: Invalid media type " (max-width:1000px)" the whole @media rule was ignored at 9:1805.
    WARNING: Ignored `box-shadow:0 6px 12px rgba(0,0,0,.175)` at 9:2336, unknown property.
    WARNING: Ignored `overflow-y:auto` at 9:3908, unknown property.
    WARNING: Ignored `text-overflow:ellipsis` at 9:4934, unknown property.
    WARNING: Expected a media type, got (max-width:600px)
    WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:5254.
    WARNING: Expected a media type, got (min-width:600px)
    WARNING: Invalid media type " (min-width:600px)" the whole @media rule was ignored at 9:5583.
    WARNING: Expected a media type, got (max-width:600px)
    WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:5650.
    WARNING: Ignored `overflow-y:auto` at 9:6180, unknown property.
    WARNING: Ignored `overflow-y:auto` at 9:6418, unknown property.
    WARNING: Expected a media type, got (max-width:1240px)
    WARNING: Invalid media type " (max-width:1240px)" the whole @media rule was ignored at 9:6434.
    WARNING: Ignored `text-size-adjust:100%` at 9:7377, unknown property.
    WARNING: Expected a media type, got (max-width:1240px)
    WARNING: Invalid media type " (max-width:1240px)" the whole @media rule was ignored at 9:11595.
    WARNING: Ignored `box-shadow:none` at 9:12111, unknown property.
    WARNING: Ignored `text-size-adjust:100%` at 9:12512, unknown property.
    WARNING: Ignored `text-rendering:optimizeLegibility` at 9:20972, unknown property.
    WARNING: Ignored `font-smoothing:antialiased` at 9:21006, unknown property.
    WARNING: Ignored `text-size-adjust:100%` at 9:21124, unknown property.
    WARNING: Ignored `box-shadow: none` at 235:3, unknown property.
    WARNING: Ignored `box-shadow: none` at 272:3, unknown property.
    然后只有首页转成功了!!!
    changjiangzzZ
        16
    changjiangzzZ  
       2019-03-07 19:02:54 +08:00
    @tongdongdong 老哥麻烦看看文档先~
    changjiangzzZ
        17
    changjiangzzZ  
       2019-03-07 19:04:38 +08:00
    @mseasons 国内网络环境不太好,连接的时候 timeout 了,添加个代理试试
    fuergaosi
        18
    fuergaosi  
    OP
       2019-03-07 19:13:55 +08:00
    @privil 吃内存是因为`weasyprint`的问题 正在尝试分片输出
    @tongdongdong 出门左转`weasyprint`的 issues 区
    @mseasons 我无法访问这个 url 不知道你是怎么访问的 希望你可以把问题以及抓取的 url 发在`issues`区
    @FakeLeung 感谢提醒 之前没找到 append 的按钮╮(╯_╰)╭ 另外目前是修改 url 使用 等下改一下使用方法 之前一直这样测试 就没注意这些方面
    Ahs
        19
    Ahs  
       2019-03-07 19:14:26 +08:00 via Android
    已 Star
    fuergaosi
        20
    fuergaosi  
    OP
       2019-03-07 19:21:27 +08:00
    @d5 @newmind 这个东西有点吃内存 解决这个问题以后会考虑做个在线版的
    aWangami
        21
    aWangami  
       2019-03-07 19:27:16 +08:00
    (Python3) ➜ gitbook2pdf python gitbook.py
    Traceback (most recent call last):
    File "gitbook.py", line 5, in <module>
    import weasyprint
    File "/Users/Python3/lib/python3.7/site-packages/weasyprint/__init__.py", line 393, in <module>
    from .css import preprocess_stylesheet # noqa
    File "/Users/Python3/lib/python3.7/site-packages/weasyprint/css/__init__.py", line 26, in <module>
    from . import computed_values
    File "/Users/Python3/lib/python3.7/site-packages/weasyprint/css/computed_values.py", line 17, in <module>
    from .. import text
    File "/Users/Python3/lib/python3.7/site-packages/weasyprint/text.py", line 14, in <module>
    import cairocffi as cairo
    File "/Users/Python3/lib/python3.7/site-packages/cairocffi/__init__.py", line 39, in <module>
    cairo = dlopen(ffi, 'cairo', 'cairo-2', 'cairo-gobject-2', 'cairo.so.2')
    File "/Users/Python3/lib/python3.7/site-packages/cairocffi/__init__.py", line 36, in dlopen
    raise OSError("dlopen() failed to load a library: %s" % ' / '.join(names))
    OSError: dlopen() failed to load a library: cairo / cairo-2 / cairo-gobject-2 / cairo.so.2

    这是啥情况?
    privil
        22
    privil  
       2019-03-07 19:28:17 +08:00
    @fuergaosi #18 抓取的时候也报错了,不过我 vps 内存真小,才 512Mb,抓原来的 k8s handbook 是不行的。

    https://funhacks.gitbooks.io/explore-python
    crawling : https://funhacks.gitbooks.io/explore-python/Conclusion/reference_material.html
    Traceback (most recent call last):
    File "gitbook.py", line 298, in <module>
    Gitbook2PDF("https://funhacks.gitbooks.io/explore-python/").run()
    File "gitbook.py", line 190, in run
    loop.run_until_complete(self.crawl_main_content(content_urls))
    File "/usr/local/python3.7.2/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
    File "gitbook.py", line 212, in crawl_main_content
    await asyncio.gather(*tasks)
    File "gitbook.py", line 233, in gettext
    text = ChapterParser(metatext, level).parser()
    File "gitbook.py", line 95, in parser
    if len(context.find('footer')):
    TypeError: object of type 'NoneType' has no len()
    privil
        23
    privil  
       2019-03-07 19:30:23 +08:00
    hooych
        24
    hooych  
       2019-03-07 19:38:27 +08:00
    @aWangami mac python3 同 OSError: dlopen() failed to load a library: cairo / cairo-2 / cairo-gobject-2 / cairo.so.2
    @fuergaosi 啥情况
    hooych
        25
    hooych  
       2019-03-07 19:40:01 +08:00
    @fuergaosi #23 多谢,安装再试下
    fuergaosi
        26
    fuergaosi  
    OP
       2019-03-07 19:40:07 +08:00
    @privil 无法重现 这个报错是官方推荐的锅 我本来没有写 len 今天跑的时候官方提示我以后可能不让直接 if None 了 就推荐写成这样 结果成了个 bug 我这就去改
    mseasons
        27
    mseasons  
       2019-03-07 22:15:15 +08:00
    @changjiangzzZ 不是 timeout 的问题,似乎是 https 验证的问题。我把所有的 get 请求参数增加 verify=False 就好了。
    mseasons
        28
    mseasons  
       2019-03-07 22:18:23 +08:00
    @fuergaosi url 我没改,直接 git clone 下来运行的源码。我后面查了一下文档,将所有的 get 请求增加参数 verify=False 就通过了。
    dyxang
        29
    dyxang  
       2019-03-07 22:24:18 +08:00 via Android
    好想直接用,为什么不 py2exe ?
    leesymbol
        30
    leesymbol  
       2019-03-08 08:22:04 +08:00 via iPhone
    帮顶
    cye3s
        31
    cye3s  
       2019-03-08 11:25:50 +08:00
    试了个,目录结构没保留啊,比如这个
    https://go.tanglei.name/content
    fuergaosi
        32
    fuergaosi  
    OP
       2019-03-08 14:18:46 +08:00
    @cye3s 我测试了一下 目录结构保留了 不过因为有两个 404 所以少了两个章节 ![kz37f1.png]( https://s2.ax1x.com/2019/03/08/kz37f1.png) 另外希望有问题可以直接发到 issues 区
    @dyxang 因为我没有 windows ┑( ̄Д  ̄)┍
    soulteary
        33
    soulteary  
       2019-05-07 23:51:07 +08:00
    @fuergaosi 你的小工具很好用鸭,但是看到有些同学搞不定环境,所以我封装了一个容器镜像,代码在这里: https://github.com/soulteary/docker-gitbook-pdf-generator

    如果你愿意稍微调整项目目录结构 & 打 release tag 的话,后续升级维护能够更方便,比如定制电子书风格, etc...
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   我们的愿景   ·   实用小工具   ·   5570 人在线   最高记录 6543   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 30ms · UTC 06:35 · PVG 14:35 · LAX 23:35 · JFK 02:35
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.