请教 V 友一个爬虫相关的问题

2018-03-21 14:49:10 +08:00

yangzhezjgs

药监局网站
http://app2.sfda.gov.cn/datasearchp/gzcxSearch.do?formRender=cx&page=1

有需求需要从药监局网站抓取一点数据，发现药监局网站反爬虫无法破解(返回 202，无法抓到正确的 html)，想求教一下各位大佬有没有解决办法？

已经尝试过的方法：
1.requests + fake_useragent
2.PhantomJS + selenium
3.requests_html 的 render() (pyppeteer+Chromium)

7453 次点击

所在节点

Python

23 条回复

Itoktsnhc

2018-03-21 18:00:45 +08:00

不好一次 ctrl +enter 手误提交了
接上条
<script type="text/javascript">window.onload=function(){var _$rU=document.getElementById(_$f7('ondoal'));_$cc(_$rU.name,_$tY(_$rU,_$zo('mdpf2UXqQG')));};</script>
看起来是写入了 Cookie ？
同时返回的 html 中有个<meta>

Itoktsnhc

2018-03-21 18:06:38 +08:00

http://app2.sfda.gov.cn/datasearchp/gzcxSearch.do?formRender=cx&page=1 202 空白 Header 请求
http://app2.sfda.gov.cn/4QbVtADbnLVIc/c.FxJzG50F.js?D9PVtGL=db0daa js 文件操作 cookie?
http://app2.sfda.gov.cn/datasearchp/gzcxSearch.do?formRender=cx&page=1 实际 html 携带 Cookie 的请求

benzzz

2018-03-21 18:07:16 +08:00

@xlrtx 方便贴个链接吗

第 2 页／共 2 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/440071

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.