Python + selenium + phantomjs 求助，爬一个网站的信息

2017-02-26 21:53:43 +08:00

doulmi

最近公司要求爬下 http://www.kuajingjishi.com/Purchase/SearchPurchase#pageIndex=10 页面的采购信息，本以为会是比较简单的事情，结果每次爬下来都是第一页的内容

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get('http://www.kuajingjishi.com/Purchase/SearchPurchase#pageIndex=2')
result = driver.find_element_by_tag_name('table')
print(result.get_attribute('innerHTML'))

求助求助~

2592 次点击

所在节点

Python

6 条回复

aploium

2017-02-26 22:02:57 +08:00

```python
import requests

data = {
'X-Requested-With': 'XMLHttpRequest'
}

for i in range(1,10):
url='http://www.kuajingjishi.com/Purchase/SearchPurchase?pageIndex={page}'.format(page=i)
r=requests.post(url, data=data)
print(r.text)
```

aploium

2017-02-26 22:17:25 +08:00

它是用 ajax 加载的, 浏览器里用 F12, 选 network 那个 tab, 手动点一下那几个页面, 能看到它发出去的请求, 然后自己一模一样发好了
这个站还没对请求进行验证......orz

至于为什么 selenium 弄不到......天知道:)

tips: 在请求上右键->copy->copy as cURL 然后在这个网站能直接转换为 requests 格式 https://curl.trillworks.com/

doulmi

2017-02-26 23:39:58 +08:00

@aploium 谢谢你！

mingyun

2017-02-26 23:44:20 +08:00

@aploium 这个网站好

MyFaith

2017-02-27 08:38:03 +08:00

pageIndex:9
POST 的时候把这个参数加上就好了，只是一个 AJAX 加载

cranelee13

2017-02-27 10:32:01 +08:00

python 爬虫只有到完全没办法，反爬实在厉害才用 selenium + phantomjs ，绝大多数情况用 requests 应付就行。

第 1 页／共 1 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/343342

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.