Python - Requests 爬虫 爬取亚马逊产品页, Headers 被识别为机器人

2022-10-14 14:52:08 +08:00
 wyzh97

我试图抓取亚马逊的产品页面( https://www.amazon.com/dp/B0B6TR2GTJ), 代码如下:


import requests

url = "https://www.amazon.com/dp/B0B6TR2GTJ"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 
    'Accept-Language': 'en-US, en;q=0.5'
}
r= requests.get(url, headers = headers)

print(r.status_code)
print("-------------------")
doc = pq(r.text)  

print(doc("title"))
print("-------------------")
print(r.text)

结果如下(被判断为机器人了): Headers 尝试了各种写法, 都是一样的结果.

503
-------------------
<title>Sorry! Something went wrong!</title>
  
-------------------
<!--
        To discuss automated access to Amazon data please contact api-services-support@amazon.com.
        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->
<!doctype html>
......

我爬虫还在初学阶段, 有没有前辈大神帮帮我. 万分感谢

4177 次点击
所在节点    Python
21 条回复
blinkdr
2022-10-15 16:27:34 +08:00
rpa 也行吧,虽然笨了点

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/886930

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX