ScrapydWeb v1.2.0: 可能是最好用的定时爬虫工具?!

2019-03-12 20:40:28 +08:00
 my8100

https://github.com/my8100/scrapydweb

2447 次点击
所在节点    Python
9 条回复
oIMOo
2019-03-12 20:50:28 +08:00
你好

我没怎么写过带 js 的 python requests 脚本
您能看看如何写能检测出来那个 js 返回的 enroll 按钮是否显示有课呢?
(目前是不能报名的状态)
谢谢
jenlors
2019-03-13 11:14:14 +08:00
支持一个
tonywangcn
2019-03-13 12:23:35 +08:00
my8100
2019-03-13 14:38:28 +08:00
@tonywangcn 刚刚确认过这两个链接都可以打开,请先确认你的网路能够正常访问 https://medium.com/
tonywangcn
2019-03-13 17:25:18 +08:00
$ https_proxy=localhost:6152 curl -vvv https://-medium.com/@my8100/https-medium-com-my8100-how-to-efficiently-manage-your-distributed-web-scraping-projects-55ab13309820

* Trying ::1...
* TCP_NODELAY set
* Connection failed
* connect to ::1 port 6152 failed: Connection refused
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 6152 (#0)
* Establish HTTP proxy tunnel to medium.com:443
> CONNECT medium.com:443 HTTP/1.1
> Host: medium.com:443
> User-Agent: curl/7.54.0
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 200 Connection established
<
* Proxy replied OK to CONNECT request
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
* CAfile: /etc/ssl/cert.pem
CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-CHACHA20-POLY1305
* ALPN, server accepted to use h2
* Server certificate:
* subject: businessCategory=Private Organization; jurisdictionCountryName=US; jurisdictionStateOrProvinceName=Delaware; serialNumber=5010624; street=760 Market Street; postalCode=94102; C=US; ST=California; L=San Francisco; O=A Medium Corporation; CN=medium.com
* start date: Jun 1 00:00:00 2017 GMT
* expire date: Aug 30 12:00:00 2019 GMT
* subjectAltName: host "medium.com" matched cert's "medium.com"
* issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 Extended Validation Server CA
* SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fa7b5806600)
> GET /@my8100/https-medium-com-my8100-how-to-efficiently-manage-your-distributed-web-scraping-projects-55ab13309820 HTTP/2
> Host: medium.com
> User-Agent: curl/7.54.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 302
< date: Wed, 13 Mar 2019 09:23:05 GMT
< content-type: application/octet-stream
< set-cookie: __cfduid=d800d3f4d7ffa024ead64e91a29e1ebb41552468985; expires=Thu, 12-Mar-20 09:23:05 GMT; path=/; domain=.medium.com; HttpOnly
< set-cookie: uid=lo_rj0lT6mjVKUE; Expires=Thu, 12-Mar-20 09:23:05 GMT; Domain=.medium.com; Path=/; Secure; HttpOnly
< content-security-policy: default-src 'self'; connect-src https://localhost https://*.instapaper.com https://*.stripe.com https://glyph.medium.com https://*.paypal.com https://getpocket.com https://medium.com:443 https://*.medium.com:443 https://*.medium.com https://medium.com https://*.medium.com https://*.algolia.net https://cdn-static-1.medium.com https://dnqgz544uhbo8.cloudfront.net https://cdn-videos-1.medium.com https://cdn-audio-1.medium.com https://*.lightstep.com https://*.branch.io https://app.zencoder.com 'self'; font-src data: https://*.amazonaws.com https://*.medium.com https://glyph.medium.com https://medium.com https://*.gstatic.com https://dnqgz544uhbo8.cloudfront.net https://use.typekit.net https://cdn-static-1.medium.com 'self'; frame-src chromenull: https: webviewprogressproxy: medium: 'self'; img-src blob: data: https: 'self'; media-src https://*.cdn.vine.co https://d1fcbxp97j4nb2.cloudfront.net https://d262ilb51hltx0.cloudfront.net https://*.medium.com https://gomiro.medium.com https://miro.medium.com https://pbs.twimg.com 'self' blob:; object-src 'self'; script-src 'unsafe-eval' 'unsafe-inline' about: https: 'self'; style-src 'unsafe-inline' data: https: 'self'; report-uri https://csp.medium.com
< x-frame-options: sameorigin
< x-content-type-options: nosniff
< x-xss-protection: 1; mode=block
< x-ua-compatible: IE=edge, Chrome=1
< x-powered-by: Medium
< x-obvious-tid: 1552468985229:d47a5d7da221
< x-obvious-info: 36855-3d9334e,3d9334ed6db
< link: <https://medium.com/humans.txt>; rel="humans"
< cache-control: no-cache, no-store, max-age=0, must-revalidate
< expires: Thu, 09 Sep 1999 09:09:09 GMT
< pragma: no-cache
< set-cookie: sid=1:1Yj8mG1saeQMx1r5h/kFLMw3J77PPMa784rb2HRk3z9J8bnuZYy18oGwoDGakmHV; path=/; expires=Thu, 12 Mar 2020 09:23:05 GMT; domain=.medium.com; secure; httponly
< tk: T
< location: /suspended
< strict-transport-security: max-age=15552000; includeSubDomains; preload
< expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
< server: cloudflare
< cf-ray: 4b6cf274ef5132fb-HKG
<
* Connection #0 to host localhost left intact


看这里:

location: /suspended
my8100
2019-03-13 17:54:57 +08:00
@tonywangcn 我只能说,这很不"科学"。建议访问内网中文版本 https://juejin.im/post/5bebc5fd6fb9a04a053f3a0e
my8100
2019-03-13 18:17:59 +08:00
@tonywangcn 有空得好好拜读兄台的大作啊 https://medium.com/@tonywangcn
tonywangcn
2019-03-13 20:10:28 +08:00
@my8100 哈哈哈哈 和你的相比,差得太远。最近在计划把 scrapy 集成到 k8s 中,正需要这样一个控制面板,方便的话可以 wx 学习下 NTMyNDcyODQx
my8100
2019-03-16 22:58:07 +08:00
@tonywangcn 今天发现退出 medium 账号后搜索不到自己了,不明原因地被 suspend 了。索性把文章转移到 https://github.com/my8100/files/blob/master/scrapydweb/README.md

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/543878

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX