Python 使用 http 库（如 requests 等），能否获取像 socket 的完整 html 响应包

jxxz

2019 年 10 月 9 日

要不自己拼接下
用 rsp.status_code、rsp.headers 和 rsp.text 拼接下
不过 headers 获取到的是 dict，需要再处理下

arrow8899

2019 年 10 月 9 日

r = requests.get('https://www.example.com', stream=True)
r.raw.read(10) # 可以读到原始的数据流，但是需要自己处理各种编码问题

akmonde

2019 年 10 月 9 日

@arrow8899 嗯嗯，昨儿试了下 r.raw.readlines，好像读不了 headers，只能读到 body。

@jxxz 老哥，我也想过，不过这样太麻烦了..

littlespider89

2019 年 10 月 9 日

那不叫完整的 html 响应包，那是 http

jxxz

2019 年 10 月 9 日

@akmonde 看了下源码，requests 用 urllib3 的 response 封装的响应，具体位置在 requests --> adapters.py --> HTTPAdapter --> build_response()这个方法里，你可以看下能不能改源码

```
def build_response(self, req, resp):
"""Builds a :class:`Response <requests.Response>` object from a urllib3
response. This should not be called from user code, and is only exposed
for use when subclassing the
:class:`HTTPAdapter <requests.adapters.HTTPAdapter>`

:param req: The :class:`PreparedRequest <PreparedRequest>` used to generate the response.
:param resp: The urllib3 response object.
:rtype: requests.Response
```

akmonde

2019 年 10 月 9 日

@littlespider89 差不多差不多..就那意思..

@jxxz 老哥，这项目需要迁移的，改原生库不大合适啊..

wwqgtxx

2019 年 10 月 9 日

自己给系统库的 socket.socket 打一个 monkey patch ？拦截所有的 recv 函数，记录下来再转发给上层

est

2019 年 10 月 9 日

@jxxz requests 调用的 urllib3.HTTPResponse 最为返回，实例化里读取头用的是 httplib.HTTPMessage 这个类。继承自 rfc822.Message，所以最简单的办法是：

a=requests.get('http://jd.com')
print a.raw._fp.msg

est

2019 年 10 月 9 日

然后需要自己处理一下 HTTPResponse.begin 里的 version, status, reason 部分。

if version == 'HTTP/1.0':
self.version = 10
elif version.startswith('HTTP/1.'):
self.version = 11 # use HTTP/1.1 code for HTTP/1.x where x>=1
elif version == 'HTTP/0.9':
self.version = 9

akmonde

2019 年 10 月 9 日

@est 感谢老哥，讲道理是比重组 headers 要强一些，不过好像差的不是很多...
@jxxz 实在找不到我只有重组了..

@wwqgtxx 老哥，我不用 socket 的原因是现在没有原始包，只有 url，所以需要用 http 库，没法去打什么 monkey patch 吧？

zhuangzhuang1988

2019 年 10 月 9 日

```python
import socket
from contextlib import contextmanager
import requests
from socket import SocketIO as _SocketIO

@contextmanager
def hook_socket_context():
buffer = []

class SocketIO(_SocketIO):
def readinto(self, b):
print('hooking')
res = super().readinto(b)
if res > 0:
buffer.append(bytes(b[:res]))
return res
socket.SocketIO = SocketIO
yield buffer
socket.SocketIO = _SocketIO

with hook_socket_context() as buf:
requests.get('https://v2ex.com/t/607316#reply10')
print(b''.join(buf))

print('with out hook')
requests.get('https://v2ex.com/t/607316#reply10')
print('done!')

```
python3.7 测试无问题

zhuangzhuang1988

2019 年 10 月 9 日

https://bitbucket.org/snippets/supermouse/onRK7x
高亮问题, 放在 bitbucket 上了

cz5424

2019 年 10 月 9 日

支持 8 楼老哥的
```python
a=requests.get('http://jd.com')
str(a.raw._fp.msg) + a.text
```

wwqgtxx

2019 年 10 月 9 日

@akmonde 无论是 http.client 还是 requests 底层都是调用系统 socket 库的（除非你用 libcurl 这种 C 库的封装），所以你可以按照 @zhuangzhuang1988 给出的方法给 socket 库打 monkey patch，从而获得原始 socket 流的所有数据包

akmonde

2019 年 10 月 9 日

@wwqgtxx @cz5424 @zhuangzhuang1988 感谢，我明儿试试~

ClericPy

2019 年 10 月 9 日

楼上好多 talk is cheap 的大佬....... 学到老活到老

python 的最大乐趣就是管你是不是猴子, 只要不是 built-ins 我就给你屁股上镶补丁

requests 库已经被魔改过好几种变种了

wwqgtxx

2019 年 10 月 9 日

@ClericPy builtins 也能打补丁呀，gevent 很久以前就干过

ClericPy

2019 年 10 月 9 日

@wwqgtxx 它是它我是我...
我只用过替换的方式 import builtins 然后改, 直接在已有对象上打当场报错...

walleL

2019 年 10 月 10 日

我比较好奇为什么需要改用 http client 库，然后还要获取原始的 tcp 数据

楼主方便详细说下需求吗？可能从原始需求出发更好解决呢

akmonde

2019 年 10 月 10 日

@walleL 是这样的，主要我这边是给 QA 做数据复现，然后 QA 那边做的分析规则，原本是分析流量得到的完整的 http 响应包的。
我这边的话，是需要原始 url 等分散的数据，重新去获取，所以拿到的不是完整的 http 响应包，而是分组的数据，做规则分析的时候，可能会形成差异，所以我这边需要做兼容。