各位大佬帮忙分析一个 net.Conn read 延迟的问题

2024-07-10 16:14:43 +08:00
 zhangsanfeng2012
最近使用 golang 写了一个 tls 的测试程序,连接特定服务器接收数据会有 500ms 的延迟,起初以为是网络问题,但是通过 wireshark 抓包查看每次服务端回包都很快,最后发现每次 tls 连接握手成功后,再使用 net.Conn 的 read 接口读取数据时都会有 500ms 延迟。
相同的代码,相同运行环境,连接其他服务器时,不会增加延迟。

大佬帮忙分析下原因,这 500ms 增加的原因

测试环境:win11 22H2

go 版本:go version go1.22.5 windows/amd64

wireshark 抓包:(可以看到发送数据后,到服务端回包只有 100ms 延迟)



打印日志:(可以看到握手成功后,每次读取数据都会有 500ms 延迟,比 100ms 多很多)
2024-07-10 16:13:24.2993355 +0800 CST m=+0.027776101 net.Conn start writing
2024-07-10 16:13:24.3118851 +0800 CST m=+0.040325701 net.Conn last write 12.6118ms
2024-07-10 16:13:24.3119473 +0800 CST m=+0.040387901 net.Conn start reading
2024-07-10 16:13:24.3299688 +0800 CST m=+0.058409401 net.Conn last read 18.0215ms
2024-07-10 16:13:24.3299688 +0800 CST m=+0.058409401 net.Conn start reading
2024-07-10 16:13:24.3305847 +0800 CST m=+0.059025301 net.Conn last read 615.9µs
2024-07-10 16:13:24.3307758 +0800 CST m=+0.059216401 net.Conn start writing
2024-07-10 16:13:24.3307758 +0800 CST m=+0.059216401 net.Conn last write 0s
TLS handshake complete
2024-07-10 16:13:24.3313401 +0800 CST m=+0.059780701 net.Conn start writing
2024-07-10 16:13:24.3313401 +0800 CST m=+0.059780701 net.Conn last write 0s
2024-07-10 16:13:24.3313401 +0800 CST m=+0.059780701 net.Conn start reading
2024-07-10 16:13:24.8493165 +0800 CST m=+0.577757101 net.Conn last read 517.9764ms
2024-07-10 16:13:24.850016 +0800 CST m=+0.578456601 net.Conn start writing
2024-07-10 16:13:24.8505629 +0800 CST m=+0.579003501 net.Conn last write 546.9µs
2024-07-10 16:13:24.8510947 +0800 CST m=+0.579535301 net.Conn start reading
2024-07-10 16:13:25.3798195 +0800 CST m=+1.108260101 net.Conn last read 528.7248ms
2024-07-10 16:13:25.3798195 +0800 CST m=+1.108260101 net.Conn start writing
2024-07-10 16:13:25.3804916 +0800 CST m=+1.108932201 net.Conn last write 672.1µs
2024-07-10 16:13:25.3811492 +0800 CST m=+1.109589801 net.Conn start reading
2024-07-10 16:13:25.9070306 +0800 CST m=+1.635471201 net.Conn last read 525.8814ms
2024-07-10 16:13:25.90721 +0800 CST m=+1.635650601 net.Conn start writing
2024-07-10 16:13:25.9077158 +0800 CST m=+1.636156401 net.Conn last write 505.8µs


代码 demo:
https://gist.github.com/kratos1918/6884f248e20130af93a8adf5d2716851
1331 次点击
所在节点    Go 编程语言
15 条回复
tool2dx
2024-07-10 16:24:59 +08:00
既然同样代码,连其他服务器没问题,那只能优先怀疑这个网站加密协议的问题了。

针对网站用 https://www.ssllabs.com/ssltest/先测试一下,看看支持那些握手协议。

然后再把具体的握手和加密部分都打印出来,找原因。你 log 文件太简单了,看不出什么东西。
zhangsanfeng2012
2024-07-10 16:41:30 +08:00
@tool2dx
TLSv1.2 Record Layer: Handshake Protocol: Client Hello
Content Type: Handshake (22)
Version: TLS 1.0 (0x0301)
Length: 188
Handshake Protocol: Client Hello
Handshake Type: Client Hello (1)
Length: 184
Version: TLS 1.2 (0x0303)
Random: 02c98e6340c1998429ef68ab84aca85febeda83f31f8c107885898bcc872b1e2
Session ID Length: 32
Session ID: 25b2f6c1e19320ece8ef49afdd86135278b795644c7875df6b177c5bb0c4b7bb
Cipher Suites Length: 32
Cipher Suites (16 suites)
Cipher Suite: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 (0xc02b)
Cipher Suite: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (0xc02f)
Cipher Suite: TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 (0xc02c)
Cipher Suite: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (0xc030)
Cipher Suite: TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256 (0xcca9)
Cipher Suite: TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 (0xcca8)
Cipher Suite: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA (0xc009)
Cipher Suite: TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA (0xc013)
Cipher Suite: TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA (0xc00a)
Cipher Suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA (0xc014)
Cipher Suite: TLS_RSA_WITH_AES_128_GCM_SHA256 (0x009c)
Cipher Suite: TLS_RSA_WITH_AES_256_GCM_SHA384 (0x009d)
Cipher Suite: TLS_RSA_WITH_AES_128_CBC_SHA (0x002f)
Cipher Suite: TLS_RSA_WITH_AES_256_CBC_SHA (0x0035)
Cipher Suite: TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA (0xc012)
Cipher Suite: TLS_RSA_WITH_3DES_EDE_CBC_SHA (0x000a)
Compression Methods Length: 1
Compression Methods (1 method)
Extensions Length: 79
Extension: status_request (len=5)
Extension: supported_groups (len=10)
Extension: ec_point_formats (len=2)
Extension: signature_algorithms (len=26)
Extension: renegotiation_info (len=1)
Extension: extended_master_secret (len=0)
Extension: signed_certificate_timestamp (len=0)
Extension: supported_versions (len=3) TLS 1.2
[JA4: t12i160800_8cdfa2d4673b_824bd91ab3e2]
[JA4_r: t12i160800_000a,002f,0035,009c,009d,c009,c00a,c012,c013,c014,c02b,c02c,c02f,c030,cca8,cca9_0005,000a,000b,000d,0012,0017,002b,ff01_0804,0403,0807,0805,0806,0401,0501,0601,0503,0603,0201,0203]
[JA3 Fullstring: 771,49195-49199-49196-49200-52393-52392-49161-49171-49162-49172-156-157-47-53-49170-10,5-10-11-13-65281-23-18-43,29-23-24-25,0]
[JA3: debde53b165c575b8eddf8ed24fd9c97]



TLSv1.2 Record Layer: Handshake Protocol: Server Hello
Content Type: Handshake (22)
Version: TLS 1.2 (0x0303)
Length: 93
Handshake Protocol: Server Hello
Handshake Type: Server Hello (2)
Length: 89
Version: TLS 1.2 (0x0303)
Random: 1575f62a5c52b96571721293311553c226a534045fce72c1444f574e47524401
Session ID Length: 32
Session ID: 0ec01f8dea82283f54f62e7e0ecc70d39c9edba3ae7a80c198eb2ad5cd91a6a7
Cipher Suite: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (0xc02f)
Compression Method: null (0)
Extensions Length: 17
Extension: renegotiation_info (len=1)
Type: renegotiation_info (65281)
Length: 1
Renegotiation Info extension
Renegotiation info extension length: 0
Extension: ec_point_formats (len=4)
Type: ec_point_formats (11)
Length: 4
EC point formats Length: 3
Elliptic curves point formats (3)
EC point format: uncompressed (0)
EC point format: ansiX962_compressed_prime (1)
EC point format: ansiX962_compressed_char2 (2)
Extension: extended_master_secret (len=0)
Type: extended_master_secret (23)
Length: 0
[JA3S Fullstring: 771,49199,65281-11-23]
[JA3S: 76c691f46143bf86e2d1bb73c6187767]
tool2dx
2024-07-10 17:09:49 +08:00
"Cipher Suite: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (0xc02f)"

这协议常见的,ECDHE 是椭圆曲线 Diffie-Hellman ,好像是写死的 secp256r1 曲线,256 位计算很快的。

AES 128 GCM 就比普通的 AES 128 多了一个 authTag ,也慢不到哪里去。

最后就算服务器返回 RSA 的是 8192 位或者 16384 位,那也只慢握手一次,也不可能后续每一次 read 都卡。看不出啥问题,汗。
1423
2024-07-10 17:15:36 +08:00
wireshark 也打出时间戳吧
zhangsanfeng2012
2024-07-10 19:02:24 +08:00
wisej
2024-07-13 00:44:15 +08:00
@zhangsanfeng2012 一种可能性是 go scheduler 调度延迟。我看你的 trace 里 P 确实是为 1 ( PS:你的 trace 结果和 demo 代码似乎不是对应的)
但是如果是这种情况的话,不会像你说的 [连接其他服务器时,不会增加延迟]
zhangsanfeng2012
2024-07-13 06:51:01 +08:00
@wisej 确实,trace 的时候设置了 GOMAXPROCS=1 ,现象还是一样的
zhangsanfeng2012
2024-07-13 06:51:43 +08:00
网上搜到了相似的问题,但是没找到解决方法 https://stackoverflow.com/questions/15588961/windows-tcp-socket-recv-delay
wisej
2024-07-13 11:52:00 +08:00
@zhangsanfeng2012 GOMAXPROCS>1 情况依旧?能不能来个纯 demo code 的 trace ?
zhangsanfeng2012
2024-07-15 09:12:33 +08:00
@wisej 这次 trace 的 maxprocs 是默认值,https://www.123pan.com/s/WWYTTd-briQA.html 提取码:v2ex

特定服务器执行结果:
2024-07-15 09:10:07.6549564 +0800 CST m=+0.026422301 net.Conn start writing
2024-07-15 09:10:07.668557 +0800 CST m=+0.040022901 net.Conn last write 13.6006ms
2024-07-15 09:10:07.668557 +0800 CST m=+0.040022901 net.Conn start reading
2024-07-15 09:10:07.6874251 +0800 CST m=+0.058891001 net.Conn last read 18.8681ms
2024-07-15 09:10:07.6877773 +0800 CST m=+0.059243201 net.Conn start reading
2024-07-15 09:10:07.6877773 +0800 CST m=+0.059243201 net.Conn last read 0s
2024-07-15 09:10:07.6883967 +0800 CST m=+0.059862601 net.Conn start writing
2024-07-15 09:10:07.6883967 +0800 CST m=+0.059862601 net.Conn last write 0s
2024-07-15 09:10:07.6889185 +0800 CST m=+0.060384401 net.Conn start reading
2024-07-15 09:10:08.2198472 +0800 CST m=+0.591313101 net.Conn last read 530.9287ms
TLS handshake complete
TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
wisej
2024-07-15 22:33:28 +08:00
@zhangsanfeng2012 从 trace 看就是纯粹的过了 500ms 左右, goroutine main.main 被 network 事件唤醒。
也就是 1.调用 Read ,syscall 返回没有可读数据 -> 2.调用 runtime_pollWait 由 runtime 等待 netfd 有可读数据(同时 gouroutine main.main blocked) 3. 500ms 后,runtime 通过 syscall 知道 netfd 有数据可读,唤醒 goroutine
---------------------
所以从 trace 看问题点应该不在 Go 。我自己本机跑也没有出现你的情况。
voidmnwzp
2024-07-16 10:28:47 +08:00
startTime := time.Now()
fmt.Println(startTime, "net.Conn start writing")
n, err = t.Conn.Write(b)
t.lastWrite = time.Now()
你这样只是计算的等待 netpoll 可写事件触发的阻塞时间
zhangsanfeng2012
2024-07-16 10:49:02 +08:00
@wisej 我也觉得应该不是 go 的问题,我用 wireshar 抓包看回包很快,但是应用读不上来。
我在当前 Read 之前,调用一次 Read(nil),就不会有 500ms 的问题了,但是换了另一台 pc ,这个方法并不能解决问题。
wisej
2024-07-16 13:34:04 +08:00
@zhangsanfeng2012 wireshark(应该)是网卡驱动层捕获 packet ,go 得经过 OS 网络栈,这之间可能有时间差,但也不可能到 500ms...

在你给的信息无误的前提下,是我我可能会用其他语言实现同样功能的代码和 Go 同时去跑一次,看它的延迟
zhangsanfeng2012
2024-07-16 14:19:44 +08:00
@wisej 后面试过通过 golang 直接调用 winsock api ,延迟发生的点也是一样的

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/1056293

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX