V200,找大佬帮忙看个 BUG,不够我加钱

2022-10-27 19:52:08 +08:00
 sadfQED2

由于找不到具体问题在哪,没办法去对应的社区提问,因此我就只能发这里了,希望有划水路过的大佬帮忙看看,能帮忙解决的话就当我请大佬吃个海底捞。

背景:go 语言写的 http 服务,程序偶然出现无响应( Stop The World 那种无响应),进而导致 Tcp Socket 句柄文件不释放,进而服务器文件句柄数疯涨。故障复现方法未知,可能一周左右出现一次,也可能一个月都不出现。线上 2 个机房,均有发生,发生时都是单机,没有大量出现,重启服务后恢复

PS:目前配置了 ping 接口探活+自动摘流重启。大概最近半年就出现了十次左右,都自动重启恢复了,但是本着探索的精神,还是想找到问题根源

环境:centos7.2+golang1.18.3

调试过程:

1 、监控发现服务异常,文件句柄异常,第一反应是代码问题,没有执行 close 函数导致。但是仔细一想,我 tm 这是 go 项目啊,又不是 c 语言项目,go 项目用的 net.http 包启动的 http 服务,又不是我手动管理的 tcp ,哪特么来的 close 啊。

2 、怀疑是项目其他地方有什么 api 调用,没有关闭连接,排查了很久,并没有发现问题。

3 、发现程序出问题这段时间,连日志都不输出了,程序 debug 接口也挂了(和业务接口不是同一个端口,debug 接口本来用于监控程序 gc 、协程数量等信息的)。开始怀疑不是 close 的问题。

4 、在程序主逻辑启动前,加了一段测试代码,代码如下

go func() {
    for {
        time.Sleep(1 * time.Minute)
        fmt.Println("Hello World!")
    }
}()

讲道理,这应该不受任何逻辑影响,始终都该输出吧。加上这段代码,重新发布项目,等了半个月,终于又出问题了。上服务器一看,出问题的时候,这个 HelloWorld 也不输出了。

5 、由于程序完全没响应,任何逻辑都不执行,因此从线上把故障机器摘流,然后使用 dlv 注入进程进行调试。结果如下:

( 1 )、协程总数异常,此时有 5w+个协程,正常情况应该不超过 500 个协程

( 2 )、大量协程被死锁,都是卡在同一个地方 Goroutine 19024799 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684384197s]

( 3 )、查看死锁协程的调用栈,结果与业务代码无关

 0  0x000000000043a556 in runtime.gopark
    at /usr/local/go1.18.3/src/runtime/proc.go:362
 1  0x000000000044b0b3 in runtime.goparkunlock
    at /usr/local/go1.18.3/src/runtime/proc.go:367
 2  0x000000000044b0b3 in runtime.semacquire1
    at /usr/local/go1.18.3/src/runtime/sema.go:144
 3  0x00000000004665a5 in sync.runtime_SemacquireMutex
    at /usr/local/go1.18.3/src/runtime/sema.go:71
 4  0x0000000000473785 in sync.(*Mutex).lockSlow
    at /usr/local/go1.18.3/src/sync/mutex.go:162
 5  0x00000000004740bf in sync.(*Mutex).Lock
    at /usr/local/go1.18.3/src/sync/mutex.go:81
 6  0x00000000004740bf in sync.(*Pool).pinSlow
    at /usr/local/go1.18.3/src/sync/pool.go:213
 7  0x000000000047400e in sync.(*Pool).pin
    at /usr/local/go1.18.3/src/sync/pool.go:206
 8  0x0000000000473d45 in sync.(*Pool).Get
    at /usr/local/go1.18.3/src/sync/pool.go:128
 9  0x0000000000711e29 in net/http.newTextprotoReader
    at /usr/local/go1.18.3/src/net/http/request.go:994
10  0x0000000000711fd9 in net/http.readRequest
    at /usr/local/go1.18.3/src/net/http/request.go:1024
11  0x00000000007181ea in net/http.(*conn).readRequest
    at /usr/local/go1.18.3/src/net/http/server.go:988
12  0x000000000071c58b in net/http.(*conn).serve
    at /usr/local/go1.18.3/src/net/http/server.go:1891
13  0x0000000000721cce in net/http.(*Server).Serve.func3
    at /usr/local/go1.18.3/src/net/http/server.go:3071
14  0x000000000046a641 in runtime.goexit
    at /usr/local/go1.18.3/src/runtime/asm_amd64.s:1571

6 、到这来我已经不知道该怎么继续排查了,服务器网络 io 、磁盘 io 、cpu 、内存等待均无明显异常。

7 、找运维在网关上面录制了流量,故障前一分钟都没发现异常流量,应该不存在被攻击的问题

8 、哪位大佬能给分析下呢,顺便给小弟讲解讲解,为啥我那个 HelloWorld 日志也不输出了?

下面附上完整的 dlv grs 结果(部分包名、路径打码了):

(dlv) grs
  Goroutine 1 - User: /home/xxxxxx/golibs/rpcserver/v2@v2.0.0/rpcserver/server.go:392 xxxx/golibs/rpcserver/v2/rpcserver.(*Server).waitHttp (0xd0c588) [select 452427h39m57.846100461s]
  Goroutine 2 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [force gc (idle) 452385h49m20.252083363s]
  Goroutine 3 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [finalizer wait 452427h39m57.869422736s]
  Goroutine 4 - User: /home/xxxxxx/devops/statsd@v0.0.13/client_golang/statsdlib/push.go:39 xxxxxx/devops/statsd/client_golang/statsdlib.statsdPushLoop (0x87b17e) [chan receive]
  Goroutine 9 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 10 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 15 - User: /home/xxxxxx/xxxxxx/gobiz/logger@v1.3.4/file.go:309 xxxxxx/gobiz/logger.(*fileLog).writeLoop (0x80a545) [select 452427h39m57.846215549s]
  Goroutine 16 - User: /home/xxxxxx/xxxxxx/rpc/rpc-go-spl-v2@v0.1.35/rpc/common/watcher.go:49 xxxxxx/rpc/rpc-go-spl-v2/rpc/common.(*DirWatcher).run.func1 (0x84a9f9) [chan receive]
  Goroutine 18 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC sweep wait]
  Goroutine 19 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC scavenge wait]
  Goroutine 27 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 28 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 29 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 30 - User: /home/xxxxxx/xxxxxx/platform-ha/onekey-degrade_sdk_go/v3@v3.2.11/degrade/metric.go:96 xxxxxx/platform-ha/onekey-degrade_sdk_go/v3/degrade.(*metricSender).sendMetricFromChan (0xb05e65) [chan receive 452427h39m57.86959434s]
  Goroutine 31 - User: /home/xxxxxx/xxxxxx/platform-ha/onekey-degrade_sdk_go/v3@v3.2.11/degrade/metric.go:96 xxxxxx/platform-ha/onekey-degrade_sdk_go/v3/degrade.(*metricSender).sendMetricFromChan (0xb05e65) [chan receive 452427h39m57.869609575s]
  Goroutine 34 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.683989625s]
  Goroutine 35 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684006698s]
  Goroutine 36 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684023352s]
  Goroutine 37 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684037091s]
  Goroutine 38 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684051841s]
  Goroutine 39 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684066197s]
  Goroutine 40 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684082165s]
  Goroutine 41 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684095326s]
  Goroutine 42 - User: /home/xxxxxx/xxxxxx/xxxx/golibs/metrics@v0.2.3/metrics.go:280 xxxxxx/xxxx/golibs/metrics.(*Sender).asyncSendLoop.func1 (0x880a5b) [select 452391h19m51.199497773s]
  Goroutine 43 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.68413238s]
  Goroutine 45 - User: /usr/local/go1.18.3/src/runtime/time.go:194 time.Sleep (0x46742e) [sleep]
  Goroutine 46 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.8518662s]
  Goroutine 48 - User: /gaia/workspace-job/xxxxxx/xxxxx/common/handlers/apollo/apollo.go:50 xxxxx/common/handlers/apollo.OnConfigUpdate.func1 (0x884859) [chan receive 452427h39m57.863062571s]
  Goroutine 49 - User: /gaia/workspace-job/xxxxxx/xxxxx/common/handlers/apollo/apollo.go:50 xxxxx/common/handlers/apollo.OnConfigUpdate.func1 (0x884859) [chan receive 452427h39m57.846560172s]
  Goroutine 67 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 68 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 69 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 72 - User: /home/xxxxxx/github.com/getsentry/sentry-go@v0.10.0/transport.go:378 github.com/getsentry/sentry-go.(*HTTPTransport).worker (0x8010ec) [chan receive 452391h19m51.973327187s]
  Goroutine 73 - User: /home/xxxxxx/xxxxxx/rpc/rpc-go-spl-v2@v0.1.35/rpc/common/watcher.go:49 xxxxxx/rpc/rpc-go-spl-v2/rpc/common.(*DirWatcher).run.func1 (0x84a9f9) [chan receive]
  Goroutine 75 - User: /usr/local/go1.18.3/src/database/sql/sql.go:1226 database/sql.(*DB).connectionOpener (0xa3d38d) [select 452427h39m57.863183594s]
  Goroutine 84 - User: /usr/local/go1.18.3/src/runtime/sigqueue.go:151 os/signal.signal_recv (0x466bef) (thread 1238)
  Goroutine 98 - User: /usr/local/go1.18.3/src/net/fd_unix.go:172 net.(*netFD).accept (0x522b15) [IO wait]
  Goroutine 99 - User: /usr/local/go1.18.3/src/net/fd_unix.go:172 net.(*netFD).accept (0x522b15) [IO wait 452427h39m57.846699823s]
  Goroutine 100 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684348233s]
  Goroutine 101 - User: /usr/local/go1.18.3/src/net/fd_unix.go:172 net.(*netFD).accept (0x522b15) [IO wait]
  Goroutine 102 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [select 452427h39m57.846735781s]
* Goroutine 19024799 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684384197s]
  Goroutine 19168258 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684396903s]
  Goroutine 19913951 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684409102s]
  Goroutine 19958391 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684421539s]
  Goroutine 20081541 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684433615s]
  Goroutine 20116495 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684446142s]
  Goroutine 20172239 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.68446185s]
  ......(后面都是重复的,就省略了)
2713 次点击
所在节点    Go 编程语言
28 条回复
zhanglintc
2022-10-27 23:53:34 +08:00
不是很懂,随便回复一下,万一胡言乱语能有所启发呢。

> 大量协程被死锁

一路看下来,原因就是死锁了呗,协程太多,资源耗尽,彻底卡死了吧。
先从代码里有可能用到锁啊什么地方开始下手?

如果实在不是自己代码的问题,那就考虑是不是 go 的 bug 了?
xyjincan
2022-10-28 07:54:50 +08:00
考虑设备有硬件问题呢
saberlong
2022-10-28 08:32:38 +08:00
看到 p.pinSlow 。猜测是 sync.Pool 这的问题。可能是 sync.Pool 使用有误,大量新建 sync.Pool 和销毁。sync.Pool 有个全局维护的数组 allPools ,有个全局的锁保证它的并发读写。
saberlong
2022-10-28 08:35:09 +08:00
再看 request ,猜测是不是有业务在做 http 请求时,每次都新建 http.Client 后请求。这样导致 sync.Pool 每次都重新创建
hopingtop
2022-10-28 10:16:05 +08:00
首先你的 go hello word 没有被打印,就相当于调度器就开始没有绑定你这个任务执行了,或者说调度器被其他的 goroutine 完全占用了。 占用的情况大一些
第二个,你重点看一下 代 loop 的函数, 比如 logger.writeLoop 和 metrics.asyncSendLoop 特别是后者,是否出现了,因为函数名是 异步 async 开头,这里是否是 开启了大量的 goroutine 然后没有正确的被释放,导致 goroutine 常驻 。

先看看 我说的上面这两个地方呢
hopingtop
2022-10-28 10:21:09 +08:00
但是按理来说 5W 的 goroutine 小 case ,
hopingtop
2022-10-28 10:22:37 +08:00
@hopingtop 但是如果这些 goroutine 里面有大量的锁,那可很可能造成你当前这样现象。
Zach369
2022-10-28 10:58:04 +08:00
应该是你代码出问题了. 使用 pprof 看看
brucemaclin
2022-10-28 11:22:52 +08:00
看着是死锁了 trace 或者 pprof 看看
sadfQED2
2022-10-28 11:41:49 +08:00
@Zach369 #8
@brucemaclin #9 我上面说程序 debug 接口也挂了,我们那个 debug 接口就是 pprof 接口,挂了以后 pprof 也看不到了
sadfQED2
2022-10-28 11:53:30 +08:00
@hopingtop #5 asyncSendLoop 里面确实有启 goroutine ,不过这是第三方包的代码,我先研究下这个代码是啥逻辑。顺便问一下,为什么 goroutine 里面有锁会影响其他 goroutine 执行呢,不是说 golang 已经实现抢占式调度了吗,那我 HelloWorld 日志那个协程为啥抢不到调度资源呢
hopingtop
2022-10-28 12:38:05 +08:00
@sadfQED2 无用的 goroutine 如果太多,大家都在抢,你怎么能保证你的 helloWorld 能抢到。当然具体的我也无法从上面的信息看出太多,如果这个第三方代码是开源的,可以贴一下 github 我可以去定位一下 相关的实现
runningman
2022-10-28 12:43:02 +08:00
这种东西,得看代码,分析,打日志等。不是描述一下就能搞定的
alexsunxl
2022-10-28 12:52:21 +08:00
这个问题是累积时间出现,但时间不固定(大致 2 周)。应该是某个角落逻辑有死锁锁住了协程。
sadfQED2
2022-10-28 12:58:10 +08:00
@hopingtop 内部的,没有开源。

大家都在抢的话,理论上是不是时间够久,总能抢到,但是我观察了一天,好像都没输出
hopingtop
2022-10-28 14:17:20 +08:00
@sadfQED2 不能这么说,何况还是 1/5W 的概率,只要不是 100% 必执行,你等一天或者 N 天都可能没得结果。所以还是要先看代码,就从 我说的那个方面先排查了来吧。
这种问题,不看代码和断点,确实不能给出更多的回复了。
sadfQED2
2022-10-28 14:26:17 +08:00
https://gist.github.com/Jinnrry/d11a8c7dc74da724832206749afa12d8

asyncSendLoop 的代码,看起来是维护了一个 Loop ,一直读 chan 数据,应该也只起了一个 GoRoutine ,然后执行计数函数,计数内部逻辑没看明白,但是应该是本地操作

@hopingtop
Y9qn1d
2022-10-28 15:43:19 +08:00
应该不是死锁,而是死循环,很多 goroutine 卡在 for 里会导致 hello world 打不出来
yongbozou
2022-10-28 16:08:15 +08:00
https://cs.opensource.google/go/go/+/refs/tags/go1.18.3:src/sync/pool.go;l=213

能看下 5w 个 goroutinues 卡住的应该有两种类型:
1. 第一种是 allPoolsMu.Lock() 卡住了。
2. 第二种就是持有锁的那个 goroutine 了,可以搜下 pinSlow 所有的 gouroutine ,是否有不同于 allPoolsMu.Lock() 的。再顺着这个 goroutine ,看它为什么卡住

func (p *Pool) pinSlow() (*poolLocal, int) {
// Retry under the mutex.
// Can not lock the mutex while pinned.
runtime_procUnpin()
allPoolsMu.Lock()
defer allPoolsMu.Unlock()
pid := runtime_procPin()
// poolCleanup won't be called while we are pinned.
s := p.localSize
l := p.local
if uintptr(pid) < s {
return indexLocal(l, pid), pid
}
if p.local == nil {
allPools = append(allPools, p)
}
// If GOMAXPROCS changes between GCs, we re-allocate the array and lose the old one.
size := runtime.GOMAXPROCS(0)
local := make([]poolLocal, size)
atomic.StorePointer(&p.local, unsafe.Pointer(&local[0])) // store-release
runtime_StoreReluintptr(&p.localSize, uintptr(size)) // store-release
return &local[pid], pid
}
yongbozou
2022-10-28 16:10:37 +08:00
http 库里面有个 sync.Pool 来复用 textproto.Reader ,减少 gc 。

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/890479

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX