V200,找大佬帮忙看个 BUG,不够我加钱

2022-10-27 19:52:08 +08:00
 sadfQED2

由于找不到具体问题在哪,没办法去对应的社区提问,因此我就只能发这里了,希望有划水路过的大佬帮忙看看,能帮忙解决的话就当我请大佬吃个海底捞。

背景:go 语言写的 http 服务,程序偶然出现无响应( Stop The World 那种无响应),进而导致 Tcp Socket 句柄文件不释放,进而服务器文件句柄数疯涨。故障复现方法未知,可能一周左右出现一次,也可能一个月都不出现。线上 2 个机房,均有发生,发生时都是单机,没有大量出现,重启服务后恢复

PS:目前配置了 ping 接口探活+自动摘流重启。大概最近半年就出现了十次左右,都自动重启恢复了,但是本着探索的精神,还是想找到问题根源

环境:centos7.2+golang1.18.3

调试过程:

1 、监控发现服务异常,文件句柄异常,第一反应是代码问题,没有执行 close 函数导致。但是仔细一想,我 tm 这是 go 项目啊,又不是 c 语言项目,go 项目用的 net.http 包启动的 http 服务,又不是我手动管理的 tcp ,哪特么来的 close 啊。

2 、怀疑是项目其他地方有什么 api 调用,没有关闭连接,排查了很久,并没有发现问题。

3 、发现程序出问题这段时间,连日志都不输出了,程序 debug 接口也挂了(和业务接口不是同一个端口,debug 接口本来用于监控程序 gc 、协程数量等信息的)。开始怀疑不是 close 的问题。

4 、在程序主逻辑启动前,加了一段测试代码,代码如下

go func() {
    for {
        time.Sleep(1 * time.Minute)
        fmt.Println("Hello World!")
    }
}()

讲道理,这应该不受任何逻辑影响,始终都该输出吧。加上这段代码,重新发布项目,等了半个月,终于又出问题了。上服务器一看,出问题的时候,这个 HelloWorld 也不输出了。

5 、由于程序完全没响应,任何逻辑都不执行,因此从线上把故障机器摘流,然后使用 dlv 注入进程进行调试。结果如下:

( 1 )、协程总数异常,此时有 5w+个协程,正常情况应该不超过 500 个协程

( 2 )、大量协程被死锁,都是卡在同一个地方 Goroutine 19024799 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684384197s]

( 3 )、查看死锁协程的调用栈,结果与业务代码无关

 0  0x000000000043a556 in runtime.gopark
    at /usr/local/go1.18.3/src/runtime/proc.go:362
 1  0x000000000044b0b3 in runtime.goparkunlock
    at /usr/local/go1.18.3/src/runtime/proc.go:367
 2  0x000000000044b0b3 in runtime.semacquire1
    at /usr/local/go1.18.3/src/runtime/sema.go:144
 3  0x00000000004665a5 in sync.runtime_SemacquireMutex
    at /usr/local/go1.18.3/src/runtime/sema.go:71
 4  0x0000000000473785 in sync.(*Mutex).lockSlow
    at /usr/local/go1.18.3/src/sync/mutex.go:162
 5  0x00000000004740bf in sync.(*Mutex).Lock
    at /usr/local/go1.18.3/src/sync/mutex.go:81
 6  0x00000000004740bf in sync.(*Pool).pinSlow
    at /usr/local/go1.18.3/src/sync/pool.go:213
 7  0x000000000047400e in sync.(*Pool).pin
    at /usr/local/go1.18.3/src/sync/pool.go:206
 8  0x0000000000473d45 in sync.(*Pool).Get
    at /usr/local/go1.18.3/src/sync/pool.go:128
 9  0x0000000000711e29 in net/http.newTextprotoReader
    at /usr/local/go1.18.3/src/net/http/request.go:994
10  0x0000000000711fd9 in net/http.readRequest
    at /usr/local/go1.18.3/src/net/http/request.go:1024
11  0x00000000007181ea in net/http.(*conn).readRequest
    at /usr/local/go1.18.3/src/net/http/server.go:988
12  0x000000000071c58b in net/http.(*conn).serve
    at /usr/local/go1.18.3/src/net/http/server.go:1891
13  0x0000000000721cce in net/http.(*Server).Serve.func3
    at /usr/local/go1.18.3/src/net/http/server.go:3071
14  0x000000000046a641 in runtime.goexit
    at /usr/local/go1.18.3/src/runtime/asm_amd64.s:1571

6 、到这来我已经不知道该怎么继续排查了,服务器网络 io 、磁盘 io 、cpu 、内存等待均无明显异常。

7 、找运维在网关上面录制了流量,故障前一分钟都没发现异常流量,应该不存在被攻击的问题

8 、哪位大佬能给分析下呢,顺便给小弟讲解讲解,为啥我那个 HelloWorld 日志也不输出了?

下面附上完整的 dlv grs 结果(部分包名、路径打码了):

(dlv) grs
  Goroutine 1 - User: /home/xxxxxx/golibs/rpcserver/v2@v2.0.0/rpcserver/server.go:392 xxxx/golibs/rpcserver/v2/rpcserver.(*Server).waitHttp (0xd0c588) [select 452427h39m57.846100461s]
  Goroutine 2 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [force gc (idle) 452385h49m20.252083363s]
  Goroutine 3 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [finalizer wait 452427h39m57.869422736s]
  Goroutine 4 - User: /home/xxxxxx/devops/statsd@v0.0.13/client_golang/statsdlib/push.go:39 xxxxxx/devops/statsd/client_golang/statsdlib.statsdPushLoop (0x87b17e) [chan receive]
  Goroutine 9 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 10 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 15 - User: /home/xxxxxx/xxxxxx/gobiz/logger@v1.3.4/file.go:309 xxxxxx/gobiz/logger.(*fileLog).writeLoop (0x80a545) [select 452427h39m57.846215549s]
  Goroutine 16 - User: /home/xxxxxx/xxxxxx/rpc/rpc-go-spl-v2@v0.1.35/rpc/common/watcher.go:49 xxxxxx/rpc/rpc-go-spl-v2/rpc/common.(*DirWatcher).run.func1 (0x84a9f9) [chan receive]
  Goroutine 18 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC sweep wait]
  Goroutine 19 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC scavenge wait]
  Goroutine 27 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 28 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 29 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 30 - User: /home/xxxxxx/xxxxxx/platform-ha/onekey-degrade_sdk_go/v3@v3.2.11/degrade/metric.go:96 xxxxxx/platform-ha/onekey-degrade_sdk_go/v3/degrade.(*metricSender).sendMetricFromChan (0xb05e65) [chan receive 452427h39m57.86959434s]
  Goroutine 31 - User: /home/xxxxxx/xxxxxx/platform-ha/onekey-degrade_sdk_go/v3@v3.2.11/degrade/metric.go:96 xxxxxx/platform-ha/onekey-degrade_sdk_go/v3/degrade.(*metricSender).sendMetricFromChan (0xb05e65) [chan receive 452427h39m57.869609575s]
  Goroutine 34 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.683989625s]
  Goroutine 35 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684006698s]
  Goroutine 36 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684023352s]
  Goroutine 37 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684037091s]
  Goroutine 38 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684051841s]
  Goroutine 39 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684066197s]
  Goroutine 40 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684082165s]
  Goroutine 41 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684095326s]
  Goroutine 42 - User: /home/xxxxxx/xxxxxx/xxxx/golibs/metrics@v0.2.3/metrics.go:280 xxxxxx/xxxx/golibs/metrics.(*Sender).asyncSendLoop.func1 (0x880a5b) [select 452391h19m51.199497773s]
  Goroutine 43 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.68413238s]
  Goroutine 45 - User: /usr/local/go1.18.3/src/runtime/time.go:194 time.Sleep (0x46742e) [sleep]
  Goroutine 46 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.8518662s]
  Goroutine 48 - User: /gaia/workspace-job/xxxxxx/xxxxx/common/handlers/apollo/apollo.go:50 xxxxx/common/handlers/apollo.OnConfigUpdate.func1 (0x884859) [chan receive 452427h39m57.863062571s]
  Goroutine 49 - User: /gaia/workspace-job/xxxxxx/xxxxx/common/handlers/apollo/apollo.go:50 xxxxx/common/handlers/apollo.OnConfigUpdate.func1 (0x884859) [chan receive 452427h39m57.846560172s]
  Goroutine 67 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 68 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 69 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)]
  Goroutine 72 - User: /home/xxxxxx/github.com/getsentry/sentry-go@v0.10.0/transport.go:378 github.com/getsentry/sentry-go.(*HTTPTransport).worker (0x8010ec) [chan receive 452391h19m51.973327187s]
  Goroutine 73 - User: /home/xxxxxx/xxxxxx/rpc/rpc-go-spl-v2@v0.1.35/rpc/common/watcher.go:49 xxxxxx/rpc/rpc-go-spl-v2/rpc/common.(*DirWatcher).run.func1 (0x84a9f9) [chan receive]
  Goroutine 75 - User: /usr/local/go1.18.3/src/database/sql/sql.go:1226 database/sql.(*DB).connectionOpener (0xa3d38d) [select 452427h39m57.863183594s]
  Goroutine 84 - User: /usr/local/go1.18.3/src/runtime/sigqueue.go:151 os/signal.signal_recv (0x466bef) (thread 1238)
  Goroutine 98 - User: /usr/local/go1.18.3/src/net/fd_unix.go:172 net.(*netFD).accept (0x522b15) [IO wait]
  Goroutine 99 - User: /usr/local/go1.18.3/src/net/fd_unix.go:172 net.(*netFD).accept (0x522b15) [IO wait 452427h39m57.846699823s]
  Goroutine 100 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684348233s]
  Goroutine 101 - User: /usr/local/go1.18.3/src/net/fd_unix.go:172 net.(*netFD).accept (0x522b15) [IO wait]
  Goroutine 102 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [select 452427h39m57.846735781s]
* Goroutine 19024799 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684384197s]
  Goroutine 19168258 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684396903s]
  Goroutine 19913951 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684409102s]
  Goroutine 19958391 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684421539s]
  Goroutine 20081541 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684433615s]
  Goroutine 20116495 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684446142s]
  Goroutine 20172239 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.68446185s]
  ......(后面都是重复的,就省略了)
2731 次点击
所在节点    Go 编程语言
28 条回复
hopingtop
2022-10-28 16:38:18 +08:00
当前你贴的代码,确实没得什么问题,asyncSendMetricLoop 这个方法调用,只会有一次吗?是 sync.Once 吗?
Counter 里面 会涉及到锁的问题吗?
asyncSendMetricLoop 因为 metricCell chan 不关闭,这个 goroutine 就会常驻, 所以这个方法如何使用就很关键。
hopingtop
2022-10-28 16:39:18 +08:00
最终也并不一定这里有问题,这只是一个 排查方向 @sadfQED2
yougg
2022-10-28 17:50:38 +08:00
如果 debug 接口挂起抓不到 pprof, 就把启动进程的标准和错误输出重定向到文件, 服务 hang 住的时候 kill -3 <pid>来查看输出重定向文件中的 goroutine 栈信息.
alphayan
2022-10-30 19:07:30 +08:00
私聊
Jhon11
2022-10-31 09:05:38 +08:00
看看 lock 和 unlock 是不是都有释放, 还有就是一个 lock 后面代码,看看是不是有死循环了
bugfan
2022-10-31 09:19:14 +08:00
卧槽,前段时间我的一个在线上运行的服务出现了和 op 很相似的情况。。。最后发现是 goroutine 太多了(另外我的服务器配置也是非常 mini 的那种),我改了改代码,现在已经不出现了。
原本我是在 443 端口上面接收 tcp 流量,如果发现包里面有 sni 会校验权限,如果匹配会做 sni 代理,否则会把流量发到 https server ,这期间很多地方的逻辑用了 goroutinte 并发实现。我最后把 sni 代理去掉了,443 端口的流量直接用 http handerl 处理,就再也没复线了。
怀疑 op 也是类似情况,程序无响应的时候,我从云平台看到内存几乎占满了,磁盘 io 也非常大,大量 野 goroutine 导致内存增长
lysS
2022-11-01 09:44:54 +08:00
你这个 stw 了,内存、cpu 又正常;我感觉是 go 中调度的一个 bug 。要是能稳定复现就好了
DevKuan
2022-11-01 17:25:48 +08:00
goroutine 太多就容易出问题,要用 context 控制

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/890479

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX