etcd 断电数据文件损坏如何恢复?

2022-04-22 10:30:50 +08:00
 seagull7558

虽然提前通知会断电重启机器,但是因为之前也遇到过断电的问题,并且一般情况下都是会自己启动 kubelet 和 docker 及相关上层服务,我就没在意,也没有说提前关闭集群,可是万万没想到啊,今天开机后发现 Master 节点居然都挂掉了

排查过程

  1. 首先登陆主节点,查看各节点状态
[root@k8s-master wal]# kubectl get nodes
The connection to the server 192.168.113.152:6443 was refused - did you specify the right host or port?

  1. 发现 kubelet 好像是有问题,去看看状态
systemctl status kubelet
[root@k8s-master wal]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Fri 2022-04-22 09:18:34 CST; 1h 4min ago
     Docs: https://kubernetes.io/docs/
 Main PID: 1017 (kubelet)
    Tasks: 17 (limit: 75136)
   Memory: 152.0M
   CGroup: /system.slice/kubelet.service
           └─1017 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-c>

4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.289217    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.390603    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.491934    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.593267    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.693429    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.793744    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.894905    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.995215    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:56 k8s-master kubelet[1017]: E0421 22:22:56.095825    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:56 k8s-master kubelet[1017]: E0421 22:22:56.196740    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
lines 1-22/22 (END)

  1. 获取主节点失败,看了眼各组件容器日志,发现 apiserver 有点问题 查看对应的 apiserver 容器日志
2022-04-22T02:21:31.203199623Z I0422 02:21:31.202994       1 server.go:565] external host was not specified, using 192.168.113.152
2022-04-22T02:21:31.231588697Z I0422 02:21:31.231374       1 server.go:172] Version: v1.23.5
2022-04-22T02:21:31.389807666Z I0422 02:21:31.389666       1 shared_informer.go:240] Waiting for caches to sync for node_authorizer
2022-04-22T02:21:31.390038398Z I0422 02:21:31.389913       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
2022-04-22T02:21:31.390047004Z I0422 02:21:31.389934       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
2022-04-22T02:21:31.390767833Z I0422 02:21:31.390608       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
2022-04-22T02:21:31.390778978Z I0422 02:21:31.390639       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
2022-04-22T02:21:31.392738574Z W0422 02:21:31.392578       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:32.391488027Z W0422 02:21:32.391315       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:32.394099214Z W0422 02:21:32.393918       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:33.393065852Z W0422 02:21:33.392633       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:33.835046849Z W0422 02:21:33.833992       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:35.158712152Z W0422 02:21:35.158257       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:36.424633514Z W0422 02:21:36.424302       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:38.117720953Z W0422 02:21:38.117282       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:41.173338381Z W0422 02:21:41.172823       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:42.894615034Z W0422 02:21:42.894174       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:47.511910137Z W0422 02:21:47.511585       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:50.459764076Z W0422 02:21:50.459450       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:51.393624485Z E0422 02:21:51.393263       1 run.go:74] "command failed" err="context deadline exceeded"

  1. 端口号 2379 接着往下排查 很明显是连接 etcd 报错,这时候我特么很慌啊,etcd 存储了很多 k8s 配置信息,这玩意要是修不好,那许多服务还得重新搭建
2022-04-22T02:26:23.131382248Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"etcdmain/etcd.go:72","msg":"Running: ","args":["etcd","--advertise-client-urls=https://192.168.113.152:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--initial-advertise-peer-urls=https://192.168.113.152:2380","--initial-cluster=k8s-master=https://192.168.113.152:2380","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://192.168.113.152:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://192.168.113.152:2380","--name=k8s-master","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]}
2022-04-22T02:26:23.131412789Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"etcdmain/etcd.go:115","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
2022-04-22T02:26:23.131416304Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.113.152:2380"]}
2022-04-22T02:26:23.131418664Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:478","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
2022-04-22T02:26:23.131421030Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.113.152:2379"]}
2022-04-22T02:26:23.131423955Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.1","git-sha":"e8732fb5f","go-version":"go1.16.3","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"k8s-master","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.113.152:2380"],"listen-peer-urls":["https://192.168.113.152:2380"],"advertise-client-urls":["https://192.168.113.152:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.113.152:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
2022-04-22T02:26:23.133621480Z panic: freepages: failed to get all reachable pages (page 478: multiple references)
2022-04-22T02:26:23.133635080Z 
2022-04-22T02:26:23.133637664Z goroutine 120 [running]:
2022-04-22T02:26:23.133639420Z go.etcd.io/bbolt.(*DB).freepages.func2(0xc00047e540)
2022-04-22T02:26:23.133648190Z 	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1056 +0xe9
2022-04-22T02:26:23.133650552Z created by go.etcd.io/bbolt.(*DB).freepages
2022-04-22T02:26:23.133652184Z 	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1054 +0x1cd

然后重启 etcd 也不好使,一番搜索发现是装在 vmware 里会有问题,可能导致 etcd 文件损坏,寄!

关联 issues

2593 次点击
所在节点    Kubernetes
9 条回复
rrfeng
2022-04-22 10:41:43 +08:00
用之前的 snapshot 恢复

这玩意也要定时离线备份的…
seagull7558
2022-04-22 10:43:40 +08:00
@rrfeng 老哥是指删掉 /var/lib/etcd/member/snap 文件夹下最新的.snap 文件吗?
FullBridgeRect
2022-04-22 10:57:26 +08:00
etcd 有做高可用么,达到容灾上限之前可以直接开新节点重建吧?
seagull7558
2022-04-22 11:05:26 +08:00
@FullBridgeRect 没做高可用 本来就是开发环境 为了模拟配置的 确实偷懒没做高可用 也没做备份
FullBridgeRect
2022-04-22 11:06:34 +08:00
@seagull7558 那没了啊😑
disk
2022-04-22 11:14:54 +08:00
备份下,*.wal 或*.snap 删了看好不好使,不行看看能不能从最新的.snap 恢复,再不行就没必要再试图恢复了,成本太高。
seagull7558
2022-04-22 11:29:39 +08:00
@disk 试过直接删掉 wal 和 snap 文件夹 会丢失 K8S 集群下的所有配置信息
zhaoyeye
2022-04-23 00:07:41 +08:00
@seagull7558 我前两天刚通过 snapshot 恢复了一次,反正是挺多坑的
stevefan1999
2022-07-17 09:10:31 +08:00
etcd 背後用的是 BoltDB 理論上是毋須數據恢復的 如果集羣有其他 HA master 那把有問題的節點完整地刪了 再不行找個備份

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/848516

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX