kubeadm 部署多 master 节点问题,高可用必须 2 台在线才行吗?

2021-11-20 22:29:43 +08:00
 caicaiwoshishui

折腾一天了,一共三台 master 节点机器 用 keepalived 做虚拟 ip ,开启了 lvsf ,测试关闭其中任意一台,另外两台都没问题,但是只要关闭 2 台,服务就不可用了.

[root@master-1 ~]# kubectl get nodes

The connection to the server 192.168.0.8:6443 was refused - did you specify the right host or port?
[root@master-1 ~]# netstat -ntlp |grep 6443

具体日志

[root@master-1 ~]# docker ps -a |grep kube-api|grep -v pause
0c1c0042b8c2   53224b502ea4                                        "kube-apiserver --ad…"   About a minute ago   Exited (1) 54 seconds ago                 k8s_kube-apiserver_kube-apiserver-master-1.host.com_kube-system_464df844856c9d5461cb184edc4974c9_45
[root@master-1 ~]# docker logs -f 0c1c0042b8c2
I1120 14:25:26.120729       1 server.go:553] external host was not specified, using 192.168.0.11
I1120 14:25:26.122152       1 server.go:161] Version: v1.22.3
I1120 14:25:26.836619       1 shared_informer.go:240] Waiting for caches to sync for node_authorizer
I1120 14:25:26.838689       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I1120 14:25:26.838721       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
I1120 14:25:26.840979       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I1120 14:25:26.841003       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
Error: context deadline exceeded
[root@master-1 ~]# docker ps -a |grep etcd
dfd6026ae3fd   004811815584                                        "etcd --advertise-cl…"   3 minutes ago    Up 3 minutes                          k8s_etcd_etcd-master-1.host.com_kube-system_a23c864b52d59788909994fe31a97f5e_8
13c6e65046d6   004811815584                                        "etcd --advertise-cl…"   7 minutes ago    Exited (2) 3 minutes ago              k8s_etcd_etcd-master-1.host.com_kube-system_a23c864b52d59788909994fe31a97f5e_7
5ca2f134f743   registry.aliyuncs.com/google_containers/pause:3.5   "/pause"                 22 minutes ago   Up 22 minutes                         k8s_POD_etcd-master-1.host.com_kube-system_a23c864b52d59788909994fe31a97f5e_1
[root@master-1 ~]# docker logs -n 10 13c6e65046d6
{"level":"warn","ts":"2021-11-20T14:24:39.911Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"ad7fc708963cf6f3","rtt":"0s","error":"dial tcp 192.168.0.9:2380: i/o timeout"}
{"level":"warn","ts":"2021-11-20T14:24:39.915Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"c68a49f4a0c3cea9","rtt":"0s","error":"dial tcp 192.168.0.10:2380: connect: no route to host"}
{"level":"warn","ts":"2021-11-20T14:24:39.915Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"c68a49f4a0c3cea9","rtt":"0s","error":"dial tcp 192.168.0.10:2380: connect: no route to host"}
{"level":"info","ts":"2021-11-20T14:24:40.658Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"cb18584c4f4dbfc is starting a new election at term 7"}
{"level":"info","ts":"2021-11-20T14:24:40.658Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"cb18584c4f4dbfc became pre-candidate at term 7"}
{"level":"info","ts":"2021-11-20T14:24:40.658Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"cb18584c4f4dbfc received MsgPreVoteResp from cb18584c4f4dbfc at term 7"}
{"level":"info","ts":"2021-11-20T14:24:40.658Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"cb18584c4f4dbfc [logterm: 7, index: 3988] sent MsgPreVote request to ad7fc708963cf6f3 at term 7"}
{"level":"info","ts":"2021-11-20T14:24:40.658Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"cb18584c4f4dbfc [logterm: 7, index: 3988] sent MsgPreVote request to c68a49f4a0c3cea9 at term 7"}
{"level":"warn","ts":"2021-11-20T14:24:41.729Z","caller":"etcdhttp/metrics.go:166","msg":"serving /health false; no leader"}
{"level":"warn","ts":"2021-11-20T14:24:41.729Z","caller":"etcdhttp/metrics.go:78","msg":"/health error","output":"{\"health\":\"false\",\"reason\":\"RAFT NO LEADER\"}","status-code":503}

结论

etcd 没有选出 leader 节点?单个 etcd 不能用吗?求大佬支招

2171 次点击
所在节点    Kubernetes
11 条回复
suifengdang666
2021-11-20 23:00:28 +08:00
etcd 为了避免脑裂,采用了 raft 算法,规定只有过半数节点在线才能提供服务,即 N/2+1 节点在线才能选出 Leader
cs419
2021-11-20 23:35:23 +08:00
高可用集群就是这么个设计方案
集群节点都活着的时候 轮询受理请求 分摊压力
挂掉的节点超过一半 就拒绝服务

原因很简单 高可用机制被破坏了
此时拒绝服务 在你修好节点后 集群可以正常工作

但如果提供继续提供服务 然后请求把剩下的节点打爆掉
则没法完整的修复数据

想要单节点可用 那就一开始用单节点启动 别创建集群
limao693
2021-11-20 23:49:04 +08:00
Raft 过半数量,可正常工作
chih758
2021-11-21 01:15:30 +08:00
测试环境 etcdctl member remove ,从集群里面删掉两个节点,就可以单点运行了
caicaiwoshishui
2021-11-21 10:08:46 +08:00
@cs419 感谢大佬,想问下如果节点过半挂了,并且重启不能恢复,是否可以添加新的机器加入到集群中,但是问题是 kubectl 都不能用了,kubeadm 也连不上 master 节点呀,这怎么搞
caicaiwoshishui
2021-11-21 10:13:47 +08:00
@chih758 刚测试了下,关闭 2 台机器,剩下一台,我 docker exec it 进入后台

配置 etcdctl 证书
sh-5.0# export ETCDCTL_API=3
sh-5.0# alias etcdctl='etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key'
sh-5.0# etcdctl member list

执行
sh-5.0# `etcdctl member list`

{"level":"warn","ts":"2021-11-21T02:11:18.722Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003a4700/#initially=[https://127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

超时,也就是剩下一台机器的 etcd 会超时并且 docker 会 exit 掉
caicaiwoshishui
2021-11-21 11:09:22 +08:00
@suifengdang666 想问下生产环境中.kubeadm 创建的 k8s 集群,etcd 是独立出来的吗?还是用 kubeadm 自带的 etcd
suifengdang666
2021-11-21 21:56:12 +08:00
@caicaiwoshishui kubeadm 创建的就行,如果怕 master 负载太高导致 etcd 异常,可以独立几个 vm 组一个 etcd 集群
pmispig
2021-11-21 23:41:28 +08:00
etcd 和 kubei api 分开放到不同的服务器部署
0x208
2021-12-13 16:45:28 +08:00
楼主找工作吗 可以看看我招聘贴
caicaiwoshishui
2021-12-13 21:53:34 +08:00
@0x208 可以远程吗 不在北京哦

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/816843

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX