请教下没有启用 HPA, deployment 什么情况下会自动将 replica 设置为 0

2022-05-09 15:55:46 +08:00
 rabbitz

最近碰到个问题,使用 deployment 运行 nexus3 跑一段时间后 replicas 会被设置为 0 ,也没启用 HPA ,restartPolicy 配置的是 Always ,查了 kube-apiserver 的 audit 日志也不是人为操作的,查看 nexus3 的日志也没有报错。

deployment yaml:

kind: Deployment
apiVersion: apps/v1
metadata:
  name: service-nexus3-deployment
  namespace: service
  annotations:
    deployment.kubernetes.io/revision: '6'
spec:
  replicas: 1
  selector:
    matchLabels:
      app: service-nexus3
      envronment: test
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: service-nexus3
        envronment: test
      annotations:
        kubesphere.io/restartedAt: '2022-02-16T01:11:44.479Z'
    spec:
      volumes:
        - name: service-nexus3-volume
          persistentVolumeClaim:
            claimName: service-nexus3-pvc
        - name: docker-proxy
          configMap:
            name: docker-proxy
            defaultMode: 493
      containers:
        - name: nexus3
          # 用的阿里的镜像仓库,删了仓库名
          image: 'registry.cn-hangzhou.aliyuncs.com/nexus3-latest'
          ports:
            - name: tcp8081
              containerPort: 8081
              protocol: TCP
          resources:
            limits:
              cpu: '4'
              memory: 8Gi
            requests:
              cpu: 500m
              memory: 1Gi
          volumeMounts:
            - name: service-nexus3-volume
              mountPath: /data/server/nexus3/
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
        - name: docker-proxy
          # 用的阿里的镜像仓库,删了仓库名
          image: 'registry.cn-hangzhou.aliyuncs.com/nginx-latest'
          ports:
            - name: tcp80
              containerPort: 80
              protocol: TCP
          resources:
            limits:
              cpu: '2'
              memory: 4Gi
            requests:
              cpu: 500m
              memory: 1Gi
          volumeMounts:
            - name: docker-proxy
              mountPath: /usr/local/nginx/conf/vhosts/
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      nodeSelector:
        disktype: raid1
      securityContext: {}
      imagePullSecrets:
        - name: registrysecret
      schedulerName: default-scheduler
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600

HPA:

# kubectl get hpa -A
No resources found

deployment describe:

...
...
Events:
  Type    Reason             Age                From                   Message
  ----    ------             ----               ----                   -------
  Normal  ScalingReplicaSet  34m (x2 over 38h)  deployment-controller  Scaled down replica set service-nexus3-deployment-57995fcd76 to 0

kube controller 日志:

# kubectl logs kube-controller-manager-k8s-130 -n kube-system|grep nexus
I0509 10:49:11.687356       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"service", Name:"service-nexus3-deployment", UID:"e0c4abba-bbe5-4c19-9853-de63ee571124", APIVersion:"apps/v1", ResourceVersion:"126342143", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set service-nexus3-deployment-57995fcd76 to 0
I0509 10:49:11.701642       1 event.go:281] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"service", Name:"service-nexus3-deployment-57995fcd76", UID:"9f96fdf1-1e20-4c83-ad18-1b3640d52493", APIVersion:"apps/v1", ResourceVersion:"126342151", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: service-nexus3-deployment-57995fcd76-t6bhx

kube-apiserver audit 相关日志:

nexus3 日志:

已经出现好几次,日志都查了实在没有思路,特来请教,希望大佬们支支招。

1338 次点击
所在节点    Kubernetes
12 条回复
anonydmer
2022-05-09 16:53:58 +08:00
检查下是不是服务不稳定,容器在不停的失败和重启
rabbitz
2022-05-09 17:10:33 +08:00
replica to 0 之前 RESTARTS 一直是 0
rabbitz
2022-05-09 17:11:35 +08:00
不好意思,上面的图片发错了,底下这个才是
wubowen
2022-05-09 17:38:14 +08:00
有点怀疑审计日志图里的内容可以证明非人为操作嘛?如果是人为操作的 scale ,最终也需要 replicaset controller 去删除 pod 吧?是不是可以考虑直接搜审计日志里和 kubeconfig 用户相关的操作,直接看是否有人为 scale
defunct9
2022-05-09 17:46:28 +08:00
开 ssh ,让我上去看看
basefas
2022-05-09 17:48:48 +08:00
监控下这个项目的 replicas ,变了报警,然后看 event
hwdef
2022-05-09 17:53:01 +08:00
hello kitty 可还行,有点萌
rabbitz
2022-05-10 10:25:48 +08:00
@wubowen #4 确实看不出来,监控的时候我把日志打全点
rabbitz
2022-05-10 10:28:45 +08:00
@basefas #6 目前监控把相关的日志信息收集都加上了
rabbitz
2022-05-10 10:30:03 +08:00
@defunct9 #5 生产业务开不了 ssh
basefas
2022-05-10 11:11:43 +08:00
@rabbitz #9 如果已经采集并保存了集群 event ,可以根据具体信息查下当时的 event
rabbitz
2022-05-10 11:45:26 +08:00
@basefas #11 集群 event 保存配置有点问题,暂时查不了历史数据

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/851766

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX