【集群】K8S集群搭建记录——故障排除:xxx:6443 was refused
💡简介
搭建基于vm的k8s-on-k8s时发生崩溃,记录排查及修复过程。
核心问题在于,vm 中搭建上层 k8s 时将下层配置文件修改了。
后续应当考虑 vm 使用独立目录而非和下层服务器共享目录。
🖼️背景
使用kubectl控制k8s的任何指令都报错:
1 | kubectl get pods |
(其中 xxx.xxx.xxx.xxx
是服务器 ip,在此隐去)
🧠思路
- 首先追本溯源,找到上游真正的问题。
- 对于真正的问题,检查如何修复。
⛳️追本溯源
🤔问题1)是哪个服务崩溃?(6443 端口所对应)
- 根据相关资料[1],监听 6443 端口的是 Kubernetes API Server。
默认情况下,Kubernetes API 服务器在第一个非 localhost 网络接口的 6443 端口上进行监听, 受 TLS 保护。
🤔问题2)为什么崩溃?
自底向上,先检查基础服务 docker、kubelet 是否正常运行,再进一步检查 docker 上运行的 api-server 等上层k8s服务是否正常运行。
基础服务状态
使用systemctl status ${SERVER_NAME}
可以看到名为SERVER_NAME的服务状态,使用journalctl -fu ${SERVER_NAME}
可以看到名为SERVER_NAME的服务日志。- docker
1
2
3
4
5
6
7
8
9
10systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since 一 2023-10-30 16:32:15 CST; 19h ago
Docs: https://docs.docker.com
Main PID: 1097921 (dockerd)
Tasks: 17
Memory: 50.2M
CGroup: /system.slice/docker.service
└─1097921 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock1
2
3
4
5journalctl -fu docker
...
Oct 31 11:57:22 xx-xxx-1 dockerd[1097921]: time="2023-10-31T11:57:22.087791624+08:00" level=info msg="ignoring event" container=1063c7297c759788eaa79e92aade6c96ab23d6156b10982ea13b38fc5734afae module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Oct 31 11:58:21 xx-xxx-1 dockerd[1097921]: time="2023-10-31T11:58:21.993922952+08:00" level=info msg="ignoring event" container=781c78be1d7f785fbb02bca8a148e4dc75a70054f41a24f51cb62095bb0c9d79 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
... - kubelet
1
2
3
4
5
6
7
8
9
10
11
12systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since 一 2023-10-30 16:32:22 CST; 19h ago
Docs: https://kubernetes.io/docs/
Main PID: 1098656 (kubelet)
Tasks: 16
Memory: 45.2M
CGroup: /system.slice/kubelet.service
└─1098656 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/...1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20journalctl -fu kubelet
...
Oct 31 11:58:33 xx-xx-1 kubelet[1098656]: E1031 11:58:33.666668 1098656 kubelet.go:2263] node "xx-xx-1" not found
...
Oct 31 11:58:33 xx-xx-1 kubelet[1098656]: I1031 11:58:33.723526 1098656 scope.go:111] [topologymanager] RemoveContainer - Container ID: bebfb084373c836501ce9c3791089c0ef0945b50bc0364f1e5bf83853b70c64e
Oct 31 11:58:33 xx-xx-1 kubelet[1098656]: E1031 11:58:33.723897 1098656 pod_workers.go:191] Error syncing pod c560f766bceb51cbaca458fd334576d0 ("etcd-xx-xx-1_kube-system(c560f766bceb51cbaca458fd334576d0)"), skipping: failed to "StartContainer" for "etcd" with CrashLoopBackOff: "back-off 5m0s restarting failed container=etcd pod=etcd-xx-xx-1_kube-system(c560f766bceb51cbaca458fd334576d0)"
...
Oct 31 11:58:34 xx-xx-1 kubelet[1098656]: I1031 11:58:34.171287 1098656 kubelet_node_status.go:71] Attempting to register node xx-xx-1
Oct 31 11:58:34 xx-xx-1 kubelet[1098656]: E1031 11:58:34.210170 1098656 kubelet_node_status.go:93] Unable to register node "xx-xx-1" with API server: Post "https://10.244.3.3:6443/api/v1/nodes": dial tcp 10.244.3.3:6443: connect: connection refused
...
Oct 31 11:58:37 xx-xx-1 kubelet[1098656]: E1031 11:58:37.409610 1098656 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://10.244.3.3:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/xx-xx-1?timeout=10s": dial tcp 10.244.3.3:6443: connect: connection refused
...
Oct 31 11:58:38 xx-xx-1 kubelet[1098656]: E1031 11:58:38.779304 1098656 eviction_manager.go:260] eviction manager: failed to get summary stats: failed to get node info: node "xx-xx-1" not found
...
Oct 31 11:58:39 xx-xx-1 kubelet[1098656]: E1031 11:58:39.037767 1098656 pod_workers.go:191] Error syncing pod c64e8855ca75d2d53d6ae0abfc5b2e24 ("kube-controller-manager-xx-xx-1_kube-system(c64e8855ca75d2d53d6ae0abfc5b2e24)"), skipping: failed to "StartContainer" for "kube-controller-manager" with CrashLoopBackOff: "back-off 5m0s restarting failed container=kube-controller-manager pod=kube-controller-manager-xx-xx-1_kube-system(c64e8855ca75d2d53d6ae0abfc5b2e24)"
Oct 31 11:58:39 xx-xx-1 kubelet[1098656]: I1031 11:58:39.068118 1098656 scope.go:111] [topologymanager] RemoveContainer - Container ID: ca74b5519cfe860b04885270fe4b820c6ac2448f1bb0365a6ee8b8e15ae8fcdc
Oct 31 11:58:39 xx-xx-1 kubelet[1098656]: E1031 11:58:39.070622 1098656 remote_runtime.go:294] RemoveContainer "ca74b5519cfe860b04885270fe4b820c6ac2448f1bb0365a6ee8b8e15ae8fcdc" from runtime service failed: rpc error: code = Unknown desc = failed to get container "ca74b5519cfe860b04885270fe4b820c6ac2448f1bb0365a6ee8b8e15ae8fcdc" log path: failed to inspect container "ca74b5519cfe860b04885270fe4b820c6ac2448f1bb0365a6ee8b8e15ae8fcdc": Error: No such container: ca74b5519cfe860b04885270fe4b820c6ac2448f1bb0365a6ee8b8e15ae8fcdc
Oct 31 11:58:39 xx-xx-1 kubelet[1098656]: E1031 11:58:39.070645 1098656 kuberuntime_gc.go:146] Failed to remove container "ca74b5519cfe860b04885270fe4b820c6ac2448f1bb0365a6ee8b8e15ae8fcdc": rpc error: code = Unknown desc = failed to get container "ca74b5519cfe860b04885270fe4b820c6ac2448f1bb0365a6ee8b8e15ae8fcdc" log path: failed to inspect container "ca74b5519cfe860b04885270fe4b820c6ac2448f1bb0365a6ee8b8e15ae8fcdc": Error: No such container: ca74b5519cfe860b04885270fe4b820c6ac2448f1bb0365a6ee8b8e15ae8fcdc
Oct 31 11:58:39 xx-xx-1 kubelet[1098656]: W1031 11:58:39.072016 1098656 status_manager.go:550] Failed to get status for pod "kube-controller-manager-xx-xx-1_kube-system(c64e8855ca75d2d53d6ae0abfc5b2e24)": Get "https://10.244.3.3:6443/api/v1/namespaces/kube-system/pods/kube-controller-manager-xx-xx-1": dial tcp 10.244.3.3:6443: connect: connection refused
...
- 根据以上信息可以判断,基础服务正常运行,但 kubelet 也无法连接到对应 k8s服务。
- docker
上层k8s服务状态
使用docker ps -a
可以看到Docker上运行的所有容器状态,docker logs ${CONTAINER_NAME}
可以看到名为 CONTAINER_NAME 的容器日志情况。- docker服务整体情况
1
2
3
4
5
6
7
8
9
10
11docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6e0350f7b16b 7a5d9d67a13f "kube-scheduler --au…" 24 seconds ago Exited (1) 23 seconds ago k8s_kube-scheduler_kube-scheduler-xx-xx-1_kube-system_472d87a90d4ce654500e8615a31c0f8b_242
fc786f3197c6 cdcab12b2dd1 "kube-apiserver --ad…" About a minute ago Exited (1) About a minute ago k8s_kube-apiserver_kube-apiserver-xx-xx-1_kube-system_e5cc1328f715a4eba870289759ed1c1f_242
39e716367013 0369cf4303ff "etcd --advertise-cl…" 5 minutes ago Exited (1) 5 minutes ago k8s_etcd_etcd-xx-xx-1_kube-system_c560f766bceb51cbaca458fd334576d0_241
ad141f2d1721 55f13c92defb "kube-controller-man…" 5 minutes ago Exited (1) 5 minutes ago k8s_kube-controller-manager_kube-controller-manager-xx-xx-1_kube-system_c64e8855ca75d2d53d6ae0abfc5b2e24_241
41d5994c0b93 registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 20 hours ago Up 20 hours k8s_POD_kube-controller-manager-xx-xx-1_kube-system_c64e8855ca75d2d53d6ae0abfc5b2e24_1
7ff1e80d9ddb registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 20 hours ago Up 20 hours k8s_POD_kube-scheduler-xx-xx-1_kube-system_472d87a90d4ce654500e8615a31c0f8b_1
6370376fe05b registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 20 hours ago Up 20 hours k8s_POD_etcd-xx-xx-1_kube-system_c560f766bceb51cbaca458fd334576d0_1
abce82c8bd1e registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 20 hours ago Up 20 hours k8s_POD_kube-apiserver-xx-xx-1_kube-system_e5cc1328f715a4eba870289759ed1c1f_1
e6751f222a36 ubuntu "/sbin/init" 23 hours ago Up 20 hours strange_wu - kube-apiserver 具体日志情况
1
2docker logs fc786f3197c6
Error: unknown flag: --insecure-port
- 根据以上信息判断,启动参数方面存在问题。
- docker服务整体情况
🏹修复方案检索
已将问题定位到“k8s相关启动参数”方面存在错误,因此对修改启动参数的方法进行检索。
根据相关资料[2],使用 kubeadm 安装 Kubernetes 集群,Kubernetes 相关组件通过 static pod 启动,其 yaml 文件的位置在
/etc/kubernetes/manifests/
路径下。检查该文件是否存在问题。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105vim /etc/kubernetes/manifests/kube-apiserver.yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: 10.244.3.3:6443
creationTimestamp: null
labels:
component: kube-apiserver
tier: control-plane
name: kube-apiserver
namespace: kube-system
spec:
containers:
- command:
- kube-apiserver
- --advertise-address=10.244.3.3
- --allow-privileged=true
- --authorization-mode=Node,RBAC
- --client-ca-file=/etc/kubernetes/pki/ca.crt
- --enable-admission-plugins=NodeRestriction
- --enable-bootstrap-token-auth=true
- --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt
- --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt
- --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key
- --etcd-servers=https://127.0.0.1:2379
- --insecure-port=0
- --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt
- --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt
- --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key
- --requestheader-allowed-names=front-proxy-client
- --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
- --requestheader-extra-headers-prefix=X-Remote-Extra-
- --requestheader-group-headers=X-Remote-Group
- --requestheader-username-headers=X-Remote-User
- --secure-port=6443
- --service-account-issuer=https://kubernetes.default.svc.cluster.local
- --service-account-key-file=/etc/kubernetes/pki/sa.pub
- --service-account-signing-key-file=/etc/kubernetes/pki/sa.key
- --service-cluster-ip-range=10.96.0.0/16
- --tls-cert-file=/etc/kubernetes/pki/apiserver.crt
- --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
image: registry.aliyuncs.com/google_containers/kube-apiserver:v1.28.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 10.244.3.3
path: /livez
port: 6443
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: kube-apiserver
readinessProbe:
failureThreshold: 3
httpGet:
host: 10.244.3.3
path: /readyz
port: 6443
scheme: HTTPS
periodSeconds: 1
timeoutSeconds: 15
resources:
requests:
cpu: 250m
startupProbe:
failureThreshold: 24
httpGet:
host: 10.244.3.3
path: /livez
port: 6443
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /etc/ssl/certs
name: ca-certs
readOnly: true
- mountPath: /etc/pki
name: etc-pki
readOnly: true
- mountPath: /etc/kubernetes/pki
name: k8s-certs
readOnly: true
hostNetwork: true
priorityClassName: system-node-critical
volumes:
- hostPath:
path: /etc/ssl/certs
type: DirectoryOrCreate
name: ca-certs
- hostPath:
path: /etc/pki
type: DirectoryOrCreate
name: etc-pki
- hostPath:
path: /etc/kubernetes/pki
type: DirectoryOrCreate
name: k8s-certs
status: {}注意到其中的 ip 都变成了
10.244.3.3
,根据交流发现该 ip 对应的是运行于服务器上的一台虚拟机。因此出错原因可能是在虚拟机内修改了该文件。
🔨解决
- ❌将文件中对应 ip 修改为本机 ip。(同时对另外几个组件
etcd
、kube-controller-manager
、kube-scheduler
的配置文件也进行修改)
- 无效,可能是因为不止这几个内容出错。
- master 节点
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25重置集群
kubeadm reset
reset] Reading configuration from the cluster...
[reset] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W1031 12:34:20.990883 2223588 reset.go:99] [reset] Unable to fetch the kubeadm-config ConfigMap from cluster: failed to get config map: Get "https://10.244.3.3:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=10s": dial tcp 10.244.3.3:6443: connect: connection refused
[reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W1031 12:34:25.107252 2223588 removeetcdmember.go:79] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
[reset] Deleting contents of stateful directories: [/var/lib/etcd /var/lib/kubelet /var/lib/dockershim /var/run/kubernetes /var/lib/cni]
The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d
The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.
If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.
The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
1 | 清除配置 |
1 | 获取并修改kubeadm-config.yaml相关配置 |
网络插件部分,或参考[4]采用 flannel。
worker 节点
1
2
3
4
5
6
7
8
9
10
11
12kubeadm reset
清除配置
rm -rf /root/.kube
rm -rf /etc/cni/net.d
rm -rf /etc/kubernetes/*
yum install -y ipvsadm
ipvsadm -C
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
kubeadm join xxx.xxx.xxx.xxx:6443 --token xxx \
--discovery-token-ca-cert-hash sha256:xxx- 插曲:跨云厂商节点上 flannel pod 启动失败
报错:1
2
3I1031 13:24:29.544239 1 main.go:212] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: version:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[eth0] ifaceRegex:[] ipMasq:true ifaceCanReach: subnetFile:/run/flannel/subnet.env publicIP:116.63.136.133 publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true useMultiClusterCidr:false}
W1031 13:24:29.544342 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
E1031 13:24:59.574617 1 main.go:229] Failed to create SubnetManager: error retrieving pod spec for 'kube-flannel/kube-flannel-ds-p95w2': Get "https://10.96.0.1:443/api/v1/namespaces/kube-flannel/pods/kube-flannel-ds-p95w2": dial tcp 10.96.0.1:443: i/o timeout
- 插曲:跨云厂商节点上 flannel pod 启动失败
解决:根据相关资料 [5],在 flannel.yaml文件中 DaemonSet
kube-flannel-ds
的env
中补充以下参数,然后重新创建 flannel 即可。1
2
3
4- name: KUBERNETES_SERVICE_HOST
value: "xxx.xxx.xxx.xxx" # ip address of the host where kube-apiservice is running
- name: KUBERNETES_SERVICE_PORT
value: "6443"
🏥反思
- 根据最近执行情况,发现无论是 docker 还是 kvm 搭建的 容器/虚机 Pod,都容易出现上下层配置冲突问题。对于不同虚拟化技术的底层实现还可以进一步了解,到底为什么会产生冲突?能否避免?是虚拟化技术本身的硬伤还是工程实现上的漏洞?
- 希望这篇博客对你有帮助!如果你有任何问题或需要进一步的帮助,请随时提问。
- 如果你喜欢这篇文章,欢迎动动小手给我一个follow或star。
🗺参考文献
[2] 修改 Kubernetes apiserver 启动参数
- 标题: 【集群】K8S集群搭建记录——故障排除:xxx:6443 was refused
- 作者: Fre5h1nd
- 创建于 : 2023-10-31 10:44:20
- 更新于 : 2024-03-08 15:37:07
- 链接: https://freshwlnd.github.io/2023/10/31/k8s/k8s-apiserver/
- 版权声明: 本文章采用 CC BY-NC-SA 4.0 进行许可。