一、背景

k8s集群排障真的很麻烦

今天集群有同事找我,节点报 PLEG is not healthy 集群中有的节点出现了NotReady,这是什么原因呢?

二、kubernetes源码分析

PLEG is not healthy 也是一个经常出现的问题

POD 生命周期事件生成器

先说下PLEG 这部分代码在kubelet 里,我们看一下在kubelet中的注释:

// GenericPLEG is an extremely simple generic PLEG that relies solely on

// periodic listing to discover container changes. It should be used

// as temporary replacement for container runtimes do not support a proper

// event generator yet.

//

// Note that GenericPLEG assumes that a container would not be created,

// terminated, and garbage collected within one relist period. If such an

// incident happens, GenenricPLEG would miss all events regarding this

// container. In the case of relisting failure, the window may become longer.

// Note that this assumption is not unique -- many kubelet internal components

// rely on terminated containers as tombstones for bookkeeping purposes. The

// garbage collector is implemented to work with such situations. However, to

// guarantee that kubelet can handle missing container events, it is

// recommended to set the relist period short and have an auxiliary, longer

// periodic sync in kubelet as the safety net.

type GenericPLEG struct {

// The period for relisting.

relistPeriod time.Duration

// The container runtime.

runtime kubecontainer.Runtime

// The channel from which the subscriber listens events.

eventChannel chan *PodLifecycleEvent

// The internal cache for pod/container information.

podRecords podRecords

// Time of the last relisting.

relistTime atomic.Value

// Cache for storing the runtime states required for syncing pods.

cache kubecontainer.Cache

// For testability.

clock clock.Clock

// Pods that failed to have their status retrieved during a relist. These pods will be

// retried during the next relisting.

podsToReinspect map[types.UID]*kubecontainer.Pod

}

也就是说kubelet 会定时把 拉取pod 的列表,然后记录下结果。

运行代码后会执行一个定时任务,定时调用relist函数

// Start spawns a goroutine to relist periodically.

func (g *GenericPLEG) Start() {

go wait.Until(g.relist, g.relistPeriod, wait.NeverStop)

}

relist函数里关键代码:

// Get all the pods.

podList, err := g.runtime.GetPods(true)

if err != nil {

klog.ErrorS(err, "GenericPLEG: Unable to retrieve pods")

return

}

g.updateRelistTime(timestamp)

我们可以看到kubelet 定期调用 docker.sock 或者containerd.sock 去调用CRI 去拉取pod列表,然后更新下relist时间。

我们在看Health 函数,是被定时调用的健康检查处理函数:

// Healthy check if PLEG work properly.

// relistThreshold is the maximum interval between two relist.

func (g *GenericPLEG) Healthy() (bool, error) {

relistTime := g.getRelistTime()

if relistTime.IsZero() {

return false, fmt.Errorf("pleg has yet to be successful")

}

// Expose as metric so you can alert on `time()-pleg_last_seen_seconds > nn`

metrics.PLEGLastSeen.Set(float64(relistTime.Unix()))

elapsed := g.clock.Since(relistTime)

if elapsed > relistThreshold {

return false, fmt.Errorf("pleg was last seen active %v ago; threshold is %v", elapsed, relistThreshold)

}

return true, nil

}

他是用当前时间 减去 relist更新时间,得到的时间如果超过relistThreshold就代表可能不健康

// The threshold needs to be greater than the relisting period + the

// relisting time, which can vary significantly. Set a conservative

// threshold to avoid flipping between healthy and unhealthy.

relistThreshold = 3 * time.Minute

进一步思考这个问题,我们就把问题锁定在了CRI 容器运行时的地方

三、锁定错误

这个问题出错的根源是在容器运行时超时,意味着dockerd 或者 contaienrd 出现故障,我们到那台机器上看到kubelet 的日志发现很多CRI 超时的 不可用的日志

Nov 02 13:41:43 app04 kubelet[8411]: E1102 13:41:43.111882 8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the

Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.036729 8411 kubelet.go:2396] "Container runtime not ready" runtimeReady="RuntimeReady=false reason:DockerDaemonNotReady messag

Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.112993 8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des

Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.113027 8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to

Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.113041 8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the

Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114281 8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des

Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114319 8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to

Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114335 8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the

Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.344912 8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con

Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.345214 8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con

Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.345501 8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con

Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.630715 8411 kubelet.go:2040] "Skipping pod synchronization" err="[container runtime is down, PLEG is not healthy: pleg was las

Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115226 8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des

Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115265 8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to

Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115280 8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the

Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116608 8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des

Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116647 8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to

Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116667 8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the

Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.081612 8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con

Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.081611 8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con

Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082134 8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con

Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082201 8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con

Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082378 8411 remote_runtime.go:6

想办法重启运行时 或者去排查containerd

Nov 02 12:58:45 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:46 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:47 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:48 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:49 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:50 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:51 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:52 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:53 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:54 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:55 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:56 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:57 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:58 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:58:59 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:59:00 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:59:01 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:59:02 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:59:03 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

Nov 02 12:59:04 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

发现是CRI 服务端接受太多套接字,导致accept 失败了,可以适当调大ulimit

好文阅读

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: