Don't Let Your Kubelet Go Dark: Proactive TLS Certificate Rotation Alerts
You've built a robust Kubernetes cluster, your applications are humming, and everything feels stable. Then, without warning, a node drops to NotReady. Pods fail to schedule. kubectl logs and exec commands start returning cryptic errors. What happened? More often than not, a silent killer is at play: an expired TLS certificate on your kubelet.
The kubelet is the agent that runs on each node in your Kubernetes cluster, ensuring containers are running in a pod. It communicates with the Kubernetes API server, manages pods, reports node status, and serves metrics. For these critical operations, secure communication is paramount, and that means TLS certificates. When these certificates expire, your node effectively loses its voice, leading to cluster instability and potential outages.
This isn't just a theoretical problem; it's a common pitfall in Kubernetes operations. In this article, we'll dive into why kubelet certificates are so important, how to identify and prevent their expiry, and how proactive monitoring can save you from a production-crippling incident.
Understanding Kubelet TLS Certificates
The kubelet uses two primary types of TLS certificates:
-
Client Certificates: These are used by the kubelet to authenticate itself to the Kubernetes API server. When your kubelet wants to inform the API server about its status, request pod configurations, or update resource usage, it presents this client certificate. Typically, these certificates are signed by the cluster's Certificate Authority (CA) and have specific Subject Alternative Names (SANs) or Common Names (CNs) that identify the node (e.g.,
system:node:<node-name>) and its group membership (system:nodes). -
Serving Certificates: The kubelet itself exposes an HTTPS endpoint (usually on port 10250) for various purposes, including:
- Serving metrics (e.g.,
/metrics,/stats/summary). - Providing logs for pods (
/containerLogs). - Handling
kubectl exec,attach, andport-forwardrequests. - Health checks (
/healthz). This endpoint requires a serving certificate to establish a secure connection. These can be self-signed, signed by a custom CA, or even by the cluster's CA, depending on your cluster's configuration and how it was provisioned.
- Serving metrics (e.g.,
Where do these certificates live?
On most Linux-based Kubernetes nodes, you'll find these certificates and their corresponding private keys in a directory like /var/lib/kubelet/pki/. The exact paths can vary based on your Kubernetes distribution and configuration, but common filenames might include kubelet-client-current.pem or kubelet.crt for client certificates, and kubelet-serving.crt for serving certificates.
How are they managed?
Modern Kubernetes clusters, especially those provisioned with tools like kubeadm or managed by cloud providers (EKS, GKE, AKS), heavily automate the management and rotation of these certificates. They often leverage the Kubernetes Certificate Signing Request (CSR) API. The kubelet can generate a CSR, submit it to the API server, and if approved by a controller, retrieve a signed certificate. This process is designed to be self-healing and reduce manual intervention.
However, "designed to be self-healing" doesn't mean "always self-healing."
The Silent Killer: Why Kubelet Certificates Expire Unnoticed
Despite automation, kubelet certificate expiry remains a significant operational risk due to several factors:
- Automation Gaps: The CSR bootstrap process, while robust, can sometimes fail. This could be due to network issues, misconfigurations in the API server's certificate controller, or even the cluster's root CA itself expiring (a much larger, but related, problem).
- Manual Overrides and Drift: In some cases, operators might manually place certificates on nodes, bypassing the automated rotation. Over time, these manual interventions are forgotten, and the certificates expire.
- Long-Lived Clusters: As clusters mature, the underlying CAs that sign these certificates might have their own expiry dates. If the CA expires, all certificates signed by it will eventually become invalid, even if the kubelet's individual rotation mechanism is working correctly.
- Lack of Centralized Visibility: With dozens or hundreds of nodes, manually checking each kubelet's certificate status is impractical and error-prone. Without a centralized monitoring solution, an expired certificate on a single node can easily go unnoticed until it causes an outage.
- Configuration Complexity: Different Kubernetes components might rely on different certificates. The kubelet client certificate, serving certificate, API server's serving certificate, etcd certificates – each has its own lifecycle, and managing them without a unified approach is challenging.
Identifying Kubelet Certificate Expiry
The first sign of an expired kubelet certificate is usually a node exhibiting erratic behavior or dropping out of the cluster.
Common Symptoms:
- Node
NotReady: Yourkubectl get nodesoutput will show one or more nodes in aNotReadystate. - Pod Scheduling Failures: New pods won't schedule on the affected node(s).
kubectl logs/exec/attachFailing: Commands likekubectl logs <pod-name> -n <namespace>orkubectl exec -it <pod-name> -- bashwill fail for pods running on the affected node, often with TLS handshake errors or connection refused messages.- Metrics Scrapping Failures: Monitoring systems relying on kube