cert-manager + Prometheus Alert Pattern: Proactive Certificate Expiry Monitoring
Managing SSL/TLS certificates can feel like walking a tightrope. One misstep, and your services go down, users see scary browser warnings, and trust evaporates. cert-manager has been a game-changer for Kubernetes users, automating the lifecycle of certificates, from issuance to renewal. It's incredibly powerful, but automation doesn't mean you can completely set it and forget it. You still need visibility and, crucially, proactive alerts when things aren't going as planned.
This article will guide you through setting up a robust monitoring and alerting strategy for your cert-manager managed certificates using Prometheus and Alertmanager. We'll cover how to tap into cert-manager's metrics, craft effective PromQL queries, define alerts, and discuss common pitfalls to ensure your certificates renew smoothly and silently.
Understanding cert-manager Metrics
cert-manager exposes a /metrics endpoint that Prometheus can scrape, offering valuable insights into your certificate landscape. The most critical metric for expiry monitoring is certmanager_certificate_expiration_timestamp_seconds. This gauge metric provides the Unix timestamp (in seconds) when a certificate is expected to expire. It's labeled with useful metadata like:
name: The name of theCertificateresource.namespace: The namespace where theCertificateresource resides.issuer_name: The name of theIssuerorClusterIssuerused.issuer_kind: The kind of issuer (e.g.,Issuer,ClusterIssuer).common_name: The common name of the certificate.dns_names: A comma-separated list of DNS names on the certificate.
To verify these metrics are available, you can usually port-forward to the cert-manager controller service and curl its metrics endpoint.
Example 1: Accessing cert-manager metrics
First, find your cert-manager controller pod:
kubectl get pods -n cert-manager -l app=cert-manager -o jsonpath='{.items[0].metadata.name}'
Then, port-forward to it (you might need to adjust the port if your setup is different):
kubectl port-forward -n cert-manager service/cert-manager 9402:9402 &
(Note: The default metrics port for cert-manager is 9402, but it's exposed via a service, so port-forwarding the service is often more robust than a specific pod.)
Now, curl the metrics endpoint:
curl http://localhost:9402/metrics | grep certmanager_certificate_expiration_timestamp_seconds
You should see output similar to this, indicating your certificates and their expiry timestamps:
# HELP certmanager_certificate_expiration_timestamp_seconds Unix timestamp of the certificate expiration.
# TYPE certmanager_certificate_expiration_timestamp_seconds gauge
certmanager_certificate_expiration_timestamp_seconds{common_name="example.com",dns_names="example.com",issuer_kind="ClusterIssuer",issuer_name="letsencrypt-prod",name="example-com-tls",namespace="default"} 1678886400
certmanager_certificate_expiration_timestamp_seconds{common_name="another.net",dns_names="another.net",issuer_kind="Issuer",issuer_name="selfsigned-issuer",name="another-net-tls",namespace="dev"} 1708886400
Crafting Your Prometheus Query
With the certmanager_certificate_expiration_timestamp_seconds metric in hand, we can build a PromQL query to identify certificates nearing expiry. The key is to compare the certificate's expiry timestamp with the current time. Prometheus provides the time() function, which returns the current Unix timestamp.
To get the remaining time until expiry in seconds, you'd use:
certmanager_certificate_expiration_timestamp_seconds - time()
Now, let's say you want to be alerted 30 days before a certificate expires. 30 days is 30 * 24 * 60 * 60 = 2,592,000 seconds. So, your query would look like this:
certmanager_certificate_expiration_timestamp_seconds - time() < 2592000
This query will return 1 for any certificate that expires within the next 30 days, along with all its labels. You can adjust the threshold (e.g., 60 days, 14 days) based on your operational needs and how long your cert-manager renewal process typically takes.
You can also refine this query further. For instance, if you only care about certificates from a specific issuer or in a particular namespace:
- By Issuer:
certmanager_certificate_expiration_timestamp_seconds{issuer_name="letsencrypt-prod"} - time() < 2592000 - By Namespace:
certmanager_certificate_expiration_timestamp_seconds{namespace="production"} - time() < 2592000
It's often a good idea to filter out self-signed certificates or those used for internal-only purposes if they have different expiry requirements or are less critical for external services.
Building Robust Prometheus Alerts
Once you have a working PromQL query, the next step is to turn it into an alert rule for Alertmanager. You'll typically define these in a prometheus.rules file that Prometheus scrapes.
Example 2: Prometheus Alert Rule for Certificate Expiry
Here's an example of an alert.rules file snippet:
groups:
- name: cert-manager-alerts
rules:
- alert: CertManagerCertificateExpiringSoon
expr: |
certmanager_certificate_expiration_timestamp_seconds - time() < 2592000
for: 5m # Wait 5 minutes to ensure the metric is stable
labels:
severity: warning
team: infra
annotations:
summary: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires soon"
description: |
The certificate '{{ $labels.name }}' (Common Name: {{ $labels.common_name }}, DNS Names: {{ $labels.dns_names }})
issued by '{{ $labels.issuer_name }}' in namespace '{{ $labels.namespace }}' is expiring in less than 30 days.
Current expiry timestamp: {{ $labels.certmanager_certificate_expiration_timestamp_seconds | humanizeTimestamp }}.
Please verify cert-manager renewal status and intervene if necessary.
Let's break down this alert rule:
alert: CertManagerCertificateExpiringSoon: A unique name for your alert.expr:: Your PromQL query. We're using a multi-line literal here for readability.for: 5m: This ensures the alert only fires if the condition (< 30 days) has been true for at least 5 minutes. This helps prevent flapping alerts due to scrape intervals or transient issues.labels:: Key-value pairs that help categorize and route your alerts in Alertmanager.severityandteamare common examples.annotations:: More descriptive information about the alert, often used in alert notifications. We use templating ({{ $labels.<label_name> }}) to include dynamic information from the metric labels.humanizeTimestampis a useful Go template function available in Alertmanager to make the timestamp readable.
Once Prometheus picks up this rule, if the condition is met, it will send an alert to Alertmanager, which then routes it to your configured receivers (e.g., Slack, email, PagerDuty).
Pitfalls and Edge Cases
While this Prometheus-based approach is powerful, it's essential to be aware of its limitations and potential pitfalls:
- Silent Renewal Failures: The
certmanager_certificate_expiration_timestamp_secondsmetric reflects the expiry of the currently active certificate. Ifcert-managerfails to renew a certificate, this metric won't immediately tell you that the renewal process itself is stuck. You'll only get an alert when the old certificate is nearing expiry, which might be too late for proactive intervention if the renewal has been failing for a while. For deeper insights intocert-manager's internal state, you might need to monitorcertmanager_certificate_ready_statusorcertmanager_certificate_renewal_time_secondsas well. - Prometheus Scrape Issues: If Prometheus itself can't scrape the
cert-managermetrics endpoint (e.g., due to networking