cert-manager + Prometheus Alert Pattern: Proactive Certificate Expiry Monitoring

Managing SSL/TLS certificates can feel like walking a tightrope. One misstep, and your services go down, users see scary browser warnings, and trust evaporates. cert-manager has been a game-changer for Kubernetes users, automating the lifecycle of certificates, from issuance to renewal. It's incredibly powerful, but automation doesn't mean you can completely set it and forget it. You still need visibility and, crucially, proactive alerts when things aren't going as planned.

This article will guide you through setting up a robust monitoring and alerting strategy for your cert-manager managed certificates using Prometheus and Alertmanager. We'll cover how to tap into cert-manager's metrics, craft effective PromQL queries, define alerts, and discuss common pitfalls to ensure your certificates renew smoothly and silently.

Understanding cert-manager Metrics

cert-manager exposes a /metrics endpoint that Prometheus can scrape, offering valuable insights into your certificate landscape. The most critical metric for expiry monitoring is certmanager_certificate_expiration_timestamp_seconds. This gauge metric provides the Unix timestamp (in seconds) when a certificate is expected to expire. It's labeled with useful metadata like:

  • name: The name of the Certificate resource.
  • namespace: The namespace where the Certificate resource resides.
  • issuer_name: The name of the Issuer or ClusterIssuer used.
  • issuer_kind: The kind of issuer (e.g., Issuer, ClusterIssuer).
  • common_name: The common name of the certificate.
  • dns_names: A comma-separated list of DNS names on the certificate.

To verify these metrics are available, you can usually port-forward to the cert-manager controller service and curl its metrics endpoint.

Example 1: Accessing cert-manager metrics

First, find your cert-manager controller pod:

kubectl get pods -n cert-manager -l app=cert-manager -o jsonpath='{.items[0].metadata.name}'

Then, port-forward to it (you might need to adjust the port if your setup is different):

kubectl port-forward -n cert-manager service/cert-manager 9402:9402 &

(Note: The default metrics port for cert-manager is 9402, but it's exposed via a service, so port-forwarding the service is often more robust than a specific pod.)

Now, curl the metrics endpoint:

curl http://localhost:9402/metrics | grep certmanager_certificate_expiration_timestamp_seconds

You should see output similar to this, indicating your certificates and their expiry timestamps:

# HELP certmanager_certificate_expiration_timestamp_seconds Unix timestamp of the certificate expiration.
# TYPE certmanager_certificate_expiration_timestamp_seconds gauge
certmanager_certificate_expiration_timestamp_seconds{common_name="example.com",dns_names="example.com",issuer_kind="ClusterIssuer",issuer_name="letsencrypt-prod",name="example-com-tls",namespace="default"} 1678886400
certmanager_certificate_expiration_timestamp_seconds{common_name="another.net",dns_names="another.net",issuer_kind="Issuer",issuer_name="selfsigned-issuer",name="another-net-tls",namespace="dev"} 1708886400

Crafting Your Prometheus Query

With the certmanager_certificate_expiration_timestamp_seconds metric in hand, we can build a PromQL query to identify certificates nearing expiry. The key is to compare the certificate's expiry timestamp with the current time. Prometheus provides the time() function, which returns the current Unix timestamp.

To get the remaining time until expiry in seconds, you'd use: certmanager_certificate_expiration_timestamp_seconds - time()

Now, let's say you want to be alerted 30 days before a certificate expires. 30 days is 30 * 24 * 60 * 60 = 2,592,000 seconds. So, your query would look like this:

certmanager_certificate_expiration_timestamp_seconds - time() < 2592000

This query will return 1 for any certificate that expires within the next 30 days, along with all its labels. You can adjust the threshold (e.g., 60 days, 14 days) based on your operational needs and how long your cert-manager renewal process typically takes.

You can also refine this query further. For instance, if you only care about certificates from a specific issuer or in a particular namespace:

  • By Issuer: certmanager_certificate_expiration_timestamp_seconds{issuer_name="letsencrypt-prod"} - time() < 2592000
  • By Namespace: certmanager_certificate_expiration_timestamp_seconds{namespace="production"} - time() < 2592000

It's often a good idea to filter out self-signed certificates or those used for internal-only purposes if they have different expiry requirements or are less critical for external services.

Building Robust Prometheus Alerts

Once you have a working PromQL query, the next step is to turn it into an alert rule for Alertmanager. You'll typically define these in a prometheus.rules file that Prometheus scrapes.

Example 2: Prometheus Alert Rule for Certificate Expiry

Here's an example of an alert.rules file snippet:

groups:
- name: cert-manager-alerts
  rules:
  - alert: CertManagerCertificateExpiringSoon
    expr: |
      certmanager_certificate_expiration_timestamp_seconds - time() < 2592000
    for: 5m # Wait 5 minutes to ensure the metric is stable
    labels:
      severity: warning
      team: infra
    annotations:
      summary: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires soon"
      description: |
        The certificate '{{ $labels.name }}' (Common Name: {{ $labels.common_name }}, DNS Names: {{ $labels.dns_names }})
        issued by '{{ $labels.issuer_name }}' in namespace '{{ $labels.namespace }}' is expiring in less than 30 days.
        Current expiry timestamp: {{ $labels.certmanager_certificate_expiration_timestamp_seconds | humanizeTimestamp }}.
        Please verify cert-manager renewal status and intervene if necessary.

Let's break down this alert rule:

  • alert: CertManagerCertificateExpiringSoon: A unique name for your alert.
  • expr:: Your PromQL query. We're using a multi-line literal here for readability.
  • for: 5m: This ensures the alert only fires if the condition (< 30 days) has been true for at least 5 minutes. This helps prevent flapping alerts due to scrape intervals or transient issues.
  • labels:: Key-value pairs that help categorize and route your alerts in Alertmanager. severity and team are common examples.
  • annotations:: More descriptive information about the alert, often used in alert notifications. We use templating ({{ $labels.<label_name> }}) to include dynamic information from the metric labels. humanizeTimestamp is a useful Go template function available in Alertmanager to make the timestamp readable.

Once Prometheus picks up this rule, if the condition is met, it will send an alert to Alertmanager, which then routes it to your configured receivers (e.g., Slack, email, PagerDuty).

Pitfalls and Edge Cases

While this Prometheus-based approach is powerful, it's essential to be aware of its limitations and potential pitfalls:

  • Silent Renewal Failures: The certmanager_certificate_expiration_timestamp_seconds metric reflects the expiry of the currently active certificate. If cert-manager fails to renew a certificate, this metric won't immediately tell you that the renewal process itself is stuck. You'll only get an alert when the old certificate is nearing expiry, which might be too late for proactive intervention if the renewal has been failing for a while. For deeper insights into cert-manager's internal state, you might need to monitor certmanager_certificate_ready_status or certmanager_certificate_renewal_time_seconds as well.
  • Prometheus Scrape Issues: If Prometheus itself can't scrape the cert-manager metrics endpoint (e.g., due to networking