Observability From Zero to Hero - Part III: Metrics

IMO Metrics is THE way to signal problems and measure user experience of your application. Here is my mental model:

Metrics are for measuring at the service level, and signalling any potential issues.
Logs & traces are for drilling down into the root cause of the issues.

In this article we will speed run through the RED methodology, and how to instrument your application, collect the metrics without pulling your hair out, and last but not least how to secure your metrics endpoint if you absolutely need to.

In the example we will be using Golang and the Prometheus to demonstate how to do it. The metrics will be pushed (aka. remote write) to the Grafana Cloud.

Hopefully by the end of this article you will have a good idea of how to effectively instrument and collect your application metrics.

Prerequisites of the reading

This article assumes that you have the basic mental model of how prometheus scraping works, and you have read the previous article in the series.

RED method

IMO the best way to measure the user experience of your application is through the RED method, which you break your application performance characteristics into 3 categories:

Rate
Errors
Duration

In my case if I think about my past experience of dealing with production issues, pretty much majority of the signals can be attributed to either drop in request per second (Rate), or increase in error rates (Errors), or increase in response time (Duration).

The boilerplate middleware

For measuring the RED metrics I often use this middleware pattern:

A metrics package that collects http_requests_total as a counter, and http_request_duration_seconds as a histogram bucket, which should cover the RED method get-go:

Rate: sum(rate(http_requests_total{job="my-app", path="/api/THE_PATH"}[$__rate_interval]))
Errors: sum(rate(http_requests_total{job="my-app", path="/api/THE_PATH", code=~"5.."}[$__rate_interval]))
Duration: histogram_quantile by (le) (0.95, sum(rate(http_request_duration_seconds_bucket{job="my-app", path="/api/THE_PATH"}[$__rate_interval])))

To be honest http_request_duration_seconds comes with a http_request_duration_seconds_count which kinda makes the http_requests_total counter redundant, however I like to keep them separate, but it's just a personal preference :)

In your main.go you can then call metrics.Init(ctx) to start the metrics server.

package metrics

import (
    "context"
    "net/http"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/collectors"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    Registry          = prometheus.NewRegistry()
    HttpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"code", "method", "path"},
    )
    HttpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "Duration of HTTP requests",
            Buckets: prometheus.DefBuckets,
        },
        []string{"code", "method", "path"},
    )
)

func Init(ctx context.Context) {
    Registry.MustRegister(HttpRequestsTotal)
    Registry.MustRegister(HttpRequestDuration)
    http.Handle("/metrics", promhttp.HandlerFor(
        Registry,
        promhttp.HandlerOpts{
            EnableOpenMetrics: true,
        },
    ))

    logger.G(ctx).Info("Starting metrics server on :8081")
    go http.ListenAndServe("127.0.0.1:8081", nil)
}

On the middleware level I often use a middleware called Observe that can be used to wrap the http handlers. Here is an example written in gin:

func Observe(path string) gin.HandlerFunc {
    return func(c *gin.Context) {
        start := time.Now()

        defer func() {
            status := c.Writer.Status()
            metrics.HttpRequestsTotal.WithLabelValues(strconv.Itoa(status), c.Request.Method, path).Inc()
            metrics.HttpRequestDuration.WithLabelValues(strconv.Itoa(status), c.Request.Method, path).Observe(time.Since(start).Seconds())
        }()

        c.Next()
    }
}

With the Observe middleware in place you can pretty much instrument any of the http handlers via adding it into the middleware chain, e.g.

r.GET("/api/user/:id", metrics.Observe("/api/user/:id"), handler)

One thing you might be wondering is why on-earth I'm passing the path into the Observe middleware, instead of just using the c.Path()?? This is because with c.Path() if your routes come with something like /api/user/{id} - not only the cardinality will go out of control, promethues itself will have a hard time counting and summing them up as well. As the result I'd much prefer having a static path string as a parameter.

How to collect the metrics?

There are several options to collect the metrics.

Option 1: Pod/Service Monitors CRDs

Assuming your application is deployed as a pod in the Kubernetes cluster, you can create a PodMonitor CRD to collect the metrics. PS: I just use PodMonitor, but ServiceMonitor should work as well.

# by default it will be scraping the `/metrics` endpoint
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: my-app
  namespace: the-namespace
spec:
  selector:
    matchLabels:
      app: my-app
  podMetricsEndpoints:
    - port: metrics

There are a few caveats with this approach:

The port must be a name instead of a number.
Due to ^, you must specify something like below in your container definition.
The customers of the Kubernetes cluster must have the adequate permissions to create PodMonitor resources.

containers:
- name: my-app
  ports:
  - name: metrics
    port: 8081
    protocol: TCP

Option 2: Undocumented alloy magic annotations

I couldn't find any documentation on this, but apparently based on this there is a way to add annotations to your pod spec to tell the Grafana alloy to scrape the metrics on the pod annotations level.

Which looks like this:

annotations:
  k8s.grafana.com/job: my-app
  k8s.grafana.com/scrape: "true"
  k8s.grafana.com/metrics.path: /metrics
  k8s.grafana.com/metrics.port: metrics
  k8s.grafana.com/scrape.scheme: http
  k8s.grafana.com/metrics.scrapeInterval: 60s

What if I need to secure my metrics endpoint?

Disclaimer: This is an advanced usage feel free to skip this section if securing the metrics endpoint is not a requirement.

Sometimes due to the sensitive nature of the metrics data (e.g. PII, or just general business sensitive data) you might want to secure your metrics endpoint, just so that randos can't scrape them.

The library that fits the bill is kube-rbac-proxy, which I've been using for a while. Essentially it's deployed as a proxy sidecar, and the alloy agent/prometheus server will scrape the metrics from the proxy sidecar instead. To make it work you need to:

Run your metrics endpoint on 127.0.0.1:8081 instead of 0.0.0.0:8081.
Deploy the kube-rbac-proxy as a sidecar to your application pod that runs on 0.0.0.0:8080, and set 127.0.0.1:8081 as the upstream address.
You grant your pod service account the ability of performing token review against the Kubernetes API, which is required by the kube-rbac-proxy to work.
You grant alloy-agent or the prometheus server the cluster role to access /metrics endpoint as the non-url resource.

The configuration is pretty fiddly, and I won't get too much into the details, besides this is a pretty good starting point: https://github.com/brancz/kube-rbac-proxy/tree/master/examples/non-resource-url.

Practically it is so fiddly that I ended up with writing a sidecar injector that will automatically inject the kube-rbac-proxy sidecar for me if I add a.b.com/sidecar-profiles: metrics to my pod spec annotation.

Conclusion

In this article we have covered the basics of the RED method, how to instrument your application, and how to collect the metrics the easy way, hopefully you found it useful.