Observability From Zero to Hero - Part III: Metrics
IMO Metrics is THE way to signal problems and measure user experience of your application. Here is my mental model:
- Metrics are for measuring at the service level, and signalling any potential issues.
- Logs & traces are for drilling down into the root cause of the issues.
In this article we will speed run through the RED methodology, and how to instrument your application, collect the metrics without pulling your hair out, and last but not least how to secure your metrics endpoint if you absolutely need to.
In the example we will be using Golang and the Prometheus to demonstate how to do it. The metrics will be pushed (aka. remote write) to the Grafana Cloud.
Hopefully by the end of this article you will have a good idea of how to effectively instrument and collect your application metrics.
Prerequisites of the reading
This article assumes that you have the basic mental model of how prometheus scraping works, and you have read the previous article in the series.
RED method
IMO the best way to measure the user experience of your application is through the RED method, which you break your application performance characteristics into 3 categories:
- Rate
- Errors
- Duration
In my case if I think about my past experience of dealing with production issues, pretty much majority of the signals can be attributed to either drop in request per second (Rate), or increase in error rates (Errors), or increase in response time (Duration).
The boilerplate middleware
For measuring the RED metrics I often use this middleware pattern:
A metrics package that collects http_requests_total
as a counter, and http_request_duration_seconds
as a histogram bucket, which should cover the RED method get-go:
- Rate:
sum(rate(http_requests_total{job="my-app", path="/api/THE_PATH"}[$__rate_interval]))
- Errors:
sum(rate(http_requests_total{job="my-app", path="/api/THE_PATH", code=~"5.."}[$__rate_interval]))
- Duration:
histogram_quantile by (le) (0.95, sum(rate(http_request_duration_seconds_bucket{job="my-app", path="/api/THE_PATH"}[$__rate_interval])))
To be honest http_request_duration_seconds
comes with a http_request_duration_seconds_count
which kinda makes the http_requests_total
counter redundant, however I like to keep them separate, but it's just a personal preference :)
In your main.go
you can then call metrics.Init(ctx)
to start the metrics server.
package metrics
import (
"context"
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/collectors"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
Registry = prometheus.NewRegistry()
HttpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"code", "method", "path"},
)
HttpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Duration of HTTP requests",
Buckets: prometheus.DefBuckets,
},
[]string{"code", "method", "path"},
)
)
func Init(ctx context.Context) {
Registry.MustRegister(HttpRequestsTotal)
Registry.MustRegister(HttpRequestDuration)
http.Handle("/metrics", promhttp.HandlerFor(
Registry,
promhttp.HandlerOpts{
EnableOpenMetrics: true,
},
))
logger.G(ctx).Info("Starting metrics server on :8081")
go http.ListenAndServe("127.0.0.1:8081", nil)
}
On the middleware level I often use a middleware called Observe
that can be used to wrap the http handlers. Here is an example written in gin:
func Observe(path string) gin.HandlerFunc {
return func(c *gin.Context) {
start := time.Now()
defer func() {
status := c.Writer.Status()
metrics.HttpRequestsTotal.WithLabelValues(strconv.Itoa(status), c.Request.Method, path).Inc()
metrics.HttpRequestDuration.WithLabelValues(strconv.Itoa(status), c.Request.Method, path).Observe(time.Since(start).Seconds())
}()
c.Next()
}
}
With the Observe middleware in place you can pretty much instrument any of the http handlers via adding it into the middleware chain, e.g.
r.GET("/api/user/:id", metrics.Observe("/api/user/:id"), handler)
One thing you might be wondering is why on-earth I'm passing the path
into the Observe
middleware, instead of just using the c.Path()
?? This is because with c.Path()
if your routes come with something like /api/user/{id}
- not only the cardinality will go out of control, promethues itself will have a hard time counting and summing them up as well. As the result I'd much prefer having a static path string as a parameter.
How to collect the metrics?
There are several options to collect the metrics.
Option 1: Pod/Service Monitors CRDs
Assuming your application is deployed as a pod in the Kubernetes cluster, you can create a PodMonitor CRD to collect the metrics. PS: I just use PodMonitor, but ServiceMonitor should work as well.
# by default it will be scraping the `/metrics` endpoint
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: my-app
namespace: the-namespace
spec:
selector:
matchLabels:
app: my-app
podMetricsEndpoints:
- port: metrics
There are a few caveats with this approach:
- The port must be a name instead of a number.
- Due to ^, you must specify something like below in your container definition.
- The customers of the Kubernetes cluster must have the adequate permissions to create PodMonitor resources.
containers:
- name: my-app
ports:
- name: metrics
port: 8081
protocol: TCP
Option 2: Undocumented alloy magic annotations
I couldn't find any documentation on this, but apparently based on this there is a way to add annotations to your pod spec to tell the Grafana alloy to scrape the metrics on the pod annotations level.
Which looks like this:
annotations:
k8s.grafana.com/job: my-app
k8s.grafana.com/scrape: "true"
k8s.grafana.com/metrics.path: /metrics
k8s.grafana.com/metrics.port: metrics
k8s.grafana.com/scrape.scheme: http
k8s.grafana.com/metrics.scrapeInterval: 60s
What if I need to secure my metrics endpoint?
Disclaimer: This is an advanced usage feel free to skip this section if securing the metrics endpoint is not a requirement.
Sometimes due to the sensitive nature of the metrics data (e.g. PII, or just general business sensitive data) you might want to secure your metrics endpoint, just so that randos can't scrape them.
The library that fits the bill is kube-rbac-proxy, which I've been using for a while. Essentially it's deployed as a proxy sidecar, and the alloy agent/prometheus server will scrape the metrics from the proxy sidecar instead. To make it work you need to:
- Run your metrics endpoint on
127.0.0.1:8081
instead of0.0.0.0:8081
. - Deploy the kube-rbac-proxy as a sidecar to your application pod that runs on
0.0.0.0:8080
, and set127.0.0.1:8081
as the upstream address. - You grant your pod service account the ability of performing token review against the Kubernetes API, which is required by the kube-rbac-proxy to work.
- You grant
alloy-agent
or the prometheus server the cluster role to access/metrics
endpoint as the non-url resource.
The configuration is pretty fiddly, and I won't get too much into the details, besides this is a pretty good starting point: https://github.com/brancz/kube-rbac-proxy/tree/master/examples/non-resource-url.
Practically it is so fiddly that I ended up with writing a sidecar injector that will automatically inject the kube-rbac-proxy sidecar for me if I add a.b.com/sidecar-profiles: metrics
to my pod spec annotation.
Conclusion
In this article we have covered the basics of the RED method, how to instrument your application, and how to collect the metrics the easy way, hopefully you found it useful.