Observability From Zero to Hero - Part II: Structured Logging

For the first 6 years of my career I have always been doing logging in a rather undisciplined manner. It is fairly free form and unstructured that would look like this:

logger.info "User #{user.id} has made a payment of #{amount.inspect} with card ending in #{last_digits}"

When it comes to querying it in the log engine I often would just go with something brute-force that looks like:

message:"User has made a payment of"

Funny enough in my first job this type of query would often ground the Elasticsearch to a halt when the query range was over more than a month.

Fast forward to 2020, I was lucky enough to work with Boran who introduced me the concept of structured logging and it has completely opened my mind. I have not looked back ever since.

In this article I would like to share with you the things I have learned about structured logging over the years.

What is structured logging?

As the name has already hinted, structured logging is about logging in a structured manner. It is about emitting structured data into the logging system that:

Can be easily ingested and indexed by the log engine.
Can be easily queried and analysed by the human.

What are the structured logging formats?

There are serveral structured logging formats that are widely used. The most common ones I have used are:

JSON - Which is very machine friendly but not particilarly human readable in its raw form.
Logfmt - Which is both human and machine readable.

Quite often I settled with JSON because I predominantly works on projects that run on GCP over the past few years, where JSON logs are out of box indexed by the Google Cloud Logging. That being said it has a few issues:

It is not the most human readable format especially when each emitted event comes with many fields.
JSON marshalling can be pretty CPU intensive when you emit excessive amount of logs - It can be nauseously slow when it happens within a single HTTP request.

Personally I have not used logfmt thus probably not qualified to comment on it. As far as I can tell Loki supports both JSON and logfmt thus comes really handy.

How I structured my logs

For the past few years I've been predominantly writing software in Go and Python. When it comes to Go I rigidly using context based pattern:

package logger

import (
    "context"

    "github.com/sirupsen/logrus"
)

var (
    G = GetLogger
    L = logrus.NewEntry(newLogger())
)

type (
    loggerKey struct{}
)

func WithLogger(ctx context.Context, logger *logrus.Entry) context.Context {
    e := logger.WithContext(ctx)
    return context.WithValue(ctx, loggerKey{}, e)
}

func GetLogger(ctx context.Context) *logrus.Entry {
    logger := ctx.Value(loggerKey{})

    if logger == nil {
        return L.WithContext(ctx)
    }

    return logger.(*logrus.Entry)
}

func newLogger() *logrus.Logger {
    l := logrus.New()
    l.SetReportCaller(true) // Include file and line number in logs

    l.Formatter = &logrus.JSONFormatter{
        FieldMap: logrus.FieldMap{
            logrus.FieldKeyTime:  "timestamp", // @timestamp for Logstash - Can't remember why...
            logrus.FieldKeyLevel: "logLevel",
            logrus.FieldKeyMsg:   "message",
        },
        TimestampFormat: time.RFC3339Nano,
    }

    return l
}

It is greatly inspired by this code snippet from the containerd project. In the example above I used the logrus logger but the same pattern can be applied to any other logging library that supports structured logging.

Notes by default all the logs are shipped to the stderr. Typically I do not use any vendor specific SDKs which emits logs directly to the logging system, but instead writing logs to stdout and stderr, and rely on collectors such as Fluentd, Filebeat, alloy-log to collect the container logs and ship them to the centralised logging system. Reason being:

You just need to write the structured logs to the stdout/stderr, which decouples the emitting from the collecting.
When you use SDKs to emit logs in the event of any network issues all the logs are essentially down the drain, whereas with stdout & stderr it is more resilient since collectors such as Filebeat will simply play catchup later.

Here are some of the common patterns I use to utilise the micro library above:

func (svc *Service) initRouter(router *gin.Router) {
  router.POST("/payments",
    svc.Track(),
    svc.FetchUser(),
    svc.MakePayment(),
  )
}

func (svc *Service) Track() gin.HandlerFunc {
    return func(c *gin.Context) {
    ctx := c.Request.Context()
    start := time.Now()

    correlationID := c.GetHeader("X-Correlation-ID")
    log := logger.G(ctx).WithFields(logrus.Fields{
      "correlation_id": correlationID,
      "method": c.Request.Method,
      "path": c.Request.URL.Path,
      "remote_ip": c.ClientIP(),
    })

    ctx = logger.WithLogger(ctx, log)
    c.Request = c.Request.WithContext(ctx)

    defer func() {
      log.WithField("duration", time.Since(start)).
        WithField("status_code", c.Writer.Status()).
        Info("Request completed") // emit correlation_id=xxx, duration=xxx, status_code=xxx, method=xxx, remote_ip=xxx, path="/payment user_id=xxx, email=xxx, payment_id=xxx, amount=xxx, message="Request completed", logLevel=info, timestamp=xxx
    }()

    c.Next()
  }
}

func (svc *Service) FetchUser() gin.HandlerFunc {
  return func(c *gin.Context) {
    ctx := c.Request.Context()
    user := svc.FetchUser(ctx, c.Param("user_id"))

    log := logger.G(ctx)
      .WithField("user_id", user.ID)
      .WithField("email", user.Email)

    log.Info("User fetched") // emit remote_ip=xxx, method=xxx, path=/payments user_id=xxx, email=xxx, message="User fetched", logLevel=info, timestamp=xxx

    c.Request = c.Request.WithContext(logger.WithLogger(ctx, log)) // embbed the logger into the context
    c.Next()
  }
}

func (svc *Service) MakePayment() gin.HandlerFunc {
  return func(c *gin.Context) {
    ctx := c.Request.Context()

    log := logger.G(ctx)
      .WithField("payment_id", payment.ID)
      .WithField("amount", payment.Amount)

    payment, err := svc.MakePayment(ctx, c.Param("user_id"), c.Param("amount"))

    if err != nil {
      // all the context from the previous middleware is preserved
      log.WithError(err).Error("Failed to make payment") // emit remote_ip=xxx, method=xxx, path=/payments payment_id=xxx, amount=xxx, user_id=xxx, email=xxx, message="Failed to make payment", error="xxx", logLevel=error, timestamp=xxx
      c.AbortWithStatus(http.StatusBadRequest)
      return
    }

    log.WithField("payment_id", payment.ID).Info("Payment made") // emit remote_ip=xxx, method=xxx, path=/payments payment_id=xxx, amount=xxx, user_id=xxx, email=xxx, message="Payment made", logLevel=info, timestamp=xxx
  }
}

As you can see the log is enriched on each of the middleware, and the context from the previous middleware is passed down to the next, so that logs are not only structured but also correlated.

On the live operation side you can now effectively query the area of interest. For example if certain customer complains issue with payment you can effectively query:

{ service="payments", method="POST", user_id="1234", status_code!="201" } | json

which allows you to narrow down the problematic transaction. Moreover if you want to further diagnose what causes the issue you can simply query:

{ correlation_id="xxx" }

which allows you to look into the issue end to end.

Wait, isn't this what OpenTelemetry Traces are for?

You are quite right. OTel Traces are pretty much superset of what I have shown above. In fact Traces are a more powerful form of structured logging, because it allows you to correlate events across different services.

That being said in there are cases where tracing is not suitable, or structured logs are simply a better fit:

Your application is a big reconciliation loop (e.g. a Kubernetes operator reconciler), where traces simply doesn't fit into the flow.
Your application is a big monolith, and you are not interested in intra-service correlation.
Your team's logging is a mess - It makes more sense to fix what's broken instead of introducing a new shiny thing.

Final thoughts

In this article I have shown you how I do structured logging, and demonstrated how it can be useful in diagnosing issues. Practically I find it very much a low-effort-high-impact thing to do to get your application's observability off the ground, along with some basic metrics. Essentially:

Metrics helps you to understand what is wrong on the service level (e.g. 40% error rate over 15mins rolling window).
Structured logs allows you to slice and dice the events, and helps you to diagnose why something is wrong on a transaction level (e.g. aha, the customer payment is declined due to a transient issue with the payment provider).

We have also briefly touched on the Traces and seemingly big overlap between it and (properly done) structured logging.

I hope you found this article useful.