Observability From Zero to Hero - Part IV: Tracing

In the previous part we have highlighted how to instrument your application with structured logging. While it is a good way to make your events more searcable, filterable and slice-and-diceable it has some limitations:

Despite you can correlate logs between services using x-request-id or x-correlation-id it very much a linear correlation that doesn't fly with divergent paths.
You can also time each log events, but it is a very much laborious process (e.g. maybe in python you can use a timer decorator but there is no such magic in Golang).
It generally isn't that great when it comes to capturing the flow of the request through the services.

To address some of the pain points we will now cover tracing, specifically on how to do tracing using Otel.

What we won't cover

How to do logging and metrics with OTEL - It is indeed offered by the Otel, however it is not the focus of this article. Personally I also find logging and metrics in Otel overly committee driven and implementation can be quite heavy-handed and complex, thus I avoid if I can.
How to setup Byzantine system for OTEL pipeline. In this particle we will focus more on instrumenting the application, besides if you use the opinionated setup from Part I you rarely need anything extra.

Setup OTEL SDK

In this section we will setup a Golang application using OTEL in the most vendor agnostic way.

Typically I will more or less setup a traces package following:

package traces

import (
    "context"
    "errors"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/trace"
)

var Tracer = otel.Tracer("my-shiny-app")

// SetupOTelSDK bootstraps the OpenTelemetry pipeline.
// If it does not return an error, make sure to call shutdown for proper cleanup.
func SetupOTelSDK(ctx context.Context) (shutdown func(context.Context) error, err error) {
    var shutdownFuncs []func(context.Context) error

    // shutdown calls cleanup functions registered via shutdownFuncs.
    // The errors from the calls are joined.
    // Each registered cleanup will be invoked once.
    shutdown = func(ctx context.Context) error {
        var err error
        for _, fn := range shutdownFuncs {
            err = errors.Join(err, fn(ctx))
        }
        shutdownFuncs = nil
        return err
    }

    prop := propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    )
    otel.SetTextMapPropagator(prop)

    traceExporter, err := otlptrace.New(ctx, otlptracehttp.NewClient())
    if err != nil {
        return nil, err
    }

    tracerProvider := trace.NewTracerProvider(trace.WithBatcher(traceExporter))

    shutdownFuncs = append(shutdownFuncs, tracerProvider.Shutdown)
    otel.SetTracerProvider(tracerProvider)

    return
}

PS: I very much copy n' pasted from either Grafana Tempo or opentelemetry-go official documentation, and strip off the metrics part, that being said I can no longer find the original source. As you can see in the code it is very much vendor agnostic, since it's very much abstracted away with the OTLP protocol.

In the main function we will be starting off the Otel SDK via:

    // Set up OpenTelemetry.
    otelShutdown, err := traces.SetupOTelSDK(ctx)
    if err != nil {
        logger.G(ctx).Warn("Cannot setup Otel")
    }

    // Handle shutdown properly so nothing leaks.
    defer func() {
        ctx := context.TODO()
        if err := otelShutdown(ctx); err != nil {
            logger.G(ctx).Error("error with shutting down the otel collector")
        }
    }()

Notes on Otel start up error we only soft warn as it is not critical to the up-and-running of the application.

To make sure that the traces are properly shipped to the tracing backend, you will need to make sure that the right OTLP endpoint is specified the right OTELP environment variables. If you are using the opinionated setup from Part I all you need is to populate the following env vars into your container env var configs.

    - name: OTEL_EXPORTER_OTLP_INSECURE
      value: "true"
    - name: OTEL_EXPORTER_OTLP_ENDPOINT
      value: http://grafana-k8s-monitoring-grafana-agent.observability.svc.cluster.local:4318
    - name: OTEL_EXPORTER_OTLP_PROTOCOL
      value: grpc
    - name: OTEL_SERVICE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.labels['app']

PS: The OTEL_SERVICE_NAME is the name of the service appears in the tracing backend, I often just populate the app label as the service name as env var using the k8s downward api.

Instrumenting the application

The Boilerplate

Instrumenting the application is as simple as doing the following boilerplate for function you want to trace:

func TheFunction(ctx context.Context, ...) error {
    ctx, span := traces.Tracer.Start(ctx, "xxxpackage.TheFunction", trace.WithAttributes(
        attribute.String("somekey", "somevalue"),
        attribute.Int("someotherkey", 123),
    ))
    defer span.End()

    // Do some work
    importantStuff, err := runImportantStuff()
    if err != nil {
        span.SetStatus(codes.Error, err.Error())
        return err
    }

    if span.IsRecording() {
        span.SetAttributes(attribute.String("important_stuff", importantStuff))
    }

    return nil
}

But to make sure that the spans are properly correlated we need to make sure that the context is ALWAYS propagated to the downstream calls.

Libraries that support OTEL

Besides using the boilerplate above and propagating the context that is correlated to the trace span, to make sure that the spans are propagated distributedly we also need to make sure that we use the libraries that support OTEL.

Here are some of the libraries you are likely to use:

OTEL HTTP

otelhttp is part of the opentelemetry-go-contrib project, practically using it is as simple as injecting the transport into the http.Client from net/http.

client := http.Client{
    Transport: otelhttp.NewTransport(http.DefaultTransport),
}

// **ALWAYS** use `http.NewRequestWithContext` to create the request.
req, err := http.NewRequestWithContext(ctx, "GET", "https://example.com", nil)
if err != nil {
    return err
}

As the comment above suggested, ALWAYS use http.NewRequestWithContext to make sure that the spans are properly correlated.

Gin

otelgin is also part of the opentelemetry-go-contrib project, utilising it is very much a no-brainer, just inject the middleware into your gin engine.

r := gin.New()
r.Use(otelgin.Middleware("my-web-app"))

Later down the middleware chain you can pretty much instrument your handler as follows:

func TheHandler(c *gin.Context) {
    ctx, span := traces.Tracer.Start(c.Request.Context(), "xxxpackage.TheHandler")
    defer span.End()

    // Do some work
}

gRPC

otelgrpc is also part of the opentelemetry-go-contrib project, again utilising it is pretty no-brainer:

// for client
    conn, err := grpc.NewClient(address,
        grpc.WithStatsHandler(otelgrpc.NewClientHandler()), // This is the important part
        // ...
    )

// for server
    grpc.NewServer(
        grpc.StatsHandler(otelgrpc.NewServerHandler()), // This is the important part
        // ...
    )

Instrumenting the database

They are very straight forward, just follow the instructions in the link above.

Sampling

Once the number of traces reaching certain critical mass it will become impractical to store all the traces. Trade off is likely needed to make among accuracy, cost and performance.

Sampling is a technique of selecting a representative subset of trace data to collect, store and analyse, rather than capturing and collecting every single trace event. Statistically speaking with a large base of data the sampled data will still be representative of the whole.

There are several sampling strategies written in the Observability Engineering book (yes it's free on the Honeycomb website). The sampling chapter is a joy to read, which I highly recommend but I won't repeat the content here.

Practically there are two types of sampling:

Head based sampling - This is where the sampling decision is made based on the trace id of the root span. Once it's been decided that a trace will be sampled, all the spans within the trace will be sampled.
Tail based sampling - This is where the sampling decision is made once all the spans within the trace has been collected. In practice I have yet to see this being used.

By default Otel uses trace.ParentBased(trace.AlwaysSample()) which samples all traces. To make it head based sampling you can do the following:

tracerProvider := trace.NewTracerProvider(
    // ...
    // Sample 10% of the traces
    trace.WithSampler(trace.ParentBased(trace.TraceIDRatioBased(0.1))),
    // ...
)

What to use when you have to do it from scratch on a big system

The above section highlighted how to instrument the application with Otel in an artisan manner, however if you have a large estate of microservices that you want to start to instrument using Otel, this might not be the approach that gives you the best momentum.

From my experience to get the Otel inititive off the ground and get buy-in and values from day one, I would recommend the following approach:

Have an opinionated setup that provides Otel capabilities out of the box, like the setup in Part I
Start with auto-instrumentation using libraries like opentelemetry-go-instrumentation
Enable the Otel emission from the edge level such as Nginx and Traefik, so that you have a legit root span for every request. Both of them have Otel support out of box and supported by the helm charts:
- https://kubernetes.github.io/ingress-nginx/user-guide/third-party-addons/opentelemetry/
- https://doc.traefik.io/traefik/observability/tracing/opentelemetry/
From the edge slowly start to manually instrument the services using the artisan approach in this article.

The Otel ecosystem is overwhelmingly large that is every easy to drown you in the details. The whole point of this article is to highlight an opinionated approach, and encourage you to stop over thinking and just do it :).

Final Thoughts

Tracing is a very powerful tool to have in your observability stack. From my personal experience it almost makes logging obsolete. At the same time it goes hand in hand with metrics where:

Metrics provides you a signal that biased towards actions. However due to the low cardinality nature it only tells you what but not why.
Tracing on the other hand provides you with high-cardinality attributes/context comes with a data flow, which allows you to slice and dice, and makes debugging production issues so much easier. That being said by its nature it also comes with "too much information" problem, which can be compensated with the presence of metrics.

I hope you find this guide useful. If you have any questions or feedback please reach out to me on jingkai@hey.com or hit me on X.