Observability From Zero to Hero - Part I: Up and Running

Historically I've helped 3 different customers to build out Grafana LGTM (Loki, Grafana, Tempo, Mimir, additionally the introduced pyroscope) based observability solution. It is a fairly easy feat retrospectively, however I have also observed that quite often the clients simply get off the beaten track because they either:

Don't have the capabilities to implement in house.
Get distracted by the hundreds of options and possible configurations
Sometimes focused on tackling the "problem" that is not a problem in the first place.
Too much focuses on setting up the infra instead of having something that provides the business value get-go.

In the next few weeks I will be posting a series of articles on how to build an observability solution from zero. Here are a non-exhaustive list of topics I will cover:

The golden path for to get the Grafana LGTM based observability up and running on your Kubernetes cluster.
Idiot-proof ways to get new services onboarded onto the observability platform.
Some advanced usage of the LGTM that gives you superpowers (spoiler alert: they are not even covered in the documentations)

Areas that I will not cover

To save your time here are the areas that I will not be covering:

Building and operating Promethues Mimir, Loki and Tempo - since it is completely a different beast that desereves another topic. We will be solely focusing on effectively collecting and managing the metrics, logs, traces and profiles on the Kubernetes cluster (which is arguably enough challenging for most of the teams). I assume you are using the LGTM service that either provided by Grafana Cloud or anything that already pre-configured for you.
Vector aggregator. In the opinionated approach we will push the logs, metrics, traces and profiles directly to the LGTM, since I assume the audience are responsible adults who make sure of pushing structured events to the log facility and have the basic fluency in managing metrics cardinalities. That being said the setup we have is completely compatible with the Vector, apart from the Pyroscope profiles (which I'm not sure if Vector supports it).

In the first part we will focus on getting the observability up and running.

Part I: Up and Running

Prerequisites

Have a Grafana Cloud account setup. They have a free tier which is more than enough to get started. Personally I use their pay-as-you-go pro tier subscription for my bare-metal estate observability. Practically it costs $0 a month.
Have a Kubernetes cluster up and running. If you are not a Kubernetes expert, I would recommend using the managed Kubernetes offering such as GKE or EKS.
The example below uses Terraform for managing the infrastructure.

Step 1: Create the tokens

In my opition Grafana absolutely nailed it in terms of its tech. My only complains are:

They just keep changing the UI which is hard to keep up.
Their documentation has always been playing catch up with the product development.

As it stands this is how you get the tokens:

On the left hand side of https://${your-org}.grafana.net click "Infrastructure -> Kubernetes -> Configuration"
Go straight to the Access Policy Token section, give the token a name and click "Create token", notes this only give you metrics read & write, log write and trace writer access. NB: You can absolutely use the helm or terraform on the same page to get the stack up and running on your Kubernetes cluster, however the capability is fairly lacking IMO, besides it assume you will be writing the secrets in plain text as IaC, which is suboptimal.
On the same page take notes of the:
The name of the token.
Token that has just been created.
Prometheus, Tempo and Loki endpoints.
Promethues, Tempo and Loki usernames.
Go the to Cloud Access Policy page on https://${your-org}.grafana.net/a/grafana-auth-app.
Find stack-xxxx-YOUR-TOKEN-NAME in the list, and grant it with profiles:write scope so that it can write the profiles to Pyroscope.
Go to https://${your-org}.grafana.net/connections/datasources/edit/grafanacloud-profiles, you will find the username for the Pyroscope datasource. Take note of it.

Step 2: Deploy the grafana alloy stack into your k8s cluster

Here is a fairly opinionated config to fully leverage the LGTM stack

variable "gcp_project" {
  type        = string
  description = "The GCP project to use"
}

variable "cluster_name" {
  type        = string
  description = "The name of the cluster"
}

variable "k8s_monitoring_chart_repo" {
  type    = string
  default = "https://grafana.github.io/helm-charts"
}

variable "k8s_monitoring_chart_version" {
  description = "version of the k8s monitoring chart"
  type        = string
  default     = "1.5.0"
}

variable "prom_host" {
  type        = string
  description = "The host for the Prometheus instance"
  default     = "https://prometheus-prod-01-eu-west-0.grafana.net"
}

variable "loki_host" {
  type        = string
  description = "The host for the Loki instance"
  default     = "https://logs-prod-eu-west-0.grafana.net"
}

variable "tempo_host" {
  type        = string
  description = "The host for the Tempo instance"
  default     = "https://tempo-eu-west-0.grafana.net:443"
}

variable "pyroscope_host" {
  type        = string
  description = "The host for the Pyroscope instance"
  default     = "https://profiles-prod-010.grafana.net"
}

variable "grafana_usernames" {
  type = object({
    prometheus = string
    tempo      = string
    loki       = string
    pyroscope  = string
  })
  description = "The usernames for the Grafana instance"
}

variable "metrics_external_labels" {
  description = "external labels"
  type        = map(string)

  default = {}
}

data "google_secret_manager_secret_version" "grafana_creds" {
  project = var.gcp_project
  secret  = "grafana-creds"
}

resource "kubernetes_namespace" "observability" {
  metadata {
    name = "observability"
  }
}

resource "kubernetes_secret" "prom_creds" {
  metadata {
    name      = "prom-creds"
    namespace = kubernetes_namespace.observability.metadata.0.name
  }

  data = {
    host     = var.prom_host
    username = var.grafana_usernames.prometheus
    password = data.google_secret_manager_secret_version.grafana_creds.secret_data
  }
}

resource "kubernetes_secret" "tempo_creds" {
  metadata {
    name      = "tempo-creds"
    namespace = kubernetes_namespace.observability.metadata.0.name
  }

  data = {
    host     = var.tempo_host
    username = var.grafana_usernames.tempo
    password = data.google_secret_manager_secret_version.grafana_creds.secret_data
  }
}

resource "kubernetes_secret" "loki_creds" {
  metadata {
    name      = "loki-creds"
    namespace = kubernetes_namespace.observability.metadata.0.name
  }

  data = {
    host     = var.loki_host
    username = var.grafana_usernames.loki
    password = data.google_secret_manager_secret_version.grafana_creds.secret_data
  }
}

resource "kubernetes_secret" "pyroscope_creds" {
  metadata {
    name      = "pyroscope-creds"
    namespace = kubernetes_namespace.observability.metadata.0.name
  }

  data = {
    host     = var.pyroscope_host
    username = var.grafana_usernames.pyroscope
    password = data.google_secret_manager_secret_version.grafana_creds.secret_data
  }
}

// ******************** RBAC start ********************
resource "kubernetes_service_account" "alloy" {
  metadata {
    name      = "alloy"
    namespace = kubernetes_namespace.observability.metadata.0.name
  }
}

resource "kubernetes_cluster_role" "metrics" {
  metadata {
    name = "metrics"
  }
  rule {
    non_resource_urls = ["/metrics"]
    verbs             = ["get"]
  }
}

resource "kubernetes_cluster_role_binding" "alloy" {
  metadata {
    name = "alloy"
  }
  role_ref {
    kind      = "ClusterRole"
    name      = kubernetes_cluster_role.metrics.metadata.0.name
    api_group = "rbac.authorization.k8s.io"
  }
  subject {
    kind      = "ServiceAccount"
    name      = kubernetes_service_account.alloy.metadata.0.name
    namespace = kubernetes_namespace.observability.metadata.0.name
  }
}

resource "kubernetes_service_account" "alloy_profiler" {
  metadata {
    name      = "alloy-profiler"
    namespace = kubernetes_namespace.observability.metadata.0.name
  }
}

resource "kubernetes_cluster_role" "profiles" {
  metadata {
    name = "profiles"
  }
  rule {
    non_resource_urls = ["/debug/pprof/", "/debug/pprof/*"]
    verbs             = ["get"]
  }
}

resource "kubernetes_cluster_role_binding" "alloy_profiler" {
  metadata {
    name = "alloy-profiler"
  }
  role_ref {
    kind      = "ClusterRole"
    name      = kubernetes_cluster_role.profiles.metadata.0.name
    api_group = "rbac.authorization.k8s.io"
  }
  subject {
    kind      = "ServiceAccount"
    name      = kubernetes_service_account.alloy_profiler.metadata.0.name
    namespace = kubernetes_namespace.observability.metadata.0.name
  }
}
// ******************** RBAC done ********************

resource "helm_release" "grafana-k8s-monitoring" {
  name             = "grafana-k8s-monitoring"
  repository       = var.k8s_monitoring_chart_repo
  version          = var.k8s_monitoring_chart_version
  chart            = "k8s-monitoring"
  namespace        = kubernetes_namespace.observability.metadata[0].name
  create_namespace = true
  atomic           = true
  timeout          = 300

  values = [
    yamlencode({
      cluster = {
        name = var.cluster_name
      },
      alloy-profiles = {
        serviceAccount = {
          create = false
          name   = kubernetes_service_account.alloy_profiler.metadata.0.name
        }
      },
      alloy = {
        serviceAccount = {
          create = false
          name   = kubernetes_service_account.alloy.metadata.0.name
        }
      },
      externalServices = {
        prometheus = {
          host      = var.prom_host
          basicAuth = {}
          secret = {
            create    = false
            name      = kubernetes_secret.prom_creds.metadata[0].name
            namespace = kubernetes_secret.prom_creds.metadata[0].namespace
          }
        },
        loki = {
          host      = var.loki_host
          basicAuth = {}
          secret = {
            create    = false
            name      = kubernetes_secret.loki_creds.metadata[0].name
            namespace = kubernetes_secret.loki_creds.metadata[0].namespace
          }
        },
        tempo = {
          host      = var.tempo_host
          basicAuth = {}
          secret = {
            create    = false
            name      = kubernetes_secret.tempo_creds.metadata[0].name
            namespace = kubernetes_secret.tempo_creds.metadata[0].namespace
          }
        },
        pyroscope = {
          host      = var.pyroscope_host
          basicAuth = {}
          secret = {
            create    = false
            name      = kubernetes_secret.pyroscope_creds.metadata.0.name
            namespace = kubernetes_namespace.observability.metadata.0.name
          },
        }
      },
      metrics = {
        enabled = true
        cost = {
          enabled = false
        },
        node-exporter = {
          enabled = true
        }
      },
      logs = {
        enabled = true
        pod_logs = {
          enabled = true
        },
        cluster_events = {
          enabled = true
        }
      },
      traces = {
        enabled = true
      },
      profiles = {
        enabled = true
        java = {
          enabled = false
        }
        ebpf = {
          enabled = true
        }
      },
      receivers = {
        grpc = {
          enabled = true
        },
        http = {
          enabled = true
        },
        zipkin = {
          enabled = false
        }
      },
      opencost = {
        enabled = false
      },
      kube-state-metrics = {
        enabled = true
      },
      prometheus-node-exporter = {
        enabled = true
      },
      prometheus-operator-crds = {
        enabled = true
      }
    })
  ]
}

This module uses the GCP secret manager to store the Grafana API token in a secret called grafana-creds, It can easily be swapped to something else such as the external secret operator or Hashicorp's Vault as desired.

To deploy it

module "grafana-alloy" {
  source       = "../../modules/grafana-alloy"
  gcp_project  = "$YOUR_GCP_PROJECT" # for accessing secret manager only
  cluster_name = "$YOUR_CLUSTER_NAME"
  grafana_usernames = {
    prometheus = "$PROM_USERNAME_YOU_NOTED_DOWN"
    tempo      = "$TEMO_USERNAME_YOU_NOTED_DOWN"
    loki       = "$LOKI_USERNAME_YOU_NOTED_DOWN"
    pyroscope  = "$PYRO_USERNAME_YOU_NOTED_DOWN"
  }
}

To summarise what the module does:

It enables prometheus, loki, tempo and pyroscope data collection in your kubernetes cluster.
In terms of the metrics it out of box collects node exporter, cadvisor and kube-state-metrics metrics. It also takes care of node exporter and kube-state-metrics installation.
All the logs and events are collected into Loki.
On each of the node in the cluster it will deploy the alloy-profiles daemonset agent for application profiling. In the example above it:
- Disables the default java collection with pprof enabled.
- It enables the ebpf collection meaning majority of the pod in the cluster will be auto-instrumented.
- Practically you might need some label drop and namespace whitelisting, we will cover it future articles.
If you have a keen eye you might have noticed that there are some extra non-resource URL RBAC has been added against the alloy and alloy-profiler service accounts. This is meant for sensitive data collection such as pprof profiles and metrics endpoint that comes with sensitive data. It is something again we will cover in the future in greater detail.
We have also enabled the "legacy" Prom operator CRDs for supporting service and monitor based metrics scraping.

Step 3: Verify the deployment

Go to https://${YOUR_ORG}.grafana.net/explore to explore the metrics, logs, traces and profiles.

Final thoughts

In this article we have covered how to setup the observability foundation on your Kubernetes cluster. By the looks of the config it might be seemingly easy, but behind the scenes there are a lot of trial and error considering over the past 2 years they have evolved the stack from grafana-agent to grafana-flow and then to alloy, as the result there are a lot of moving parts.

In the next article I will cover some of the features that can massively accelerate your service metrics onboarding.