Platform Engineering

Microservices Observability Dashboard

Real-time visibility across 40+ services

Tech Stack

OpenTelemetryGrafanaTempoLokiPrometheusKubernetesAlertmanagerPagerDutyGo

Overview

A centralised observability platform unifying metrics, traces, and logs for a distributed system of 40+ microservices, reducing MTTR from hours to minutes.

The Problem

Engineers were flying blind across a complex microservices landscape. Incidents took 3–6 hours to diagnose because traces lived in one tool, logs in another, and there was no correlation between them. On-call engineers needed a single source of truth.

The Solution

Designed and shipped a custom Grafana-based observability platform with auto-instrumented services (OpenTelemetry), centralised log correlation via structured logging conventions, and a custom alerting rules engine with Slack/PagerDuty integration.

Architecture

OpenTelemetry SDKs in all services → Collector agents → Tempo (traces) + Loki (logs) + Prometheus (metrics) → Grafana dashboards. A lightweight correlation service tags log lines with trace IDs, enabling one-click navigation from a slow trace to its corresponding log stream.

Outcome & Impact

MTTR dropped from ~4 hours to under 15 minutes. On-call engineer satisfaction score improved from 3.2 to 4.7/5. Engineering leadership now uses dashboards in weekly reliability reviews.

Back to all projects