Microservices Observability Dashboard
Real-time visibility across 40+ services
Tech Stack
Overview
A centralised observability platform unifying metrics, traces, and logs for a distributed system of 40+ microservices, reducing MTTR from hours to minutes.
The Problem
Engineers were flying blind across a complex microservices landscape. Incidents took 3–6 hours to diagnose because traces lived in one tool, logs in another, and there was no correlation between them. On-call engineers needed a single source of truth.
The Solution
Designed and shipped a custom Grafana-based observability platform with auto-instrumented services (OpenTelemetry), centralised log correlation via structured logging conventions, and a custom alerting rules engine with Slack/PagerDuty integration.
Architecture
OpenTelemetry SDKs in all services → Collector agents → Tempo (traces) + Loki (logs) + Prometheus (metrics) → Grafana dashboards. A lightweight correlation service tags log lines with trace IDs, enabling one-click navigation from a slow trace to its corresponding log stream.
Outcome & Impact
MTTR dropped from ~4 hours to under 15 minutes. On-call engineer satisfaction score improved from 3.2 to 4.7/5. Engineering leadership now uses dashboards in weekly reliability reviews.