All posts
·10 min read

Observability Engineering: ELK Stack, Prometheus, and OpenTelemetry in Practice

A practitioner's guide to building real observability — structured logging with ELK, metrics with Prometheus/Grafana, and distributed tracing with OpenTelemetry — drawn from 8 years of production experience at CDAC and HDFC Bank.

#observability
#elk-stack
#prometheus
#opentelemetry
#spring-boot

Why Observability is an Engineering Discipline, Not a Tool

After 8 years working on distributed systems — first at CDAC and now at HDFC Bank — the single biggest factor separating teams that recover from incidents in minutes versus hours is observability. Not the specific tools. The discipline.

Observability means you can answer "what is my system doing right now?" without needing a new deployment or a code change. This post covers the three pillars — logs, metrics, traces — with concrete patterns from production.

Pillar 1: Structured Logging with ELK

At CDAC, our first instinct when something broke was to ssh into a server and grep through /var/log. This doesn't scale past two services. The ELK stack (Elasticsearch, Logstash, Kibana) centralises logs and makes them queryable.

The most important investment: structured logging. Plain-text log lines are for humans to read; JSON log lines are for machines to index and query.

// Bad: unstructured — only grep-able
log.error("Payment failed for user 12345 after 3 retries: timeout");

// Good: structured — filterable in Kibana on any field
log.error("Payment processing failed",
    kv("userId", userId),
    kv("transactionId", txnId),
    kv("attemptCount", 3),
    kv("failureReason", "TIMEOUT"),
    kv("durationMs", elapsed)
);

Configure Logback to emit JSON with Logstash encoder:

<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder">
        <includeMdcKeyName>correlationId</includeMdcKeyName>
        <includeMdcKeyName>userId</includeMdcKeyName>
        <includeMdcKeyName>traceId</includeMdcKeyName>
    </encoder>
</appender>

Every log line automatically includes the MDC fields, so you can filter in Kibana with correlationId: "abc-123" and see every log event across every service for that request.

Correlation IDs are mandatory. Set them in a servlet filter at the gateway and propagate via HTTP headers:

@Component
public class CorrelationIdFilter implements Filter {
    @Override
    public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) {
        String id = Optional.ofNullable(((HttpServletRequest) req)
            .getHeader("X-Correlation-ID"))
            .orElse(UUID.randomUUID().toString());

        MDC.put("correlationId", id);
        ((HttpServletResponse) res).setHeader("X-Correlation-ID", id);
        try {
            chain.doFilter(req, res);
        } finally {
            MDC.clear();
        }
    }
}

Pillar 2: Metrics with Prometheus and Grafana

Logs tell you what happened. Metrics tell you how your system is behaving over time. At HDFC Bank, configuring Prometheus/Grafana dashboards was one of my first deliverables.

Spring Boot Actuator + Micrometer gives you metrics for free:

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  metrics:
    export:
      prometheus:
        enabled: true
  endpoint:
    health:
      show-details: always

Prometheus scrapes /actuator/prometheus every 15 seconds. You get JVM metrics (heap, GC, threads), HTTP metrics (request count, latency histograms), and HikariCP pool metrics — all without a line of custom instrumentation.

For business metrics, add custom counters:

@Service
public class PaymentService {
    private final Counter paymentSuccessCounter;
    private final Counter paymentFailureCounter;
    private final Timer paymentTimer;

    public PaymentService(MeterRegistry registry) {
        paymentSuccessCounter = Counter.builder("payment.processed")
            .tag("status", "success").register(registry);
        paymentFailureCounter = Counter.builder("payment.processed")
            .tag("status", "failure").register(registry);
        paymentTimer = Timer.builder("payment.duration")
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);
    }
}

The publishPercentiles on the timer is key — you want P99 latency visible in Grafana, not just average latency. Averages hide tail latency misery.

Key Grafana panels every service should have:

  • Request rate (RPS) — rate(http_server_requests_seconds_count[1m])
  • Error rate — filter by status=~"5.."
  • P99 latency — histogram_quantile(0.99, http_server_requests_seconds_bucket)
  • JVM heap usage — jvm_memory_used_bytes / jvm_memory_max_bytes
  • DB pool active connections — hikaricp_connections_active

Pillar 3: Distributed Tracing with OpenTelemetry

Logs and metrics tell you that something is wrong. Traces tell you where in a multi-service flow it went wrong. At HDFC Bank, I integrated OpenTelemetry tracing into the standardized observability library so every service gets tracing for free.

Spring Boot 3 supports OpenTelemetry via Micrometer Tracing:

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>
management:
  tracing:
    sampling:
      probability: 0.1   # 10% in prod; 1.0 in dev/staging
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces

Trace context propagates automatically across RestTemplate, WebClient, and Spring MVC. For Kafka, you need to manually propagate headers (the library we built at HDFC Bank handles this via a custom ProducerInterceptor).

Sampling strategy matters. 100% sampling at production TPS is cost-prohibitive. We sample:

  • 100% of all error responses (5xx)
  • 100% of slow requests (above P95 threshold)
  • 10% of normal traffic

This means incidents are always captured, while routine traffic is sampled economically.

The Compound Effect

The real value emerges when all three pillars are connected. A Grafana alert fires on elevated error rate → you click the trace ID embedded in the alert annotation → Zipkin shows the exact span where the failure started → you filter Kibana by that correlation ID → you see the full log context including the SQL query that timed out.

That end-to-end journey — from alert to root cause — is what took 45 minutes before observability and takes under 10 minutes after. The tools matter less than the discipline of wiring them together from day one.