Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Observability in Microservices

When you have 50 services, “tailing the logs” is impossible. You need centralized observability.

1. The Three Pillars

  1. Logs: Immutable record of discrete events. (“Error at 10:00 PM”).
  2. Metrics: Aggregated data over time. (“CPU usage is 80%”, “Requests per second: 50”).
  3. Tracing: The path of a single request across multiple services.

2. Distributed Tracing with Zipkin/Micrometer

Spring Boot 3 uses Micrometer Tracing (formerly Spring Cloud Sleuth). Dependencies:
  • io.micrometer:micrometer-tracing-bridge-brave
  • io.zipkin.reporter2:zipkin-reporter-brave
How it works Every request gets a unique Trace ID (global) and Span ID (local). These IDs are propagated via HTTP headers (traceparent).
Running Zipkin:
docker run -d -p 9411:9411 openzipkin/zipkin
Config (application.yml):
management:
  tracing:
    sampling:
      probability: 1.0 # Sample 100% of requests (Don't do this in prod!)
Now, when you hit Order Service, which calls Inventory Service, you can see the full timeline in Zipkin UI (http://localhost:9411).

3. Metrics with Prometheus & Grafana

Actuator exposes metrics at /actuator/metrics. Prometheus scrapes them. Dependency: io.micrometer:micrometer-registry-prometheus. Config:
management:
  endpoints:
    web:
      exposure:
        include: prometheus
Prometheus Config (prometheus.yml):
scrape_configs:
  - job_name: 'spring_micrometer'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['host.docker.internal:8080']
Grafana: Connect Grafana to Prometheus and import a standard Spring Boot Dashboard (ID: 4701). You’ll get instant graphs for JVM memory, GC pauses, and HTTP throughput.

4. Centralized Logging (ELK / Loki)

Don’t write logs to files. Write to Console (STDOUT). Use a log shipper (Fluentd/Promtail) to send them to ElasticSearch or Loki. Lombok Logging:
@Slf4j
@Service
public class OrderService {
    public void createOrder() {
        log.info("Creating order..."); // Automatically includes Trace ID and Span ID
    }
}
Prometheus scrapes /actuator/prometheus every 15s. Grafana visualizes the data.

5. Deep Dive: Spring Boot Actuator

Actuator exposes operational information about your running application.

Enabling Actuator

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
By default, most endpoints are disabled for security. Enable All:
management:
  endpoints:
    web:
      exposure:
        include: "*" # WARNING: Don't do this in production without security

Key Endpoints

EndpointDescription
/actuator/healthApplication health (UP/DOWN). Includes DB, Disk, etc.
/actuator/infoApplication metadata (version, Git commit).
/actuator/metricsAll available metrics.
/actuator/metrics/{name}Specific metric (e.g., jvm.memory.used).
/actuator/envEnvironment properties.
/actuator/loggersView/Change log levels at runtime.
/actuator/prometheusPrometheus-formatted metrics.

Securing Actuator

@Configuration
public class SecurityConfig {
    @Bean
    public SecurityFilterChain actuatorSecurity(HttpSecurity http) throws Exception {
        http.authorizeHttpRequests(auth -> auth
                .requestMatchers("/actuator/**").hasRole("ADMIN")
                .anyRequest().authenticated()
        );
        return http.build();
    }
}

6. Custom Metrics with Micrometer

Track your own business KPIs.
@Service
@RequiredArgsConstructor
public class OrderService {

    private final MeterRegistry meterRegistry;

    public void placeOrder(Order order) {
        // Increment counter
        meterRegistry.counter("orders.placed", "status", "success").increment();
        
        // Record time
        Timer.Sample sample = Timer.start(meterRegistry);
        processOrder(order);
        sample.stop(meterRegistry.timer("order.processing.time"));
        
        // Gauge (current value)
        meterRegistry.gauge("orders.pending", getPendingOrderCount());
    }
}
Metric Types:
  • Counter: Monotonically increasing (e.g., requests served).
  • Gauge: Current value (e.g., active connections).
  • Timer: Duration of events (e.g., request latency).
  • Distribution Summary: Statistical summary (e.g., request size).

Monitoring Flow (Complete Picture)


Interview Deep-Dive

Strong Answer:
  • Logs are discrete, immutable records of events. They tell you WHAT happened at a specific point in time: “OrderService threw NullPointerException at 14:32:05 processing order 12345.” Logs are high-cardinality (every event is unique) and high-volume. They are the raw evidence.
  • Metrics are aggregated numerical measurements over time. They tell you HOW MUCH and HOW FAST: “P99 latency is 2.3 seconds, error rate is 12%, CPU usage is 85%.” Metrics are low-cardinality (pre-aggregated) and cheap to store. They are the smoke detector — they tell you something is wrong before you know exactly what.
  • Traces are the path of a single request across service boundaries. They tell you WHERE time was spent: “Request 123 spent 50ms in the Gateway, 200ms in OrderService, and 3.5 seconds in InventoryService’s database query.” Traces provide the causal chain.
  • Debugging workflow: (1) Metrics alert fires — “P99 latency for /orders exceeded 3 seconds.” This is your smoke detector. (2) You open Grafana and see latency spiked at 14:30. You filter by status code — 500 errors also spiked. (3) You grab a trace ID from a slow request. In Zipkin/Jaeger, the trace shows InventoryService’s span took 3.4 seconds, specifically a database query. (4) You search logs by the trace ID in Kibana/Loki and find: “Slow query warning: SELECT * FROM inventory WHERE sku LIKE ‘%ABC%’ — 3400ms.” Now you know the root cause: a missing index on the sku column causing full table scan.
  • The complement: metrics tell you WHEN and HOW BAD, traces tell you WHERE in the call chain, logs tell you EXACTLY WHAT happened. Without metrics, you do not know there is a problem. Without traces, you cannot find which service is responsible. Without logs, you cannot determine the root cause.
Follow-up: Should you sample 100% of traces in production? What are the trade-offs?No, and this is a common mistake. 100% sampling means every request generates trace data shipped to Zipkin/Jaeger. At 10,000 requests/second, that is 10,000 traces/second — each with multiple spans. The storage and network cost is enormous, and the tracing backend becomes a bottleneck. Most production systems sample 1-10%. Zipkin’s default is 10%. The trade-off: with 1% sampling, if a bug occurs once per 10,000 requests, you have only a 1% chance of capturing that specific trace. Solutions: use adaptive sampling (always sample slow requests and errors, sample fast successes at a lower rate) or head-based decision (decide at the gateway whether to sample, propagate the decision to all downstream services so you get complete traces, not partial ones). Micrometer Tracing supports custom Sampler implementations for exactly this.
Strong Answer:
  • When a request enters the first service (usually the API Gateway), Micrometer Tracing generates two identifiers: a Trace ID (globally unique, shared across all services for this request) and a Span ID (unique to this service’s processing of the request). Together, these form the tracing context.
  • The context is propagated to downstream services via HTTP headers. The W3C Trace Context standard uses traceparent: 00-{traceId}-{spanId}-{flags}. The older B3 format (Zipkin) uses X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId. Spring Boot 3 defaults to W3C format but supports both.
  • When OrderService calls InventoryService via WebClient or Feign, Micrometer’s instrumentation automatically adds the traceparent header to the outgoing HTTP request. InventoryService’s instrumentation reads this header, extracts the Trace ID, creates a new Span ID (with the parent set to OrderService’s Span ID), and continues the trace. This creates a parent-child relationship visible in the trace UI as a tree or waterfall.
  • Each span records: start time, end time, service name, operation name, status (OK or ERROR), and optional tags (HTTP method, URL, status code). Spans are reported asynchronously to the tracing backend (Zipkin, Jaeger) — the reporting never blocks the request processing thread.
  • The “bridge” architecture in Spring Boot 3: Micrometer Tracing is the API layer (vendor-neutral). The actual trace generation is delegated to a bridge — either Brave (Zipkin-native) or OpenTelemetry. You choose by including the appropriate dependency (micrometer-tracing-bridge-brave or micrometer-tracing-bridge-otel). This means you can switch tracing backends without changing application code.
Follow-up: How do you ensure trace context propagates across asynchronous boundaries — message queues (Kafka), @Async methods, and CompletableFuture chains?For Kafka: Micrometer and Spring Cloud Stream automatically inject trace headers into Kafka message headers when producing and extract them when consuming. The trace ID spans the producer and consumer, showing the full async flow in the trace UI. For @Async: by default, the trace context is ThreadLocal-bound and does not cross thread boundaries. You need a TaskDecorator that copies the tracing context to the worker thread. Spring provides ContextSnapshotFactory and Micrometer’s ContextSnapshot for this. For CompletableFuture: use Observation.createNotStarted().observe(() -> ...) to wrap the async operation, or configure a custom executor with context propagation. The general principle: any time work crosses a thread boundary, you need explicit context propagation. Micrometer’s context-propagation library (io.micrometer:context-propagation) automates this for Reactor, but for raw thread pools, you must instrument the executor.
Strong Answer:
  • Minute 0-2 (Triage): Open Grafana. Check the service dashboard for the checkout flow. Identify which service’s latency spiked — is it the Gateway, Order Service, Payment Service, or Inventory Service? Look at the latency heatmap to see if it is ALL requests or just a subset (e.g., only requests to a specific endpoint or from a specific region).
  • Minute 2-5 (Narrow scope): Check error rate metrics alongside latency. If error rate is also up, check the HTTP status code breakdown — are these 500s (server errors), 504s (timeouts), or 429s (rate limiting)? If it is only latency with no errors, the service is slow but not failing. Check resource metrics: CPU, memory, GC pauses (JVM), thread pool utilization, connection pool utilization. A spike in GC pause time or connection pool exhaustion points to a resource bottleneck.
  • Minute 5-10 (Find the trace): Grab a trace ID from a recent slow request (Zipkin search: service=order-service, minDuration=3s). Open the trace waterfall. Identify the slowest span. If InventoryService’s database call went from 50ms to 3 seconds, the problem is in the database layer of that service.
  • Minute 10-15 (Root cause): Search logs by trace ID in Loki/Kibana. Look for slow query warnings, connection timeout errors, or exception stack traces within that span’s time window. Common findings: a new code deployment introduced a query without an index (check git log for recent deploys to InventoryService), a database connection pool is exhausted (max connections reached), or an external dependency (Redis, third-party API) is responding slowly.
  • Minute 15-20 (Mitigate): If it is a bad deployment, roll back. If it is a database query, add the index or kill the problematic query. If it is resource exhaustion, scale up the affected service (increase replicas) or increase the connection pool temporarily. Communicate status to the team.
Follow-up: After the incident, how do you prevent the same category of failure from happening again?Post-incident actions: (1) Add a Grafana alert for the specific metric that was the earliest signal (e.g., database query latency P99 for InventoryService). The alert should fire BEFORE users notice — threshold at 500ms when the SLO is 1 second. (2) Add a performance regression test to CI: run a load test that verifies P99 stays under the SLO. If a new query is added without an index, the load test fails before deployment. (3) Add a circuit breaker between Order Service and Inventory Service if one does not exist, so InventoryService slowness returns a degraded response instead of cascading. (4) Write the postmortem with timeline, root cause, and action items. Share it broadly — the next on-call engineer should not have to rediscover this.