Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Observability in Microservices
When you have 50 services, “tailing the logs” is impossible. You need centralized observability.1. The Three Pillars
- Logs: Immutable record of discrete events. (“Error at 10:00 PM”).
- Metrics: Aggregated data over time. (“CPU usage is 80%”, “Requests per second: 50”).
- Tracing: The path of a single request across multiple services.
2. Distributed Tracing with Zipkin/Micrometer
Spring Boot 3 uses Micrometer Tracing (formerly Spring Cloud Sleuth). Dependencies:io.micrometer:micrometer-tracing-bridge-braveio.zipkin.reporter2:zipkin-reporter-brave
Trace ID (global) and Span ID (local). These IDs are propagated via HTTP headers (traceparent).
Running Zipkin:
application.yml):
Order Service, which calls Inventory Service, you can see the full timeline in Zipkin UI (http://localhost:9411).
3. Metrics with Prometheus & Grafana
Actuator exposes metrics at/actuator/metrics. Prometheus scrapes them.
Dependency: io.micrometer:micrometer-registry-prometheus.
Config:
prometheus.yml):
4. Centralized Logging (ELK / Loki)
Don’t write logs to files. Write to Console (STDOUT). Use a log shipper (Fluentd/Promtail) to send them to ElasticSearch or Loki. Lombok Logging:/actuator/prometheus every 15s. Grafana visualizes the data.
5. Deep Dive: Spring Boot Actuator
Actuator exposes operational information about your running application.Enabling Actuator
Key Endpoints
| Endpoint | Description |
|---|---|
/actuator/health | Application health (UP/DOWN). Includes DB, Disk, etc. |
/actuator/info | Application metadata (version, Git commit). |
/actuator/metrics | All available metrics. |
/actuator/metrics/{name} | Specific metric (e.g., jvm.memory.used). |
/actuator/env | Environment properties. |
/actuator/loggers | View/Change log levels at runtime. |
/actuator/prometheus | Prometheus-formatted metrics. |
Securing Actuator
6. Custom Metrics with Micrometer
Track your own business KPIs.- Counter: Monotonically increasing (e.g., requests served).
- Gauge: Current value (e.g., active connections).
- Timer: Duration of events (e.g., request latency).
- Distribution Summary: Statistical summary (e.g., request size).
Monitoring Flow (Complete Picture)
Interview Deep-Dive
Explain the three pillars of observability -- logs, metrics, and traces. When would you use each to debug a production issue, and how do they complement each other?
Explain the three pillars of observability -- logs, metrics, and traces. When would you use each to debug a production issue, and how do they complement each other?
Strong Answer:
- Logs are discrete, immutable records of events. They tell you WHAT happened at a specific point in time: “OrderService threw NullPointerException at 14:32:05 processing order 12345.” Logs are high-cardinality (every event is unique) and high-volume. They are the raw evidence.
- Metrics are aggregated numerical measurements over time. They tell you HOW MUCH and HOW FAST: “P99 latency is 2.3 seconds, error rate is 12%, CPU usage is 85%.” Metrics are low-cardinality (pre-aggregated) and cheap to store. They are the smoke detector — they tell you something is wrong before you know exactly what.
- Traces are the path of a single request across service boundaries. They tell you WHERE time was spent: “Request 123 spent 50ms in the Gateway, 200ms in OrderService, and 3.5 seconds in InventoryService’s database query.” Traces provide the causal chain.
- Debugging workflow: (1) Metrics alert fires — “P99 latency for
/ordersexceeded 3 seconds.” This is your smoke detector. (2) You open Grafana and see latency spiked at 14:30. You filter by status code — 500 errors also spiked. (3) You grab a trace ID from a slow request. In Zipkin/Jaeger, the trace shows InventoryService’s span took 3.4 seconds, specifically a database query. (4) You search logs by the trace ID in Kibana/Loki and find: “Slow query warning: SELECT * FROM inventory WHERE sku LIKE ‘%ABC%’ — 3400ms.” Now you know the root cause: a missing index on theskucolumn causing full table scan. - The complement: metrics tell you WHEN and HOW BAD, traces tell you WHERE in the call chain, logs tell you EXACTLY WHAT happened. Without metrics, you do not know there is a problem. Without traces, you cannot find which service is responsible. Without logs, you cannot determine the root cause.
Sampler implementations for exactly this.How does distributed tracing work under the hood with Micrometer Tracing in Spring Boot 3? What actually propagates between services?
How does distributed tracing work under the hood with Micrometer Tracing in Spring Boot 3? What actually propagates between services?
Strong Answer:
- When a request enters the first service (usually the API Gateway), Micrometer Tracing generates two identifiers: a Trace ID (globally unique, shared across all services for this request) and a Span ID (unique to this service’s processing of the request). Together, these form the tracing context.
- The context is propagated to downstream services via HTTP headers. The W3C Trace Context standard uses
traceparent: 00-{traceId}-{spanId}-{flags}. The older B3 format (Zipkin) usesX-B3-TraceId,X-B3-SpanId,X-B3-ParentSpanId. Spring Boot 3 defaults to W3C format but supports both. - When OrderService calls InventoryService via WebClient or Feign, Micrometer’s instrumentation automatically adds the
traceparentheader to the outgoing HTTP request. InventoryService’s instrumentation reads this header, extracts the Trace ID, creates a new Span ID (with the parent set to OrderService’s Span ID), and continues the trace. This creates a parent-child relationship visible in the trace UI as a tree or waterfall. - Each span records: start time, end time, service name, operation name, status (OK or ERROR), and optional tags (HTTP method, URL, status code). Spans are reported asynchronously to the tracing backend (Zipkin, Jaeger) — the reporting never blocks the request processing thread.
- The “bridge” architecture in Spring Boot 3: Micrometer Tracing is the API layer (vendor-neutral). The actual trace generation is delegated to a bridge — either Brave (Zipkin-native) or OpenTelemetry. You choose by including the appropriate dependency (
micrometer-tracing-bridge-braveormicrometer-tracing-bridge-otel). This means you can switch tracing backends without changing application code.
@Async: by default, the trace context is ThreadLocal-bound and does not cross thread boundaries. You need a TaskDecorator that copies the tracing context to the worker thread. Spring provides ContextSnapshotFactory and Micrometer’s ContextSnapshot for this. For CompletableFuture: use Observation.createNotStarted().observe(() -> ...) to wrap the async operation, or configure a custom executor with context propagation. The general principle: any time work crosses a thread boundary, you need explicit context propagation. Micrometer’s context-propagation library (io.micrometer:context-propagation) automates this for Reactor, but for raw thread pools, you must instrument the executor.You are on-call and get paged at 3 AM: P99 latency for the checkout flow has tripled. Walk me through your incident response using the observability stack.
You are on-call and get paged at 3 AM: P99 latency for the checkout flow has tripled. Walk me through your incident response using the observability stack.
Strong Answer:
- Minute 0-2 (Triage): Open Grafana. Check the service dashboard for the checkout flow. Identify which service’s latency spiked — is it the Gateway, Order Service, Payment Service, or Inventory Service? Look at the latency heatmap to see if it is ALL requests or just a subset (e.g., only requests to a specific endpoint or from a specific region).
- Minute 2-5 (Narrow scope): Check error rate metrics alongside latency. If error rate is also up, check the HTTP status code breakdown — are these 500s (server errors), 504s (timeouts), or 429s (rate limiting)? If it is only latency with no errors, the service is slow but not failing. Check resource metrics: CPU, memory, GC pauses (JVM), thread pool utilization, connection pool utilization. A spike in GC pause time or connection pool exhaustion points to a resource bottleneck.
- Minute 5-10 (Find the trace): Grab a trace ID from a recent slow request (Zipkin search: service=order-service, minDuration=3s). Open the trace waterfall. Identify the slowest span. If InventoryService’s database call went from 50ms to 3 seconds, the problem is in the database layer of that service.
- Minute 10-15 (Root cause): Search logs by trace ID in Loki/Kibana. Look for slow query warnings, connection timeout errors, or exception stack traces within that span’s time window. Common findings: a new code deployment introduced a query without an index (check
git logfor recent deploys to InventoryService), a database connection pool is exhausted (max connections reached), or an external dependency (Redis, third-party API) is responding slowly. - Minute 15-20 (Mitigate): If it is a bad deployment, roll back. If it is a database query, add the index or kill the problematic query. If it is resource exhaustion, scale up the affected service (increase replicas) or increase the connection pool temporarily. Communicate status to the team.