Monitoring = Knowing your application is brokenObservability = Knowing WHY your application is brokenThink of it like a car:Monitoring (Dashboard lights):
Check Engine Light ✅ (Something is wrong!)
Temperature Gauge: HIGH ⚠️ (Engine is overheating!)
Fuel Gauge: EMPTY ⚠️ (You’re out of gas!)
What you know: The car has a problem
What you DON’T know: Why the engine is overheating (radiator leak? broken fan? low coolant?)Observability (Diagnostic tools):
Metrics: Temperature readings every second (shows temperature spiked at 3:15 PM)
Logs: “Coolant level low” warning at 3:10 PM → “Fan belt broke” error at 3:14 PM
Traces: Complete timeline from “fan belt snapped” → “fan stopped” → “engine overheated” → “check engine light”
Result: You know EXACTLY what failed, when, and why!
What it tells you: Exactly WHAT happened (payment gateway timed out)
What it DOESN’T tell you: Which service caused the timeout?3. Traces = Detective’s Timeline (Request journey)What you see:
Request ID: abc123 (Customer #1234's checkout)Frontend (50ms) ↓API Gateway (20ms) ↓Order Service (100ms) ↓Payment Service (5000ms) ← BOTTLENECK! ↓ Timeout (payment gateway never responded)Total time: 5,170ms (should be ~200ms)
What it tells you: WHERE the problem occurred (Payment Service → payment gateway)
Problem: “Checkout is slow for some users”Monitoring approach (limited):
Dashboard shows:- Average response time: 250ms ✅ (Looks fine!)- CPU usage: 40% ✅ (Looks fine!)- Error rate: 0.5% ✅ (Looks fine!)Conclusion: Everything looks normal, but users still complaining!
Observability approach (powerful):
KQL Query:requests| where name contains "checkout"| where duration > 5000 // > 5 seconds| summarize count() by client_CityResults:London: 2 slow requestsTokyo: 1,543 slow requests ← FOUND IT!Cause: Database replica in Asia is down!
Observability lets you ask questions you didn’t think of when building dashboards.
[!TIP]
Jargon Alert: Observability vs MonitoringMonitoring: Tells you the system is dead. (“CPU is 100%”)
Observability: Tells you why the system is dead. (“The database query from line 42 is hanging.”)
[!WARNING]
Gotcha: Log Retention Costs
Log Analytics charges you to ingest data and to keep it. Storing debug logs for 365 days is expensive. Set retention to 30 days for dev/test and use “Data Export” to move old logs to Blob Storage (Archive Tier) for long-term compliance.
// Program.csvar builder = WebApplication.CreateBuilder(args);// This single line instruments your entire app -- it automatically captures:// - All incoming HTTP requests (duration, status code, URL)// - All outgoing HTTP calls (dependencies like SQL, Redis, APIs)// - Unhandled exceptions (with full stack traces)// - Performance counters (CPU, memory, GC pressure)// No manual logging required for basic observability.//// Cost impact: This generates ~1-5 KB of telemetry per request.// At 100K requests/day, that is ~500 MB/day or ~15 GB/month.// First 5 GB/month is free, remainder costs ~$28/month.// Enable sampling (below) to reduce this by 90% while keeping// statistical accuracy for most debugging scenarios.builder.Services.AddApplicationInsightsTelemetry();var app = builder.Build();
{ "ApplicationInsights": { // Use Connection String (not Instrumentation Key alone) -- Connection Strings // support regional ingestion endpoints, reducing latency and ensuring // data residency compliance (EU data stays in EU). // NEVER hardcode this in source control -- use Key Vault or environment variables. "ConnectionString": "InstrumentationKey=xxx;IngestionEndpoint=https://xxx" }}
Cost Tip: By default, Application Insights captures 100% of telemetry. For high-traffic apps (1M+ requests/day), enable adaptive sampling to automatically reduce volume while preserving statistically accurate data:
builder.Services.AddApplicationInsightsTelemetry(options =>{ // Adaptive sampling keeps enough data for accurate metrics // while reducing ingestion costs by 80-95% for high-traffic apps. options.EnableAdaptiveSampling = true;});
// Find slowest requestsrequests| where timestamp > ago(24h)| summarize count=count(), avg_duration=avg(duration), p50=percentile(duration, 50), p95=percentile(duration, 95), p99=percentile(duration, 99) by operation_Name| order by p95 desc// Find requests slower than SLArequests| where duration > 1000 // > 1 second| project timestamp, name, url, duration, resultCode| order by duration desc
// Find top errorsrequests| where timestamp > ago(1h)| where success == false| summarize count() by resultCode, operation_Name| order by count_ desc// Error rate over timerequests| where timestamp > ago(24h)| summarize total=count(), failed=countif(success == false) by bin(timestamp, 5m)| extend errorRate = (failed * 100.0) / total| render timechart// Exceptions with stack tracesexceptions| where timestamp > ago(1h)| project timestamp, type, outerMessage, innermostMessage| order by timestamp desc
// Failed dependenciesdependencies| where timestamp > ago(1h)| where success == false| summarize count() by name, type, resultCode| order by count_ desc// Database query performancedependencies| where type == "SQL"| where timestamp > ago(24h)| summarize count(), avg(duration), p95=percentile(duration, 95) by name| order by p95 desc
// Active userspageViews| where timestamp > ago(7d)| summarize dau=dcount(user_Id) by bin(timestamp, 1d)| render timechart// Most popular pagespageViews| where timestamp > ago(7d)| summarize count() by name| order by count_ desc| take 10
Checkout conversion drops 40% but no alerts fire. CPU, memory, and error rates look normal. How do you diagnose this?
Strong Candidate Answer:
Why monitoring missed it: CPU, memory, and 500 error rates are infrastructure metrics. A conversion drop is a business metric — the app is technically working but something is wrong with user experience.
Diagnosis: Open Application Map for dependency latency. Check Performance blade for /checkout P95 response time — if it jumped from 800ms to 4 seconds, users abandon due to slowness (200 OK response, but slow). Use distributed tracing to find which dependency call is the bottleneck. Check Failures blade for dependency 429s (rate limiting) causing retries.
The likely culprit: A third-party dependency (payment gateway, fraud detection) responding slowly. Application Insights dependency tracking shows exact call, latency, and success rate.
What was missing: No alerts on business KPIs. Create custom metrics tracking checkout start vs order confirmation events. Alert when conversion drops 20% vs same hour last week.
Follow-up: How do you alert on business problems without alert fatigue?Use anomaly-based alerting (Smart Detection) instead of static thresholds. Dynamic threshold alerts compare current values against 7-day rolling averages. Route infrastructure alerts to platform team, business metrics to product team. Implement suppression during maintenance windows.
Your Application Insights bill jumped from $500 to $8,000/month. How do you reduce it without losing critical observability?
Strong Candidate Answer:
What happened: Ingestion went from ~185 GB to ~2,900 GB/month at $2.76/GB. Common causes: verbose DEBUG logging in production, trace telemetry for every request, dependency tracking logging full SQL queries, or snapshot debugging enabled.
Immediate fix: Enable adaptive sampling (1-in-5 requests, 80% cost reduction while preserving statistical accuracy). Set daily cap to 200 GB to prevent runaway costs.
Medium-term: Use “Usage and estimated costs” blade to find which telemetry type is largest. Filter out health check endpoints with TelemetryProcessors. Move verbose logs to separate workspace with 7-day retention.
Never cut: Distributed tracing and exception tracking. A 1-hour outage costs $50K+ for most e-commerce sites — far more than monthly telemetry costs.
Follow-up: Compare Application Insights pricing with Datadog and New Relic for 50 microservices.At 500 GB/month: Application Insights ~1,400(commitmenttier),Datadog2,050 (31/host+logs),NewRelic1,130 (0.30/GB+49/user for 20 engineers). New Relic cheapest per-GB, Datadog best dashboarding, Application Insights best Azure integration. Real differentiator is team skills: if engineers know KQL, Application Insights is fastest to value.
A user says checkout is slow but your API returns 200 OK in 300ms. Where is the problem?
Strong Candidate Answer:
The disconnect: Server measures 300ms. User experiences: DNS (50ms) + TCP (100ms) + TLS (100ms) + TTFB (300ms) + download (200ms) + JS rendering (2s) + third-party scripts (analytics, chat). Total perceived: 2.75 seconds.
How distributed tracing reveals this: Application Insights JavaScript SDK captures browser timing. End-to-end trace shows: Browser (2.75s) -> CDN (50ms) -> Front Door (20ms) -> API (300ms). 2 seconds are client-side rendering and third-party scripts.
The fix: Defer non-critical third-party scripts after checkout. Lazy-load below-fold content. Pre-connect to payment API domain. Reduces perceived time from 2.75s to under 1 second without backend changes.
Follow-up: How do you implement distributed tracing across polyglot microservices (C#, Node.js, Python)?Use OpenTelemetry as the instrumentation standard. Each service exports traces to Application Insights via Azure Monitor OTel exporter. W3C Trace Context headers propagate automatically across HTTP calls regardless of language. Unified traces in one Application Insights instance across all services. OpenTelemetry is vendor-neutral, avoiding lock-in to Azure-specific SDKs.