Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Traffic Management & Network Security
What You’ll Learn
By the end of this chapter, you’ll understand:- How load balancers work (and why Azure has 4 different ones)
- When to use each load balancer (Layer 4 vs Layer 7, Regional vs Global)
- Real costs and performance trade-offs between load balancing options
- How to prevent outages with health probes and connection draining
- Common mistakes that cause production failures
Introduction: What is Load Balancing?
Start Here if You’re Completely New
The Problem: You have a website running on a single server. What happens when:- 100 users visit → Server handles it fine ✅
- 10,000 users visit → Server slows down ⚠️
- 100,000 users visit → Server crashes ❌
Real-World Analogy: Restaurant Hostess
Without Load Balancer = Restaurant with no hostess- Customers walk in, sit anywhere
- One table gets 10 people (overcrowded)
- Other tables are empty
- Bad customer experience
- Hostess greets customers
- Assigns them to available tables evenly
- All tables equally busy
- Great customer experience
Why This Matters: Real Cost of Getting It Wrong
Case Study: Target’s 2013 Black Friday Crash Target’s website crashed on Black Friday 2013:- The Setup: Used wrong type of load balancer
- The Problem: Load balancer couldn’t handle HTTP traffic properly
- The Incident: Website down for 4 hours during peak shopping
- The Cost: $440M in lost sales (that day alone)
- The Fix: Migrated to proper Layer 7 load balancer
- The Lesson: Choosing wrong load balancer cost $440M
Getting traffic into your application reliably and securely is just as important as the network inside.
1. Load Balancing Decision Tree
Understanding Azure’s 4 Load Balancers (From Absolute Zero)
Azure has 4 different load balancers. Choosing the wrong one is a disaster (see Target’s $440M loss above). The Challenge: Why so many? The Answer: Different use cases need different capabilities.Layer 4 vs Layer 7 (Explained Simply)
The OSI Model is a networking standard with 7 layers. Most people only care about 2: Layer 4 (Transport Layer) = Dumb, fast pipe- Sees: IP address and port number only
- Example: “Send packet to 10.0.0.5 port 80”
- Doesn’t know: What’s in the packet (HTTP? SQL? Video?)
- Speed: Extremely fast (<1ms latency)
- Analogy: Mail carrier who only reads the address on envelope
- Sees: HTTP headers, URL paths, cookies, everything
- Example: “Send
/apirequests to Server A,/imagesto Server B” - Knows: Content type, can inspect and modify
- Speed: Slower (3-20ms latency, must parse HTTP)
- Analogy: Mail carrier who opens mail, reads it, decides where it should go
[!TIP] Jargon Alert: Layer 4 vs Layer 7 Layer 4 (Transport): Knows IP and Port. “Send packet to 10.0.0.5:80”. (Dumb, fast pipe). Layer 7 (Application): Knows URL, Cookies, Headers. “SendWhen to Use Each: Use Layer 4 when:/apito Service A and/imagesto Service B”. (Smart, CPU intensive).
- ✅ Need maximum speed (latency < 1ms)
- ✅ Non-HTTP traffic (databases, game servers)
- ✅ Don’t need to inspect content
- ✅ Need smart routing (
/api→ different server than/images) - ✅ Need SSL termination (decrypt HTTPS once, not on every server)
- ✅ Need Web Application Firewall (WAF) protection
- ✅ HTTP/HTTPS traffic only
Global vs Regional Load Balancers (Simplified)
Regional = Works within one Azure region (e.g., East US)- Example: 3 servers in East US datacenter
- Example: Servers in 3 continents
Azure’s 4 Load Balancers Explained
| Tool | Scope | Layer | Protocol | Monthly Cost | Best For |
|---|---|---|---|---|---|
| Azure Load Balancer | Regional | Layer 4 | TCP/UDP | $18 | Databases, High throughput |
| Application Gateway | Regional | Layer 7 | HTTP/S | $125 | Web apps in one region, WAF |
| Traffic Manager | Global | DNS | Any | $1.35/M queries | Non-HTTP, Legacy failover |
| Front Door | Global | Layer 7 | HTTP/S | 0.03/GB | Global web apps, CDN, WAF |
Global (Multi-Region) vs Regional (Detailed)
| Tool | Scope | Layer | Protocol | Best For |
|---|---|---|---|---|
| Front Door | Global | Layer 7 | HTTP/S | Web Apps, Microservices, CDN |
| Traffic Manager | Global | DNS | Any | Non-HTTP, Legacy failover |
| App Gateway | Regional | Layer 7 | HTTP/S | WAF, SSL Termination, Ingress |
| Load Balancer | Regional | Layer 4 | TCP/UDP | Databases, High throughput, Non-HTTP |
Deep Dive: Load Balancer Comparison
When choosing between Azure’s load balancing services, understanding the nuances is critical for production systems.| Feature | Azure Load Balancer | Application Gateway | Azure Front Door | Traffic Manager |
|---|---|---|---|---|
| OSI Layer | Layer 4 (TCP/UDP) | Layer 7 (HTTP/HTTPS) | Layer 7 (HTTP/HTTPS) | DNS (Layer 3) |
| Scope | Regional (Zone-redundant) | Regional | Global (Multi-region) | Global (DNS-based) |
| SSL Termination | No | Yes | Yes | No |
| Path-based Routing | No | Yes (/api → Backend1) | Yes (/api → Origin1) | No |
| WAF | No | Yes (OWASP 3.2) | Yes (OWASP 3.2 + MS Rules) | No |
| Session Affinity | 5-tuple hash | Cookie-based | Cookie-based | No |
| Health Probes | TCP/HTTP | HTTP/HTTPS | HTTP/HTTPS | HTTP/HTTPS/TCP |
| Latency | <1ms | 3-10ms | 10-20ms (edge routing) | 60s+ (DNS TTL) |
| Throughput | 4M flows/sec | ~20 Gbps | ~50 Gbps | N/A (DNS only) |
| Cost | 0.005/GB | 0.008/GB | 0.03/GB | $1.35/M DNS queries |
| Typical Use Case | SQL Server, MongoDB | Microservices on AKS | Global SPA, CDN | DR Failover |
Understanding Key Features (Explained Simply)
SSL Termination = Decrypt HTTPS once at load balancer, not on every server- Why It Matters: Saves CPU on your servers (encryption is expensive)
- Example: 1,000 HTTPS requests → Load balancer decrypts once → Servers get plain HTTP
- Cost Savings: 20-30% less CPU usage on servers
- Example:
/api→ API servers,/images→ Image servers,/admin→ Admin servers - Why It Matters: Optimize server resources for specific tasks
- Analogy: Restaurant with different stations (grill, salad bar, dessert)
- Blocks: SQL injection, XSS (cross-site scripting), DDoS attacks
- Example: Hacker sends
https://yoursite.com/api?id=1' OR '1'='1→ WAF blocks it - Real Cost: Equifax breach cost $4B, could have been prevented with WAF
- Problem: User logs in to Server A, next request goes to Server B (session lost!)
- Solution: “Pin” user to Server A for entire session
- Better Solution: Use Redis for shared sessions (no sticky sessions needed)
[!WARNING] Gotcha: Traffic Manager Isn’t a Load Balancer Traffic Manager is a DNS service. It returns an IP address to the client, then the client connects directly to that backend. If the backend goes down after DNS resolution, Traffic Manager won’t reroute traffic until the next DNS lookup (60+ seconds later). Use it for coarse-grained multi-region failover, not for real-time load balancing. Visual Example:Common Mistake #1: Using Traffic Manager for Real-Time Failover The Trap:
- Team deploys global app
- Uses Traffic Manager for failover
- Region goes down
- Problem: Users stuck on dead region for 60+ seconds (DNS TTL)
- Impact: Bad user experience, lost revenue
- Use Azure Front Door ($35/month)
- Failover in <10 seconds (no DNS caching)
- Cost: 1.35/M queries (similar price for most apps)
Session Affinity (Sticky Sessions) Explained
The Problem (Story Format): Imagine you’re shopping online:- Step 1: You visit website → Load balancer sends you to Server A
- Step 2: You log in → Server A stores “You are logged in” in memory
- Step 3: You add item to cart → Load balancer sends you to Server B
- Result: Server B doesn’t know you’re logged in → “401 Unauthorized” error ❌
Method 1: Azure Load Balancer (5-Tuple Hash)
How It Works:- Load balancer looks at your IP address + port
- Creates a “fingerprint” (hash)
- Always sends same fingerprint to same server
- ✅ Works for any protocol (TCP, UDP, HTTP)
- ✅ Very fast (no cookies to parse)
- ❌ If your IP changes (mobile switching cell towers), you get routed to different server
- ❌ If you’re behind NAT (corporate network), everyone shares same IP
Method 2: Application Gateway / Front Door (Cookie-Based)
How It Works:- First request → Load balancer picks Server A
- Response includes cookie:
Set-Cookie: ApplicationGatewayAffinity=abc123 - Future requests → Browser sends cookie → Load balancer reads it → Routes to Server A
- ✅ Survives IP changes (mobile networks, VPN switches)
- ✅ More accurate than IP-based
- ❌ If user clears cookies, session is lost
- ❌ Slightly slower (must parse HTTP headers)
[!TIP] Best Practice: Use Redis or Azure App Service Distributed Cache for session state, so sticky sessions aren’t required. This allows horizontal scaling without session loss. Why Shared Session Storage is Better:Common Mistake #2: Relying on Sticky Sessions The Trap:
- App stores sessions in server memory
- Uses sticky sessions
- Server crashes → All sessions on that server lost
- Impact: Users forced to log in again
- E-commerce site during Black Friday
- Server crash lost 10,000 active sessions
- Users had to re-add items to cart
- 70% abandoned their carts
- Cost: $2.1M in lost sales
- Migrate sessions to Redis ($20/month)
- No sticky sessions needed
- Server crashes don’t lose sessions
Health Probes: Keeping Dead Servers Out of Rotation
The Problem (Explained Simply): Imagine you have 3 servers behind a load balancer:- Server A: Running fine ✅
- Server B: Running fine ✅
- Server C: Crashed (out of memory) ❌
How Health Probes Work
The Concept: Load balancer acts like a doctor doing checkups:- Every 15-30 seconds: “Are you healthy?”
- Server responds: “Yes, I’m fine!” → Stays in rotation
- Server doesn’t respond: (Marked unhealthy after 2-3 failures) → Removed from rotation
Azure Load Balancer Health Probes (Simple)
- Every 15 seconds, try to connect to port 80
- If 2 consecutive failures → Mark server unhealthy
- Marks unhealthy after: 2 × 15s = 30 seconds
- Question: “Is port 80 open?”
- Server: “Yes, port is open” ✅
- Problem: Port might be open, but app crashed!
- Question: “GET /health → Give me HTTP 200 OK”
- Server: “HTTP 200 OK” ✅
- Better: Confirms app is actually responding
Application Gateway Health Probes (Advanced)
- Every 30 seconds, send
GET /health - If server doesn’t respond in 30 seconds → Timeout
- If 3 consecutive failures → Mark unhealthy
- Marks unhealthy after: 3 × 30s = 90 seconds
/health endpoint that checks everything:
- Port might be open ✅
- App might be running ✅
- But if database is down → Health check fails → Server removed from rotation ✅
[!WARNING] Gotcha: Health Probe IPs Health probes come from Azure’s internal IP rangeCommon Mistake #3: Blocking Health Probe IP Real-World Example: A team locked down their NSG to only allow traffic from Front Door IP ranges:168.63.129.16. You MUST allow this IP in your NSG, or all backends will be marked unhealthy! Visual:
- Incident duration: 4 hours
- Users affected: 500,000
- Revenue lost: $2.8M
- Prevention: One extra NSG rule (free!)
Connection Draining: Gracefully Shutting Down Backends
Problem: You deploy a new version. Azure removes the old VM from the load balancer pool, but it has 50 active connections processing long-running API requests. If you immediately kill the VM, those requests fail.Azure Load Balancer
- Idle Timeout: After 30 minutes of inactivity, connection is closed.
- No Graceful Draining: Azure Load Balancer doesn’t support draining. Use a rolling update strategy.
Application Gateway
- How it works: When you remove a backend, App Gateway stops sending new requests to it, but allows existing connections to finish for up to 300 seconds.
[!TIP] Best Practice: Set drain timeout to your P99 request latency. If 99% of requests finish in 10 seconds, set drain timeout to 15s.
Cross-Region Load Balancing: Front Door vs Traffic Manager
| Scenario | Use Front Door | Use Traffic Manager |
|---|---|---|
| Global HTTP/S app | ✅ Automatic failover, anycast | ❌ DNS caching causes stale routes |
| Non-HTTP workload (TCP/UDP) | ❌ HTTP/S only | ✅ Works with any protocol |
| Real-time failover required | ✅ Sub-second failover | ❌ 60s+ DNS TTL delay |
| Cost-sensitive | ❌ $0.03/GB (3x more) | ✅ $1.35/M queries |
| Need CDN + WAF | ✅ Built-in | ❌ Must add separate CDN |
Decision Flowchart
2. Azure Front Door
The modern entry point for global web applications. Think of Front Door as a combination of a global bouncer, traffic cop, and express delivery service — all managed by Microsoft at their edge network of 150+ Points of Presence (POPs) worldwide.- CDN: Caches static content at the edge. A user in Tokyo gets your CSS/JS from a Tokyo POP instead of waiting for a round-trip to your East US backend (saving 150-200ms per request).
- Anycast: Users connect to the nearest Microsoft Edge node (POPs). Unlike DNS-based routing (Traffic Manager), anycast uses BGP to route at the network level — there is no DNS TTL delay.
- WAF: Web Application Firewall protects against SQL Injection, XSS, and other OWASP Top 10 attacks. Microsoft adds their own managed rule sets on top of OWASP rules, updated automatically as new threats emerge.
- Base fee: ~$35/month (Standard tier)
- Data transfer: $0.03/GB (first 5 TB), drops with volume
- WAF requests: $0.06/10,000 requests
- Typical cost for a mid-traffic site (1 million requests/month, 500 GB transfer): ~$55/month
[!WARNING] Gotcha: The 5-minute timeout Front Door has a hard 240-second (4-minute) idle timeout for connections. If your backend takes longer to process a report, Front Door will cut the connection with a 504 Gateway Timeout. The fix: use asynchronous patterns — return a 202 Accepted immediately with a status URL, and let the client poll for results.
3. Azure Application Gateway
The regional Layer 7 load balancer. If Front Door is the international airport, Application Gateway is the regional bus station — it distributes traffic within a single Azure region to the right backends.- WAF: Uses OWASP 3.2 rules (same ruleset as Front Door). Always use WAF_v2 SKU for production — the v1 SKU lacks autoscaling and is being deprecated.
- Autoscaling: WAF_v2 scales from 0 to 125 instances based on traffic. This means you do not pay for peak capacity at all times — you pay for actual usage. Set min instances to 2 for production (prevents cold-start latency during traffic spikes).
- AGIC: Application Gateway Ingress Controller for AKS. Instead of deploying a separate nginx ingress inside your cluster, AGIC lets your Kubernetes Ingress resources configure Application Gateway directly. This keeps your WAF and SSL termination outside the cluster, which is both more secure and more cost-effective.
az network application-gateway delete to clean up — stopping it does not stop billing.
Practical Tip: When configuring path-based routing, plan your URL structure carefully. A common pattern:
4. Azure NAT Gateway
The Problem: SNAT Port Exhaustion — one of the most frustrating and confusing networking issues in Azure. Real-World Analogy: Imagine a building with one phone line (public IP). Each phone call uses one line. If 100 employees try to make external calls simultaneously, some get a busy signal. SNAT port exhaustion is exactly this — your VMs are trying to make more outbound connections than there are available source ports. When 100 VMs try to talk to the internet using one Standard Load Balancer public IP, they share 1,024 SNAT ports per backend instance. Under heavy outbound load (calling third-party APIs, sending webhooks), connections start failing randomly with “connection timed out” errors. These failures are intermittent and maddening to debug because they only happen under load. The Solution: NAT Gateway.- Dedicated resource for outbound traffic — completely decouples your outbound path from your load balancer.
- Provides 64,000 SNAT ports per Public IP (vs. 1,024 per backend instance on a Load Balancer).
- You can attach up to 16 Public IPs (1,024,000+ concurrent connections).
- Idle timeout configurable from 4-120 minutes (default: 4 minutes).
[!IMPORTANT] Best Practice: Always attach a NAT Gateway to your subnets if you have high outbound traffic (e.g., API scrapers, high-volume webhooks, microservices calling external payment gateways). The $32/month cost is trivial compared to the engineering hours spent debugging SNAT exhaustion.
5. Azure DNS
DNS is the phone book of the internet — it translates human-readable names (myapp.com) into IP addresses (20.50.100.5) that computers use. Getting DNS wrong means nobody can reach your application, even if everything else is perfectly configured.
Public Zones
Host your domain (example.com). Azure has ultra-fast global DNS servers (ns1-01.azure-dns.com) with 100% SLA backed by anycast — meaning Azure’s DNS infrastructure has never gone down (one of only two Azure services with a 100% SLA, the other being Availability Zones).
Cost: 0.40 per million queries. For most applications, DNS costs less than $2/month.
Practical Tip: When migrating a domain to Azure DNS, lower your TTL to 60 seconds 24-48 hours before the migration. This ensures that when you switch nameservers, DNS caches expire quickly and users get routed to your new records within minutes instead of hours.
Private Zones
Internal DNS (app.internal).
- Resolve hostnames across VNets (when the Private DNS Zone is linked to those VNets).
- Auto-registration: When you create a VM, it automatically gets a DNS record (
vm1.app.internal). This eliminates the need to hardcode IP addresses in configuration files — if a VM is replaced, the DNS record updates automatically. - Used heavily by Private Link to map
mypaas.privatelink.database.windows.netto a private IP inside your VNet.
privatelink.database.windows.net linked to your VNet. Without this, your application will resolve the SQL server’s hostname to its public IP, bypassing Private Link entirely — and your connection will either fail (if public access is disabled) or be insecure (going over the public internet).
Split-Horizon DNS
You can haveapi.company.com resolve to a Public IP for external users, but a Private IP (10.0.0.5) for internal users on VPN. This is a powerful pattern for hybrid environments: external customers hit your public-facing load balancer with WAF protection, while internal applications connect directly to the backend over the private network — faster, cheaper, and with no WAF overhead.
How it works: Create a Public DNS Zone with api.company.com pointing to your public IP, and a Private DNS Zone with the same name pointing to the private IP. Azure VMs linked to the Private Zone will resolve the private IP; everyone else gets the public IP.
6. Case Study: E-Commerce Architecture
Putting it all together:- User hits
www.shop.com. - Azure Front Door intercepts, checks WAF, serves global cache.
- Forwards dynamic request to Application Gateway in
Region A. - App Gateway routes
/cartto AKS Cluster (in a private subnet). - AKS Pod talks to Azure SQL via Private Link (traffic never leaves VNet).
- AKS Pod sends email via SendGrid using NAT Gateway (to prevent SNAT failing).
- DevOps Engineer connects via VPN Gateway to debug DB issues.
Interview Deep-Dive
Azure has 4 load balancers. An architect proposes using Azure Load Balancer for a global e-commerce web application. What is wrong with this choice, and what would you recommend?
Azure has 4 load balancers. An architect proposes using Azure Load Balancer for a global e-commerce web application. What is wrong with this choice, and what would you recommend?
- Why Azure Load Balancer is wrong here: Azure Load Balancer operates at Layer 4 (TCP/UDP) and is regional. For a global e-commerce web application, you need Layer 7 intelligence (URL-based routing, cookie-based session affinity, SSL termination) and global reach (users in Tokyo should hit a nearby backend, not one in Virginia). Azure Load Balancer cannot route /api to one backend pool and /images to another. It cannot terminate SSL. It cannot detect that a user in Japan should be routed to a Japan-based origin.
- The correct architecture: Azure Front Door as the global entry point (125/month) for VNet-level path-based routing and a second WAF layer. Behind Application Gateway, use Azure Load Balancer ($18/month) for distributing TCP traffic to backend VMs or AKS node pools.
- The layered pattern: Front Door (global L7) -> Application Gateway (regional L7) -> Load Balancer (regional L4) -> Backend servers. Each layer serves a different purpose: Front Door handles global routing and edge caching, Application Gateway handles VNet routing and SSL re-encryption, Load Balancer handles high-performance TCP distribution.
- Cost comparison for the correct architecture: Front Door 250 + 2 regional Load Balancers 321/month. Using just Azure Load Balancer would cost 440M Black Friday loss was caused by choosing the wrong load balancing tier.
Your application uses sticky sessions via Application Gateway cookies. During a deployment, 30% of users lose their sessions and have to re-login. What went wrong and how do you fix it permanently?
Your application uses sticky sessions via Application Gateway cookies. During a deployment, 30% of users lose their sessions and have to re-login. What went wrong and how do you fix it permanently?
Explain SNAT port exhaustion. Your team is seeing intermittent 'connection timed out' errors from an AKS cluster calling a third-party payment API. How do you diagnose and fix this?
Explain SNAT port exhaustion. Your team is seeing intermittent 'connection timed out' errors from an AKS cluster calling a third-party payment API. How do you diagnose and fix this?
- What SNAT exhaustion is: When multiple VMs or pods behind a Load Balancer make outbound connections to the internet, they share a pool of source ports (SNAT ports) mapped to the Load Balancer’s public IP. Each outbound connection consumes one SNAT port. Standard Load Balancer allocates 1,024 SNAT ports per backend instance. If your AKS node has 50 pods each making 25 concurrent connections to the payment API, that is 1,250 connections — exceeding the 1,024 port allocation. New connections fail with “connection timed out.”
- Why it is intermittent: SNAT ports are released 4 minutes after the TCP connection closes (TCP TIME_WAIT). During off-peak hours, connections close before the pool is exhausted. During peak payment processing, the creation rate exceeds the release rate and the pool is depleted.
- Diagnosis steps: (1) Check Load Balancer metrics for “SNAT Connection Count” with state “Failed” — any non-zero value confirms exhaustion. (2) Check the AKS node count and calculate: nodes x 1,024 ports = total pool. If your application needs more concurrent outbound connections than this, you have a problem. (3) Look at the connection pattern — are connections being reused (HTTP keep-alive) or created fresh for every request?
- Fix — deploy NAT Gateway: Attach an Azure NAT Gateway ($32/month) to the AKS node subnet. NAT Gateway provides 64,000 SNAT ports per public IP, and you can attach up to 16 public IPs for over 1 million concurrent connections. NAT Gateway completely decouples outbound traffic from the Load Balancer.
- Application-level fix (also important): Ensure the payment API client uses HTTP connection pooling with keep-alive. A single persistent connection handles hundreds of sequential requests without consuming additional SNAT ports. In .NET, use a singleton HttpClient. In Node.js, use an Axios instance with keepAlive: true. This often reduces SNAT usage by 80%+ and may eliminate the exhaustion without NAT Gateway.