Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Chapter 17: Managing the Bill - Cost Management and FinOps
In the cloud, cost is not just a line item for finance; it is an engineering metric. A poorly architected system isn’t just slow or insecure — it is expensive. In Google Cloud, managing costs requires a deep understanding of the billing hierarchy, the discount models, and the automation tools available to enforce “FinOps” (Financial Operations) principles. Think of FinOps like managing a household budget. You would not give every family member an unlimited credit card and hope for the best. You set budgets, track spending, look for waste (that streaming subscription nobody uses), and negotiate discounts (annual plans instead of monthly). Cloud cost management follows the same logic, just at a much larger scale. AWS has similar tools (Cost Explorer, Savings Plans, Reserved Instances), and Azure has Cost Management + Advisor. The concepts are universal — only the implementation details differ.1. The GCP Billing Hierarchy
To manage costs at scale, you must first understand how they are attributed.- Cloud Billing Account: The top-level resource linked to a payment method.
- Projects: All resources live in projects, and each project is linked to one billing account.
- Labels: These are key-value pairs (e.g.,
team:search,env:prod) attached to resources. Labels are the single most important tool for cost allocation. They are exported into your billing data, allowing you to see exactly how much each team is spending.
2. Discount Models: CUDs and SUDs
GCP offers several ways to reduce your “list price” spend.Sustained Use Discounts (SUDs)
SUDs are automatic. If you run a Compute Engine instance for more than 25% of a month, Google automatically starts applying a discount. For a full month, the discount can reach up to 30%. This is a uniquely GCP feature — AWS and Azure do not offer automatic discounts for sustained usage. In AWS, you must explicitly purchase Savings Plans or Reserved Instances to get any discount. In GCP, you get 30% off just by running a VM all month, with zero commitment. Why this matters: For organizations that are new to cloud cost optimization, GCP’s SUDs provide an immediate cost reduction without any planning or purchasing decisions. On AWS, the equivalent savings require analyzing usage patterns, choosing between Standard vs. Convertible Reserved Instances, and committing upfront capital.Committed Use Discounts (CUDs)
CUDs require a commitment (1 or 3 years) in exchange for deep discounts (up to 70%). AWS’s equivalent is Savings Plans (spend-based) and Reserved Instances (resource-based). Azure uses Reserved Virtual Machine Instances.- Resource-based CUDs: You commit to a specific amount of vCPU and RAM in a specific region. Best for predictable, steady-state workloads like always-on databases or baseline web servers.
- Flexible (Spend-based) CUDs: You commit to a specific hourly spend (e.g., “$10/hour”). This applies across multiple regions and even multiple products (Compute Engine, Cloud Run, Spanner). Best for dynamic organizations that change machine types or regions frequently.
3. Cost Optimization Strategies
The Recommender API
Google uses ML to analyze your resource usage and provides “Recommendations.”- Rightsizing: It might suggest moving a VM from
n2-standard-4ton2-standard-2if the CPU usage is consistently below 10%. - Idle Resources: It identifies unattached Persistent Disks, idle IP addresses, and unused Load Balancers that are costing you money every hour.
Spot VMs (formerly Preemptible)
Spot VMs offer a 60-91% discount compared to on-demand prices.- The Catch: Google can take them back at any time with a 30-second notice.
- Best Use: Batch processing, CI/CD runners, and fault-tolerant GKE node pools.
4. Advanced Visibility: Billing Export to BigQuery
The standard billing console is fine for small projects, but for enterprises, you must enable the Billing Export to BigQuery.- Granularity: You get per-hour, per-resource cost data.
- Custom Dashboards: Point Looker Studio at your BigQuery billing dataset to build custom dashboards for every team lead.
- Anomaly Detection: You can write SQL queries to detect “Cost Spikes” (e.g., “Alert me if any project spends 20% more today than it did yesterday”).
5. GKE Cost Optimization
Kubernetes is a major cost driver. GKE offers specialized tools to keep it under control:- GKE Autopilot: You pay only for the Pods you run. Google handles the “bin-packing” (fitting as many pods onto a node as possible), eliminating the cost of idle node capacity.
- Cost Allocation: GKE can attribute costs down to the Namespace or even the Label level within a cluster. This is essential for chargebacks in a shared cluster environment.
6. Budgets and Programmatic Alerts
A “Budget” in GCP does not stop your services; it only sends alerts.- Thresholds: Set alerts at 50%, 90%, and 100% of your expected spend.
- Pub/Sub Integration: You can send a budget alert to a Pub/Sub topic. This can trigger a Cloud Function that automatically shuts down non-production environments if they exceed their monthly limit.
6. Advanced FinOps: Egress and Orphans
6.1 Identifying “Orphaned” Resources
A common source of waste is “orphaned” resources—disks or IPs left behind after a VM is deleted.- BigQuery SQL: Use the billing export to find resources with
cost > 0butusage = 0or no associated labels. - Automation: Use the Recommender API to automatically identify and delete these orphans in non-production projects.
6.2 Network Egress Analysis
Egress is often the most misunderstood cost. Think of it like shipping charges — storing goods in a warehouse is cheap, but every time you ship something out, you pay delivery fees. Most teams budget for compute and storage but are blindsided by egress.| Egress Type | Approximate Cost | Example |
|---|---|---|
| Internet Egress | 0.23/GB | Serving images to users worldwide |
| Cross-Region Egress | $0.01/GB | Replicating data from us-central1 to europe-west1 |
| Cross-Zone Egress | $0.01/GB | GKE pods talking across zones (hidden cost) |
| Same-Zone | Free | VMs in the same zone communicating |
- Tip: Use VPC Flow Logs joined with BigQuery billing to identify which specific service is driving high egress costs. AWS has similar egress pricing, but Azure offers free cross-zone traffic within a region — a significant cost advantage for zone-distributed workloads.
7. Interview Preparation
1. Q: What are “Committed Use Discounts” (CUDs) and how do they differ from “Sustained Use Discounts” (SUDs)? A: SUDs are automatic; you get them just by running a VM for more than 25% of a month. CUDs require a commitment (1 or 3 years) but offer much deeper discounts (up to 70%). CUDs can be Resource-based (fixed vCPU/RAM in one region) or Flexible (spend-based, applying across multiple regions and products like Cloud Run and Spanner). 2. Q: Why is “Billing Export to BigQuery” considered a mandatory FinOps practice? A: The standard Cloud Console only provides high-level views. Billing Export provides granular, per-resource, hourly cost data. By exporting to BigQuery, you can:- Join costs with Labels to create accurate department-level chargebacks.
- Build custom dashboards in Looker Studio.
- Write SQL queries to detect “Cost Spikes” or “Zombie Resources” (idle disks/IPs) programmatically.
- VM Rightsizing: Suggesting a smaller machine type if CPU is low.
- Idle Resources: Identifying unattached Persistent Disks or unassigned Static IPs.
- CUD Recommendations: Identifying where a commitment would save money based on steady-state usage.
- Disable Billing for the project (shuts down all resources).
- Scale down GKE deployments to zero.
- Remove external IP addresses.