Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Chapter 17: Managing the Bill - Cost Management and FinOps

In the cloud, cost is not just a line item for finance; it is an engineering metric. A poorly architected system isn’t just slow or insecure — it is expensive. In Google Cloud, managing costs requires a deep understanding of the billing hierarchy, the discount models, and the automation tools available to enforce “FinOps” (Financial Operations) principles. Think of FinOps like managing a household budget. You would not give every family member an unlimited credit card and hope for the best. You set budgets, track spending, look for waste (that streaming subscription nobody uses), and negotiate discounts (annual plans instead of monthly). Cloud cost management follows the same logic, just at a much larger scale. AWS has similar tools (Cost Explorer, Savings Plans, Reserved Instances), and Azure has Cost Management + Advisor. The concepts are universal — only the implementation details differ.

1. The GCP Billing Hierarchy

To manage costs at scale, you must first understand how they are attributed.
  • Cloud Billing Account: The top-level resource linked to a payment method.
  • Projects: All resources live in projects, and each project is linked to one billing account.
  • Labels: These are key-value pairs (e.g., team:search, env:prod) attached to resources. Labels are the single most important tool for cost allocation. They are exported into your billing data, allowing you to see exactly how much each team is spending.

2. Discount Models: CUDs and SUDs

GCP offers several ways to reduce your “list price” spend.

Sustained Use Discounts (SUDs)

SUDs are automatic. If you run a Compute Engine instance for more than 25% of a month, Google automatically starts applying a discount. For a full month, the discount can reach up to 30%. This is a uniquely GCP feature — AWS and Azure do not offer automatic discounts for sustained usage. In AWS, you must explicitly purchase Savings Plans or Reserved Instances to get any discount. In GCP, you get 30% off just by running a VM all month, with zero commitment. Why this matters: For organizations that are new to cloud cost optimization, GCP’s SUDs provide an immediate cost reduction without any planning or purchasing decisions. On AWS, the equivalent savings require analyzing usage patterns, choosing between Standard vs. Convertible Reserved Instances, and committing upfront capital.

Committed Use Discounts (CUDs)

CUDs require a commitment (1 or 3 years) in exchange for deep discounts (up to 70%). AWS’s equivalent is Savings Plans (spend-based) and Reserved Instances (resource-based). Azure uses Reserved Virtual Machine Instances.
  • Resource-based CUDs: You commit to a specific amount of vCPU and RAM in a specific region. Best for predictable, steady-state workloads like always-on databases or baseline web servers.
  • Flexible (Spend-based) CUDs: You commit to a specific hourly spend (e.g., “$10/hour”). This applies across multiple regions and even multiple products (Compute Engine, Cloud Run, Spanner). Best for dynamic organizations that change machine types or regions frequently.
Practical Decision Framework: Start with SUDs (free, automatic). After 3 months of stable usage, analyze your baseline with the Recommender API. Purchase CUDs only for the predictable “floor” of your usage — not the peaks. A common mistake is over-committing: buying CUDs for 100% of your current usage, then needing to scale down but still paying for the commitment. Aim to commit to 60-70% of your steady-state usage and let SUDs handle the rest.

3. Cost Optimization Strategies

The Recommender API

Google uses ML to analyze your resource usage and provides “Recommendations.”
  • Rightsizing: It might suggest moving a VM from n2-standard-4 to n2-standard-2 if the CPU usage is consistently below 10%.
  • Idle Resources: It identifies unattached Persistent Disks, idle IP addresses, and unused Load Balancers that are costing you money every hour.

Spot VMs (formerly Preemptible)

Spot VMs offer a 60-91% discount compared to on-demand prices.
  • The Catch: Google can take them back at any time with a 30-second notice.
  • Best Use: Batch processing, CI/CD runners, and fault-tolerant GKE node pools.

4. Advanced Visibility: Billing Export to BigQuery

The standard billing console is fine for small projects, but for enterprises, you must enable the Billing Export to BigQuery.
  • Granularity: You get per-hour, per-resource cost data.
  • Custom Dashboards: Point Looker Studio at your BigQuery billing dataset to build custom dashboards for every team lead.
  • Anomaly Detection: You can write SQL queries to detect “Cost Spikes” (e.g., “Alert me if any project spends 20% more today than it did yesterday”).

5. GKE Cost Optimization

Kubernetes is a major cost driver. GKE offers specialized tools to keep it under control:
  • GKE Autopilot: You pay only for the Pods you run. Google handles the “bin-packing” (fitting as many pods onto a node as possible), eliminating the cost of idle node capacity.
  • Cost Allocation: GKE can attribute costs down to the Namespace or even the Label level within a cluster. This is essential for chargebacks in a shared cluster environment.

6. Budgets and Programmatic Alerts

A “Budget” in GCP does not stop your services; it only sends alerts.
  • Thresholds: Set alerts at 50%, 90%, and 100% of your expected spend.
  • Pub/Sub Integration: You can send a budget alert to a Pub/Sub topic. This can trigger a Cloud Function that automatically shuts down non-production environments if they exceed their monthly limit.

6. Advanced FinOps: Egress and Orphans

6.1 Identifying “Orphaned” Resources

A common source of waste is “orphaned” resources—disks or IPs left behind after a VM is deleted.
  • BigQuery SQL: Use the billing export to find resources with cost > 0 but usage = 0 or no associated labels.
  • Automation: Use the Recommender API to automatically identify and delete these orphans in non-production projects.

6.2 Network Egress Analysis

Egress is often the most misunderstood cost. Think of it like shipping charges — storing goods in a warehouse is cheap, but every time you ship something out, you pay delivery fees. Most teams budget for compute and storage but are blindsided by egress.
Egress TypeApproximate CostExample
Internet Egress0.080.08-0.23/GBServing images to users worldwide
Cross-Region Egress$0.01/GBReplicating data from us-central1 to europe-west1
Cross-Zone Egress$0.01/GBGKE pods talking across zones (hidden cost)
Same-ZoneFreeVMs in the same zone communicating
Common Mistake: Running a GKE cluster across 3 zones (good for HA) without realizing that every service-to-service call that crosses a zone boundary incurs 0.01/GB.Forachattymicroservicesarchitectureprocessing10TB/monthofinternaltraffic,crosszoneegressalonecancost0.01/GB. For a chatty microservices architecture processing 10TB/month of internal traffic, cross-zone egress alone can cost 100/month. Use topology-aware routing in GKE to prefer same-zone backends.
  • Tip: Use VPC Flow Logs joined with BigQuery billing to identify which specific service is driving high egress costs. AWS has similar egress pricing, but Azure offers free cross-zone traffic within a region — a significant cost advantage for zone-distributed workloads.

7. Interview Preparation

1. Q: What are “Committed Use Discounts” (CUDs) and how do they differ from “Sustained Use Discounts” (SUDs)? A: SUDs are automatic; you get them just by running a VM for more than 25% of a month. CUDs require a commitment (1 or 3 years) but offer much deeper discounts (up to 70%). CUDs can be Resource-based (fixed vCPU/RAM in one region) or Flexible (spend-based, applying across multiple regions and products like Cloud Run and Spanner). 2. Q: Why is “Billing Export to BigQuery” considered a mandatory FinOps practice? A: The standard Cloud Console only provides high-level views. Billing Export provides granular, per-resource, hourly cost data. By exporting to BigQuery, you can:
  • Join costs with Labels to create accurate department-level chargebacks.
  • Build custom dashboards in Looker Studio.
  • Write SQL queries to detect “Cost Spikes” or “Zombie Resources” (idle disks/IPs) programmatically.
3. Q: How does GKE Autopilot change the cost-management responsibility for an SRE? A: In GKE Standard, the SRE is responsible for Bin-packing (fitting pods into nodes to avoid idle CPU/RAM). If nodes are 20% utilized, you still pay for 100%. In Autopilot, Google handles the bin-packing. You are billed only for the Pod Requests. The SRE’s responsibility shifts from managing “Node Waste” to managing “Pod Request Right-sizing” (ensuring developers don’t request 4GB of RAM for a 512MB app). 4. Q: What is the “Recommender API” and how does it help with cost optimization? A: The Recommender uses machine learning to analyze your historical usage. It provides actionable recommendations like:
  • VM Rightsizing: Suggesting a smaller machine type if CPU is low.
  • Idle Resources: Identifying unattached Persistent Disks or unassigned Static IPs.
  • CUD Recommendations: Identifying where a commitment would save money based on steady-state usage.
5. Q: How do you implement “Budget Automation” to prevent massive cloud bills? A: You set a Budget Alert in the Billing console. Instead of just sending an email, you connect the alert to a Pub/Sub topic. When a threshold (e.g., 100%) is hit, the Pub/Sub message triggers a Cloud Function. That function can then use the GCP APIs to:
  • Disable Billing for the project (shuts down all resources).
  • Scale down GKE deployments to zero.
  • Remove external IP addresses.

Implementation: The “FinOps Master” Lab

Analyzing Costs with SQL in BigQuery

Once your billing export is enabled, you can run powerful queries like this:
-- Find the top 10 most expensive labels (teams) in the last 30 days
-- Why: This is your "chargeback" query -- it tells each team exactly what they are spending
-- Without labels, you get a single bill for the entire organization with no accountability
-- Pro-tip: Make labeling mandatory via Org Policy before costs get out of control
SELECT
  labels.value as team,
  SUM(cost) as total_cost,
  SUM(cost) / 30 as avg_daily_cost   -- Add daily average to spot trends
FROM
  `my_project.billing_dataset.gcp_billing_export_v1`,
  UNNEST(labels) as labels
WHERE
  labels.key = "team"
  AND usage_start_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
Common Mistake: Not enabling billing export until costs are already out of control. Billing export is retroactive only to the day it was enabled — there is no way to backfill historical data. Enable it on day one of your GCP journey, even if you are just using the free tier. The storage cost in BigQuery is negligible (pennies per month) and the visibility it provides is invaluable.

Pro-Tip: Committed Use Discount Sharing

If you have multiple projects under one billing account, enable CUD Sharing. This allows a discount purchased in “Project A” to be applied to a matching VM running in “Project B” if Project A isn’t using its full commitment. This prevents “wasted” discounts.