Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Chapter 15: Infrastructure as Code - Terraform on GCP

In the modern cloud era, clicking through the web console is considered an “anti-pattern” for production systems. Infrastructure as Code (IaC) allows you to define your entire data center in text files. This ensures that your environments are reproducible, version-controlled, and auditable. In Google Cloud, Terraform is the undisputed king of IaC. Think of IaC like a recipe for a meal. If you cook by randomly throwing ingredients into a pot (clicking through the console), you can never reproduce the result, and nobody else can make the same dish. A recipe (Terraform code) ensures that anyone can recreate the exact same infrastructure, every time, and you can see exactly what changed between versions by looking at the Git diff. AWS users typically use Terraform or CloudFormation; Azure users use Terraform or Bicep. Terraform’s advantage is that it works across all three clouds with the same language (HCL).

1. Why Terraform is the GCP Standard

While Google offers Deployment Manager (native) and Config Connector (Kubernetes-based), Terraform remains the preferred choice for most engineers.
  • The Google Provider: Google maintains one of the most comprehensive Terraform providers in existence. New GCP features often have Terraform support on “Day 0.”
  • Immutable Infrastructure: Terraform encourages you to replace resources rather than patching them, which reduces “Configuration Drift.”
  • Plan and Apply: The terraform plan command acts as a safety net, showing you exactly what will happen before you make any changes.

2. Advanced State Management

The terraform.tfstate file is the source of truth for your infrastructure. In production, you must manage this with extreme care.

The GCS Backend

Always store your state in a Cloud Storage bucket with the following configuration. In AWS, the equivalent is an S3 backend with DynamoDB locking. GCP simplifies this by providing native locking in GCS without needing a separate service.
  • Versioning: Enable bucket versioning so you can recover from a corrupted state. If a terraform apply fails mid-execution and corrupts the state, you can restore the previous version from GCS versioning. Without this, you may need to manually edit the state file or import resources one by one — a painful, error-prone process.
  • Locking: GCS natively supports state locking. This prevents two developers from running terraform apply at the same time and corrupting the state.
Common Disaster Scenario: Developer A runs terraform apply on their laptop. While it is running, Developer B (who does not know A is applying) also runs terraform apply from their laptop. Without state locking, both processes read the same state, make conflicting changes, and the resulting state file is corrupted. With GCS locking, Developer B’s command immediately fails with a “state locked” error. This is not a theoretical risk — it happens regularly on teams that skip remote state setup.

Structuring Environments

  • Separate State per Environment: Never use the same state file for dev and prod. If you accidentally delete your state in dev, you don’t want it to impact prod.
  • Workspaces vs. Directories: Most GCP experts prefer separate directories (e.g., environments/prod/, environments/dev/) over Terraform Workspaces for clearer isolation and variable management.

3. The Cloud Foundation Toolkit (CFT)

Instead of writing every resource from scratch, Google provides the Cloud Foundation Toolkit.
  • Best-Practice Modules: CFT is a set of open-source Terraform modules that implement Google’s best practices for VPCs, GKE clusters, Project factories, and more.
  • Opinionated Security: These modules come with “secure defaults,” like disabling public IPs for GKE nodes or enforcing encryption on buckets.

3.1 The Project Factory Module Deep Dive

The Project Factory is the most critical module in the CFT. It automates the creation of GCP projects while enforcing organization-level compliance. What it Automates:
  • Project Creation: Handles the google_project resource.
  • Billing Linkage: Connects the project to the central billing account.
  • Service API Enablement: Enables a list of APIs (e.g., compute.googleapis.com, container.googleapis.com) automatically.
  • Shared VPC Attachment: Handles the complex handshake of attaching a project as a “Service Project” to a host VPC.
  • Default Service Account Deletion: (Security Best Practice) Deletes the default, over-privileged compute service account.
  • Group IAM Bindings: Assigns standard IAM roles to G Suite / Google Groups for developers, auditors, and admins.
Example Implementation:
module "project-factory" {
  source  = "terraform-google-modules/project-factory/google"
  version = "~> 14.0"

  name            = "prod-data-platform"
  random_project_id = true
  org_id          = var.org_id
  billing_account = var.billing_account
  folder_id       = var.folder_id

  # Shared VPC Configuration
  svpc_host_project_id = "host-project-123"
  shared_vpc_subnets   = [
    "projects/host-project-123/regions/us-central1/subnets/data-subnet"
  ]

  # API Enablement
  activate_apis = [
    "compute.googleapis.com",
    "bigquery.googleapis.com",
    "storage-api.googleapis.com"
  ]
}

4. Google Cloud Deploy: Managed CD

Once your infrastructure is provisioned, you need a way to deploy your code. Cloud Deploy is Google’s fully managed continuous delivery service for GKE, Cloud Run, and Anthos.

Key Concepts

  • Delivery Pipeline: Defines the progression of a release through different targets (e.g., devstagingprod).
  • Skaffold: Cloud Deploy uses Skaffold under the hood to decouple the build/deploy configuration from the pipeline definition.
  • Rollout Strategies:
    • Canary: Deploy to a small percentage of users first.
    • Blue/Green: Deploy a full new version and switch traffic instantly.
  • Approval Gates: You can require a manual “click” from a lead engineer before a release moves into the production target.

5. Policy as Code: Guarding the Pipeline

To prevent developers from accidentally creating insecure infrastructure (like an open S3 bucket), use Policy as Code.
  • Terraform Validator: A tool that checks your terraform plan against your organization’s security policies before it is applied.
  • Example Policy: “No VM can have an external IP address unless it has a specific tag.”

6. Config Connector: The K8s Alternative

For teams that are “all-in” on Kubernetes, Config Connector allows you to manage GCP resources using Kubernetes YAML.
  • The CRD Model: A Cloud SQL database becomes a Kind: SQLInstance object in your GKE cluster.
  • Reconciliation: Kubernetes’ controller loop constantly checks the state of your GCP resources and fixes any drift, just as it does for pods.

7. Advanced Terraform Patterns: Meta-Arguments and DRY Code

7.1 Lifecycle Meta-Arguments

Terraform provides meta-arguments to control how resources are handled during an apply.
  • prevent_destroy: Essential for critical resources like production databases or DNS zones. It prevents Terraform from destroying the resource even if you remove it from the code.
  • ignore_changes: Useful when certain attributes are managed by other tools (e.g., GKE node pool sizes managed by the autoscaler).

7.2 Keeping Code DRY with Terragrunt

In a large-scale GCP environment, you often have dozens of nearly identical projects. Terragrunt is a thin wrapper that helps you:
  • Inherit configuration: Define your provider and backend once and inherit them across all projects.
  • Dependency management: Ensure your VPC is created before your GKE cluster.

7.3 Google Terraform Validator

This tool allows you to validate your terraform plan against your organization’s security policies (Forseti or CAI-based).
  • Integration: Run it in your Cloud Build pipeline. If a developer tries to create a bucket without encryption, the build fails.

8. Interview Preparation

1. Q: Why is it critical to store the Terraform state file in a remote GCS backend? A: Storing state locally prevents collaboration and is a single point of failure. A GCS Backend provides:
  • Locking: Prevents two users from running apply simultaneously and corrupting the state.
  • Versioning: Allows you to roll back to a previous state if the current one is corrupted.
  • Security: State files often contain sensitive information (like DB passwords); GCS allows you to restrict access via IAM.
2. Q: Explain the concept of “Configuration Drift” and how Terraform handles it. A: Configuration Drift occurs when the real-world infrastructure (manual changes in the console) no longer matches the code. Terraform detects this during the plan phase. By comparing the tfstate with the live environment, Terraform identifies the “drift” and proposes a plan to revert the manual changes or update the code to match the desired state, ensuring the infrastructure remains consistent. 3. Q: What are “Terraform Modules” and why are they used in enterprise environments? A: Modules are containers for multiple resources that are used together. They enable Code Reusability and Standardization. In an enterprise, you can create a “Standard VPC Module” that includes all necessary subnets, firewalls, and logging. Developers then use this module rather than writing networking code from scratch, ensuring compliance with company security standards. 4. Q: How does “Config Connector” differ from Terraform? A:
  • Terraform: Is a standalone CLI tool that uses HCL. It is “imperative-style” execution (you run a command to apply).
  • Config Connector: Is a Kubernetes-native controller. You define GCP resources as K8s YAML files. The controller is constantly “reconciling”—if a resource is deleted in the console, Config Connector will automatically recreate it within minutes without any manual intervention.
5. Q: What is the “Cloud Foundation Toolkit” (CFT)? A: CFT is a collection of open-source Terraform modules maintained by Google. They are built to Google’s Best Practices for security and reliability. Instead of reinventing the wheel, architects use CFT modules for complex setups like “Project Factory” (automated project creation), “Networking,” and “GKE” to ensure their environment is enterprise-hardened from day one.

Implementation: The “Platform Engineer” Lab

Setting up a Production Terraform Backend

# backend.tf
# Why GCS backend: Enables team collaboration, state locking, and disaster recovery
# The bucket MUST exist before running 'terraform init' -- create it manually or with a bootstrap script
terraform {
  backend "gcs" {
    bucket  = "my-company-tfstate"
    prefix  = "terraform/state/prod"   # Use a unique prefix per environment to isolate state files
  }
}

# provider.tf
# Why pin the provider version: Prevents unexpected breaking changes when Google updates the provider
provider "google" {
  project = var.project_id
  region  = "us-central1"
}

# vpc.tf (Using a CFT module)
# Why use a CFT module instead of raw resources: The module implements Google's best practices
# (private Google access, flow logs, correct secondary ranges for GKE) that you would otherwise
# need to configure manually and might forget
module "vpc" {
  source  = "terraform-google-modules/network/google"
  version = "~> 6.0"   # Pin to major version to get patches but avoid breaking changes

  project_id   = var.project_id
  network_name = "prod-vpc"
  routing_mode = "GLOBAL"   # GLOBAL allows subnets in different regions to route to each other

  subnets = [
    {
      subnet_name   = "prod-subnet-01"
      subnet_ip     = "10.0.1.0/24"
      subnet_region = "us-central1"
    }
  ]
}
Pro-Tip: Always run terraform plan and save the plan to a file (terraform plan -out=plan.tfplan) before applying. This ensures that the exact changes you reviewed are what gets applied, even if the infrastructure changes between your plan and apply. In CI/CD pipelines, this two-step approach is essential for audit trails and approval gates.

Pro-Tip: Terraform Workgraph

When debugging slow Terraform runs, use terraform graph | dot -Tpng > graph.png. This visualizes the dependency tree of your resources, helping you identify bottlenecks or circular dependencies that might be slowing down your deployments.