Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Google Cloud Engineering Master Course

Start Here if You’re Completely New to Cloud

0.1 What is Cloud Computing? (From Scratch)

What is Cloud Computing? Imagine you want to start a world-class restaurant chain: Traditional Approach (Buy Everything):
  • Capital Expenditure (CapEx): You buy the land (2,000,000),constructthebuilding(2,000,000), construct the building (5,000,000), buy industrial-grade ovens (200,000),andfurniture(200,000), and furniture (100,000).
  • Maintenance: You hire a full-time team to fix the roof, service the ovens, and manage the electricity.
  • Risk: If people don’t like your food, you are stuck with a $7.3 million debt and a building you can’t easily sell.
  • Scaling: If your restaurant is a hit and you need more space, you have to buy the neighboring land and start construction again (takes 12–18 months).
Cloud Approach (Rent Everything):
  • Operational Expenditure (OpEx): You rent a pre-built commercial kitchen ($10,000/month).
  • Managed Services: The landlord handles the building maintenance, utilities, and even provides a cleaning crew.
  • Risk: If the restaurant fails, you simply stop paying rent and walk away. Your loss is limited to a few thousand dollars.
  • Scaling: If you suddenly have 1,000 customers waiting, the landlord opens up the dining room next door immediately. You pay a bit more rent, but you never lose a customer due to lack of space.
Cloud Computing = Renting Google’s Planet-Scale Infrastructure Instead of:
  • Buying physical servers (the “hardware”)
  • Managing massive air conditioning units, diesel generators, and physical security guards
  • Waiting 3 months for a new server to be delivered and racked
You:
  • Rent virtual resources via an API or Web Console
  • Google handles the “boring” stuff (power, cooling, hardware failure)
  • Scale from 1 server to 10,000 servers in under 5 minutes

0.2 Key Cloud Characteristics

Before we dive into GCP, it helps to know the standard NIST cloud characteristics:
  • On‑demand self‑service – Developer can provision resources without human approval.
  • Broad network access – Access over the network (browser, CLI, APIs) from many device types.
  • Resource pooling – Physical resources are shared across many customers (multi‑tenancy).
  • Rapid elasticity – Scale out/in quickly; appears unlimited from user perspective.
  • Measured service – You pay for what you use (per second/minute/GB), with detailed metering.
We will see these show up repeatedly when we talk about autoscaling, managed databases, serverless, and cost management.

What is Google Cloud Platform (GCP)?

GCP is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, YouTube, Gmail, and Google Drive.

1.1 The Google Edge

What makes GCP fundamentally different from AWS and Azure is not the list of services (all three clouds offer similar categories) but the underlying infrastructure. Google has been building planet-scale distributed systems for over two decades, and that engineering DNA is baked into every GCP service.
  • Global Network: Google owns one of the largest private networks in the world. Thousands of miles of undersea fiber optic cables connect their data centers. When your data moves between GCP regions, it stays on Google’s private fiber — it never touches the public internet. This is why GCP often benchmarks 25-35% lower latency than routing over the public internet.
  • Planet-Scale Databases: Services like Cloud Spanner offer “TrueTime” — a global clock synchronized by atomic clocks and GPS satellites. No other cloud provider has an equivalent to TrueTime, which is why Spanner can offer globally consistent transactions that are impossible on AWS or Azure native databases.
  • Innovation: Google invented many of the technologies the world uses today, including Kubernetes (from internal Borg), MapReduce (which inspired Hadoop), and TensorFlow. When you use GKE, BigQuery, or Pub/Sub, you are using the externalized versions of tools Google uses to run Search, YouTube, and Gmail.

1.2 By the Numbers

  • 35+ Regions (Geographical locations)
  • 100+ Zones (Isolated data centers within regions)
  • 187+ Network Edge Locations (Points of Presence)
  • 0% Net Carbon Emissions (Google matches 100% of its electricity consumption with renewable energy)

1.3 Shared Responsibility Model (High Level)

Even with managed infrastructure, there are clear lines between what Google secures and what you must configure:
  • Google: physical security, hardware, hypervisor, some managed service internals.
  • You: IAM, network access, application code, data classification, backups (for some services).
We will revisit this in the Security chapters, but keep it in mind from the start: cloud does not remove responsibility, it changes it.

Why Should You Learn GCP?

The Market Reality: While AWS has the largest market share, GCP is the fastest-growing major cloud provider. Enterprises are moving to GCP for three main reasons:
  1. AI and Machine Learning: Google is the undisputed leader in AI.
  2. Data Analytics: BigQuery is widely considered the best cloud data warehouse.
  3. Open Source DNA: GCP is built on open standards like Kubernetes and Istio, reducing vendor lock-in.
Career & Salary (US Data):
  • Associate Cloud Engineer: 110,000110,000 - 145,000
  • Professional Cloud Architect: 164,000164,000 - 210,000 (Highest paying certification in IT for 3 consecutive years)
  • Cloud Security Engineer: 170,000170,000 - 230,000
  • Machine Learning Engineer: 180,000180,000 - 250,000+

What Makes This Course Different?

“Most GCP courses teach you how to click buttons in the Console. This course teaches you how to think like a Google Site Reliability Engineer (SRE).“

2.1 Philosophy of the Course

We don’t just show you how to create a VM. We explain:
  • How the Andromeda software-defined network routes your traffic.
  • Why Colossus (Google’s file system) is the secret behind Cloud Storage’s durability.
  • How to design for 99.999% availability using Multi-Regional architectures.
  • The FinOps strategies used to save millions on egress costs.

2.2 Real-World Example

Scenario: A major social media platform experienced a 4-hour outage because they misconfigured their Global Load Balancer. In this course: We break down that specific incident, show you the configuration that caused it, and teach you how to use Cloud Armor and Health Checks to ensure it never happens to your systems. Throughout the course we will map every concept to:
  • Real Google services (e.g., GFE, Andromeda, Colossus).
  • Real operational practices (SRE, observability, incident response).
  • Real cost trade‑offs (performance vs spend).

The SRE Foundation: Learning the “Google Way”

Site Reliability Engineering (SRE) is what happens when you ask a software engineer to design an operations team. This course is heavily inspired by the three definitive texts published by Google:
  1. The SRE Book: How Google runs production systems.
  2. The SRE Workbook: Practical ways to implement SRE.
  3. Building Secure & Reliable Systems: The intersection of security and reliability.
Throughout this course, we refer to these as the “Google Triad.” You will learn to apply concepts like Error Budgets, Toil Reduction, and Post-Mortem culture directly to your GCP resources. We don’t just want you to build systems that work; we want you to build systems that are maintainable at scale.

Detailed Certification & Career Path Analysis

GCP certifications are highly valued because they focus on design and problem-solving rather than just rote memorization of service names. This course provides 80-90% of the technical coverage for the following paths:

1. The Generalist (Cloud Architect / Engineer)

  • Target Cert: Associate Cloud Engineer (ACE) & Professional Cloud Architect (PCA).
  • Focus: Core infrastructure, networking, and security.
  • Primary Chapters: 1, 2, 3, 4, 5, 10, 15, 17.
  • Career Goal: Lead architect for digital transformation or startup CTO.

2. The Specialist (Data & AI Engineer)

  • Target Cert: Professional Data Engineer.
  • Focus: Scalable data pipelines, BigQuery optimization, and ML lifecycle.
  • Primary Chapters: 6, 7, 8, 12.
  • Career Goal: Building the next generation of LLM-powered applications or real-time analytics engines.

3. The Modernizer (DevOps & Security Engineer)

  • Target Cert: Professional Cloud DevOps Engineer & Professional Cloud Security Engineer.
  • Focus: CI/CD, GKE hardening, IAM governance, and observability.
  • Primary Chapters: 2, 9, 10, 13, 14, 15, 16.
  • Career Goal: Securing the software supply chain and automating “Day 2” operations.

Why This Course?

SRE Principles

Learn the Site Reliability Engineering patterns born at Google to manage planet-scale systems.

Data & AI Deep Dives

Go beyond basics in BigQuery, Vertex AI, and Pub/Sub—the heart of the modern data stack.

Architecture-First

We focus on design patterns (Hub-and-Spoke, Microservices, DR) before the CLI commands.

Cost Engineering

Master the art of “FinOps”—optimizing for performance while minimizing the monthly bill.

Course Roadmap: The Journey to Mastery

This course is designed as a path, not a collection of random topics. You can treat it as a 12–16 week guided program.

3.1 High-Level Tracks

1

GCP Foundations & The Google Network

Deep dive into Regions, Zones, Resource Hierarchy, and the physical fiber network that makes Google different.
2

Advanced Identity (IAM) & Governance

Master Service Accounts, Workload Identity, Organization Policies, and the “Policy Troubleshooter.”
3

VPC Networking & Security

Build global VPCs, Shared VPCs, Firewall Rules (Tags vs Service Accounts), and Cloud NAT.
4

Global Traffic Management

Master the Global HTTP(S) Load Balancer (GFE), Cloud CDN, and Cloud Armor.
5

Compute Deep Dive

Compute Engine (MIGs, Sole-Tenant, Shielded VMs), Cloud Run, and Cloud Functions.
6

The Kubernetes Masterclass (GKE)

From standard clusters to Autopilot, binary authorization, and multi-cluster ingress.
7

Storage & Databases

GCS, Cloud SQL (HA/DR), Cloud Spanner (TrueTime), and Bigtable performance tuning.
8

Big Data & Analytics

BigQuery (Slots, Partitioning, ML), Pub/Sub, and Dataflow pipelines.
9

Operations & Observability

Cloud Monitoring, Logging (Log Sinks), Trace, Profiler, and Error Reporting.
10

Infrastructure as Code (Terraform)

Provisioning the entire GCP stack with Terraform, State management, and Modules.
11

The Capstone: Planet-Scale Application

Architecting and deploying a globally distributed, secure, and auto-scaling e-commerce platform.

3.2 Suggested Weekly Plan

You can adapt this, but a typical pacing:
  • Weeks 1–2: Foundations + IAM
  • Weeks 3–4: VPC + Load Balancing/DNS
  • Weeks 5–6: Compute + GKE + Containers
  • Weeks 7–8: Storage + Databases
  • Weeks 9–10: Data Analytics (BigQuery, Dataflow, Pub/Sub)
  • Weeks 11–12: Observability + Security + FinOps
  • Weeks 13–16: Capstone project and optional advanced topics (Anthos, multi‑cloud).

Prerequisites: “Test Yourself”

You don’t need to be an expert, but you should check these basics. If you fail a “Test Yourself,” we recommend a quick 30-minute refresher on that topic.

4.1 Networking Fundamentals

  • Concept: Do you know the difference between a Private IP and a Public IP?
  • Test Yourself: Can you explain what a Subnet Mask (e.g., /24) does?
  • Refresher: Look up “CIDR Notation” and “OSI Model Layer 3 vs 4.”

4.2 Linux Command Line

  • Concept: Are you comfortable moving through a file system without a mouse?
  • Test Yourself: Can you write a command to find all files ending in .log and delete them?
  • Refresher: Practice cd, ls, grep, find, and chmod.
  • Concept: Understanding logic (If/Else, Loops).
  • Test Yourself: Can you read a basic Python script and tell what it does?
  • Note: We use Python and Node.js for some serverless examples.
For each prerequisite, we will link to short refresher resources in the Essentials section of the docs so you can quickly fill gaps before diving into the main modules.

The Tech Stack We Will Master

ComponentGoogle Cloud Technology
ComputeCompute Engine, GKE, Cloud Run, Cloud Functions
NetworkingVPC, Cloud Load Balancing, Cloud DNS, Cloud Interconnect
StorageCloud Storage (GCS), Filestore, Persistent Disk
DatabasesCloud SQL, Cloud Spanner, Bigtable, Firestore
Data AnalyticsBigQuery, Pub/Sub, Dataflow, Looker
SecurityIAM, Cloud Armor, IAP, Secret Manager, KMS
DevOps/IaCTerraform, Cloud Build, Artifact Registry, Config Connector
ObservabilityCloud Monitoring, Cloud Logging, Error Reporting

Cost Management: The $300 “Safe Zone”

Google provides a $300 Free Credit for 90 days. We have designed this course to be completed entirely within that credit.

5.1 The “SRE” Way to Save Money

The biggest risk with cloud labs is not running out of credits — it is forgetting to delete resources afterward. A single GKE cluster left running can consume $150+ of your free credits in a week. We have designed every lab with explicit cleanup instructions.
  1. Budgets and Alerts:
    We will set a $10 budget alert early in the course so you see how budget alerts work. Set this up before creating any resources — not after.
  2. Auto-Delete Scripts:
    We provide scripts and guidance to safely delete lab resources by project or label in one shot. The safest approach: create a dedicated project for each lab and delete the entire project when done.
  3. Spot VMs:
    We will use Spot (Preemptible) instances for expensive labs to save up to ~90% compared to on-demand. Since lab workloads are not production-critical, preemption is not a concern.
  4. Scale to Zero:
    We prioritize services like Cloud Run and Firestore which cost $0 when not in use. When a lab calls for a database, we use Cloud SQL with the smallest tier and remind you to delete it immediately after the exercise.
You will learn to treat cost like latency or error rate: measured, monitored, and actively optimized.

Community & Support

  • GitHub Repo: Access every Terraform script and Dockerfile used in the course.
  • Discord: Join the #gcp-engineering channel for peer support.
  • Office Hours: Join our bi-weekly live sessions to review complex architectures.

Ready to build the future?

Click Next to start Chapter 1: GCP Fundamentals & The Global Network. We’re going to dive deep into how Google actually builds their data centers.

Interview Preparation

Answer: GCP’s differentiation is built on Google’s internal technology heritage:
  1. Networking: Google’s private B4 backbone provides consistently lower latency (25-35% improvement over public internet) and Andromeda SDN eliminates the “noisy neighbor” problem found in virtualized network appliances.
  2. Data Platform: BigQuery is the industry-leading serverless data warehouse. It’s built on Dremel (the same engine Google uses internally) and offers true separation of compute and storage with Jupiter network speeds (1.3 Pbps).
  3. Kubernetes Origins: GKE is the most mature managed Kubernetes offering because Google invented Kubernetes (from Project Borg). Autopilot mode is years ahead of competitors in terms of hands-off operation.
  4. AI/ML Leadership: Google’s Vertex AI is built on the same infrastructure as Google Search and Gmail. TensorFlow and JAX are Google products, giving GCP first-class support.
  5. Open Standards: GCP embraces open standards (Kubernetes, Istio, Envoy) reducing vendor lock-in compared to proprietary services in other clouds.
Answer: The hierarchy is: Organization → Folders → Projects → Resources.Why it matters:
  • IAM Inheritance: Permissions flow downward. If you grant “Viewer” at the Organization level, that permission applies to every project and resource underneath. This is both powerful (centralized control) and dangerous (overprivileged access).
  • Organization Policies: Enforceable constraints (like “disable external IPs”) applied at the Org or Folder level cannot be overridden by lower levels. This prevents shadow IT from creating insecure resources.
  • Billing Aggregation: Folders allow you to group projects by department or environment, enabling cost allocation and budget alerts at the appropriate level.
  • Blast Radius: Projects are trust boundaries. By default, resources in Project A cannot communicate with Project B unless explicitly configured (VPC Peering, Shared VPC). This limits the damage from a compromised workload.
Interview Deep Dive: An ideal enterprise setup separates Prod and Non-Prod into distinct folders, uses a Shared VPC for network centralization, and aggregates audit logs into a separate “Security” folder project to prevent tampering by application teams.
Answer:Professional Cloud Architect (PCA):
  • Focus: Design and architecture. Scenario-based questions testing system design, capacity planning, and trade-offs.
  • Skills: Designing for scalability, reliability, security, and compliance. Understanding business requirements and translating them into GCP solutions.
  • Exam: Case studies where you analyze a company’s requirements and recommend architectures.
Professional Cloud Engineer (PCE):
  • Focus: Implementation and operation. Hands-on deployment, troubleshooting, and managing GCP resources.
  • Skills: Terraform, gcloud CLI, GKE operations, and observability tooling.
  • Exam: Task-based questions like “How would you debug a failing health check?” or “What gcloud command deploys this configuration?”
Career Path: Many engineers earn PCA first (it’s considered harder and more prestigious), then follow up with PCE to demonstrate hands-on skills. Both are valuable, but PCA often commands a higher salary (164k210kvs.164k-210k vs. 110k-145k).
Answer: Google invented SRE. The core SRE principles are embedded into GCP services:
  1. Error Budgets: Instead of aiming for 100% uptime (impossible and wasteful), Google defines Service Level Objectives (SLOs) like 99.95%. The remaining 0.05% is an “error budget.” If the budget isn’t exhausted, teams can deploy faster. If it’s exhausted, they must stop features and focus on reliability.
  2. Toil Automation: SREs measure “toil”—manual, repetitive work. GCP services like GKE Autopilot, Cloud Run autoscaling, and Cloud SQL automated backups are all designed to eliminate toil for customers.
  3. Observability by Default: Every GCP service integrates with Cloud Monitoring, Logging, and Trace out of the box. This reflects Google’s belief that “you can’t manage what you can’t measure.”
  4. Blameless Post-Mortems: When a GCP service fails, Google publishes detailed incident reports explaining root cause and prevention measures. This culture encourages transparency and continuous learning.
Interview Insight: Mentioning SRE principles in answers demonstrates that you understand not just what GCP offers, but why it’s designed that way—showing strategic thinking beyond tool usage.
Answer: Overprivileged Service Accounts.The Mistake: Many beginners use the “Default Compute Engine Service Account” or grant the “Editor” role at the project level. This violates the Principle of Least Privilege. If a VM is compromised, the attacker inherits those broad permissions, allowing them to read secrets, delete databases, or exfiltrate data.The Fix:
  1. Custom Service Accounts: Always create a dedicated SA for each workload.
  2. Predefined Roles: Use the most granular predefined role (e.g., roles/storage.objectViewer instead of roles/editor).
  3. Workload Identity (for GKE): Never use JSON keys. Bind Kubernetes Service Accounts to Google Service Accounts using Workload Identity, eliminating the risk of key leakage.
  4. IAM Recommender: Google’s ML-powered tool analyzes 90 days of API usage and recommends removing unused permissions. Check it weekly.
Interview Depth: Mention that you would also use VPC Service Controls for critical workloads to create a secondary defense layer, ensuring that even a compromised SA cannot exfiltrate data outside the defined perimeter.

Interview Deep-Dive

Strong Answer:I would not argue for a wholesale migration — that is almost never the right call. Instead, I would focus on the specific technical advantages GCP offers for the workload at hand.
  • Data and Analytics: If the project is data-intensive, BigQuery alone can justify the move. It is serverless, separates compute from storage, and the on-demand pricing model means you pay per query rather than provisioning Redshift clusters that sit idle at night. I have seen teams cut their analytics infrastructure cost by 40-60% by moving from Redshift to BigQuery, while simultaneously eliminating the operational overhead of vacuum operations and node management.
  • Kubernetes-native workloads: If the team is building microservices on Kubernetes, GKE Autopilot is 2-3 years ahead of EKS Fargate in terms of operational simplicity. Google invented Kubernetes, and that shows in GKE’s networking integration (VPC-native pods get real VPC IPs, no overlay network overhead) and security defaults (Shielded Nodes, Workload Identity out of the box).
  • Global networking: GCP’s global VPC model eliminates the need for Transit Gateway configurations when deploying across regions. A single VPC spans all regions. For a globally distributed app, this alone removes a significant architectural tax.
  • Cost baseline: GCP’s Sustained Use Discounts are automatic — you get up to 30% off just by running a VM for a full month, with zero commitment. On AWS, equivalent savings require purchasing Reserved Instances or Savings Plans, which means upfront analysis and financial commitment.
The honest caveat: if the team is heavily invested in AWS-specific services (Lambda@Edge, DynamoDB Streams, Step Functions), the migration cost may outweigh the benefits. Multi-cloud is fine, but it should be driven by technical fit, not politics.Follow-up: What is the biggest risk of adopting GCP alongside an existing AWS footprint?The primary risk is operational complexity — your team now needs expertise in two clouds, two IAM models, two networking paradigms, and two billing systems. The way to mitigate this is to pick a clear boundary: for example, “all data workloads on GCP, all application workloads on AWS.” The worst outcome is a random scattering of services across both clouds with no coherent strategy. You also need to account for egress costs between clouds — Cross-Cloud Interconnect starts at roughly $1,700/month for 10Gbps, but if you are transferring more than 15-20TB/month between clouds, it pays for itself compared to internet egress fees.
Strong Answer:The hierarchy design starts with understanding the organizational boundaries, compliance requirements, and operational model.
  • Organization node: Bound to the company domain via Cloud Identity. This is non-negotiable for enterprise governance. It gives us centralized audit logging, organization policies, and ensures no employee can create rogue projects outside the company’s control.
  • Top-level folders by business unit: BU-Payments, BU-Marketing, BU-Platform. Each BU gets its own folder tree, enabling per-BU IAM delegation and cost attribution.
  • Sub-folders by environment: Under each BU folder: Prod, NonProd, Sandbox. The key decision here is that Prod folders get stricter Org Policies (no external IPs, mandatory CMEK encryption, restricted regions for data residency compliance).
  • Shared infrastructure folder: A top-level Shared-Services folder containing: (1) a Shared VPC host project owned by the network team, (2) a logging-aggregation project where all audit log sinks converge, (3) a security-tools project for SCC, DLP, and vulnerability scanning.
  • Organization policies at the Org level: constraints/compute.disableSerialPortAccess, constraints/iam.disableServiceAccountKeyCreation (forces Workload Identity adoption), constraints/gcp.resourceLocations restricted to approved regions for GDPR compliance.
  • Billing: One billing account linked to all projects, with labels enforced via Org Policy so every resource carries team, env, and cost-center labels. Billing export to BigQuery enabled from day one.
The critical insight most people miss: projects are trust boundaries. A compromised service in bu-marketing-sandbox should never be able to reach bu-payments-prod. Shared VPC with per-subnet IAM bindings enforces this at the network level.Follow-up: How do you prevent a developer in the Sandbox folder from accidentally spinning up $10,000 worth of GPUs overnight?Three layers of defense. First, set an Org Policy on the Sandbox folder that restricts allowed machine types to E2 and N2 families only — this blocks GPU instances entirely at the API level, before IAM is even evaluated. Second, create a budget alert on every Sandbox project with a $100/month threshold, connected to a Pub/Sub topic that triggers a Cloud Function to disable billing on the project when the threshold is breached. Third, use IAM to ensure Sandbox developers only have roles/compute.instanceAdmin.v1 (not roles/editor), and use IAM Conditions to further restrict instance creation to specific zones if needed. The Org Policy is the strongest control because it cannot be overridden by project-level admins.
Strong Answer:The choice depends on where the candidate is in their career and what role they are targeting.
  • Professional Cloud Architect (PCA): This is the “design and decide” certification. The exam presents case studies — a fictional company with specific business requirements, compliance constraints, and existing infrastructure — and you must recommend the right GCP architecture. It tests trade-off analysis: when to use Spanner vs Cloud SQL, when to choose GKE Standard vs Autopilot, how to design for 99.99% availability vs 99.9%. It is consistently ranked as one of the highest-paying IT certifications because it validates architectural judgment, not just tool knowledge. Best for: engineers moving toward principal/staff roles, solution architects, or CTOs.
  • Professional Cloud Engineer (PCE): This is the “build and operate” certification. It tests hands-on skills: writing gcloud commands, debugging health check failures, configuring IAM bindings, managing Terraform state. It is more tactical. Best for: mid-level engineers, SREs, and DevOps engineers who want to prove they can operate GCP infrastructure in production.
My advice for most people: start with PCA. It forces you to learn the “why” behind every service, which makes the PCE exam significantly easier afterward. If you understand why you would choose Regional PD over Zonal PD (because HA requires synchronous cross-zone replication), the gcloud command to create it is just syntax.The common mistake: studying for PCA by memorizing service feature lists. The exam does not ask “What is Cloud Spanner?” — it asks “Given these requirements for a global e-commerce platform with strong consistency needs and 99.999% availability, which database should you use and why?”Follow-up: If someone passed PCA but keeps failing real-world system design interviews, what is the gap?The gap is almost always production experience. PCA validates that you can choose the right services, but system design interviews test whether you can reason about failure modes, capacity planning, and operational trade-offs under pressure. The fix is building real things: deploy the Capstone project in this course, intentionally break it (kill a zone, flood it with traffic, revoke a service account), and document what happened. Interviewers care far more about “I deployed a multi-region GKE cluster, simulated a zone failure, and discovered that my PodDisruptionBudgets were misconfigured” than “I passed PCA with a 90%.”
Strong Answer:The Shared Responsibility Model defines who secures what. The key insight is that the dividing line shifts depending on the service type.
  • IaaS (Compute Engine): Google secures the physical infrastructure (data centers, hardware, Titan chip, hypervisor). You are responsible for the OS, patches, application code, IAM configuration, firewall rules, encryption decisions, and data classification. If you leave port 22 open to 0.0.0.0/0 with a weak root password, that is your problem, not Google’s.
  • PaaS (App Engine, Cloud Run): Google additionally manages the OS and runtime patching. Your responsibility shrinks to application code, IAM, data, and service configuration (like ensuring you are not accidentally serving your Cloud Run service to allUsers).
  • SaaS (BigQuery, Cloud Storage): Google manages almost everything. Your responsibility is primarily IAM (who can access the data), data classification (what sensitivity level), encryption key management (Google-managed vs CMEK vs CSEK), and network controls (VPC Service Controls to prevent exfiltration).
The most dangerous misconception is that “managed” means “secure.” Cloud SQL is a managed database, but if you expose it with a public IP and no SSL requirement, Google will not stop you. The managed part means Google handles patching the PostgreSQL engine, managing backups, and performing failovers. The secure part — network configuration, IAM, encryption at rest with CMEK — is your job.A real-world gotcha: Cloud Storage buckets. Google encrypts data at rest by default (using Google-managed keys), but if you grant allUsers the Storage Object Viewer role, your data is public. Google fulfilled their part of the model (encryption, physical security), but you failed yours (access control).Follow-up: How does the Shared Responsibility Model change when you move from GKE Standard to GKE Autopilot?In GKE Standard, you manage node pools, node OS patches, node security configuration, and pod scheduling. In Autopilot, Google takes over all of those responsibilities. Your responsibility shrinks to pod-level concerns: container images (vulnerability scanning via Artifact Analysis), pod security policies (which Autopilot enforces by default — no privileged containers allowed), application code, and IAM (Workload Identity bindings). Autopilot effectively shifts the dividing line closer to the “SaaS” end of the spectrum for Kubernetes, which is why it is increasingly the recommendation for teams that do not need the low-level node control that Standard provides.