Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Chapter 16: Beyond a Single Cloud - Hybrid and Multi-Cloud

Modern enterprise computing is rarely confined to a single cloud provider. Organizations often maintain on-premises data centers for data sovereignty or use multiple clouds to avoid vendor lock-in. Anthos is Google Cloud’s answer to this complexity — a managed platform that extends Google’s services and operational model to any environment. The real-world context: According to industry surveys, over 80% of enterprises use two or more cloud providers. This is not always by choice — it often happens through acquisitions, team preferences, or specific services only available on one cloud. The challenge is not “which cloud is best?” but “how do we manage all of them consistently?” Anthos addresses this by providing a single management plane. AWS has a similar (but narrower) offering with EKS Anywhere, and Azure has Azure Arc. Anthos was first to market with true multi-cloud Kubernetes management and remains the most comprehensive.

1. Anthos: The Unified Control Plane

Anthos is not a single product; it is a suite of technologies built on Kubernetes, Istio, and Knative. It allows you to manage clusters on GCP, AWS, Azure, and On-Premises (VMware or Bare Metal) from a single dashboard.

Connect Gateway

One of the most powerful features of Anthos is the Connect Gateway. Think of it like a secure remote desktop for Kubernetes — it allows you to run kubectl commands or use the GCP Console to manage clusters that are behind firewalls or in private networks, without needing a VPN or complex SSH tunnels. The Connect Agent runs inside the cluster and maintains an outbound connection to Google, so you never need to open inbound firewall ports. AWS EKS Anywhere requires setting up AWS SSM or a VPN for similar access; Azure Arc uses a similar agent-based model.

Anthos Clusters (GKE Multi-Cloud)

Anthos provides a consistent GKE experience everywhere.
  • Anthos on VMware: Runs GKE clusters on your existing vSphere infrastructure.
  • Anthos on Bare Metal: Runs GKE directly on physical Linux servers, eliminating the overhead of a hypervisor—ideal for edge computing and high-performance workloads.
  • Anthos on AWS/Azure: Google manages the lifecycle of Kubernetes clusters running on EC2 or Azure VMs.

2. Anthos Service Mesh (ASM)

ASM is a managed service mesh based on Istio. It solves the “microservices mess” by providing security, observability, and traffic control without requiring code changes.
  • Managed Control Plane: Google manages the Istio control plane (pilot, citadels), so you only worry about the sidecar proxies.
  • mTLS by Default: ASM automatically encrypts all service-to-service communication using Mutual TLS (mTLS), ensuring that even if the network is compromised, the data is safe.
  • Service Graph: A visual representation of how your services communicate, including latency, error rates, and throughput for every link.
Cost Tip: ASM is included free with GKE Enterprise (Anthos) licensing. However, the sidecar proxy (Envoy) running in every pod adds approximately 50-100MB of RAM and 0.1 vCPU per pod. For a cluster with 200 pods, that is 10-20GB of additional RAM and 20 vCPUs of overhead. Factor this into your node pool sizing — many teams are surprised when adding a service mesh increases their compute bill by 15-25%. Consider using the newer “sidecar-less” mode (ambient mesh) for lower overhead on high-density clusters.

2.1 The ASM Ingress Gateway: TLS Termination

The ASM Ingress Gateway is a standalone Envoy-based proxy that handles traffic entering the mesh. Architectural Flow:
  1. Client Request: User hits the Global Load Balancer (or a static IP).
  2. TLS Termination: The Ingress Gateway terminates the external TLS using a Kubernetes Secret (containing the cert/key).
  3. mTLS Origination: The Gateway then starts a new mTLS connection to the backend sidecar proxy inside the mesh.
Configuring TLS Termination (YAML):
# Why terminate TLS at the Gateway (not in individual services): Centralizes certificate
# management in one place. Without this, every microservice needs its own cert handling --
# a maintenance nightmare at scale. The Gateway handles external TLS, then starts mTLS
# to backend services, so traffic is encrypted end-to-end.
# AWS equivalent: API Gateway or ALB with ACM certificate + backend TLS via App Mesh.
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: prod-gateway
  namespace: istio-system        # Restrict certs to a controlled namespace
spec:
  selector:
    istio: ingressgateway        # Use Istio default gateway implementation
  servers:
  - port:
      number: 443
      name: https
      protocol: HTTPS
    tls:
      mode: SIMPLE               # SIMPLE = standard TLS termination (not mTLS from clients)
      credentialName: "frontend-certs"  # K8s secret containing cert/key -- manage via cert-manager
    hosts:
    - "api.mycompany.com"        # Only accept traffic for this hostname
Why terminate at the Gateway?
  • Security: Keeps sensitive SSL keys in a restricted namespace (istio-system).
  • Offloading: Backend microservices don’t need to manage certificates; they only see mTLS.
  • Policy Enforcement: You can apply global Rate Limiting or JWT validation at the Gateway before traffic reaches any pod.

3. Anthos Config Management (ACM)

ACM brings GitOps to the enterprise. It ensures that all your clusters (regardless of where they are) stay in sync with a single source of truth—your Git repository.
  • Config Sync: Periodically pulls manifests (YAML) from Git and applies them to the fleet.
  • Policy Controller: Built on the Open Policy Agent (OPA), it allows you to enforce guardrails. For example, you can prevent any developer from creating a “LoadBalancer” service that has a public IP in a specific namespace.
  • Hierarchy Controller: Allows you to create parent-child relationships between namespaces, making it easier to manage permissions and quotas in large multi-tenant clusters.

4. Connecting the Clouds: Interconnect and VPN

To make hybrid cloud work, you need a high-performance “pipe” between your data center and GCP.

Dedicated vs. Partner Interconnect

  • Dedicated Interconnect: A physical fiber connection between your router and a Google edge PoP. Provides 10 Gbps or 100 Gbps of bandwidth.
  • Partner Interconnect: You connect to a supported service provider (like Equinix or AT&T) who already has a high-speed link to Google. Ideal if you are not in a Google peering location.

Cross-Cloud Interconnect

The modern way to do multi-cloud. Google provides direct physical links to AWS and Azure edge locations. This is a game-changer for multi-cloud architectures — previously, connecting two clouds required either VPN tunnels over the internet (high latency, unpredictable) or partnering with a colocation provider like Equinix to bridge the two networks.
  • No Internet: Traffic between GCP and AWS never touches the public internet. This matters for compliance (some regulations prohibit data traversing the public internet) and for performance.
  • Performance: Sub-10ms latency between clouds, enabling real-time data synchronization.
Cost Reality Check: Cross-Cloud Interconnect is not cheap — it starts at roughly 1,700/monthfora10Gbpsconnection.Butifyouaretransferringsignificantdatabetweenclouds(whichcosts1,700/month for a 10Gbps connection. But if you are transferring significant data between clouds (which costs 0.08-0.12/GB over the internet), the breakeven point is approximately 15-20TB/month of data transfer. For enterprises with active multi-cloud workloads, it pays for itself quickly.

5. Advanced Anthos Ops: Service Mesh and Policy Controller

5.1 ASM Traffic Management

ASM allows you to decouple traffic routing from deployment.
  • VirtualService: Defines where the traffic goes (e.g., “Send 90% to v1 and 10% to v2”).
  • DestinationRule: Defines how the traffic is handled at the destination (e.g., “Use random load balancing” or “Set a circuit breaker if 5xx errors > 5%”).
  • Circuit Breakers: Prevents a single failing service from taking down the entire system by failing fast and allowing the service to recover.

5.2 Policy Controller Guardrails

Policy Controller (built on OPA Gatekeeper) allows you to audit and enforce compliance across your fleet.
  • Constraints: Declarative rules (e.g., “All namespaces must have a ‘cost-center’ label”).
  • Audit Mode: See which resources are currently violating a policy without actually blocking them—ideal for onboarding legacy clusters.

6. Google Cloud VMware Engine (GCVE)

For organizations that want to move to the cloud without containerizing their apps, GCVE is the fastest path. Think of it as renting a fully furnished apartment instead of buying building materials and constructing from scratch — you get your familiar VMware environment, but running on Google’s infrastructure with access to GCP’s native services. AWS has VMware Cloud on AWS, and Azure has Azure VMware Solution. All three follow the same model: run VMware’s stack on the cloud provider’s bare metal.
  • Native VMware Stack: You get a full VMware environment (vSphere, vCenter, vSAN, NSX-T) running on Google’s bare-metal infrastructure.
  • Seamless Migration: Use HCX to “Live Migrate” (vMotion) VMs from your data center to GCP with zero downtime.
Cost Reality Check: GCVE is not cheap — a minimum 3-node cluster starts at approximately $15,000-20,000/month (varies by region and node type). This makes sense for organizations migrating 50+ VMware VMs where the alternative is months of re-architecture work. For smaller workloads (fewer than 10 VMs), consider Compute Engine with Migrate to Virtual Machines instead — it is significantly cheaper and does not require maintaining a full VMware stack.

6. Interview Preparation: Architectural Deep Dive

1. Q: What is “Anthos” and what problem does it solve for the enterprise? A: Anthos is a Managed Application Platform that provides a consistent operational model across GCP, other clouds (AWS/Azure), and on-premise (VMware/Bare Metal). It solves “Operational Silos” by allowing teams to use the same Kubernetes-based tools (GKE), the same service mesh (Istio), and the same GitOps (Anthos Config Management) regardless of where the hardware actually lives. 2. Q: Explain the “GitOps” workflow as implemented by Anthos Config Management (ACM). A: In ACM, the Git Repository is the single source of truth.
  1. A developer commits a YAML manifest to Git.
  2. The Config Sync agent running in the Anthos clusters detects the change.
  3. The agent pulls and applies the manifest to the cluster. This ensures that all clusters in the fleet stay in sync and allows for “infrastructure versioning” and easy rollbacks.
3. Q: What is the difference between Dedicated Interconnect and Partner Interconnect? A:
  • Dedicated: You have a physical fiber connection between your router and a Google Edge Point of Presence (PoP). Supports 10G/100G. Best for high bandwidth and security.
  • Partner: You connect to a service provider (like Equinix) who already has a link to Google. Better for smaller bandwidth (50Mbps to 10G) or if your data center is not in a Google PoP city.
4. Q: Why is “Anthos Service Mesh” (ASM) critical for a microservices architecture? A: ASM (based on Istio) provides Observability, Security, and Traffic Management without code changes.
  • mTLS: Automatically encrypts all service-to-service traffic.
  • Observability: Provides a “Service Graph” and Golden Signals (Latency/Errors) out of the box.
  • Resiliency: Handles retries, circuit breakers, and canary rollouts at the infrastructure layer.
5. Q: When should an architect choose Google Cloud VMware Engine (GCVE) over containerizing into GKE? A: GCVE is the choice for a “Fast Migration” (Lift and Shift). If an organization has a complex, legacy VMware environment and needs to exit their data center quickly, GCVE allows them to vMotion VMs to GCP with zero code changes. Containerization (GKE/Anthos) is for Modernization, but it requires a significantly higher engineering effort to refactor the applications.

Implementation: The “Hybrid Architect” Lab

Setting up a GitOps Pipeline with ACM

# 1. Enable the Anthos API
# Why: Anthos APIs are not enabled by default. This single command unlocks Fleet Management,
# Config Sync, and Policy Controller features for your project.
gcloud services enable anthos.googleapis.com

# 2. Register a cluster to the Anthos Fleet
# Why --enable-workload-identity: The Connect Agent needs a secure way to authenticate
# back to GCP. Workload Identity eliminates the need for a JSON key file on the cluster.
# Common Mistake: Forgetting this flag and then having to re-register the cluster later.
gcloud container fleet memberships register my-cluster \
    --gke-cluster=us-central1-a/my-cluster \
    --enable-workload-identity

# 3. Configure Config Sync (apply-spec.yaml)
# Why Git as source of truth: Every change is auditable, reviewable, and reversible.
# If someone makes a bad config change, you can git revert and Config Sync auto-heals.
# spec:
#   git:
#     repo: https://github.com/my-org/anthos-config
#     branch: main
#     dir: "cluster-configs"
#     auth: token

# 4. Apply the configuration
# Why: This binds the cluster to the Git repo. From this point forward, the cluster's
# desired state is whatever is in the repo. Manual kubectl changes will be reverted.
gcloud beta container fleet config-management apply \
    --membership=my-cluster \
    --config-yaml=apply-spec.yaml

Pro-Tip: Anthos Service Mesh Dashboards

Once ASM is enabled, check the Anthos Service Mesh dashboard in the Cloud Console. It provides a “Golden Signals” view (Latency, Traffic, Errors) for every single service in your mesh automatically, without you having to write a single line of monitoring code.