Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Chapter 2: Architecture Overview

The Google File System architecture is elegantly simple: a single master coordinating hundreds of chunkservers, with clients communicating directly with chunkservers for data operations. This separation of control plane (metadata on the master) from data plane (actual bytes on chunkservers) is one of GFS’s most important contributions to distributed systems design. The same pattern appears today in HDFS (NameNode vs DataNode), Kubernetes (API server vs kubelets), and virtually every modern cloud storage service. This chapter explores how these components work together to create a highly scalable distributed file system.
Chapter Goals:
  • Understand the three main components: Master, Chunkservers, and Clients
  • Learn how data flows through the system
  • Grasp the separation of control and data flow
  • Appreciate the single-master design choice

System Components

GFS consists of three main types of components:
+---------------------------------------------------------------+
|                    GFS ARCHITECTURE                           |
+---------------------------------------------------------------+
|                                                               |
|                    ┌─────────────────┐                        |
|                    │  MASTER         │                        |
|                    │  ┌───────────┐  │                        |
|                    │  │ Metadata  │  │   • Namespace          |
|                    │  │ (in-mem)  │  │   • Chunk locations    |
|                    │  └───────────┘  │   • Leases             |
|                    │  Operation Log   │   • Garbage collection |
|                    │  Checkpoints     │                        |
|                    └─────────────────┘                        |
|                       ↑    ↑    ↑                             |
|         Control flow  │    │    │   (metadata only)           |
|                       │    │    │                             |
|         ┌─────────────┘    │    └─────────────┐               |
|         │                  │                  │               |
|    ┌────────┐         ┌────────┐         ┌────────┐          |
|    │ CLIENT │         │ CLIENT │         │ CLIENT │          |
|    └────────┘         └────────┘         └────────┘          |
|         │                  │                  │               |
|         │ Data flow        │                  │               |
|         │ (direct)         │                  │               |
|         ↓                  ↓                  ↓               |
|    ┌──────────┐      ┌──────────┐      ┌──────────┐          |
|    │ CHUNK    │←────→│ CHUNK    │←────→│ CHUNK    │          |
|    │ SERVER 1 │      │ SERVER 2 │      │ SERVER 3 │          |
|    │          │      │          │      │          │          |
|    │ [Chunks] │      │ [Chunks] │      │ [Chunks] │          |
|    │ Checksums│      │ Checksums│      │ Checksums│          |
|    └──────────┘      └──────────┘      └──────────┘          |
|                                                               |
|    ... hundreds more chunkservers ...                        |
|                                                               |
+---------------------------------------------------------------+

KEY PRINCIPLE: Control and data flow are separated
              Master handles metadata, clients talk to chunkservers for data

The Master

The master is the brain of GFS — a single process that manages all metadata. At first glance, a single master looks like an obvious bottleneck and single point of failure. Skeptics in 2003 criticized this choice heavily. But the GFS team recognized a crucial insight: by keeping the master entirely out of the data path and minimizing its per-operation cost, a single machine could handle metadata for a petabyte-scale cluster. The design leverages Metadata Minimization (64 bytes per chunk, all in RAM) and Direct Data Transfer (clients talk to chunkservers for data, never routing bytes through the master) to ensure it can manage petabytes of data from thousands of clients. This “simple master, smart clients” model proved so effective that it became the default architecture for an entire generation of distributed systems.
Namespace Structure: Unlike traditional file systems with directory inodes, GFS uses a lookup table mapping full pathnames to metadata. This lookup table is stored in a prefix-compressed Radix Tree (or Hash Table) for extreme RAM efficiency.The Locking Mechanism: Since there is no directory structure, operations on /a/b/c don’t require locking the parent /a or /a/b. Instead:
  1. To create /home/user/foo, the master acquires Read Locks on /home and /home/user, and a Write Lock on /home/user/foo.
  2. This allows concurrent file creations in the same directory (e.g., /home/user/bar can be created simultaneously as it only needs a read lock on the parent).
  3. No deadlock risk: Locks are acquired in a consistent lexicographical order.
Benefit: Extreme concurrency for metadata operations compared to traditional POSIX dentry locking.
What Gets Persisted:
OPERATION LOG (Sequential on Disk)
──────────────────────────────────

Every metadata mutation is logged:

Example Log Entries:
┌────────────────────────────────────────┐
│ Timestamp: 1234567890                  │
│ Op: CREATE_FILE                        │
│ Path: /data/crawl/20031015             │
│ Owner: crawler                         │
├────────────────────────────────────────┤
│ Timestamp: 1234567891                  │
│ Op: CREATE_CHUNK                       │
│ File: /data/crawl/20031015             │
│ ChunkHandle: 0x1a2b3c4d                │
│ ChunkIndex: 0                          │
├────────────────────────────────────────┤
│ Timestamp: 1234567892                  │
│ Op: ASSIGN_CHUNK                       │
│ ChunkHandle: 0x1a2b3c4d                │
│ Servers: [cs1, cs5, cs12]              │
└────────────────────────────────────────┘

CHECKPOINTS (Compact B-tree)
────────────────────────────

Periodically, master creates checkpoint:
• Entire namespace state
• All chunk mappings
• Compact B-tree format
• New operations log after checkpoint

Recovery:
1. Load latest checkpoint
2. Replay log entries since checkpoint
3. Query chunkservers for chunk locations

Chunkservers

Chunkservers are the workhorses that store actual data:

Storage Model

How Chunks Are Stored:
  • Each chunk: 64MB max
  • Stored as Linux file on local disk
  • Named by chunk handle (globally unique)
  • File path: /gfs/chunks/<chunk_handle>
  • Checksums for each 64KB block
  • Multiple chunks per disk

Responsibilities

What Chunkservers Do:
  • Store and retrieve chunks
  • Verify data integrity (checksums)
  • Replicate chunks to other servers
  • Report to master via heartbeat
  • Delete garbage chunks
  • Handle client read/write requests

No Metadata Cache

Simple Design:
  • Chunkservers don’t cache metadata
  • Don’t track which files chunks belong to
  • Master tells them what to do
  • Simplifies consistency
  • Reduces memory requirements

Replication

Chunk Copies:
  • Each chunk replicated 3x (default)
  • Replicas on different racks
  • Replicas can serve reads
  • One primary for writes (leased)
  • Replicas forward writes in chain
Chunkserver State:
┌─────────────────────────────────────────────────────────────┐
│                    CHUNKSERVER                               │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Disk Storage:                  Memory:                      │
│  ─────────────                  ────────                     │
│                                                              │
│  /gfs/chunks/                   • Checksums (in-mem)         │
│  ├── 0x1a2b3c4d                 • Active operations          │
│  ├── 0x2b3c4d5e                 • Network buffers            │
│  ├── 0x3c4d5e6f                                              │
│  └── ...                        No persistent metadata!      │
│                                                              │
│  Each chunk file:               Heartbeat to Master:         │
│  • 64MB max                     ──────────────────           │
│  • Checksum per 64KB            • Every few seconds          │
│  • Version number               • "I have chunks X,Y,Z"      │
│                                 • Disk space available       │
│                                 • Current load               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Clients

GFS clients are library code linked into applications:
CLIENT LIBRARY (GFS API)
────────────────────────

┌─────────────────────────────────────┐
│  Application Code                   │
│  (e.g., MapReduce, Crawler)         │
├─────────────────────────────────────┤
│  GFS Client Library                 │
│                                     │
│  • File operations API              │
│  • Metadata caching                 │
│  • Data buffering                   │
│  • Checksum validation              │
│  • Retry logic                      │
│  • Performance optimizations        │
└─────────────────────────────────────┘
         │              │
         │              │
         ↓              ↓
    To Master    To Chunkservers
   (metadata)        (data)

CLIENT OPTIMIZATIONS:
─────────────────────

1. Metadata Caching
   • Cache chunk locations
   • Reduce master traffic
   • Timeout-based invalidation

2. Chunk Location Prefetch
   • Ask for multiple chunks
   • Batch requests to master
   • Sequential access optimization

3. Data Buffering
   • Buffer writes before sending
   • Combine small appends
   • Reduce network overhead

4. Checksum Validation
   • Verify data integrity
   • Detect corruption early
   • Avoid propagating bad data

Data Flow Patterns

Understanding how data flows through GFS is crucial. The key principle is the Separation of Control Flow and Data Flow. Control flows from client to master and then to the primary, while data is pushed linearly along a chain of chunkservers to maximize network throughput. This separation is one of the most practically important ideas in the paper. By decoupling “where should I write?” (control) from “here is the actual data” (data), GFS ensures the master never becomes a bandwidth bottleneck. Modern systems like Apache Kafka use the same principle: the controller handles partition assignments and leader election, but producers and consumers transfer data directly to/from broker nodes.

The Pipeline Push Mechanism

To fully utilize each machine’s network bandwidth, data is pushed along a chain of chunkservers rather than in a star topology.
PIPELINE DATA TRANSFER (3 Replicas: A, B, C)
-------------------------------------------
Client → A → B → C

Mathematical Optimization:
Total Time ≈ (Size / Bandwidth) + (N-1) * Latency

Why this is efficient:
1. Pipelining: Server A starts forwarding data to B as soon as it receives the first packet.
2. Network Topology: GFS chooses the chain based on IP addresses (network distance) to minimize latency.
3. Full Duplex: Each server uses its full outgoing bandwidth to push data to the next peer while receiving.

Read Operation

1

Client Requests Metadata

Client                    Master
  │                         │
  │  "Where is chunk 0      │
  │   of /data/file1?"      │
  │────────────────────────>│
  │                         │
  │  Chunk locations:       │
  │  [cs1, cs5, cs12]       │
  │<────────────────────────│
  │                         │

Client caches this for future reads
2

Client Contacts Chunkserver

Client                  Chunkserver
  │                         │
  │  "Read chunk handle     │
  │   0x1a2b, offset 10MB,  │
  │   length 1MB"           │
  │────────────────────────>│
  │                         │
  │  [Data payload]         │
  │<────────────────────────│
  │                         │

Client picks closest chunkserver
Master not involved in data transfer!
3

Data Validation

Client receives data and:

1. Validates checksums
2. Returns to application
3. May cache for future reads

If checksum fails:
→ Try different replica
→ Report to master
Complete Read Flow:
READ OPERATION FLOW
───────────────────

1. Application calls: read("/data/file1", offset, length)

2. Client Library:
   ┌────────────────────────────────────────┐
   │ a) Convert offset to chunk index       │
   │    offset 100MB = chunk 1 (64MB each)  │
   │                                        │
   │ b) Check metadata cache                │
   │    Cache hit? Skip master request      │
   │                                        │
   │ c) Request from master (if miss):      │
   │    "Give me locations for chunks       │
   │     [1,2,3] of /data/file1"            │
   │    (Prefetch multiple chunks)          │
   │                                        │
   │ d) Cache response                      │
   └────────────────────────────────────────┘

3. Master Returns:
   ┌────────────────────────────────────────┐
   │ Chunk 1: [cs2, cs7, cs15]              │
   │ Chunk 2: [cs3, cs9, cs11]              │
   │ Chunk 3: [cs1, cs8, cs14]              │
   │ Version: 5                             │
   └────────────────────────────────────────┘

4. Client Picks Closest Chunkserver:
   ┌────────────────────────────────────────┐
   │ Sort by network distance               │
   │ cs15 is closest → send request there   │
   └────────────────────────────────────────┘

5. Chunkserver Serves Data:
   ┌────────────────────────────────────────┐
   │ • Read from local disk                 │
   │ • Compute & return checksums           │
   │ • Stream data to client                │
   └────────────────────────────────────────┘

6. Client Validates & Returns:
   ┌────────────────────────────────────────┐
   │ • Verify checksums                     │
   │ • Return to application                │
   │ • On error: try different replica      │
   └────────────────────────────────────────┘

Write Operation

Writes are more complex due to maintaining consistency across replicas:
WRITE OPERATION FLOW
────────────────────

Setup Phase:
───────────

Client                Master                Chunkservers
  │                     │                         │
  │ 1. "Write to        │                         │
  │    file /data/f1"   │                         │
  │────────────────────>│                         │
  │                     │                         │
  │                     │ 2. Grant lease to       │
  │                     │    one replica (primary)│
  │                     │────────────────────────>│ Primary
  │                     │                         │
  │ 3. Return:          │                         │
  │    Primary: cs2     │                         │
  │    Secondaries:     │                         │
  │    [cs5, cs9]       │                         │
  │<────────────────────│                         │


Data Push Phase (SEPARATED FROM CONTROL):
────────────────────────────────────────

  │                                                │
  │ 4. Push data to closest replica                │
  │───────────────────────────────────────────────>│ cs5
  │                                                │
  │                                                │
  │                          5. Forward along      │
  │                             replica chain      │
  │                                                │─────> cs2
  │                                                │
  │                                                │─────> cs9
  │                                                │
  │ 6. All replicas ACK data received              │
  │    (data in memory, not yet written)           │
  │<───────────────────────────────────────────────│


Control Phase:
─────────────

  │                                                │
  │ 7. Send write request to PRIMARY               │
  │    "Apply buffered data"                       │
  │───────────────────────────────────────────────>│ Primary (cs2)
  │                                                │
  │                                                │
  │                     8. Primary assigns serial  │
  │                        number, applies writes  │
  │                        in order                │
  │                                                │
  │                     9. Forward write order     │
  │                        to secondaries          │
  │                                                │─────> cs5
  │                                                │─────> cs9
  │                                                │
  │                    10. Secondaries apply       │
  │                        writes in same order    │
  │                                                │
  │                    11. Secondaries reply       │
  │                        to primary              │
  │                                                │<───── cs5
  │                                                │<───── cs9
  │                                                │
  │ 12. Primary replies to client                  │
  │     Success/Failure                            │
  │<───────────────────────────────────────────────│
  │                                                │


KEY INSIGHTS:
────────────

1. Data flow decoupled from control flow
   → Push data along network topology
   → Send control to primary

2. Pipelining
   → Replica forwards while receiving
   → Exploits full network bandwidth

3. Primary serializes operations
   → Consistent order across replicas
   → No distributed consensus needed

Record Append Operation

The record append is GFS’s “killer feature”:
ATOMIC RECORD APPEND
────────────────────

Problem:
───────
Multiple clients want to append to same file
(e.g., 1000 mappers writing to shared output)

Traditional Append:
──────────────────
1. Client: "What's current file size?"
2. Master: "100MB"
3. Client: "Write at offset 100MB"
4. RACE: Another client may write there!
5. Need distributed locking → slow

GFS Record Append:
─────────────────

Client A ──┐
Client B ──┼──> "Append this record"
Client C ──┘     (no offset specified!)


       Primary picks offset atomically


       Returns offset to client

Properties:
──────────
✓ Atomic: All-or-nothing
✓ At-least-once: Guaranteed success
✓ No distributed locks needed
✓ Concurrent from multiple clients
⚠ May have duplicates (retries)
⚠ May have padding (if crosses chunk)

Architecture Deep Dive

Chunk Size: Why 64MB?

The 64MB chunk size is a defining characteristic of GFS:
Benefits of 64MB Chunks:
  1. Reduced Metadata:
    1TB file:
    - 4KB blocks: 268 million entries
    - 64MB chunks: 16,384 entries
    → 16,000x less metadata!
    
  2. Fewer Network Hops:
    • Client requests chunk locations once
    • Works with chunk for extended period
    • Reduces master load significantly
  3. Better TCP Performance:
    • Long-lived TCP connections
    • Connection setup cost amortized
    • Better throughput utilization
  4. Data Locality:
    • MapReduce can schedule entire chunk to one worker
    • Reduces network transfer
    • Better cache utilization
Problems with Large Chunks:
  1. Internal Fragmentation:
    Problem:
    - 1KB file occupies 64MB chunk
    - Wastes 63.999MB per small file
    
    Mitigation:
    - Lazy space allocation
    - Only allocate actual bytes used
    - Linux sparse files
    
  2. Hot Spots:
    Problem:
    - Popular small file (e.g., executable)
    - All chunks on same 3 chunkservers
    - All clients hit same servers
    → Overload!
    
    Mitigation:
    - Higher replication for hot files
    - Client-side retries with backoff
    - Applications can batch/stagger access
    
  3. Startup Issues:
    • Small configuration files
    • Many clients read at startup
    • Temporary hotspot
    Solution: Batch downloads, higher replication

Metadata Design

Why Keep Metadata in RAM?
PERFORMANCE BENEFITS:
────────────────────

Operation          Disk        RAM
──────────────────────────────────
Lookup file        10ms        100ns
List directory     50ms        500ns
Create file        20ms        200ns

→ 100,000x faster!

SCALABILITY:
───────────

100 million chunks × 50 bytes = 5GB

Modern server RAM: 100s of GB
→ Metadata easily fits

SIMPLICITY:
──────────

No cache coherency issues
No disk I/O for reads
Fast global decisions

Consistency Without Locks

GFS achieves consistency without distributed locking:
CONSISTENCY MECHANISMS:
──────────────────────

1. LEASE MECHANISM
   ─────────────

   Master grants 60-second lease to one replica (primary)

   ┌─────────────────────────────────────────┐
   │ Primary has authority to order mutations│
   │                                         │
   │ Lease Lifecycle:                        │
   │ • Master grants                         │
   │ • Primary orders operations             │
   │ • Lease expires or renewed              │
   │ • Master can revoke if needed           │
   └─────────────────────────────────────────┘

   Benefit: No distributed consensus for each write!

2. VERSION NUMBERS
   ───────────────

   Each chunk has version number

   Chunk created: version = 1
   Mutation: version++

   Stale replica detection:
   • Chunkserver reports chunk version
   • Master compares to expected
   • Stale replicas garbage collected

   Benefit: Detect stale replicas reliably

3. CHECKSUMS
   ──────────

   Each 64KB block has 32-bit checksum

   On write:
   • Compute checksum
   • Store with data

   On read:
   • Recompute checksum
   • Compare with stored
   • If mismatch: corruption detected

   Benefit: Detect data corruption

4. APPLICATION-LEVEL VALIDATION
   ─────────────────────────────

   Applications add:
   • Unique IDs (deduplication)
   • Checksums (validation)
   • Sequence numbers (ordering)

   Benefit: Handle relaxed consistency

Component Interactions

Master-Chunkserver Communication

HEARTBEAT PROTOCOL:
──────────────────

Every few seconds:

Chunkserver                          Master
    │                                  │
    │  Heartbeat + Chunk Report        │
    │─────────────────────────────────>│
    │                                  │
    │  Chunks: [0x1a2b, 0x2c3d, ...]   │
    │  Disk space: 500GB free          │
    │  Load: 50% CPU                   │
    │                                  │
    │  <───────────────────────────────│
    │      Instructions                │
    │                                  │
    │  • Delete chunk 0x5e6f           │
    │  • Re-replicate 0x7f8g to cs15   │
    │  • Update version for 0x9a0b     │
    │                                  │

CHUNK LOCATION UPDATES:
──────────────────────

Master's view updated continuously:

1. Initial: Startup poll all chunkservers
2. Ongoing: Heartbeat updates
3. Additions: New chunks reported
4. Deletions: Garbage collection confirmed

No persistent state needed!
→ Chunkservers are source of truth

Client-Master Communication

CLIENT METADATA REQUESTS:
────────────────────────

Optimizations to reduce master load:

1. BATCHING
   ────────

   Client asks for multiple chunks:

   "Give me chunks [0-10] of /file"

   Instead of 11 separate requests

2. PREFETCHING
   ───────────

   Client predicts sequential access
   Asks for future chunks

   Reading chunk 5?
   → Ask for chunks [5-15]

3. CACHING
   ────────

   Client caches chunk locations
   Timeout: few minutes

   Cache hit rate: 95%+
   → Master sees 5% of reads

4. LEASE TRANSPARENCY
   ──────────────────

   Master returns:
   • Primary replica
   • Secondary replicas
   • Lease expiration time

   Client caches until expiration
   → No master contact for writes

Key Architectural Decisions

Single Master

Decision: One master, many chunkserversWhy:
  • Simplifies design dramatically
  • Strong metadata consistency
  • Global knowledge for decisions
  • Fast in-memory operations
Mitigation:
  • Minimize master involvement
  • Shadow masters for HA
  • Client caching

Large Chunks

Decision: 64MB chunk sizeWhy:
  • Reduces metadata size
  • Fewer network round trips
  • Better throughput
  • Data locality benefits
Trade-off:
  • Internal fragmentation
  • Potential hot spots
  • Not optimal for small files

Separation of Control/Data

Decision: Metadata through master, data direct to chunkserversWhy:
  • Master not bottleneck for data
  • Scales to GB/s throughput
  • Parallel data transfers
  • Flexible data flow routing
Benefit:
  • Master handles 1000s ops/sec
  • Data path limited only by network

Relaxed Consistency

Decision: Defined consistency, not strictWhy:
  • Higher performance
  • Simpler implementation
  • No distributed transactions
  • Matches application needs
Requirement:
  • Application-level handling
  • Deduplication
  • Validation

Interview Questions

Expected Answer:GFS has three main components:
  1. Master (single):
    • Stores all metadata in RAM
    • Manages namespace (files/directories)
    • Tracks chunk locations
    • Makes placement decisions
    • Coordinates system-wide operations
  2. Chunkservers (hundreds/thousands):
    • Store 64MB chunks as Linux files
    • Serve read/write requests
    • Replicate data to other chunkservers
    • Report status to master via heartbeat
  3. Clients (many):
    • Application library
    • Caches metadata
    • Communicates with master for metadata
    • Talks directly to chunkservers for data
Key principle: Control and data flow are separated.
Expected Answer:GFS separates control (metadata) and data flow for scalability:Control Flow (Client ↔ Master):
  • Client asks master for chunk locations
  • Master returns list of replicas
  • Client caches this information
  • Happens only once per chunk/file
Data Flow (Client ↔ Chunkservers):
  • Client communicates directly with chunkservers
  • No master involvement for data transfer
  • Enables massive parallel throughput
  • Master not a bottleneck
Benefits:
  • Master handles thousands of metadata ops/sec
  • Data throughput limited only by network/chunkservers
  • System scales to hundreds of clients at GB/s aggregate
Without separation, master would need to handle all data, becoming an immediate bottleneck.
Expected Answer:GFS write operation has three phases:1. Lease Grant:
  • Client asks master for chunk replicas
  • Master grants lease to one replica (primary)
  • Client caches primary and secondary locations
2. Data Push (decoupled from control):
  • Client pushes data to closest replica
  • Each replica forwards to next in chain
  • Pipelined: forward while receiving
  • All replicas ACK when data buffered in memory
  • Data not yet written to disk!
3. Control Flow:
  • Client sends write command to primary
  • Primary assigns serial number (ordering)
  • Primary applies writes to local disk in order
  • Primary forwards write order to secondaries
  • Secondaries apply in same order
  • Secondaries ACK to primary
  • Primary ACKs success/failure to client
Key Insights:
  • Data flow optimized for network topology
  • Control flow ensures consistent ordering
  • Primary serializes concurrent operations
  • No distributed consensus needed
  • If any replica fails, entire write fails (client retries)
Expected Answer:Chunk locations are NOT stored persistently in master, only in-memory. Reasons:Why Not Persist:
  1. Chunkservers are source of truth:
    • Chunkserver knows what chunks it has (on disk)
    • No risk of inconsistency between master and reality
  2. Dynamic nature:
    • Chunks added/deleted frequently
    • Chunkservers may fail
    • Disks may fail
    • Replication changes chunk locations
  3. Simpler consistency:
    • No need to keep persistent state in sync
    • No risk of master having stale information
    • Master polls chunkservers to rebuild state
How It Works:
  1. Startup: Master polls all chunkservers for chunk list
  2. Ongoing: Heartbeat messages update chunk locations
  3. Changes: Chunkservers report additions/deletions
Benefits:
  • Eliminates entire class of consistency bugs
  • Faster recovery (no persistent state to reload)
  • Simpler implementation
Trade-off:
  • Startup requires polling all chunkservers
  • Acceptable because happens rarely and completes quickly

Summary

Key Architecture Takeaways:
  1. Single Master Design: Simplicity and strong consistency for metadata
  2. Separation of Concerns: Control (master) vs. Data (chunkservers) flow
  3. Large Chunks: 64MB chunks reduce metadata and network overhead
  4. In-Memory Metadata: Fast operations, simple design
  5. Leases for Consistency: Primary replica orders operations without consensus
  6. Direct Client-Chunkserver: Data flow doesn’t involve master
  7. Relaxed Consistency: Trade-off for performance, application handles edge cases
  8. Record Append: Atomic concurrent appends without locks

Up Next

In Chapter 3: Master Operations, we’ll explore:
  • Namespace management and locking
  • Chunk creation and allocation
  • Replica placement strategies
  • Lease management in detail
  • Garbage collection mechanism
The architecture provides the foundation—now we’ll see how the master orchestrates the entire system.

Interview Deep-Dive

Strong Answer:The lease mechanism is how GFS delegates write authority without expensive distributed consensus. When a client wants to write to a chunk, the master grants a 60-second lease to one chunkserver, designating it the “primary.” The primary defines the serial order of all mutations to that chunk. All replicas apply mutations in the same order, ensuring consistency. The lease auto-expires after 60 seconds unless the primary requests extensions via heartbeats.Why leases instead of Paxos or distributed locks? Three reasons. First, simplicity: leases are a single timer-based mechanism, while Paxos requires multiple rounds of messaging for every decision. Second, performance: the lease is granted once and reused for all writes during its lifetime, amortizing the coordination cost. Third, safety against split-brain: if the master loses contact with a primary, it simply waits 60 seconds for the lease to expire before granting a new one. This guarantees that at most one primary exists for any chunk at any time, without needing the complex leader election of consensus protocols.The trade-off is a 60-second recovery window. If a primary crashes, no writes can proceed to that chunk until the lease expires. For Google batch workloads, a 60-second pause was acceptable. For a low-latency transaction processing system, it would not be.Follow-up: What happens if the master itself crashes while a lease is outstanding?The lease continues to be valid because it is time-bound, not dependent on the master being alive. When the master recovers (by loading its checkpoint and replaying the operation log), it polls all chunkservers to discover which leases are still active. Any primary whose lease has not expired continues to serve writes. Any primary whose lease has expired simply stops accepting writes, and the recovered master can grant new leases. This design is elegant because it separates the lease grant (which requires the master) from the lease enforcement (which is purely time-based and works even during master downtime).
Strong Answer:The primary risk is data loss if the master process crashes without persisting recent mutations. GFS mitigates this with an operation log that records every metadata change before acknowledging it to the client. The operation log is replicated to multiple remote machines, so even if the master machine is destroyed, metadata can be recovered from a remote copy.The second risk is memory exhaustion. At 64 bytes per chunk, the master needs about 1GB of RAM per 16 million chunks (about 1PB of data). As the cluster grows, the master eventually runs out of RAM. Google handled this by using prefix compression for the namespace (reducing per-file overhead to about 64 bytes) and by monitoring heap usage as a capacity planning metric.The third risk is recovery time. After a crash, the master must load the latest checkpoint (a compact B-tree representation of the full namespace) and replay all operation log entries since that checkpoint. If the log is long, recovery takes minutes. GFS mitigates this by checkpointing frequently (the checkpoint runs in a background thread without blocking mutations) and by keeping the checkpoint in a format that can be memory-mapped directly.The fourth risk is GC pauses. A master with tens of gigabytes of heap can experience multi-second garbage collection pauses, during which all metadata operations stall. This was a real operational concern at Google and is one reason they eventually moved to Colossus with distributed metadata.Follow-up: If you were designing this today, would you still use in-memory metadata? What alternatives exist?For clusters under 100PB, in-memory metadata is still the right call. Modern servers with 1-2TB of RAM can hold metadata for exabyte-scale clusters. The alternatives are: (1) a distributed metadata store like Bigtable or FoundationDB, which is what Colossus uses — this removes the single-machine RAM limit but adds the complexity of distributed transactions for metadata; (2) a tiered approach where hot metadata is in RAM and cold metadata is on SSD, similar to how modern databases manage buffer pools. I would start with in-memory and plan the migration to distributed metadata as a known future milestone, because the operational simplicity of a single in-memory master saves enormous engineering time in the early years.
Strong Answer:In GFS, control flow (metadata operations like file open, chunk lookup, lease grant) goes through the master, while data flow (actual reads and writes of file content) goes directly between clients and chunkservers. The master never sees a single byte of file data. This separation is critical for three reasons.First, it prevents the master from becoming a bandwidth bottleneck. If every read and write routed through the master, a single machine would need to handle the aggregate I/O bandwidth of the entire cluster — potentially terabytes per second. By keeping the master off the data path, it only needs to handle metadata RPCs, which are small (a few hundred bytes each) and relatively infrequent (clients cache chunk locations).Second, it enables linear scaling of data throughput. Adding more chunkservers adds more aggregate bandwidth because clients talk to chunkservers directly. The master does not need to scale with data throughput, only with metadata operation rate.Third, it simplifies the master implementation. The master does not need high-bandwidth network interfaces, large disk arrays, or complex buffer management. It is essentially a metadata database that fits in RAM.This pattern has become the standard architecture for distributed systems. HDFS uses the same separation (NameNode for metadata, DataNodes for data). Kafka separates its controller from broker data paths. Kubernetes separates the API server (control plane) from kubelet data operations.Follow-up: Are there cases where you would NOT separate control and data flow?Yes. For very small clusters (under 10 nodes) or systems with extremely low throughput, the separation adds unnecessary complexity. A single coordinator that handles both metadata and data is simpler to operate and debug. Also, for systems where every operation requires a metadata decision (like distributed databases with per-row access control), you may want the metadata and data paths to be collocated for lower latency. The separation is a scaling optimization, and like all optimizations, it has a complexity cost that needs to be justified by the scale.