Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Chapter 3: Master Operations

The master is the brain of GFS, orchestrating all metadata operations and coordinating the distributed system. What makes GFS’s master design remarkable is how much sophistication is packed into a single process without it becoming the bottleneck critics predicted. The namespace locking strategy alone — using fine-grained, per-path read-write locks instead of traditional directory-level locks — was a significant innovation that enabled thousands of concurrent file creations in the same directory. This chapter explores how the master manages the namespace, allocates chunks, grants leases, places replicas, and maintains system health through garbage collection. These mechanisms are directly relevant to modern system design: lease-based coordination appears in etcd and ZooKeeper, replica placement strategies inform Kafka broker assignment, and lazy garbage collection is the standard approach in systems from Go’s runtime to distributed databases.
Chapter Goals:
  • Understand namespace management with coarse-grained locking
  • Learn chunk lease mechanism for consistency
  • Explore replica placement strategies
  • Master garbage collection techniques
  • Grasp master fault tolerance mechanisms

Namespace Management

Unlike traditional file systems with per-directory data structures (where each directory is essentially an inode pointing to a list of child entries), GFS uses a flat namespace with efficient path-to-metadata lookups. This design choice eliminated the “directory inode bottleneck” that plagued concurrent access in POSIX file systems. In a traditional file system, creating a file requires locking the parent directory’s inode, which serializes all operations in the same directory. GFS’s approach of locking individual path components independently enabled massive parallelism — a property that was essential for supporting thousands of MapReduce tasks writing output simultaneously.

Namespace Structure

GFS NAMESPACE DESIGN
────────────────────

Traditional FS:           GFS:
──────────────           ─────

Directory tree:          Flat lookup table:
/                        ┌─────────────────────────────────┐
├── home                 │ Full Path → Metadata            │
│   ├── user1            ├─────────────────────────────────┤
│   └── user2            │ /home → {dir metadata}          │
└── data                 │ /home/user1 → {dir metadata}    │
    └── files            │ /data/files/log1 → {file meta}  │
                         └─────────────────────────────────┘

Each dir has            Single prefix-compressed
list of children        table (efficient)

Problem:                Benefit:
- Tree traversal        - O(1) lookup
- Lock contention       - Fine-grained locking
                        - Parallel operations


IMPLEMENTATION:
──────────────

Data Structure: Prefix-compressed lookup table

Example:
────────
/home/alice/data/file1.txt
/home/alice/data/file2.txt
/home/bob/logs/access.log

Stored as:
──────────
Common prefixes compressed:

Efficient memory usage
Fast lookups
Support for snapshots

Coarse-Grained Locking

GFS uses namespace locks to allow concurrent operations:
Read-Write Locks on Paths:
NAMESPACE LOCKING STRATEGY
─────────────────────────

Each namespace node (file/directory) has:
• One read-write lock

Lock Granularity: Per full pathname

Example Operations:
──────────────────

1. Create /home/user1/file1
   Locks acquired:
   • /home          → Read lock
   • /home/user1    → Read lock
   • /home/user1/file1 → Write lock

2. Delete /home/user2
   Locks acquired:
   • /home          → Read lock
   • /home/user2    → Write lock

3. Snapshot /home/user1 → /save/snapshot1
   Locks acquired:
   • /home          → Read lock
   • /home/user1    → Write lock (prevent changes)
   • /save          → Read lock
   • /save/snapshot1 → Write lock


KEY INSIGHT: No parent directory locks needed!
────────────────────────────────────────────

Traditional FS:
- Need parent directory lock to modify children
- Serialization bottleneck

GFS:
- Lock full path only
- Parallel operations in same directory
- Scales beautifully

Metadata Storage

The master keeps three types of metadata, all in memory:

File Namespace

Directory and File Names
  • Full pathname → metadata
  • Persistent (operation log)
  • Modification times
  • Owner/permissions
  • List of chunk handles

Chunk Metadata

Chunk Handle → Info
  • Chunk version number
  • List of replica locations
  • Primary (if leased)
  • Lease expiration
  • Partially persistent

Server State

Chunkserver Info
  • Available disk space
  • Current load (CPU, I/O)
  • Chunks stored
  • Last heartbeat time
  • Not persistent
METADATA MEMORY LAYOUT
─────────────────────

For 100 million chunks (several petabytes):

File Namespace:
──────────────
• ~10 million files
• ~100 bytes per file
• Total: ~1 GB

Chunk Metadata:
──────────────
• 100 million chunks
• ~64 bytes per chunk
  - Chunk handle (8 bytes)
  - Version number (8 bytes)
  - Replica locations (3-5 locations × 8 bytes)
  - Lease info (16 bytes)
• Total: ~6 GB

Chunkserver State:
─────────────────
• ~1000 chunkservers
• ~1 KB per server
• Total: ~1 MB

TOTAL: < 10 GB (easily fits in RAM of modern server)

Benefits:
────────
✓ Fast operations (no disk I/O)
✓ Simple consistency
✓ Global view for decisions
✓ Easy to scan entire namespace

Chunk Lease Mechanism

Leases are central to GFS’s consistency model, enabling mutations without the overhead of distributed consensus. The lease mechanism is one of GFS’s most elegant design decisions and one of the most frequently asked-about concepts in system design interviews. The core insight is this: instead of running an expensive consensus protocol (like Paxos) for every single write operation, the master delegates authority to a “primary” chunkserver for a bounded time period. During that lease window, the primary makes all serialization decisions unilaterally. This converts the cost of distributed coordination from per-operation to per-lease-grant — a dramatic reduction. Today, this pattern is ubiquitous: Kafka uses leader epochs, Spanner uses leader leases, and even distributed lock services like etcd use lease-based TTLs for the same reason.

How Leases Work

1

Master Grants Lease

LEASE GRANT PROCESS
──────────────────

Client requests write to chunk X

Master checks:
1. Does chunk X have current primary?
   NO → Select one replica as primary

2. Grant lease:
   ┌──────────────────────────────┐
   │ Chunk: X                     │
   │ Primary: chunkserver-5       │
   │ Lease expiration: T+60sec    │
   │ Version: 7                   │
   └──────────────────────────────┘

3. Send to chunkserver-5:
   "You are primary for chunk X until T+60"

4. Increment chunk version to 8

5. Return to client:
   Primary: cs-5
   Secondaries: [cs-2, cs-9]
   Version: 8
2

Primary Orders Mutations

PRIMARY ROLE
───────────

While lease held, primary has authority:

Multiple clients send writes:

Client A: Write(offset=1000, len=100)
Client B: Append(len=200)
Client C: Write(offset=5000, len=50)

Primary receives all three:
─────────────────────────

1. Assign serial numbers:
   A → Serial #1
   B → Serial #2
   C → Serial #3

2. Apply to local chunk in order:
   #1: Write at 1000
   #2: Append at end
   #3: Write at 5000

3. Send to secondaries:
   "Apply operations in order: #1, #2, #3"

4. Wait for ACKs

5. Reply to clients

Result: Consistent ordering across all replicas!
3

Lease Renewal

LEASE LIFECYCLE
──────────────

Initial lease: 60 seconds

Renewal (if writes ongoing):
────────────────────────────

Every ~10-15 seconds:

Primary → Master:
  "Heartbeat: Still active, renew lease for chunk X"

Master → Primary:
  "Lease extended to T+60"

This continues while writes happen

Expiration:
──────────

No renewal for 60 seconds?
→ Lease expires
→ Master can grant to different replica
→ Or same replica (if still alive)


REVOCATION:
──────────

Master needs to revoke lease:
(e.g., client requested chunk deletion)

1. Master sends revoke message to primary
2. Primary stops accepting new mutations
3. Completes in-flight operations
4. ACKs revocation to master
5. Master can now delete chunk

If primary unreachable?
→ Wait for 60sec expiration
→ Then proceed

Lease Benefits

WHY LEASES?
──────────

Problem Without Leases:
───────────────────────

Distributed consensus for every write:

Client → Replica 1 ┐
Client → Replica 2 ┼→ Agree on order? (Paxos/Raft)
Client → Replica 3 ┘   Slow! Complex!

For each write:
- Multiple round trips
- Consensus protocol
- Failure handling
→ 10-100ms latency per operation


With Leases:
───────────

One-time setup (lease grant):
Master → Primary: "You order operations for 60sec"

Then for each write:
Client → Primary: "Write this"
Primary → Secondaries: "Apply in this order"
→ ~1ms latency per operation

Benefits:
────────
✓ No distributed consensus per operation
✓ Primary makes serialization decision
✓ Low latency writes
✓ Simple protocol
✓ Automatic timeout (no need for perfect failure detection)


LEASE VS LOCK:
─────────────

Lock:
- Must explicitly release
- Failure? → Deadlock or complex recovery

Lease:
- Automatic timeout
- Failure? → Expires naturally
- Master can wait or revoke
- No distributed state

Snapshot Mechanics (Copy-on-Write)

Snapshots in GFS are nearly instantaneous and allow users to create a copy of a file or a directory tree without copying data. This capability was critical for production operations: teams could snapshot an entire dataset before running a risky MapReduce job, providing a cheap rollback mechanism. The technique GFS uses — Copy-on-Write (CoW) at the chunk level — is the same fundamental mechanism used by Linux’s fork() system call, by ZFS and Btrfs file system snapshots, and by container image layers in Docker. Understanding CoW in GFS gives you the mental model for all of these systems.

How it Works (The Metadata-Only Copy)

GFS uses Copy-on-Write (CoW) at the chunk level to implement snapshots efficiently.
  1. Lease Revocation: When the master receives a snapshot request, it first revokes any outstanding leases on the chunks of the files to be snapshotted. This ensures that any subsequent writes will require a new lease, giving the master a chance to intercept and create a copy.
  2. Log & Copy Metadata: The master logs the snapshot operation to disk. It then applies the operation to its in-memory state by duplicating the metadata for the file or directory. The new “snapshot” file points to the same chunk handles as the original.
  3. Reference Counting: Each chunk handle now has a reference count > 1.

The First Write After Snapshot

When a client wants to write to a chunk that has been snapshotted:
1. Client asks Master for lease: "Write to Chunk C (ref_count=2)"
2. Master sees ref_count > 1:
   a) Pick new chunk handle C'
   b) Tell chunkservers holding C: "Copy C to C' locally"
   c) Update metadata: Current file now points to C' (ref_count=1)
   d) Snapshot still points to C (ref_count=1)
3. Master grants lease for C' to client.
4. Client writes to C' normally.
Key Benefit: The initial snapshot is just a metadata operation (copying pointers). The actual data copying is deferred until a write occurs, and only for the specific chunks being modified.

Replica Placement

The master decides where to place chunk replicas, optimizing for reliability, bandwidth, and load balancing.

Placement Goals

Survive Multiple Failures:
FAILURE DOMAIN HIERARCHY
───────────────────────

Datacenter
├── Rack 1
│   ├── Switch 1A
│   │   ├── Machine 1
│   │   ├── Machine 2
│   │   └── Machine 3
│   └── Switch 1B
│       ├── Machine 4
│       └── Machine 5
└── Rack 2
    └── Switch 2A
        ├── Machine 6
        └── Machine 7

Failure Scenarios:
─────────────────

1. Disk failure: One machine
   Probability: High (daily)

2. Machine failure: One machine
   Probability: Medium (weekly)

3. Switch failure: All machines under switch
   Probability: Low (monthly)

4. Rack failure: Power/cooling/network
   Probability: Very low (yearly)


PLACEMENT STRATEGY:
──────────────────

Bad: All replicas on same rack
┌────────────────────────────┐
│ Rack 1                     │
│  Replica 1 (machine 1)     │
│  Replica 2 (machine 2)     │
│  Replica 3 (machine 3)     │
└────────────────────────────┘
→ Rack failure = Data loss!

Good: Replicas across racks
┌────────────┐  ┌────────────┐
│ Rack 1     │  │ Rack 2     │
│ Replica 1  │  │ Replica 3  │
│ Replica 2  │  │            │
└────────────┘  └────────────┘
→ Survives rack failure
→ 2 in one rack for intra-rack bandwidth
Network Topology Awareness:
NETWORK BANDWIDTH HIERARCHY
──────────────────────────

Within machine:      1000+ MB/s (disk)
Within rack:         100-1000 MB/s (1-10 Gbps)
Cross-rack:          10-100 MB/s (limited by uplink)
Cross-datacenter:    1-10 MB/s (WAN)


READ OPTIMIZATION:
─────────────────

Client wants to read chunk X
Replicas: [rack1-machine5, rack2-machine3, rack3-machine1]

Client is in rack1

Prefer rack1-machine5 (100-1000 MB/s)
Over rack2-machine3 (10-100 MB/s)

→ 10x faster read!


WRITE OPTIMIZATION:
──────────────────

Chain replicas by network distance:

Client in rack1 writes:
───────────────────────

Bad chaining:
Client → Rack3 → Rack1 → Rack2
(cross-rack × 3 = slow)

Good chaining:
Client → Rack1 → Rack2 → Rack3
(one intra-rack, then fan out)

→ Exploits full network bandwidth
→ Pipelined transfers
Distribute Storage and I/O:
LOAD BALANCING FACTORS
──────────────────────

Master tracks for each chunkserver:

1. Disk space utilization
   ┌─────────────────────────┐
   │ cs-1: 45% full          │
   │ cs-2: 89% full ← Avoid  │
   │ cs-3: 23% full ← Prefer │
   └─────────────────────────┘

2. Recent write load
   ┌─────────────────────────┐
   │ cs-1: 50 writes/min     │
   │ cs-2: 200 writes/min ←  │
   │ cs-3: 10 writes/min     │
   └─────────────────────────┘

3. Number of chunks stored
   (proxy for future load)

4. Recent chunk creations
   (newly created chunks → writes soon)


PLACEMENT ALGORITHM:
───────────────────

When creating new chunk replica:

1. Filter candidates:
   • Enough free space (> chunk size)
   • Not in same rack as existing replica
   • Below average load

2. Sort by:
   • Disk space (prefer emptier)
   • Recent chunk creations (prefer fewer)

3. Select top candidate

4. Update tracking info


REBALANCING:
───────────

Background process (periodic):

Detect imbalance:
- cs-2: 90% full
- cs-3: 20% full
- Difference > threshold

Action:
→ Move some chunks from cs-2 to cs-3
→ Gradual process (limit impact)
→ Respect rack diversity

Chunk Creation

CHUNK CREATION WORKFLOW
──────────────────────

Scenario: Client creates new file and writes first chunk

1. CLIENT CREATES FILE
   ──────────────────

   Client: "Create file /data/logs/2003-10-15"
   Master:
     a) Create namespace entry
     b) Log operation to disk
     c) Return success

2. CLIENT WRITES DATA
   ──────────────────

   Client: "Write to /data/logs/2003-10-15, offset 0, 1MB"
   Master:
     a) File has no chunks yet
     b) Need to create chunk 0

3. MASTER ALLOCATES CHUNK
   ──────────────────────

   a) Generate chunk handle (globally unique 64-bit)
      chunk_handle = 0x1a2b3c4d5e6f7g8h

   b) Select 3 chunkservers:

      Available servers: [cs-1, cs-2, ..., cs-50]

      Filter:
      • Enough disk space
      • Below average utilization
      • Rack diversity

      Selected: [cs-5, cs-12, cs-23]
      (cs-5 and cs-12 in rack-1, cs-23 in rack-2)

   c) Designate primary (lowest load)
      Primary: cs-5

   d) Grant lease to cs-5 (60 seconds)

   e) Increment version to 1

   f) Log operation:
      "CHUNK_CREATE: handle=0x1a2b, file=/data/logs/2003-10-15,
       index=0, version=1, locations=[cs-5, cs-12, cs-23]"

4. MASTER CONTACTS CHUNKSERVERS
   ────────────────────────────

   Master → cs-5, cs-12, cs-23:
     "Create chunk 0x1a2b, version 1"

   Each chunkserver:
     a) Create empty file /gfs/chunks/0x1a2b3c4d5e6f7g8h
     b) Initialize version to 1
     c) ACK to master

5. MASTER RETURNS TO CLIENT
   ────────────────────────

   Master → Client:
     "Chunk 0 metadata:
      handle: 0x1a2b3c4d5e6f7g8h
      primary: cs-5
      secondaries: [cs-12, cs-23]
      version: 1
      lease_expiration: T+60"

6. CLIENT WRITES DATA
   ──────────────────

   (Normal write flow as described in Chapter 2)

Re-replication

When replicas fall below target count (e.g., server failure), master re-replicates:
RE-REPLICATION WORKFLOW
──────────────────────

Trigger: Chunkserver cs-12 fails

1. DETECTION
   ─────────

   Master heartbeat timeout:
   • cs-12 hasn't responded in 10+ seconds
   • Mark cs-12 as dead
   • Scan all chunks on cs-12

2. IDENTIFY UNDER-REPLICATED CHUNKS
   ────────────────────────────────

   Chunks on cs-12:
   ┌─────────────────────────────────────┐
   │ Chunk 0x1a2b: [cs-5, cs-12, cs-23]  │
   │   → Now only 2 replicas (cs-5, cs-23) │
   │   → Need re-replication!            │
   │                                     │
   │ Chunk 0x2c3d: [cs-7, cs-12, cs-15]  │
   │   → Now only 2 replicas             │
   │   → Need re-replication!            │
   └─────────────────────────────────────┘

3. PRIORITIZATION
   ──────────────

   Sort chunks by priority:

   Priority factors:
   • Current replication count (1 > 2 > 3)
   • Recently accessed (hot chunks first)
   • Blocks client writes (highest priority)

   Example order:
   1. Chunk 0x9f8e: 1 replica (URGENT!)
   2. Chunk 0x7d6c: 2 replicas, blocking write
   3. Chunk 0x1a2b: 2 replicas, recently read
   4. Chunk 0x5b4a: 2 replicas, cold data

4. SELECT SOURCE AND DESTINATION
   ──────────────────────────────

   For chunk 0x1a2b:

   Existing replicas: [cs-5, cs-23]

   Select source: cs-5 (lowest load)

   Select destination:
   • Filter: Not in same rack as cs-5 or cs-23
   • Prefer: Low disk usage, low load
   • Result: cs-31 (in rack-3)

5. ISSUE RE-REPLICATION
   ────────────────────

   Master → cs-31:
     "Copy chunk 0x1a2b from cs-5, version 1"

   cs-31 → cs-5:
     "Send me chunk 0x1a2b"

   cs-5 → cs-31:
     [chunk data, 64MB]

   cs-31:
     • Writes to local disk
     • Verifies checksums
     • ACKs to master

6. UPDATE METADATA
   ───────────────

   Master updates chunk 0x1a2b:
   Locations: [cs-5, cs-23, cs-31]

   Chunk now has 3 replicas again ✓


RATE LIMITING:
─────────────

Master limits re-replication rate:
• Max concurrent re-replications: 50-100
• Avoids overwhelming network
• Background process, not urgent
• Gradually restores replication

Garbage Collection

GFS uses lazy garbage collection instead of immediate deletion.

Why Lazy Deletion?

Benefits

Advantages of Lazy GC:
  • Simple implementation
  • Batched operations
  • No complex distributed deletion
  • Can recover from accidental deletes
  • Spreads I/O load over time
  • Handles failures gracefully

Trade-offs

Considerations:
  • Storage not freed immediately
  • May need manual cleanup for urgent cases
  • Requires background process
  • Deleted files visible briefly
  • Not suitable for quota systems

Garbage Collection Process

FILE DELETION WORKFLOW
─────────────────────

1. CLIENT DELETES FILE
   ──────────────────

   Client: "Delete /data/old/file1"

   Master:
   ┌──────────────────────────────────┐
   │ a) Rename file to hidden name:  │
   │    /data/old/file1               │
   │    →                             │
   │    /.trash/file1.deleted.T12345  │
   │                                  │
   │ b) Add deletion timestamp        │
   │                                  │
   │ c) Log operation                 │
   │                                  │
   │ d) Return success                │
   └──────────────────────────────────┘

   File still exists, just renamed!

   Client can immediately:
   • Delete again (no-op)
   • Undelete (rename back)

2. GRACE PERIOD
   ───────────

   Hidden file stays for N days (configurable, e.g., 3 days)

   During this time:
   • File not visible to normal operations
   • Chunks still exist
   • Can be recovered if needed
   • Storage not freed

   After 3 days:
   → Master's background GC removes it

3. BACKGROUND SCAN (Master)
   ───────────────────────

   Periodic scan (e.g., every few hours):

   for each file in /.trash/:
       if (current_time - deletion_time) > grace_period:
           # Permanently delete
           chunk_handles = file.get_chunks()
           for chunk in chunk_handles:
               remove_from_namespace(chunk)
           delete_namespace_entry(file)

   No rush, happens eventually

4. CHUNK REMOVAL
   ─────────────

   Deleted chunks marked in master memory:
   "Chunk 0x1a2b: DELETED"

   Not immediately removed from chunkservers
   → Next step handles this

Master Fault Tolerance

The master is replicated to ensure system availability:

Replication Strategy

MASTER REPLICATION
─────────────────

Primary Master:
┌──────────────────────────────────────┐
│ • Handles all operations             │
│ • In-memory metadata                 │
│ • Operation log (on disk)            │
└──────────────────────────────────────┘

           │ Replicates to:


┌──────────────────────────────────────┐
│ Shadow Masters (multiple)            │
│ • Read-only replicas                 │
│ • Lag slightly behind primary        │
│ • Can serve reads                    │
│ • Quick promotion on failure         │
└──────────────────────────────────────┘

           │ Operation log also on:


┌──────────────────────────────────────┐
│ Remote Disks (multiple)              │
│ • Different failure domains          │
│ • Different datacenters              │
│ • Disaster recovery                  │
└──────────────────────────────────────┘


OPERATION LOG REPLICATION:
─────────────────────────

Every metadata mutation:

1. Primary logs operation
2. Replicates to shadow masters
3. Replicates to remote disks
4. Waits for ACKs from ALL
5. Then applies to memory
6. Then returns success to client

Guarantee: If client sees success, operation is durable


SHADOW MASTER OPERATION:
───────────────────────

Shadow continuously:
• Reads operation log from primary
• Replays operations locally
• Builds same in-memory state
• Lags by ~seconds

Can serve:
• Read-only operations
• Metadata queries
• File listings

Cannot serve:
• Writes (redirects to primary)
• Lease grants
• Chunk creation


FAILOVER:
────────

Primary master fails:
1. Monitoring detects failure (<10 sec)
2. External system promotes shadow
3. Shadow becomes primary
4. Updates DNS/configuration
5. Clients redirect to new primary

Downtime: ~1 minute (conservative)

Data loss: None (operation log replicated)

Fast Recovery

1

Checkpoint Loading

Master starts (after crash or planned restart)

1. Load latest checkpoint from disk
   • Compact B-tree format
   • Contains full namespace
   • Loads in 1-2 seconds
   • Typically < 1 GB

2. Result:
   • Full file namespace in memory
   • Chunk handle mappings
   • Version numbers
2

Log Replay

Replay operation log since checkpoint

1. Read log file
2. Apply each operation in order:
   • File creates
   • Chunk allocations
   • Version updates
   • Deletions

3. Typically few seconds (100K ops/sec)

4. Result: Master has current namespace state
3

Chunk Location Discovery

Master doesn't persist chunk locations
(Chunkservers are source of truth)

1. Master sends to all chunkservers:
   "What chunks do you have?"

2. Each chunkserver replies:
   "I have: [chunk1, chunk2, ..., chunkN]
    With versions: [v1, v2, ..., vN]"

3. Master builds location map in memory

4. Identifies stale replicas (old versions)

5. Marks for garbage collection

Typically completes in 10-30 seconds
4

Resume Operations

Master is now ready!

Total recovery time: < 1 minute

1. Checkpoint load: 1-2 sec
2. Log replay: 5-10 sec
3. Chunk poll: 10-30 sec
4. Ready: 30-60 sec total

Meanwhile:
• Shadow masters can serve reads
• System remains partially available
• No data loss

Interview Questions

Expected Answer:GFS uses lazy garbage collection instead of immediate deletion for several reasons:
  1. Simplicity: No complex distributed deletion protocol needed
  2. Recovery: Accidental deletions can be recovered during grace period (e.g., 3 days)
  3. Batch Operations: Deletions batched together, reducing overhead
  4. Failure Handling: If deletion message lost, chunk collected eventually anyway
  5. Spread Load: I/O spread over time, not sudden burst
  6. Piggybacking: Uses existing heartbeat mechanism
The trade-off is that storage isn’t freed immediately, but for Google’s workload this was acceptable since storage was relatively cheap and safety was more important.Process: File deletion → rename to hidden → wait grace period → background GC removes → heartbeat informs chunkservers → chunkservers delete local chunks
Expected Answer:GFS uses fine-grained read-write locks on full pathnames to enable concurrent operations:How it works:
  • Each path (file or directory) has a read-write lock
  • Operations acquire locks on full paths, not just the target
  • Example: Creating /home/user1/file requires:
    • Read lock on /home
    • Read lock on /home/user1
    • Write lock on /home/user1/file
Benefits:
  • Multiple operations in same directory can proceed in parallel
  • Example: Creating /data/file1 and /data/file2 concurrently
  • Both acquire read lock on /data (shared)
  • Each acquires write lock on different file (no conflict)
Deadlock prevention:
  • Locks acquired in lexicographic order
  • Example: Operation needs /a/x and /b/y
  • Always acquire in sorted order: /a/x then /b/y
This design enables linear scaling with concurrent operations, unlike directory-level locking which serializes all operations in the same directory.
Expected Answer:The lease mechanism provides consistency without distributed consensus:Setup:
  1. Master grants 60-second lease to one replica (primary)
  2. Only primary can order mutations during lease period
  3. Primary identity cached by clients
Write Process:
  1. Data pushed to all replicas (in memory, not applied)
  2. Client sends write request to primary
  3. Primary assigns serial number (ordering)
  4. Primary applies to local disk
  5. Primary sends order to secondaries
  6. Secondaries apply in same order
  7. All ACK to primary
  8. Primary ACKs to client
Consistency Guarantee:
  • All replicas apply mutations in same order (serialized by primary)
  • Same serial numbers → same state
  • No distributed consensus needed
  • Primary has authority during lease
Failure Handling:
  • Lease timeout (60 sec) ensures master can regain control
  • If primary fails, lease expires naturally
  • Master can grant new lease to different replica
  • No need for perfect failure detection
Why it works:
  • Single authority (primary) during lease
  • Time-bounded authority (60 sec)
  • Master retains ultimate control
  • Simple protocol, high performance
Expected Answer:Several approaches to scale beyond single master:1. Metadata Sharding (like Colossus):
  • Partition namespace by path prefix
  • Multiple master shards, each handles subset
  • Example: Master1 handles /data/*, Master2 handles /logs/*
  • Benefits: Scales metadata capacity and throughput
  • Challenges: Cross-shard operations, rebalancing
2. Hierarchical Masters:
  • Root master coordinates multiple sub-masters
  • Each sub-master handles subset of chunkservers
  • Root handles namespace, sub-masters handle chunks
  • Benefits: Scales chunk management
  • Challenges: Two-level hierarchy complexity
3. Client-Side Metadata Caching:
  • Aggressive client caching with long timeouts
  • Lease-based cache consistency
  • Master only for cache misses
  • Benefits: Reduces master load dramatically
  • Challenges: Consistency protocol more complex
4. Metadata Distribution:
  • Distribute master state using consensus (Paxos/Raft)
  • Read from any replica, write to leader
  • Benefits: High availability, read scalability
  • Challenges: Write latency, consistency overhead
Real-world Evolution:
  • Colossus (GFS successor) uses metadata sharding
  • HDFS Federation uses multiple namenodes
  • Both prove that single master can be overcome while maintaining simplicity where possible

Stale Replica Detection (Version Numbers)

In a distributed system, some replicas may miss updates (e.g., if a chunkserver crashes during a write). GFS uses Chunk Version Numbers to distinguish between up-to-date and stale replicas.

The Versioning Protocol

  1. Lease Granting: Before granting a lease, the master increments the chunk’s version number in its persistent metadata.
  2. Propagation: The master notifies the primary and all secondaries of the new version number.
  3. Persisting: Both the master and the chunkservers record the new version number on their respective persistent disks before the mutation starts.
  4. Stale Check: If a chunkserver was down during the update, it will still have the old version number.

How the Master Uses Versions

  • During Heartbeats: Chunkservers report their (handle, version) pairs. If the master sees a version V<VcurrentV < V_{current}, it knows the replica is stale.
  • Garbage Collection: Stale replicas are immediately scheduled for garbage collection.
  • Client Requests: When a client asks for chunk locations, the master never returns a stale replica, ensuring the client only sees current data.

Key Takeaways

Master Operations Summary:
  1. Namespace Locking: Fine-grained path-based locks enable parallel operations
  2. Leases for Consistency: Time-bounded primary authority avoids distributed consensus
  3. Replica Placement: Rack-aware placement balances reliability and performance
  4. Lazy Garbage Collection: Simple, safe deletion with recovery window
  5. In-Memory Metadata: Fast operations, simple consistency, small memory footprint
  6. Master Replication: Operation log replication ensures durability and fast recovery
  7. Background Processes: GC, re-replication, balancing happen asynchronously
  8. Version Numbers: Detect stale replicas reliably without complex protocols

Up Next

In Chapter 4: Chunkservers & Data Flow, we’ll explore:
  • How chunkservers store and manage chunks
  • Detailed read, write, and record append flows
  • Data integrity mechanisms with checksums
  • Replication pipeline and optimization
  • Handling chunkserver failures
The master orchestrates the system—now we’ll see how chunkservers execute the actual data operations.

Interview Deep-Dive

Strong Answer:When a file is deleted in GFS, the master does not immediately reclaim the chunks. Instead, it renames the file to a hidden name with a deletion timestamp. A background process periodically scans for hidden files older than a configurable threshold (default three days) and removes their metadata. Orphaned chunks (chunks with no file reference) are discovered during regular chunk report exchanges with chunkservers and reclaimed.Lazy deletion is better than immediate deletion for three reasons. First, safety: accidental deletions can be undone within the grace period by simply renaming the hidden file back. This is enormously valuable in production — at Google scale, operator error is a constant risk. Second, simplicity: the garbage collector runs as a single background sweep, which is far simpler than coordinating immediate deletion across three replicas on different chunkservers. If one replica is temporarily unreachable during an immediate delete, you need complex retry logic. With lazy GC, the unreachable chunkserver simply reports the orphaned chunk on its next heartbeat, and the master tells it to delete. Third, batching: the GC can coalesce many deletions into efficient batch operations, reducing the metadata operation rate on the master.The trade-off is storage reclamation latency. Deleted files continue to consume disk space for up to three days. For clusters with tight storage budgets, this delay can be problematic. GFS allows tuning the grace period down for specific namespaces.Follow-up: How does this compare to garbage collection in modern object stores like S3?S3 uses a similar lazy approach internally, though the semantics exposed to users are different. When you delete an S3 object, the metadata is removed immediately from the API perspective, but the underlying storage blocks may not be reclaimed instantly. S3 also offers versioning, which is conceptually similar to GFS hidden-file grace period — deleted objects are retained as non-current versions until a lifecycle policy removes them. The key insight that carries across both systems is that in distributed storage, lazy reclamation is almost always the right default because it decouples the fast path (metadata update) from the slow path (physical storage reclamation).
Strong Answer:In a traditional POSIX file system, creating a file requires locking the parent directory inode. If you have 1,000 MapReduce tasks all writing output files to the same directory simultaneously, they serialize on that directory lock. This creates a bottleneck that limits parallelism.GFS uses a flat namespace with per-path read-write locks. To create /data/logs/file1, the master acquires read locks on /data and /data/logs (to prevent them from being deleted), and a write lock only on /data/logs/file1 itself. Creating /data/logs/file2 simultaneously requires read locks on the same parents (which are shared with file1) and a write lock on /data/logs/file2. Since the write locks are on different paths, both operations proceed in parallel.This enables massive concurrency. In a MapReduce job with 10,000 map tasks writing output to the same directory, all 10,000 file creations can proceed concurrently because they only contend on shared read locks for the parent paths. Deadlocks are prevented by always acquiring locks in lexicographic order of the full pathname.Follow-up: Can you think of a scenario where this locking scheme could cause problems?Yes — directory-level operations like rename or snapshot. If you snapshot /data/logs, you need a write lock on /data/logs to prevent any modifications during the snapshot. This blocks all concurrent file creations in that directory. In practice, GFS handles this by making snapshots a rare, operator-initiated operation, not something that happens during normal workload processing. The design is optimized for the common case (file creation and deletion) at the expense of rare operations (snapshots, directory renames). This is a classic engineering trade-off: optimize for the 99% case and accept higher cost for the 1% case.
Strong Answer:GFS places the first replica on the same machine as the writer (or a machine with below-average disk utilization if the writer is not a chunkserver). The second replica goes on a different rack. The third replica goes on a different machine in the same rack as the second. This strategy balances three concerns.Fault tolerance: by placing replicas across at least two racks, GFS survives an entire rack failure (power outage, top-of-rack switch failure) with at least one surviving replica. This is critical because rack-level failures are correlated — a single switch failure takes out every machine on the rack.Write performance: the pipelined replication sends data from the writer to the nearest replica first, which forwards it to the next nearest. Having two replicas on the same rack means the second-to-third hop is a fast intra-rack transfer rather than a slower cross-rack transfer. This reduces the total write latency.Read performance: for read-heavy files, having replicas on two different racks means clients in either rack can read locally, distributing read load across the network.The trade-off is that with two replicas on one rack, a rack failure leaves you with only one surviving replica. At that point, re-replication becomes urgent and is prioritized above all other chunk operations. The master tracks under-replicated chunks and prioritizes them by severity (1 replica remaining is more urgent than 2 replicas remaining).Follow-up: How would you change this placement strategy for a cluster spanning multiple data centers?I would add a fourth replica in a different data center for disaster recovery. The cross-datacenter replica would be asynchronously replicated (to avoid the latency penalty of synchronous cross-WAN writes) and used only for reads in the remote datacenter or for recovery after a datacenter-level failure. This is essentially what Google did when evolving toward Colossus and what modern systems like Spanner do with their multi-region configurations. The key design question is whether the cross-datacenter replica participates in the write quorum (stronger consistency, higher latency) or is replicated asynchronously (lower latency, risk of data loss during datacenter failure).