Chapter 3: Master Operations

The master is the brain of GFS, orchestrating all metadata operations and coordinating the distributed system. What makes GFS’s master design remarkable is how much sophistication is packed into a single process without it becoming the bottleneck critics predicted. The namespace locking strategy alone — using fine-grained, per-path read-write locks instead of traditional directory-level locks — was a significant innovation that enabled thousands of concurrent file creations in the same directory. This chapter explores how the master manages the namespace, allocates chunks, grants leases, places replicas, and maintains system health through garbage collection. These mechanisms are directly relevant to modern system design: lease-based coordination appears in etcd and ZooKeeper, replica placement strategies inform Kafka broker assignment, and lazy garbage collection is the standard approach in systems from Go’s runtime to distributed databases.

Chapter Goals:

Understand namespace management with coarse-grained locking
Learn chunk lease mechanism for consistency
Explore replica placement strategies
Master garbage collection techniques
Grasp master fault tolerance mechanisms

Namespace Management

Unlike traditional file systems with per-directory data structures (where each directory is essentially an inode pointing to a list of child entries), GFS uses a flat namespace with efficient path-to-metadata lookups. This design choice eliminated the “directory inode bottleneck” that plagued concurrent access in POSIX file systems. In a traditional file system, creating a file requires locking the parent directory’s inode, which serializes all operations in the same directory. GFS’s approach of locking individual path components independently enabled massive parallelism — a property that was essential for supporting thousands of MapReduce tasks writing output simultaneously.

Namespace Structure

GFS NAMESPACE DESIGN
────────────────────

Traditional FS:           GFS:
──────────────           ─────

Directory tree:          Flat lookup table:
/                        ┌─────────────────────────────────┐
├── home                 │ Full Path → Metadata            │
│   ├── user1            ├─────────────────────────────────┤
│   └── user2            │ /home → {dir metadata}          │
└── data                 │ /home/user1 → {dir metadata}    │
    └── files            │ /data/files/log1 → {file meta}  │
                         └─────────────────────────────────┘

Each dir has            Single prefix-compressed
list of children        table (efficient)

Problem:                Benefit:
- Tree traversal        - O(1) lookup
- Lock contention       - Fine-grained locking
                        - Parallel operations


IMPLEMENTATION:
──────────────

Data Structure: Prefix-compressed lookup table

Example:
────────
/home/alice/data/file1.txt
/home/alice/data/file2.txt
/home/bob/logs/access.log

Stored as:
──────────
Common prefixes compressed:
↓
Efficient memory usage
Fast lookups
Support for snapshots

Coarse-Grained Locking

GFS uses namespace locks to allow concurrent operations:

Lock Types
Concurrent Operations
Lock Ordering
Performance Impact

Read-Write Locks on Paths:

NAMESPACE LOCKING STRATEGY
─────────────────────────

Each namespace node (file/directory) has:
• One read-write lock

Lock Granularity: Per full pathname

Example Operations:
──────────────────

1. Create /home/user1/file1
   Locks acquired:
   • /home          → Read lock
   • /home/user1    → Read lock
   • /home/user1/file1 → Write lock

2. Delete /home/user2
   Locks acquired:
   • /home          → Read lock
   • /home/user2    → Write lock

3. Snapshot /home/user1 → /save/snapshot1
   Locks acquired:
   • /home          → Read lock
   • /home/user1    → Write lock (prevent changes)
   • /save          → Read lock
   • /save/snapshot1 → Write lock


KEY INSIGHT: No parent directory locks needed!
────────────────────────────────────────────

Traditional FS:
- Need parent directory lock to modify children
- Serialization bottleneck

GFS:
- Lock full path only
- Parallel operations in same directory
- Scales beautifully

Enabling Parallelism:

CONCURRENT OPERATIONS EXAMPLE
────────────────────────────

Scenario: Same directory, different files

Thread 1: Create /data/logs/file1
Thread 2: Create /data/logs/file2
Thread 3: Read /data/logs/file3

Lock Acquisition:
────────────────

Thread 1:
  /data      → Read lock ✓
  /data/logs → Read lock ✓
  /data/logs/file1 → Write lock ✓

Thread 2:
  /data      → Read lock ✓ (shared)
  /data/logs → Read lock ✓ (shared)
  /data/logs/file2 → Write lock ✓ (different file)

Thread 3:
  /data      → Read lock ✓ (shared)
  /data/logs → Read lock ✓ (shared)
  /data/logs/file3 → Read lock ✓

ALL THREE PROCEED IN PARALLEL!

Traditional approach:
─────────────────────
Would lock /data/logs directory
→ Serializes all operations
→ Poor scalability


CONFLICT SCENARIO:
─────────────────

Thread A: Delete /home/user1
Thread B: Create /home/user1/newfile

Thread A:
  /home → Read lock ✓
  /home/user1 → Write lock ✓

Thread B:
  /home → Read lock ✓
  /home/user1 → Read lock ✗ (BLOCKED)
  (waits for Thread A)

Correct serialization maintained!

Preventing Deadlocks:

LOCK ORDERING PROTOCOL
─────────────────────

Rule: Always acquire locks in consistent order

Order: Lexicographic order by full pathname

Example:
────────

Operation: Rename /a/x to /b/y

Incorrect:
  /a/x lock (write)
  /b/y lock (write)
  → If another thread does /b/... first → DEADLOCK!

Correct:
  Sort: [/a, /a/x, /b, /b/y]
  Acquire in order:
    /a   → read
    /a/x → write
    /b   → read
    /b/y → write

Deadlock impossible!


IMPLEMENTATION:
──────────────

Algorithm:
──────────
1. Collect all paths needed for operation
2. Sort lexicographically
3. Acquire locks in sorted order
4. Perform operation
5. Release all locks

Example with snapshot:
─────────────────────

Snapshot /home/alice to /backup/alice_snap

Paths needed:
- /home         (read)
- /home/alice   (write - prevent changes)
- /backup       (read)
- /backup/alice_snap (write)

Sorted order:
1. /backup (read)
2. /backup/alice_snap (write)
3. /home (read)
4. /home/alice (write)

Acquire in this order → No deadlock possible

Scalability Benefits:

PERFORMANCE COMPARISON
─────────────────────

Metric: Operations per second on /data/logs/

Traditional (directory-level locking):
──────────────────────────────────────
1 thread:  1000 creates/sec
2 threads: 1000 creates/sec (serialized!)
4 threads: 1000 creates/sec (serialized!)
8 threads: 1000 creates/sec (serialized!)

GFS (fine-grained path locking):
────────────────────────────────
1 thread:  1000 creates/sec
2 threads: 2000 creates/sec (parallel!)
4 threads: 4000 creates/sec (parallel!)
8 threads: 8000 creates/sec (parallel!)

→ Linear scaling with parallelism


REAL-WORLD IMPACT:
─────────────────

MapReduce with 1000 mappers:
• All write to /output/intermediate/
• Each creates file with unique name
• All proceed in parallel
• No lock contention!

Without fine-grained locking:
• Serialized on directory lock
• 1000x slower
• System unusable

Metadata Storage

The master keeps three types of metadata, all in memory:

File Namespace

Directory and File Names

Full pathname → metadata
Persistent (operation log)
Modification times
Owner/permissions
List of chunk handles

Chunk Metadata

Chunk Handle → Info

Chunk version number
List of replica locations
Primary (if leased)
Lease expiration
Partially persistent

Server State

Chunkserver Info

Available disk space
Current load (CPU, I/O)
Chunks stored
Last heartbeat time
Not persistent

METADATA MEMORY LAYOUT
─────────────────────

For 100 million chunks (several petabytes):

File Namespace:
──────────────
• ~10 million files
• ~100 bytes per file
• Total: ~1 GB

Chunk Metadata:
──────────────
• 100 million chunks
• ~64 bytes per chunk
  - Chunk handle (8 bytes)
  - Version number (8 bytes)
  - Replica locations (3-5 locations × 8 bytes)
  - Lease info (16 bytes)
• Total: ~6 GB

Chunkserver State:
─────────────────
• ~1000 chunkservers
• ~1 KB per server
• Total: ~1 MB

TOTAL: < 10 GB (easily fits in RAM of modern server)

Benefits:
────────
✓ Fast operations (no disk I/O)
✓ Simple consistency
✓ Global view for decisions
✓ Easy to scan entire namespace

Chunk Lease Mechanism

Leases are central to GFS’s consistency model, enabling mutations without the overhead of distributed consensus. The lease mechanism is one of GFS’s most elegant design decisions and one of the most frequently asked-about concepts in system design interviews. The core insight is this: instead of running an expensive consensus protocol (like Paxos) for every single write operation, the master delegates authority to a “primary” chunkserver for a bounded time period. During that lease window, the primary makes all serialization decisions unilaterally. This converts the cost of distributed coordination from per-operation to per-lease-grant — a dramatic reduction. Today, this pattern is ubiquitous: Kafka uses leader epochs, Spanner uses leader leases, and even distributed lock services like etcd use lease-based TTLs for the same reason.

How Leases Work

Master Grants Lease

LEASE GRANT PROCESS
──────────────────

Client requests write to chunk X

Master checks:
1. Does chunk X have current primary?
   NO → Select one replica as primary

2. Grant lease:
   ┌──────────────────────────────┐
   │ Chunk: X                     │
   │ Primary: chunkserver-5       │
   │ Lease expiration: T+60sec    │
   │ Version: 7                   │
   └──────────────────────────────┘

3. Send to chunkserver-5:
   "You are primary for chunk X until T+60"

4. Increment chunk version to 8

5. Return to client:
   Primary: cs-5
   Secondaries: [cs-2, cs-9]
   Version: 8

Primary Orders Mutations

PRIMARY ROLE
───────────

While lease held, primary has authority:

Multiple clients send writes:

Client A: Write(offset=1000, len=100)
Client B: Append(len=200)
Client C: Write(offset=5000, len=50)

Primary receives all three:
─────────────────────────

1. Assign serial numbers:
   A → Serial #1
   B → Serial #2
   C → Serial #3

2. Apply to local chunk in order:
   #1: Write at 1000
   #2: Append at end
   #3: Write at 5000

3. Send to secondaries:
   "Apply operations in order: #1, #2, #3"

4. Wait for ACKs

5. Reply to clients

Result: Consistent ordering across all replicas!

Lease Renewal

LEASE LIFECYCLE
──────────────

Initial lease: 60 seconds

Renewal (if writes ongoing):
────────────────────────────

Every ~10-15 seconds:

Primary → Master:
  "Heartbeat: Still active, renew lease for chunk X"

Master → Primary:
  "Lease extended to T+60"

This continues while writes happen

Expiration:
──────────

No renewal for 60 seconds?
→ Lease expires
→ Master can grant to different replica
→ Or same replica (if still alive)


REVOCATION:
──────────

Master needs to revoke lease:
(e.g., client requested chunk deletion)

1. Master sends revoke message to primary
2. Primary stops accepting new mutations
3. Completes in-flight operations
4. ACKs revocation to master
5. Master can now delete chunk

If primary unreachable?
→ Wait for 60sec expiration
→ Then proceed

Lease Benefits

WHY LEASES?
──────────

Problem Without Leases:
───────────────────────

Distributed consensus for every write:

Client → Replica 1 ┐
Client → Replica 2 ┼→ Agree on order? (Paxos/Raft)
Client → Replica 3 ┘   Slow! Complex!

For each write:
- Multiple round trips
- Consensus protocol
- Failure handling
→ 10-100ms latency per operation


With Leases:
───────────

One-time setup (lease grant):
Master → Primary: "You order operations for 60sec"

Then for each write:
Client → Primary: "Write this"
Primary → Secondaries: "Apply in this order"
→ ~1ms latency per operation

Benefits:
────────
✓ No distributed consensus per operation
✓ Primary makes serialization decision
✓ Low latency writes
✓ Simple protocol
✓ Automatic timeout (no need for perfect failure detection)


LEASE VS LOCK:
─────────────

Lock:
- Must explicitly release
- Failure? → Deadlock or complex recovery

Lease:
- Automatic timeout
- Failure? → Expires naturally
- Master can wait or revoke
- No distributed state

Snapshot Mechanics (Copy-on-Write)

Snapshots in GFS are nearly instantaneous and allow users to create a copy of a file or a directory tree without copying data. This capability was critical for production operations: teams could snapshot an entire dataset before running a risky MapReduce job, providing a cheap rollback mechanism. The technique GFS uses — Copy-on-Write (CoW) at the chunk level — is the same fundamental mechanism used by Linux’s fork() system call, by ZFS and Btrfs file system snapshots, and by container image layers in Docker. Understanding CoW in GFS gives you the mental model for all of these systems.

How it Works (The Metadata-Only Copy)

GFS uses Copy-on-Write (CoW) at the chunk level to implement snapshots efficiently.

Lease Revocation: When the master receives a snapshot request, it first revokes any outstanding leases on the chunks of the files to be snapshotted. This ensures that any subsequent writes will require a new lease, giving the master a chance to intercept and create a copy.
Log & Copy Metadata: The master logs the snapshot operation to disk. It then applies the operation to its in-memory state by duplicating the metadata for the file or directory. The new “snapshot” file points to the same chunk handles as the original.
Reference Counting: Each chunk handle now has a reference count > 1.

The First Write After Snapshot

When a client wants to write to a chunk that has been snapshotted:

1. Client asks Master for lease: "Write to Chunk C (ref_count=2)"
2. Master sees ref_count > 1:
   a) Pick new chunk handle C'
   b) Tell chunkservers holding C: "Copy C to C' locally"
   c) Update metadata: Current file now points to C' (ref_count=1)
   d) Snapshot still points to C (ref_count=1)
3. Master grants lease for C' to client.
4. Client writes to C' normally.

Key Benefit: The initial snapshot is just a metadata operation (copying pointers). The actual data copying is deferred until a write occurs, and only for the specific chunks being modified.

Replica Placement

The master decides where to place chunk replicas, optimizing for reliability, bandwidth, and load balancing.

Placement Goals

Maximize Reliability

Survive Multiple Failures:

FAILURE DOMAIN HIERARCHY
───────────────────────

Datacenter
├── Rack 1
│   ├── Switch 1A
│   │   ├── Machine 1
│   │   ├── Machine 2
│   │   └── Machine 3
│   └── Switch 1B
│       ├── Machine 4
│       └── Machine 5
└── Rack 2
    └── Switch 2A
        ├── Machine 6
        └── Machine 7

Failure Scenarios:
─────────────────

1. Disk failure: One machine
   Probability: High (daily)

2. Machine failure: One machine
   Probability: Medium (weekly)

3. Switch failure: All machines under switch
   Probability: Low (monthly)

4. Rack failure: Power/cooling/network
   Probability: Very low (yearly)


PLACEMENT STRATEGY:
──────────────────

Bad: All replicas on same rack
┌────────────────────────────┐
│ Rack 1                     │
│  Replica 1 (machine 1)     │
│  Replica 2 (machine 2)     │
│  Replica 3 (machine 3)     │
└────────────────────────────┘
→ Rack failure = Data loss!

Good: Replicas across racks
┌────────────┐  ┌────────────┐
│ Rack 1     │  │ Rack 2     │
│ Replica 1  │  │ Replica 3  │
│ Replica 2  │  │            │
└────────────┘  └────────────┘
→ Survives rack failure
→ 2 in one rack for intra-rack bandwidth

Maximize Bandwidth

Network Topology Awareness:

NETWORK BANDWIDTH HIERARCHY
──────────────────────────

Within machine:      1000+ MB/s (disk)
Within rack:         100-1000 MB/s (1-10 Gbps)
Cross-rack:          10-100 MB/s (limited by uplink)
Cross-datacenter:    1-10 MB/s (WAN)


READ OPTIMIZATION:
─────────────────

Client wants to read chunk X
Replicas: [rack1-machine5, rack2-machine3, rack3-machine1]

Client is in rack1
↓
Prefer rack1-machine5 (100-1000 MB/s)
Over rack2-machine3 (10-100 MB/s)

→ 10x faster read!


WRITE OPTIMIZATION:
──────────────────

Chain replicas by network distance:

Client in rack1 writes:
───────────────────────

Bad chaining:
Client → Rack3 → Rack1 → Rack2
(cross-rack × 3 = slow)

Good chaining:
Client → Rack1 → Rack2 → Rack3
(one intra-rack, then fan out)

→ Exploits full network bandwidth
→ Pipelined transfers

Balance Load

Distribute Storage and I/O:

LOAD BALANCING FACTORS
──────────────────────

Master tracks for each chunkserver:

1. Disk space utilization
   ┌─────────────────────────┐
   │ cs-1: 45% full          │
   │ cs-2: 89% full ← Avoid  │
   │ cs-3: 23% full ← Prefer │
   └─────────────────────────┘

2. Recent write load
   ┌─────────────────────────┐
   │ cs-1: 50 writes/min     │
   │ cs-2: 200 writes/min ←  │
   │ cs-3: 10 writes/min     │
   └─────────────────────────┘

3. Number of chunks stored
   (proxy for future load)

4. Recent chunk creations
   (newly created chunks → writes soon)


PLACEMENT ALGORITHM:
───────────────────

When creating new chunk replica:

1. Filter candidates:
   • Enough free space (> chunk size)
   • Not in same rack as existing replica
   • Below average load

2. Sort by:
   • Disk space (prefer emptier)
   • Recent chunk creations (prefer fewer)

3. Select top candidate

4. Update tracking info


REBALANCING:
───────────

Background process (periodic):

Detect imbalance:
- cs-2: 90% full
- cs-3: 20% full
- Difference > threshold

Action:
→ Move some chunks from cs-2 to cs-3
→ Gradual process (limit impact)
→ Respect rack diversity

Chunk Creation

CHUNK CREATION WORKFLOW
──────────────────────

Scenario: Client creates new file and writes first chunk

1. CLIENT CREATES FILE
   ──────────────────

   Client: "Create file /data/logs/2003-10-15"
   Master:
     a) Create namespace entry
     b) Log operation to disk
     c) Return success

2. CLIENT WRITES DATA
   ──────────────────

   Client: "Write to /data/logs/2003-10-15, offset 0, 1MB"
   Master:
     a) File has no chunks yet
     b) Need to create chunk 0

3. MASTER ALLOCATES CHUNK
   ──────────────────────

   a) Generate chunk handle (globally unique 64-bit)
      chunk_handle = 0x1a2b3c4d5e6f7g8h

   b) Select 3 chunkservers:

      Available servers: [cs-1, cs-2, ..., cs-50]

      Filter:
      • Enough disk space
      • Below average utilization
      • Rack diversity

      Selected: [cs-5, cs-12, cs-23]
      (cs-5 and cs-12 in rack-1, cs-23 in rack-2)

   c) Designate primary (lowest load)
      Primary: cs-5

   d) Grant lease to cs-5 (60 seconds)

   e) Increment version to 1

   f) Log operation:
      "CHUNK_CREATE: handle=0x1a2b, file=/data/logs/2003-10-15,
       index=0, version=1, locations=[cs-5, cs-12, cs-23]"

4. MASTER CONTACTS CHUNKSERVERS
   ────────────────────────────

   Master → cs-5, cs-12, cs-23:
     "Create chunk 0x1a2b, version 1"

   Each chunkserver:
     a) Create empty file /gfs/chunks/0x1a2b3c4d5e6f7g8h
     b) Initialize version to 1
     c) ACK to master

5. MASTER RETURNS TO CLIENT
   ────────────────────────

   Master → Client:
     "Chunk 0 metadata:
      handle: 0x1a2b3c4d5e6f7g8h
      primary: cs-5
      secondaries: [cs-12, cs-23]
      version: 1
      lease_expiration: T+60"

6. CLIENT WRITES DATA
   ──────────────────

   (Normal write flow as described in Chapter 2)

Re-replication

When replicas fall below target count (e.g., server failure), master re-replicates:

RE-REPLICATION WORKFLOW
──────────────────────

Trigger: Chunkserver cs-12 fails

1. DETECTION
   ─────────

   Master heartbeat timeout:
   • cs-12 hasn't responded in 10+ seconds
   • Mark cs-12 as dead
   • Scan all chunks on cs-12

2. IDENTIFY UNDER-REPLICATED CHUNKS
   ────────────────────────────────

   Chunks on cs-12:
   ┌─────────────────────────────────────┐
   │ Chunk 0x1a2b: [cs-5, cs-12, cs-23]  │
   │   → Now only 2 replicas (cs-5, cs-23) │
   │   → Need re-replication!            │
   │                                     │
   │ Chunk 0x2c3d: [cs-7, cs-12, cs-15]  │
   │   → Now only 2 replicas             │
   │   → Need re-replication!            │
   └─────────────────────────────────────┘

3. PRIORITIZATION
   ──────────────

   Sort chunks by priority:

   Priority factors:
   • Current replication count (1 > 2 > 3)
   • Recently accessed (hot chunks first)
   • Blocks client writes (highest priority)

   Example order:
   1. Chunk 0x9f8e: 1 replica (URGENT!)
   2. Chunk 0x7d6c: 2 replicas, blocking write
   3. Chunk 0x1a2b: 2 replicas, recently read
   4. Chunk 0x5b4a: 2 replicas, cold data

4. SELECT SOURCE AND DESTINATION
   ──────────────────────────────

   For chunk 0x1a2b:

   Existing replicas: [cs-5, cs-23]

   Select source: cs-5 (lowest load)

   Select destination:
   • Filter: Not in same rack as cs-5 or cs-23
   • Prefer: Low disk usage, low load
   • Result: cs-31 (in rack-3)

5. ISSUE RE-REPLICATION
   ────────────────────

   Master → cs-31:
     "Copy chunk 0x1a2b from cs-5, version 1"

   cs-31 → cs-5:
     "Send me chunk 0x1a2b"

   cs-5 → cs-31:
     [chunk data, 64MB]

   cs-31:
     • Writes to local disk
     • Verifies checksums
     • ACKs to master

6. UPDATE METADATA
   ───────────────

   Master updates chunk 0x1a2b:
   Locations: [cs-5, cs-23, cs-31]

   Chunk now has 3 replicas again ✓


RATE LIMITING:
─────────────

Master limits re-replication rate:
• Max concurrent re-replications: 50-100
• Avoids overwhelming network
• Background process, not urgent
• Gradually restores replication

Garbage Collection

GFS uses lazy garbage collection instead of immediate deletion.

Why Lazy Deletion?

Benefits

Advantages of Lazy GC:

Simple implementation
Batched operations
No complex distributed deletion
Can recover from accidental deletes
Spreads I/O load over time
Handles failures gracefully

Trade-offs

Considerations:

Storage not freed immediately
May need manual cleanup for urgent cases
Requires background process
Deleted files visible briefly
Not suitable for quota systems

Garbage Collection Process

File Deletion
Chunk Collection
Background Process

FILE DELETION WORKFLOW
─────────────────────

1. CLIENT DELETES FILE
   ──────────────────

   Client: "Delete /data/old/file1"

   Master:
   ┌──────────────────────────────────┐
   │ a) Rename file to hidden name:  │
   │    /data/old/file1               │
   │    →                             │
   │    /.trash/file1.deleted.T12345  │
   │                                  │
   │ b) Add deletion timestamp        │
   │                                  │
   │ c) Log operation                 │
   │                                  │
   │ d) Return success                │
   └──────────────────────────────────┘

   File still exists, just renamed!

   Client can immediately:
   • Delete again (no-op)
   • Undelete (rename back)

2. GRACE PERIOD
   ───────────

   Hidden file stays for N days (configurable, e.g., 3 days)

   During this time:
   • File not visible to normal operations
   • Chunks still exist
   • Can be recovered if needed
   • Storage not freed

   After 3 days:
   → Master's background GC removes it

3. BACKGROUND SCAN (Master)
   ───────────────────────

   Periodic scan (e.g., every few hours):

   for each file in /.trash/:
       if (current_time - deletion_time) > grace_period:
           # Permanently delete
           chunk_handles = file.get_chunks()
           for chunk in chunk_handles:
               remove_from_namespace(chunk)
           delete_namespace_entry(file)

   No rush, happens eventually

4. CHUNK REMOVAL
   ─────────────

   Deleted chunks marked in master memory:
   "Chunk 0x1a2b: DELETED"

   Not immediately removed from chunkservers
   → Next step handles this

CHUNK GARBAGE COLLECTION
───────────────────────

1. HEARTBEAT EXCHANGE
   ──────────────────

   Regular heartbeat (every ~30 seconds):

   Chunkserver → Master:
     "I have chunks: [0x1a2b, 0x2c3d, 0x3e4f, ...]"

   Master → Chunkserver:
     "Delete these: [0x5e6f, 0x7g8h]"

2. MASTER IDENTIFIES GARBAGE
   ─────────────────────────

   Master compares chunkserver's list with metadata:

   Chunkserver reports: [0x1a2b, 0x2c3d, 0x9z8y]

   Master's namespace has:
   • 0x1a2b ✓ (belongs to /data/file1)
   • 0x2c3d ✓ (belongs to /data/file2)
   • 0x9z8y ✗ (NOT in namespace)

   → 0x9z8y is orphaned/garbage

   Reasons for garbage:
   • File was deleted
   • Chunk creation failed partway
   • Master crash during operation
   • Stale replica from old version

3. CHUNKSERVER DELETES
   ───────────────────

   Chunkserver receives delete list:

   for chunk in delete_list:
       os.remove(f"/gfs/chunks/{chunk}")
       log("Deleted chunk {chunk}")

   Simple, local operation
   No coordination needed

4. STALE REPLICA DETECTION
   ───────────────────────

   Version numbers detect stale replicas:

   Chunk 0x1a2b, current version: 5

   Heartbeat report:
   cs-5:  "chunk 0x1a2b, version 5" ✓
   cs-12: "chunk 0x1a2b, version 5" ✓
   cs-23: "chunk 0x1a2b, version 4" ✗ (STALE!)

   Master → cs-23: "Delete 0x1a2b (stale)"

   How version increases:
   • Master grants new lease → version++
   • Chunkserver was down during lease grant
   • Chunkserver comes back with old version
   • Master detects and deletes stale replica

GC BACKGROUND PROCESS
────────────────────

Master runs continuous background tasks:

TASK 1: Namespace Scan
──────────────────────
Frequency: Every few hours
Purpose: Remove old deleted files

pseudocode:
──────────
def namespace_gc():
    for file in namespace:
        if file.is_hidden_deleted():
            age = now() - file.deletion_time
            if age > GRACE_PERIOD:
                permanently_delete(file)

Impact:
• Low priority
• Runs when master idle
• Rate limited (100s files/sec)


TASK 2: Chunk Accounting
────────────────────────
Frequency: Ongoing (via heartbeats)
Purpose: Identify orphaned chunks

pseudocode:
──────────
def on_heartbeat(chunkserver, chunk_list):
    for chunk in chunk_list:
        if chunk not in namespace:
            send_delete_command(chunkserver, chunk)
        elif chunk.version < expected_version:
            send_delete_command(chunkserver, chunk)

Impact:
• Immediate (next heartbeat)
• No extra overhead
• Piggybacked on heartbeat


TASK 3: Orphan Detection
────────────────────────
Frequency: Daily
Purpose: Find chunks that should exist but don't

pseudocode:
──────────
def orphan_detection():
    for file in namespace:
        for chunk in file.chunks:
            replicas = get_chunk_locations(chunk)
            if len(replicas) < REPLICATION_GOAL:
                schedule_replication(chunk)

Impact:
• Ensures all data replicated
• Catches missed failures
• Safety net


RATE LIMITING:
─────────────

All GC tasks are rate-limited:
• Don't overload master
• Don't saturate network
• Don't impact foreground ops

Configuration:
• Max deletions per heartbeat: 10
• Max namespace scans per sec: 100
• Max re-replications concurrent: 50

Master Fault Tolerance

The master is replicated to ensure system availability:

Replication Strategy

MASTER REPLICATION
─────────────────

Primary Master:
┌──────────────────────────────────────┐
│ • Handles all operations             │
│ • In-memory metadata                 │
│ • Operation log (on disk)            │
└──────────────────────────────────────┘
           │
           │ Replicates to:
           │
           ↓
┌──────────────────────────────────────┐
│ Shadow Masters (multiple)            │
│ • Read-only replicas                 │
│ • Lag slightly behind primary        │
│ • Can serve reads                    │
│ • Quick promotion on failure         │
└──────────────────────────────────────┘
           │
           │ Operation log also on:
           │
           ↓
┌──────────────────────────────────────┐
│ Remote Disks (multiple)              │
│ • Different failure domains          │
│ • Different datacenters              │
│ • Disaster recovery                  │
└──────────────────────────────────────┘


OPERATION LOG REPLICATION:
─────────────────────────

Every metadata mutation:

1. Primary logs operation
2. Replicates to shadow masters
3. Replicates to remote disks
4. Waits for ACKs from ALL
5. Then applies to memory
6. Then returns success to client

Guarantee: If client sees success, operation is durable


SHADOW MASTER OPERATION:
───────────────────────

Shadow continuously:
• Reads operation log from primary
• Replays operations locally
• Builds same in-memory state
• Lags by ~seconds

Can serve:
• Read-only operations
• Metadata queries
• File listings

Cannot serve:
• Writes (redirects to primary)
• Lease grants
• Chunk creation


FAILOVER:
────────

Primary master fails:
1. Monitoring detects failure (<10 sec)
2. External system promotes shadow
3. Shadow becomes primary
4. Updates DNS/configuration
5. Clients redirect to new primary

Downtime: ~1 minute (conservative)

Data loss: None (operation log replicated)

Fast Recovery

Checkpoint Loading

Master starts (after crash or planned restart)

1. Load latest checkpoint from disk
   • Compact B-tree format
   • Contains full namespace
   • Loads in 1-2 seconds
   • Typically < 1 GB

2. Result:
   • Full file namespace in memory
   • Chunk handle mappings
   • Version numbers

Log Replay

Replay operation log since checkpoint

1. Read log file
2. Apply each operation in order:
   • File creates
   • Chunk allocations
   • Version updates
   • Deletions

3. Typically few seconds (100K ops/sec)

4. Result: Master has current namespace state

Chunk Location Discovery

Master doesn't persist chunk locations
(Chunkservers are source of truth)

1. Master sends to all chunkservers:
   "What chunks do you have?"

2. Each chunkserver replies:
   "I have: [chunk1, chunk2, ..., chunkN]
    With versions: [v1, v2, ..., vN]"

3. Master builds location map in memory

4. Identifies stale replicas (old versions)

5. Marks for garbage collection

Typically completes in 10-30 seconds

Resume Operations

Master is now ready!

Total recovery time: < 1 minute

1. Checkpoint load: 1-2 sec
2. Log replay: 5-10 sec
3. Chunk poll: 10-30 sec
4. Ready: 30-60 sec total

Meanwhile:
• Shadow masters can serve reads
• System remains partially available
• No data loss

Interview Questions

Basic: Why does GFS use lazy garbage collection?

Expected Answer:GFS uses lazy garbage collection instead of immediate deletion for several reasons:

Simplicity: No complex distributed deletion protocol needed
Recovery: Accidental deletions can be recovered during grace period (e.g., 3 days)
Batch Operations: Deletions batched together, reducing overhead
Failure Handling: If deletion message lost, chunk collected eventually anyway
Spread Load: I/O spread over time, not sudden burst
Piggybacking: Uses existing heartbeat mechanism

The trade-off is that storage isn’t freed immediately, but for Google’s workload this was acceptable since storage was relatively cheap and safety was more important.Process: File deletion → rename to hidden → wait grace period → background GC removes → heartbeat informs chunkservers → chunkservers delete local chunks

Intermediate: Explain GFS's namespace locking mechanism

Expected Answer:GFS uses fine-grained read-write locks on full pathnames to enable concurrent operations:How it works:

Each path (file or directory) has a read-write lock
Operations acquire locks on full paths, not just the target
Example: Creating /home/user1/file requires:
- Read lock on /home
- Read lock on /home/user1
- Write lock on /home/user1/file

Benefits:

Multiple operations in same directory can proceed in parallel
Example: Creating /data/file1 and /data/file2 concurrently
Both acquire read lock on /data (shared)
Each acquires write lock on different file (no conflict)

Deadlock prevention:

Locks acquired in lexicographic order
Example: Operation needs /a/x and /b/y
Always acquire in sorted order: /a/x then /b/y

This design enables linear scaling with concurrent operations, unlike directory-level locking which serializes all operations in the same directory.

Advanced: How does the chunk lease mechanism ensure consistency?

Expected Answer:The lease mechanism provides consistency without distributed consensus:Setup:

Master grants 60-second lease to one replica (primary)
Only primary can order mutations during lease period
Primary identity cached by clients

Write Process:

Data pushed to all replicas (in memory, not applied)
Client sends write request to primary
Primary assigns serial number (ordering)
Primary applies to local disk
Primary sends order to secondaries
Secondaries apply in same order
All ACK to primary
Primary ACKs to client

Consistency Guarantee:

All replicas apply mutations in same order (serialized by primary)
Same serial numbers → same state
No distributed consensus needed
Primary has authority during lease

Failure Handling:

Lease timeout (60 sec) ensures master can regain control
If primary fails, lease expires naturally
Master can grant new lease to different replica
No need for perfect failure detection

Why it works:

Single authority (primary) during lease
Time-bounded authority (60 sec)
Master retains ultimate control
Simple protocol, high performance

System Design: How would you improve GFS's single master limitation?

Expected Answer:Several approaches to scale beyond single master:1. Metadata Sharding (like Colossus):

Partition namespace by path prefix
Multiple master shards, each handles subset
Example: Master1 handles /data/*, Master2 handles /logs/*
Benefits: Scales metadata capacity and throughput
Challenges: Cross-shard operations, rebalancing

2. Hierarchical Masters:

Root master coordinates multiple sub-masters
Each sub-master handles subset of chunkservers
Root handles namespace, sub-masters handle chunks
Benefits: Scales chunk management
Challenges: Two-level hierarchy complexity

3. Client-Side Metadata Caching:

Aggressive client caching with long timeouts
Lease-based cache consistency
Master only for cache misses
Benefits: Reduces master load dramatically
Challenges: Consistency protocol more complex

4. Metadata Distribution:

Distribute master state using consensus (Paxos/Raft)
Read from any replica, write to leader
Benefits: High availability, read scalability
Challenges: Write latency, consistency overhead

Real-world Evolution:

Colossus (GFS successor) uses metadata sharding
HDFS Federation uses multiple namenodes
Both prove that single master can be overcome while maintaining simplicity where possible

Stale Replica Detection (Version Numbers)

In a distributed system, some replicas may miss updates (e.g., if a chunkserver crashes during a write). GFS uses Chunk Version Numbers to distinguish between up-to-date and stale replicas.

The Versioning Protocol

Lease Granting: Before granting a lease, the master increments the chunk’s version number in its persistent metadata.
Propagation: The master notifies the primary and all secondaries of the new version number.
Persisting: Both the master and the chunkservers record the new version number on their respective persistent disks before the mutation starts.
Stale Check: If a chunkserver was down during the update, it will still have the old version number.

How the Master Uses Versions

During Heartbeats: Chunkservers report their (handle, version) pairs. If the master sees a version $V < V_{current}$ , it knows the replica is stale.
Garbage Collection: Stale replicas are immediately scheduled for garbage collection.
Client Requests: When a client asks for chunk locations, the master never returns a stale replica, ensuring the client only sees current data.

Key Takeaways

Master Operations Summary:

Namespace Locking: Fine-grained path-based locks enable parallel operations
Leases for Consistency: Time-bounded primary authority avoids distributed consensus
Replica Placement: Rack-aware placement balances reliability and performance
Lazy Garbage Collection: Simple, safe deletion with recovery window
In-Memory Metadata: Fast operations, simple consistency, small memory footprint
Master Replication: Operation log replication ensures durability and fast recovery
Background Processes: GC, re-replication, balancing happen asynchronously
Version Numbers: Detect stale replicas reliably without complex protocols

Up Next

In Chapter 4: Chunkservers & Data Flow, we’ll explore:

How chunkservers store and manage chunks
Detailed read, write, and record append flows
Data integrity mechanisms with checksums
Replication pipeline and optimization
Handling chunkserver failures

The master orchestrates the system—now we’ll see how chunkservers execute the actual data operations.

Interview Deep-Dive

Explain GFS garbage collection. Why is lazy deletion better than immediate deletion in a distributed file system?

Strong Answer:When a file is deleted in GFS, the master does not immediately reclaim the chunks. Instead, it renames the file to a hidden name with a deletion timestamp. A background process periodically scans for hidden files older than a configurable threshold (default three days) and removes their metadata. Orphaned chunks (chunks with no file reference) are discovered during regular chunk report exchanges with chunkservers and reclaimed.Lazy deletion is better than immediate deletion for three reasons. First, safety: accidental deletions can be undone within the grace period by simply renaming the hidden file back. This is enormously valuable in production — at Google scale, operator error is a constant risk. Second, simplicity: the garbage collector runs as a single background sweep, which is far simpler than coordinating immediate deletion across three replicas on different chunkservers. If one replica is temporarily unreachable during an immediate delete, you need complex retry logic. With lazy GC, the unreachable chunkserver simply reports the orphaned chunk on its next heartbeat, and the master tells it to delete. Third, batching: the GC can coalesce many deletions into efficient batch operations, reducing the metadata operation rate on the master.The trade-off is storage reclamation latency. Deleted files continue to consume disk space for up to three days. For clusters with tight storage budgets, this delay can be problematic. GFS allows tuning the grace period down for specific namespaces.Follow-up: How does this compare to garbage collection in modern object stores like S3?S3 uses a similar lazy approach internally, though the semantics exposed to users are different. When you delete an S3 object, the metadata is removed immediately from the API perspective, but the underlying storage blocks may not be reclaimed instantly. S3 also offers versioning, which is conceptually similar to GFS hidden-file grace period — deleted objects are retained as non-current versions until a lifecycle policy removes them. The key insight that carries across both systems is that in distributed storage, lazy reclamation is almost always the right default because it decouples the fast path (metadata update) from the slow path (physical storage reclamation).

GFS uses per-path read-write locks instead of traditional directory locks. Why does this matter at scale, and what concurrency does it enable?

Strong Answer:In a traditional POSIX file system, creating a file requires locking the parent directory inode. If you have 1,000 MapReduce tasks all writing output files to the same directory simultaneously, they serialize on that directory lock. This creates a bottleneck that limits parallelism.GFS uses a flat namespace with per-path read-write locks. To create /data/logs/file1, the master acquires read locks on /data and /data/logs (to prevent them from being deleted), and a write lock only on /data/logs/file1 itself. Creating /data/logs/file2 simultaneously requires read locks on the same parents (which are shared with file1) and a write lock on /data/logs/file2. Since the write locks are on different paths, both operations proceed in parallel.This enables massive concurrency. In a MapReduce job with 10,000 map tasks writing output to the same directory, all 10,000 file creations can proceed concurrently because they only contend on shared read locks for the parent paths. Deadlocks are prevented by always acquiring locks in lexicographic order of the full pathname.Follow-up: Can you think of a scenario where this locking scheme could cause problems?Yes — directory-level operations like rename or snapshot. If you snapshot /data/logs, you need a write lock on /data/logs to prevent any modifications during the snapshot. This blocks all concurrent file creations in that directory. In practice, GFS handles this by making snapshots a rare, operator-initiated operation, not something that happens during normal workload processing. The design is optimized for the common case (file creation and deletion) at the expense of rare operations (snapshots, directory renames). This is a classic engineering trade-off: optimize for the 99% case and accept higher cost for the 1% case.

How does GFS replica placement strategy balance fault tolerance against write performance?

Strong Answer:GFS places the first replica on the same machine as the writer (or a machine with below-average disk utilization if the writer is not a chunkserver). The second replica goes on a different rack. The third replica goes on a different machine in the same rack as the second. This strategy balances three concerns.Fault tolerance: by placing replicas across at least two racks, GFS survives an entire rack failure (power outage, top-of-rack switch failure) with at least one surviving replica. This is critical because rack-level failures are correlated — a single switch failure takes out every machine on the rack.Write performance: the pipelined replication sends data from the writer to the nearest replica first, which forwards it to the next nearest. Having two replicas on the same rack means the second-to-third hop is a fast intra-rack transfer rather than a slower cross-rack transfer. This reduces the total write latency.Read performance: for read-heavy files, having replicas on two different racks means clients in either rack can read locally, distributing read load across the network.The trade-off is that with two replicas on one rack, a rack failure leaves you with only one surviving replica. At that point, re-replication becomes urgent and is prioritized above all other chunk operations. The master tracks under-replicated chunks and prioritizes them by severity (1 replica remaining is more urgent than 2 replicas remaining).Follow-up: How would you change this placement strategy for a cluster spanning multiple data centers?I would add a fourth replica in a different data center for disaster recovery. The cross-datacenter replica would be asynchronously replicated (to avoid the latency penalty of synchronous cross-WAN writes) and used only for reads in the remote datacenter or for recovery after a datacenter-level failure. This is essentially what Google did when evolving toward Colossus and what modern systems like Spanner do with their multi-region configurations. The key design question is whether the cross-datacenter replica participates in the write quorum (stronger consistency, higher latency) or is replicated asynchronously (lower latency, risk of data loss during datacenter failure).

Documentation Index

​Chapter 3: Master Operations

​Namespace Management

​Namespace Structure

​Coarse-Grained Locking

​Metadata Storage

File Namespace

Chunk Metadata

Server State

​Chunk Lease Mechanism

​How Leases Work

​Lease Benefits

​Snapshot Mechanics (Copy-on-Write)

​How it Works (The Metadata-Only Copy)

​The First Write After Snapshot

​Replica Placement

​Placement Goals

​Chunk Creation

​Re-replication

​Garbage Collection

​Why Lazy Deletion?