The master is the brain of GFS, orchestrating all metadata operations and coordinating the distributed system. What makes GFS’s master design remarkable is how much sophistication is packed into a single process without it becoming the bottleneck critics predicted. The namespace locking strategy alone — using fine-grained, per-path read-write locks instead of traditional directory-level locks — was a significant innovation that enabled thousands of concurrent file creations in the same directory. This chapter explores how the master manages the namespace, allocates chunks, grants leases, places replicas, and maintains system health through garbage collection. These mechanisms are directly relevant to modern system design: lease-based coordination appears in etcd and ZooKeeper, replica placement strategies inform Kafka broker assignment, and lazy garbage collection is the standard approach in systems from Go’s runtime to distributed databases.
Chapter Goals:
Understand namespace management with coarse-grained locking
Unlike traditional file systems with per-directory data structures (where each directory is essentially an inode pointing to a list of child entries), GFS uses a flat namespace with efficient path-to-metadata lookups. This design choice eliminated the “directory inode bottleneck” that plagued concurrent access in POSIX file systems. In a traditional file system, creating a file requires locking the parent directory’s inode, which serializes all operations in the same directory. GFS’s approach of locking individual path components independently enabled massive parallelism — a property that was essential for supporting thousands of MapReduce tasks writing output simultaneously.
LOCK ORDERING PROTOCOL─────────────────────Rule: Always acquire locks in consistent orderOrder: Lexicographic order by full pathnameExample:────────Operation: Rename /a/x to /b/yIncorrect: /a/x lock (write) /b/y lock (write) → If another thread does /b/... first → DEADLOCK!Correct: Sort: [/a, /a/x, /b, /b/y] Acquire in order: /a → read /a/x → write /b → read /b/y → writeDeadlock impossible!IMPLEMENTATION:──────────────Algorithm:──────────1. Collect all paths needed for operation2. Sort lexicographically3. Acquire locks in sorted order4. Perform operation5. Release all locksExample with snapshot:─────────────────────Snapshot /home/alice to /backup/alice_snapPaths needed:- /home (read)- /home/alice (write - prevent changes)- /backup (read)- /backup/alice_snap (write)Sorted order:1. /backup (read)2. /backup/alice_snap (write)3. /home (read)4. /home/alice (write)Acquire in this order → No deadlock possible
Scalability Benefits:
PERFORMANCE COMPARISON─────────────────────Metric: Operations per second on /data/logs/Traditional (directory-level locking):──────────────────────────────────────1 thread: 1000 creates/sec2 threads: 1000 creates/sec (serialized!)4 threads: 1000 creates/sec (serialized!)8 threads: 1000 creates/sec (serialized!)GFS (fine-grained path locking):────────────────────────────────1 thread: 1000 creates/sec2 threads: 2000 creates/sec (parallel!)4 threads: 4000 creates/sec (parallel!)8 threads: 8000 creates/sec (parallel!)→ Linear scaling with parallelismREAL-WORLD IMPACT:─────────────────MapReduce with 1000 mappers:• All write to /output/intermediate/• Each creates file with unique name• All proceed in parallel• No lock contention!Without fine-grained locking:• Serialized on directory lock• 1000x slower• System unusable
The master keeps three types of metadata, all in memory:
File Namespace
Directory and File Names
Full pathname → metadata
Persistent (operation log)
Modification times
Owner/permissions
List of chunk handles
Chunk Metadata
Chunk Handle → Info
Chunk version number
List of replica locations
Primary (if leased)
Lease expiration
Partially persistent
Server State
Chunkserver Info
Available disk space
Current load (CPU, I/O)
Chunks stored
Last heartbeat time
Not persistent
METADATA MEMORY LAYOUT─────────────────────For 100 million chunks (several petabytes):File Namespace:──────────────• ~10 million files• ~100 bytes per file• Total: ~1 GBChunk Metadata:──────────────• 100 million chunks• ~64 bytes per chunk - Chunk handle (8 bytes) - Version number (8 bytes) - Replica locations (3-5 locations × 8 bytes) - Lease info (16 bytes)• Total: ~6 GBChunkserver State:─────────────────• ~1000 chunkservers• ~1 KB per server• Total: ~1 MBTOTAL: < 10 GB (easily fits in RAM of modern server)Benefits:────────✓ Fast operations (no disk I/O)✓ Simple consistency✓ Global view for decisions✓ Easy to scan entire namespace
Leases are central to GFS’s consistency model, enabling mutations without the overhead of distributed consensus. The lease mechanism is one of GFS’s most elegant design decisions and one of the most frequently asked-about concepts in system design interviews. The core insight is this: instead of running an expensive consensus protocol (like Paxos) for every single write operation, the master delegates authority to a “primary” chunkserver for a bounded time period. During that lease window, the primary makes all serialization decisions unilaterally. This converts the cost of distributed coordination from per-operation to per-lease-grant — a dramatic reduction. Today, this pattern is ubiquitous: Kafka uses leader epochs, Spanner uses leader leases, and even distributed lock services like etcd use lease-based TTLs for the same reason.
LEASE GRANT PROCESS──────────────────Client requests write to chunk XMaster checks:1. Does chunk X have current primary? NO → Select one replica as primary2. Grant lease: ┌──────────────────────────────┐ │ Chunk: X │ │ Primary: chunkserver-5 │ │ Lease expiration: T+60sec │ │ Version: 7 │ └──────────────────────────────┘3. Send to chunkserver-5: "You are primary for chunk X until T+60"4. Increment chunk version to 85. Return to client: Primary: cs-5 Secondaries: [cs-2, cs-9] Version: 8
2
Primary Orders Mutations
PRIMARY ROLE───────────While lease held, primary has authority:Multiple clients send writes:Client A: Write(offset=1000, len=100)Client B: Append(len=200)Client C: Write(offset=5000, len=50)Primary receives all three:─────────────────────────1. Assign serial numbers: A → Serial #1 B → Serial #2 C → Serial #32. Apply to local chunk in order: #1: Write at 1000 #2: Append at end #3: Write at 50003. Send to secondaries: "Apply operations in order: #1, #2, #3"4. Wait for ACKs5. Reply to clientsResult: Consistent ordering across all replicas!
3
Lease Renewal
LEASE LIFECYCLE──────────────Initial lease: 60 secondsRenewal (if writes ongoing):────────────────────────────Every ~10-15 seconds:Primary → Master: "Heartbeat: Still active, renew lease for chunk X"Master → Primary: "Lease extended to T+60"This continues while writes happenExpiration:──────────No renewal for 60 seconds?→ Lease expires→ Master can grant to different replica→ Or same replica (if still alive)REVOCATION:──────────Master needs to revoke lease:(e.g., client requested chunk deletion)1. Master sends revoke message to primary2. Primary stops accepting new mutations3. Completes in-flight operations4. ACKs revocation to master5. Master can now delete chunkIf primary unreachable?→ Wait for 60sec expiration→ Then proceed
WHY LEASES?──────────Problem Without Leases:───────────────────────Distributed consensus for every write:Client → Replica 1 ┐Client → Replica 2 ┼→ Agree on order? (Paxos/Raft)Client → Replica 3 ┘ Slow! Complex!For each write:- Multiple round trips- Consensus protocol- Failure handling→ 10-100ms latency per operationWith Leases:───────────One-time setup (lease grant):Master → Primary: "You order operations for 60sec"Then for each write:Client → Primary: "Write this"Primary → Secondaries: "Apply in this order"→ ~1ms latency per operationBenefits:────────✓ No distributed consensus per operation✓ Primary makes serialization decision✓ Low latency writes✓ Simple protocol✓ Automatic timeout (no need for perfect failure detection)LEASE VS LOCK:─────────────Lock:- Must explicitly release- Failure? → Deadlock or complex recoveryLease:- Automatic timeout- Failure? → Expires naturally- Master can wait or revoke- No distributed state
Snapshots in GFS are nearly instantaneous and allow users to create a copy of a file or a directory tree without copying data. This capability was critical for production operations: teams could snapshot an entire dataset before running a risky MapReduce job, providing a cheap rollback mechanism. The technique GFS uses — Copy-on-Write (CoW) at the chunk level — is the same fundamental mechanism used by Linux’s fork() system call, by ZFS and Btrfs file system snapshots, and by container image layers in Docker. Understanding CoW in GFS gives you the mental model for all of these systems.
GFS uses Copy-on-Write (CoW) at the chunk level to implement snapshots efficiently.
Lease Revocation: When the master receives a snapshot request, it first revokes any outstanding leases on the chunks of the files to be snapshotted. This ensures that any subsequent writes will require a new lease, giving the master a chance to intercept and create a copy.
Log & Copy Metadata: The master logs the snapshot operation to disk. It then applies the operation to its in-memory state by duplicating the metadata for the file or directory. The new “snapshot” file points to the same chunk handles as the original.
Reference Counting: Each chunk handle now has a reference count > 1.
When a client wants to write to a chunk that has been snapshotted:
1. Client asks Master for lease: "Write to Chunk C (ref_count=2)"2. Master sees ref_count > 1: a) Pick new chunk handle C' b) Tell chunkservers holding C: "Copy C to C' locally" c) Update metadata: Current file now points to C' (ref_count=1) d) Snapshot still points to C (ref_count=1)3. Master grants lease for C' to client.4. Client writes to C' normally.
Key Benefit: The initial snapshot is just a metadata operation (copying pointers). The actual data copying is deferred until a write occurs, and only for the specific chunks being modified.
CHUNK CREATION WORKFLOW──────────────────────Scenario: Client creates new file and writes first chunk1. CLIENT CREATES FILE ────────────────── Client: "Create file /data/logs/2003-10-15" Master: a) Create namespace entry b) Log operation to disk c) Return success2. CLIENT WRITES DATA ────────────────── Client: "Write to /data/logs/2003-10-15, offset 0, 1MB" Master: a) File has no chunks yet b) Need to create chunk 03. MASTER ALLOCATES CHUNK ────────────────────── a) Generate chunk handle (globally unique 64-bit) chunk_handle = 0x1a2b3c4d5e6f7g8h b) Select 3 chunkservers: Available servers: [cs-1, cs-2, ..., cs-50] Filter: • Enough disk space • Below average utilization • Rack diversity Selected: [cs-5, cs-12, cs-23] (cs-5 and cs-12 in rack-1, cs-23 in rack-2) c) Designate primary (lowest load) Primary: cs-5 d) Grant lease to cs-5 (60 seconds) e) Increment version to 1 f) Log operation: "CHUNK_CREATE: handle=0x1a2b, file=/data/logs/2003-10-15, index=0, version=1, locations=[cs-5, cs-12, cs-23]"4. MASTER CONTACTS CHUNKSERVERS ──────────────────────────── Master → cs-5, cs-12, cs-23: "Create chunk 0x1a2b, version 1" Each chunkserver: a) Create empty file /gfs/chunks/0x1a2b3c4d5e6f7g8h b) Initialize version to 1 c) ACK to master5. MASTER RETURNS TO CLIENT ──────────────────────── Master → Client: "Chunk 0 metadata: handle: 0x1a2b3c4d5e6f7g8h primary: cs-5 secondaries: [cs-12, cs-23] version: 1 lease_expiration: T+60"6. CLIENT WRITES DATA ────────────────── (Normal write flow as described in Chapter 2)
FILE DELETION WORKFLOW─────────────────────1. CLIENT DELETES FILE ────────────────── Client: "Delete /data/old/file1" Master: ┌──────────────────────────────────┐ │ a) Rename file to hidden name: │ │ /data/old/file1 │ │ → │ │ /.trash/file1.deleted.T12345 │ │ │ │ b) Add deletion timestamp │ │ │ │ c) Log operation │ │ │ │ d) Return success │ └──────────────────────────────────┘ File still exists, just renamed! Client can immediately: • Delete again (no-op) • Undelete (rename back)2. GRACE PERIOD ─────────── Hidden file stays for N days (configurable, e.g., 3 days) During this time: • File not visible to normal operations • Chunks still exist • Can be recovered if needed • Storage not freed After 3 days: → Master's background GC removes it3. BACKGROUND SCAN (Master) ─────────────────────── Periodic scan (e.g., every few hours): for each file in /.trash/: if (current_time - deletion_time) > grace_period: # Permanently delete chunk_handles = file.get_chunks() for chunk in chunk_handles: remove_from_namespace(chunk) delete_namespace_entry(file) No rush, happens eventually4. CHUNK REMOVAL ───────────── Deleted chunks marked in master memory: "Chunk 0x1a2b: DELETED" Not immediately removed from chunkservers → Next step handles this
CHUNK GARBAGE COLLECTION───────────────────────1. HEARTBEAT EXCHANGE ────────────────── Regular heartbeat (every ~30 seconds): Chunkserver → Master: "I have chunks: [0x1a2b, 0x2c3d, 0x3e4f, ...]" Master → Chunkserver: "Delete these: [0x5e6f, 0x7g8h]"2. MASTER IDENTIFIES GARBAGE ───────────────────────── Master compares chunkserver's list with metadata: Chunkserver reports: [0x1a2b, 0x2c3d, 0x9z8y] Master's namespace has: • 0x1a2b ✓ (belongs to /data/file1) • 0x2c3d ✓ (belongs to /data/file2) • 0x9z8y ✗ (NOT in namespace) → 0x9z8y is orphaned/garbage Reasons for garbage: • File was deleted • Chunk creation failed partway • Master crash during operation • Stale replica from old version3. CHUNKSERVER DELETES ─────────────────── Chunkserver receives delete list: for chunk in delete_list: os.remove(f"/gfs/chunks/{chunk}") log("Deleted chunk {chunk}") Simple, local operation No coordination needed4. STALE REPLICA DETECTION ─────────────────────── Version numbers detect stale replicas: Chunk 0x1a2b, current version: 5 Heartbeat report: cs-5: "chunk 0x1a2b, version 5" ✓ cs-12: "chunk 0x1a2b, version 5" ✓ cs-23: "chunk 0x1a2b, version 4" ✗ (STALE!) Master → cs-23: "Delete 0x1a2b (stale)" How version increases: • Master grants new lease → version++ • Chunkserver was down during lease grant • Chunkserver comes back with old version • Master detects and deletes stale replica
GC BACKGROUND PROCESS────────────────────Master runs continuous background tasks:TASK 1: Namespace Scan──────────────────────Frequency: Every few hoursPurpose: Remove old deleted filespseudocode:──────────def namespace_gc(): for file in namespace: if file.is_hidden_deleted(): age = now() - file.deletion_time if age > GRACE_PERIOD: permanently_delete(file)Impact:• Low priority• Runs when master idle• Rate limited (100s files/sec)TASK 2: Chunk Accounting────────────────────────Frequency: Ongoing (via heartbeats)Purpose: Identify orphaned chunkspseudocode:──────────def on_heartbeat(chunkserver, chunk_list): for chunk in chunk_list: if chunk not in namespace: send_delete_command(chunkserver, chunk) elif chunk.version < expected_version: send_delete_command(chunkserver, chunk)Impact:• Immediate (next heartbeat)• No extra overhead• Piggybacked on heartbeatTASK 3: Orphan Detection────────────────────────Frequency: DailyPurpose: Find chunks that should exist but don'tpseudocode:──────────def orphan_detection(): for file in namespace: for chunk in file.chunks: replicas = get_chunk_locations(chunk) if len(replicas) < REPLICATION_GOAL: schedule_replication(chunk)Impact:• Ensures all data replicated• Catches missed failures• Safety netRATE LIMITING:─────────────All GC tasks are rate-limited:• Don't overload master• Don't saturate network• Don't impact foreground opsConfiguration:• Max deletions per heartbeat: 10• Max namespace scans per sec: 100• Max re-replications concurrent: 50
Master starts (after crash or planned restart)1. Load latest checkpoint from disk • Compact B-tree format • Contains full namespace • Loads in 1-2 seconds • Typically < 1 GB2. Result: • Full file namespace in memory • Chunk handle mappings • Version numbers
2
Log Replay
Replay operation log since checkpoint1. Read log file2. Apply each operation in order: • File creates • Chunk allocations • Version updates • Deletions3. Typically few seconds (100K ops/sec)4. Result: Master has current namespace state
3
Chunk Location Discovery
Master doesn't persist chunk locations(Chunkservers are source of truth)1. Master sends to all chunkservers: "What chunks do you have?"2. Each chunkserver replies: "I have: [chunk1, chunk2, ..., chunkN] With versions: [v1, v2, ..., vN]"3. Master builds location map in memory4. Identifies stale replicas (old versions)5. Marks for garbage collectionTypically completes in 10-30 seconds
4
Resume Operations
Master is now ready!Total recovery time: < 1 minute1. Checkpoint load: 1-2 sec2. Log replay: 5-10 sec3. Chunk poll: 10-30 sec4. Ready: 30-60 sec totalMeanwhile:• Shadow masters can serve reads• System remains partially available• No data loss
Failure Handling: If deletion message lost, chunk collected eventually anyway
Spread Load: I/O spread over time, not sudden burst
Piggybacking: Uses existing heartbeat mechanism
The trade-off is that storage isn’t freed immediately, but for Google’s workload this was acceptable since storage was relatively cheap and safety was more important.Process: File deletion → rename to hidden → wait grace period → background GC removes → heartbeat informs chunkservers → chunkservers delete local chunks
In a distributed system, some replicas may miss updates (e.g., if a chunkserver crashes during a write). GFS uses Chunk Version Numbers to distinguish between up-to-date and stale replicas.
Explain GFS garbage collection. Why is lazy deletion better than immediate deletion in a distributed file system?
Strong Answer:When a file is deleted in GFS, the master does not immediately reclaim the chunks. Instead, it renames the file to a hidden name with a deletion timestamp. A background process periodically scans for hidden files older than a configurable threshold (default three days) and removes their metadata. Orphaned chunks (chunks with no file reference) are discovered during regular chunk report exchanges with chunkservers and reclaimed.Lazy deletion is better than immediate deletion for three reasons. First, safety: accidental deletions can be undone within the grace period by simply renaming the hidden file back. This is enormously valuable in production — at Google scale, operator error is a constant risk. Second, simplicity: the garbage collector runs as a single background sweep, which is far simpler than coordinating immediate deletion across three replicas on different chunkservers. If one replica is temporarily unreachable during an immediate delete, you need complex retry logic. With lazy GC, the unreachable chunkserver simply reports the orphaned chunk on its next heartbeat, and the master tells it to delete. Third, batching: the GC can coalesce many deletions into efficient batch operations, reducing the metadata operation rate on the master.The trade-off is storage reclamation latency. Deleted files continue to consume disk space for up to three days. For clusters with tight storage budgets, this delay can be problematic. GFS allows tuning the grace period down for specific namespaces.Follow-up: How does this compare to garbage collection in modern object stores like S3?S3 uses a similar lazy approach internally, though the semantics exposed to users are different. When you delete an S3 object, the metadata is removed immediately from the API perspective, but the underlying storage blocks may not be reclaimed instantly. S3 also offers versioning, which is conceptually similar to GFS hidden-file grace period — deleted objects are retained as non-current versions until a lifecycle policy removes them. The key insight that carries across both systems is that in distributed storage, lazy reclamation is almost always the right default because it decouples the fast path (metadata update) from the slow path (physical storage reclamation).
GFS uses per-path read-write locks instead of traditional directory locks. Why does this matter at scale, and what concurrency does it enable?
Strong Answer:In a traditional POSIX file system, creating a file requires locking the parent directory inode. If you have 1,000 MapReduce tasks all writing output files to the same directory simultaneously, they serialize on that directory lock. This creates a bottleneck that limits parallelism.GFS uses a flat namespace with per-path read-write locks. To create /data/logs/file1, the master acquires read locks on /data and /data/logs (to prevent them from being deleted), and a write lock only on /data/logs/file1 itself. Creating /data/logs/file2 simultaneously requires read locks on the same parents (which are shared with file1) and a write lock on /data/logs/file2. Since the write locks are on different paths, both operations proceed in parallel.This enables massive concurrency. In a MapReduce job with 10,000 map tasks writing output to the same directory, all 10,000 file creations can proceed concurrently because they only contend on shared read locks for the parent paths. Deadlocks are prevented by always acquiring locks in lexicographic order of the full pathname.Follow-up: Can you think of a scenario where this locking scheme could cause problems?Yes — directory-level operations like rename or snapshot. If you snapshot /data/logs, you need a write lock on /data/logs to prevent any modifications during the snapshot. This blocks all concurrent file creations in that directory. In practice, GFS handles this by making snapshots a rare, operator-initiated operation, not something that happens during normal workload processing. The design is optimized for the common case (file creation and deletion) at the expense of rare operations (snapshots, directory renames). This is a classic engineering trade-off: optimize for the 99% case and accept higher cost for the 1% case.
How does GFS replica placement strategy balance fault tolerance against write performance?
Strong Answer:GFS places the first replica on the same machine as the writer (or a machine with below-average disk utilization if the writer is not a chunkserver). The second replica goes on a different rack. The third replica goes on a different machine in the same rack as the second. This strategy balances three concerns.Fault tolerance: by placing replicas across at least two racks, GFS survives an entire rack failure (power outage, top-of-rack switch failure) with at least one surviving replica. This is critical because rack-level failures are correlated — a single switch failure takes out every machine on the rack.Write performance: the pipelined replication sends data from the writer to the nearest replica first, which forwards it to the next nearest. Having two replicas on the same rack means the second-to-third hop is a fast intra-rack transfer rather than a slower cross-rack transfer. This reduces the total write latency.Read performance: for read-heavy files, having replicas on two different racks means clients in either rack can read locally, distributing read load across the network.The trade-off is that with two replicas on one rack, a rack failure leaves you with only one surviving replica. At that point, re-replication becomes urgent and is prioritized above all other chunk operations. The master tracks under-replicated chunks and prioritizes them by severity (1 replica remaining is more urgent than 2 replicas remaining).Follow-up: How would you change this placement strategy for a cluster spanning multiple data centers?I would add a fourth replica in a different data center for disaster recovery. The cross-datacenter replica would be asynchronously replicated (to avoid the latency penalty of synchronous cross-WAN writes) and used only for reads in the remote datacenter or for recovery after a datacenter-level failure. This is essentially what Google did when evolving toward Colossus and what modern systems like Spanner do with their multi-region configurations. The key design question is whether the cross-datacenter replica participates in the write quorum (stronger consistency, higher latency) or is replicated asynchronously (lower latency, risk of data loss during datacenter failure).