Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Chapter 5: Hadoop Ecosystem

The true power of Hadoop lies not just in HDFS and MapReduce, but in the rich ecosystem of tools built on top of it. This chapter explores the key components that transform Hadoop from a low-level distributed file system and processing framework into a complete big data platform. The Hadoop ecosystem did not emerge from a single grand design. It grew organically, project by project, as engineers at Facebook, Yahoo, LinkedIn, and dozens of other companies hit real production walls. Facebook engineers needed SQL analysts to query petabytes of clickstream data without learning Java — so they built Hive (around 2008). Yahoo researchers needed a scripting language for complex ETL pipelines that were too painful to express in raw MapReduce — so they built Pig (around 2006). The pattern repeated: every time a new class of problem proved too awkward or too slow with existing tools, someone built a new layer on top of HDFS and YARN. Understanding this history matters because it explains why the ecosystem looks the way it does — not a clean layered architecture designed on a whiteboard, but a living, evolving collection of tools where each one exists because a specific pain point demanded it.
Chapter Goals:
  • Understand the Hadoop ecosystem landscape
  • Master Hive for SQL on Hadoop
  • Learn Pig for data flow programming
  • Explore HBase for real-time NoSQL storage
  • Study workflow orchestration with Oozie
  • Understand data ingestion with Kafka and Flume
  • Compare ecosystem tools and when to use each

Ecosystem Overview

The Hadoop Stack

+---------------------------------------------------------------+
|                  HADOOP ECOSYSTEM STACK                       |
+---------------------------------------------------------------+
|                                                               |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │              APPLICATIONS & USE CASES                   │  |
|  │  Business Intelligence, Analytics, ML, ETL, Reporting   │  |
|  └─────────────────────────────────────────────────────────┘  |
|                           ↑                                   |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │              HIGH-LEVEL TOOLS                           │  |
|  ├─────────────────────────────────────────────────────────┤  |
|  │  SQL Query:  │ Scripting:  │ ML:       │ Graph:        │  |
|  │  • Hive      │ • Pig       │ • Mahout  │ • Giraph      │  |
|  │  • Impala    │ • Cascading │ • Spark   │               │  |
|  │  • Presto    │             │   MLlib   │               │  |
|  └──────────────┴─────────────┴───────────┴───────────────┘  |
|                           ↑                                   |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │         DATA PROCESSING FRAMEWORKS                      │  |
|  ├─────────────────────────────────────────────────────────┤  |
|  │  Batch:      │ Stream:     │ Interactive:              │  |
|  │  • MapReduce │ • Storm     │ • Impala                  │  |
|  │  • Spark     │ • Flink     │ • Presto                  │  |
|  │  • Tez       │ • Samza     │ • Drill                   │  |
|  └──────────────┴─────────────┴───────────────────────────┘  |
|                           ↑                                   |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │         RESOURCE MANAGEMENT & ORCHESTRATION             │  |
|  ├─────────────────────────────────────────────────────────┤  |
|  │  • YARN (Resource Manager)                              │  |
|  │  • Oozie (Workflow Scheduler)                           │  |
|  │  • ZooKeeper (Coordination)                             │  |
|  └─────────────────────────────────────────────────────────┘  |
|                           ↑                                   |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │              STORAGE LAYER                              │  |
|  ├─────────────────────────────────────────────────────────┤  |
|  │  Distributed FS:  │ NoSQL:    │ Ingestion:            │  |
|  │  • HDFS           │ • HBase   │ • Kafka               │  |
|  │  • S3             │ • Kudu    │ • Flume               │  |
|  │                   │           │ • Sqoop               │  |
|  └───────────────────┴───────────┴───────────────────────┘  |
|                                                               |
+---------------------------------------------------------------+

GUIDING PRINCIPLE:
─────────────────
Each layer builds on lower layers.
Higher layers provide easier abstractions.
Lower layers provide more control and flexibility.

Why the Ecosystem Matters

Abstraction

Hide Complexity:
  • MapReduce requires Java programming
  • Hive provides SQL interface
  • Easier onboarding, faster development
  • Reach more users (analysts, not just engineers)

Productivity

Faster Development:
  • 100 lines of Java MapReduce → 5 lines of SQL
  • Pig reduces code by 10-20x
  • Faster iteration, fewer bugs
  • Focus on logic, not plumbing

Specialization

Right Tool for Job:
  • Hive for SQL analytics
  • HBase for real-time access
  • Kafka for stream ingestion
  • Each optimized for specific use case

Innovation

Community-Driven:
  • Open source enables experimentation
  • Best tools emerge organically
  • Spark displaced MapReduce
  • Ecosystem evolves with needs

Apache Hive: SQL on Hadoop

What is Hive?

Apache Hive was created at Facebook in 2007 by Jeff Hammerbacher’s data team and later open-sourced in 2008. The motivation was straightforward: Facebook’s data warehouse was growing by terabytes per day, and the only way to query it was to write Java MapReduce programs. Most of the analysts who needed to answer business questions — “How many users clicked this ad in the last 30 days?” — knew SQL, not Java. Hive bridged that gap by providing a SQL dialect (HiveQL) that compiled to MapReduce jobs behind the scenes. This single decision — putting a SQL layer on top of MapReduce — arguably did more to accelerate Hadoop adoption than any other ecosystem project. It transformed Hadoop from an engineering tool into a data warehouse that business analysts could use directly. Hive provides a SQL interface (HiveQL) to data stored in HDFS, translating SQL queries into MapReduce, Tez, or Spark jobs.
+---------------------------------------------------------------+
|                    APACHE HIVE ARCHITECTURE                   |
+---------------------------------------------------------------+
|                                                               |
|  ┌──────────────────────────────────────────┐                 |
|  │  CLIENT (SQL Interface)                  │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  • Hive CLI                              │                 |
|  │  • Beeline (JDBC client)                 │                 |
|  │  • JDBC/ODBC drivers                     │                 |
|  │  • BI tools (Tableau, PowerBI)           │                 |
|  └─────────────────┬────────────────────────┘                 |
|                    │                                          |
|                    │ HiveQL query                             |
|                    ↓                                          |
|  ┌──────────────────────────────────────────┐                 |
|  │  HIVE SERVER (HiveServer2)               │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Parser                            │  │                 |
|  │  │  (SQL → Abstract Syntax Tree)      │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Semantic Analyzer                 │  │                 |
|  │  │  (Validate, resolve metadata)      │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Logical Plan Generator            │  │                 |
|  │  │  (Operator tree)                   │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Optimizer                         │  │                 |
|  │  │  (Predicate pushdown, join reorder)│  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Physical Plan (MapReduce/Tez)     │  │                 |
|  │  │  (Execution plan)                  │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  └──────────────────────────────────────────┘                 |
|                    │                                          |
|                    │ Submit MR/Tez job                        |
|                    ↓                                          |
|  ┌──────────────────────────────────────────┐                 |
|  │  EXECUTION ENGINE                        │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  • MapReduce (slow)                      │                 |
|  │  • Tez (faster, DAG-based)               │                 |
|  │  • Spark (fastest, in-memory)            │                 |
|  └──────────────────────────────────────────┘                 |
|                    ↓                                          |
|  ┌──────────────────────────────────────────┐                 |
|  │  METASTORE                               │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  Database: MySQL, PostgreSQL, Derby      │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Tables:                           │  │                 |
|  │  │  • Table metadata                  │  │                 |
|  │  │  • Column types                    │  │                 |
|  │  │  • Partitions                      │  │                 |
|  │  │  • Storage location (HDFS path)    │  │                 |
|  │  │  • SerDe info                      │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  └──────────────────────────────────────────┘                 |
|                    ↓                                          |
|  ┌──────────────────────────────────────────┐                 |
|  │  STORAGE (HDFS)                          │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  /user/hive/warehouse/                   │                 |
|  │    ├─ sales/                             │                 |
|  │    │  ├─ year=2023/                      │                 |
|  │    │  │  └─ month=01/                    │                 |
|  │    │  │     └─ data.parquet              │                 |
|  │    └─ users/                             │                 |
|  │       └─ data.orc                        │                 |
|  └──────────────────────────────────────────┘                 |
|                                                               |
+---------------------------------------------------------------+

Hive Fundamentals

SQL-Like Syntax
-- CREATE TABLE
CREATE TABLE employees (
  id INT,
  name STRING,
  salary DECIMAL(10,2),
  dept STRING,
  hire_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;


-- LOAD DATA
LOAD DATA INPATH '/user/data/employees.csv'
INTO TABLE employees;


-- SELECT QUERY
SELECT dept, AVG(salary) as avg_salary
FROM employees
WHERE hire_date >= '2020-01-01'
GROUP BY dept
HAVING AVG(salary) > 50000
ORDER BY avg_salary DESC;


-- JOIN
SELECT e.name, e.salary, d.dept_name
FROM employees e
JOIN departments d
ON e.dept = d.dept_id;


-- SUBQUERY
SELECT name, salary
FROM employees
WHERE salary > (
  SELECT AVG(salary) FROM employees
);
Behind the Scenes:Each query translates to MapReduce/Tez job:
SELECT dept, COUNT(*) FROM employees GROUP BY dept;

↓ Translates to:

Map Phase:
• Read employees table from HDFS
• For each row, emit (dept, 1)

Shuffle:
• Group by dept

Reduce Phase:
• For each dept, sum counts
• Write to HDFS

Result written to temp HDFS location.

Hive Metastore

Centralized Metadata Repository
METASTORE ARCHITECTURE:
──────────────────────

┌──────────────────────────────────────┐
│  Hive Clients                        │
│  • Hive CLI                          │
│  • Beeline                           │
│  • Spark SQL                         │
│  • Impala                            │
│  • Presto                            │
└───────────────┬──────────────────────┘

                │ Thrift API

┌──────────────────────────────────────┐
│  Metastore Service                   │
│  (HiveMetastore)                     │
└───────────────┬──────────────────────┘

                │ JDBC

┌──────────────────────────────────────┐
│  Metastore Database                  │
│  (MySQL, PostgreSQL, Derby)          │
├──────────────────────────────────────┤
│  Tables:                             │
│  • DBS (databases)                   │
│  • TBLS (tables)                     │
│  • COLUMNS_V2 (columns)              │
│  • PARTITIONS (partition metadata)   │
│  • SDS (storage descriptors)         │
│  • SERDES (serialization info)       │
└──────────────────────────────────────┘


STORED METADATA:
───────────────

For each table:
• Database name
• Table name
• Column names and types
• Partition keys
• Storage location (HDFS path)
• File format (Parquet, ORC, etc.)
• SerDe class
• Bucket information
• Table statistics
Example Metadata:
CREATE TABLE sales (
  id INT,
  amount DECIMAL(10,2)
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/sales';
Metastore Entries:
DBS table:
┌────────────────────────────────────┐
│ DB_ID | NAME    | DESC | LOCATION  │
├───────┼─────────┼──────┼───────────┤
│ 1     | default | ...  | /user/... │
└────────────────────────────────────┘

TBLS table:
┌───────────────────────────────────────────┐
│ TBL_ID | DB_ID | TBL_NAME | TBL_TYPE   │
├────────┼───────┼──────────┼────────────┤
│ 100    | 1     | sales    | MANAGED    │
└───────────────────────────────────────────┘

COLUMNS_V2 table:
┌─────────────────────────────────────────┐
│ CD_ID | COLUMN_NAME | TYPE_NAME         │
├───────┼─────────────┼───────────────────┤
│ 200   | id          | int               │
│ 200   | amount      | decimal(10,2)     │
└─────────────────────────────────────────┘

PARTITIONS table:
┌───────────────────────────────────────┐
│ PART_ID | TBL_ID | PART_NAME         │
├─────────┼────────┼───────────────────┤
│ 300     | 100    | year=2023/month=1 │
│ 301     | 100    | year=2023/month=2 │
└───────────────────────────────────────┘

SDS (Storage Descriptor):
┌──────────────────────────────────────────┐
│ SD_ID | LOCATION                         │
├───────┼──────────────────────────────────┤
│ 400   | /user/hive/warehouse/sales/...  │
└──────────────────────────────────────────┘
Table Ownership
-- MANAGED TABLE (Hive owns data)
CREATE TABLE managed_sales (
  id INT,
  amount DECIMAL
);

-- Data stored in Hive warehouse:
-- /user/hive/warehouse/managed_sales/

DROP TABLE managed_sales;
-- Data DELETED from HDFS!


-- EXTERNAL TABLE (Hive doesn't own data)
CREATE EXTERNAL TABLE external_sales (
  id INT,
  amount DECIMAL
)
LOCATION '/data/sales';

-- Data at /data/sales/ (outside Hive warehouse)

DROP TABLE external_sales;
-- Only metadata deleted, data remains!
When to Use Each:
MANAGED TABLES:
──────────────

Use when:
• Hive is primary consumer
• Want Hive to manage lifecycle
• Data is temporary/intermediate

Benefit:
• DROP TABLE cleans everything
• Simpler management


EXTERNAL TABLES:
───────────────

Use when:
• Multiple tools access data (Hive, Spark, Impala)
• Data produced outside Hive (Kafka, Flume)
• Production data (don't want accidental deletion)

Benefit:
• Data survives table drop
• Flexibility in storage location


BEST PRACTICE:
─────────────

Production: Use EXTERNAL tables
Reason: Prevents accidental data loss
Reading Custom Formats
-- Default: LazySimpleSerDe (CSV-like)
CREATE TABLE default_format (
  id INT,
  name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';


-- JSON SerDe
CREATE TABLE json_data (
  id INT,
  name STRING,
  attributes MAP<STRING, STRING>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

-- Can read JSON files directly!


-- RegEx SerDe (parse logs)
CREATE TABLE apache_logs (
  ip STRING,
  timestamp STRING,
  request STRING,
  status INT,
  bytes BIGINT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) [^ ]* [^ ]* \\[([^\\]]*)\\] \"([^\"]*)\" ([0-9]*) ([0-9]*)"
);


-- Parquet SerDe (built-in)
CREATE TABLE parquet_data (
  id INT,
  name STRING
)
STORED AS PARQUET;
-- Uses org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe


-- Custom SerDe (write your own!)
CREATE TABLE custom_format (
  ...
)
ROW FORMAT SERDE 'com.company.CustomSerDe'
WITH SERDEPROPERTIES (
  "field.delim" = "|",
  "escape.delim" = "\\"
);
SerDe Flow:
READ PATH:
─────────

HDFS File

InputFormat (reads bytes)

SerDe (deserialize bytes → objects)

Hive Row Objects

Query Processing


WRITE PATH:
──────────

Query Results

Hive Row Objects

SerDe (serialize objects → bytes)

OutputFormat (write bytes)

HDFS File

Deep Dive: Hive Metastore Scalability and the “Partition Explosion”

As data lakes grow to petabyte scale, the Hive Metastore often becomes the primary bottleneck in the entire stack.

1. The Partition Bottleneck

In a standard RDBMS-backed Metastore (MySQL/PostgreSQL), a query like SELECT * FROM sales WHERE year=2023 requires the Metastore to:
  1. Lookup the table sales.
  2. Scan the PARTITIONS table for all entries matching the filter.
  3. Fetch the SDS (Storage Descriptor) for each partition to find the HDFS paths.
The Math of Failure:
  • If a table has 1,000,000 partitions (common in high-cardinality data), a single ad-hoc query might force the Metastore to load 1GB of metadata into memory just to plan the scan.
  • This leads to “Metastore OOM” and serialized query planning that can take minutes before a single mapper even starts.

2. Scaling Strategies

  • Metastore Federation: Splitting metadata across multiple Metastore instances based on database name.
  • Partition Pruning at the Source: Ensuring clients use partition-bound filters to avoid full metadata scans.
  • Direct SQL: Modern Hive versions use direct SQL queries to the backend DB instead of the slower DataNucleus ORM layer to fetch partitions.

Deep Dive: Hive Query Lifecycle and Execution

To understand Hive’s performance, one must look past the SQL interface into the transformation pipeline that converts a declarative query into a distributed DAG.

1. The Query Planning Pipeline

Hive’s “Compiler” is a multi-stage engine that performs sophisticated optimizations before any data is touched.
StageActionOutput
ParserTokenizes HiveQL using Antlr.Abstract Syntax Tree (AST)
Semantic AnalyzerResolves table names, column types, and partition metadata from the Metastore.Query Block (QB) Tree
Logical Plan GenConverts QB tree into basic relational algebra operators (Filter, Join, Project).Operator Tree (Initial)
OptimizerApplies rules like Predicate Pushdown, Column Pruning, and Partition Pruning.Operator Tree (Optimized)
Physical Plan GenBreaks the operator tree into executable tasks (MapReduce, Tez, or Spark).Task Tree (DAG)

2. Tez vs. MapReduce: The DAG Revolution

While Hive 1.x relied on MapReduce, modern Hive (2.x/3.x) uses Apache Tez to eliminate the “HDFS barrier” between jobs.
  • MapReduce Barrier: Every stage in a complex query (e.g., multiple joins) must write intermediate data to HDFS, causing massive I/O overhead.
  • Tez DAG: Tez allows data to flow directly from one task to the next (e.g., Map -> Reduce -> Reduce) without intermediate HDFS writes. It uses a Directed Acyclic Graph of tasks.

3. LLAP (Live Long and Process)

Introduced in Hive 2.0, LLAP is a hybrid architecture that combines persistent query servers with standard YARN containers.
  • Persistent Daemons: Instead of starting a new JVM for every task (high latency), LLAP uses long-running daemons on worker nodes.
  • In-Memory Caching: LLAP caches columnar data (ORC/Parquet) in a smart, asynchronous cache, avoiding repetitive HDFS reads.
  • Vectorized Execution: LLAP processes data in batches of 1024 rows at a time using SIMD instructions, drastically reducing CPU cycles per row.
FeatureStandard HiveHive with LLAP
Startup LatencyHigh (Container launch)Ultra-low (Always-on daemons)
Data AccessHDFS ScanIn-Memory Cache + HDFS
ExecutionMapReduce/Tez TasksFragment-based execution
Target Use CaseLarge Batch ETLInteractive BI / Sub-second SQL

Apache Pig: Data Flow Language

What is Pig?

Apache Pig was developed at Yahoo Research around 2006, making it one of the oldest Hadoop ecosystem projects. While Hive targeted SQL-literate analysts, Pig targeted a different audience: engineers building complex ETL pipelines where the step-by-step data flow mattered more than the declarative “what.” A typical Pig user was an engineer who needed to load data from three different sources, filter each one differently, join them in a specific order, apply custom transformations, and store the result — all in a way that was readable and debuggable. Writing this as a sequence of MapReduce jobs required hundreds of lines of boilerplate Java. Pig Latin reduced it to 10-20 lines of procedural data flow code. It is worth noting that Pig has largely been superseded by Apache Spark’s DataFrame API and PySpark, which offer the same procedural data flow model with significantly better performance (in-memory execution) and a broader ecosystem (Python integration, ML libraries). New projects should default to Spark. However, Pig scripts remain in production at many organizations, and understanding the data flow paradigm it pioneered is valuable. Pig provides a high-level scripting language (Pig Latin) for data transformations, compiling to MapReduce/Tez jobs.
+---------------------------------------------------------------+
|                      APACHE PIG ARCHITECTURE                  |
+---------------------------------------------------------------+
|                                                               |
|  ┌──────────────────────────────────────────┐                 |
|  │  PIG SCRIPT (Pig Latin)                  │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  data = LOAD '/input' AS (id, name);     │                 |
|  │  filtered = FILTER data BY id > 100;     │                 |
|  │  grouped = GROUP filtered BY name;       │                 |
|  │  counts = FOREACH grouped GENERATE       │                 |
|  │             group, COUNT(filtered);      │                 |
|  │  STORE counts INTO '/output';            │                 |
|  └──────────────────┬───────────────────────┘                 |
|                     │                                         |
|                     │ Submit                                  |
|                     ↓                                         |
|  ┌──────────────────────────────────────────┐                 |
|  │  PIG COMPILER                            │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  • Parser (Pig Latin → logical plan)     │                 |
|  │  • Optimizer (merge, push filters)       │                 |
|  │  • Physical plan (MR/Tez operators)      │                 |
|  │  • Generate MapReduce jobs               │                 |
|  └──────────────────┬───────────────────────┘                 |
|                     │                                         |
|                     │ Submit MR jobs                          |
|                     ↓                                         |
|  ┌──────────────────────────────────────────┐                 |
|  │  EXECUTION ENGINE                        │                 |
|  │  (MapReduce, Tez)                        │                 |
|  └──────────────────────────────────────────┘                 |
|                                                               |
+---------------------------------------------------------------+

DESIGN PHILOSOPHY:
─────────────────

• Procedural (not declarative like SQL)
• Dataflow-oriented (transformations on datasets)
• Schema-optional (can work without strict types)
• ETL-focused (Extract, Transform, Load)

Pig Latin Basics

Fundamental Pig Commands
-- LOAD data from HDFS
users = LOAD '/data/users.csv'
        USING PigStorage(',')
        AS (id:int, name:chararray, age:int, city:chararray);


-- FILTER rows
adults = FILTER users BY age >= 18;


-- FOREACH (transform each row)
names_ages = FOREACH adults GENERATE name, age;


-- GROUP BY
by_city = GROUP users BY city;

-- Result: {group: "NYC", users: {(1, "Alice", 30, "NYC"), (2, "Bob", 25, "NYC")}}


-- JOIN
orders = LOAD '/data/orders.csv'
         AS (order_id:int, user_id:int, amount:double);

joined = JOIN users BY id, orders BY user_id;


-- ORDER BY
sorted = ORDER users BY age DESC;


-- DISTINCT
unique_cities = DISTINCT (FOREACH users GENERATE city);


-- LIMIT
top_10 = LIMIT sorted 10;


-- STORE results
STORE top_10 INTO '/output' USING PigStorage('\t');
Example: Word Count in Pig
-- Load input
lines = LOAD '/input/books.txt' AS (line:chararray);

-- Split into words
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- Group by word
grouped = GROUP words BY word;

-- Count
counts = FOREACH grouped GENERATE group AS word, COUNT(words) AS count;

-- Sort by count descending
sorted = ORDER counts BY count DESC;

-- Store result
STORE sorted INTO '/output/wordcount';
Equivalent in MapReduce: ~200 lines of Java!

Apache HBase: NoSQL on HDFS

What is HBase?

HBase is a distributed, column-oriented NoSQL database built on HDFS, modeled after Google’s Bigtable paper (2006). It was originally developed by Powerset (later acquired by Microsoft) around 2007 and became an Apache top-level project in 2010. The fundamental problem HBase solved was that HDFS is designed for batch processing — high throughput sequential reads and writes — but many real applications need random, low-latency access to individual records. Facebook famously used HBase to power its messaging platform (around 2010), storing hundreds of billions of messages with single-digit millisecond read latencies. The key insight was layering an LSM-Tree based storage engine on top of HDFS, combining the durability and fault tolerance of the distributed file system with the random access performance of an in-memory write buffer. In modern architecture, HBase competes with Apache Cassandra (better for multi-region deployments and tunable consistency), Google Cloud Bigtable (the managed equivalent), and Amazon DynamoDB (serverless, fully managed). HBase remains the strongest choice when you are already running an HDFS cluster and need tight integration with the Hadoop ecosystem — particularly for use cases where co-located analytics (via Hive or Spark) on the same data are important.
+---------------------------------------------------------------+
|                    APACHE HBASE ARCHITECTURE                  |
+---------------------------------------------------------------+
|                                                               |
|  ┌──────────────────────────────────────────┐                 |
|  │  CLIENT                                  │                 |
|  │  (HBase Shell, Java API, REST/Thrift)    │                 |
|  └───────────────┬──────────────────────────┘                 |
|                  │                                            |
|                  │ RPC                                        |
|                  ↓                                            |
|  ┌──────────────────────────────────────────┐                 |
|  │  HMASTER (Cluster Master)                │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  • Assign regions to RegionServers       │                 |
|  │  • Handle region splits/merges           │                 |
|  │  • Schema changes (create/delete tables) │                 |
|  │  • Load balancing                        │                 |
|  └──────────────────────────────────────────┘                 |
|                  ↑                                            |
|  ┌───────────────┴──────────────────────┐                     |
|  │         ZOOKEEPER                    │                     |
|  ├──────────────────────────────────────┤                     |
|  │  • Cluster coordination              │                     |
|  │  • Master election                   │                     |
|  │  • Region assignment tracking        │                     |
|  └──────────────────────────────────────┘                     |
|                  ↑                                            |
|  ┌───────────────┴──────────────────────────────┐             |
|  │  REGIONSERVER 1  │  REGIONSERVER 2  │  ...   │             |
|  ├──────────────────┼──────────────────┼────────┤             |
|  │  ┌────────────┐  │  ┌────────────┐  │        │             |
|  │  │ Region A   │  │  │ Region B   │  │        │             |
|  │  ├────────────┤  │  ├────────────┤  │        │             |
|  │  │ MemStore   │  │  │ MemStore   │  │        │             |
|  │  │ (in-memory)│  │  │ (in-memory)│  │        │             |
|  │  ├────────────┤  │  ├────────────┤  │        │             |
|  │  │ BlockCache │  │  │ BlockCache │  │        │             |
|  │  │ (reads)    │  │  │ (reads)    │  │        │             |
|  │  └────────────┘  │  └────────────┘  │        │             |
|  └──────────────────┴──────────────────┴────────┘             |
|                  ↓                                            |
|  ┌──────────────────────────────────────────┐                 |
|  │  HDFS (Persistent Storage)               │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  HFiles (SSTable format):                │                 |
|  │  • Immutable sorted files                │                 |
|  │  • Stored per column family              │                 |
|  │  • 3x replicated (HDFS)                  │                 |
|  └──────────────────────────────────────────┘                 |
|                                                               |
+---------------------------------------------------------------+

DATA MODEL:
──────────

Table: users
┌─────────────┬──────────────────────────────────────────┐
│ Row Key     │ Column Family: info    │ Column Family: │
│             │                        │ prefs          │
├─────────────┼────────────────────────┼────────────────┤
│ user_001    │ info:name = "Alice"    │ prefs:theme=   │
│             │ info:email= "a@x.com"  │   "dark"       │
│             │ ts=1234567890          │ ts=1234567891  │
├─────────────┼────────────────────────┼────────────────┤
│ user_002    │ info:name = "Bob"      │ prefs:lang=    │
│             │ info:email= "b@x.com"  │   "en"         │
│             │ ts=1234567892          │ ts=1234567893  │
└─────────────┴────────────────────────┴────────────────┘

KEY CONCEPTS:
────────────

• Row Key: Primary key, lexicographically sorted
• Column Family: Physical grouping of columns
• Column Qualifier: Column name within family
• Cell: (row, column family, column, timestamp) → value
• Timestamp: Version of cell (multi-versioning)
• Sparse: Rows can have different columns

HBase Operations

Create, Read, Update, Delete
// CREATE TABLE
HBaseAdmin admin = new HBaseAdmin(conf);
HTableDescriptor tableDesc = new HTableDescriptor(TableName.valueOf("users"));
tableDesc.addFamily(new HColumnDescriptor("info"));
tableDesc.addFamily(new HColumnDescriptor("prefs"));
admin.createTable(tableDesc);


// PUT (insert or update)
HTable table = new HTable(conf, "users");
Put put = new Put(Bytes.toBytes("user_001"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"),
              Bytes.toBytes("Alice"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("email"),
              Bytes.toBytes("alice@example.com"));
table.put(put);


// GET (read single row)
Get get = new Get(Bytes.toBytes("user_001"));
Result result = table.get(get);
byte[] name = result.getValue(Bytes.toBytes("info"), Bytes.toBytes("name"));
System.out.println("Name: " + Bytes.toString(name));


// SCAN (read multiple rows)
Scan scan = new Scan();
scan.setStartRow(Bytes.toBytes("user_000"));
scan.setStopRow(Bytes.toBytes("user_100"));
ResultScanner scanner = table.getScanner(scan);
for (Result r : scanner) {
  // Process each row
}
scanner.close();


// DELETE
Delete delete = new Delete(Bytes.toBytes("user_001"));
table.delete(delete);
HBase Shell:
# Create table
create 'users', 'info', 'prefs'

# Put data
put 'users', 'user_001', 'info:name', 'Alice'
put 'users', 'user_001', 'info:email', 'alice@example.com'

# Get data
get 'users', 'user_001'

# Scan table
scan 'users', {STARTROW => 'user_000', STOPROW => 'user_100'}

# Count rows
count 'users'

# Delete row
deleteall 'users', 'user_001'

# Disable and drop table
disable 'users'
drop 'users'

Deep Dive: HBase Internals and Storage Engine

HBase is not a relational database; it is a Log-Structured Merge-Tree (LSM-Tree) based storage system. This architecture is optimized for high-write throughput and sequential disk I/O.

1. The Write Path: WAL and MemStore

Every write to HBase follows a strict “Persistence First” protocol to ensure data durability even if a RegionServer crashes.
  1. WAL (Write-Ahead Log): The write is first appended to a log on HDFS. If the server dies, this log is used to replay the data.
  2. MemStore: After the WAL is synced, the data is written to an in-memory sorted buffer called the MemStore.
  3. Acknowledgement: The client receives a success response as soon as the data is in the MemStore.

2. MemStore Flush and HFile Creation

When a MemStore reaches its threshold (e.g., 128MB), it is “flushed” to HDFS as an HFile.
  • HFile (SSTable): A sorted, immutable file. Once written, it is never changed.
  • BlockIndex: Each HFile contains an index of its data blocks for fast binary search during lookups.

3. LSM-Tree and Read Amplification

Because HFiles are immutable, a single row might have data scattered across multiple HFiles (e.g., an update in HFile 3 overriding a value in HFile 1).
  • Read Path: HBase must check the MemStore, then scan multiple HFiles, merging the results to find the latest version of a cell. This is known as Read Amplification.
  • Bloom Filters: To speed this up, HBase uses Bloom Filters to skip HFiles that definitely do not contain a specific row key.

4. Compaction: Managing File Bloat

To prevent the number of HFiles from growing indefinitely, HBase performs Compaction.
TypeActionImpact
Minor CompactionPicks a few small HFiles and merges them into one slightly larger HFile.Low I/O; cleans up some “Read Amplification”.
Major CompactionMerges ALL HFiles in a Column Family into a single file.High I/O; deletes expired cells and tombstones (deletes).

5. Data Locality and HDFS Interaction

HBase achieves “Local Reads” by scheduling RegionServers on the same nodes as their HDFS DataNodes. When a MemStore flushes, the first replica is written to the local disk. Over time, as regions move, HBase relies on the HDFS Balancer and major compactions to restore data locality.

6. The Mathematics of Region Sizing

A common failure in production HBase clusters is “Region Squatting”—having too many regions per RegionServer, which fragments the available MemStore memory. The Capacity Formula: The maximum number of regions a RegionServer can safely handle is bounded by the total MemStore heap: MaxRegions=RS_Heap×MemStore_FractionMemStore_Flush_Size×Num_Column_FamiliesMaxRegions = \frac{RS\_Heap \times MemStore\_Fraction}{MemStore\_Flush\_Size \times Num\_Column\_Families} Example:
  • Heap: 32GB
  • MemStore Fraction: 0.4 (40% of heap reserved for writes)
  • Flush Size: 128MB
  • Column Families: 2
  • Result: 32,768×0.4128×251\frac{32,768 \times 0.4}{128 \times 2} \approx 51 regions.
Consequence of Over-provisioning: If you host 500 regions on this server, each region only gets ~2.5MB of MemStore. This causes “Thundering Flushes” where the server spends all its time writing tiny HFiles, leading to massive compaction pressure and I/O saturation.

7. The Region Split Protocol (State Machine)

When a region exceeds the hbase.hregion.max.filesize (e.g., 10GB), it must split. This is a complex distributed transaction coordinated via ZooKeeper:
  1. PRE_SPLIT: RegionServer (RS) creates a split znode in ZooKeeper.
  2. OFFLINE: RS takes the parent region offline, stopping all writes.
  3. DAUGHTER_CREATION: RS creates two “Reference Files” (daughter regions) pointing to the parent’s HFiles. This is a metadata-only operation (very fast).
  4. OPEN_DAUGHTERS: RS opens the two daughter regions and begins serving requests.
  5. POST_SPLIT: RS updates the .META. table and deletes the parent region metadata.
  6. Compaction (Background): Over time, the daughter regions undergo compaction, which physically splits the parent HFiles into new, independent HFiles, eventually deleting the reference files.

Ecosystem Integration and Coordination

Beyond storage and processing, the ecosystem requires robust coordination and ingestion tools.

1. Apache ZooKeeper: The Glue

ZooKeeper is a distributed coordination service originally developed at Yahoo Research (around 2007) to solve the recurring problem of distributed coordination that every Hadoop component was implementing independently and badly. The name is intentional — it “keeps the zoo” of distributed processes in order. ZooKeeper provides a small set of primitives (ephemeral nodes, watches, sequential nodes) that can be composed to build higher-level coordination patterns like leader election, distributed locks, and barrier synchronization. Almost every major Hadoop component depends on it:
  • HBase: Master election and region server tracking.
  • HDFS: NameNode HA leader election via the ZooKeeper Failover Controller (ZKFC).
  • YARN: ResourceManager HA and state management.
  • Kafka: Broker coordination and topic metadata (though Kafka is migrating to its own Raft-based controller as of KRaft).
In modern cloud-native architectures, ZooKeeper’s role is shrinking. Kubernetes provides similar coordination via etcd. Kafka is removing its ZooKeeper dependency. But for on-premises Hadoop clusters, ZooKeeper remains the essential coordination backbone.

2. Apache Oozie: Workflow Orchestration

Oozie was developed at Yahoo to manage complex pipelines of Hadoop jobs. In production, a single “analytics run” is rarely one MapReduce job. It is a chain: ingest raw data, clean it, join with reference tables, aggregate, export to a serving database. Oozie provides the orchestration layer for these multi-step pipelines.
  • Workflow: A DAG of actions (Hive query -> Pig script -> MapReduce job).
  • Coordinator: Triggers workflows based on time or data availability (e.g., “Run every day at 2 AM” or “Run when the /sales/today folder exists”).
  • Bundle: Groups multiple coordinators that should be managed together.
Oozie’s XML-based workflow definitions are notoriously verbose and difficult to debug. In modern practice, Apache Airflow (developed at Airbnb, open-sourced in 2015) has largely replaced Oozie for new projects. Airflow offers Python-based DAG definitions, a superior web UI for monitoring, richer operator ecosystem, and cloud-native integrations. If you are starting a new data pipeline today, use Airflow (or its managed equivalents like Google Cloud Composer or Amazon MWAA). But Oozie remains deeply embedded in many existing Hadoop deployments.

3. Data Ingestion: Sqoop, Flume, and Kafka

  • Sqoop (SQL-to-Hadoop): Efficiently transfers bulk data between RDBMS (MySQL, Oracle) and HDFS/Hive using parallel MapReduce tasks. A typical use case is nightly ingestion of updated rows from a MySQL transactional database into a Hive warehouse table. Sqoop was retired as an Apache project in 2021, and modern alternatives include Apache Spark JDBC connectors, Debezium for CDC (Change Data Capture), and cloud-native services like AWS DMS.
  • Flume: A distributed service for collecting, aggregating, and moving large amounts of streaming log data into HDFS. Flume uses a source-channel-sink architecture where agents on application servers collect logs and reliably deliver them to HDFS. While still functional, Flume has been largely replaced by Apache Kafka for log aggregation due to Kafka’s superior throughput, replay capability, and ecosystem integration.
  • Apache Kafka: Originally developed at LinkedIn (around 2010), Kafka has become the de facto standard for real-time data ingestion and event streaming. Kafka Connect provides a pluggable framework for moving data between Kafka and external systems (databases, HDFS, S3, Elasticsearch) without writing custom code. In a modern Hadoop-adjacent architecture, Kafka typically sits as the real-time ingestion layer, with consumers writing to HDFS/S3 for batch processing and to streaming engines (Flink, Spark Structured Streaming) for real-time analytics.

Conclusion: The Modern Hadoop Stack

Today, the “Hadoop Ecosystem” has evolved significantly from its 2006-2014 golden era. While Hive remains the standard for SQL-on-HDFS, many processing tasks have shifted to Apache Spark due to its in-memory performance. Pig and Oozie are in maintenance mode, largely replaced by PySpark and Airflow respectively. HBase faces competition from managed services like DynamoDB and Cloud Bigtable. However, the core architectural principles of the Hadoop ecosystem — separation of storage (HDFS/S3), resource management (YARN/Kubernetes), and specialized processing engines — continue to define modern big data architecture. The modern “lakehouse” pattern (Delta Lake, Apache Iceberg, Apache Hudi) is essentially the Hadoop ecosystem pattern rebuilt on cloud object storage with ACID transactions. When you use Databricks, Snowflake, or BigQuery, you are using systems that learned their most important lessons from the Hadoop ecosystem’s successes and failures. The most important lesson from the Hadoop ecosystem is not any specific tool — it is the architectural pattern of decoupling storage, compute, and metadata. That pattern outlives every individual project in the stack.

Interview Deep-Dive

Strong Answer:The Hive Metastore stores all table and partition metadata in a backing RDBMS — typically MySQL or PostgreSQL. When a table has millions of partitions (common with high-cardinality partitioning like year/month/day/hour on a high-volume event stream), every query that touches that table must first ask the Metastore for the list of matching partitions. This becomes a problem in three ways.First, query planning latency. A query like SELECT * FROM events WHERE date = '2024-01-15' seems simple, but if the events table has 5 million partitions, the Metastore must scan its PARTITIONS table in the backing RDBMS to find matching entries. With the default DataNucleus ORM layer, this can take minutes before a single mapper even starts.Second, Metastore memory pressure. Fetching millions of partition objects into the Metastore JVM can cause OOM errors, especially when multiple concurrent queries hit the same large table.Third, lock contention. The Metastore uses database-level locks for partition creation and deletion, so concurrent writes to a heavily partitioned table serialize on the Metastore.Solutions, in order of impact: First, ensure partition pruning is always used — enforce policies that prevent full-table scans on partitioned tables. Second, enable Direct SQL mode in the Metastore (bypasses the DataNucleus ORM and issues raw SQL queries to the backing database), which can provide 10-100x speedup for partition listing. Third, consider Metastore federation — splitting metadata across multiple Metastore instances by database name. Fourth, for new architectures, migrate to Apache Iceberg or Delta Lake, which store partition metadata in manifest files rather than a centralized RDBMS, eliminating the Metastore bottleneck entirely. Iceberg’s hidden partitioning also avoids the “partition explosion” problem at the design level.Follow-up: How does Apache Iceberg solve the partition metadata problem differently than Hive?Iceberg stores table metadata in a hierarchy of metadata files (metadata.json, manifest lists, manifest files) on the storage layer itself (HDFS or S3), not in a centralized database. Each manifest file tracks a set of data files with their partition values and column-level statistics. Query planning reads only the manifest files needed for the query’s filter predicates, and this is done in parallel by the query engine, not serialized through a single Metastore process. The result is that partition listing scales with the number of matching partitions, not total partitions, and there is no centralized bottleneck. Additionally, Iceberg supports “hidden partitioning” where the partitioning scheme is defined as a transform on a column (e.g., days(timestamp)) rather than as an explicit directory structure, so users do not need to include partition columns in WHERE clauses.
Strong Answer:Both Parquet and ORC are columnar storage formats that achieve dramatic compression and query performance improvements over row-oriented formats like CSV or JSON. They share the same core insight: analytics queries typically read a few columns out of many, so storing data column-by-column lets you skip reading irrelevant columns entirely and apply column-specific compression (a column of integers compresses much better than a row of mixed types).Parquet was developed by Twitter and Cloudera (around 2013), inspired by Google’s Dremel paper. It uses a row group/column chunk structure where each row group (typically 128MB) contains column chunks, and each column chunk has pages with optional dictionary encoding, RLE (run-length encoding), and delta encoding. Parquet supports nested data natively through Dremel’s repetition and definition levels.ORC (Optimized Row Columnar) was developed at Hortonworks primarily for Hive. It uses a stripe/stream structure with built-in bloom filters, stripe-level statistics (min/max/sum/count), and file-level statistics. ORC has native support for ACID transactions in Hive 3+, which Parquet does not.Choice criteria: Use Parquet when working with a diverse tool ecosystem (Spark, Presto, Athena, BigQuery all have excellent Parquet support), when you have deeply nested data (Parquet handles nested structures more efficiently), or when building on cloud data lakes where Parquet is the de facto standard. Use ORC when you are primarily in a Hive-heavy environment, need ACID transactions in Hive, or are using the Hortonworks distribution. In benchmarks, ORC often shows slightly better compression ratios due to its more aggressive encoding, while Parquet tends to show slightly better read performance in non-Hive engines.In practice, the difference is small enough that ecosystem compatibility should drive the decision. The modern data lake world has standardized on Parquet.Follow-up: What is predicate pushdown and how do columnar formats enable it?Predicate pushdown means applying WHERE clause filters as early as possible — ideally during the file read itself, before data enters the query engine. Both Parquet and ORC store statistics (min/max values, null counts) at the row group or stripe level. When the query engine reads a file, it can check these statistics to skip entire row groups that cannot possibly contain matching rows. For example, if a stripe’s max value for the age column is 25, and the query has WHERE age > 30, the entire stripe is skipped without reading a single data byte. This turns a full-table scan into an effective index scan. Combined with partition pruning (skipping entire files based on directory structure), predicate pushdown can reduce the amount of data read by orders of magnitude.
Strong Answer:This is the classic Lambda Architecture problem, and the answer depends on your latency requirements and operational complexity budget.The core tension is that HDFS and Hive are optimized for batch analytics (high throughput, high latency), while serving lookups requires low latency (single-digit milliseconds). No single system does both well. The standard approach is to split the architecture.For the batch layer: raw events land in Kafka, and a consumer (Spark Structured Streaming or Flink) writes them to HDFS or S3 in Parquet format, partitioned by time. Hive or Spark SQL runs batch analytics on this historical data — aggregations, trend analysis, reporting. This handles the “analytics” requirement.For the serving layer: HBase sits on top of HDFS and provides random read/write access with millisecond latency. A derived dataset (precomputed aggregations, denormalized user profiles, feature vectors) is materialized into HBase by a batch job (Spark writing to HBase) or a streaming job (Flink directly writing to HBase). Application servers query HBase for low-latency lookups.For real-time updates: Kafka consumers (Flink or Spark Structured Streaming) process events as they arrive, update HBase in near-real-time (seconds of delay), and also write to the batch layer for historical completeness.The trade-offs: This architecture has operational complexity (three systems to maintain: Kafka, HDFS/Hive, HBase). Consistency between the batch and serving layers requires careful design. And HBase’s lack of secondary indexes means your lookup patterns must be known in advance and baked into the row key design.A modern alternative that simplifies this: use Apache Druid or Apache Pinot for the serving layer instead of HBase. Both are purpose-built for analytical lookups (group-by queries with filters) at low latency, and they can ingest directly from Kafka. This is the approach LinkedIn (Pinot) and Airbnb (Druid) have taken.Follow-up: What about the Kappa Architecture? When would you use it instead of Lambda?The Kappa Architecture eliminates the batch layer entirely and treats everything as a stream. All data flows through Kafka, and a single stream processing engine (Flink) handles both real-time processing and historical reprocessing (by replaying Kafka topics from the beginning). The serving layer is populated only from the stream processor. This reduces operational complexity (one processing path instead of two) but requires Kafka to retain data for long periods (expensive storage) and the stream processor to handle both real-time and backfill workloads (which have very different resource profiles). Use Kappa when your processing logic is naturally event-driven and your latency requirements are uniform across all queries. Use Lambda when you have distinct batch and real-time requirements with different SLAs.
Strong Answer:Row key design is the single most impactful performance decision in HBase because it determines data distribution across regions, write throughput, and read access patterns. HBase stores rows lexicographically sorted by row key, and regions (horizontal partitions) are defined by key ranges. A poorly designed row key causes two catastrophic problems: write hotspots (all writes go to one region) and read inefficiency (full table scans instead of targeted lookups).For an IoT sensor data use case — say, 10,000 sensors reporting temperature every second — the naive row key is timestamp_sensorId. This is terrible because all current writes have similar timestamps, so they all go to the same region (the one handling the “latest” key range). One RegionServer gets all the write load while others sit idle.My design approach: The primary access patterns for IoT data are typically (1) get the latest readings for a specific sensor, (2) get a time range of readings for a specific sensor, and (3) get readings across all sensors for a specific time window.For patterns 1 and 2, the row key should be sensorId_reverseTimestamp where reverseTimestamp is Long.MAX_VALUE - timestamp. This distributes writes across regions (different sensorIds hash to different regions) and enables efficient time-range scans for a specific sensor (scan from sensorId_reverseTimestamp_start to sensorId_reverseTimestamp_end). The reverse timestamp ensures the most recent data comes first in a scan, which matches the most common access pattern.For pattern 3 (cross-sensor queries), this row key design is poor because it requires scanning all regions. If this pattern is frequent, I would maintain a secondary index table with row key timeBucket_sensorId (where timeBucket is hourly or daily), or use a tool like Apache Phoenix that provides secondary indexing on HBase.Additional considerations: Pre-split the table into N regions based on the sensorId distribution to avoid the initial single-region bottleneck. If sensorIds are sequential integers, apply a hash prefix (e.g., MD5(sensorId).substring(0,4)_sensorId_reverseTimestamp) to ensure uniform distribution even with non-uniform sensorId assignment. Set TTL on the column family to auto-expire old data (e.g., 90 days) to prevent unbounded growth.Follow-up: How do compactions affect this workload, and how would you tune them?With high write throughput (10,000 sensors x 1 write/second = 10,000 writes/second), MemStores flush frequently, creating many small HFiles. Minor compactions merge these into larger files, and major compactions merge all files into one per column family per region. The risk is that major compactions on a high-write table cause I/O storms that degrade read latency. I would disable automatic major compactions (hbase.hregion.majorcompaction = 0) and schedule them during off-peak hours via a cron job. For minor compactions, I would tune hbase.hstore.compactionThreshold (number of files that triggers compaction) and hbase.hstore.compaction.max (max files to compact at once) to balance between read amplification and I/O overhead.