Chapter 5: Hadoop Ecosystem

The true power of Hadoop lies not just in HDFS and MapReduce, but in the rich ecosystem of tools built on top of it. This chapter explores the key components that transform Hadoop from a low-level distributed file system and processing framework into a complete big data platform. The Hadoop ecosystem did not emerge from a single grand design. It grew organically, project by project, as engineers at Facebook, Yahoo, LinkedIn, and dozens of other companies hit real production walls. Facebook engineers needed SQL analysts to query petabytes of clickstream data without learning Java — so they built Hive (around 2008). Yahoo researchers needed a scripting language for complex ETL pipelines that were too painful to express in raw MapReduce — so they built Pig (around 2006). The pattern repeated: every time a new class of problem proved too awkward or too slow with existing tools, someone built a new layer on top of HDFS and YARN. Understanding this history matters because it explains why the ecosystem looks the way it does — not a clean layered architecture designed on a whiteboard, but a living, evolving collection of tools where each one exists because a specific pain point demanded it.

Chapter Goals:

Understand the Hadoop ecosystem landscape
Master Hive for SQL on Hadoop
Learn Pig for data flow programming
Explore HBase for real-time NoSQL storage
Study workflow orchestration with Oozie
Understand data ingestion with Kafka and Flume
Compare ecosystem tools and when to use each

Ecosystem Overview

The Hadoop Stack

+---------------------------------------------------------------+
|                  HADOOP ECOSYSTEM STACK                       |
+---------------------------------------------------------------+
|                                                               |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │              APPLICATIONS & USE CASES                   │  |
|  │  Business Intelligence, Analytics, ML, ETL, Reporting   │  |
|  └─────────────────────────────────────────────────────────┘  |
|                           ↑                                   |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │              HIGH-LEVEL TOOLS                           │  |
|  ├─────────────────────────────────────────────────────────┤  |
|  │  SQL Query:  │ Scripting:  │ ML:       │ Graph:        │  |
|  │  • Hive      │ • Pig       │ • Mahout  │ • Giraph      │  |
|  │  • Impala    │ • Cascading │ • Spark   │               │  |
|  │  • Presto    │             │   MLlib   │               │  |
|  └──────────────┴─────────────┴───────────┴───────────────┘  |
|                           ↑                                   |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │         DATA PROCESSING FRAMEWORKS                      │  |
|  ├─────────────────────────────────────────────────────────┤  |
|  │  Batch:      │ Stream:     │ Interactive:              │  |
|  │  • MapReduce │ • Storm     │ • Impala                  │  |
|  │  • Spark     │ • Flink     │ • Presto                  │  |
|  │  • Tez       │ • Samza     │ • Drill                   │  |
|  └──────────────┴─────────────┴───────────────────────────┘  |
|                           ↑                                   |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │         RESOURCE MANAGEMENT & ORCHESTRATION             │  |
|  ├─────────────────────────────────────────────────────────┤  |
|  │  • YARN (Resource Manager)                              │  |
|  │  • Oozie (Workflow Scheduler)                           │  |
|  │  • ZooKeeper (Coordination)                             │  |
|  └─────────────────────────────────────────────────────────┘  |
|                           ↑                                   |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │              STORAGE LAYER                              │  |
|  ├─────────────────────────────────────────────────────────┤  |
|  │  Distributed FS:  │ NoSQL:    │ Ingestion:            │  |
|  │  • HDFS           │ • HBase   │ • Kafka               │  |
|  │  • S3             │ • Kudu    │ • Flume               │  |
|  │                   │           │ • Sqoop               │  |
|  └───────────────────┴───────────┴───────────────────────┘  |
|                                                               |
+---------------------------------------------------------------+

GUIDING PRINCIPLE:
─────────────────
Each layer builds on lower layers.
Higher layers provide easier abstractions.
Lower layers provide more control and flexibility.

Why the Ecosystem Matters

Abstraction

Hide Complexity:

MapReduce requires Java programming
Hive provides SQL interface
Easier onboarding, faster development
Reach more users (analysts, not just engineers)

Productivity

Faster Development:

100 lines of Java MapReduce → 5 lines of SQL
Pig reduces code by 10-20x
Faster iteration, fewer bugs
Focus on logic, not plumbing

Specialization

Right Tool for Job:

Hive for SQL analytics
HBase for real-time access
Kafka for stream ingestion
Each optimized for specific use case

Innovation

Community-Driven:

Open source enables experimentation
Best tools emerge organically
Spark displaced MapReduce
Ecosystem evolves with needs

Apache Hive: SQL on Hadoop

What is Hive?

Apache Hive was created at Facebook in 2007 by Jeff Hammerbacher’s data team and later open-sourced in 2008. The motivation was straightforward: Facebook’s data warehouse was growing by terabytes per day, and the only way to query it was to write Java MapReduce programs. Most of the analysts who needed to answer business questions — “How many users clicked this ad in the last 30 days?” — knew SQL, not Java. Hive bridged that gap by providing a SQL dialect (HiveQL) that compiled to MapReduce jobs behind the scenes. This single decision — putting a SQL layer on top of MapReduce — arguably did more to accelerate Hadoop adoption than any other ecosystem project. It transformed Hadoop from an engineering tool into a data warehouse that business analysts could use directly. Hive provides a SQL interface (HiveQL) to data stored in HDFS, translating SQL queries into MapReduce, Tez, or Spark jobs.

+---------------------------------------------------------------+
|                    APACHE HIVE ARCHITECTURE                   |
+---------------------------------------------------------------+
|                                                               |
|  ┌──────────────────────────────────────────┐                 |
|  │  CLIENT (SQL Interface)                  │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  • Hive CLI                              │                 |
|  │  • Beeline (JDBC client)                 │                 |
|  │  • JDBC/ODBC drivers                     │                 |
|  │  • BI tools (Tableau, PowerBI)           │                 |
|  └─────────────────┬────────────────────────┘                 |
|                    │                                          |
|                    │ HiveQL query                             |
|                    ↓                                          |
|  ┌──────────────────────────────────────────┐                 |
|  │  HIVE SERVER (HiveServer2)               │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Parser                            │  │                 |
|  │  │  (SQL → Abstract Syntax Tree)      │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Semantic Analyzer                 │  │                 |
|  │  │  (Validate, resolve metadata)      │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Logical Plan Generator            │  │                 |
|  │  │  (Operator tree)                   │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Optimizer                         │  │                 |
|  │  │  (Predicate pushdown, join reorder)│  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Physical Plan (MapReduce/Tez)     │  │                 |
|  │  │  (Execution plan)                  │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  └──────────────────────────────────────────┘                 |
|                    │                                          |
|                    │ Submit MR/Tez job                        |
|                    ↓                                          |
|  ┌──────────────────────────────────────────┐                 |
|  │  EXECUTION ENGINE                        │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  • MapReduce (slow)                      │                 |
|  │  • Tez (faster, DAG-based)               │                 |
|  │  • Spark (fastest, in-memory)            │                 |
|  └──────────────────────────────────────────┘                 |
|                    ↓                                          |
|  ┌──────────────────────────────────────────┐                 |
|  │  METASTORE                               │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  Database: MySQL, PostgreSQL, Derby      │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Tables:                           │  │                 |
|  │  │  • Table metadata                  │  │                 |
|  │  │  • Column types                    │  │                 |
|  │  │  • Partitions                      │  │                 |
|  │  │  • Storage location (HDFS path)    │  │                 |
|  │  │  • SerDe info                      │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  └──────────────────────────────────────────┘                 |
|                    ↓                                          |
|  ┌──────────────────────────────────────────┐                 |
|  │  STORAGE (HDFS)                          │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  /user/hive/warehouse/                   │                 |
|  │    ├─ sales/                             │                 |
|  │    │  ├─ year=2023/                      │                 |
|  │    │  │  └─ month=01/                    │                 |
|  │    │  │     └─ data.parquet              │                 |
|  │    └─ users/                             │                 |
|  │       └─ data.orc                        │                 |
|  └──────────────────────────────────────────┘                 |
|                                                               |
+---------------------------------------------------------------+

Hive Fundamentals

HiveQL Basics
Partitioning
File Formats
Optimization

SQL-Like Syntax

-- CREATE TABLE
CREATE TABLE employees (
  id INT,
  name STRING,
  salary DECIMAL(10,2),
  dept STRING,
  hire_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;


-- LOAD DATA
LOAD DATA INPATH '/user/data/employees.csv'
INTO TABLE employees;


-- SELECT QUERY
SELECT dept, AVG(salary) as avg_salary
FROM employees
WHERE hire_date >= '2020-01-01'
GROUP BY dept
HAVING AVG(salary) > 50000
ORDER BY avg_salary DESC;


-- JOIN
SELECT e.name, e.salary, d.dept_name
FROM employees e
JOIN departments d
ON e.dept = d.dept_id;


-- SUBQUERY
SELECT name, salary
FROM employees
WHERE salary > (
  SELECT AVG(salary) FROM employees
);

Behind the Scenes:Each query translates to MapReduce/Tez job:

SELECT dept, COUNT(*) FROM employees GROUP BY dept;

↓ Translates to:

Map Phase:
• Read employees table from HDFS
• For each row, emit (dept, 1)

Shuffle:
• Group by dept

Reduce Phase:
• For each dept, sum counts
• Write to HDFS

Result written to temp HDFS location.

Organizing Data for Performance

-- CREATE PARTITIONED TABLE
CREATE TABLE sales (
  transaction_id STRING,
  amount DECIMAL(10,2),
  customer_id STRING
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET;


-- LOAD DATA INTO PARTITION
INSERT INTO TABLE sales
PARTITION (year=2023, month=1)
SELECT transaction_id, amount, customer_id
FROM raw_sales
WHERE year(transaction_date) = 2023
  AND month(transaction_date) = 1;

Directory Structure:

/user/hive/warehouse/sales/
  ├─ year=2023/
  │  ├─ month=1/
  │  │  └─ 000000_0.parquet
  │  ├─ month=2/
  │  │  └─ 000000_0.parquet
  │  └─ month=3/
  │     └─ 000000_0.parquet
  └─ year=2024/
     ├─ month=1/
     │  └─ 000000_0.parquet
     └─ month=2/
        └─ 000000_0.parquet

Query with Partition Pruning:

SELECT SUM(amount)
FROM sales
WHERE year = 2023 AND month = 1;

Performance Impact:

Without Partitioning:
• Scan entire table (all 24 months)
• Read: 10 TB
• Time: 30 minutes

With Partitioning:
• Scan only year=2023/month=1/
• Read: 400 GB
• Time: 1 minute

25x speedup!

Best Practices:

Partition by frequently filtered columns (date, region, category)
Don’t over-partition (too many small files)
Typical: 1-3 partition levels (e.g., year/month/day)
Avoid high-cardinality partitions (e.g., user_id)

Choosing Storage Format

-- TEXT (default, human-readable)
CREATE TABLE users_text (
  id INT,
  name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;


-- PARQUET (columnar, compressed)
CREATE TABLE users_parquet (
  id INT,
  name STRING,
  age INT,
  email STRING
)
STORED AS PARQUET;


-- ORC (Optimized Row Columnar)
CREATE TABLE users_orc (
  id INT,
  name STRING,
  age INT,
  email STRING
)
STORED AS ORC;


-- AVRO (schema evolution)
CREATE TABLE users_avro
STORED AS AVRO;

Format Comparison:

┌─────────┬──────────┬──────────┬─────────────┬────────┐
│ Format  │ Type     │ Compress │ Query Speed │ Write  │
├─────────┼──────────┼──────────┼─────────────┼────────┤
│ Text    │ Row      │ Poor     │ Slow        │ Fast   │
│ Sequence│ Row      │ Good     │ Medium      │ Fast   │
│ Avro    │ Row      │ Good     │ Medium      │ Medium │
│ Parquet │ Columnar │ Excellent│ Fast        │ Medium │
│ ORC     │ Columnar │ Excellent│ Fast        │ Medium │
└─────────┴──────────┴──────────┴─────────────┴────────┘


EXAMPLE: 1 Billion rows, 100 columns
─────────────────────────────────────

Text (CSV):
• Size: 1 TB
• Compression: ~700 GB (gzip)
• Query (SELECT col1, col2): Read 700 GB

Parquet:
• Size: 200 GB (built-in compression)
• Query (SELECT col1, col2): Read 4 GB (only 2 columns!)
• 175x faster query!


WHEN TO USE:
───────────

Text/CSV:
• Human-readable debugging
• One-time ingestion
• Small datasets

Parquet:
• Analytical queries (SELECT few columns)
• Long-term storage
• Production data lakes

ORC:
• Hive-specific optimizations
• ACID transactions (Hive 3+)
• Slightly better than Parquet in Hive

Avro:
• Schema evolution (add/remove columns)
• Streaming data (Kafka)
• Row-based processing

Making Queries Faster

-- 1. BUCKETING (hash partitioning)
CREATE TABLE users_bucketed (
  id INT,
  name STRING,
  email STRING
)
CLUSTERED BY (id) INTO 32 BUCKETS
STORED AS PARQUET;

-- Benefit: Efficient joins, sampling


-- 2. MAP-SIDE JOIN (for small tables)
SET hive.auto.convert.join=true;
SET hive.mapjoin.smalltable.filesize=25000000; -- 25MB

SELECT /*+ MAPJOIN(small_table) */
  big_table.*, small_table.name
FROM big_table
JOIN small_table
ON big_table.id = small_table.id;

-- Small table loaded into memory on each mapper
-- No shuffle needed!


-- 3. VECTORIZATION (process 1024 rows at once)
SET hive.vectorized.execution.enabled=true;

-- Benefit: 3-5x speedup on analytical queries


-- 4. COST-BASED OPTIMIZATION (statistics)
ANALYZE TABLE employees COMPUTE STATISTICS;
ANALYZE TABLE employees COMPUTE STATISTICS FOR COLUMNS;

-- Hive uses stats to:
-- • Choose join order
-- • Decide map-side vs reduce-side join
-- • Optimize query plan


-- 5. TEZ EXECUTION ENGINE
SET hive.execution.engine=tez;  -- Default: mr

-- Tez benefits:
-- • DAG-based (vs multiple MR stages)
-- • Container reuse
-- • 2-10x faster than MapReduce


-- 6. PARTITION PRUNING
SELECT * FROM sales
WHERE year = 2023 AND month = 1;

-- Only reads year=2023/month=1/ partition


-- 7. PREDICATE PUSHDOWN
SELECT id, name FROM users
WHERE age > 30;

-- For Parquet/ORC:
-- • Filter applied during file read
-- • Skip entire row groups that don't match
-- • Don't even read non-matching data

Query Performance Tips:

SLOW QUERY:
──────────
SELECT u.name, o.amount
FROM big_orders o
JOIN small_users u
ON o.user_id = u.id
WHERE u.region = 'US';

Issues:
• Big table on left (not optimal)
• No hint for map-side join
• Filter after join (wasteful)


OPTIMIZED:
─────────
SELECT /*+ MAPJOIN(u) */ u.name, o.amount
FROM small_users u
JOIN big_orders o
ON u.id = o.user_id
WHERE u.region = 'US';

Improvements:
• Map-side join (small table in memory)
• Filter on small table first
• Small table on left side
• Result: 10-100x faster

Hive Metastore

Metastore Architecture

Centralized Metadata Repository

METASTORE ARCHITECTURE:
──────────────────────

┌──────────────────────────────────────┐
│  Hive Clients                        │
│  • Hive CLI                          │
│  • Beeline                           │
│  • Spark SQL                         │
│  • Impala                            │
│  • Presto                            │
└───────────────┬──────────────────────┘
                │
                │ Thrift API
                ↓
┌──────────────────────────────────────┐
│  Metastore Service                   │
│  (HiveMetastore)                     │
└───────────────┬──────────────────────┘
                │
                │ JDBC
                ↓
┌──────────────────────────────────────┐
│  Metastore Database                  │
│  (MySQL, PostgreSQL, Derby)          │
├──────────────────────────────────────┤
│  Tables:                             │
│  • DBS (databases)                   │
│  • TBLS (tables)                     │
│  • COLUMNS_V2 (columns)              │
│  • PARTITIONS (partition metadata)   │
│  • SDS (storage descriptors)         │
│  • SERDES (serialization info)       │
└──────────────────────────────────────┘


STORED METADATA:
───────────────

For each table:
• Database name
• Table name
• Column names and types
• Partition keys
• Storage location (HDFS path)
• File format (Parquet, ORC, etc.)
• SerDe class
• Bucket information
• Table statistics

Example Metadata:

CREATE TABLE sales (
  id INT,
  amount DECIMAL(10,2)
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/sales';

Metastore Entries:

DBS table:
┌────────────────────────────────────┐
│ DB_ID | NAME    | DESC | LOCATION  │
├───────┼─────────┼──────┼───────────┤
│ 1     | default | ...  | /user/... │
└────────────────────────────────────┘

TBLS table:
┌───────────────────────────────────────────┐
│ TBL_ID | DB_ID | TBL_NAME | TBL_TYPE   │
├────────┼───────┼──────────┼────────────┤
│ 100    | 1     | sales    | MANAGED    │
└───────────────────────────────────────────┘

COLUMNS_V2 table:
┌─────────────────────────────────────────┐
│ CD_ID | COLUMN_NAME | TYPE_NAME         │
├───────┼─────────────┼───────────────────┤
│ 200   | id          | int               │
│ 200   | amount      | decimal(10,2)     │
└─────────────────────────────────────────┘

PARTITIONS table:
┌───────────────────────────────────────┐
│ PART_ID | TBL_ID | PART_NAME         │
├─────────┼────────┼───────────────────┤
│ 300     | 100    | year=2023/month=1 │
│ 301     | 100    | year=2023/month=2 │
└───────────────────────────────────────┘

SDS (Storage Descriptor):
┌──────────────────────────────────────────┐
│ SD_ID | LOCATION                         │
├───────┼──────────────────────────────────┤
│ 400   | /user/hive/warehouse/sales/...  │
└──────────────────────────────────────────┘

Managed vs External Tables

Table Ownership

-- MANAGED TABLE (Hive owns data)
CREATE TABLE managed_sales (
  id INT,
  amount DECIMAL
);

-- Data stored in Hive warehouse:
-- /user/hive/warehouse/managed_sales/

DROP TABLE managed_sales;
-- Data DELETED from HDFS!


-- EXTERNAL TABLE (Hive doesn't own data)
CREATE EXTERNAL TABLE external_sales (
  id INT,
  amount DECIMAL
)
LOCATION '/data/sales';

-- Data at /data/sales/ (outside Hive warehouse)

DROP TABLE external_sales;
-- Only metadata deleted, data remains!

When to Use Each:

MANAGED TABLES:
──────────────

Use when:
• Hive is primary consumer
• Want Hive to manage lifecycle
• Data is temporary/intermediate

Benefit:
• DROP TABLE cleans everything
• Simpler management


EXTERNAL TABLES:
───────────────

Use when:
• Multiple tools access data (Hive, Spark, Impala)
• Data produced outside Hive (Kafka, Flume)
• Production data (don't want accidental deletion)

Benefit:
• Data survives table drop
• Flexibility in storage location


BEST PRACTICE:
─────────────

Production: Use EXTERNAL tables
Reason: Prevents accidental data loss

SerDe (Serializer/Deserializer)

Reading Custom Formats

-- Default: LazySimpleSerDe (CSV-like)
CREATE TABLE default_format (
  id INT,
  name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';


-- JSON SerDe
CREATE TABLE json_data (
  id INT,
  name STRING,
  attributes MAP<STRING, STRING>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

-- Can read JSON files directly!


-- RegEx SerDe (parse logs)
CREATE TABLE apache_logs (
  ip STRING,
  timestamp STRING,
  request STRING,
  status INT,
  bytes BIGINT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) [^ ]* [^ ]* \\[([^\\]]*)\\] \"([^\"]*)\" ([0-9]*) ([0-9]*)"
);


-- Parquet SerDe (built-in)
CREATE TABLE parquet_data (
  id INT,
  name STRING
)
STORED AS PARQUET;
-- Uses org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe


-- Custom SerDe (write your own!)
CREATE TABLE custom_format (
  ...
)
ROW FORMAT SERDE 'com.company.CustomSerDe'
WITH SERDEPROPERTIES (
  "field.delim" = "|",
  "escape.delim" = "\\"
);

SerDe Flow:

READ PATH:
─────────

HDFS File
  ↓
InputFormat (reads bytes)
  ↓
SerDe (deserialize bytes → objects)
  ↓
Hive Row Objects
  ↓
Query Processing


WRITE PATH:
──────────

Query Results
  ↓
Hive Row Objects
  ↓
SerDe (serialize objects → bytes)
  ↓
OutputFormat (write bytes)
  ↓
HDFS File

Deep Dive: Hive Metastore Scalability and the “Partition Explosion”

As data lakes grow to petabyte scale, the Hive Metastore often becomes the primary bottleneck in the entire stack.

1. The Partition Bottleneck

In a standard RDBMS-backed Metastore (MySQL/PostgreSQL), a query like SELECT * FROM sales WHERE year=2023 requires the Metastore to:

Lookup the table sales.
Scan the PARTITIONS table for all entries matching the filter.
Fetch the SDS (Storage Descriptor) for each partition to find the HDFS paths.

The Math of Failure:

If a table has 1,000,000 partitions (common in high-cardinality data), a single ad-hoc query might force the Metastore to load 1GB of metadata into memory just to plan the scan.
This leads to “Metastore OOM” and serialized query planning that can take minutes before a single mapper even starts.

2. Scaling Strategies

Metastore Federation: Splitting metadata across multiple Metastore instances based on database name.
Partition Pruning at the Source: Ensuring clients use partition-bound filters to avoid full metadata scans.
Direct SQL: Modern Hive versions use direct SQL queries to the backend DB instead of the slower DataNucleus ORM layer to fetch partitions.

Deep Dive: Hive Query Lifecycle and Execution

To understand Hive’s performance, one must look past the SQL interface into the transformation pipeline that converts a declarative query into a distributed DAG.

1. The Query Planning Pipeline

Hive’s “Compiler” is a multi-stage engine that performs sophisticated optimizations before any data is touched.

Stage	Action	Output
Parser	Tokenizes HiveQL using Antlr.	Abstract Syntax Tree (AST)
Semantic Analyzer	Resolves table names, column types, and partition metadata from the Metastore.	Query Block (QB) Tree
Logical Plan Gen	Converts QB tree into basic relational algebra operators (Filter, Join, Project).	Operator Tree (Initial)
Optimizer	Applies rules like Predicate Pushdown, Column Pruning, and Partition Pruning.	Operator Tree (Optimized)
Physical Plan Gen	Breaks the operator tree into executable tasks (MapReduce, Tez, or Spark).	Task Tree (DAG)

2. Tez vs. MapReduce: The DAG Revolution

While Hive 1.x relied on MapReduce, modern Hive (2.x/3.x) uses Apache Tez to eliminate the “HDFS barrier” between jobs.

MapReduce Barrier: Every stage in a complex query (e.g., multiple joins) must write intermediate data to HDFS, causing massive I/O overhead.
Tez DAG: Tez allows data to flow directly from one task to the next (e.g., Map -> Reduce -> Reduce) without intermediate HDFS writes. It uses a Directed Acyclic Graph of tasks.

3. LLAP (Live Long and Process)

Introduced in Hive 2.0, LLAP is a hybrid architecture that combines persistent query servers with standard YARN containers.

Persistent Daemons: Instead of starting a new JVM for every task (high latency), LLAP uses long-running daemons on worker nodes.
In-Memory Caching: LLAP caches columnar data (ORC/Parquet) in a smart, asynchronous cache, avoiding repetitive HDFS reads.
Vectorized Execution: LLAP processes data in batches of 1024 rows at a time using SIMD instructions, drastically reducing CPU cycles per row.

Feature	Standard Hive	Hive with LLAP
Startup Latency	High (Container launch)	Ultra-low (Always-on daemons)
Data Access	HDFS Scan	In-Memory Cache + HDFS
Execution	MapReduce/Tez Tasks	Fragment-based execution
Target Use Case	Large Batch ETL	Interactive BI / Sub-second SQL

Apache Pig: Data Flow Language

What is Pig?

Apache Pig was developed at Yahoo Research around 2006, making it one of the oldest Hadoop ecosystem projects. While Hive targeted SQL-literate analysts, Pig targeted a different audience: engineers building complex ETL pipelines where the step-by-step data flow mattered more than the declarative “what.” A typical Pig user was an engineer who needed to load data from three different sources, filter each one differently, join them in a specific order, apply custom transformations, and store the result — all in a way that was readable and debuggable. Writing this as a sequence of MapReduce jobs required hundreds of lines of boilerplate Java. Pig Latin reduced it to 10-20 lines of procedural data flow code. It is worth noting that Pig has largely been superseded by Apache Spark’s DataFrame API and PySpark, which offer the same procedural data flow model with significantly better performance (in-memory execution) and a broader ecosystem (Python integration, ML libraries). New projects should default to Spark. However, Pig scripts remain in production at many organizations, and understanding the data flow paradigm it pioneered is valuable. Pig provides a high-level scripting language (Pig Latin) for data transformations, compiling to MapReduce/Tez jobs.

+---------------------------------------------------------------+
|                      APACHE PIG ARCHITECTURE                  |
+---------------------------------------------------------------+
|                                                               |
|  ┌──────────────────────────────────────────┐                 |
|  │  PIG SCRIPT (Pig Latin)                  │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  data = LOAD '/input' AS (id, name);     │                 |
|  │  filtered = FILTER data BY id > 100;     │                 |
|  │  grouped = GROUP filtered BY name;       │                 |
|  │  counts = FOREACH grouped GENERATE       │                 |
|  │             group, COUNT(filtered);      │                 |
|  │  STORE counts INTO '/output';            │                 |
|  └──────────────────┬───────────────────────┘                 |
|                     │                                         |
|                     │ Submit                                  |
|                     ↓                                         |
|  ┌──────────────────────────────────────────┐                 |
|  │  PIG COMPILER                            │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  • Parser (Pig Latin → logical plan)     │                 |
|  │  • Optimizer (merge, push filters)       │                 |
|  │  • Physical plan (MR/Tez operators)      │                 |
|  │  • Generate MapReduce jobs               │                 |
|  └──────────────────┬───────────────────────┘                 |
|                     │                                         |
|                     │ Submit MR jobs                          |
|                     ↓                                         |
|  ┌──────────────────────────────────────────┐                 |
|  │  EXECUTION ENGINE                        │                 |
|  │  (MapReduce, Tez)                        │                 |
|  └──────────────────────────────────────────┘                 |
|                                                               |
+---------------------------------------------------------------+

DESIGN PHILOSOPHY:
─────────────────

• Procedural (not declarative like SQL)
• Dataflow-oriented (transformations on datasets)
• Schema-optional (can work without strict types)
• ETL-focused (Extract, Transform, Load)

Pig Latin Basics

Core Operations
Advanced Features
Pig vs Hive

Fundamental Pig Commands

-- LOAD data from HDFS
users = LOAD '/data/users.csv'
        USING PigStorage(',')
        AS (id:int, name:chararray, age:int, city:chararray);

-- FILTER rows
adults = FILTER users BY age >= 18;

-- FOREACH (transform each row)
names_ages = FOREACH adults GENERATE name, age;

-- GROUP BY
by_city = GROUP users BY city;

-- Result: {group: "NYC", users: {(1, "Alice", 30, "NYC"), (2, "Bob", 25, "NYC")}}

-- JOIN
orders = LOAD '/data/orders.csv'
         AS (order_id:int, user_id:int, amount:double);

joined = JOIN users BY id, orders BY user_id;

-- ORDER BY
sorted = ORDER users BY age DESC;

-- DISTINCT
unique_cities = DISTINCT (FOREACH users GENERATE city);

-- LIMIT
top_10 = LIMIT sorted 10;

-- STORE results
STORE top_10 INTO '/output' USING PigStorage('\t');

Example: Word Count in Pig

-- Load input
lines = LOAD '/input/books.txt' AS (line:chararray);

-- Split into words
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- Group by word
grouped = GROUP words BY word;

-- Count
counts = FOREACH grouped GENERATE group AS word, COUNT(words) AS count;

-- Sort by count descending
sorted = ORDER counts BY count DESC;

-- Store result
STORE sorted INTO '/output/wordcount';

Equivalent in MapReduce: ~200 lines of Java!

Power User Techniques

-- COGROUP (group multiple datasets)
users = LOAD '/users' AS (id, name);
orders = LOAD '/orders' AS (order_id, user_id, amount);

cogrouped = COGROUP users BY id, orders BY user_id;

-- Result: {group: 1, users: {(1,"Alice")}, orders: {(101,1,50), (102,1,75)}}


-- FLATTEN (unnest)
flattened = FOREACH cogrouped
            GENERATE group,
                     FLATTEN(users),
                     FLATTEN(orders);


-- UDF (User Defined Function)
REGISTER 'my_udfs.jar';

processed = FOREACH users
            GENERATE id, com.company.MyUDF(name) AS processed_name;


-- NESTED FOREACH
by_city = GROUP users BY city;

result = FOREACH by_city {
  sorted = ORDER users BY age DESC;
  top_3 = LIMIT sorted 3;
  GENERATE group AS city, top_3;
}


-- CROSS (Cartesian product)
colors = LOAD '/colors' AS (color:chararray);
sizes = LOAD '/sizes' AS (size:chararray);

combinations = CROSS colors, sizes;


-- UNION
dataset1 = LOAD '/data1' AS (id, value);
dataset2 = LOAD '/data2' AS (id, value);

combined = UNION dataset1, dataset2;


-- SAMPLE (random sampling)
sampled = SAMPLE users 0.1;  -- 10% sample


-- SPLIT (conditional split)
SPLIT users INTO
  young IF age < 30,
  old IF age >= 30;

When to Use Each

USE CASE: Calculate average order by city
─────────────────────────────────────────

HIVE (Declarative SQL):
──────────────────────

SELECT city, AVG(amount) as avg_order
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY city;


PIG (Procedural Dataflow):
─────────────────────────

users = LOAD '/users' AS (id, name, city);
orders = LOAD '/orders' AS (order_id, user_id, amount);
joined = JOIN users BY id, orders BY user_id;
grouped = GROUP joined BY city;
result = FOREACH grouped
         GENERATE group AS city, AVG(joined.amount) AS avg_order;
STORE result INTO '/output';


COMPARISON:
──────────

┌──────────────┬─────────────┬─────────────────┐
│ Aspect       │ Hive        │ Pig             │
├──────────────┼─────────────┼─────────────────┤
│ Language     │ SQL-like    │ Procedural      │
│ Users        │ Analysts    │ Engineers       │
│ Use Case     │ Analytics   │ ETL pipelines   │
│ Learning     │ Easy (SQL)  │ Moderate        │
│ Flexibility  │ Medium      │ High            │
│ Schema       │ Required    │ Optional        │
│ Optimization │ Automatic   │ Manual control  │
└──────────────┴─────────────┴─────────────────┘


CHOOSE HIVE WHEN:
────────────────

• Users know SQL
• Ad-hoc analytics
• BI tool integration
• Formal schemas
• Let optimizer decide plan


CHOOSE PIG WHEN:
───────────────

• Complex data flows
• ETL pipelines
• Iterative development
• Schema evolution
• Need fine-grained control
• Custom UDFs common


TREND:
─────

Both declining in favor of Apache Spark:
• Faster (in-memory)
• More expressive (DataFrames, SQL)
• Unified API (batch + stream)

But Hive/Pig still widely used in legacy systems.

Apache HBase: NoSQL on HDFS

What is HBase?

HBase is a distributed, column-oriented NoSQL database built on HDFS, modeled after Google’s Bigtable paper (2006). It was originally developed by Powerset (later acquired by Microsoft) around 2007 and became an Apache top-level project in 2010. The fundamental problem HBase solved was that HDFS is designed for batch processing — high throughput sequential reads and writes — but many real applications need random, low-latency access to individual records. Facebook famously used HBase to power its messaging platform (around 2010), storing hundreds of billions of messages with single-digit millisecond read latencies. The key insight was layering an LSM-Tree based storage engine on top of HDFS, combining the durability and fault tolerance of the distributed file system with the random access performance of an in-memory write buffer. In modern architecture, HBase competes with Apache Cassandra (better for multi-region deployments and tunable consistency), Google Cloud Bigtable (the managed equivalent), and Amazon DynamoDB (serverless, fully managed). HBase remains the strongest choice when you are already running an HDFS cluster and need tight integration with the Hadoop ecosystem — particularly for use cases where co-located analytics (via Hive or Spark) on the same data are important.

+---------------------------------------------------------------+
|                    APACHE HBASE ARCHITECTURE                  |
+---------------------------------------------------------------+
|                                                               |
|  ┌──────────────────────────────────────────┐                 |
|  │  CLIENT                                  │                 |
|  │  (HBase Shell, Java API, REST/Thrift)    │                 |
|  └───────────────┬──────────────────────────┘                 |
|                  │                                            |
|                  │ RPC                                        |
|                  ↓                                            |
|  ┌──────────────────────────────────────────┐                 |
|  │  HMASTER (Cluster Master)                │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  • Assign regions to RegionServers       │                 |
|  │  • Handle region splits/merges           │                 |
|  │  • Schema changes (create/delete tables) │                 |
|  │  • Load balancing                        │                 |
|  └──────────────────────────────────────────┘                 |
|                  ↑                                            |
|  ┌───────────────┴──────────────────────┐                     |
|  │         ZOOKEEPER                    │                     |
|  ├──────────────────────────────────────┤                     |
|  │  • Cluster coordination              │                     |
|  │  • Master election                   │                     |
|  │  • Region assignment tracking        │                     |
|  └──────────────────────────────────────┘                     |
|                  ↑                                            |
|  ┌───────────────┴──────────────────────────────┐             |
|  │  REGIONSERVER 1  │  REGIONSERVER 2  │  ...   │             |
|  ├──────────────────┼──────────────────┼────────┤             |
|  │  ┌────────────┐  │  ┌────────────┐  │        │             |
|  │  │ Region A   │  │  │ Region B   │  │        │             |
|  │  ├────────────┤  │  ├────────────┤  │        │             |
|  │  │ MemStore   │  │  │ MemStore   │  │        │             |
|  │  │ (in-memory)│  │  │ (in-memory)│  │        │             |
|  │  ├────────────┤  │  ├────────────┤  │        │             |
|  │  │ BlockCache │  │  │ BlockCache │  │        │             |
|  │  │ (reads)    │  │  │ (reads)    │  │        │             |
|  │  └────────────┘  │  └────────────┘  │        │             |
|  └──────────────────┴──────────────────┴────────┘             |
|                  ↓                                            |
|  ┌──────────────────────────────────────────┐                 |
|  │  HDFS (Persistent Storage)               │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  HFiles (SSTable format):                │                 |
|  │  • Immutable sorted files                │                 |
|  │  • Stored per column family              │                 |
|  │  • 3x replicated (HDFS)                  │                 |
|  └──────────────────────────────────────────┘                 |
|                                                               |
+---------------------------------------------------------------+

DATA MODEL:
──────────

Table: users
┌─────────────┬──────────────────────────────────────────┐
│ Row Key     │ Column Family: info    │ Column Family: │
│             │                        │ prefs          │
├─────────────┼────────────────────────┼────────────────┤
│ user_001    │ info:name = "Alice"    │ prefs:theme=   │
│             │ info:email= "a@x.com"  │   "dark"       │
│             │ ts=1234567890          │ ts=1234567891  │
├─────────────┼────────────────────────┼────────────────┤
│ user_002    │ info:name = "Bob"      │ prefs:lang=    │
│             │ info:email= "b@x.com"  │   "en"         │
│             │ ts=1234567892          │ ts=1234567893  │
└─────────────┴────────────────────────┴────────────────┘

KEY CONCEPTS:
────────────

• Row Key: Primary key, lexicographically sorted
• Column Family: Physical grouping of columns
• Column Qualifier: Column name within family
• Cell: (row, column family, column, timestamp) → value
• Timestamp: Version of cell (multi-versioning)
• Sparse: Rows can have different columns

HBase Operations

Basic CRUD
Row Key Design
HBase vs HDFS/RDBMS

Create, Read, Update, Delete

// CREATE TABLE
HBaseAdmin admin = new HBaseAdmin(conf);
HTableDescriptor tableDesc = new HTableDescriptor(TableName.valueOf("users"));
tableDesc.addFamily(new HColumnDescriptor("info"));
tableDesc.addFamily(new HColumnDescriptor("prefs"));
admin.createTable(tableDesc);


// PUT (insert or update)
HTable table = new HTable(conf, "users");
Put put = new Put(Bytes.toBytes("user_001"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"),
              Bytes.toBytes("Alice"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("email"),
              Bytes.toBytes("alice@example.com"));
table.put(put);


// GET (read single row)
Get get = new Get(Bytes.toBytes("user_001"));
Result result = table.get(get);
byte[] name = result.getValue(Bytes.toBytes("info"), Bytes.toBytes("name"));
System.out.println("Name: " + Bytes.toString(name));


// SCAN (read multiple rows)
Scan scan = new Scan();
scan.setStartRow(Bytes.toBytes("user_000"));
scan.setStopRow(Bytes.toBytes("user_100"));
ResultScanner scanner = table.getScanner(scan);
for (Result r : scanner) {
  // Process each row
}
scanner.close();


// DELETE
Delete delete = new Delete(Bytes.toBytes("user_001"));
table.delete(delete);

HBase Shell:

# Create table
create 'users', 'info', 'prefs'

# Put data
put 'users', 'user_001', 'info:name', 'Alice'
put 'users', 'user_001', 'info:email', 'alice@example.com'

# Get data
get 'users', 'user_001'

# Scan table
scan 'users', {STARTROW => 'user_000', STOPROW => 'user_100'}

# Count rows
count 'users'

# Delete row
deleteall 'users', 'user_001'

# Disable and drop table
disable 'users'
drop 'users'

Critical for Performance

PROBLEM: Sequential Row Keys
───────────────────────────

Bad Design:
Row keys: timestamp_00001, timestamp_00002, ...

Issue:
• All writes go to SAME region (hotspot!)
• Region 1: [timestamp_00001 to timestamp_10000]
• Region 2: [timestamp_10001 to timestamp_20000]
• New writes always hit Region 2
• Other regions idle


SOLUTION 1: Salting (hash prefix)
─────────────────────────────────

Original: timestamp_00001
Salted:   {hash(timestamp) % 10}_timestamp_00001

Result:
• Writes distributed across 10 regions
• No hotspot!

Trade-off:
• Range scans more complex (need to query all salts)


SOLUTION 2: Reverse Timestamp
─────────────────────────────

For most recent data access:

Row key: {Long.MAX_VALUE - timestamp}_userId

Result:
• Recent data has lowest row key
• Scan from beginning gets newest first


SOLUTION 3: Composite Keys
──────────────────────────

userId_timestamp

Benefits:
• All user's data together
• Can scan user's history efficiently
• Natural distribution (different users)


BEST PRACTICES:
──────────────

1. Avoid sequential IDs
   Bad:  1, 2, 3, 4, ...
   Good: uuid(), hash(id), ...

2. Consider access patterns
   • Frequent: Recent user events
   • Row key: userId_reverseTimestamp

3. Pre-split table
   • Create regions upfront
   • Distribute load from start

4. Keep row keys short
   • Stored in every cell
   • Long keys waste space


EXAMPLE: URL Shortener
─────────────────────

Bad:  shortCode (e.g., "abc123")
Why:  All reads/writes hit same region initially

Good: {hash(shortCode) % 100}_shortCode
Why:  100-way distribution from start

When to Use HBase

HBASE vs HDFS:
─────────────

┌──────────────────┬───────────┬──────────────┐
│ Feature          │ HDFS      │ HBase        │
├──────────────────┼───────────┼──────────────┤
│ Access Pattern   │ Batch     │ Random R/W   │
│ Latency          │ Minutes   │ Milliseconds │
│ Updates          │ Append    │ Yes          │
│ Deletes          │ No        │ Yes          │
│ Point Lookups    │ Slow      │ Fast         │
│ Full Scans       │ Fast      │ Slow         │
│ Use Case         │ Analytics │ Serving      │
└──────────────────┴───────────┴──────────────┘


HBASE vs RDBMS:
──────────────

┌──────────────────┬────────────┬──────────────┐
│ Feature          │ RDBMS      │ HBase        │
├──────────────────┼────────────┼──────────────┤
│ Schema           │ Strict     │ Flexible     │
│ Transactions     │ ACID       │ Row-level    │
│ Joins            │ Yes        │ No (manual)  │
│ Secondary Index  │ Yes        │ Limited      │
│ Scale            │ Vertical   │ Horizontal   │
│ Consistency      │ Strong     │ Eventual*    │
│ Max Data         │ TBs        │ PBs          │
└──────────────────┴────────────┴──────────────┘

* HBase provides strong consistency within single row


USE HBASE WHEN:
──────────────

✓ Need random read/write at scale (billions of rows)
✓ Sparse data (rows with different columns)
✓ Time-series data (append-heavy)
✓ Key-value lookups (user profiles, session data)
✓ Write-heavy workloads
✓ Horizontal scalability required


DON'T USE HBASE WHEN:
────────────────────

✗ Need complex joins
✗ Need ACID transactions across rows
✗ Small dataset (< 1 billion rows)
✗ Primarily analytical queries (use Hive)
✗ Need secondary indexes (use Cassandra)


COMMON USE CASES:
────────────────

• User profiles (billions of users)
• Message/email storage (Facebook Messages)
• Time-series data (sensor data, logs)
• Recommendation systems (feature vectors)
• Content management (versioned documents)
• Social graphs (following relationships)

Deep Dive: HBase Internals and Storage Engine

HBase is not a relational database; it is a Log-Structured Merge-Tree (LSM-Tree) based storage system. This architecture is optimized for high-write throughput and sequential disk I/O.

1. The Write Path: WAL and MemStore

Every write to HBase follows a strict “Persistence First” protocol to ensure data durability even if a RegionServer crashes.

WAL (Write-Ahead Log): The write is first appended to a log on HDFS. If the server dies, this log is used to replay the data.
MemStore: After the WAL is synced, the data is written to an in-memory sorted buffer called the MemStore.
Acknowledgement: The client receives a success response as soon as the data is in the MemStore.

2. MemStore Flush and HFile Creation

When a MemStore reaches its threshold (e.g., 128MB), it is “flushed” to HDFS as an HFile.

HFile (SSTable): A sorted, immutable file. Once written, it is never changed.
BlockIndex: Each HFile contains an index of its data blocks for fast binary search during lookups.

3. LSM-Tree and Read Amplification

Because HFiles are immutable, a single row might have data scattered across multiple HFiles (e.g., an update in HFile 3 overriding a value in HFile 1).

Read Path: HBase must check the MemStore, then scan multiple HFiles, merging the results to find the latest version of a cell. This is known as Read Amplification.
Bloom Filters: To speed this up, HBase uses Bloom Filters to skip HFiles that definitely do not contain a specific row key.

4. Compaction: Managing File Bloat

To prevent the number of HFiles from growing indefinitely, HBase performs Compaction.

Type	Action	Impact
Minor Compaction	Picks a few small HFiles and merges them into one slightly larger HFile.	Low I/O; cleans up some “Read Amplification”.
Major Compaction	Merges ALL HFiles in a Column Family into a single file.	High I/O; deletes expired cells and tombstones (deletes).

5. Data Locality and HDFS Interaction

HBase achieves “Local Reads” by scheduling RegionServers on the same nodes as their HDFS DataNodes. When a MemStore flushes, the first replica is written to the local disk. Over time, as regions move, HBase relies on the HDFS Balancer and major compactions to restore data locality.

6. The Mathematics of Region Sizing

A common failure in production HBase clusters is “Region Squatting”—having too many regions per RegionServer, which fragments the available MemStore memory. The Capacity Formula: The maximum number of regions a RegionServer can safely handle is bounded by the total MemStore heap:

MaxRegions = \frac{RS\_Heap \times MemStore\_Fraction}{MemStore\_Flush\_Size \times Num\_Column\_Families}

Example:

Heap: 32GB
MemStore Fraction: 0.4 (40% of heap reserved for writes)
Flush Size: 128MB
Column Families: 2
Result: $\frac{32,768 \times 0.4}{128 \times 2} \approx 51$ regions.

Consequence of Over-provisioning: If you host 500 regions on this server, each region only gets ~2.5MB of MemStore. This causes “Thundering Flushes” where the server spends all its time writing tiny HFiles, leading to massive compaction pressure and I/O saturation.

7. The Region Split Protocol (State Machine)

When a region exceeds the hbase.hregion.max.filesize (e.g., 10GB), it must split. This is a complex distributed transaction coordinated via ZooKeeper:

PRE_SPLIT: RegionServer (RS) creates a split znode in ZooKeeper.
OFFLINE: RS takes the parent region offline, stopping all writes.
DAUGHTER_CREATION: RS creates two “Reference Files” (daughter regions) pointing to the parent’s HFiles. This is a metadata-only operation (very fast).
OPEN_DAUGHTERS: RS opens the two daughter regions and begins serving requests.
POST_SPLIT: RS updates the .META. table and deletes the parent region metadata.
Compaction (Background): Over time, the daughter regions undergo compaction, which physically splits the parent HFiles into new, independent HFiles, eventually deleting the reference files.

Ecosystem Integration and Coordination

Beyond storage and processing, the ecosystem requires robust coordination and ingestion tools.

1. Apache ZooKeeper: The Glue

ZooKeeper is a distributed coordination service originally developed at Yahoo Research (around 2007) to solve the recurring problem of distributed coordination that every Hadoop component was implementing independently and badly. The name is intentional — it “keeps the zoo” of distributed processes in order. ZooKeeper provides a small set of primitives (ephemeral nodes, watches, sequential nodes) that can be composed to build higher-level coordination patterns like leader election, distributed locks, and barrier synchronization. Almost every major Hadoop component depends on it:

HBase: Master election and region server tracking.
HDFS: NameNode HA leader election via the ZooKeeper Failover Controller (ZKFC).
YARN: ResourceManager HA and state management.
Kafka: Broker coordination and topic metadata (though Kafka is migrating to its own Raft-based controller as of KRaft).

In modern cloud-native architectures, ZooKeeper’s role is shrinking. Kubernetes provides similar coordination via etcd. Kafka is removing its ZooKeeper dependency. But for on-premises Hadoop clusters, ZooKeeper remains the essential coordination backbone.

2. Apache Oozie: Workflow Orchestration

Oozie was developed at Yahoo to manage complex pipelines of Hadoop jobs. In production, a single “analytics run” is rarely one MapReduce job. It is a chain: ingest raw data, clean it, join with reference tables, aggregate, export to a serving database. Oozie provides the orchestration layer for these multi-step pipelines.

Workflow: A DAG of actions (Hive query -> Pig script -> MapReduce job).
Coordinator: Triggers workflows based on time or data availability (e.g., “Run every day at 2 AM” or “Run when the /sales/today folder exists”).
Bundle: Groups multiple coordinators that should be managed together.

Oozie’s XML-based workflow definitions are notoriously verbose and difficult to debug. In modern practice, Apache Airflow (developed at Airbnb, open-sourced in 2015) has largely replaced Oozie for new projects. Airflow offers Python-based DAG definitions, a superior web UI for monitoring, richer operator ecosystem, and cloud-native integrations. If you are starting a new data pipeline today, use Airflow (or its managed equivalents like Google Cloud Composer or Amazon MWAA). But Oozie remains deeply embedded in many existing Hadoop deployments.

3. Data Ingestion: Sqoop, Flume, and Kafka

Sqoop (SQL-to-Hadoop): Efficiently transfers bulk data between RDBMS (MySQL, Oracle) and HDFS/Hive using parallel MapReduce tasks. A typical use case is nightly ingestion of updated rows from a MySQL transactional database into a Hive warehouse table. Sqoop was retired as an Apache project in 2021, and modern alternatives include Apache Spark JDBC connectors, Debezium for CDC (Change Data Capture), and cloud-native services like AWS DMS.
Flume: A distributed service for collecting, aggregating, and moving large amounts of streaming log data into HDFS. Flume uses a source-channel-sink architecture where agents on application servers collect logs and reliably deliver them to HDFS. While still functional, Flume has been largely replaced by Apache Kafka for log aggregation due to Kafka’s superior throughput, replay capability, and ecosystem integration.
Apache Kafka: Originally developed at LinkedIn (around 2010), Kafka has become the de facto standard for real-time data ingestion and event streaming. Kafka Connect provides a pluggable framework for moving data between Kafka and external systems (databases, HDFS, S3, Elasticsearch) without writing custom code. In a modern Hadoop-adjacent architecture, Kafka typically sits as the real-time ingestion layer, with consumers writing to HDFS/S3 for batch processing and to streaming engines (Flink, Spark Structured Streaming) for real-time analytics.

Conclusion: The Modern Hadoop Stack

Today, the “Hadoop Ecosystem” has evolved significantly from its 2006-2014 golden era. While Hive remains the standard for SQL-on-HDFS, many processing tasks have shifted to Apache Spark due to its in-memory performance. Pig and Oozie are in maintenance mode, largely replaced by PySpark and Airflow respectively. HBase faces competition from managed services like DynamoDB and Cloud Bigtable. However, the core architectural principles of the Hadoop ecosystem — separation of storage (HDFS/S3), resource management (YARN/Kubernetes), and specialized processing engines — continue to define modern big data architecture. The modern “lakehouse” pattern (Delta Lake, Apache Iceberg, Apache Hudi) is essentially the Hadoop ecosystem pattern rebuilt on cloud object storage with ACID transactions. When you use Databricks, Snowflake, or BigQuery, you are using systems that learned their most important lessons from the Hadoop ecosystem’s successes and failures. The most important lesson from the Hadoop ecosystem is not any specific tool — it is the architectural pattern of decoupling storage, compute, and metadata. That pattern outlives every individual project in the stack.

Interview Deep-Dive

The Hive Metastore has become a critical bottleneck in your data lake with millions of partitions. Walk me through the problem and how you would solve it.

Strong Answer:The Hive Metastore stores all table and partition metadata in a backing RDBMS — typically MySQL or PostgreSQL. When a table has millions of partitions (common with high-cardinality partitioning like year/month/day/hour on a high-volume event stream), every query that touches that table must first ask the Metastore for the list of matching partitions. This becomes a problem in three ways.First, query planning latency. A query like SELECT * FROM events WHERE date = '2024-01-15' seems simple, but if the events table has 5 million partitions, the Metastore must scan its PARTITIONS table in the backing RDBMS to find matching entries. With the default DataNucleus ORM layer, this can take minutes before a single mapper even starts.Second, Metastore memory pressure. Fetching millions of partition objects into the Metastore JVM can cause OOM errors, especially when multiple concurrent queries hit the same large table.Third, lock contention. The Metastore uses database-level locks for partition creation and deletion, so concurrent writes to a heavily partitioned table serialize on the Metastore.Solutions, in order of impact: First, ensure partition pruning is always used — enforce policies that prevent full-table scans on partitioned tables. Second, enable Direct SQL mode in the Metastore (bypasses the DataNucleus ORM and issues raw SQL queries to the backing database), which can provide 10-100x speedup for partition listing. Third, consider Metastore federation — splitting metadata across multiple Metastore instances by database name. Fourth, for new architectures, migrate to Apache Iceberg or Delta Lake, which store partition metadata in manifest files rather than a centralized RDBMS, eliminating the Metastore bottleneck entirely. Iceberg’s hidden partitioning also avoids the “partition explosion” problem at the design level.Follow-up: How does Apache Iceberg solve the partition metadata problem differently than Hive?Iceberg stores table metadata in a hierarchy of metadata files (metadata.json, manifest lists, manifest files) on the storage layer itself (HDFS or S3), not in a centralized database. Each manifest file tracks a set of data files with their partition values and column-level statistics. Query planning reads only the manifest files needed for the query’s filter predicates, and this is done in parallel by the query engine, not serialized through a single Metastore process. The result is that partition listing scales with the number of matching partitions, not total partitions, and there is no centralized bottleneck. Additionally, Iceberg supports “hidden partitioning” where the partitioning scheme is defined as a transform on a column (e.g., days(timestamp)) rather than as an explicit directory structure, so users do not need to include partition columns in WHERE clauses.

Compare the columnar file formats Parquet and ORC. When would you choose one over the other, and how do they achieve their compression ratios?

Strong Answer:Both Parquet and ORC are columnar storage formats that achieve dramatic compression and query performance improvements over row-oriented formats like CSV or JSON. They share the same core insight: analytics queries typically read a few columns out of many, so storing data column-by-column lets you skip reading irrelevant columns entirely and apply column-specific compression (a column of integers compresses much better than a row of mixed types).Parquet was developed by Twitter and Cloudera (around 2013), inspired by Google’s Dremel paper. It uses a row group/column chunk structure where each row group (typically 128MB) contains column chunks, and each column chunk has pages with optional dictionary encoding, RLE (run-length encoding), and delta encoding. Parquet supports nested data natively through Dremel’s repetition and definition levels.ORC (Optimized Row Columnar) was developed at Hortonworks primarily for Hive. It uses a stripe/stream structure with built-in bloom filters, stripe-level statistics (min/max/sum/count), and file-level statistics. ORC has native support for ACID transactions in Hive 3+, which Parquet does not.Choice criteria: Use Parquet when working with a diverse tool ecosystem (Spark, Presto, Athena, BigQuery all have excellent Parquet support), when you have deeply nested data (Parquet handles nested structures more efficiently), or when building on cloud data lakes where Parquet is the de facto standard. Use ORC when you are primarily in a Hive-heavy environment, need ACID transactions in Hive, or are using the Hortonworks distribution. In benchmarks, ORC often shows slightly better compression ratios due to its more aggressive encoding, while Parquet tends to show slightly better read performance in non-Hive engines.In practice, the difference is small enough that ecosystem compatibility should drive the decision. The modern data lake world has standardized on Parquet.Follow-up: What is predicate pushdown and how do columnar formats enable it?Predicate pushdown means applying WHERE clause filters as early as possible — ideally during the file read itself, before data enters the query engine. Both Parquet and ORC store statistics (min/max values, null counts) at the row group or stripe level. When the query engine reads a file, it can check these statistics to skip entire row groups that cannot possibly contain matching rows. For example, if a stripe’s max value for the age column is 25, and the query has WHERE age > 30, the entire stripe is skipped without reading a single data byte. This turns a full-table scan into an effective index scan. Combined with partition pruning (skipping entire files based on directory structure), predicate pushdown can reduce the amount of data read by orders of magnitude.

You are designing a real-time analytics system that needs to serve low-latency lookups on billions of rows while also supporting batch analytics. How would you architect this using the Hadoop ecosystem?

Strong Answer:This is the classic Lambda Architecture problem, and the answer depends on your latency requirements and operational complexity budget.The core tension is that HDFS and Hive are optimized for batch analytics (high throughput, high latency), while serving lookups requires low latency (single-digit milliseconds). No single system does both well. The standard approach is to split the architecture.For the batch layer: raw events land in Kafka, and a consumer (Spark Structured Streaming or Flink) writes them to HDFS or S3 in Parquet format, partitioned by time. Hive or Spark SQL runs batch analytics on this historical data — aggregations, trend analysis, reporting. This handles the “analytics” requirement.For the serving layer: HBase sits on top of HDFS and provides random read/write access with millisecond latency. A derived dataset (precomputed aggregations, denormalized user profiles, feature vectors) is materialized into HBase by a batch job (Spark writing to HBase) or a streaming job (Flink directly writing to HBase). Application servers query HBase for low-latency lookups.For real-time updates: Kafka consumers (Flink or Spark Structured Streaming) process events as they arrive, update HBase in near-real-time (seconds of delay), and also write to the batch layer for historical completeness.The trade-offs: This architecture has operational complexity (three systems to maintain: Kafka, HDFS/Hive, HBase). Consistency between the batch and serving layers requires careful design. And HBase’s lack of secondary indexes means your lookup patterns must be known in advance and baked into the row key design.A modern alternative that simplifies this: use Apache Druid or Apache Pinot for the serving layer instead of HBase. Both are purpose-built for analytical lookups (group-by queries with filters) at low latency, and they can ingest directly from Kafka. This is the approach LinkedIn (Pinot) and Airbnb (Druid) have taken.Follow-up: What about the Kappa Architecture? When would you use it instead of Lambda?The Kappa Architecture eliminates the batch layer entirely and treats everything as a stream. All data flows through Kafka, and a single stream processing engine (Flink) handles both real-time processing and historical reprocessing (by replaying Kafka topics from the beginning). The serving layer is populated only from the stream processor. This reduces operational complexity (one processing path instead of two) but requires Kafka to retain data for long periods (expensive storage) and the stream processor to handle both real-time and backfill workloads (which have very different resource profiles). Use Kappa when your processing logic is naturally event-driven and your latency requirements are uniform across all queries. Use Lambda when you have distinct batch and real-time requirements with different SLAs.

HBase row key design is critical for performance. Explain why, and walk through how you would design a row key for a time-series IoT sensor data use case.

Strong Answer:Row key design is the single most impactful performance decision in HBase because it determines data distribution across regions, write throughput, and read access patterns. HBase stores rows lexicographically sorted by row key, and regions (horizontal partitions) are defined by key ranges. A poorly designed row key causes two catastrophic problems: write hotspots (all writes go to one region) and read inefficiency (full table scans instead of targeted lookups).For an IoT sensor data use case — say, 10,000 sensors reporting temperature every second — the naive row key is timestamp_sensorId. This is terrible because all current writes have similar timestamps, so they all go to the same region (the one handling the “latest” key range). One RegionServer gets all the write load while others sit idle.My design approach: The primary access patterns for IoT data are typically (1) get the latest readings for a specific sensor, (2) get a time range of readings for a specific sensor, and (3) get readings across all sensors for a specific time window.For patterns 1 and 2, the row key should be sensorId_reverseTimestamp where reverseTimestamp is Long.MAX_VALUE - timestamp. This distributes writes across regions (different sensorIds hash to different regions) and enables efficient time-range scans for a specific sensor (scan from sensorId_reverseTimestamp_start to sensorId_reverseTimestamp_end). The reverse timestamp ensures the most recent data comes first in a scan, which matches the most common access pattern.For pattern 3 (cross-sensor queries), this row key design is poor because it requires scanning all regions. If this pattern is frequent, I would maintain a secondary index table with row key timeBucket_sensorId (where timeBucket is hourly or daily), or use a tool like Apache Phoenix that provides secondary indexing on HBase.Additional considerations: Pre-split the table into N regions based on the sensorId distribution to avoid the initial single-region bottleneck. If sensorIds are sequential integers, apply a hash prefix (e.g., MD5(sensorId).substring(0,4)_sensorId_reverseTimestamp) to ensure uniform distribution even with non-uniform sensorId assignment. Set TTL on the column family to auto-expire old data (e.g., 90 days) to prevent unbounded growth.Follow-up: How do compactions affect this workload, and how would you tune them?With high write throughput (10,000 sensors x 1 write/second = 10,000 writes/second), MemStores flush frequently, creating many small HFiles. Minor compactions merge these into larger files, and major compactions merge all files into one per column family per region. The risk is that major compactions on a high-write table cause I/O storms that degrade read latency. I would disable automatic major compactions (hbase.hregion.majorcompaction = 0) and schedule them during off-peak hours via a cron job. For minor compactions, I would tune hbase.hstore.compactionThreshold (number of files that triggers compaction) and hbase.hstore.compaction.max (max files to compact at once) to balance between read amplification and I/O overhead.

Documentation Index

​Chapter 5: Hadoop Ecosystem

​Ecosystem Overview

​The Hadoop Stack

​Why the Ecosystem Matters

Abstraction

Productivity

Specialization

Innovation

​Apache Hive: SQL on Hadoop

​What is Hive?

​Hive Fundamentals

​Hive Metastore

​Deep Dive: Hive Metastore Scalability and the “Partition Explosion”

​1. The Partition Bottleneck

​2. Scaling Strategies

​Deep Dive: Hive Query Lifecycle and Execution

​1. The Query Planning Pipeline

​2. Tez vs. MapReduce: The DAG Revolution

​3. LLAP (Live Long and Process)

​Apache Pig: Data Flow Language

​What is Pig?

​Pig Latin Basics

​Apache HBase: NoSQL on HDFS

​What is HBase?

​HBase Operations

​Deep Dive: HBase Internals and Storage Engine

​1. The Write Path: WAL and MemStore

​2. MemStore Flush and HFile Creation

​3. LSM-Tree and Read Amplification

​4. Compaction: Managing File Bloat

​5. Data Locality and HDFS Interaction

​6. The Mathematics of Region Sizing

​7. The Region Split Protocol (State Machine)

​Ecosystem Integration and Coordination

​1. Apache ZooKeeper: The Glue

​2. Apache Oozie: Workflow Orchestration

​3. Data Ingestion: Sqoop, Flume, and Kafka

​Conclusion: The Modern Hadoop Stack

​Interview Deep-Dive