Key Takeaways
1. Big Data Demands a New Approach: Hadoop's Distributed Solution
In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.
The data deluge. We live in an age of unprecedented data growth, measured in zettabytes, stemming from diverse sources like web logs, sensors, and personal devices. Traditional systems struggle to store and analyze this volume due to limitations in single-machine storage capacity and, more critically, disk access speeds which haven't kept pace with storage density. Reading a modern terabyte drive takes hours, making full dataset scans impractical on single machines.
Distributed computing is key. The solution lies not in larger, more powerful single computers, but in coordinating computations across many commodity machines. This parallel processing approach drastically reduces analysis time by reading from multiple disks simultaneously. However, distributing data and computation introduces challenges like hardware failure and coordinating data combination.
Hadoop's core promise. Hadoop provides a reliable, scalable, and affordable platform designed specifically for storing and analyzing massive datasets on clusters of commodity hardware. It addresses the challenges of distributed systems by handling fault tolerance and coordination automatically, enabling users to run batch queries against their entire dataset in a reasonable time, unlocking previously inaccessible data for innovation and insight.
2. HDFS: Reliable, Scalable Storage for Massive Files
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
Designed for scale. HDFS (Hadoop Distributed Filesystem) is Hadoop's primary storage system, optimized for handling files hundreds of megabytes to petabytes in size with a write-once, read-many-times streaming access pattern. It runs on clusters of inexpensive, standard hardware, anticipating and tolerating frequent component failures across the cluster.
Blocks and replication. HDFS breaks files into large blocks (128 MB by default) and replicates them across multiple machines (typically three) to ensure fault tolerance and high availability. If a disk or node fails, data can be read from another replica transparently to the user. This block abstraction simplifies storage management and allows files to exceed the capacity of any single disk.
Master-worker architecture. An HDFS cluster consists of a single Namenode (master) managing the filesystem namespace and metadata, and multiple Datanodes (workers) storing the actual data blocks. Clients interact with the Namenode to find block locations and then read/write data directly to the Datanodes, distributing data traffic across the cluster and preventing the Namenode from becoming a bottleneck.
3. YARN: Hadoop's Flexible Resource Manager for Diverse Applications
YARN is a cluster resource management system, which allows any distributed program (not just MapReduce) to run on data in a Hadoop cluster.
Evolution beyond MapReduce. YARN (Yet Another Resource Negotiator), introduced in Hadoop 2, fundamentally changed Hadoop by decoupling resource management from the MapReduce programming model. This allows other distributed computing frameworks to run on the same Hadoop cluster and share its resources and data storage (HDFS).
Core components. YARN consists of a global Resource Manager that allocates resources (CPU, memory) across the cluster and Node Managers running on each machine that launch and monitor application-specific containers. Applications request containers from the Resource Manager and run their logic within them.
Enabling diverse workloads. This architecture enables a variety of processing models beyond traditional batch MapReduce, including:
- Interactive SQL (e.g., Impala, Hive on Tez)
- Iterative processing (e.g., Spark)
- Stream processing (e.g., Spark Streaming, Flume)
- Search (e.g., Solr)
Improved scalability and utilization. Compared to the older MapReduce 1 architecture, YARN offers significantly better scalability (designed for 10,000+ nodes), higher resource utilization through fine-grained resource allocation (containers instead of fixed slots), and enhanced availability through master high availability.
4. MapReduce: The Foundational Batch Processing Model
MapReduce is a programming model for data processing. The model is simple, yet not too simple to express useful programs in.
Simple programming model. MapReduce provides a powerful, parallel programming model based on two functions: map and reduce. The map function processes input key-value pairs and generates intermediate key-value pairs. The reduce function processes intermediate key-value pairs grouped by key and generates final output key-value pairs.
The shuffle. The core of MapReduce is the "shuffle," the process of sorting and transferring intermediate map outputs to the correct reduce tasks. Map tasks write output to local disk, which is then copied, sorted, and merged by reduce tasks. This process is automatically managed by the Hadoop framework.
Fault tolerance built-in. MapReduce is designed to handle failures gracefully. If a map or reduce task fails, the framework automatically detects it and reschedules the task on a different node. Intermediate map outputs are stored on local disk, and final reduce outputs are written to HDFS for reliability.
5. Efficient Data Handling: Serialization, Compression, and File Formats
Serialization is the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage.
Optimized I/O. Hadoop provides primitives for efficient data input/output, crucial for processing large volumes. This includes data integrity checks (checksums) to detect corruption and compression to reduce storage space and improve transfer speeds.
Serialization formats. Hadoop uses serialization for interprocess communication (RPC) and persistent storage. Key requirements for serialization formats include:
- Compactness (saves network bandwidth and storage)
- Speed (fast encoding/decoding)
- Extensibility (supports schema evolution)
- Interoperability (works across languages)
Writable and Avro. Hadoop's native serialization is Writable, which is compact and fast but Java-centric. Avro is a language-neutral, schema-based serialization system that offers schema evolution and is widely used across the Hadoop ecosystem. Parquet is a columnar storage format optimized for analytical queries.
6. Beyond Batch: Real-time and Interactive Processing on Hadoop
For all its strengths, MapReduce is fundamentally a batch processing system, and is not suitable for interactive analysis.
Addressing latency needs. While MapReduce excels at batch processing, its latency (minutes or more) makes it unsuitable for interactive queries or real-time applications. The Hadoop ecosystem has evolved to address these needs by building new frameworks on top of HDFS and YARN.
Real-time databases. HBase is a distributed, column-oriented database built on HDFS that provides real-time read/write random access to large datasets. It's suitable for applications requiring low-latency access to individual records.
Interactive SQL. Projects like Impala and Hive on Tez provide interactive SQL query capabilities on data stored in HDFS, offering much lower latency than traditional Hive on MapReduce.
Stream processing. Frameworks like Spark Streaming and Flume enable processing data as it arrives in real-time streams, allowing for continuous analysis and immediate action on incoming data.
7. Higher-Level Tools Simplify Development: Pig, Hive, Crunch, Spark SQL, Cascading
With Pig, the data structures are much richer, typically being multivalued and nested, and the transformations you can apply to the data are much more powerful.
Abstracting MapReduce complexity. Writing complex data processing pipelines directly in MapReduce can be challenging, often requiring chaining multiple jobs and managing intermediate data. Higher-level tools provide more intuitive programming models and handle the translation to distributed execution engines automatically.
SQL for data warehousing. Hive provides a data warehousing solution on Hadoop with a query language (HiveQL) based on SQL, making it accessible to analysts familiar with relational databases. It manages data in tables with schemas stored in a metastore.
Data flow languages. Pig offers a data flow language (Pig Latin) for exploring large datasets with richer data structures (tuples, bags, maps) and powerful operators (joins, grouping). It's well-suited for iterative data exploration and scripting.
Java/Scala/Python APIs. Libraries like Crunch and Spark provide Java, Scala, and Python APIs for building data pipelines with more expressive transformations and better code composability than raw MapReduce. They offer features like type safety, embedded UDFs, and flexible execution engine support.
8. Integrating with External Systems: Sqoop and Flume
Apache Sqoop is an open source tool that allows users to extract data from a structured data store into Hadoop for further processing.
Data ingress and egress. Getting data into and out of Hadoop is a critical part of any data pipeline. Sqoop and Flume are key tools in the ecosystem for integrating Hadoop with external systems.
Database integration. Sqoop specializes in transferring data between Hadoop (HDFS, Hive, HBase) and structured data stores like relational databases. It can import data from tables into Hadoop and export results back, often using database-specific bulk transfer mechanisms for efficiency.
Streaming data ingestion. Flume is designed for collecting and aggregating high-volume, event-based data streams from sources like logfiles into Hadoop. It uses a flexible agent architecture with sources, channels, and sinks, supporting tiered aggregation and reliable delivery guarantees.
Diverse connectors. Both Sqoop and Flume offer pluggable connector frameworks to support integration with a wide variety of external systems and data formats beyond just databases and logfiles.
9. Coordination and Reliability: ZooKeeper and Distributed Primitives
ZooKeeper can't make partial failures go away, since they are intrinsic to distributed systems. It certainly does not hide partial failures, either.
Building distributed applications. Writing reliable distributed applications is inherently difficult due to challenges like partial network failures and coordinating state across multiple nodes. ZooKeeper provides a service specifically designed to help build these applications.
Simple, reliable service. ZooKeeper is a highly available, high-performance coordination service that exposes a simple, filesystem-like API (znodes) with operations for creating, deleting, and accessing data and metadata. It guarantees sequential consistency, atomicity, and durability.
Coordination primitives. ZooKeeper's features enable the construction of common distributed coordination patterns:
- Ephemeral znodes: Represent service availability or group membership.
- Sequential znodes: Impose a global ordering on events.
- Watches: Provide notifications of znode changes.
Enabling complex systems. ZooKeeper is used by many Hadoop components (HBase, YARN, Hive) for coordination tasks like leader election, service discovery, and managing distributed state, providing a robust foundation for building complex, fault-tolerant systems.
10. Hadoop in Action: Real-World Case Studies
Today, more data is analyzed for personalized advertising than personalized medicine, but that will not be the case in the future.
Solving complex problems. Hadoop and its ecosystem are being applied to solve challenging, real-world problems across various industries, demonstrating the platform's versatility and power beyond typical web analytics.
Healthcare data integration. Cerner uses Hadoop, particularly Crunch and Avro, to integrate disparate healthcare data sources, normalize complex medical records, and build pipelines for analytics and patient care management. The focus is on creating a clean, semantically integrated data foundation.
Genomics data science. The AMPLab and others are using Spark, Avro, and Parquet to accelerate genomics data processing, such as aligning DNA reads and counting k-mers. Tools like ADAM and SNAP leverage these technologies for rapid, scalable analysis of massive biological datasets, with direct applications in personalized medicine and pathogen identification.
Composable solutions. Real-world applications often build complex data pipelines by composing functions and datasets using higher-level tools like Crunch or by orchestrating jobs with Oozie, demonstrating how the ecosystem facilitates building sophisticated, modular solutions.
Last updated:
Review Summary
Hadoop by Tom White is highly regarded as a comprehensive guide to the Hadoop ecosystem. Readers praise its thorough coverage of topics, clear explanations, and practical examples. Many found it useful for understanding Hadoop's architecture and related technologies. Some criticisms include outdated information in newer editions and overwhelming detail in certain sections. Overall, it's considered an excellent resource for those learning about or working with Hadoop, though some suggest supplementing with more current materials for specific technologies.
Similar Books









Download PDF
Download EPUB
.epub
digital book format is ideal for reading ebooks on phones, tablets, and e-readers.