Big Data: Principles and Best Practices of Scalable Real-Time Data Systems

Description

This book presents the Lambda Architecture, a scalable, easy-to-understand approach that can be built and run by a small team. You'll explore the theory of big data systems and how to implement them in practice. In addition to discovering a general framework for processing big data, you'll learn specific technologies like Hadoop, Storm and NoSQL databases.

About the Author

Nathan Marz is the creator of Apache Storm and the originator of the Lambda Architecture for big data systems.

James Warren is an analytics architect with a background in machine learning and scientific computing.

Table of Contents

Book Review

Part 1 Batch layer

2 Data model for Big Data

2.1 The properties of data

2.2 The fact-based model for representing data

2.3 Graph schemas

2.4 A complete data model for SuperWebAnalytics.com

2.5 Summary

3 Data model for Big Data: Illustration

3.1 Why a serialization framework?

3.2 Apache Thrift

3.3 Limitations of serialization frameworks

3.4 Summary

4 Data storage on the batch layer

4.1 Storage requirements for the master dataset

4.2 Choosing a storage solution for the batch layer

4.3 How distributed filesystems work

4.4 Storing a master dataset with a distributed filesystem

4.5 Vertical partitioning

4.6 Low-level nature of distributed filesystems

4.7 Storing the SuperWebAnalytics.com master dataset on a distributed filesystem

4.8 Summary

5 Data storage on the batch layer: Illustration

5.1 Using the Hadoop Distributed File System

5.2 Data storage in the batch layer with Pail

5.3 Storing the master dataset for SuperWebAnalytics.com

5.4 Summary

6 Batch layer

6.1 Motivating examples

6.2 Computing on the batch layer

6.3 Recomputation algorithms vs. incremental algorithms

6.4 Scalability in the batch layer

6.5 MapReduce: a paradigm for Big Data computing

6.6 Low-level nature of MapReduce

6.7 Pipe diagrams: a higher-level way of thinking about batch computation

6.8 Summary

7 Batch layer: Illustration

7.1 An illustrative example

7.2 Common pitfalls of data-processing tools

7.3 An introduction to JCascalog

7.4 Composition

7.5 Summary

8 An example batch layer: Architecture and algorithms

8.1 Design of the SuperWebAnalytics.com batch layer

8.2 Workflow overview

8.3 Ingesting new data

8.4 URL normalization

8.5 User-identifier normalization

8.6 Deduplicate pageviews

8.7 Computing batch views

8.8 Summary

9 An example batch layer: Implementation

9.1 Starting point

9.2 Preparing the workflow

9.3 Ingesting new data

9.4 URL normalization

9.5 User-identifier normalization

9.6 Deduplicate pageviews

9.7 Computing batch views

9.8 Summary

Part 2 Serving layer

10 Serving layer

10.1 Performance metrics for the serving layer

10.2 The serving layer solution to the normalization/denormalization problem

10.3 Requirements for a serving layer database

10.4 Designing a serving layer for SuperWebAnalytics.com

10.5 Contrasting with a fully incremental solution

10.6 Summary

11 Serving layer: Illustration

11.1 Basics of ElephantDB

11.2 Building the serving layer for SuperWebAnalytics.com

11.3 Summary

Part 3 Speed layer

12 Realtime views

12.1 Computing realtime views

12.2 Storing realtime views

12.3 Challenges of incremental computation

12.4 Asynchronous versus synchronous updates

12.5 Expiring realtime views

12.6 Summary

13 Realtime views: Illustration

13.1 Cassandra’s data model

13.2 Using Cassandra

13.3 Summary

14 Queuing and stream processing

14.1 Queuing

14.2 Stream processing

14.3 Higher-level, one-at-a-time stream processing

14.4 SuperWebAnalytics.com speed layer

14.5 Summary

15 Queuing and stream processing: Illustration

15.1 Defining topologies with Apache Storm

15.2 Apache Storm clusters and deployment

15.3 Guaranteeing message processing

15.4 Implementing the SuperWebAnalytics.com uniques-over-time speed layer

15.5 Summary

16 Micro-batch stream processing

16.1 Achieving exactly-once semantics

16.2 Core concepts of micro-batch stream processing

16.3 Extending pipe diagrams for micro-batch processing

16.4 Finishing the speed layer for SuperWebAnalytics.com

16.5 Pageviews over time 262 n Bounce-rate analysis

16.6 Another look at the bounce-rate-analysis example

16.7 Summary

17 Micro-batch stream processing: Illustration

17.1 Using Trident

17.2 Finishing the SuperWebAnalytics.com speed layer

17.3 Fully fault-tolerant, in-memory, micro-batch processing

17.4 Summary

18 Lambda Architecture in depth

18.1 Defining data systems

18.2 Batch and serving layers

18.3 Speed layer

18.4 Query layer

18.5 Summary