book-big-data

Key Takeaways

Human fault-tolerance

Page 91 (bottom) You must always have recomputation versions of your algorithms. This is the only way to ensure human-fault tolerance for your system, and human-fault tolerance is a non-negotiable requirement for robust systems (#)

Recomputation

Page 91 (bottom) You must always have recomputation versions of your algorithms. This is the only way to ensure human-fault tolerance for your system, and human-fault tolerance is a non-negotiable requirement for robust systems (#)

Chapter 1

Page 6

book-big-data#human-fault-tolerance"So when you make a mistake, you might write bad data, but at least you won't destroy good data. This is a much stronger human-fault tolerance guarantee than in a traditional system based on mutation. With traditional databases you'd be wary of using immutable data because of how fast such a dataset would grow. But because Big Data techniques can scale up to so much data, you have the ability to design systems in different ways." book-big-data#human-fault-tolerance

Page 7

book-big-data#rawest-information-is-data1 2Page 7 - top: Another crucial observation is that not all bits of information are equal. Some information is derived from other pieces of information. (...) When you keep tracing back where information is derived from, you eventually end up at information that is not derived from anything. This is the rawest information you have: information you hold to be true simply because it exists. Let's call this information data. book-big-data#rawest-information-is-data1 2

Page 8

book-big-data#immutability-and-recomputation-mechanism-for-recovery1 2 3 4Page 8: If you build immutability and recomputation into the code of a Big Data system, the system will be innately resilient to human error by providing a clear and simple mechanism for recovery. book-big-data#immutability-and-recomputation-mechanism-for-recovery1 2 3 4

Page 9

book-big-data#trace-what-caused-it-to-have-value1 2 3 4 5Page 9: A Big Data system must provide the information necessary to debug the system when things go wrong. The key is to be able to trace, for each value in the system, exactly what caused it to have that value. book-big-data#trace-what-caused-it-to-have-value1 2 3 4 5

Page 12

book-big-data#incremental-database-will-be-corrupted1 2Page 12: Because mistakes are inevitable, the database in a fully incremental architecture is guaranteed to be corrupted book-big-data#incremental-database-will-be-corrupted1 2

Page 13

trample over the events store

Paid 17

the batch and serving layers are also human-fault tolerant because when a mistake is made you can fix your algorithm or remove the bad data and recompute the views from scratch.

Page 18

The speed layer does incremental computation instead of the recomputation done in the bachelor

Page 20

This property of the lambda architecture is called "complexity isolation", meaning that complexity is pushed into a layer his results are only temporary

Page 20

the batch layer repeatedly overrides the speed later, so the approximation gets corrected and your system exhibits the property of "eventual accuracy"

Page 23

book-big-data#collect-more-data1Page 23 because your system will be able to handle much larger amounts of data then you'll be able to collect even more data and get more value from it. book-big-data#collect-more-data1

Page 23

there is very little magic happening behind-the-scenes, as compared to something like a SQL query planner. This leads to more predictable performance.

Chapter 3 - Data modeling for big data

Page 48

book-big-data#schema-format-is-good-idea1 2Many developers go down the path of writing their raw data in the schemaless format like JSON. This is appealing because of how easy is is to get started, but this approach quickly leads to problems. Whether due to bugs or misunderstandings between different developers, data corruption inevitably occurs. book-big-data#schema-format-is-good-idea1 2

It's our experience that data corruption errors are some of the mode time-consuming to debug.

book-big-data#schemas-at-time-of-writing1 2 3When you create an enforceable schema, you get errors at the time of writing the data -- giving you full context as to how and why the data became invalid (like a stacktrace). In addition, the error prevents the program from corrupting the master dataset by writing that data. book-big-data#schemas-at-time-of-writing1 2 3

Chapter 4 - Data Storage on the batch layer

Page 61

Vertical Partitioning

book-big-data#source-of-truth1 2 3Page 68 - The master dataset is the source of truth in the lambda architecture book-big-data#source-of-truth1 2 3

Chapter 5 - Data Storage on the batch layer - Illustration

Page 67

Scalable, fault tolerant, performant, and elegant

Chapter 6 - Batch layer

Page 84

indexes of the master dataset

precalculation rather than totally on the fly

Page 85

book-big-data#semantic-normalization1"The algorithm first performs semantic normalization on the the name for the person, doing conversion like Bob to Robert and Bill to William" book-big-data#semantic-normalization1

book-big-data#recomputation-improve-result-as-algo-improves1Page 85 with recalculation, results improve as your algorithm improves book-big-data#recomputation-improve-result-as-algo-improves1

Page 86

Batching layer creates batch view in the service layer

Page 87 Batch view are intermediate data

book-big-data#serving-layer-indexes-batch-views1Page 87 Serving layer indexes the batch views book-big-data#serving-layer-indexes-batch-views1

Page 88 Querying an indexed batch view

Page 89 a table of the tradeoffs between re-computation and incrementalism

Page 89 Batch views can be a lot smaller (Question based on this.. what about if we are not doing aggregations, but rather more like transformation?

Page 91 (middle) normalization (issues with incremental vs recomputation)

book-big-data#batch-views-also-generated-incrementally1Page 90 Batch views are also generated by incremental algos, not just recompilation. book-big-data#batch-views-also-generated-incrementally1

book-big-data#key-takeaway-must-always-have-recomputation-fault-tolerance1 2Page 91 (bottom) You must always have recomputation versions of your algorithms. This is the only way to ensure human-fault tolerance for your system, and human-fault tolerance is a non-negotiable requirement for robust systems book-big-data#key-takeaway-must-always-have-recomputation-fault-tolerance1 2

book-big-data#scalability-definition1Scalability is the ability of a system to maintain performance under increasing load by adding more resources. book-big-data#scalability-definition1

book-big-data#load-definition1Load in a Big Data context is a combination of the total about of data you have , and how much new data you receive aver day, how many requests you receive every second, and so forth book-big-data#load-definition1

Page 93 linear scalability is key

Page 95 in map reduce, code goes to where the data is

Page 96 Single failures (likely infra) vs. multiple failures (likely code)

Page 98 Spark vs. Map Reduce

Page 99 joins are complicated

Page 102 Pipe diagrams

book-big-data#immutable-intermediate-results1Page 104. Immutable intermediate results. "As you can see, one of the keys to pipe diagrams is that fields are immutable once created. One obvious optimization that you can make is to discard fields as soon as they're no longer needed (preventing unnecessary serialization and network I/O)" book-big-data#immutable-intermediate-results1

Page 107 Smart Compiler

Page 108 A "combiner aggregator" is very efficient in some cases

book-big-data#batch-do-what-you-cannot-in-real-timePage 109 (bottom): Batch - do what you can't do in real time book-big-data#batch-do-what-you-cannot-in-real-time

Page 110 - Tradeoffs between size of generated views and query time

Chapter 7 - Batch Layer: Illustration

book-big-data#sql-gold-standard-limiting1Page 112 - SQL as the gold standard thinking is limiting book-big-data#sql-gold-standard-limiting1

book-big-data#writing-partial-listing-for-brevity1Page 112: Partial listing for brevity book-big-data#writing-partial-listing-for-brevity1

Page 114: Essential complexity vs. accidental couples

book-big-data#business-logic-user-defined-functions1Page 114: incorporate business logic requires user-defined functions. Question Gordon poses to self: what is SQL is the problem, not the solution? book-big-data#business-logic-user-defined-functions1

Page 138: The way you express computation

book-big-data#batch-adapting-to-change1Page 140: Recomputation is good at adapting to change book-big-data#batch-adapting-to-change1

Page 140: batch layer to support 3 types of queries

Page 140: Goal of batch is to precompute views so queries go fast.

Page 145: Enough to make uniquely identifiable

book-big-data#need-a-consistent-master-dataset-when-doing-computations1Page 145: Easier to reason about when from same master dataset (why you don't just keep appending incrementally to the master dataset while doing calculation) book-big-data#need-a-consistent-master-dataset-when-doing-computations1

Page 146: Fully-distributed iterative graph algorithm for determining a canonical id for a user

book-big-data#fixed-point1Page 147: Reaching a fixed point, where the resulting output is the same as the input book-big-data#fixed-point1

Page 151: "Now that the data is ready to compute the batch view" - ah, so semantic normalization is a separate step before calculating the batch views.

Chapter 10 - Serving layer

book-big-data#last-component-of-serving-layer1page 179 the serving layer is the last component of the batch section of the lambda architecture book-big-data#last-component-of-serving-layer1

book-big-data#indexing-strategiespage 180 indexing strategies to minimize latecy, resource usage, and variance book-big-data#indexing-strategies

book-big-data#solve-normalization-vs-denormalization-problem1 2 3How the serving layer solved the long-debated normalization versus denormalization problem book-big-data#solve-normalization-vs-denormalization-problem1 2 3

page 181 if you hit a lot of machines for a query, you latency for your query will be the most latent of the responses from those servers

book-big-data#colocation1page 182 colocate the pageview information in the serving layer to speed things up because you can do a scan book-big-data#colocation1

book-big-data#tailor-serving-layer-for-queries1 2page 183: "a vital advantage of the Lambda Architecture is that it allows you to tailor the serving layer for the queries it serves to optimize efficiency book-big-data#tailor-serving-layer-for-queries1 2

book-big-data#cost-of-denormalization1 2page 185 "The denormalization process increases performance, but it comes with the huge complexity of keeping the redundant data consistent book-big-data#cost-of-denormalization1 2

book-big-data#optimizations-in-the-serving-layer1page 184: these optimizations in the serving layer can go far beyond denormalization. In addition to prejoining data, you can perform additional aggregation and transformation to further improve efficiency. book-big-data#optimizations-in-the-serving-layer1

book-big-data#can-optimize-serving-layer-for-read-path1 2 3page 185: since there are not random writes in the serving layer then you can optimize for the read path and get high-performance. book-big-data#can-optimize-serving-layer-for-read-path1 2 3

book-big-data#batch-layer-output-unindexedPage 186 "The output of the batch layer is unindexed. It's the job of the serving layer to index those views and serve them with low latency." book-big-data#batch-layer-output-unindexed

book-big-data#repurpose-traditional-databases-for-serving-layer1 2page 186 In practice you may find yourself repurposing traditional databases for the serving layer. book-big-data#repurpose-traditional-databases-for-serving-layer1 2

Realtime Views: Illustration

book-big-data#cassandra-column-family1Page 221: column family is analogous to tables in relation ab databases book-big-data#cassandra-column-family1

book-big-data#cassandra-keys1page 221: if you consider a column family a giant map, keys are the top-level entries in that map book-big-data#cassandra-keys1

book-big-data#cassandra-keys-used-for-partitioning1page 221: cassandra uses keys to partition a column family across a cluster book-big-data#cassandra-keys-used-for-partitioning1

book-big-data#cassandra-keys-point-to-columns1 2page 221: each key points to another map of key-value pairs called columns book-big-data#cassandra-keys-point-to-columns1 2

book-big-data#cassandra-columns-associated-with-a-key-stored-together1page 221: all columns for a key is stored together, physically, making it inexpensive to access ranges of columns book-big-data#cassandra-columns-associated-with-a-key-stored-together1

book-big-data#cassandra-columns-can-differ-from-key-to-key1page 221: columns can differ from key to key book-big-data#cassandra-columns-can-differ-from-key-to-key1

book-big-data#cassandra-columns-are-sorted1page 222: columns are sorted book-big-data#cassandra-columns-are-sorted1

book-big-data#cassandra-schemas-only-have-to-be-created-once-per-column-family1 2page 222: schemas only have to be created once per column family book-big-data#cassandra-schemas-only-have-to-be-created-once-per-column-family1 2

book-big-data#cassandra-slice-operations-efficient-because-columns-ordered-and-stored-together1 2page 223: because the columns are ordered and stored together, slice operations over a range of columns are very efficient. book-big-data#cassandra-slice-operations-efficient-because-columns-ordered-and-stored-together1 2

book-big-data#cassandra-can-partition-keys-randomly-or-preserve-order1can partition keys randomly or preserve order book-big-data#cassandra-can-partition-keys-randomly-or-preserve-order1

Glossary

data-architecture-glossary#glossary
 

Referring Pages

intermediate-results context-propagation data-architecture-glossary new-data-architecture schema-is-very-important keep-all-the-data-process-it-later cassandra-glossary

People

person-nathan-marz