talk-runaway-complexity-in-big-data

https://www.youtube.com/watch?v=ucHjyb6jv08

talk-runaway-complexity-in-big-data#human-fault-tolerance1At 3:20 and talks about human fault tolerance talk-runaway-complexity-in-big-data#human-fault-tolerance1

talk-runaway-complexity-in-big-data#data-loss-and-corruption1 2At 5:20 the worst mistakes you can make are data loss and data corruption talk-runaway-complexity-in-big-data#data-loss-and-corruption1 2

talk-runaway-complexity-in-big-data#immutable-no-index-needed-to-find-and-update1At 9:15 having immutable system means you don't have to have an index your data (because you don't have to find it to be able to update it) talk-runaway-complexity-in-big-data#immutable-no-index-needed-to-find-and-update1

At 12 the way you store, model, and query data is complected. They are fundamentally intertwined.

At 12:15 he talks about systems where those parts are disassociated and could be scaled independently.

At 17:25 he uses Apache thrift to define his schemas

Add 27:25 he talks about ElephantDB, which he wrote, and voldemort, which LinkedIn wrote

At 2830 you have the ideas of transformation and normalization, however they are disassociated

talk-runaway-complexity-in-big-data#complexity-isolation1At 32:40 he talks about the architecture as facilitating complexity isolation, because most of the complexity goes into the real-time, incremental processing. Eventually the batch layer overrides that. And the batch layer is much more simple. talk-runaway-complexity-in-big-data#complexity-isolation1

talk-runaway-complexity-in-big-data#eventual-accuracy1At 35:20 he talks about eventual accuracy. Basically, what it means is that the batch layer will eventually override the realtime system talk-runaway-complexity-in-big-data#eventual-accuracy1

talk-runaway-complexity-in-big-data#storage-vs-querying1 2 3At 39:20 he talks about how the data storage is separated from the querying, and how that helps with the normalization and transformation. You get a fully normalized schema in the storage, but you also get views that are optimized (potentially very de-normalized) for your queries (the batch views) talk-runaway-complexity-in-big-data#storage-vs-querying1 2 3

talk-runaway-complexity-in-big-data#change-batch-viewsAt 39:45 it gives you flexibility in your batch views. If you realize you need a very different set of batch views, you can change your recompute algorithm or target and create totally new batch views. These batch views could even live on a completely different new system. talk-runaway-complexity-in-big-data#change-batch-views

talk-runaway-complexity-in-big-data#changing-needs-new-systemsAt 40:05 your needs change over time. As long as you are able to use a function on all of your data to recompute badge is, you will be able to satisfy your future needs as well. talk-runaway-complexity-in-big-data#changing-needs-new-systems

At 44:38 is the same idea as event sourcing sourcing

data-architecture-glossary#glossary

acyclic data flow graphs
- These systems achieve their scalability and fault tolerance by providing a programming model where the user creates acyclic data flow graphs to pass input data through a set of operators. This allows the underlying system to manage scheduling and to react to faults without user intervention. (#)
- All that goes to hell as soon as you back-feed outputs of a later stage into inputs of an earlier stage. Now you have one monolithic block of code where you've semi-pointlessly drawn some boxes inside of it to pretend like it's modular like the rest of the pipeline, but it's not. You can't understand it without understanding the whole thing (#)
annotation
- The metadata system allows for arbitrary annotation of data. It is used to convey information to the compiler about types, but can also be used by application developers for many purposes, annotating data sources, policy etc. (#)
append-only
- Location: 13,789 argue in favor of immutable and append-only data, because it is easier to recover from such mistakes if you remove the ability of faulty code to destroy good data. (#)
- Data are never deleted from the data vault, unless you have a technical error while loading data. (#)
approximate results
- Some Dremel queries, such as top-k and count-distinct, return approximate results using known one-pass algorithms (#)
artificial key
- see also business key
- see also surrogate key
- Given that a key is a column with unique values in each row, one way to create one is to cheat and throw made-up unique values into each row. Artificial keys are just that: an invented code used for referring to facts or objects (#)
attribute volatility
- "It might be that the business case your application is solving actually has an attribute volatility problem to solve. In that case, consider having as solid a model as possible and use jsonb columns as extension points." (#)
auditability
- see also debuggability
- see also traceability
- Page 9: A Big Data system must provide the information necessary to debug the system when things go wrong. The key is to be able to trace, for each value in the system, exactly what caused it to have that value. (#)
- the focus of any data vault implementation is complete traceability and auditability of all information. (#)
background data improvement
- BigQuery has background processes that constantly look at all the stored data and check if it can be optimized even further. Perhaps initially data was loaded in small chunks, and without seeing all the data, some decisions were not globally optimal. Or perhaps some parameters of the system have changed, and there are new opportunities for storage restructuring. Or perhaps, Capacitor models got more trained and tuned, and it possible to enhance existing data. Whatever the case might be, when the system detects an opportunity to improve storage, it kickstarts data conversion tasks. These tasks do not compete with queries for resources, they run completely in parallel, and don't degrade query performance. Once the new, optimized storage is complete, it atomically replaces old storage data — without interfering with running queries. Old data will be garbage-collected later (#)
bad data
- in contrast to good data
- Data vault modeling makes no distinction between good and bad data ("bad" meaning not conforming to business rules). (#)
base
- in contrast to view
- Materialized views handle automated server-side denormalization, removing the need for client side handling of this denormalization and ensuring eventual consistency between the base and view data (#)
batch
- in contrast to streaming
- one part of a unified model
- also in contrast to the term bounded
- When describing infinite/finite data sets, we prefer the terms unbounded/bounded over streaming/batch, because the latter terms carry with them an implication of the use of a specific type of execution engine. In reality, unbounded datasets have been processed using repeated runs of batch systems since their conception, and well-designed streaming systems are perfectly capable of processing bounded data. From the perspective of the model, the distinction of streaming or batch is largely irrelevant, and we thus reserve those terms exclusively for describing runtime execution engines. (#)
- Location: 13,180 Batch processing has a quite strong functional flavor (even if the code is not written in a functional programming language): it encourages deterministic, pure functions whose output depends only on the input and which have no side effects other than the explicit outputs, treating inputs as immutable and outputs as append-only. (#)
batch view
- recomputation batch view
- incremental batch view
- At 39:20 he talks about how the data storage is separated from the querying, and how that helps with the normalization and transformation. You get a fully normalized schema in the storage, but you also get views that are optimized (potentially very de-normalized) for your queries (the batch views) (#)
- Page 90 Batch views are also generated by incremental algos, not just recompilation. (#)
bottlenecks
- provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound). (#)
bounded
- in contrast to unbounded
- also in contrast to the term batch
- When describing infinite/finite data sets, we prefer the terms unbounded/bounded over streaming/batch, because the latter terms carry with them an implication of the use of a specific type of execution engine. In reality, unbounded datasets have been processed using repeated runs of batch systems since their conception, and well-designed streaming systems are perfectly capable of processing bounded data. From the perspective of the model, the distinction of streaming or batch is largely irrelevant, and we thus reserve those terms exclusively for describing runtime execution engines. (#)
bounded stale reads
- Cloud Spanner supports bounded stale reads, which offer similar performance benefits as eventual consistency but with much stronger consistency guarantees. (#)
boxcar
- At 12:10 "we boxcar them together to generate a workload that's worth shipping over" (#)
business key
- see also artificial key
- see also surrogate key
- The data vault method has as one of its main axioms that real business keys only change when the business changes and are therefore the most stable elements from which to derive the structure of a historical database. If you use these keys as the backbone of a data warehouse, you can organize the rest of the data around them. (#)
- a surrogate key, used to connect the other structures to this table. a business key, the driver for this hub. The business key can consist of multiple fields. (#)
caching
- Data immutability allows you to scale readers and have aggressive caching. (#)
- At 4120 what is safe to cache? Everything! (#)
capturing intent
- see also capturing meaning
- intent is one type of meaning. In the case of actions taken by a person, what did they intend to do?
capturing meaning
- see also capturing intent
- To build that data structure, you don't need to know anything that happens to it afterwards, you just need to know what it means (#)
- we're achieving loose coupling between components by exposing a data schema, publicly committing to all its representational details. Nothing is encapsulated at all (#)
cassandra
- At 1:42:20 do as little as possible and jam that stuff in to Cassandra as quickly as you can (#)
- Location: 2,129 There are also different strategies to determine the order and timing of how SSTables are compacted and merged. The most common options are size-tiered and leveled compaction. LevelDB and RocksDB use leveled compaction (hence the name of LevelDB), HBase uses size-tiered, and Cassandra supports both [16]. In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger SSTables. In leveled compaction, the key range is split up into smaller SSTables and older data is moved into separate "levels," which allows the compaction to proceed more incrementally and use less disk space. (#)
causal ordering
- see also causal chain
- This is a causal ordering. It doesn't care so much about clock time. It cares what commits I worked from when I made mine. I knew about the parent commit, I started from there, so it's causal. Whatever you were doing on your branch, I didn't know about it, it wasn't causal, so there is no "before" or "after" relationship to yours and mine. (#)
- For situations where reordering could be problematic, CockroachDB returns a causality token, which is just the maximum timestamp encountered during a transaction. If passed from one actor to the next in a causal chain, the token serves as a minimum timestamp for successive transactions and will guarantee that each has a properly ordered commit timestamp (#)
- p 217: Write requests on the other hand will be coordinated by a node in the key's current preference list. This restriction is due to the fact that these preferred nodes have the added responsibility of creating a new version stamp that causally subsumes the version that has been updated by the write request. Note that if Dynamo's versioning scheme is based on physical timestamps, any node can coordinate a write request. (#)
causal chain
- see also causal ordering
- see also lineage
- For situations where reordering could be problematic, CockroachDB returns a causality token, which is just the maximum timestamp encountered during a transaction. If passed from one actor to the next in a causal chain, the token serves as a minimum timestamp for successive transactions and will guarantee that each has a properly ordered commit timestamp (#)
checkpoint
- While some DSM systems achieve fault tolerance through checkpointing [18], Spark reconstructs lost partitions of RDDs using lineage information captured in the RDD objects (#)
checksums
- At 25:05 use full-row checksums full-row-checksums to see if values align (#)
client-generated request id
- see also idempotency key
- Location: 14,057 Passing a client-generated request ID through all these levels of processing, enabling end-to-end duplicate suppression and idempotence (#)
client-side denormalization
- in contrast to server-side denormalization
- Instead, client-side denormalization and multiple independent tables are used, which means that the same code is rewritten for many different users. (#)
clock multiplier
- see also system clock
- At 36:35 he calls the thing that speeds up the clock for simulation a clock multiplier. It allows you to turn up the intensity of your test. (#)
colocation
- see also partitioning
- page 182 colocate the pageview information in the serving layer to speed things up because you can do a scan (#)
- Snowflake automatically sorts data as it is inserted/loaded into a table. Data with the same values is co-located, as much as possible, in the same micro-partition. (#)
common storage layer
- see also storage layer
- The above scenario requires interoperation between the query processor and other data management tools. The first ingredient for that is a common storage layer. (#)
compaction
- Location: 1,992 Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key. (#)
complexity isolation
- At 32:40 he talks about the architecture as facilitating complexity isolation, because most of the complexity goes into the real-time, incremental processing. Eventually the batch layer overrides that. And the batch layer is much more simple. (#)
composition
- One of the nice things about pipelines is the composition operator is associative: we can take a series of transformations and pretend they're just one big transformation (#)
conflict
- see also materializing conflicts
- timeuuid: ideal for use in applications requiring conflict-free timestamps (#)
conflict resolution
- see also conflict
- p 2: To allow for concurrent updates while avoiding many of the problems inherent with wide-area locking, it uses an update model based on conflict resolution (#)
- p 205: It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use. (#)
conformed dimensions
- see also normalization
- For conformed dimensions you also have to cleanse data (to conform it) and this is undesirable in a number of cases since this inevitably will lose information. (#)
consistency
- p 214: This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting read-your-writes consistency (#)
- p 1: Although strong consistency provides the application writer a convenient programming model, these systems are limited in scalability and availability [10] (#)
consistent hashing
- p 215: Dynamo uses consistent hashing to partition its key space across its replicas and to ensure uniform load distribution (#)
- p 209: technique: consistent hashing - advantage: allows for incemental scalability (#)
consistent hashing ring
- see also consistent hashing
- p 212: To remedy this it does not enforce strict quorum membership and instead it uses a sloppy quorum; all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring. (#)
constant state
- When there is no overlap in the range of values across all micro-partitions, the micro-partitions are considered to be in a constant state (i.e. they cannot be improved by reclustering) and table has a clustering ratio of 100. In this case, the table is considered to be fully clustered. (#)
context propagation
- At 31:20 tracing requires cross-language context propagation (#)
contiguous unit of storage
- In Snowflake, all data in tables is automatically divided into micro-partitions, which are contiguous units of storage. Snowflake is columnar-based and horizontally partitioned, meaning a row of data is stored in the same micro-partition. (#)
continuity
- Workflows are expected to be mostly static or slowly changing. You can think of the structure of the tasks in your workflow as slightly more dynamic than a database structure would be. Airflow workflows are expected to look similar from a run to the next, this allows for clarity around unit of work and continuity. (#)
coordination overhead
- At 1710 because they are doing update in place traditional relational databases have a huge coordination overhead (#)
data
- in contrast to derived data
- Page 7 - top: Another crucial observation is that not all bits of information are equal. Some information is derived from other pieces of information. (...) When you keep tracing back where information is derived from, you eventually end up at information that is not derived from anything. This is the rawest information you have: information you hold to be true simply because it exists. Let's call this information data. (#)
dataflow
- "Not all systems make a clear distinction between systems of record and derived data in their architecture, but it's a very helpful distinction to make, because it clarifies the dataflow through your system: it makes explicit which parts of the system have which inputs and which outputs, and how they depend on each other." (#)
- At 5505 Nile is a dataflow language (#)
- These systems achieve their scalability and fault tolerance by providing a programming model where the user creates acyclic data flow graphs to pass input data through a set of operators. This allows the underlying system to manage scheduling and to react to faults without user intervention. (#)
- dataflow model: In this section, we will define the formal model for the system and explain why its semantics are general enough to subsume the standard batch, micro-batch, and streaming models, as well as the hybrid streaming and batch semantics of the Lambda Architecture. (#)
- Location: 11,020 Since they explicitly model the flow of data through several processing stages, these systems are known as dataflow engines (#)
- Location: 11,286 Dataflow engines perform less materialization of intermediate state and keep more in memory, which means that they need to recompute more data if a node fails. Deterministic operators reduce the amount of data that needs to be recomputed. (#)
database inside out
- see also unbundling the database
- Location: 13,457 The approach of unbundling databases by composing specialized storage and processing systems with application code is also becoming known as the "database inside-out" approach [26], after the title of a conference talk I gave in 2014 [27]. (#)
- I think it makes more sense to think of your datacenter as a giant database, in that database Kafka is the commit log, and these various storage systems are kinds of derived indexes or views. It's true that a log like Kafka can be fantastic primitive for building a database, but the queries are still served out of indexes that were built for the appropriate access pattern. (#)
data corruption
- Page 12: Because mistakes are inevitable, the database in a fully incremental architecture is guaranteed to be corrupted (#)
- At 5:20 the worst mistakes you can make are data loss and data corruption (#)
data basis
- At 4:20 what is the data basis (#)
data inflation
- subsequent tests revealed that the database was using three to four times as much storage as would be necessary to store each field as a 32-bit integer. This sort of data "inflation" is typical of a traditional RDBMS and shouldn't necessarily be seen as a problem, especially to the extent that it is part of a strategy to improve performance. After all, disk space is relatively cheap.) (#)
data lake
- see also schema on read
- Location: 10,871 The careful schema design required by an MPP database slows down that centralized data collection; collecting data in its raw form, and worrying about schema design later, allows the data collection to be speeded up (a concept sometimes known as a data lake or enterprise data hub [55]). (#)
easy to move
- Note that while it is relatively straightforward to move data from a data vault model to a (cleansed) dimensional model, the reverse is not as easy. (#)
data lineage
- Data lineage, or data tracking, is generally defined as a type of data lifecycle that includes data origins and data movement over time. It can also describe transformations applied to the data as it passes through various processes. Data lineage can help analyse how information is used and track key information that serves a particular purpose. (#)
- Page 9: A Big Data system must provide the information necessary to debug the system when things go wrong. The key is to be able to trace, for each value in the system, exactly what caused it to have that value. (#)
- the focus of any data vault implementation is complete traceability and auditability of all information. (#)
data loss
- At 5:20 the worst mistakes you can make are data loss and data corruption (#)
data pipeline
- A solid data pipeline and a domain-specific understanding of the AI business problem at hand is table minimum. (#)
- We can completely segment one component of the compiler from another by having the right form of data structure in between them (#)
data preparation
- "The narrative isn't the product of any single malfunction, but rather the result of overhyped marketing, deficiencies in operating with deep learning and GPUs and intensive data preparation demands." (#)
data vault model
- see also checksums
- The Data Vault Model is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise[5] (#)
debuggability
- see also auditability
- see also traceability
- Page 9: A Big Data system must provide the information necessary to debug the system when things go wrong. The key is to be able to trace, for each value in the system, exactly what caused it to have that value. (#)
- the focus of any data vault implementation is complete traceability and auditability of all information. (#)
decentralization
- p 208: An extension of symmetry, the design should favor decentralized peer-to-peer techniques over centralized control. In the past, centralized control has resulted in outages and the goal is to avoid it as much as possible. This leads to a simpler, more scalable, and more available system. (#)
declarative
- At 3:20 SQL is the most declarative thing we encounter. And using declarative things is good. (#)
denormalization
- page 185 "The denormalization process increases performance, but it comes with the huge complexity of keeping the redundant data consistent (#)
- How the serving layer solved the long-debated normalization versus denormalization problem (#)
- page 185 "The denormalization process increases performance, but it comes with the huge complexity of keeping the redundant data consistent (#)
derived data
- in contrast to data
- in contrast to input data
- in contrast to raw data
- in contrast to system of record
- Location: 11,860 This aspect makes log-based messaging more like the batch processes of the last chapter, where derived data is clearly separated from input data through a repeatable transformation process. It allows more experimentation and easier recovery from errors and bugs, making it a good tool for integrating dataflows within an organization (#)
- Location: 14,576 Derived state can be updated by observing changes in the underlying data. (#)
derived data systems
- Location: 11,951 We can call the log consumers derived data systems, as discussed in the introduction to Part III: the data stored in the search index and the data warehouse is just another view onto the data in the system of record. (#)
derived state
- Location: 14,576 Derived state can be updated by observing changes in the underlying data. (#)
- "Sometimes it's helpful to wrap a view around a table. The view definition can include derived data calculations. Then applications and interfaces can access views for a consistent implementation of derived data." (#)
distributed
- in contrast to single node systems
- Location: 93 This book uses less ambiguous terms, such as single-node versus distributed systems, or online/interactive versus offline/batch processing systems. (#)
distributed hash table
- p 209: Dynamo can be characterized as a zero-hop DHT, where each node maintains enough routing information locally to route a request to the appropriate node directly. (#)
distributed transactions
- Location: 13,190 In principle, derived data systems could be maintained synchronously, just like a relational database updates secondary indexes synchronously within the same transaction as writes to the table being indexed. However, asynchrony is what makes systems based on event logs robust: it allows a fault in one part of the system to be contained locally, whereas distributed transactions abort if any one participant fails, so they tend to amplify failures by spreading them to the rest of the system (see "Limitations of distributed transactions"). (#)
document-based partitioning
- Location: 5,243 There are two main approaches to partitioning a database with secondary indexes: document-based partitioning and term-based partitioning. (#)
- Location: 5,259 For that reason, a document-partitioned index is also known as a local index as opposed to a global index (#)
- Location: 5,270 This approach to querying a partitioned database is sometimes known as scatter/gather, and it can make read queries on secondary indexes quite expensive. (#)
- Location: 5,272 MongoDB, Riak [15], Cassandra [16], Elasticsearch [17], SolrCloud [18], and VoltDB [19] all use document-partitioned secondary indexes. (#)
DSL
- A domain specific language
- At 57 if you are building a DSL one thing you have to do is to be able to debug it, so the human experience becomes a big part of this. (#)
- At 5:15 there is a strong influence from database research on creating DSL's that make people very productive without making them programmers. (#)
durability
- At 3750 take novelty and immediately log it. That is where you get durability. But that is not organized in a leverage-able way. (#)
- At 3550 everything that's new you are already logging. So the indexing of this data is not a durability problem. (#)
encoding
- various techniques and encodings, such as Run Length Encoding (RLE), Dictionary encoding, Bit-Vector encoding, Frame of Reference encoding, etc (#)
enrichment
- These two streams are somewhat time aligned, so you might only need to wait for a few minutes after the impression for the matching click to occur. This is called a streaming join. Another type of join is for "enrichment" of an incoming stream — perhaps you want to take an incoming stream of ad clicks and join on attributes about the user (perhaps for use in downstream aggregations). (#)
entire dataset
- in contrast to database inside out
- in contrast to unbundling the database
- Put another way, data stores that provide transactions and consistency across the entire dataset by default lead to fewer bugs, fewer headaches and easier-to-maintain application code. (#)
event sourcing
- Location: 12,066 For example, storing the event "student cancelled their course enrollment" clearly expresses the intent of a single action in a neutral fashion, whereas the side effects "one entry was deleted from the enrollments table, and one cancellation reason was added to the student feedback table" embed a lot of assumptions about the way the data is later going to be used. If a new application feature is introduced — for example, "the place is offered to the next person on the waiting list" — the event sourcing approach allows that new side effect to easily be chained off the existing event. (#)
event time
- in contrast to processing time
- At 2140 the time is event time based as opposed to processing time based, meaning they take the time information from the event, not when it is processed. (#)
eventual accuracy
- At 35:20 he talks about eventual accuracy. Basically, what it means is that the batch layer will eventually override the realtime system (#)
eventual consistency
- Cloud Spanner supports bounded stale reads, which offer similar performance benefits as eventual consistency but with much stronger consistency guarantees. (#)
evolvability
- Location: 625 since this is such an important idea, we will use a different word to refer to agility on a data system level: evolvability (#)
execution engine
- When describing infinite/finite data sets, we prefer the terms unbounded/bounded over streaming/batch, because the latter terms carry with them an implication of the use of a specific type of execution engine. In reality, unbounded datasets have been processed using repeated runs of batch systems since their conception, and well-designed streaming systems are perfectly capable of processing bounded data. From the perspective of the model, the distinction of streaming or batch is largely irrelevant, and we thus reserve those terms exclusively for describing runtime execution engines. (#)
execution tree
- see also serving tree
- We show how execution trees used in web search systems can be applied to database processing, and explain their benefits for answering aggregation queries efficiently (#)
extension points
- "It might be that the business case your application is solving actually has an attribute volatility problem to solve. In that case, consider having as solid a model as possible and use jsonb columns as extension points." (#)
external consistency
- external consistency, which is strong consistency + additional properties (including serializability and linearizability). (#)
- External consistency is a stronger property than both linearizability and serializability. (#)
fixed point
- Page 147: Reaching a fixed point, where the resulting output is the same as the input (#)
fact
- At 1550 "Rich Hickey: By an information model I mean the ability to store facts, to not have things replace other things in place, to have some temporal notion to what's being stored" (#)
failure
- In the resilient section: "The client of a component is not burdened with handling its failures." (#)
fault-tolerance
- These systems achieve their scalability and fault tolerance by providing a programming model where the user creates acyclic data flow graphs to pass input data through a set of operators. This allows the underlying system to manage scheduling and to react to faults without user intervention. (#)
- While some DSM systems achieve fault tolerance through checkpointing [18], Spark reconstructs lost partitions of RDDs using lineage information captured in the RDD objects (#)
global index
- in contrast to local index
- Location: 5,259 For that reason, a document-partitioned index is also known as a local index as opposed to a global index (#)
- Location: 5,285 Rather than each partition having its own secondary index (a local index), we can construct a global index that covers data in all partitions. (#)
- Location: 5,287 A global index must also be partitioned, but it can be partitioned differently from the primary key index. (#)
- Location: 5,302 In practice, updates to global secondary indexes (global index) are often asynchronous (that is, if you read the index shortly after a write, the change you just made may not yet be reflected in the index). (#)
good data
- in contrast to bad data
- Data vault modeling makes no distinction between good and bad data ("bad" meaning not conforming to business rules). (#)
gossip protocol
- p 3: Cluster membership in Cassandra is based on Scuttlebutt[19], a very efficient anti-entropy Gossip based mechanism. (#)
heavyweight lock
- The first commit 6d46f478 has changed the heavyweight locks (locks that are used for logical database objects to ensure the database ACID properties) to lightweight locks (locks to protect shared data structures) for scanning the bucket pages (#)
heterogeneity
- p 208: The system needs to be able to exploit heterogeneity in the infrastructure it runs on. e.g. the work distribution must be proportional to the capabilities of the individual servers. This is essential in adding new nodes with higher capacity without having to upgrade all hosts at once. (#)
- p 210: The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure. (#)
hierarchical namespaces
- p 209: Dynamo do not require support for hierarchical namespaces (a norm in many file systems) or complex relational schema (supported by traditional databases). (#)
high cardinality data
- "Currently, the only way to query a column without specifying the partition key is to use secondary indexes, but they are not a substitute for the denormalization of data into new tables as they are not fit for high cardinality data" (#)
- High cardinality secondary index queries often require responses from all of the nodes in the ring, which adds latency to each request. (#)
human fault-tolerance
- Page 91 (bottom) You must always have recomputation versions of your algorithms. This is the only way to ensure human-fault tolerance for your system, and human-fault tolerance is a non-negotiable requirement for robust systems (#)
- Page 8: If you build immutability and recomputation into the code of a Big Data system, the system will be innately resilient to human error by providing a clear and simple mechanism for recovery. (#)
- Page 12: Because mistakes are inevitable, the database in a fully incremental architecture is guaranteed to be corrupted (#)
- At 3:20 and talks about human fault tolerance (#)
idempotency key
- see also client-generated request id
- An idempotency key is a unique value that's generated by a client and sent to an API along with a request. The server stores the key to use for bookkeeping the status of that request on its end. If a request should fail partway through, the client retries with the same idempotency key value, and the server uses it to look up the request's state and continue from where it left off. (#)
immutability
- Page 8: If you build immutability and recomputation into the code of a Big Data system, the system will be innately resilient to human error by providing a clear and simple mechanism for recovery. (#)
- Location: 13,789 argue in favor of immutable and append-only data, because it is easier to recover from such mistakes if you remove the ability of faulty code to destroy good data. (#)
- Data are never deleted from the data vault, unless you have a technical error while loading data. (#)
imbalance ratio
- p 215: the fraction of nodes that are "out-of-balance" (henceforth, imbalance ratio) during this time period (#)
incremental scalability
- see also scalability
- p 208: Dynamo should be able to scale out one storage host (henceforth, referred to as "node") at a time, with minimal impact on both operators of the system and the system itself. (#)
indexing
- At 3550 everything that's new you are already logging. So the indexing of this data is not a durability problem. (#)
- That 1150 indexes are basically sorted views on the data. That is why databases are faster than flat files. (#)
information-model
- At 1550 "Rich Hickey: By an information model I mean the ability to store facts, to not have things replace other things in place, to have some temporal notion to what's being stored" (#)
input data
- in contrast to derived data
- sometimes called, simply, data or the master dataset
- Location: 11,860 This aspect makes log-based messaging more like the batch processes of the last chapter, where derived data is clearly separated from input data through a repeatable transformation process. It allows more experimentation and easier recovery from errors and bugs, making it a good tool for integrating dataflows within an organization (#)
- I like that the Lambda Architecture emphasizes retaining the input data unchanged. I think the discipline of modeling data transformation as a series of materialized stages from an original input has a lot of merit. This is one of the things that makes large MapReduce workflows tractable, as it enables you to debug each stage independently. (#)
intermediate representation
- It doesn't explain why we're so incredibly enamored with creating more IRs. (#)
in situ
- Unlike traditional databases, it is capable of operating on in situ nested data. In situ refers to the ability to access data "in place", e.g., in a distributed file system (#)
isolation
- Isolation ensures that partial insertion or updates are not accessed until all operations are complete. (#)
jitter
- At 10:10 network weather will increase latency and jitter (#)
join-free document model
- Location: 907 Moreover, even if the initial version of an application fits well in a join-free document model, data has a tendency of becoming more interconnected as features are added to applications. (#)
kappa architecture
- in contrast to lambda architecture
- When the logic of the stream processing code changes, you often want to recompute your results. A very simple way to do this is just to reset the offset for the program to zero to recompute the results with the new code. This sometimes goes by the somewhat grandiose name of The Kappa Architecture. (#)
key stability
- This approach provides internal key stability while acknowledging and protecting natural keys (#)
lambda architecture
- in contrast to kappa architecture
- Page 68 - The master dataset is the source of truth in the lambda architecture (#)
- dataflow model: In this section, we will define the formal model for the system and explain why its semantics are general enough to subsume the standard batch, micro-batch, and streaming models, as well as the hybrid streaming and batch semantics of the Lambda Architecture. (#)
- There are a number of other motivations proposed for the Lambda Architecture, but I don't think they make much sense. One is that real-time processing is inherently approximate, less powerful, and more lossy than batch processing. I actually do not think this is true. It is true that the existing set of stream processing frameworks are less mature than MapReduce, but there is no reason that a stream processing system can't give as strong a semantic guarantee as a batch system. (#)
legibility
- I'm using the word "legibility"[1] in the sense of "understandability:" as a person we have visibility into the system's workings, we can follow along with what it's doing. Distinguish its features, locate problems and change it. (#)
lightweight lock
- The first commit 6d46f478 has changed the heavyweight locks (locks that are used for logical database objects to ensure the database ACID properties) to lightweight locks (locks to protect shared data structures) for scanning the bucket pages (#)
lineage
- see also causal ordering
- RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition (#)
- While some DSM systems achieve fault tolerance through checkpointing [18], Spark reconstructs lost partitions of RDDs using lineage information captured in the RDD objects (#)
- Capturing lineage or provenance information for datasets has long been a research topic in the scientific computing an database fields, for applications such as explaining results, allowing them to be reproduced by others, and recomputing data if a bug is found in a work- flow step or if a dataset is lost. We refer the reader to [7], [23] and [9] for surveys of this work (#)
linearizability
- see also total order broadcast
- While Spanner provides linearizability, CockroachDB's external consistency guarantee is by default only serializability, though with some features that can help bridge the gap in practice. (#)
- Location: 8,989 linearizability is a recency guarantee: a read is guaranteed to see the latest value written. (#)
- Location 8,990 if you have total order broadcast, you can build linearizable storage on top of it (#)
- Location: 13,963 writes that may conflict are routed to the same partition and processed sequentially (#)
- External consistency is a stronger property than both linearizability and serializability. (#)
- Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn't group operations together into transactions, so it does not prevent problems such as write skew, unless you take additional measures such as materializing conflicts (#)
load
- Load in a Big Data context is a combination of the total about of data you have , and how much new data you receive aver day, how many requests you receive every second, and so forth (#)
local data
- At 1:01:15 since you have all of the information locally you can lay out what-if? scenarios locally without affecting the real database. (#)
local index
- in contrast to global index
- Location: 5,259 For that reason, a document-partitioned index is also known as a local index as opposed to a global index (#)
location transparency
- Location transparent messaging as a means of communication makes it possible for the management of failure to work with the same constructs and semantics across a cluster or within a single host. (#)
lock sharding
- At 8:20 lock sharding (#)
logging
- At 3750 take novelty and immediately log it. That is where you get durability. But that is not organized in a leverage-able way. (#)
loose coupling
- we're achieving loose coupling between components by exposing a data schema, publicly committing to all its representational details. Nothing is encapsulated at all (#)
map reduce
- Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations. (#)
- MR can benefit from columnar storage just like a DBMS. (#)
master dataset
- The "raw" data in our parlance
- Page 68 - The master dataset is the source of truth in the lambda architecture (#)
- When you create an enforceable schema, you get errors at the time of writing the data -- giving you full context as to how and why the data became invalid (like a stacktrace). In addition, the error prevents the program from corrupting the master dataset by writing that data. (#)
materialization
- Location: 11,286 Dataflow engines perform less materialization of intermediate state and keep more in memory, which means that they need to recompute more data if a node fails. Deterministic operators reduce the amount of data that needs to be recomputed. (#)
materialized stage
- I like that the Lambda Architecture emphasizes retaining the input data unchanged. I think the discipline of modeling data transformation as a series of materialized stages from an original input has a lot of merit. This is one of the things that makes large MapReduce workflows tractable, as it enables you to debug each stage independently. (#)
materialized view
- see also materialization
- We saw in "Databases and Streams" that a stream of changes to a database can be used to keep derived data systems, such as caches, search indexes, and data warehouses, up to date with a source database. We can regard these examples as specific cases of maintaining materialized views (#)
- In order to take full advantage of this setup, we need to build applications in such a way that it is easy to deploy new instances that use replay to recreate their materialized view of the log (#)
materializing conflicts
- Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn't group operations together into transactions, so it does not prevent problems such as write skew, unless you take additional measures such as materializing conflicts (#)
membership model
- p 218: Dynamo adopts a full membership model where each node is aware of the data hosted by its peers (#)
merge join
- At 48:39 - "the query engine is going to do a merge join ... what you'll get is a sum of the information between the two" (#)
merkle trees
- A hash tree or Merkle tree is a tree in which every non-leaf node is labelled with the hash of the labels or values (in case of leaves) of its child nodes. (#)
- Location: 14,258 Cryptographic auditing and integrity checking often relies on Merkle trees [74], which are trees of hashes that can be used to efficiently prove that a record appears in some dataset (and a few other things) (#)
message id
- All assets published through the Gateway are assigned a unique message ID, and this ID is provided back to the publisher as well as passed along through Kafka and to the consuming applications, allowing us to track and monitor when each individual update is processed in each system, all the way out to the end-user applications. This is useful both for tracking performance and for pinpointing problems when something goes wrong." (#)
metadata
- The metadata system allows for arbitrary annotation of data. It is used to convey information to the compiler about types, but can also be used by application developers for many purposes, annotating data sources, policy etc. (#)
micro-batch
- dataflow model: In this section, we will define the formal model for the system and explain why its semantics are general enough to subsume the standard batch, micro-batch, and streaming models, as well as the hybrid streaming and batch semantics of the Lambda Architecture. (#)
micro-partition
- In Snowflake, all data in tables is automatically divided into micro-partitions, which are contiguous units of storage. Snowflake is columnar-based and horizontally partitioned, meaning a row of data is stored in the same micro-partition. (#)
- Snowflake automatically gathers metadata about all rows stored in a micro-partition, including: range of values for each of the columns in the micro-partition (#)
- The micro-partition metadata maintained by Snowflake enables precise pruning of columns in micro-partitions at query run-time, including columns containing semi-structured data. In other words, a query that specifies a filter predicate on a range of values that accesses 10% of the values in the range should ideally only scan 10% of the micro-partitions. (#)
monotonic data growth
- "Specifically, we are concerned with models where shared data structures grow monotonically—by publishing information, but never invalidating it." (#)
- Cells Merge Information Monotonically (#)
- "Because state modifications that only add information and never destroy it can be structured to commute with one another and thereby avoid race conditions, it stands to reason that diverse deterministic parallel programming models would leverage the principle of monotonicity." (#)
multi-way merge results
- "Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client's append. It is useful for implementing multi-way merge results and producer-consumer queues that many clients can simultaneously append to without additional locking. We have found these types of files to be invaluable in building large distributed applications." (#)
nested data model
- A nested data model underlies most of structured data processing at Google (#)
network jitter
- Jitter is defined as a variation in the delay of received packets. The sending side transmits packets in a continuous stream and spaces them evenly apart. (#)
normalization
- How the serving layer solved the long-debated normalization versus denormalization problem (#)
novelty
- At 23 we can represent novelty as assertions or retractions of facts (#)
object versioning
- p 205: It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use. (#)
offline
- in contrast to online systems
- Location: 93 This book uses less ambiguous terms, such as single-node versus distributed systems, or online/interactive versus offline/batch processing systems. (#)
one-pass algorithms
- Some Dremel queries, such as top-k and count-distinct, return approximate results using known one-pass algorithms (#)
online
- in contrast to offline systems
- Location: 93 This book uses less ambiguous terms, such as single-node versus distributed systems, or online/interactive versus offline/batch processing systems. (#)
optimal query format
- see also optimal storage formal
- file dumps, typically in a format like CSV, are regularly uploaded to EDH, where they are then unpacked, transformed into optimal query format, and tucked away in HDFS where various EDH components can use them (#)
- our goal is to store all values of a given field consecutively to improve retrieval efficiency. In this section, we address the following challenges: lossless representation of record structure in a columnar format (Section 4.1), fast encoding (Section 4.2), and efficient record assembly (Section 4.3) (#)
- Our columnar representation of nested data builds on ideas that date back several decades: separation of structure from content and transposed representation. (#)
- The data vault modelled layer is normally used to store data. It is not optimized for query performance, nor is it easy to query by the well-known query tools. (#)
optimal storage format
- see also optimal query formal
- The data vault modelled layer is normally used to store data. It is not optimized for query performance, nor is it easy to query by the well-known query tools. (#)
optimistic
- For systems prone to server and network failures, availability can be increased by using optimistic replication techniques, where changes are allowed to propagate to replicas in the background, and concurrent, disconnected work is tolerated (#)
ordering
- The order of events in two different partitions is then ambiguous. (#)
- Location: 11,763 There is no ordering guarantee across different partitions. (#)
overlay links
- p 208: These were examples of unstructured P2P networks where the overlay links between peers were established arbitrarily (#)
partitioning
- normalized from sharding
- Location: 3,825 partitions so that different partitions can be assigned to different nodes (also known as sharding). (#)
- Location: 13,963 writes that may conflict are routed to the same partition and processed sequentially (#)
- The order of events in two different partitions is then ambiguous. (#)
- Location: 11,763 There is no ordering guarantee across different partitions. (#)
- The lack of thread local storage is not an issue, because you partition the data so that no two threads are working on the same thing at the same time. (#)
- "If you read data from the child shards before having read all data from the parent shards, you could read data for a particular hash key out of the order given by the data records' sequence numbers. Therefore, assuming that the order of the data is important, you should, after a reshard, always continue to read data from the parent shards until it is exhausted" (#)
- It's single-partition because we want to maintain the total ordering — specifically, we want to ensure that when you are consuming the log, you always see a referenced asset before the asset doing the referencing. (#)
- "In practice there is a lot of leverage to beginning with a higher number of shards than you have underlying instances." (#)
- Usually the attributes are grouped in satellites by source system. However, descriptive attributes such as size, cost, speed, amount or color can change at different rates, so you can also split these attributes up in different satellites based on their rate of change (#)
- We don't want too few shards, because we would like to take advantage of distributed processing capabilities of BigQuery, processing a table in parallel using potentially thousands of machines — each one reading individual shards. But we also don't want too many shards, because every unit of storage and processing has constant overhead (#)
path boundary
- see also read path
- see also write path
- Location: 13,634 Viewed like this, the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path. (#)
perception
- in contrast to process
- 2620 process is different from perception. Process is how you get the data in what the transactions and indexes look like, etc. Perception is the queries you do against the data, etc. (#)
- At 2655, we can separate process and perception because we have adopted and information model and immutability. (#)
persistent state
- p 205 the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. (#)
performance
- At 9:15 having immutable system means you don't have to have an index your data (because you don't have to find it to be able to update it) (#)
pipeline segmentation
- related to data pipeline
- We can completely segment one component of the compiler from another by having the right form of data structure in between them (#)
preference list
- p 210: the preference list for a key is constructed by skipping positions in the ring to ensure that the list contains only distinct physical nodes (#)
- p 210: The list of nodes that is responsible for storing a particular key is called the preference list (#)
- p 212: To remedy this it does not enforce strict quorum membership and instead it uses a sloppy quorum; all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring. (#)
- p 212: In essence, the preference list of a key is constructed such that the storage nodes are spread across multiple data centers (#)
process
- in contrast to perception
- 2620 process is different from perception. Process is how you get the data in what the transactions and indexes look like, etc. Perception is the queries you do against the data, etc. (#)
- At 2655, we can separate process and perception because we have adopted and information model and immutability. (#)
processing time
- in contrast to event time
- At 2140 the time is event time based as opposed to processing time based, meaning they take the time information from the event, not when it is processed. (#)
propagator-model
- see also context propagation
- And, Sussman explained, the propagator model carries its symbolic reasoning along with it. (#)
provenance
- At 23:30 the attributes you add to reified transactions establish provenance (#)
- Capturing lineage or provenance information for datasets has long been a research topic in the scientific computing an database fields, for applications such as explaining results, allowing them to be reproduced by others, and recomputing data if a bug is found in a work- flow step or if a dataset is lost. We refer the reader to [7], [23] and [9] for surveys of this work (#)
querying
- At 39:20 he talks about how the data storage is separated from the querying, and how that helps with the normalization and transformation. You get a fully normalized schema in the storage, but you also get views that are optimized (potentially very de-normalized) for your queries (the batch views) (#)
query pruning
- The micro-partition metadata maintained by Snowflake enables precise pruning of columns in micro-partitions at query run-time, including columns containing semi-structured data. In other words, a query that specifies a filter predicate on a range of values that accesses 10% of the values in the range should ideally only scan 10% of the micro-partitions. (#)
rate of change
- see also splitting based on rate of change
- Usually the attributes are grouped in satellites by source system. However, descriptive attributes such as size, cost, speed, amount or color can change at different rates, so you can also split these attributes up in different satellites based on their rate of change (#)
raw data
- see also data
- see also input data
- see also master dataset
- see also system of record
- in contrast to derived data
- Page 68 - The master dataset is the source of truth in the lambda architecture (#)
- Page 7 - top: Another crucial observation is that not all bits of information are equal. Some information is derived from other pieces of information. (...) When you keep tracing back where information is derived from, you eventually end up at information that is not derived from anything. This is the rawest information you have: information you hold to be true simply because it exists. Let's call this information data. (#)
reactive systems
- At 9:20 we would like to build reactive systems that don't poll and can get consistent data (#)
read before write
- At 33:30 If you want 'insert if it does not exist' semantics, Cassandra uses Paxos for distrubuted locking. It is a read-before-write with the locking provided by the database. (#)
read-committed transaction isolation
- Location: 5,936 The most basic level of transaction isolation is read committed (#)
read path
- see also write path
- see also path boundary
- Location: 13,634 Viewed like this, the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path. (#)
- Location: 13,672 In terms of our model of write path and read path, actively pushing state changes all the way to client devices means extending the write path all the way to the end user. (#)
- page 185: since there are not random writes in the serving layer then you can optimize for the read path and get high-performance. (#)
read-your-writes consistency
- a type of consistency
- p 214: This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting read-your-writes consistency (#)
recomputation batch view
- see also reprocessing
- Page 91 (bottom) You must always have recomputation versions of your algorithms. This is the only way to ensure human-fault tolerance for your system, and human-fault tolerance is a non-negotiable requirement for robust systems (#)
- Page 8: If you build immutability and recomputation into the code of a Big Data system, the system will be innately resilient to human error by providing a clear and simple mechanism for recovery. (#)
- Page 85 with recalculation, results improve as your algorithm improves (#)
- Page 140: Recomputation is good at adapting to change (#)
- The Lambda Architecture deserves a lot of credit for highlighting this (the reprocessing) problem. (#)
- Location: 13,476 Even spreadsheets have dataflow programming capabilities that are miles ahead of most mainstream programming languages [33]. In a spreadsheet, you can put a formula in one cell (for example, the sum of cells in another column), and whenever any input to the formula changes, the result of the formula is automatically recalculated. This is exactly what we want at a data system level: when a record in a database changes, we want any index for that record to be automatically updated, and any cached views or aggregations that depend on the record to be automatically refreshed. You should not have to worry about the technical details of how this refresh happens, but be able to simply trust that it works correctly. (#)
record append
- "Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client's append. It is useful for implementing multi-way merge results and producer-consumer queues that many clients can simultaneously append to without additional locking. We have found these types of files to be invaluable in building large distributed applications." (#)
recovery
- Page 8: If you build immutability and recomputation into the code of a Big Data system, the system will be innately resilient to human error by providing a clear and simple mechanism for recovery. (#)
register
- Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn't group operations together into transactions, so it does not prevent problems such as write skew, unless you take additional measures such as materializing conflicts (#)
reified transactions
- see also provenance
- At 23:30 the attributes you add to reified transactions establish provenance (#)
replication
- For systems prone to server and network failures, availability can be increased by using optimistic replication techniques, where changes are allowed to propagate to replicas in the background, and concurrent, disconnected work is tolerated (#)
replicated state machine
- Consensus typically arises in the context of replicated state machines, a general approach to building fault-tolerant systems (#)
reprocessing
- see also recomputation batch view
- "By "reprocessing," I mean processing input data over again to re-derive output. This is a completely obvious but often ignored requirement. Code will always change. So, if you have code that derives output data from an input stream, whenever the code changes, you will need to recompute your output to see the effect of the change." (#)
run length encoding
- one type of encoding
- various techniques and encodings, such as Run Length Encoding (RLE), Dictionary encoding, Bit-Vector encoding, Frame of Reference encoding, etc (#)
- not all columns are born equal. Some might be very long strings, where shorter RLE runs are more beneficial than longer runs on the small integers column (#)
scalability
- see also incremental scalability
- Scalability is the ability of a system to maintain performance under increasing load by adding more resources. (#)
scatter-gather
- Location: 5,270 This approach to querying a partitioned database is sometimes known as scatter/gather, and it can make read queries on secondary indexes quite expensive. (#)
- Location: 5,271 Even if you query the partitions in parallel, scatter/gather is prone to tail latency amplification (#)
serving layer
- page 179 the serving layer is the last component of the batch section of the lambda architecture (#)
- How the serving layer solved the long-debated normalization versus denormalization problem (#)
- page 186 In practice you may find yourself repurposing traditional databases for the serving layer. (#)
- page 185: since there are not random writes in the serving layer then you can optimize for the read path and get high-performance. (#)
sketch algorithm
At larger scale sketch algorithms like HyperLogLog can be really effective. (Sketches are probabilistic algorithms which can generate approximate results efficiently within mathematically provable error bounds. (#)
schema
- Many developers go down the path of writing their raw data in the schemaless format like JSON. This is appealing because of how easy is is to get started, but this approach quickly leads to problems. Whether due to bugs or misunderstandings between different developers, data corruption inevitably occurs. (#)
- At 9:50 schema is very very important. (#)
- At 11:55 you need to enforce schemas at produce time (#)
- When you create an enforceable schema, you get errors at the time of writing the data -- giving you full context as to how and why the data became invalid (like a stacktrace). In addition, the error prevents the program from corrupting the master dataset by writing that data. (#)
- Our operations have greatly improved, but we're still suffering from a legacy of poorly-enforced schemas on a dataset too large to clean efficiently. (#)
schema-on-read
- see also schema-on-write
- see also data lake
- Location: 1,026 A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it) (#)
schema-on-write
- see also schema-on-read
- Location: 1,026 A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it) (#)
- At 11:55 you need to enforce schemas at produce time (#)
- The actual write happens through a gateway service, which validates that the published asset is compliant with our schema. (#)
secondary index
- Location: 13,500 The derivation function for a secondary index is so commonly required that it is built into many databases as a core feature, and you can invoke it by merely saying CREATE INDEX (#)
- In Cassandra, if you want to query columns other than the primary key, you need to create a secondary index on them (#)
- Location: 13,190 In principle, derived data systems could be maintained synchronously, just like a relational database updates secondary indexes synchronously within the same transaction as writes to the table being indexed. However, asynchrony is what makes systems based on event logs robust: it allows a fault in one part of the system to be contained locally, whereas distributed transactions abort if any one participant fails, so they tend to amplify failures by spreading them to the rest of the system (see "Limitations of distributed transactions"). (#)
- A Secondary index is an index that is not a primary index and may have duplicates. eg. Employee name can be example of it. Because Employee name can have similar values. (#)
- "Currently, the only way to query a column without specifying the partition key is to use secondary indexes, but they are not a substitute for the denormalization of data into new tables as they are not fit for high cardinality data" (#)
- High cardinality secondary index queries often require responses from all of the nodes in the ring, which adds latency to each request. (#)
semantic normalization
- "The algorithm first performs semantic normalization on the the name for the person, doing conversion like Bob to Robert and Bill to William" (#)
serializability
- p 214: Although it is desirable always to have the first node among the top N to coordinate the writes thereby serializing all writes at a single location, this approach has led to uneven load distribution resulting in SLA violations (#)
- While Spanner provides linearizability, CockroachDB's external consistency guarantee is by default only serializability, though with some features that can help bridge the gap in practice. (#)
- Location: 12,111 Thus, any validation of a command needs to happen synchronously, before it becomes an event — for example, by using a serializable transaction that atomically validates the command and publishes the event. (#)
- Location: 13,780 serializability and atomic commit are established approaches, but they come at a cost: they typically only work in a single datacenter (ruling out geographically distributed architectures), and they limit the scale and fault-tolerance properties you can achieve. (#)
- Location: 13,862 (whereas an application-level check-then-insert may fail under nonserializable isolation, as discussed in "Write Skew and Phantoms"). (#)
- Location: 12,111 Thus, any validation of a command needs to happen synchronously, before it becomes an event — for example, by using a serializable transaction that atomically validates the command and publishes the event. (#)
- External consistency is a stronger property than both linearizability and serializability. (#)
server-side denormalization
- in contrast to client-side denormalization
- Materialized views handle automated server-side denormalization, removing the need for client side handling of this denormalization and ensuring eventual consistency between the base and view data (#)
serving layer
- Page 87 Serving layer indexes the batch views (#)
- page 183: "a vital advantage of the Lambda Architecture is that it allows you to tailor the serving layer for the queries it serves to optimize efficiency (#)
serving tree
- see also execution tree
- its architecture borrows the concept of a serving tree used in distributed search engines [11]. Just like a web search request, a query gets pushed down the tree and is rewritten at each step. The result of the query is assembled by aggregating the replies received from lower levels of the tree (#)
sharding
- normalized to partitioning
shared storage format
- The second ingredient for building interoperable data manage- ment components is a shared storage format. Columnar storage proved successful for flat relational data but making it work for Google required adapting it to a nested data model (#)
single-node
- in contrast to distributed systems
- Location: 93 This book uses less ambiguous terms, such as single-node versus distributed systems, or online/interactive versus offline/batch processing systems. (#)
skew
- Location: 11,000 Skew or varying load on different machines means that a job often has a few straggler tasks that take much longer to complete than the others. Having to wait until all of the preceding job's tasks have completed slows down the execution of the workflow as a whole. (#)
- p 215: Dynamo's design assumes that even where there is a significant skew in the access distribution there are enough keys in the popular end of the distribution so that the load of handling popular keys can be spread across the nodes uniformly through partitioning. (#)
sloppy quorum
- p 212: To remedy this it does not enforce strict quorum membership and instead it uses a sloppy quorum; all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring. (#)
snowflake schema
- in contrast to star schema
- At 2:45 snowflake schema can have normalized dimension tables, but star schema has purely denormalized dimension tables. (#)
small writes problem
- At 2:20 this solves the small writes problem (#)
splitting based on rate of change
- see also rate of change
- Usually the attributes are grouped in satellites by source system. However, descriptive attributes such as size, cost, speed, amount or color can change at different rates, so you can also split these attributes up in different satellites based on their rate of change (#)
star schema
- in contrast to snowflake schema
- At 2:45 snowflake schema can have normalized dimension tables, but star schema has purely denormalized dimension tables. (#)
storage
- At 39:20 he talks about how the data storage is separated from the querying, and how that helps with the normalization and transformation. You get a fully normalized schema in the storage, but you also get views that are optimized (potentially very de-normalized) for your queries (the batch views) (#)
storage layer
- The above scenario requires interoperation between the query processor and other data management tools. The first ingredient for that is a common storage layer. (#)
- A high performance storage layer is critical for in situ data management (#)
streaming
- in contrast to batch
- one part of a unified model
- also in contrast to the term unbounded
- When describing infinite/finite data sets, we prefer the terms unbounded/bounded over streaming/batch, because the latter terms carry with them an implication of the use of a specific type of execution engine. In reality, unbounded datasets have been processed using repeated runs of batch systems since their conception, and well-designed streaming systems are perfectly capable of processing bounded data. From the perspective of the model, the distinction of streaming or batch is largely irrelevant, and we thus reserve those terms exclusively for describing runtime execution engines. (#)
- There are a number of other motivations proposed for the Lambda Architecture, but I don't think they make much sense. One is that real-time processing is inherently approximate, less powerful, and more lossy than batch processing. I actually do not think this is true. It is true that the existing set of stream processing frameworks are less mature than MapReduce, but there is no reason that a stream processing system can't give as strong a semantic guarantee as a batch system. (#)
surrogate key
- see also artificial key
- see also business key
- As mentioned above, an important kind of artificial key is called a surrogate key. It's not meant to be succinct or shareable like other artificial keys, it's meant as an internal placeholder that identifies a row forevermore. It's used in SQL and joins but not explicitly referenced by an application. (#)
system clock
- see also clock multiplier
- A simulator can artificially advance the clocks of a system to induce a leader election, while a "real" cluster has to wait real time to trigger certain logic. (#)
system of record
- in contrast to derived data
- "Not all systems make a clear distinction between systems of record and derived data in their architecture, but it's a very helpful distinction to make, because it clarifies the dataflow through your system: it makes explicit which parts of the system have which inputs and which outputs, and how they depend on each other." (#)
symmetry
- Every node in Dynamo should have the same set of responsibilities as its peers; there should be no distinguished node or nodes that take special roles or extra set of responsibilities. In our experience, symmetry simplifies the process of system provisioning and maintenance. (#)
- p 208: Every node in Dynamo should have the same set of responsibilities as its peers; there should be no distinguished node or nodes that take special roles or extra set of responsibilities. In our experience, symmetry simplifies the process of system provisioning and maintenance. (#)
tail latency amplification
- Location: 5,271 Even if you query the partitions in parallel, scatter/gather is prone to tail latency amplification (#)
- High cardinality secondary index queries often require responses from all of the nodes in the ring, which adds latency to each request. (#)
term-based partitioning
- Location: 5,243 There are two main approaches to partitioning a database with secondary indexes: document-based partitioning and term-based partitioning. (#)
- Location: 5,291 We call this kind of index term-partitioned index, because the term we're looking for determines the partition of the index. (#)
- Location: 5,294 Partitioning by the term itself can be useful for range scans (e.g., on a numeric property, such as the asking price of the car), whereas partitioning on a hash of the term gives a more even distribution of load. (#)
throughput unit
- Shard is the base throughput unit of an Amazon Kinesis stream (#)
tombstone
- Location: 2,010 you have to append a special deletion record to the data file (sometimes called a tombstone). (#)
total order broadcast
- see also linearizability
- Location: 8,988 total order broadcast is asynchronous: messages are guaranteed to be delivered reliably in a fixed order, but there is no guarantee about when a message will be delivered (so one recipient may lag behind the others) (#)
- Location 8,990 if you have total order broadcast, you can build linearizable storage on top of it (#)
subsystem-internal foreign keys
- Gordon idea... need reference and name
- within a subsystem or a bounded context how records refer to each other is immaterial as long as it works, and those identifiers should not be used more widely
time
- Location: 13,264 This raises the problems discussed in "Reasoning About Time", such as handling stragglers and handling windows that cross boundaries between batches. (#)
time buckets
- Zero-filling Timeseries queries normally fill empty interior time buckets with zeroes. For example, if you issue a "day" granularity timeseries query for the interval 2012-01-01/2012-01-04, and no data exists for 2012-01-02, you will receive (#)
traceability
- see also auditability
- see also debuggability
- Page 9: A Big Data system must provide the information necessary to debug the system when things go wrong. The key is to be able to trace, for each value in the system, exactly what caused it to have that value. (#)
- the focus of any data vault implementation is complete traceability and auditability of all information. (#)
types of joins
- see also enrichment
- These two streams are somewhat time aligned, so you might only need to wait for a few minutes after the impression for the matching click to occur. This is called a streaming join. Another type of join is for "enrichment" of an incoming stream — perhaps you want to take an incoming stream of ad clicks and join on attributes about the user (perhaps for use in downstream aggregations). (#)
unbounded
- in contrast to bounded
- also in contrast to the term streaming
- When describing infinite/finite data sets, we prefer the terms unbounded/bounded over streaming/batch, because the latter terms carry with them an implication of the use of a specific type of execution engine. In reality, unbounded datasets have been processed using repeated runs of batch systems since their conception, and well-designed streaming systems are perfectly capable of processing bounded data. From the perspective of the model, the distinction of streaming or batch is largely irrelevant, and we thus reserve those terms exclusively for describing runtime execution engines. (#)
unbundling the database
- see also database inside out
- Location: 13,475 The term unbundling in this context was proposed by Jay Kreps (#)
- Location: 13,550 Unbundling the database means taking this idea and applying it to the creation of derived datasets outside of the primary database: caches, full-text search indexes, machine learning, or analytics systems. We can use stream processing and messaging systems for this purpose. (#)
- Location: 13,457 The approach of unbundling databases by composing specialized storage and processing systems with application code is also becoming known as the "database inside-out" approach [26], after the title of a conference talk I gave in 2014 [27]. (#)
- Splitting process and perception allows you focus on each individually. For process, you can ensure a transactional processing of writes. Indexing of the immutable values can happen after-the-fact. (#)
- p 205: synthesis of well known techniques to achieve scalability and availability (#)
unified model
- in contrast to batch
- in contrast to streaming
- Add 30 seconds the data flow model is a unified model, so it can express both batch and streaming operations (#)
unit of scale
- A shard is the unit of scale. (#)
unit of work
- Workflows are expected to be mostly static or slowly changing. You can think of the structure of the tasks in your workflow as slightly more dynamic than a database structure would be. Airflow workflows are expected to look similar from a run to the next, this allows for clarity around unit of work and continuity. (#)
UUID
- The real problem with UUIDs is that highly randomized values cause write amplification due to full-page writes in the write-ahead log (WAL). This means worse performance when inserting rows. (#)
- timeuuid: ideal for use in applications requiring conflict-free timestamps (#)
- All assets published through the Gateway are assigned a unique message ID, and this ID is provided back to the publisher as well as passed along through Kafka and to the consuming applications, allowing us to track and monitor when each individual update is processed in each system, all the way out to the end-user applications. This is useful both for tracking performance and for pinpointing problems when something goes wrong." (#)
- Some people think UUIDs are strings because of the traditional dashed hexadecimal representation: 5bd68e64-ff52-4f54-ace4-3cd9161c8b7f. In fact some databases don't have a compact (128-bit) uuid type, but PostgreSQL does. It's the size of two bigints, and that's not an appreciable overhead when compared with the bulk of other information in the database. (#)
vector clock
- p 209: technique: vector clocks with reconciliation during reads - advantage: allows vector size to be decoupled from update rates (#)
- At 1:03:00 Patrick draws a distinction between logging and in-place updates. He says that vector clocks are especially useful for lots of in-place updates, whereas if you do logging it is a less necessary technique because you are not updating the same values. (#)
vertical partitioning
- Partition data such that a function only accesses data relevant to its function
view
- in contrast to base
- Materialized views handle automated server-side denormalization, removing the need for client side handling of this denormalization and ensuring eventual consistency between the base and view data (#)
virtual nodes
- p 210: The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure. (#)
vulnerability window
- related to window
- related to durability
- p 214: This also introduces a vulnerability window for durability when a write request is successfully returned to the client even though it has been persisted at only a small number of nodes. (#)
watermark
- At 23:40 in a streaming context: collect until watermark condition is met (time, size, number of rows, etc) (#)
- At 14:05 the watermark is an indicator of the processing. It basically says that this time I don't expect any older events to show up. (#)
- At 14:45 to heuristics the drive the watermark have to deal with the trade-offs. If you move the watermark forward too fast, you might miss some slightly stale events. If you move it forward to slowly you're processing will be delayed as you wait around for latent events that actually never materialize. (#)
weather
- At 10:10 network weather will increase latency and jitter (#)
wide-area locking
- To allow for concurrent updates while avoiding many of the problems inherent with concept-wide-area-locking, it uses an update model based on conflict resolution. (#)
- p 2: To allow for concurrent updates while avoiding many of the problems inherent with wide-area locking, it uses an update model based on conflict resolution (#)
wild card tables
- To distribute data between tables, BigQuery heavily relies on the wild card tables pattern (#)
window
- p 218: This information is used to check whether the percentiles of latencies (or failures) in a given trailing time window are close to a desired threshold (#)
window of change
- related to window
- At 4910 accumulating a window of change. (#)
work distribution
- p 208: The system needs to be able to exploit heterogeneity in the infrastructure it runs on. e.g. the work distribution must be proportional to the capabilities of the individual servers. This is essential in adding new nodes with higher capacity without having to upgrade all hosts at once. (#)
write amplification
- "under heavy MySQL load, which causes around 10x write amplification" (#)
- At 13:10 they do see a write amplification because they have to do six times the log writes (#)
- The real problem with UUIDs is that highly randomized values cause write amplification due to full-page writes in the write-ahead log (WAL). This means worse performance when inserting rows. (#)
- GFS uses replication to preserve the data despite faulty hardware and achieve fast response times in presence of stragglers (#)
write path
- see also read path
- see also path boundary
- Location: 13,634 Viewed like this, the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path. (#)
- Location: 13,672 In terms of our model of write path and read path, actively pushing state changes all the way to client devices means extending the write path all the way to the end user. (#)
write tear
- At 10:40 you do a double writes because you are doing a destructive write and you can't afford write tear (#)
write skew
- Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn't group operations together into transactions, so it does not prevent problems such as write skew, unless you take additional measures such as materializing conflicts (#)
write throughput
- p 2: A write operation in Dynamo also requires a read to be performed for managing the vector timestamps. This is can be very limiting in environments where systems need to handle a very high write throughput. (#)
zero-copy cloning
- At 10:10 zero-copy cloning - cloning data in something similar to copy-on-write format (#)
zero filling
- Zero-filling Timeseries queries normally fill empty interior time buckets with zeroes. For example, if you issue a "day" granularity timeseries query for the interval 2012-01-01/2012-01-04, and no data exists for 2012-01-02, you will receive (#)

Referring Pages

data-architecture-glossary new-data-architecture

People

person-nathan-marz