data-architecture-organized-conceptual-glossary

Organized concepts from data-architecture-glossary

Amplification

Glossary

data inflation
- subsequent tests revealed that the database was using three to four times as much storage as would be necessary to store each field as a 32-bit integer. This sort of data "inflation" is typical of a traditional RDBMS and shouldn't necessarily be seen as a problem, especially to the extent that it is part of a strategy to improve performance. After all, disk space is relatively cheap.) (#)
tail latency amplification
- Location: 5,271 Even if you query the partitions in parallel, scatter/gather is prone to tail latency amplification (#)
- High cardinality secondary index queries often require responses from all of the nodes in the ring, which adds latency to each request. (#)
write amplification
- "under heavy MySQL load, which causes around 10x write amplification" (#)
- At 13:10 they do see a write amplification because they have to do six times the log writes (#)
- The real problem with UUIDs is that highly randomized values cause write amplification due to full-page writes in the write-ahead log (WAL). This means worse performance when inserting rows. (#)
- GFS uses replication to preserve the data despite faulty hardware and achieve fast response times in presence of stragglers (#)

Examples

Context Propagation

Glossary

annotation
- The metadata system allows for arbitrary annotation of data. It is used to convey information to the compiler about types, but can also be used by application developers for many purposes, annotating data sources, policy etc. (#)
causal ordering
- see also causal chain
- This is a causal ordering. It doesn't care so much about clock time. It cares what commits I worked from when I made mine. I knew about the parent commit, I started from there, so it's causal. Whatever you were doing on your branch, I didn't know about it, it wasn't causal, so there is no "before" or "after" relationship to yours and mine. (#)
- For situations where reordering could be problematic, CockroachDB returns a causality token, which is just the maximum timestamp encountered during a transaction. If passed from one actor to the next in a causal chain, the token serves as a minimum timestamp for successive transactions and will guarantee that each has a properly ordered commit timestamp (#)
- p 217: Write requests on the other hand will be coordinated by a node in the key's current preference list. This restriction is due to the fact that these preferred nodes have the added responsibility of creating a new version stamp that causally subsumes the version that has been updated by the write request. Note that if Dynamo's versioning scheme is based on physical timestamps, any node can coordinate a write request. (#)
causal chain
- see also causal ordering
- see also lineage
- For situations where reordering could be problematic, CockroachDB returns a causality token, which is just the maximum timestamp encountered during a transaction. If passed from one actor to the next in a causal chain, the token serves as a minimum timestamp for successive transactions and will guarantee that each has a properly ordered commit timestamp (#)
data lineage
- Data lineage, or data tracking, is generally defined as a type of data lifecycle that includes data origins and data movement over time. It can also describe transformations applied to the data as it passes through various processes. Data lineage can help analyse how information is used and track key information that serves a particular purpose. (#)
- Page 9: A Big Data system must provide the information necessary to debug the system when things go wrong. The key is to be able to trace, for each value in the system, exactly what caused it to have that value. (#)
- the focus of any data vault implementation is complete traceability and auditability of all information. (#)
lineage
- see also causal ordering
- RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition (#)
- While some DSM systems achieve fault tolerance through checkpointing [18], Spark reconstructs lost partitions of RDDs using lineage information captured in the RDD objects (#)
- Capturing lineage or provenance information for datasets has long been a research topic in the scientific computing an database fields, for applications such as explaining results, allowing them to be reproduced by others, and recomputing data if a bug is found in a work- flow step or if a dataset is lost. We refer the reader to [7], [23] and [9] for surveys of this work (#)
metadata
- The metadata system allows for arbitrary annotation of data. It is used to convey information to the compiler about types, but can also be used by application developers for many purposes, annotating data sources, policy etc. (#)
message id
- All assets published through the Gateway are assigned a unique message ID, and this ID is provided back to the publisher as well as passed along through Kafka and to the consuming applications, allowing us to track and monitor when each individual update is processed in each system, all the way out to the end-user applications. This is useful both for tracking performance and for pinpointing problems when something goes wrong." (#)
provenance
- At 23:30 the attributes you add to reified transactions establish provenance (#)
- Capturing lineage or provenance information for datasets has long been a research topic in the scientific computing an database fields, for applications such as explaining results, allowing them to be reproduced by others, and recomputing data if a bug is found in a work- flow step or if a dataset is lost. We refer the reader to [7], [23] and [9] for surveys of this work (#)
reified transactions
- see also provenance
- At 23:30 the attributes you add to reified transactions establish provenance (#)

Examples

Fault Tolerance

Glossary

acyclic data flow graphs
- These systems achieve their scalability and fault tolerance by providing a programming model where the user creates acyclic data flow graphs to pass input data through a set of operators. This allows the underlying system to manage scheduling and to react to faults without user intervention. (#)
- All that goes to hell as soon as you back-feed outputs of a later stage into inputs of an earlier stage. Now you have one monolithic block of code where you've semi-pointlessly drawn some boxes inside of it to pretend like it's modular like the rest of the pipeline, but it's not. You can't understand it without understanding the whole thing (#)

Examples

Locking

Glossary

heavyweight lock
- The first commit 6d46f478 has changed the heavyweight locks (locks that are used for logical database objects to ensure the database ACID properties) to lightweight locks (locks to protect shared data structures) for scanning the bucket pages (#)
lightweight lock
- The first commit 6d46f478 has changed the heavyweight locks (locks that are used for logical database objects to ensure the database ACID properties) to lightweight locks (locks to protect shared data structures) for scanning the bucket pages (#)
lock sharding
- At 8:20 lock sharding (#)
wide-area locking
- To allow for concurrent updates while avoiding many of the problems inherent with concept-wide-area-locking, it uses an update model based on conflict resolution. (#)
- p 2: To allow for concurrent updates while avoiding many of the problems inherent with wide-area locking, it uses an update model based on conflict resolution (#)

Examples

Materialization

Glossary

materialization
- Location: 11,286 Dataflow engines perform less materialization of intermediate state and keep more in memory, which means that they need to recompute more data if a node fails. Deterministic operators reduce the amount of data that needs to be recomputed. (#)
materialized stage
- I like that the Lambda Architecture emphasizes retaining the input data unchanged. I think the discipline of modeling data transformation as a series of materialized stages from an original input has a lot of merit. This is one of the things that makes large MapReduce workflows tractable, as it enables you to debug each stage independently. (#)
materialized view
- see also materialization
- We saw in "Databases and Streams" that a stream of changes to a database can be used to keep derived data systems, such as caches, search indexes, and data warehouses, up to date with a source database. We can regard these examples as specific cases of maintaining materialized views (#)
- In order to take full advantage of this setup, we need to build applications in such a way that it is easy to deploy new instances that use replay to recreate their materialized view of the log (#)

Examples

Nodes

Glossary

heterogeneity
- p 208: The system needs to be able to exploit heterogeneity in the infrastructure it runs on. e.g. the work distribution must be proportional to the capabilities of the individual servers. This is essential in adding new nodes with higher capacity without having to upgrade all hosts at once. (#)
- p 210: The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure. (#)
virtual nodes
- p 210: The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure. (#)

Examples

Ordering

Glossary

causal ordering
- see also causal chain
- This is a causal ordering. It doesn't care so much about clock time. It cares what commits I worked from when I made mine. I knew about the parent commit, I started from there, so it's causal. Whatever you were doing on your branch, I didn't know about it, it wasn't causal, so there is no "before" or "after" relationship to yours and mine. (#)
- For situations where reordering could be problematic, CockroachDB returns a causality token, which is just the maximum timestamp encountered during a transaction. If passed from one actor to the next in a causal chain, the token serves as a minimum timestamp for successive transactions and will guarantee that each has a properly ordered commit timestamp (#)
- p 217: Write requests on the other hand will be coordinated by a node in the key's current preference list. This restriction is due to the fact that these preferred nodes have the added responsibility of creating a new version stamp that causally subsumes the version that has been updated by the write request. Note that if Dynamo's versioning scheme is based on physical timestamps, any node can coordinate a write request. (#)
causal chain
- see also causal ordering
- see also lineage
- For situations where reordering could be problematic, CockroachDB returns a causality token, which is just the maximum timestamp encountered during a transaction. If passed from one actor to the next in a causal chain, the token serves as a minimum timestamp for successive transactions and will guarantee that each has a properly ordered commit timestamp (#)
lineage
- see also causal ordering
- RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition (#)
- While some DSM systems achieve fault tolerance through checkpointing [18], Spark reconstructs lost partitions of RDDs using lineage information captured in the RDD objects (#)
- Capturing lineage or provenance information for datasets has long been a research topic in the scientific computing an database fields, for applications such as explaining results, allowing them to be reproduced by others, and recomputing data if a bug is found in a work- flow step or if a dataset is lost. We refer the reader to [7], [23] and [9] for surveys of this work (#)
linearizability
- see also total order broadcast
- While Spanner provides linearizability, CockroachDB's external consistency guarantee is by default only serializability, though with some features that can help bridge the gap in practice. (#)
- Location: 8,989 linearizability is a recency guarantee: a read is guaranteed to see the latest value written. (#)
- Location 8,990 if you have total order broadcast, you can build linearizable storage on top of it (#)
- Location: 13,963 writes that may conflict are routed to the same partition and processed sequentially (#)
- External consistency is a stronger property than both linearizability and serializability. (#)
- Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn't group operations together into transactions, so it does not prevent problems such as write skew, unless you take additional measures such as materializing conflicts (#)
ordering
- The order of events in two different partitions is then ambiguous. (#)
- Location: 11,763 There is no ordering guarantee across different partitions. (#)
serializability
- p 214: Although it is desirable always to have the first node among the top N to coordinate the writes thereby serializing all writes at a single location, this approach has led to uneven load distribution resulting in SLA violations (#)
- While Spanner provides linearizability, CockroachDB's external consistency guarantee is by default only serializability, though with some features that can help bridge the gap in practice. (#)
- Location: 12,111 Thus, any validation of a command needs to happen synchronously, before it becomes an event — for example, by using a serializable transaction that atomically validates the command and publishes the event. (#)
- Location: 13,780 serializability and atomic commit are established approaches, but they come at a cost: they typically only work in a single datacenter (ruling out geographically distributed architectures), and they limit the scale and fault-tolerance properties you can achieve. (#)
- Location: 13,862 (whereas an application-level check-then-insert may fail under nonserializable isolation, as discussed in "Write Skew and Phantoms"). (#)
- Location: 12,111 Thus, any validation of a command needs to happen synchronously, before it becomes an event — for example, by using a serializable transaction that atomically validates the command and publishes the event. (#)
- External consistency is a stronger property than both linearizability and serializability. (#)
total order broadcast
- see also linearizability
- Location: 8,988 total order broadcast is asynchronous: messages are guaranteed to be delivered reliably in a fixed order, but there is no guarantee about when a message will be delivered (so one recipient may lag behind the others) (#)
- Location 8,990 if you have total order broadcast, you can build linearizable storage on top of it (#)

Examples

Indexes

Glossary

document-based partitioning
- Location: 5,243 There are two main approaches to partitioning a database with secondary indexes: document-based partitioning and term-based partitioning. (#)
- Location: 5,259 For that reason, a document-partitioned index is also known as a local index as opposed to a global index (#)
- Location: 5,270 This approach to querying a partitioned database is sometimes known as scatter/gather, and it can make read queries on secondary indexes quite expensive. (#)
- Location: 5,272 MongoDB, Riak [15], Cassandra [16], Elasticsearch [17], SolrCloud [18], and VoltDB [19] all use document-partitioned secondary indexes. (#)
global index
- in contrast to local index
- Location: 5,259 For that reason, a document-partitioned index is also known as a local index as opposed to a global index (#)
- Location: 5,285 Rather than each partition having its own secondary index (a local index), we can construct a global index that covers data in all partitions. (#)
- Location: 5,287 A global index must also be partitioned, but it can be partitioned differently from the primary key index. (#)
- Location: 5,302 In practice, updates to global secondary indexes (global index) are often asynchronous (that is, if you read the index shortly after a write, the change you just made may not yet be reflected in the index). (#)
local index
- in contrast to global index
- Location: 5,259 For that reason, a document-partitioned index is also known as a local index as opposed to a global index (#)
term-based partitioning
- Location: 5,243 There are two main approaches to partitioning a database with secondary indexes: document-based partitioning and term-based partitioning. (#)
- Location: 5,291 We call this kind of index term-partitioned index, because the term we're looking for determines the partition of the index. (#)
- Location: 5,294 Partitioning by the term itself can be useful for range scans (e.g., on a numeric property, such as the asking price of the car), whereas partitioning on a hash of the term gives a more even distribution of load. (#)

Examples

Partitioning

Glossary

document-based partitioning
- Location: 5,243 There are two main approaches to partitioning a database with secondary indexes: document-based partitioning and term-based partitioning. (#)
- Location: 5,259 For that reason, a document-partitioned index is also known as a local index as opposed to a global index (#)
- Location: 5,270 This approach to querying a partitioned database is sometimes known as scatter/gather, and it can make read queries on secondary indexes quite expensive. (#)
- Location: 5,272 MongoDB, Riak [15], Cassandra [16], Elasticsearch [17], SolrCloud [18], and VoltDB [19] all use document-partitioned secondary indexes. (#)
partitioning
- normalized from sharding
- Location: 3,825 partitions so that different partitions can be assigned to different nodes (also known as sharding). (#)
- Location: 13,963 writes that may conflict are routed to the same partition and processed sequentially (#)
- The order of events in two different partitions is then ambiguous. (#)
- Location: 11,763 There is no ordering guarantee across different partitions. (#)
- The lack of thread local storage is not an issue, because you partition the data so that no two threads are working on the same thing at the same time. (#)
- "If you read data from the child shards before having read all data from the parent shards, you could read data for a particular hash key out of the order given by the data records' sequence numbers. Therefore, assuming that the order of the data is important, you should, after a reshard, always continue to read data from the parent shards until it is exhausted" (#)
- It's single-partition because we want to maintain the total ordering — specifically, we want to ensure that when you are consuming the log, you always see a referenced asset before the asset doing the referencing. (#)
- "In practice there is a lot of leverage to beginning with a higher number of shards than you have underlying instances." (#)
- Usually the attributes are grouped in satellites by source system. However, descriptive attributes such as size, cost, speed, amount or color can change at different rates, so you can also split these attributes up in different satellites based on their rate of change (#)
- We don't want too few shards, because we would like to take advantage of distributed processing capabilities of BigQuery, processing a table in parallel using potentially thousands of machines — each one reading individual shards. But we also don't want too many shards, because every unit of storage and processing has constant overhead (#)
term-based partitioning
- Location: 5,243 There are two main approaches to partitioning a database with secondary indexes: document-based partitioning and term-based partitioning. (#)
- Location: 5,291 We call this kind of index term-partitioned index, because the term we're looking for determines the partition of the index. (#)
- Location: 5,294 Partitioning by the term itself can be useful for range scans (e.g., on a numeric property, such as the asking price of the car), whereas partitioning on a hash of the term gives a more even distribution of load. (#)
colocation
- see also partitioning
- page 182 colocate the pageview information in the serving layer to speed things up because you can do a scan (#)
- Snowflake automatically sorts data as it is inserted/loaded into a table. Data with the same values is co-located, as much as possible, in the same micro-partition. (#)
document-based partitioning
- Location: 5,243 There are two main approaches to partitioning a database with secondary indexes: document-based partitioning and term-based partitioning. (#)
- Location: 5,259 For that reason, a document-partitioned index is also known as a local index as opposed to a global index (#)
- Location: 5,270 This approach to querying a partitioned database is sometimes known as scatter/gather, and it can make read queries on secondary indexes quite expensive. (#)
- Location: 5,272 MongoDB, Riak [15], Cassandra [16], Elasticsearch [17], SolrCloud [18], and VoltDB [19] all use document-partitioned secondary indexes. (#)

Examples

The order of events in two different partitions is then ambiguous. (#)

Location: 11,763 There is no ordering guarantee across different partitions. (#)

Performance

Glossary

write throughput
- p 2: A write operation in Dynamo also requires a read to be performed for managing the vector timestamps. This is can be very limiting in environments where systems need to handle a very high write throughput. (#)

Examples

Reads

Glossary

path boundary
- see also read path
- see also write path
- Location: 13,634 Viewed like this, the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path. (#)
read before write
- At 33:30 If you want 'insert if it does not exist' semantics, Cassandra uses Paxos for distrubuted locking. It is a read-before-write with the locking provided by the database. (#)
read path
- see also write path
- see also path boundary
- Location: 13,634 Viewed like this, the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path. (#)
- Location: 13,672 In terms of our model of write path and read path, actively pushing state changes all the way to client devices means extending the write path all the way to the end user. (#)
- page 185: since there are not random writes in the serving layer then you can optimize for the read path and get high-performance. (#)
read-committed transaction isolation
- Location: 5,936 The most basic level of transaction isolation is read committed (#)
read-your-writes consistency
- a type of consistency
- p 214: This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting read-your-writes consistency (#)
schema-on-read
- see also schema-on-write
- see also data lake
- Location: 1,026 A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it) (#)

Examples

Scalability

Glossary

acyclic data flow graphs
- These systems achieve their scalability and fault tolerance by providing a programming model where the user creates acyclic data flow graphs to pass input data through a set of operators. This allows the underlying system to manage scheduling and to react to faults without user intervention. (#)
- All that goes to hell as soon as you back-feed outputs of a later stage into inputs of an earlier stage. Now you have one monolithic block of code where you've semi-pointlessly drawn some boxes inside of it to pretend like it's modular like the rest of the pipeline, but it's not. You can't understand it without understanding the whole thing (#)
incremental scalability
- see also scalability
- p 208: Dynamo should be able to scale out one storage host (henceforth, referred to as "node") at a time, with minimal impact on both operators of the system and the system itself. (#)

Examples

Synchronous vs Asynchronous

Glossary

Examples

Location: 5,302 In practice, updates to global secondary indexes (global index) are often asynchronous (that is, if you read the index shortly after a write, the change you just made may not yet be reflected in the index). (#)

Time

Glossary

event time
- in contrast to processing time
- At 2140 the time is event time based as opposed to processing time based, meaning they take the time information from the event, not when it is processed. (#)
processing time
- in contrast to event time
- At 2140 the time is event time based as opposed to processing time based, meaning they take the time information from the event, not when it is processed. (#)
time
- Location: 13,264 This raises the problems discussed in "Reasoning About Time", such as handling stragglers and handling windows that cross boundaries between batches. (#)
time buckets
- Zero-filling Timeseries queries normally fill empty interior time buckets with zeroes. For example, if you issue a "day" granularity timeseries query for the interval 2012-01-01/2012-01-04, and no data exists for 2012-01-02, you will receive (#)
zero filling
- Zero-filling Timeseries queries normally fill empty interior time buckets with zeroes. For example, if you issue a "day" granularity timeseries query for the interval 2012-01-01/2012-01-04, and no data exists for 2012-01-02, you will receive (#)

Examples

Transactions

Glossary

distributed transactions
- Location: 13,190 In principle, derived data systems could be maintained synchronously, just like a relational database updates secondary indexes synchronously within the same transaction as writes to the table being indexed. However, asynchrony is what makes systems based on event logs robust: it allows a fault in one part of the system to be contained locally, whereas distributed transactions abort if any one participant fails, so they tend to amplify failures by spreading them to the rest of the system (see "Limitations of distributed transactions"). (#)
reified transactions
- see also provenance
- At 23:30 the attributes you add to reified transactions establish provenance (#)

Examples

Writes

Glossary

path boundary
- see also read path
- see also write path
- Location: 13,634 Viewed like this, the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path. (#)
read before write
- At 33:30 If you want 'insert if it does not exist' semantics, Cassandra uses Paxos for distrubuted locking. It is a read-before-write with the locking provided by the database. (#)
boxcar
- At 12:10 "we boxcar them together to generate a workload that's worth shipping over" (#)
read-your-writes consistency
- a type of consistency
- p 214: This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting read-your-writes consistency (#)
schema-on-write
- see also schema-on-read
- Location: 1,026 A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it) (#)
- At 11:55 you need to enforce schemas at produce time (#)
small writes problem
- At 2:20 this solves the small writes problem (#)
write amplification
- "under heavy MySQL load, which causes around 10x write amplification" (#)
- At 13:10 they do see a write amplification because they have to do six times the log writes (#)
- The real problem with UUIDs is that highly randomized values cause write amplification due to full-page writes in the write-ahead log (WAL). This means worse performance when inserting rows. (#)
- GFS uses replication to preserve the data despite faulty hardware and achieve fast response times in presence of stragglers (#)
write path
- see also read path
- see also path boundary
- Location: 13,634 Viewed like this, the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path. (#)
- Location: 13,672 In terms of our model of write path and read path, actively pushing state changes all the way to client devices means extending the write path all the way to the end user. (#)
write skew
- Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn't group operations together into transactions, so it does not prevent problems such as write skew, unless you take additional measures such as materializing conflicts (#)
write tear
- At 10:40 you do a double writes because you are doing a destructive write and you can't afford write tear (#)
write throughput
- p 2: A write operation in Dynamo also requires a read to be performed for managing the vector timestamps. This is can be very limiting in environments where systems need to handle a very high write throughput. (#)

Examples

Referring Pages

data-architecture-glossary