data-architecture-organized-conceptual-glossary
Organized concepts from data-architecture-glossary
Amplification
Glossary
Examples
Context Propagation
Glossary
annotation
- The metadata system allows for arbitrary annotation of data. It is used to convey information to the compiler about types, but can also be used by application developers for many purposes, annotating data sources, policy etc.
(#)
causal ordering
- see also causal chain
- This is a causal ordering. It doesn't care so much about clock time. It cares what commits I worked from when I made mine. I knew about the parent commit, I started from there, so it's causal. Whatever you were doing on your branch, I didn't know about it, it wasn't causal, so there is no "before" or "after" relationship to yours and mine.
(#)
- For situations where reordering could be problematic, CockroachDB returns a causality token, which is just the maximum timestamp encountered during a transaction. If passed from one actor to the next in a causal chain, the token serves as a minimum timestamp for successive transactions and will guarantee that each has a properly ordered commit timestamp
(#)
- p 217: Write requests on the other hand will be coordinated by a node in the key's current preference list. This restriction is due to the fact that these preferred nodes have the added responsibility of creating a new version stamp that causally subsumes the version that has been updated by the write request. Note that if Dynamo's versioning scheme is based on physical timestamps, any node can coordinate a write request.
(#)
causal chain
- see also causal ordering
- see also lineage
- For situations where reordering could be problematic, CockroachDB returns a causality token, which is just the maximum timestamp encountered during a transaction. If passed from one actor to the next in a causal chain, the token serves as a minimum timestamp for successive transactions and will guarantee that each has a properly ordered commit timestamp
(#)
data lineage
- Data lineage, or data tracking, is generally defined as a type of data lifecycle that includes data origins and data movement over time. It can also describe transformations applied to the data as it passes through various processes. Data lineage can help analyse how information is used and track key information that serves a particular purpose.
(#)
- Page 9: A Big Data system must provide the information necessary to debug the system when things go wrong. The key is to be able to trace, for each value in the system, exactly what caused it to have that value.
(#)
- the focus of any data vault implementation is complete traceability and auditability of all information.
(#)
lineage
- see also causal ordering
- RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition
(#)
- While some DSM systems achieve fault tolerance through checkpointing [18], Spark reconstructs lost partitions of RDDs using lineage information captured in the RDD objects
(#)
- Capturing lineage or provenance information for datasets has long been a research topic in the scientific computing an database fields, for applications such as explaining results, allowing them to be reproduced by others, and recomputing data if a bug is found in a work- flow step or if a dataset is lost. We refer the reader to [7], [23] and [9] for surveys of this work
(#)
metadata
- The metadata system allows for arbitrary annotation of data. It is used to convey information to the compiler about types, but can also be used by application developers for many purposes, annotating data sources, policy etc.
(#)
message id
- All assets published through the Gateway are assigned a unique message ID, and this ID is provided back to the publisher as well as passed along through Kafka and to the consuming applications, allowing us to track and monitor when each individual update is processed in each system, all the way out to the end-user applications. This is useful both for tracking performance and for pinpointing problems when something goes wrong."
(#)
provenance
- At 23:30 the attributes you add to reified transactions establish provenance
(#)
- Capturing lineage or provenance information for datasets has long been a research topic in the scientific computing an database fields, for applications such as explaining results, allowing them to be reproduced by others, and recomputing data if a bug is found in a work- flow step or if a dataset is lost. We refer the reader to [7], [23] and [9] for surveys of this work
(#)
reified transactions
- see also provenance
- At 23:30 the attributes you add to reified transactions establish provenance
(#)
Examples
Fault Tolerance
Glossary
- acyclic data flow graphs
- These systems achieve their scalability and fault tolerance by providing a programming model where the user creates acyclic data flow graphs to pass input data through a set of operators. This allows the underlying system to manage scheduling and to react to faults without user intervention.
(#)
- All that goes to hell as soon as you back-feed outputs of a later stage into inputs of an earlier stage. Now you have one monolithic block of code where you've semi-pointlessly drawn some boxes inside of it to pretend like it's modular like the rest of the pipeline, but it's not. You can't understand it without understanding the whole thing
(#)
Examples
Locking
Glossary
heavyweight lock
- The first commit 6d46f478 has changed the heavyweight locks (locks that are used for logical database objects to ensure the database ACID properties) to lightweight locks (locks to protect shared data structures) for scanning the bucket pages
(#)
lightweight lock
- The first commit 6d46f478 has changed the heavyweight locks (locks that are used for logical database objects to ensure the database ACID properties) to lightweight locks (locks to protect shared data structures) for scanning the bucket pages
(#)
lock sharding
- At 8:20 lock sharding
(#)
wide-area locking
- To allow for concurrent updates while avoiding many of the problems inherent with concept-wide-area-locking, it uses an update model based on conflict resolution.
(#)
- p 2: To allow for concurrent updates while avoiding many of the problems inherent with wide-area locking, it uses an update model based on conflict resolution
(#)
Examples
Materialization
Glossary
materialization
- Location: 11,286 Dataflow engines perform less materialization of intermediate state and keep more in memory, which means that they need to recompute more data if a node fails. Deterministic operators reduce the amount of data that needs to be recomputed.
(#)
materialized stage
- I like that the Lambda Architecture emphasizes retaining the input data unchanged. I think the discipline of modeling data transformation as a series of materialized stages from an original input has a lot of merit. This is one of the things that makes large MapReduce workflows tractable, as it enables you to debug each stage independently.
(#)
materialized view
- see also materialization
- We saw in "Databases and Streams" that a stream of changes to a database can be used to keep derived data systems, such as caches, search indexes, and data warehouses, up to date with a source database. We can regard these examples as specific cases of maintaining materialized views
(#)
- In order to take full advantage of this setup, we need to build applications in such a way that it is easy to deploy new instances that use replay to recreate their materialized view of the log
(#)
Examples
Nodes
Glossary
heterogeneity
- p 208: The system needs to be able to exploit heterogeneity in the infrastructure it runs on. e.g. the work distribution must be proportional to the capabilities of the individual servers. This is essential in adding new nodes with higher capacity without having to upgrade all hosts at once.
(#)
- p 210: The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.
(#)
virtual nodes
- p 210: The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.
(#)
Examples
Ordering
Glossary
causal ordering
- see also causal chain
- This is a causal ordering. It doesn't care so much about clock time. It cares what commits I worked from when I made mine. I knew about the parent commit, I started from there, so it's causal. Whatever you were doing on your branch, I didn't know about it, it wasn't causal, so there is no "before" or "after" relationship to yours and mine.
(#)
- For situations where reordering could be problematic, CockroachDB returns a causality token, which is just the maximum timestamp encountered during a transaction. If passed from one actor to the next in a causal chain, the token serves as a minimum timestamp for successive transactions and will guarantee that each has a properly ordered commit timestamp
(#)
- p 217: Write requests on the other hand will be coordinated by a node in the key's current preference list. This restriction is due to the fact that these preferred nodes have the added responsibility of creating a new version stamp that causally subsumes the version that has been updated by the write request. Note that if Dynamo's versioning scheme is based on physical timestamps, any node can coordinate a write request.
(#)
causal chain
- see also causal ordering
- see also lineage
- For situations where reordering could be problematic, CockroachDB returns a causality token, which is just the maximum timestamp encountered during a transaction. If passed from one actor to the next in a causal chain, the token serves as a minimum timestamp for successive transactions and will guarantee that each has a properly ordered commit timestamp
(#)
lineage
- see also causal ordering
- RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition
(#)
- While some DSM systems achieve fault tolerance through checkpointing [18], Spark reconstructs lost partitions of RDDs using lineage information captured in the RDD objects
(#)
- Capturing lineage or provenance information for datasets has long been a research topic in the scientific computing an database fields, for applications such as explaining results, allowing them to be reproduced by others, and recomputing data if a bug is found in a work- flow step or if a dataset is lost. We refer the reader to [7], [23] and [9] for surveys of this work
(#)
linearizability
- see also total order broadcast
- While Spanner provides linearizability, CockroachDB's external consistency guarantee is by default only serializability, though with some features that can help bridge the gap in practice.
(#)
- Location: 8,989 linearizability is a recency guarantee: a read is guaranteed to see the latest value written.
(#)
- Location 8,990 if you have total order broadcast, you can build linearizable storage on top of it
(#)
- Location: 13,963 writes that may conflict are routed to the same partition and processed sequentially
(#)
- External consistency is a stronger property than both linearizability and serializability.
(#)
- Linearizability is a recency guarantee on reads and writes of a register (an individual object). It doesn't group operations together into transactions, so it does not prevent problems such as write skew, unless you take additional measures such as materializing conflicts
(#)
ordering
- The order of events in two different partitions is then ambiguous.
(#)
- Location: 11,763 There is no ordering guarantee across different partitions.
(#)
serializability
- p 214: Although it is desirable always to have the first node among the top N to coordinate the writes thereby serializing all writes at a single location, this approach has led to uneven load distribution resulting in SLA violations
(#)
- While Spanner provides linearizability, CockroachDB's external consistency guarantee is by default only serializability, though with some features that can help bridge the gap in practice.
(#)
- Location: 12,111 Thus, any validation of a command needs to happen synchronously, before it becomes an event — for example, by using a serializable transaction that atomically validates the command and publishes the event.
(#)
- Location: 13,780 serializability and atomic commit are established approaches, but they come at a cost: they typically only work in a single datacenter (ruling out geographically distributed architectures), and they limit the scale and fault-tolerance properties you can achieve.
(#)
- Location: 13,862 (whereas an application-level check-then-insert may fail under nonserializable isolation, as discussed in "Write Skew and Phantoms").
(#)
- Location: 12,111 Thus, any validation of a command needs to happen synchronously, before it becomes an event — for example, by using a serializable transaction that atomically validates the command and publishes the event.
(#)
- External consistency is a stronger property than both linearizability and serializability.
(#)
total order broadcast
- see also linearizability
- Location: 8,988 total order broadcast is asynchronous: messages are guaranteed to be delivered reliably in a fixed order, but there is no guarantee about when a message will be delivered (so one recipient may lag behind the others)
(#)
- Location 8,990 if you have total order broadcast, you can build linearizable storage on top of it
(#)
Examples
Indexes
Glossary
Examples
Partitioning
Glossary
Examples
The order of events in two different partitions is then ambiguous.
(#)
Location: 11,763 There is no ordering guarantee across different partitions.
(#)
Performance
Glossary
- write throughput
- p 2: A write operation in Dynamo also requires a read to be performed for managing the vector timestamps. This is can be very limiting in environments where systems need to handle a very high write throughput.
(#)
Examples
Reads
Glossary
Examples
Scalability
Glossary
acyclic data flow graphs
- These systems achieve their scalability and fault tolerance by providing a programming model where the user creates acyclic data flow graphs to pass input data through a set of operators. This allows the underlying system to manage scheduling and to react to faults without user intervention.
(#)
- All that goes to hell as soon as you back-feed outputs of a later stage into inputs of an earlier stage. Now you have one monolithic block of code where you've semi-pointlessly drawn some boxes inside of it to pretend like it's modular like the rest of the pipeline, but it's not. You can't understand it without understanding the whole thing
(#)
incremental scalability
- see also scalability
- p 208: Dynamo should be able to scale out one storage host (henceforth, referred to as "node") at a time, with minimal impact on both operators of the system and the system itself.
(#)
Examples
Synchronous vs Asynchronous
Glossary
Examples
Location: 5,302 In practice, updates to global secondary indexes (global index) are often asynchronous (that is, if you read the index shortly after a write, the change you just made may not yet be reflected in the index).
(#)
Time
Glossary
event time
- in contrast to processing time
- At 2140 the time is event time based as opposed to processing time based, meaning they take the time information from the event, not when it is processed.
(#)
processing time
- in contrast to event time
- At 2140 the time is event time based as opposed to processing time based, meaning they take the time information from the event, not when it is processed.
(#)
time
- Location: 13,264 This raises the problems discussed in "Reasoning About Time", such as handling stragglers and handling windows that cross boundaries between batches.
(#)
time buckets
- Zero-filling Timeseries queries normally fill empty interior time buckets with zeroes. For example, if you issue a "day" granularity timeseries query for the interval 2012-01-01/2012-01-04, and no data exists for 2012-01-02, you will receive
(#)
zero filling
- Zero-filling Timeseries queries normally fill empty interior time buckets with zeroes. For example, if you issue a "day" granularity timeseries query for the interval 2012-01-01/2012-01-04, and no data exists for 2012-01-02, you will receive
(#)
Examples
Transactions
Glossary
distributed transactions
- Location: 13,190 In principle, derived data systems could be maintained synchronously, just like a relational database updates secondary indexes synchronously within the same transaction as writes to the table being indexed. However, asynchrony is what makes systems based on event logs robust: it allows a fault in one part of the system to be contained locally, whereas distributed transactions abort if any one participant fails, so they tend to amplify failures by spreading them to the rest of the system (see "Limitations of distributed transactions").
(#)
reified transactions
- see also provenance
- At 23:30 the attributes you add to reified transactions establish provenance
(#)
Examples
Writes
Glossary
Examples
Referring Pages
data-architecture-glossary