paper-amazon-dynamo

http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Key takeaways

paper-amazon-dynamo#key-takeaway-app-control1Allowing the application to specify its preferred tradeoff of availability and consistency lets apps do what is best for them paper-amazon-dynamo#key-takeaway-app-control1
paper-amazon-dynamo#key-takeaway-key-value-good-fit1Key-Value is a better fit for some application needs than relational, and allows for certain desirable operational characteristics paper-amazon-dynamo#key-takeaway-key-value-good-fit1
paper-amazon-dynamo#key-takeaway-four-key-principlesPage 208: Four key principles in the design: incremental scalability, symmetry, decentralization, heterogeneity paper-amazon-dynamo#key-takeaway-four-key-principles
They knew what they cared about (add to cart needed to be highly available and low latency)

paper-amazon-dynamo#object-versioning-and-conflict-resolution1 2p 205: It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use. paper-amazon-dynamo#object-versioning-and-conflict-resolution1 2

p 205: One of the lessons our organization has learned from operating Amazon's platform is that the reliability and scalability of a system is dependent on how its application state is managed.

p 205: the common pattern of using a relational database would lead to inefficiencies and limit scale and availability.

paper-amazon-dynamo#synthesis-of-well-known-techniques1p 205: synthesis of well known techniques to achieve scalability and availability paper-amazon-dynamo#synthesis-of-well-known-techniques1

p 205: consistency is facilitated by object versioning

p 206: clearly defined consistency window

paper-amazon-dynamo#scale-out-schemep 206: scale-out scheme paper-amazon-dynamo#scale-out-scheme

p 206: No operations span multiple data items

p 206: Dynamo targets applications that operate with weaker consistency (the "C" in ACID) if this results in high availability

p 206: In Amazon's platform, services have stringent latency requirements which are in general measured at the 99.9th percentile of the distribution

p 206: Its operation environment is assumed to be non-hostile and there are no security related requirements such as authentication and authorization.

p 206: which most prominently include the client's expected request rate distribution for a particular API and the expected service latency under those conditions

paper-amazon-dynamo#example-slap 207: provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second. paper-amazon-dynamo#example-sla

paper-amazon-dynamo#performance-sla-99-point-9p 207: A common approach in the industry for forming a performance oriented SLA is to describe it using average, median and expected variance. At Amazon we have found that these metrics are not good enough if the goal is to build a system where all customers have a good experience, rather than just the majority. For example if extensive personalization techniques are used then customers with longer histories require more processing which impacts performance at the high-end of the distribution. An SLA stated in terms of mean or median response times will not address the performance of this important customer segment. To address this issue, at Amazon, SLAs are expressed and measured at the 99.9th percentile of the distribution. The choice for 99.9% over an even higher percentile has been made based on a cost-benefit analysis which demonstrated a significant increase in cost to improve performance that much. paper-amazon-dynamo#performance-sla-99-point-9

paper-amazon-dynamo#customers-experiencep 207: In this paper there are many references to this 99.9th percentile of distributions, which reflects Amazon engineers' relentless focus on performance from the perspective of the customers' experience. paper-amazon-dynamo#customers-experience

p 207: One of the main design considerations for Dynamo is to give services control over their system properties, such as durability and consistency, and to let services make their own tradeoffs between functionality, performance and costeffectiveness.

p 207: paper-amazon-dynamo#optimistic-replication-techniques1 2 3For systems prone to server and network failures, availability can be increased by using optimistic replication techniques, where changes are allowed to propagate to replicas in the background, and concurrent, disconnected work is tolerated paper-amazon-dynamo#optimistic-replication-techniques1 2 3

paper-amazon-dynamo#conflict-resolutionp 207: This process of conflict resolution introduces two problems: when to resolve them and who resolves them paper-amazon-dynamo#conflict-resolution

paper-amazon-dynamo#design-spacep 207: On the other hand, Dynamo targets the design space of an "always writeable" data store (i.e., a data store that is highly available for writes) paper-amazon-dynamo#design-space

paper-amazon-dynamo#conflict-resolution-complexity-to-readsp 207: This requirement forces us to push the complexity of conflict resolution to the reads in order to ensure that writes are never rejected. paper-amazon-dynamo#conflict-resolution-complexity-to-reads

paper-amazon-dynamo#conflict-resolution-by-datastore-is-limitedp 208: If conflict resolution is done by the data store, its choices are rather limited paper-amazon-dynamo#conflict-resolution-by-datastore-is-limited

paper-amazon-dynamo#key-principles1p 208: Other key principles embraced in the design are incremental scalability, symmetry, decentralization, and heterogeneity paper-amazon-dynamo#key-principles1

paper-amazon-dynamo#incremental-scalability1p 208: Dynamo should be able to scale out one storage host (henceforth, referred to as "node") at a time, with minimal impact on both operators of the system and the system itself. paper-amazon-dynamo#incremental-scalability1

paper-amazon-dynamo#symmetry1p 208: Every node in Dynamo should have the same set of responsibilities as its peers; there should be no distinguished node or nodes that take special roles or extra set of responsibilities. In our experience, symmetry simplifies the process of system provisioning and maintenance. paper-amazon-dynamo#symmetry1

paper-amazon-dynamo#decentralization1p 208: An extension of symmetry, the design should favor decentralized peer-to-peer techniques over centralized control. In the past, centralized control has resulted in outages and the goal is to avoid it as much as possible. This leads to a simpler, more scalable, and more available system. paper-amazon-dynamo#decentralization1

paper-amazon-dynamo#heterogeneity1p 208: The system needs to be able to exploit heterogeneity in the infrastructure it runs on. e.g. the work distribution must be proportional to the capabilities of the individual servers. This is essential in adding new nodes with higher capacity without having to upgrade all hosts at once. paper-amazon-dynamo#heterogeneity1

paper-amazon-dynamo#work-distribution1p 208: The system needs to be able to exploit heterogeneity in the infrastructure it runs on. e.g. the work distribution must be proportional to the capabilities of the individual servers. This is essential in adding new nodes with higher capacity without having to upgrade all hosts at once. paper-amazon-dynamo#work-distribution1

paper-amazon-dynamo#overlay-links1p 208: These were examples of unstructured P2P networks where the overlay links between peers were established arbitrarily paper-amazon-dynamo#overlay-links1

paper-amazon-dynamo#routing-mechanisms-number-of-hopsp 2: use routing mechanisms to ensure that queries can be answered within a bounded number of hops paper-amazon-dynamo#routing-mechanisms-number-of-hops

paper-amazon-dynamo#routing-overlaysp 2: built on top of these routing overlays paper-amazon-dynamo#routing-overlays

paper-amazon-dynamo#wide-area-locking-and-conflict-resolution1 2p 2: To allow for concurrent updates while avoiding many of the problems inherent with wide-area locking, it uses an update model based on conflict resolution paper-amazon-dynamo#wide-area-locking-and-conflict-resolution1 2

paper-amazon-dynamo#target-requirements1p 209: Dynamo differs from the aforementioned decentralized storage systems in terms of its target requirements. First, Dynamo is targeted mainly at applications that need an "always writeable" data store where no updates are rejected due to failures or concurrent writes paper-amazon-dynamo#target-requirements1

paper-amazon-dynamo#hierarchical-namespaces1p 209: Dynamo do not require support for hierarchical namespaces (a norm in many file systems) or complex relational schema (supported by traditional databases). paper-amazon-dynamo#hierarchical-namespaces1

p 209: Dynamo is built for latency sensitive applications that require at least 99.9% of read and write operations to be performed within a few hundred millisecond. To meet these stringent latency requirements, it was imperative for us to avoid routing requests through multiple nodes (which is the typical design adopted by several distributed hash table systems such as Chord and Pastry)s

paper-amazon-dynamo#zero-hop-distributed-hash-table1p 209: Dynamo can be characterized as a zero-hop DHT, where each node maintains enough routing information locally to route a request to the appropriate node directly. paper-amazon-dynamo#zero-hop-distributed-hash-table1

paper-amazon-dynamo#list-of-system-propertiesp 209: The architecture of a storage system that needs to operate in a production setting is complex. In addition to the actual data persistence component, the system needs to have scalable and robust solutions for load balancing, membership and failure detection, failure recovery, replica synchronization, overload handling, state transfer, concurrency and job scheduling, request marshalling, request routing, system monitoring and alarming, and configuration management. paper-amazon-dynamo#list-of-system-properties

paper-amazon-dynamo#consistent-hashing1p 209: technique: consistent hashing - advantage: allows for incemental scalability paper-amazon-dynamo#consistent-hashing1

paper-amazon-dynamo#vector-clocks1p 209: technique: vector clocks with reconciliation during reads - advantage: allows vector size to be decoupled from update rates paper-amazon-dynamo#vector-clocks1

paper-amazon-dynamo#sloppy-quorum-and-hinted-handoffp 209: technique: sloppy quorum and hinted handoff - advantage: High availability and durability even when some of the replicas are not available paper-amazon-dynamo#sloppy-quorum-and-hinted-handoff

paper-amazon-dynamo#anti-entropy-using-merkle-treesp 209: technique: anti-entropy using merkle trees - advantage: Syncronizes divergent replicas in the background paper-amazon-dynamo#anti-entropy-using-merkle-trees

paper-amazon-dynamo#gossip-based-membership-protocol-and-failure-detectionp 209: technique: gossip-based protocol (membership protocol) and failure detection paper-amazon-dynamo#gossip-based-membership-protocol-and-failure-detection

paper-amazon-dynamo#context-informationp 209: The context encodes system metadata about the object that is opaque to the caller and includes information such as the version of the object. The context information is stored along with the object so that the system can verify the validity of the context object supplied in the put request. paper-amazon-dynamo#context-information

p 209: It applies a MD5 hash on the key to generate a 128-bit identifier, which is used to determine the storage nodes that are responsible for serving the key.

paper-amazon-dynamo#principle-advantage-of-consistent-hashingp 210: The principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected. paper-amazon-dynamo#principle-advantage-of-consistent-hashing

paper-amazon-dynamo#consistent-hashing-challengesp 210: The basic consistent hashing algorithm presents some challenges. paper-amazon-dynamo#consistent-hashing-challenges

paper-amazon-dynamo#virtual-nodes-and-heterogeneity1 2p 210: The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure. paper-amazon-dynamo#virtual-nodes-and-heterogeneity1 2

paper-amazon-dynamo#per-instancep 210: Each data item is replicated at N hosts, where N is a parameter configured "per-instance" paper-amazon-dynamo#per-instance

paper-amazon-dynamo#preference-list1p 210: The list of nodes that is responsible for storing a particular key is called the preference list paper-amazon-dynamo#preference-list1

paper-amazon-dynamo#distinct-physical-nodes-in-preference-list1p 210: the preference list for a key is constructed by skipping positions in the ring to ensure that the list contains only distinct physical nodes paper-amazon-dynamo#distinct-physical-nodes-in-preference-list1

paper-amazon-dynamo#immutable-datap 210: In order to provide this kind of guarantee, Dynamo treats the result of each modification as a new and immutable version of the data paper-amazon-dynamo#immutable-data

paper-amazon-dynamo#syntactic-reconciliationp 210: Most of the time, new versions subsume the previous version(s), and the system itself can determine the authoritative version (syntactic reconciliation). paper-amazon-dynamo#syntactic-reconciliation

paper-amazon-dynamo#semantic-reconciliationp 210: In these cases, the system cannot reconcile the multiple versions of the same object and the client must perform the reconciliation in order to collapse multiple branches of data evolution back into one (semantic reconciliation) paper-amazon-dynamo#semantic-reconciliation

p 210: This requires us to design applications that explicitly acknowledge the possibility of multiple versions of the same data (in order to never lose any updates).

paper-amazon-dynamo#vector-clock-node-counter-pairsp 210: A vector clock is effectively a list of (node, counter) pairs paper-amazon-dynamo#vector-clock-node-counter-pairs

paper-amazon-dynamo#must-specify-which-version-it-is-updatingp 210: In Dynamo, when a client wishes to update an object, it must specify which version it is updating paper-amazon-dynamo#must-specify-which-version-it-is-updating

paper-amazon-dynamo#version-evolutionp 211: version evolution of an object over time paper-amazon-dynamo#version-evolution

p 211: the writes are usually handled by one of the top N nodes in the preference list

paper-amazon-dynamo#vector-clock-truncation-schemep 211: Dynamo employs the following clock truncation scheme: Along with each (node, counter) pair, Dynamo stores a timestamp that indicates the last time the node updated the data item paper-amazon-dynamo#vector-clock-truncation-scheme

paper-amazon-dynamo#coordinatorp 211: A node handling a read or write operation is known as the coordinator. paper-amazon-dynamo#coordinator

paper-amazon-dynamo#quorum-systemsp 211: To maintain consistency among its replicas, Dynamo uses a consistency protocol similar to those used in quorum systems paper-amazon-dynamo#quorum-systems

paper-amazon-dynamo#sloppy-quorum1 2 3p 212: To remedy this it does not enforce strict quorum membership and instead it uses a sloppy quorum; all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring. paper-amazon-dynamo#sloppy-quorum1 2 3

paper-amazon-dynamo#hinted-handoff1p 212: The replica sent to D will have a hint in its metadata that suggests which node was the intended recipient of the replica (in this case A) paper-amazon-dynamo#hinted-handoff1

paper-amazon-dynamo#preference-list-multiple-datacenters1p 212: In essence, the preference list of a key is constructed such that the storage nodes are spread across multiple data centers paper-amazon-dynamo#preference-list-multiple-datacenters1

paper-amazon-dynamo#anti-entropy-protocolp 212: Dynamo implements an anti-entropy (replica synchronization) protocol to keep the replicas synchronized. paper-amazon-dynamo#anti-entropy-protocol

paper-amazon-dynamo#merkle-treep 212: The principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set paper-amazon-dynamo#merkle-tree

paper-amazon-dynamo#key-rangep 212: Dynamo uses Merkle trees for anti-entropy as follows: Each node maintains a separate Merkle tree for each key range (the set of keys covered by a virtual node) it hosts paper-amazon-dynamo#key-range

paper-amazon-dynamo#explicit-membership-changesp 212: For these reasons, it was deemed appropriate to use an explicit mechanism to initiate the addition and removal of nodes from a Dynamo ring paper-amazon-dynamo#explicit-membership-changes

paper-amazon-dynamo#membership-changes-historyp 212: The membership changes form a history because nodes can be removed and added back multiple times paper-amazon-dynamo#membership-changes-history

paper-amazon-dynamo#membership-sync-with-peerp 212: Each node contacts a peer chosen at random every second and the two nodes efficiently reconcile their persisted membership change histories. paper-amazon-dynamo#membership-sync-with-peer

paper-amazon-dynamo#token-setp 212: When a node starts for the first time, it chooses its set of tokens (virtual nodes in the consistent hash space) and maps nodes to their respective token sets paper-amazon-dynamo#token-set

paper-amazon-dynamo#logical-partition-and-seedp 213: To prevent logical partitions, some Dynamo nodes play the role of seeds. Seeds are nodes that are discovered via an external mechanism and are known to all nodes paper-amazon-dynamo#logical-partition-and-seed

paper-amazon-dynamo#bootstrappingp 213: consider a simple bootstrapping scenario where node X is added to the ring shown in Figure 2 between A and B paper-amazon-dynamo#bootstrapping

paper-amazon-dynamo#read-repairp 214: If stale versions were returned in any of the responses, the coordinator updates those nodes with the latest version. This process is called read repair because it repairs replicas that have missed a recent update at an opportunistic time and relieves the anti-entropy protocol from having to do it. paper-amazon-dynamo#read-repair

paper-amazon-dynamo#serializing-is-desirable1p 214: Although it is desirable always to have the first node among the top N to coordinate the writes thereby serializing all writes at a single location, this approach has led to uneven load distribution resulting in SLA violations paper-amazon-dynamo#serializing-is-desirable1

paper-amazon-dynamo#top-n-nodes-in-preference-list-coordinate-writesp 214: any of the top N nodes in the preference list is allowed to coordinate the writes paper-amazon-dynamo#top-n-nodes-in-preference-list-coordinate-writes

paper-amazon-dynamo#write-coordinatorp 214: the coordinator for a write is chosen to be the node that replied fastest to the previous read operation which is stored in the context information of the request paper-amazon-dynamo#write-coordinator

paper-amazon-dynamo#read-your-writes-consistency1 2p 214: This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting read-your-writes consistency paper-amazon-dynamo#read-your-writes-consistency1 2

paper-amazon-dynamo#business-logic-specific-reconciliationp 214: Business logic specific reconciliation paper-amazon-dynamo#business-logic-specific-reconciliation

paper-amazon-dynamo#last-write-wins-reconciliationp 214: Dynamo performs simple timestamp based reconciliation logic of last write wins; i.e., the object with the largest physical timestamp value is chosen as the correct version. paper-amazon-dynamo#last-write-wins-reconciliation

paper-amazon-dynamo#persistence-cachep 214: authoritative persistence cache for data stored in more heavy weight backing stores. Services that maintain product catalog and promotional items fit in this category paper-amazon-dynamo#persistence-cache

paper-amazon-dynamo#client-applications-can-tunep 214: client applications can tune the values of N, R and W to achieve their desired levels of performance, availability and durability paper-amazon-dynamo#client-applications-can-tune

paper-amazon-dynamo#vulnerability-window1p 214: This also introduces a vulnerability window for durability when a write request is successfully returned to the client even though it has been persisted at only a small number of nodes. paper-amazon-dynamo#vulnerability-window1

paper-amazon-dynamo#write-bufferingp 215: write buffering smoothes out higher percentile latencies. Obviously, this scheme trades durability for performance paper-amazon-dynamo#write-buffering

paper-amazon-dynamo#consistent-hashing-ensure-uniform-load-distribution1p 215: Dynamo uses consistent hashing to partition its key space across its replicas and to ensure uniform load distribution paper-amazon-dynamo#consistent-hashing-ensure-uniform-load-distribution1

paper-amazon-dynamo#design-assumes-enough-popular-keys-skew1p 215: Dynamo's design assumes that even where there is a significant skew in the access distribution there are enough keys in the popular end of the distribution so that the load of handling popular keys can be spread across the nodes uniformly through partitioning. paper-amazon-dynamo#design-assumes-enough-popular-keys-skew1

paper-amazon-dynamo#load-imbalancep 215: To study the load imbalance and its correlation with request load, the total number of requests received by each node was measured for a period of 24 hours paper-amazon-dynamo#load-imbalance

paper-amazon-dynamo#imbalance-ratio1p 215: the fraction of nodes that are "out-of-balance" (henceforth, imbalance ratio) during this time period paper-amazon-dynamo#imbalance-ratio1

paper-amazon-dynamo#independent-schemes-for-partitioning-and-placementp 216: Ideally, it is desirable to use independent schemes for partitioning and placement paper-amazon-dynamo#independent-schemes-for-partitioning-and-placement

paper-amazon-dynamo#load-balancing-efficiencyp 217: Load balancing efficiency is defined as the ratio of average number of requests served by each node to the maximum number of requests served by the hottest node. paper-amazon-dynamo#load-balancing-efficiency

paper-amazon-dynamo#number-divergent-verstions-seen-in-production-environment1p 217: However, this section discusses a good summary metric: the number of divergent versions seen by the application in a live production environment. paper-amazon-dynamo#number-divergent-verstions-seen-in-production-environment1

paper-amazon-dynamo#busy-robotsp 217: The increase in the number of concurrent writes is usually triggered by busy robots (automated client programs) and rarely by humans paper-amazon-dynamo#busy-robots

paper-amazon-dynamo#causally-subsumes1p 217: Write requests on the other hand will be coordinated by a node in the key's current preference list. This restriction is due to the fact that these preferred nodes have the added responsibility of creating a new version stamp that causally subsumes the version that has been updated by the write request. Note that if Dynamo's versioning scheme is based on physical timestamps, any node can coordinate a write request. paper-amazon-dynamo#causally-subsumes1

paper-amazon-dynamo#hit-ratiosp 218: This is because Dynamo's storage engine caches and write buffer have good hit ratios paper-amazon-dynamo#hit-ratios

paper-amazon-dynamo#admission-control-mechanismp 218: To this end, the background tasks were integrated with an admission control mechanism paper-amazon-dynamo#admission-control-mechanism

paper-amazon-dynamo#trailing-time-window1p 218: This information is used to check whether the percentiles of latencies (or failures) in a given trailing time window are close to a desired threshold paper-amazon-dynamo#trailing-time-window1

paper-amazon-dynamo#full-membership-model1p 218: Dynamo adopts a full membership model where each node is aware of the data hosted by its peers paper-amazon-dynamo#full-membership-model1

paper-amazon-dynamo#decentralized-techniques-make-available-system1p 219: The production use of Dynamo for the past year demonstrates that decentralized techniques can be combined to provide a single highly-available system paper-amazon-dynamo#decentralized-techniques-make-available-system1

Older notes

paper-amazon-dynamo#persistent-state1p 205 the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. paper-amazon-dynamo#persistent-state1

Sacrifices consistency

Object versioning

Application assisted conflict resolution

True or false:

"even the slightest outage has significant financial consequences and impacts customer trust."

One of the lessons our organization has learned from operating Amazon's platform is that the reliability and scalability of a system is dependent on how its application state is managed

need to be constructed in a manner that treats failure handling as the normal case without impacting availability or performance

There are many services on Amazon's platform that only need primary-key access to a data store. For many services, such as those that provide best seller lists, shopping carts, customer preferences, session management, sales rank, and product catalog, the common pattern of using a relational database would lead to inefficiencies and limit scale and availability. Dynamo provides a simple primary-key only interface to meet the requirements of these applications.

It demonstrates that an eventually-consistent storage system can be used in production with demanding applications.

clearly defined consistency window

a significant portion of Amazon's services can work with this simple query model and do not need any relational schema

measured at the 99.9th percentile of the distribution

paper-amazon-dynamo#design-priciple-symmetry1 2Every node in Dynamo should have the same set of responsibilities as its peers; there should be no distinguished node or nodes that take special roles or extra set of responsibilities. In our experience, symmetry simplifies the process of system provisioning and maintenance. paper-amazon-dynamo#design-priciple-symmetry1 2

paper-amazon-dynamo#wide-area-locking1To allow for concurrent updates while avoiding many of the problems inherent with concept-wide-area-locking, it uses an update model based on conflict resolution. paper-amazon-dynamo#wide-area-locking1

Dynamo can be characterized as a zero-hop DHT, where each node maintains enough routing information locally to route a request to the appropriate node directly.

this paper focuses on the core distributed systems techniques used in Dynamo: partitioning, replication, versioning, membership, failure handling and scaling.

The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.

To address this, the preference list for a key is constructed by skipping positions in the ring to ensure that the list contains only distinct physical nodes.

talks about lamport clock talk-how-to-have-causality-and-wall-clock-too

Referring Pages

talk-intro-to-cassandra data-architecture-glossary paper-cassandra

paper-amazon-dynamo

Key takeaways

Older notes

Related

Referring Pages