talk-amazon-aurora-deep-dive

https://www.youtube.com/watch?v=CwWFrZGMDds

slides: https://www.slideshare.net/AmazonWebServices/dat405-amazon-aurora-deep-dive

Key Takeaways

At around 13 minutes because they're only writing logs and not data blocks. So, even though they are writing six times as much, the network traffic is nine times less.

Details

talk-amazon-aurora-deep-dive#lock-sharding1At 8:20 lock sharding talk-amazon-aurora-deep-dive#lock-sharding1

talk-amazon-aurora-deep-dive#network-weather-latency-and-jitter1 2At 10:10 network weather will increase latency and jitter talk-amazon-aurora-deep-dive#network-weather-latency-and-jitter1 2

talk-amazon-aurora-deep-dive#write-tear1At 10:40 you do a double writes because you are doing a destructive write and you can't afford write tear talk-amazon-aurora-deep-dive#write-tear1

talk-amazon-aurora-deep-dive#only-shipping-log-recordsAt 12:05 "in Aurora, we're only shipping log records" talk-amazon-aurora-deep-dive#only-shipping-log-records

talk-amazon-aurora-deep-dive#boxcar1At 12:10 "we boxcar them together to generate a workload that's worth shipping over" talk-amazon-aurora-deep-dive#boxcar1

talk-amazon-aurora-deep-dive#six-times-log-writes-write-amplification1At 13:10 they do see a write amplification because they have to do six times the log writes talk-amazon-aurora-deep-dive#six-times-log-writes-write-amplification1

At 12:20 they are segmenting the database volume across many notes and many discs on each node. Basically it is a huge storage area network.

At 12:50 there are only log right, there are no data block right

At 1338 it is a quorum of-based model where we only require four of the six rights to come back

At 1440 they basically deconstructed the standard database and took the log processing portion of it and turned it into its own multi-tenant application.

At 1505 he calls out the latency sensitive portion of the processing.

At 1551 of the big negatives of log structured storage systems in the past was write amplification.At 1625 we are leveraging the coalesce to create the data blocks

At 1635 a hot log set up and they are pushed out to us three, which is when they actually consider the data to be durably persisted.

At 1715 they are basically scrubbing the disks on the storage notes all the time. Reading the blogs and comparing them with the CRC. Maybe five minutes before he metsend that there is a gossiper call is for storage does to agree what should be on them.

At 1740 the foreground latency path.

At 1840 trading latency for disk space is a trade we want to make

At 1930 you can't actually do a I/O per Kamat, instead here buffer up some commits and do a group at Kamat. That means the first writer gets a latency penalty.

At 2035 they do the networking and use her space, and a lemonade the I think first packet penalty, or something like that.

I had 23 we have a pool eat pull on the front E-Poll and a latch free task queue

At 2710 replica is the same underlying storage, if it is in cash then the cash is cleared by new record coming in. Not great capturing of what he said. Recheck.

At 2910 he talked about how they did read ahead but it was slower than in MySQL

Add 30 he talks about the benefits of a quorum system for both latency reduction and availability during node failure.

At 3020 he draws a distinction between a durability laws and an intermittent failure.

I had 31 he talks about scrubbing the data blocks to make sure that the six copies of the day that you initially got on the storage node and continue to be the same values. I believe I took my note about this earlier.

Add 30 to 10 the volume is broken into 10 GB segments. This unit was chosen because there is a 10 Gb network and something, which means these segments can quickly be moved from one place to another. Also I'm not sure of the networking what is 10 Gb or 10 GB by the storage is 10 GB

At 3310 the loss of a quorum would cause a right stall. They worked hard to avoid right stalls even during Coram changes.

At 3545 that's not what you want to see because that's not what you manage to.

When you need to restart a database you restart it and because the cash is part of the database process it is emptied at which point you need to send a "trickle load "into the database because if you give it the full load it will fall over again.

Pat 4235 SQL commands that let you simulate various failures.

At 4350 when talking about how you move and RDS instance to Aurora he says it basically moves at "line rate speed quote

Referring Pages

data-architecture-glossary