talk-building-robust-and-scalable-data-pipelines-with-kafka

https://www.youtube.com/watch?v=a4w6MXKv0Cw

Overview

Before enqueueing data:
- Have a strong schema at the time of data production into the queue.
- Know what uniquely identifies the data (for deduplication/sorting). (Consider vector clocks, etc)
Have a service that copies all data in the queue into S3, unaltered.
Have a separate service that deduplicates and orders the data
Many consumers of the de-duplicated/ordered data.
- Fills data warehouse directly for structured analysis
- Can provide data to the online system

Notes

At 4:30 the guiding principles for optimizing their data architecture for scalability developer productivity, and for correctness

At 7:30 LinkedIn is using Kafka to handle over 1 trillion events per day.

talk-building-robust-and-scalable-data-pipelines-with-kafka#schema-is-very-very-important1 2At 9:50 schema is very very important. talk-building-robust-and-scalable-data-pipelines-with-kafka#schema-is-very-very-important1 2

At 11:40 he talks about the schema be in the message or out of the message like At 17:10 in-band and out-of-band schemas (#)

talk-building-robust-and-scalable-data-pipelines-with-kafka#schema-enforce-at-produce-time1 2 3At 11:55 you need to enforce schemas at produce time talk-building-robust-and-scalable-data-pipelines-with-kafka#schema-enforce-at-produce-time1 2 3

At 13:30 you need to know what uniquely identifies a record

At 15:05 they have a separate service for getting the schemas and stream meta-data

talk-building-robust-and-scalable-data-pipelines-with-kafka#streams-used-for-many-purposes1

At 16:50 what are the various things you can do with the streams of data?

archival
warehousing and structured analytics
batch processing
ad-hoc analysis
stream processing
online query systems

talk-building-robust-and-scalable-data-pipelines-with-kafka#streams-used-for-many-purposes1

At 17:40 Twilio FS is the data lake

At 19:20 copycat copies all of the Kafka data to S3. This allows the re-running of data from scratch (even if they made a mistake in Copydog)

At 19:45 copydog dedups and ensures the correct ordering in the data, but otherwise leaves the data alone. Copydog's output is what is used by other parts of the system.

At 20:10 S3 does not have an atomic move functionality, so they his version and file names and a reference to the latest version. Like At 36:34 Riak and S3 do not have the consistency or transitional semantics or was it a transactional semantics? (#)

At 21:05 the metadata API helps.

At 21:50 each there are multiple red shift clusters, used for different purposes.

At 23:40 use spark for real time processing from streams

talk-building-robust-and-scalable-data-pipelines-with-kafka#full-row-checksums1At 25:05 use full-row checksums full-row-checksums to see if values align talk-building-robust-and-scalable-data-pipelines-with-kafka#full-row-checksums1

At 26:30 "we try out new data stores all the time"

Referring Pages

data-architecture-glossary schema-is-very-important

People

person-mike-mentes