podcast-oreilly-data-show-massively

At 8:40 out of order processing

At 9:50 talks about dataflow, and how it came out of the flume project which was basically to create a plan to execute the pipelines.

At 12:20 they are trying to get Milwheel plus batch processing running in dataflow.

At 1320 he describes the Lambda architecture and talks about how the streaming semantics and consistency were not great which is why they needed the batch

podcast-oreilly-data-show-massively#summingbird-unified-apiAt 1350 Summingbird gives a single API instead of two separate APIs for the batch and streaming parts of Lambda podcast-oreilly-data-show-massively#summingbird-unified-api

At 1420 he talks about moving away from batch and streaming as terms and talk to you more about bounded and unbounded data

At 15 streaming it is a loaded term

At 1620 the talk-goodbye-to-batch at Strata in London

podcast-oreilly-data-show-massively#event-time-vs-processing-time1 2At 2140 the time is event time based as opposed to processing time based, meaning they take the time information from the event, not when it is processed. podcast-oreilly-data-show-massively#event-time-vs-processing-time1 2

At 2340 dataflow has the concept of a watermark, which is an indication of when you think the data is stable before. Up to that watermark you believe you have seen all of the data and you will see, and if any And if any additional data comes in for pre-watermark after the fact it is late and you can decide what to do with it.

At 2605 talking about consistency says to make sure your ongoing calculation is correct Spate machine failures issues network issues

At 2640 you don't acknowledge that something has been processed until it has been durably committed

At 28 when describing spark he said that it was exciting for them to review Sparky and understand it was a principled system that cared about consistency

Note, I may have been using the wrong word above. It may have been correctness and not consistency that he was talking about.

At 3650 he says solve that other problems away with solve this one with window he and triggers.

Near the end he talks about how dataflow can run on various execution engines, such as Spark.

Referenced in show

paper-the-dataflow-model

article-the-world-beyond-batch-streaming-101

article-the-world-beyond-batch-streaming-102

Referring Pages

data-architecture-glossary

People

person-tyler-akidau