blog-post-local-state-fundamental-primitive

https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing

Overview: you can maintain local state with fault tolerance by logging changes to a Kafka topic.

blog-post-local-state-fundamental-primitive#types-of-joins1 2These two streams are somewhat time aligned, so you might only need to wait for a few minutes after the impression for the matching click to occur. This is called a streaming join. Another type of join is for "enrichment" of an incoming stream — perhaps you want to take an incoming stream of ad clicks and join on attributes about the user (perhaps for use in downstream aggregations). blog-post-local-state-fundamental-primitive#types-of-joins1 2

blog-post-local-state-fundamental-primitive#local-stateOne possibility is to co-partition the database and the input processing, and then move the data to be directly co-located with the processing. That is what I am calling "local state," which I will describe next. blog-post-local-state-fundamental-primitive#local-state

blog-post-local-state-fundamental-primitive#log-compactionThe log is periodically compacted by removing duplicate updates for the same key to keep the log from growing too large blog-post-local-state-fundamental-primitive#log-compaction

This model of state changes, as themselves being a kind of stream or log, is pretty nice

The stream processor acts as a kind of fancy cross-database trigger mechanism to process changes, usually to create some materialized view to be used for serving live queries

Rich access patterns

Once you have co-located the data with the processing, you can access it in any way you like, as it is local to your machine.

The fundamental idea here is that virtually any kind of data structure can be be backed by a changelog and be made fault tolerant. This means the storage is pluggable — any kind of embeddable data structure can be used. Samza provides key-value storage out of the box, but you can plug in anything you like (an in-memory hash table, a bloom filter, a bitmap index, whatever).

That is, you will inevitably change your code to produce different, better, results and want to reprocess your input to backfill this new output. Since stream processing systems like Samza let you run your processing partitioned over a cluster of machines, these high-throughput periods can be very high indeed.

However, because of re-processing and catch-up, stream processing doesn't have this limitation — load can spike arbitrarily high

blog-post-local-state-fundamental-primitive#logic-around-access-in-services1Finally, if significant logic around data access is encapsulated in a service, then replicating the data rather than accessing the service will not enforce that logic. This issue does arise, but it is somewhat less common than it would seem — and quite similar to the situation that arises with Hadoop and MapReduce processing, which virtually requires the data to be replicated to the local system for processing blog-post-local-state-fundamental-primitive#logic-around-access-in-services1

throughput-oriented processing

Referring Pages

data-architecture-glossary book-designing-data-intensive-applications

People

person-jay-kreps