book-i-heart-logs

The log entry number can be thought of as the "timestamp" of the entry. Describing this ordering as a notion of time seems a bit odd at first, but it has the convenient property of being decoupled from any particular physical clock. This property will turn out to be essential as we get to distributed systems.Read more at location 82 • Delete this highlight Add a note log shipping protocolsRead more at location 101 • Delete this highlight Add a note Database people generally differentiate between physical and logical logging.Read more at location 143 • Delete this highlight Add a note The state machine model usually refers to an active-active model, where we keep a log of the incoming requests and each replica processes each request in log order. A slight modification of this, called the primary-backup model, is to elect one replica as the leader.Read more at location 146 • Delete this highlight Add a note This type of event data shakes up traditional data integration approaches because it tends to be several orders of magnitude larger than transactional data.Read more at location 262 • Delete this highlight Add a note You can think of the log as acting as a kind of messaging system with durability guarantees and strong ordering semantics.Read more at location 301 • Delete this highlight Add a note atomic broadcast.Read more at location 302 • Delete this highlight Add a note First, the pipelines we had built, even though a bit of a mess, were actually extremely valuable. Just the process of making data available in a new processing system (Hadoop) unlocked many possibilities. It was possible to do new computation on the data that would have been hard to do before. Many new products and analysis came from simply putting together multiple pieces of data that had previously been locked up in specialized systems.Read more at location 336 • Delete this highlight Add a note The idea is that adding a new data system — be it a data source or a data destination — should create integration work only to connect it to a single pipeline instead of to each consumer of data.Read more at location 357 • Delete this highlight Add a note The key problem for a data-centric organization is coupling the clean integrated data to the data warehouse. A data warehouse is a piece of batch query infrastructure that is well suited to many kinds of reporting and ad hoc analysis, particularly when the queries involve simple counting, aggregation, and filtering. But having a batch system be the only repository of clean, complete data means the data is unavailable for systems requiring a real-time feed: real-time processing, search indexing, monitoring systems, and so on.Read more at location 378 • Delete this highlight Add a note A better approach is to have a central pipeline, the log, with a well-defined API for adding data. The responsibility of integrating with this pipeline and providing a clean, well-structured data feed lies with the producer of this data feed.Read more at location 393 • Delete this highlight Add a note By contrast, if the organization had built out feeds of uniform, well-structured data, getting any new system full access to all data requires only a single bit of integration plumbing to attach to the pipeline.Read more at location 407 • Delete this highlight Add a note The original log is still available, but this real-time processing produces a derived log containing augmented data.Read more at location 419 • Delete this highlight Add a note The problem with this is the same as the problem with all batch ETL: it couples the data flow to the data warehouse's capabilities and processing schedule.Read more at location 427 • Delete this highlight Add a note We have defined several hundred event types,Read more at location 429 • Delete this highlight Add a note The writer controls the assignment of the messages to a particular partition, with most users choosing to partition by some kind of key (such as a user ID).Read more at location 464 • Delete this highlight Add a note Each partition is replicated across a configurable number of replicas, each of which has an identical copy of the partition's log.Read more at location 467 • Delete this highlight Add a note real-time data streams.Read more at location 536 • Delete this highlight Add a note derived feedsRead more at location 554 • Delete this highlight Add a note Being able to quickly tap into an output stream and check its validity, compute some monitoring statistics, or even just see what the data looks like make development much moreRead more at location 572 • Delete this highlight Add a note

book-i-heart-logs#intermediate-results1 2 Big, complex MapReduce workflows use files to checkpoint and share their intermediate results. Big, complex SQL processing pipelines create lots and lots of intermediate or temporary tables. This just applies the pattern with an abstraction that is suitable for data in motion, namely a log book-i-heart-logs#intermediate-results1 2

.Read more at location 590 • Delete this highlight Add a note The Lambda Architecture emphasizes retaining the original input data unchanged. I think this is a really important aspect. The ability to operate on a complex data flow is greatly aided by the ability to see what inputs went in and what outputs came out.Read more at location 610 • Delete this highlight Add a note I also like that this architecture highlights the problem of reprocessing data. Reprocessing is one of the key challenges of stream processing,Read more at location 612 • Delete this highlight Add a note abstraction in stream processing is data-flow DAGs, which are exactly the same underlying abstraction in a traditional data warehouse (such as Volcano), as well as being the fundamental abstraction in the MapReduce successor Tez. Stream processing is just a generalization of this data-flow model that exposes checkpointing of intermediate results and continual output to the end user.Read more at location 643 • Delete this highlight Add a note The real advantage isn't about efficiency at all, but rather about allowing people to develop, test, debug, and operate their system on top of a single processing framework.Read more at location 674 • Delete this highlight Add a note You might, for example, want to enrich an event stream (say a stream of clicks) with information about the user doing the click, in effect joining the click stream to the user account database.Read more at location 681 • Delete this highlight Add a note In fact, the processors have something very like a co-partitioned table maintained along with them.Read more at location 700 • Delete this highlight Add a note

Referring Pages

intermediate-results

People

person-jay-kreps