podcast-oreilly-data-show-metadata

Meta-data services podcast with Joe Hellerstein

Meta-data services podcast data wrangling at 1:30

At 2:30 data wrangling his data cleaning and transformation

At 2:50 entity resolution

At 3:10 there might be an algorithm, but how well does the person using the algorithm interact with it? It is an HCI problem

podcast-oreilly-data-show-metadata#validate-the-basic-thrust1Way of talking. We validated the basic Alex thrust podcast-oreilly-data-show-metadata#validate-the-basic-thrust1

At four minutes and interaction model "space where the computer predicts what the person might want is called is called predictive transformation or productive interaction by the guy.

At 4:40 the data wrangling at scalethey did at the company called trifecta

podcast-oreilly-data-show-metadata#dsls-make-people-productive-without-being-programmers1At 5:15 there is a strong influence from database research on creating DSL's that make people very productive without making them programmers. podcast-oreilly-data-show-metadata#dsls-make-people-productive-without-being-programmers1

At 5:45 you want to do use examples in a spreadsheet, so you can look at them visually, but then you want to scale it to maybe terabytes of data in the backend.

podcast-oreilly-data-show-metadata#elegant-intellectually1At 7:15 it is elegant intellectually but I don't think it will fly. podcast-oreilly-data-show-metadata#elegant-intellectually1

At 8:50 can use the example of thousands of transactions per second as being something that did not need a custom solution, and at a point in the mid to thousands it wasn't completely clear the new systems were needed for the amount of data that was coming in to the systems.

At 10 we now have data rates on the order of hurts.

At 10:45 there was work at Stanford that created a sequel like language called C to Al that worked on streams and was very similar to SQL.

At 1310 a good quote about SQL be a vehicle for scale.

At 16 believe realize that it's all about Dana and the competition is a servant to data. At 1730 they were trying to figure out a way of giving Matt produce in a different interface like on top of post grass with the green plum project.

At 1620 the biggest problem is the cost of coordination

At 1630 the question is how can you get correct semantics in your data without needing to constantly check in or get locks, etc. The question was how do you do this stuff at an Internet scale

At 1930 consistency as logical mono tenacity the CALM Siri that was from his group not Siri, but Siri Siri

At 20 people are starting to "raise the interface " of correctness and coordination and try to expose it in systems.

At 2021 one of his students who is now a Stanford professor had a version of this called invariant confluence in which you program a bunch of variance into your code

Some of those types of things are starting to be integrated into no SQL stores instead of locking. Some of the invariant type things.

Data driven culture.at 2150.

When thinking about. Had 2208 when thinking about the data lake you can decide what the meaning of the day that is after the fact with schema on use, or schema on read.

At 2250 he says that there are pieces of software that attempt to have everyone have the same opinion about all of the data, he calls that master data something, and says it's fine for some uses.

Somewhere around 23 he gives the example of how the authorship in Wikipedia seemed very ambiguous 10 years ago and there was some question of whether that could be legitimate. We have come to embrace that ambiguity and understand it.

At 2350 he says that one of the exciting things about the data driven philosophy and big data and agility is its ability to deal with ambiguity in increasingly similar to how people deal with the ambiguity.

At 24 we are going to have multiple views of the data. Gordon and Note.: Schema on read facilitates multiple views of the data because different people can have different opinions about the same underlying data, and what it means.

Add 2410 of the description of the data is different from the model of

Add 2410 the description of the dayData is different from the storage of the data. It may be stored in HD FS, but the description is elsewhere.The question is where is that something else. There is a hive meta-data store that is in Apache product. However out in the real world what this guy here is is that they are perceived as vendor specific projects, even the atlas Apache project.

Add 2542 what does a meta-store need it? One thing it needs is to be a place to put your data inventory what is my data, how is it structured, what is it named, etc.

At 2641 using a query logs for "expert sourcing"

podcast-oreilly-data-show-metadata#raw-data-to-cooked1At 2730 when you're working with it in the raw form and you were boiling it down into something kind of cooked podcast-oreilly-data-show-metadata#raw-data-to-cooked1

podcast-oreilly-data-show-metadata#data-lineageAt 2831 thing you always want to know is the data lineage. Who kicked off this report? Who is access to this data? Is it frequently accessed. Which of the reports was it used in? Is it in the raw form, or in the "post-processed form" podcast-oreilly-data-show-metadata#data-lineage

At 2950 does the decision-maker want to know all of the details about what went into the data transformations? He gives the example that they might not want to see all of the Python code, but they might want to see a visualization.

At 3010 kind of like how you can have a DSL that describes transformations you'd like to make there could be a even higher level visual explanation of transformations already made, kind of like an IKEA set of instructions

At 3151 he describes a shop that is collective governance minded where they can do their Wikipedia like data cure shin generation museum cure shin

Add 3514 that divide has never been more porous and flexible that it is now

podcast-oreilly-data-show-metadata#patterns-of-critique1At 36 there are patterns of critique podcast-oreilly-data-show-metadata#patterns-of-critique1

Referring Pages

data-architecture-glossary

People

person-joe-hellerstein