talk-advanced-spark-training

https://www.youtube.com/watch?v=7ooZ4S7Ay6Y

At 5:50 scheduling, monitoring, and distributing framework

At 7:20 Spark SQL is a module made for working with structured data made of rows and columns. Hive data, queries, etc work.

At 8:08 standard ODBC connections... can run SQL at scale

At 8:40 spark streaming - 2012 60 million records/sec on 100 node cluster.

at 9:40 these higher level libraries push work down to Spark core

at 18:20 spark is like 10 - 100X faster than map reduce

at 21:00 paper-spark

at 21:20 paper-resiliant-distributed-datasets

At 2142 more papers. One on spark streaming and the other on spark SQL

At 2220 there is also a learning SPARCbook from O'Reilly

At 2330 he talks about spark stream me documentation

Advanced spark notes

At 30 minutes each RDD may have 1000 to 10,000 partitions. That's pretty normal

At 3140 1RD D's can be created in two ways you can paralyze a collection or you can get the data from some external source

Add 33 he talks about a base our DD

At 34 shows our DD's being transformed with a filter and then egg coalesce

At 3532 at the end you can ship it over to the driver JVM as in "action "

At 3624 it basically build a dag directed is if the graph, but nothing actually happens. It is lazy.

At 4020 he talks about not cashing your base our DD, but cashing one that has had a few steps of normalization added to it. This sounds a lot like what a batch view is.

At 40 to 50 there is metadata attached to the RDD that indicates that you want to cash it. Cashing is lazy evaluated.

No time, but the workers are called executor JVM us.

At 5120 the five methods that make up the RTD interface

At 55 the preferred location for an RTD. For example on a filter RDD he would preferred to be on the same node as the parent RTD.

At 5630 the Cassandra connector RDD

At 5720 he contrasts pulling data from Cassondra and pulling it from HD FS. When pulling from Cassondra you can pass specifics about what you want to Cassondra and it will pull just the day that you need, where as with HD FS you typically will have to pull the entire file.

At 1:09 he talks about hitting shift and her to run every cell in a notebook. I don't know what he means by that.

At 1:21 running in standalone mode actually runs as a cluster

At 1:22 there are static partitioners and dynamic partitioners. This is not partitions in our DD's, but rather at the cluster level.

At 1:24 map reduce know how to put is really three projects HD FS yarn and map radius. Spark is really just a next generation replacement for my previous.

In her dupe do you have separate map and reduce slots. At 1:26. If you start up a new machine it may spend the first hour only doing Maps and you will not use any of the reduce, meaning that you do not use your CPU fully.

At 1:20 740 when a slight freezape it can be 15 to 20 seconds before a new task lands there in her tube

At 1:20 825 the latency of assigning a new task to a slot in spark is extremely fast

At 1:29 spark reuse the slots

At 1: 30 30 a visualization showing green tasks. He says you can also call them slots. He says that spark calls them course, like numb course

A beginner mistake is to set the spark Cors to be the same or less than the number of actual machine course. Instead you want to oversubscribe the spark course by a factor of two or three.

At 1:30 727 code showing how to do is park submit. The normal course specified in code will override a value you pass to spark submit.

At 1:30 740 if you try to align the number of tasks to the number of cores, it won't work because there are internal threats Ronnie, too.

At 1:30 9:30when in standalone mode you submit to a standalone cluster, in this case there are four machines, but I don't know if they are real machines or fake machines on a single real machine it is a little unclear.

At 1:41 there is a configuration variable called Spark local dears, which can be an array of local directories. One thing you can do is to have multiple SSDs mounted to the machines.maybe the OS drive is a normal rotational media, but the SSDs are used to persists data from an RTD that does not have Knouff memory available.

At 1:40 210 the map spill files associated with intermediate shuffle data that happens after maps kids stored in the local dealers

At 1:40 240, Hughes J Bud with SSDs, not read. Not read read the castle At 1:40 210 using some edits ripped the spark master JVM will start, and also a spark work her JVM will starteach work her registers with the master

At 1:40 for the spark master JVM and spark worker JVM are both pretty small. That is not where the real work is done the real work is done on the executors a BM's.

At 1:40 540 the spark master JVM is basically a scheduler and tells though various workers to launch executor JVM's for the application.

At 1:40 630 all the worker executor does is to create no all the worker JVM does is to create executor JVM when it is asked to do so by the master JVM.

One 4640 if a executor crashes, the work her will restart it. If a worker crashes, the master will restart it. If the driver crashesthen I don't know what happened I didn't understand. To do. Question.

To do what is a driver?

At 1:40 720 you can have two partitions and have them copied once so they are on for executors. This is for cashing the RDD values.

At 1:48 when the task is run local to the something, the info goes directly from the heap to the thread.

At 1:40 830 there is a setting in one of those files where you can set the Nam work her course for a specific box to be higher if it is himif it is a more powerful box. Remember that now I'm course is not the number of cores on the machine and should be at least double that of the actual cores on the machine. It is task slots.

At 1:40 920 the numb were hurt course is just the number of task slots or Coors that a worker can give out to the executor JVM it is in charge of.

At 1:40 950 you can make your spark masters highly available. Using zookeeper. And you can add more masters while it is going.

At 1: 50 10 data Stax distribution of spark is is Cassondra's system table to do the highhighly available master JVM's because they didn't want to complicate things by adding zookeeper.

At 1:50 110 he says you shouldn't give a JVM more than about 40 GB of RAM because the garbage collection overhead gets too high

One 5145 if you want to run two of the same executor on each box in standalone mode you will need to work hers because one worker can only do one executor of the same type.

At 1:52 you can set that with the spark worker instances

At 1:50 250 spark work her memory is how much memory the work her can give out to its underlying executors a VM's, kind of like spark work her Coors

At 1:50 310 spark demon memory sets the amount of memory for the actual work her and master JVM's. It is not spark work her memory, even though that sounds like it should set the actual memory for the work her JBM. Surprising

At 1:40 210 he mentions multiple applications again. A worker can run multiple executors, as long as they are for different applications. Glossary

At 1:50 640 the master JVM has an embedded Web server that runs on the board something At 2:04 45 the difference between standalone and yarn is who starts the executor JBM. In standalone it is the worker JVM in yarn to somebody else will find out soon

At 2:05 a spark application is made up of multiple jobs. A spark shell is an application.

At 2:05 40 a job is made of multiple stages

At 2:06 a stage decomposes into one or more tasks

People

person-sameer-farooqui