Euclidian distance
- k-means: This unsupervised clustering algorithm uses a distance metric with the goal of minimizing the Euclidean distance from the data points to a centroid, remeasuring and reassigning each data point to a centroid on each iteration.
(#)
Explainable AI
- "The disconnect between how we make decisions and how machines make them, and the fact that machines are making more and more decisions for us, has birthed a new push for transparency and a field of research called explainable A.I., or X.A.I"
(#)
F-score
- F Score: This is a weighted average of the true positive rate (recall) and precision.
(#)
false positives (FP)
- confusion-matrix: We predicted yes, but they don't actually have the disease. (Also known as a "Type I error.")
(#)
false positive rate
- confusion-matrix: When it's actually no, how often does it predict yes?
(#)
- confusion-matrix: We predicted yes, but they don't actually have the disease. (Also known as a "Type I error.")
(#)
false negatives (FN)
- confusion-matrix: We predicted no, but they actually do have the disease. (Also known as a "Type II error.")
(#)
fat one-sided distribution
- The number of parents for a git commit is probably distributed according to a fat one-sided distribution (often informally called a power law distribution, but that's usually not strictly correct for reasons that aren't interesting here)
(#)
feature
- Features are independent of each other meaning that one feature doesn't impact the value of another feature and a set of labels are considered and assigned in advance.
(#)
feature engineering
- Data organization and features are the real complex pieces of most ML.
(#)
- deep learning primarily works in cases where you don't need interpretable features and you have plenty of data. This means that it's biggest success have been in images, audio, and text problems. For all other use cases feature engineering is still a necessary step for applying machine learning
(#)
feature independence
- Features are independent of each other meaning that one feature doesn't impact the value of another feature and a set of labels are considered and assigned in advance.
(#)
feature vectors
finite set
- At 1:05 features are categorical... they are from a finite set
(#)
fitting
- you can't talk about validation without talking about fitting
(#)
frequency
- Used to find underlying structure of data based on statistical properties such as frequency
(#)
general form
- At 6:09 let's write a general form of this formula
(#)
generative adversarial network
- It set up two neural networks — one that generated the images and another that tried to determine whether those images were real or fake. These are called generative adversarial networks, or GANs. In essence, one system does its best to fool the other — and the other does its best not to be fooled.
(#)
generative model
- At 13:30 This is discriminative as opposed to generative because we don't have a distribution.
(#)
- At 5:05 the generative model would model the joint distribution, both X and Y
(#)
- At 7:45 the two models come to different decisions for a given X
(#)
- At 7:50 the generative model is more powerful than the discriminative model because it takes more stuff into consideration
(#)
- At 8:30 generative: estimating densities requires a lot of data and is statistically difficult to do
(#)
- At 9:05 a generative model can have worse performance than a discriminative bottle if there is not enough data. There is higher variance in the estimates with less data
(#)
gradient descent
- Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter (e.g., through gradient descent)
(#)
greedy procedure
- Uses a greedy procedure
(#)
hidden layers
- Number of hidden layers in a deep neural network
(#)
hierarchical clustering
- Example: Hierarchical clustering
(#)
hyperparameters
- Hyperparameters: can be done by setting different values and choosing which tests better or via statistical methods
(#)
- hyperparameters: Number of clusters in k-means: in our K-means example we used the elbow method.
(#)
- hyperparameters: Number of leaves in a decision tree
(#)
- another kind of parameters that cannot be directly learned from the regular training process
(#)
- example: Number of latent factors in a matrix factorization
(#)
- example: Learning rate (in many models)
(#)
- Number of hidden layers in a deep neural network
(#)
hyperplane
- "there is a hyperplane that can separate the pink from the blue points"
(#)
independent variable
- By applying OLS, we'll get an equation that takes hand size---the 'independent' variable---as an input, and gives height---the 'dependent' variable---as an output.
(#)
intercept
- Below, ordinary least squares (OLS) is done behind-the-scenes to produce the regression equation. The constants in the regression---called 'betas'---are what OLS spits out. Here,
beta_1
is an intercept; it tells what height would be even for a hand size of zero. And beta_2
is the coefficient on hand size; it tells how much taller we should expect someone to be for a given increment in their hand size
(#)
interpretability
- When your dataset isn't that big, doing something simpler is often both more interpretable and it works just as well due to potential overfitting
(#)
Iris flower data set
- Based on Fisher's linear discriminant model, this data set became a typical test case for many statistical classification techniques in machine learning such as support vector machines.
(#)
Joint distribution
- At 1:10 a joint distribution
(#)
Joint probability
- At 4:50 the joint probability of these rules? Is the product of the independent joint probabilities of the rules.
(#)
Kernel trick
- low dimensional feature space and map it to a higher dimensional feature space, but do so in a computationally effective way using the so-called kernel trick
(#)
k-folds cross validation
- a type of cross validation
- K-folds cross-validation: Training data set is split into subsets of data – one as the test set, the remaining datasets are for training
(#)
- K-folds cross-validation: Averages error rate over rounds to estimate model performance.
(#)
k-means clustering
- Example: K-means clustering
(#)
- k-means: This unsupervised clustering algorithm uses a distance metric with the goal of minimizing the Euclidean distance from the data points to a centroid, remeasuring and reassigning each data point to a centroid on each iteration.
(#)
- k-means: makes no assumptions about the data meaning it uses random seeds and an iterative process that eventually converges.
(#)
- Used for relationship discovery and understanding the underlying structure of data
(#)
- Useful for unlabelled data as a first round of analysis
(#)
- Manually give a target number of clusters
(#)
- This algorithm takes n observations into k clusters with each observation belonging to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
(#)
- Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k.
(#)
- The k-means algorithm is a clustering algorithm, and it is unsupervised: it takes a bunch of unlabeled points and tries to group them into clusters (the "k" is the number of clusters).
(#)
k nearest neighbor
- The k-nearest-neighbors algorithm is a classification algorithm, and it is supervised: it takes a bunch of labeled points and uses them to learn how to label other points. To label a new point, it looks at the labeled points closest to that new point (those are its nearest neighbors), and has those neighbors vote, so whichever label the most of the neighbors have is the label for the new point (the "k" is the number of neighbors it checks).
(#)
- Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k.
(#)
- At 3:40 some distance measure of this similarity between two points X, and X prime
(#)
- At 13:42 he talks about how to select a K. One technique is called cross-validation. There is also a thing called the bias variance trade-off
(#)
labeled, unlabeled, labeling
- Training data is unlabeled beforehand
(#)
- Used when there is a lack of labeled training documents
(#)
latent factors
- example: Number of latent factors in a matrix factorization
(#)
learning rate
- example: Learning rate (in many models)
(#)
leave one out cross validation
- leave one out cross-validation
(#)
- leave one out: At 7:35 a hold out set or a leave out one approach
(#)
- leave one out: At 13:40 you might want to do it multiple times, holding out a different group each time
(#)
linear discriminant model
- Based on Fisher's linear discriminant model, this data set became a typical test case for many statistical classification techniques in machine learning such as support vector machines.
(#)
log-log plot
- These distributions pop up everywhere in software, appearing as straight lines on log-log plots like this one.
(#)
loss function
- At 7:40 loss function is what they call it, basically measuring errors.
(#)
machine learning
- Machine learning can be summarized as learning a function (f) that maps input variables (X) to output variables (Y).
(#)
marginal density
- At 6:25 marginal density
(#)
marginal distribution
- Marginal distribution is called that because you total up the values in the row and put the total in the margin.
(#)
matrix factorization
- example: Number of latent factors in a matrix factorization
(#)
maximum entropy
- Example: Maximum Entropy
(#)
min-max
- "judicious prunings of the state space can be achieved by such elementary AI techniques as alpha-beta pruning and min-max"
(#)
misclassification rate
- confusion-matrix: Overall, how often is it wrong?
(#)
multi-class problem
- You can convert any multi-class problem to a binary problem simply by grouping output classes together.
(#)
Naive Bayes Model
- Naïve Bayes Classification is an algorithm that attempts to make predictions based on previously labeled data using a probabilistic model. Features are independent of each other meaning that one feature doesn't impact the value of another feature and a set of labels are considered and assigned in advance.
(#)
- There are many types of supervised algorithms available, one of the most popular ones is the Naive Bayes model which is often a good starting point for developers since it's fairly easy to understand the underlying probabilistic model and easy to execute.
(#)
- Naive Bayes: feature detection is decided in advance
(#)
Neural Networks
- Neural networks is a model inspired by how biological neural networks solve problems and can either be supervised or unsupervised. Neural networks that are supervised have a known output and are built in layers of interconnected weighted nodes with an output layer that gives us a known output such as an image label.
(#)
nonparametric machine learning algorithms
novelty detection
- In novelty detection, you have a data set that contains only good data, and you're trying to determine whether new observations fit within the existing data set.
(#)
Null Error Rate
- Null Error Rate: This is how often you would be wrong if you always predicted the majority class. (In our example, the null error rate would be 60/165=0.36 because if you always predicted yes, you would only be wrong for the 60 "no" cases.) This can be a useful baseline metric to compare your classifier against. However, the best classifier for a particular application will sometimes have a higher error rate than the null error rate, as demonstrated by the Accuracy Paradox.
(#)
observational statistics
one-dimensional manifold
- At 17:14 one-dimensional manifold
(#)
- getting it from n-dimensional down to 2 so it is easier to visualize
(#)
online learning
- Online learning is where you label a minimal set of data, then the machine learning system identifies the top N examples it is having trouble with and you label those, then the process repeats. You end up needing to label a significantly smaller amount of data than you otherwise would
(#)
outlier detection
- In outlier detection, the data may contain outliers, which you want to identify.
(#)
overfitting
- At 5:30 overfitting is when the model is very strictly tied to the data it saw in training
(#)
- At 7:00 Cross-validation is one of the tricks to try to avoid overfitting your data
(#)
- At 0:35 and example of overfitting
(#)
- When your dataset isn't that big, doing something simpler is often both more interpretable and it works just as well due to potential overfitting
(#)
p-hacking
- Simmons called those questionable research practices P-hacking, because researchers used them to lower a crucial measure of statistical significance known as the P-value
(#)
parametric machine learning algorithms
- Parametric: Algorithms that simplify the function to a known form are called parametric machine learning algorithms.
(#)
- Parametric: A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model
(#)
- Parametric: often also called "linear machine learning algorithms"
(#)
- An easy to understand functional form for the mapping function is a line, as is used in linear regression:
(#)
parameter tuning
- can use a validation set
- Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k.
(#)
- what is the "best" value of k that one should select to solve their problem?
(#)
- Validation Set: Used for parameter tuning – choose model complexity
(#)
sample
- At 2:05 contrasting sample to the population
(#)
Positive Predicted Value
- Positive Predictive Value (PPV): This is very similar to precision, except that it takes prevalence into account. In the case where the classes are perfectly balanced (meaning the prevalence is 50%), the positive predictive value (PPV) is equivalent to precision. (More details about PPV.)
(#)
precision
- confusion-matrix: When it predicts yes, how often is it correct?
(#)
- If I have a spam filter (in which the positive class is "spam" and the negative class is "not spam"), I might optimize for precision or specificity because I want to minimize false positives (cases in which non-spam is sent to the spam box).
(#)
predictive model
- Decision trees are also a predictive model and have two types of trees: regression (which take continuous values) and classification models (which take finite values) and use a divide and conquer strategy that recursively separates the data to generate the tree.
(#)
prevalence
- confusion-matrix: How often does the yes condition actually occur in our sample?
(#)
probability distribution
- At 4:35 you assume that these data come from some kind of probability distribution with a density, and you want to estimate that density
(#)
probabilistic formulation
- At 8:15 it has a probabilistic formulation
(#)
probabilistic model
- Naïve Bayes Classification is an algorithm that attempts to make predictions based on previously labeled data using a probabilistic model. Features are independent of each other meaning that one feature doesn't impact the value of another feature and a set of labels are considered and assigned in advance.
(#)
pruning strategy
- pruning-strategy: At 12:40 typically what people do is to create a regression tree and then prune it, which has performance boosting characteristics.
(#)
- "judicious prunings of the state space can be achieved by such elementary AI techniques as alpha-beta pruning and min-max"
(#)
random forest
- random-forests: At 1305 he says that rather than using a pruning strategy we are going to use a random forests approach.
(#)
- At 13:20 Bootstrap aggregation and random subspaces. I believe this is describing how to create a random forest.
(#)
random seeds
- k-means: makes no assumptions about the data meaning it uses random seeds and an iterative process that eventually converges.
(#)
real space
- At 1:10 d-dimensional real space (or k-dimensional)
(#)
regression
- At 9:30 regression is for a given X try to determine what the Y it would be.
(#)
ROC curve
- ROC Curve: This is a commonly used graph that summarizes the performance of a classifier over all possible thresholds. It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary the threshold for assigning observations to a given class.
(#)
sample
- At 2:05 contrasting sample to the population
(#)
semi-supervised machine learning
- At 2:30 Semi-supervised is where there are labeled and unlabeled points. You know the values for many of the points and would like to get the values for others. Think about the Netflix problem.
(#)
sensitivity
- If I have a metal detector (in which the positive class is "has metal"), I might optimize for sensitivity (also known as True Positive Rate) because I want to minimize false negatives (cases in which someone has metal and the detector doesn't detect it).
(#)
sharpshooter's fallacy
- At 13 the sharpshooters fallacy. Shoot a bunch of things, then move the bullseyes to where you shot.
(#)
slope one family of algorithms
- collaborative-filtering: When predictions are based on binary data, as opposed to ratings, the Slope One family of algorithms can be used.
(#)
specificity
- confusion-matrix: When it's actually no, how often does it predict no?
(#)
- If I have a spam filter (in which the positive class is "spam" and the negative class is "not spam"), I might optimize for precision or specificity because I want to minimize false positives (cases in which non-spam is sent to the spam box).
(#)
random subspaces
- At 13:20 Bootstrap aggregation and random subspaces. I believe this is describing how to create a random forest.
(#)
state space
- "judicious prunings of the state space can be achieved by such elementary AI techniques as alpha-beta pruning and min-max"
(#)
supervised algorithm
- The two main types of supervised machine learning are regression and classification.
(#)
- supervised: For instance a regression model is used for the prediction of continuous data such as predicting housing prices based on historical data points and trends.
(#)
- There are many types of supervised algorithms available, one of the most popular ones is the Naive Bayes model which is often a good starting point for developers since it's fairly easy to understand the underlying probabilistic model and easy to execute.
(#)
- Neural networks is a model inspired by how biological neural networks solve problems and can either be supervised or unsupervised. Neural networks that are supervised have a known output and are built in layers of interconnected weighted nodes with an output layer that gives us a known output such as an image label.
(#)
support vector machines
- Based on Fisher's linear discriminant model, this data set became a typical test case for many statistical classification techniques in machine learning such as support vector machines.
(#)
target value
- At 1:40 the Ys are in the class, value, or target value.
(#)
test set
- a type of Conventional Validation
- Test Set: Assess model after model has been run on the training set – run confusion matrix to find errors and compare models
(#)
training documents
- Used when there is a lack of labeled training documents
(#)
training set
true positive (TP)
- confusion-matrix: These are cases in which we predicted yes (they have the disease), and they do have the disease.
(#)
true positive rate
- confusion-matrix: When it's actually yes, how often does it predict yes?
(#)
- confusion-matrix: These are cases in which we predicted yes (they have the disease), and they do have the disease.
(#)
true negative (TN)
- confusion-matrix: We predicted no, and they don't have the disease.
(#)
two moons data
- At 3:30 a classic toy problem called two moons data. he says in this case K equals two. What is K? Also, he labels two of the points. He calls the different groupings classes.
(#)
two way table
- At 4:25 two-way tables
(#)
unbiased sample variance
- At 3:45 unbiased sample variance. Basically you were looking for the sample variance to accurately reflect the population's variance.
(#)
underfitting
- There is actually a dual problem to overfitting, which is called underfitting. In our attempt to reduce overfitting, we might actually begin to head to the other extreme and our model can start to ignore important features of our data set
(#)
unsupervised machine learning
- Training data is unlabeled beforehand
(#)
- Used to find underlying structure of data based on statistical properties such as frequency
(#)
- Used for exploratory analysis to find unrealized patterns
(#)
- Used when there is a lack of labeled training documents
(#)
- Example: Hierarchical clustering
(#)
- Example: K-means clustering
(#)
- Example: Maximum Entropy
(#)
- Unsupervised learning algorithms segment data into groups of examples (clusters) or groups of features. The unlabeled data creates the parameter values and classification of the data. In essence, this process adds labels to the data so that it becomes supervised.
(#)
validation methods
validation set
value
- At 1:40 the Ys are in the class, value, or target value.
(#)
variance
- At 2:45 the symbol for variance is sigma squared
(#)
voronoi cells
- In mathematics, a Voronoi diagram is a partitioning of a plane into regions based on distance to points in a specific subset of the plane
(#)
- This algorithm takes n observations into k clusters with each observation belonging to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
(#)