"From that I extracted two test sets of corrections. The first is for development, meaning I get to look at it while I'm developing the program. The second is a final test set, meaning I'm not allowed to look at it, nor change my program after evaluating on it. This practice of having two sets is good hygiene; it keeps me from fooling myself into thinking I'm doing better than I am by tuning the program to one specific set of tests." (#)
At 37:10 create a totally separate testing framework that is only used for releases so that it may catch different bugs. He uses the antibiotics analogy. He says that the programmers will (inadvertently) figure out ways of working around the main simulation framework by figuring out which bugs are not caught, and writing code that way. Need an alternative way of finding those. (#)
Test Set: Assess model after model has been run on the training set – run confusion matrix to find errors and compare models (#)