talk-testing-distributed-systems-with-simulations

https://news.ycombinator.com/item?id=16877395

At 5:10 they didn't start by writing a database, they started by writing a totally deterministic simulation of a database.

At 6:00 Single-threaded pseudo currency

At 8:05 flow is something between a library in a language that they wrote for C++ that creates actor-model-like concurrency that is completely single threaded and driven by callbacks.

At 13:30 the simulated network has points of (deterministically seeded) randomness built in that can be increased over time to simulate network failures, etc.

At 14:15 the seed of the random number, generated by pseudo random number generator, becomes an input to the program to ensure determinism

talk-testing-distributed-systems-with-simulations#occasionally-test-same-outputs-for-inputsAt 15:15 for something very small like one in 100 runs or so, they seed the system with exactly the same inputs to ensure they get exactly the same outputs. talk-testing-distributed-systems-with-simulations#occasionally-test-same-outputs-for-inputs

talk-testing-distributed-systems-with-simulations#using-a-cycle-as-an-invariant-test1 2At 16:41 One portion of the invariant tests creates a cycle (basically a ring of records) in the database where records point to each other in a cycle of a specific length. Then the test system does thousands of transactions per second, moving the data around in transactions. Each transaction does changes that keep the cycle complete and the same length. At the end of the test it checks whether the cycle is still a cycle, and that it's still the same length. talk-testing-distributed-systems-with-simulations#using-a-cycle-as-an-invariant-test1 2

At 18:20 the swizzle test, for reasons they don't fully understand, seems to be better at finding bugs.

At 19 a example of a hard test, they change the configuration while also simulating network clogs, and also thousands of transactions per second, then check the cycle at the end.

At 20:10 they also have a test that simulates bad system administrator actions, for example they swap the IP address is of two machines immediately, or they swap for the data files that are on two machines, that kind of thing.

At 22:50 they want to catch bugs before the real world does, so they need to cause and check more failures than can happen in all their real production deployments. One way they do that is to speed up time in the simulation, for example when an exponential back off happens or something like that they can make many more seconds pass in the simulation then they would in the real world by virtually speeding up time to as fast as it can go on the machine it happens to be running on

At 2420 bugify is a macro that will cause code, in some small percentage of the cases, to change its behavior to suss out implicit but incorrect contractual assumptions

At 26 the Hearst exponent, which helps predict the likelihood of correlated events. For example, it's much more likely to rain on Weds if it rained on Tues, too. The implication to the database, borne out by experience in the real world, is that things break together. If a hard drive goes bad in a rack, then check all the drives in the rack because they may have been part of a bad batch. So, they simulate these types of grouped failures.

At 30 using printf works well in the distributed system debugging because of the determinism. Since the test runs always happen exactly the same way, you can put in non-spammy conditional printf statements that rely on certain conditionals that you know will happen.

Add 32 as a backup to their simulations and to prove that they understand they are simulating the right things, they have a real cluster called sinkhole. Sinkhole has network enabled power switches that allow it to turn off routers and machines. They have no yet found any bugs in their own software this way, but they have found bugs in Zookeeper and Linux.

At 33:35 they encountered a problem in Zookeeper so they wrote their own Paxos in flow so it can be run in the simulation framework like the rest of their stuff

At 34 he wants to be able to hire a small team specifically to write bugs. If those bugs make it through the testing phase without triggering a failure, then a bug is filed against the testing framework to ensure it catches those bugs.

talk-testing-distributed-systems-with-simulations#separate-testing-framework1At 37:10 create a totally separate testing framework that is only used for releases so that it may catch different bugs. He uses the antibiotics analogy. He says that the programmers will (inadvertently) figure out ways of working around the main simulation framework by figuring out which bugs are not caught, and writing code that way. Need an alternative way of finding those. talk-testing-distributed-systems-with-simulations#separate-testing-framework1

Thoughts

We set a high bar for taking on this type of foundational work. It's only because people agree on the complexity of distributed systems and our inability to reason about, and write code that can recover from, their failure scenarios that you could motivate multi-year work on the creation of a simulation framework as a prerequisite to the work itself.

Like so many other excellent systems, this one is predicated on both speed and extreme compute capabilities. At the end of the talk he mentions how he wants even more, mainly to facilitate better man-machine symbiosis (basically allowing developers to run their sub-set of the overall test suite faster)

Determinism is so important that they spent years to ensure they got it. Its importance is hard to overstate.

Loosely related - MemSQL had an interesting in-house testing system that uses a lot of computers: talk-memsql-running-a-107-note-cluster-on-coreos

Referring Pages

testing-concept-invariant-tests interesting-test-approaches tag-simulation testing-concept-two-sets-of-tests talk-simulation-testing testing-concept-simulation-testing

People

person-will-wilson