As we looked more closely at our transactional production data we noticed some small gaps. In one case, we had a handful of memberships created each day that lacked platform attribution.
Wedding Party supports four platforms: iOS, Android, desktop web, and mobile web. Each platform has a handful of membership creation funnels, and the funnels are not at parity across the platforms. There are also cross-cutting alternative membership creation mechanisms, like being automatically joined to a wedding when you register because you have pending invitations. Memberships are created in many different ways, and we didn't know how this subset of unattributed memberships were being created.
At the time, our per-user event stream analytics were not granular enough to support easy investigation into the provenance of each membership. We've since improved our analytics collection and reporting considerably. Today we'd have no problem finding the reason for missing platform attribution. However, we don't simply want to be able to figure out our production problems quickly once we know about them. We also want to learn about them quickly, and even prevent them from happening.
To that end we use two techniques.
First, we use system-wide invariant assertions, which we call "system assertions". On a regular interval we analyze recently modified production data to ensure that it continues to match our expectations. The membership platform attribution system assertion double-checks that memberships created in the past day all have platform attribution, otherwise it reports an assertion failure. System assertions allow us to catch issues soon after they are pushed to production. In his book "Code Complete", Steve McConnell mentions that:
"In debug mode, Microsoft Word contains code in the idle loop that checks the integrity of the document object every few seconds. This helps to detect data corruption quickly, and it makes for easier error diagnosis." (#)
Over time our list of system assertions has grown quite large. Of course, if we really wanted to block memberships from being created without platform attribution we could put a non-null index on the platform attribution column, but blocking membership creation because of missing attribution data seems extreme.
The other technique we use is to treat our entire test suite like a the exercise phase of generative test for system-level assertions. In generative tests, the exercise phase of testing-concept-four-phase-test is highly dynamic, but the verify phase has a specially crafted invariant assertion which should be true no matter what happened in the exercise phase. We have many API and system tests which directly or indirectly create memberships. We view the behavioral tests themselves as the exercise phase for the higher-level system assertion tests. In this example, after each behavioral test we verify that any memberships that were created have platform attribution.
Instacart update
A technique that I picked up at Instacart which also works well in this situation is to throw a rollbar if the membership being created does not have attribution, then allow the code to continue. Rollbars contain enough contextual information to point you to the caller directly. This technique relies on pushing to production and then tracking the different pathological cases, but it can be very effective for finding obscure bugs.
Location: 319 If a system is expected to provide some guarantee (for example, in a message queue, that the number of incoming messages equals the number of outgoing messages), it can constantly check itself while it is running and raise an alert if a discrepancy is found [12]. (#)