gold-master-metrics-testing

A story about why metrics are hard

We have a "news" page in our iPhone app. You can swipe there from the main timeline (basically from the page that shows all the photos).

One day Ajay asked, "how many people are checking out the news screen? I need to know whether we should spend more effort developing it."

We looked, and a huge number of people were going to the news page. Bingo! Let's spend a lot of effort to make it even more awesome! Thank goodness we had these metrics to point us in the right direction!

So, we were about to make big product decisions based on the data. Then, we just happened to notice that every time a user went to the timeline, it also tracked them going to the "news" page. Every... single... time.

WTF?!?!

It turns out that the way we loaded the main timeline, all three screens (the timeline itself, the sidebar navigation, and news pages next to it) are pre-loaded so you can quickly swipe between them. So, it looked like people were looking at the news screen a lot, but they weren't.

Recognize when you're facing a hard problem so you can give it the respect it deserves

That's one story. Something similar has happened over and over again. Our confidence in our metrics eroded and we became skeptical of every single stat in our system.

Tracking is exceptionally hard to get right because it's not user facing in any way! Even with ostensibly tough-to-verify things like delayed bulk emails, we occasionally we get feedback like when my wife Robin said, "Hey honey, I just got a really weird email from you guys today." (Uh oh!)

A datapoint isn't like that. It hides in hundreds of millions of other data points.

It wants to trick you. Beware!

In his excellent book, The Effective Engineer, Edmond Lau says "The only thing worse than having no data is the illusion of having the right data." "It's all too common for engineering to underinvest in data integrity." "Engineers learn quickly that writing using tests can help ensure code correctness; in contrast, the learning curve for carefully validating data correctness tends to be much higher."

So, it's critical to get the right data, we tend to not invest heavily enough in data integrity, and we tend to underestimate the difficulty of defending against bad data.

In other words, this is a hard problem, and us engineers don't even have enough of an understanding of the problem to recognize it as a hard problem.

Let's change our mindset, so we no longer think it's ok to simply throw a few extra tracking calls in and feel confident that our data will be good now and into the future.

How important is it to get this stuff right?

It's a 10. Ajay says so, and he's the boss. He's also right. We need the right data.

How can we deal with this problem?!?

Run through a single acceptance (Selenium) test and log the stuff that's tracked.

Independently, (beforehand) write some notes about what you think should be tracked.

Look at what was tracked. If anything you thought should have been tracked wasn't, problem! Add it to the tracking.

If things that you didn't expect to have tracked were (like the News screen) in the story above, figure out why.

Run the acceptance test again after making the fixes and once its tracking output looks reasonable... lock it in so that's the only acceptable tracking output for it. Basically make a golden master expectation of what metrics that single acceptance test generates.

But how do you "lock it in"? Well, the format has to be very easy to read so that you can look at it and quickly reason about it. Here is an example from one of our pinterest campaigns:

target = "
  dwb | server_action | new_user_campaign_click        | pinaa1
  dwb | engagement    | screen                         | landing
  dwb | ui_action     | click                          | auth_from_create
  dwb | mixpanel      | click auth_from_create         | NULL
  wb  | server_action | new_user_campaign_registration | pinaa1
  dwb | app_action    | event_joined                   | NULL
  dwb | engagement    | screen                         | event_settings_your_info
  "

The format is arbitrary... I wrote it to be an easy-to-read subset of the actual metrics data, which has many more less-essential attributes. Yes, those other attributes need to exist, but no, they don't need to be locked in. I've seen this specialized formatting being referred to as testing-concept-miniformats-for-test-input-and-output.

Cool! Is it perfect?

Nope. We have so much asyncrony that things will often (completely acceptably) come in out-of-order. This format doesn't account for that, and I'm thinking on the problem.

Also, let's say that for 10% of the users you pop up an alert dialog that is tracked. That's fine... we don't care one way or the other... we should be able to make that an optional metric... cool if you have it in the metrics log, cool if you don't.

What we're looking to avoid is metrics showing up that shouldn't, metrics that should be there not showing up, and metrics that should show up only once showing up repeatedly.

Conclusion

Tracking is hard; it deserves your respect. As such it needs to be verified with the same rigor as other aspects of our software. Verifying by hand works the first time, but to avoid regressions (which you will never notice unless you check the tracking over and over again by hand) we should write automated tests that test our tracking metrics.