First 3 Pages
Why Sampling Your
Mobile Analytics Data
is Bad For Growth
The costs associated with using high-quality analytics services often force companies into
trade-offs of picking which events or users to track, with sampled data being the most common vfate that people resign to.
Unfortunately, sampling can be very harmful, leading to a slew of problems including inaccurate test results and a loss of data integrity. To make critical product and business decisions based on your data, you need to be able to trust it.
In this paper, we’ll review the case against sampling, including:
• Acceptable standard error & confidence intervals
• The importance of sample size when running A/B tests
• Missing out on the long tail
• Needing the full set of user data
Types of Sampling in Analytics
THE PROBLEM: SAMPLING AT THE DATA COLLECTION LEVEL
Before we get into the implications of having all the data available, it’s important to understand what sampling really means in behavioral analytics. In the general sense, statistical sampling is the process of measuring a metric on a subset of users in order to estimate the metric across the entire user base. The selection of sampled users can take place either at collection time, i.e. choosing which events to capture at all, or at query time, i.e. choosing which events to analyze (while collecting all of it). The latter is a fine strategy for enabling interactive analysis of large datasets—you don’t need 100% accuracy when doing exploratory work, as long as you can choose to get the full results when it matters.
Unfortunately, the way analytics services are priced leads to the former, which means com
-pletely losing a sizable chunk of data; this is the problem we want to address.
Event Level Sampling vs. User Level Sampling
So what does sampling typically look like in our space? It’s quite straightforward: choose a subset of users and collect only events performed by those users. A noteworthy but sometimes misunderstood aspect of sampling event data is that it must be performed at the user level (choose whether or not to collect all of a user’s events) rather than at the event level (choose whether or not to collect each individual event). Sampling at the user level preserves the calculation of metrics like retention and funnel conversion, which get
skewed in non-intuitive ways when you selectively drop events that a user performs. Even
getting this part right, however, doesn’t guarantee that you can draw accurate conclusions from sampled data.
The Case Against Sampled Data
It’s common for power users of analytics to try to get around sampling restrictions by any means possible. Once you hit a volume threshold on Google Analytics, your query results are sampled “in order to reduce latency” and infrastructure costs. People have written an abundance of guides on how to avoid this because they’ve experienced real-world situations in which sampled query results lead to inaccurate metrics and consequently bad decisions made.
Standard Error and Confidence Intervals
The fundamental trade-off of sampling is, of course, the error in the estimated metric, whether it be retention or a user’s lifetime value. The most
In the 2 examples above, we see event timelines from 5 different users. The green events are chosen as part of the sample, while the gray events are not collected. In event level sampling, individual events are sampled for collection. This can result in skewed data and incorrect calculation of important metrics like funnels and retention. User level sampling, in which you collect all events from a select group of users, is the correct way to sample if sampling is required.
relevant aspect of the standard error is that it shrinks with the square root of the sample size.
To make this concrete, let’s say the true day 7 retention for a cohort of users is 30% and we try to estimate that with a sample. With 10,000 users, the standard error is 0.45% which leads to a potentially acceptable 95% confidence interval of 1.8%. Drop this to 1,000 users, however, and the standard error grows to 1.45% which leads to a very poor 95% confidence interval of 5.68%. You might be thinking that both examples use an unrealistically small number of users; let’s ay your app has 1,000,000 users, and sampling to 100,000 users gives a confidence interval of 0.57%, so why can’t you just do that?
The Importance of Sample Size When Running A/B Tests
One scenario in which that breaks down is while running A/B tests. When experimenting with a significant product or design change, teams will often want to run an A/B test where a small percentage of users are shown the variant and observe how conversion and retention are affected relative to control.
If your data has already been sampled down to 100,000 users, and you show the change to 5% of them, your sample size is significantly cut down and your confidence in the results of the experiment will similarly decrease.
Fast-moving product and growth teams often run tens of A/B tests at once that might each increase conversion by only a few percentage points (see Event Level Sampling—Wrong way to SampleUser Level Sampling—Right way to SampleHigh Tempo Testing, a methodology introduced by Sean Ellis). If you’ve sampled your user base at collection time too much, you won’t be able to draw meaningful conclusions from these tests (or worse, you’ll come to the wrong conclusion!). Fortunately, our partner Optimizely has put together some great resources to make sure you don’t make this mistake, but you’ll need to collect enough data in order to leverage them.
Missing Out on the Long Tail
You’ve probably heard about the Long Tail phenomenon, which suggests that the true demand curve for users (under certain conditions) has significant area in the “tail,” e.g. the lower 80% of “items.” Whether it’s e-commerce, entertainment, or content, there are many examples of online products and services that realize considerable value outside of the most popular items; it’s often considered one of the reasons that they beat out brick and mortar shops that can’t compete on those niche items. In order to effectively understand the long tail behavior of users, it’s necessary to have enough data. An individual item might only cater to one in a thousand users, so only by capturing all of their events can you hope to perform meaningful analysis on them. For example, it would be completely hopeless to run
an A/B test and observe its impact on the long tail if you already sampled a significant portion of your data out.
The long tail in search keywords is a great example of how it can be efficient to focus your
attention on not just the most common items, but also those outside of the top ten or twenty
percent. It turns out that, while the popular keywords by definition generate the most traffic,
they’re also very competitive and thus ineffi-