Welcome everyone to the first edition of my newsletter! There are so many newsletters out there and you’ve chosen to read mine. Will it be filled with interesting insights or tedious navelgazing? Let’s find out together.
Lately I’ve been ruminating about how much value is really created by analyzing data. Analyzing data is what I have become accustomed to doing to be valuable, and now thinking about what’s next for me I am forced to consider whether it is a thing I should continue emphasizing.
I have often argued that one of the most valuable types of data at a company is from A/B tests. Here is how a typical A/B test goes: Someone has an idea for changing the product and they build a prototype of that idea. They set up a test and randomly try out this idea for some fraction of the users for a few days or weeks. Then they estimate whether their idea provided benefits according to some criteria, and decide whether the idea is good enough to keep.
We can roughly break that up into three phases:
1. Having the idea and building it.
2. Running the test and collecting data.
3. Analyzing the data and making a decision about what to do.
Phase 2 requires little effort in many cases,1 so the question I am grappling with is: what is the relative value that is generated in Phase 1 vs. Phase 3? The answer is (obviously) that it depends. At the places I have worked, there are tons of product managers and engineers constantly working in Phase 1 and creating a large queue of ideas to test. And I’ve worked on mature products where those ideas are relatively small and incremental changes. This is the perfect regime for adding value in Phase 3, where we need to choose the best ideas from this large set of options, and often the effects are subtle enough that we need to be careful in deciding whether they are improvements.
In August 2020, Garrett van Ryzin decided to leave Lyft and I took over the research team he created called Marketplace Labs. I met with him regularly in his last few days to understand his strategy for the team. I asked him why he hadn’t spent any energy on statistical problems at all and he had a great answer: “when a project we work on succeeds, we don’t need statistics to know it.”
This sort of sent me into an existential talespin (in a good way). I was focused on adding a small amount of value by analyzing data a little better and giving other people tools to do that. Sure, incrementally better decisions add up to a lot of value over time,2 but maybe we’re just stuck in a local optimum and getting many small changes right will never get us to where we want to go. A modification of the famous Henry Ford quote kind of works here: you can’t A/B test your way from selling horses to selling cars. And a corollary: if you’re testing a horse against a car, you definitely don’t need an A/B test.3
So here I am, someone who’s (IMO) good at analyzing data; how could I possibly help out in Phase 1? I had been building tools to say “B is better than A” and now I need to build tools that say “we’ve been doing A… maybe we should be doing B?” You
probably definitely need a far more complicated model to generate solutions to problems than to adjudicate between some set of proposed solutions.
One great point I’ve seen Josh Wills make repeatedly is that data doesn’t really matter until things go wrong. The story is that when things are going well at a company, the data are pretty boring and mostly look like accounting for all the good stuff that’s happening. But when things start to go poorly, data becomes the only way we have of trying to explain why.
You can think of debugging a broken product as roughly inverting the order of the A/B test. We start by observing a negative effect (by comparing last week to this week), then we go and look for potential explanations. Andrew Gelman and Guido Imbens nicely frame this task as “causes of effects” and contrast it with “effects of causes” (what we usually study with tools like A/B testing). “Causes of effects” is far more challenging because we need to generate hypotheses about what could have caused the problem and test each one until we find a probable explanation.
The Phase 1 problem is actually even harder than that, which is to imagine a cause that could possibly generate the effect that we want. I tend to think of this as a high-dimensional search problem, very similar to what folks doing drug discovery are trying to solve – there are so many possible chemicals that we can synthesize, which ones are likely to be good treatments? But with developing drugs, there’s a physical model that can be developed and refined over time that allows you to search more efficiently. Models can tell us what to test or provide early feedback on what’s likely to be dangerous or promising.
Are there other domains where models can generate new ideas for things to test in the real world? I don’t think this is a particularly unrealistic idea given how far generative models have come over time. We see technologies like GPT-3 and DALL-E effectively create completely novel artifacts out of thin air that meet certain requirements we have and in ways we don’t always expect in advance. In my opinion, the reason these applications of data seem so “intelligent” is that they are not merely measurement problems, they are constructive and both match our intuition in some ways and violate it in others. Close enough to something we may have thought of on our own, but far enough away that it surprises us.
So now imagine we could train a model that generates ideas for new product ideas to test. Could it ever do better than your team of product managers? What data would it need? How could we teach it what we know? How would it even explain the idea to us?4
I have an information theoretic view of why generative models have so much potential. At best, a successfully analyzed A/B test may reveal 1-bit of information, when we go from 50/50 on which variant is better to being sure that one of them is. In the root cause analysis, we may consider dozens of potential explanations for a problem we observe, so if we do successfully debug we’re creating a few more bits than an A/B test. Either way, this is just an astonishingly small amount of information to create from so much effort, and would be overshadowed greatly by the output of any reasonable generative model. It’s like the contrast between giving someone a route between two places versus telling them “hotter or colder” when they blindly guess which direction to walk in. Of course more information isn’t always valuable, but it’s a good proxy for value.
I don’t want to rehash any details of the Twitter acquisition, but I do think there’s at least one optimistic interpretation of why Elon Musk thinks the business is worth far more than what he’s paying. Twitter has used all the standard A/B testing approaches to gradually improve the product over time and the net result is a stagnant number of users and revenue. Maybe a less incremental change to how Twitter is designed could cause a change in its trajectory?5
I have no doubt that Twitter has hired thousands of very smart people and they have tested many ideas, but each change has been tightly constrained by design decisions that were made over a decade ago. I remember thinking the 280-character limit was disruptive, but that seems quaintly incremental now. It does not seem far-fetched to me that a more disruptive change could create a better product for users or more revenue for advertisers. I don’t know what that change is, but maybe Elon Musk has a model in his head that has generated a few hypotheses to test? And he wouldn’t have to align hundreds of people to test something like that, like in a normal tech company. His master plan for Tesla is one of the best examples of a (successful!) non-linear path to a dramatic shift in an important and previously stagnant product category.
It seems to me that every person has a collective theory of the world they are employing when they reason about what they should be doing. That model can identify opportunities, notice problems, and generate promising things to try. It can imagine counterfactual scenarios about what will happen if we tried one of those things. But it’s informal, it’s stuck in people’s heads, and we can’t automate anything about it. It is going to become the bottleneck in value creation.
Making it about me again and wrapping up, I guess I am currently wondering what can I do to make that model better. Back in February I gave a talk about experimentation and causal inference and I included the following slide. In my proposed model, at the end of each experiment we update our theory based on the result and it helps us generate ideas for new features. But revisiting this now, experiment results alone provide a very limited amount of information to the “new feature box” and there are many other sources of information flowing into that theory circle. There’s an interesting and complex “human-in-the-loop” process waiting to be discovered here and after spending so much of my work life in Phase 3 I find it pretty exciting.
If you’ve read this far, thank you! It’s been fun writing this and I hope to find time and motivation to do more. Please let me know if you have feedback, questions, or criticism.
Shout out to my research friends who do manual data collection. ↩
Maybe cars were a mistake, though. ↩
Of course it could equally lead to the service’s demise, umm, thanks for subscribing to my newsletter. ↩
Question mark left intentionally. ↩