Why lawyers might be better than you at statistics

Let’s cover the basics briefly so we can get to the howling-in-frustration bit. In statistics, a population is the collection of all items that you are interested in (for the purpose of making a decision rigorously).

Should you even be attempting statistics?

You can’t answer that until you’re clear on what your population is (and it’s up to you to define it). The whole reason you’d want to take a statistical — as opposed to fact-based — approach is that you’re dealing uncertainty.

A statistical approach only makes sense when there’s a mismatch between the information you want and the information you have.

In other words, your available data (sample) doesn’t cover your whole population. If it did, you’d be dealing with facts, and facts are better than uncertainty. (If you’re thinking this last is a proclamation by Captain Obvious, perhaps you haven’t had the pleasure of grading college exam papers.) Facts mean you don’t need statistical expertise — simply state them and get on with life. No finicky p-values or credible intervals required.

Cringeworthy populations

Okay, hopefully you’re convinced that the concept of population is pretty important to the whole practice of statistics.

In the Icarus-like leap from sample to population, expect a big splat if you don’t know where you’re aiming.

Now let me show you a classic way decision-makers keep getting it wrong.

Imagine that you’re a lawyer reviewing a contract for me and my friends. We’ve told you we want to give our product’s users a \$50 voucher for chocolate. When you look inside the contract to see how the people eligible for a voucher are described, it simply says “all users.” No more and no less.

Anything wrong here?

You don’t have to be a legal expert to see that there’s a big problem! We haven’t defined “all users.”

What does “all users” even mean?

If we let this contract see the light of day before we’ve really thought about what we mean by “all users”, we’ll find ourselves flat-footed as all kinds of users climb out of the woodwork demanding chocolate. What about the people who don’t sign up but use the product on their friend’s account? Do they count? What about the ones who use the product for one second and drop it… just to score some chocolate? What about the people who can claim they’ve used it on their friends’ account in the past without signing up? Do we give them chocolate too? What about the ones who claim they’ll be future users (but want the chocolate now)? We’ll be bankrupt from chocolate vouchers before we know it.

What a nightmare! Imagine if whoever approved the contract says, “Oops, I didn’t even think of that.” Unacceptable. My lawyer friends assure me that the task here is to think of everything and be sure that what you write is precisely what you mean. No loopholes. Who gets chocolate and who doesn’t should be crystal clear from the description.

To avoid messing up, rely on your inner lawyer. Or, better yet, an outer one.

I hope you can see how important it is to use detailed legal descriptions with zero room for ambiguity. Detail is just as important in statistics.

Icarus, don’t get hurt!

You opted for statistics because (1) your decision is important — otherwise you’d prefer data-mining for a faster path to inspiration — and (2) the data you have doesn’t cover all the entities you’re interested in, so you’re trying to make an Icarus-like leap from your sample to your population. If you can’t even specify where you’re leaping, expect a big splat! Any amount of vagueness makes your entire endeavor melt into nonsense. Pretty bad when we’re dealing with an important decision.

If you leave any wiggle room in the definition, you’ve set yourself up to fail.

Despite all this obviousness, I keep seeing decision-makers write nothing but “all users” when framing their decisions. That’s just plain sloppy. In a real project, the population description involves plenty of fineprint. Alas, decision-makers don’t always realize that thinking deeply about this is their job.

Advice for those who work with decision-makers

If you see a vague population description, set up a picket line until the decision-maker does their homework. The project isn’t ripe for fancy calculations yet.

This goes beyond population definition. There are a lot of tasks the decision-maker has to complete before your math can be useful. Spending all weekend rigorously chasing down some half-baked question a decision-maker drops on your desk is a well-known rookie mistake, but I see so many junior data scientists falling for it repeatedly.

All the statistical effort you’re tempted to put in makes no sense until the decision-maker’s homework is done.