The Absurdities and Indignities of False Discovery Rates in Social Science

As a social scientist with a history of biology and neuroscience, it’s been bizarre to start seeing “False Discovery Rate” crop up in the papers I read and review. The first time I saw it, it jogged something in my mind; an ancient memory from discussion of fMRI data in undergrad or perhaps my master’s. What was it doing here, in the land of humans taking surveys?

The FDR in brief

Anyone who has taken a stats class remembers the multiple comparisons problem. Roll the proverbial p-value dice enough times and they’re bound to come up significant. When a study or scientist conducts a lot of hypothesis testing, significance will invariably appear; especially at $P < 0.05$.

For nearly a century this has bothered practicing scientists and statisticians alike. Early on, there were recommendations like ensuring the ANOVA is significant before looking at pairwise comparisons. Over time, more elaborate methods evolved; from Bonferroni to Dunn-Šidák and Holm-Bonferroni. Odds are good you learned and forgot these. But each of these seeks to control the Family Wise Error Rate:

FWER: The long-run probability that one or more false claims of statistical significance are made, among a family of tests.

Controlling FWER is pretty restrictive. Only a one-in-twenty chance that one or more claims are false; that’s a 19 out of 20 chance that none are. Think of the hit rate you’d need for thousands of claims to be so confident not one is a Type I error.

This was the problem faced by genomicists and neuroscientists as their data started rolling in from sequences and Magnetic Resonance Imaging. Try applying a Bonferroni correction to tests of whether there’s significant activity in each voxel of your increasingly high-resolution 3D imaging and you’ll just be reporting a paper full of nulls.

This isn’t simply about sneakily getting significant effects as we fret over so frequently; it’s about balancing a very real trade-off between false positives and false negatives. If there’s a gene that can identify a protein target for treating disease, the last thing you want to do is squash any chance of finding it under a Type I error correction.

Along comes the False Discovery Rate. In contrast to FWER:

FDR: Expected proportion of false claims of significance, among claims of significance made.

We’ve loosened the belt. It’s OK now if some things are false discoveries, we just want to know how many to expect.

This is exactly what you want if you’re in early-2000s genomics. Searching for genes that have some meaningful relationship to a phenotype of interest, you’re starting off in an exploratory phase. In 2003, Storey [1] pointed out that you can select your threshold for inclusion as significant after computing the adjustment and computing q-values. This will sound heretical to anyone who has lived in the social sciences over the past decade. Can you imagine selecting $\alpha$ for a significance test after computing $p$?

A given q value is the minimum tolerance for a false discovery rate you’d need to have to accept that result and any smaller qs. In other words, you’re welcome to tolerate an expected 12% rate of false discoveries; perhaps that gives you a tractable number of genes to follow up on with more in-depth work. Maybe you only need one true positive to move to the next step.

There is a lot more nuance, debate, and method development in the FDR literature, and arguments over which methods are best. At the end of the day, however, it’s a useful dial for some scientists to calibrate their expected rate of false discoveries when triaging a massive number of significance tests.

Nuance Lost

Despite FWER and FDR being conceptually different—and practically quite different—they’re just right next to each other in the same function when someone goes to address multiple comparisons after a reviewer complains. Here’s the documentation.

p.adjust(p, method = p.adjust.methods, n = length(p))
p.adjust.methods
# c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY",
#   "fdr", "none")

Choose Holm? You’ve got FWER. BH? FDR. It’s not the code maintainer’s job to ensure you know what you’re doing, but this is the guidance you’d get if you looked at the documentation.

“The false discovery rate is a less stringent condition than the family-wise error rate, so these methods are more powerful than the others.”

Sounds good! More powerful. Nice. Right?

The thing is you’ve effectively gone from building your paper around a story under the hope that every claim is solid (FWER) to tolerating some proportion of claims not being solid (FDR). By analogy to building a house, it’s going from

FWER: “probably none of the wood has termites that I’m using to build my home” to

to

FDR: “Some of the wood has termites but tolerating that will let me build 40 house prototypes”.

Neither of these is an inherently wrong way to do science, but they’re pretty fucking different. We lose this nuance when we focus on whether it’s more or less conservative or powerful or whatnot.

Why is FDR now in the social sciences?

Gelman, Hill, and Yajima [2] pointed out 14 years ago that FDR is a weird fit for the social sciences:

Methods that control for the FDR may make particular sense in fields like genetics where one would expect to see a number of real effects amidst a vast quantity of zero effects such as when examining the effect of a treatment on differential gene expression (Grant et al., 2005). They may be less useful in social science applications when we are less likely to be testing thousands of hypotheses at a time and when there are less likely to be effects that are truly zero (or at least the distinction between zero and not-zero may be more blurry).

A lot has changed in 14 years, and some social scientists do indeed test larger numbers of hypotheses in two ways. The first is the “MTurkification” of social science [3]. It was spooling up fourteen years ago, but became just extremely easy to pull data from Prolific, MTurk, etc… For your participants in randomly assigned conditions blasting through questions on Qualtrics, each one could be a hypothesis to test. Social scientists in the academy now could test dozens of hypotheses at a time.

These same scientists, however, were being sternly warned about the risks of false positives as folks fret over the replication crisis. Reviewers could now put any paper on blast for multiple comparisons, requesting some form of correction. P-values near .05 were circled, and social scientists got into heated debates over 0.005 vs. 0.05 [4]. Turning to that code above, it makes sense that folks would use the powerful multiple comparisons correction. We can see this in citation data, where Benjamini and Hochberg’s FDR correction starts taking off mid 2010s in the social sciences.

Citations in Social Science to BH's classic paper on FDR. Data from OpenAlex

Elsewhere, social scientists were being paid much more money to test many more hypotheses, while eating free snacks and hitting the in-house meditation studio as the experiments ran. Although the tech industry has now collectively lost its shit, back then they were busy figuring out which color button will make you buy that Instant pot ring or forget to unsubscribe from Moviepass. With hundreds of millions, now billions, of users some platforms could run thousands of A/B tests in parallel automatically making minor adjustments.

So what?

Facebook’s widget-shuffling scientists have a pretty good case for using A/B tests, they were actually testing thousands of hypotheses. Most of these are likely to have precisely zero effect on the outcome of interest. As with the neuroscientists or genomicists; FDR is a sensible correction. What of us academics, making mid-five to low-six figures with no stock options?

Are you Yoking?

I imagine one of the reasons FWER wasn’t more popular in the social sciences is that the familiar ones can take a huge bite out of your statistical power. Bonferroni requires $p<\frac{\alpha}{m}$ where $m$ is the number of tests you’re running. If you’ve got a $d=0.28$ (publishable…) effect size you’re testing with $n=100$, you’ve got an acceptable 80% power in a one-sample z-test; good chance of catching it. With ten tests, that power drops down to something like 50%. To recover your 80% power, you’re gonna need to recruit another 70 participants; nearly doubling your cost for that particular hypothesis. Are the other nine worth it or just questions you’re throwing in because you’re fielding a survey?

Ok, so let’s say they instead decide to use FDR and have heard cool things about BKY, whatever that is. We can write a little simulation in Python. We consider two extremes. In one case it’s all sanity checks in the other 9 hypotheses. In the other, it’s absurd ideas colleagues forced you to include, none of which are going to pan out.

Simulation code, expandable if you're curious (Python)
import numpy as np
from scipy.stats import norm
from statsmodels.stats.multitest import multipletests

def bky_sim(d=0.28, m=9, p_true=0.95, n=100, sims=10000):
    out = np.zeros(sims)
    for i in range(sims):
        d_bg = np.abs(np.random.normal(0, 0.5, m))
        d_bg[np.random.rand(m) >= p_true] = 0
        z = np.r_[d, d_bg] * np.sqrt(n) + np.random.normal(size=m + 1)
        p = 1 - norm.cdf(z)
        rej, *_ = multipletests(p, alpha=0.05, method="fdr_tsbky")
        out[i] = rej[0]
    return out.mean()

print(bky_sim(m=9, p_true=0.95))
print(bky_sim(m=9, p_true=0))

Running this, we find that when they’re testing their hypothesis alongside sanity checks they reject it 93% of the time! Their power has startlingly increased when correcting for multiple comparisons! Now what if they’ve included it alongside 9 absurd ideas, or things that are anticipated to be null by design. In this case, they’re only rejecting their focal hypothesis 53% of the time. Their power has dropped, by about as much as Bonferroni.

A 40% swing in statistical power of the test we care about, based solely on what it happened to be yoked with in an FDR correction.

This is, of course, known and expected behavior for BKY FDR. It even makes sense for voxels of neurons. If you hit someone with a stimulus that causes a ton of brain-wide activity, it makes sense to take that into account when evaluating the plausibility a given voxel is significantly impacted by the stimulus.

But it’s weird for us social scientists. Often we test very different hypotheses; or at least more different from one another than activity in adjacent regions of cortex. Perhaps one hypothesis is about SES and vaccine hesitancy and the other SES and LLM use.

Do you come from a big family?

It’s also worth noting that us social scientists rarely test thousands of hypotheses. Where I see FDR, it’s in the low to mid double-digits; sometimes even used for single digits. The thing is, as you increase the number of hypotheses the distribution of p-values gets smoother and spooky things occur.

Let’s consider a scenario where we’re testing our same hypothesis above ($d=0.28$, $n=100$) but we’re doing so either alongside a small ($m=9$) or huge $(m=9000)$ number of other hypotheses. Here’s where things get really weird. If you have a large number of null tests thrown into the family, your power plummets to near 5%. That’s basically $\alpha$, as though you’d never run the experiment at all.

What’s going on behind the scenes is that the smallest $p$-values are getting the smallest upward adjustment. Because everything is null, the first pass makes the whole procedure quite skeptical. With a lot of tests, even if most are null, odds are good that your focal hypothesis isn’t the smallest $p$ on the block. It gets penalized into null obscurity.

“It’s always trade-offs!”

These are extreme examples, but they highlight how the same exact statistical test can have a chance of rejecting the null that ranges from 5%, the nominal $\alpha$, to nearly 100% depending entirely on what else is in the FDR family. Because we’ve defined the outcome as being a true effect, we can start to think about what that means in terms of trade-offs.

In one extreme, our 5\% chance of a false positive set by our FDR with a threshold of $q<.05$ is about on par with the \approx5% chance of a false negative. We’re equally valuing getting it wrong in either direction. On the other extreme, however, we’re willing to tolerate 19 false negatives for every false positive claim we make.

Should we?

Real Life

I’m sure there are some that would happily exchange 19 false negatives for every false positive, after chucking those parameters into Ioannidis 2005 paper and finding out the significance-biased literature will be mostly true (and boring). As social scientists we’ve been conditioned to fear the false positive, lest we become the poster child for a failed replication on a claim we boldly made and turned into a book deal or speaking tour.

In many cases this is perhaps reasonable. A lot of disciplines in social sciences (looking at you, social psych) reward and incentivize discovery of some phenomenon that you riff on through tenure. Ideally it gets a name and that name is associated with you. You need that to be real, otherwise your whole career is teetering on $\alpha=.05$. This isn’t a dunk on social psychology, basic and exploratory science can make a pretty good argument for prioritizing positive findings over null ones. Find things, verify them, then let the applied folks use them.

It’s out in real life, away from the abstraction of a discovery, that false negatives start to throw their weight around. Perhaps you’re trying to determine access to which of very many different social services in a large city improves grades for K-12 students; and you’ve got a lot of hypotheses to test. Does the team running an important, and effective, service wind up being evaluated with 5 or 95% power? Or somewhere in between? When a null shows up, how do we know if it came from the effect being studied or the effects being studied alongside it?

A worked example

One of the defining questions of the 2010s, and into the 2020s, was the degree to which social media is impacting elections; both in the US and globally. There certainly was a palpable shift in the alignment and general vibes of many nations’ leaders. An academic collaboration with Meta set out to measure this in 2020, in part through a de-activation experiment. These experiments can’t really measure platform effects on large-scale societal processes, but let us pretend for a moment.

One of the papers [5] is titled “The effects of Facebook and Instagram on the 2020 election: A deactivation experiment.” Elections are decided by voting. As such, one finding (alongside turnout, perhaps) of the very many seems to be the answer to the titular question:

The point estimate for the effect of Facebook deactivation on Trump vote is a reduction of 0.026 units (P = 0.015, Q = 0.076, 95% CI bounds = −0.046, −0.005). This effect falls just short of our preregistered significance threshold of Q < 0.05.

In other words, they seem to have found around a 2.6\% shift in votes away from Trump among the folks who took time off of Facebook. Significant before FDR, non-significant afterwards. The resolution to this decade defining question winds up spending eternity in frequentist purgatory. It’s unlikely we’ll see a collaboration like this again, and it would occur in a very different world. And the strangest damn thing is that, corrected for multiple comparisons with different hypotheses, the answer could have been unambiguous on either side.

Of course this is a real-life question and it matters. If privately owned companies have designed technology that alters democratic outcomes then it means they can sway or determine democracies. Two and a half percent, in the right places, moves an election in the US. Of course the true effect could be considerably larger, as this is just from taking a wee break. Imagine what a decade of time on platform could have done beforehand to candidate preference and voting behavior.

Caught between the replication crisis and the real world

None of this is to suggest we should never use FDR or FWER. Nor is it to opine on whether the Facebook election collaboration erred in doing so. Both are just statistical procedures, appropriate when they help us answer the questions we’re asking. Yet I think they get stuck in the throats of social scientists caught between the replication crisis and the real world. On one hand, false positives are to be avoided at all costs. On the other, false negatives can have very real consequences. So how do we manage?

Well, the best way out of this is to follow Gelman, Hill and Yajima [2] and don’t worry about multiple comparisons. Use multi-level models, get comfy with Bayesian inference and start describing effect sizes rather than dichotomizing claims. When you’re in the real world, you can use the posteriors from your analyses to evaluate and calculate varying assumptions about the costs of getting it right and wrong.

Absent that, I think it’s just worth asking whether you care about the chance you miss a real effect. If you’re hunting around in the lab for the new nudge that will make your career, have money to burn, and worry about getting replicated, then spool up that FDR. Alternatively if you’re truly exploring a large landscape of near-zero effects; triage with FDR then follow up like it was intended.

Alternatively, if false negatives could have a real life impact it’s worth asking whether the reduction in false positives is worth the ambiguity about the change in false negatives. It’s one thing to be able to constrain power across plausible effect sizes, but there’s no good way to do that if your test is yoked in with a bunch of other very different hypotheses. If you find yourself in this regime, it’s worth thinking carefully about whether FDR is worth it and, if so, who goes into the family.

Coda

This is my first blog post back after a long hiatus from blogging and a few months of being sick and in the hospital. I’m feeling really jazzed to think about science, but invariably will have typos and things that are wrong somewhere in here. Let me know, and I’ll update as time allows. As for LLMs, they did a bit of spell-checking, math-checking, and code-reorganizing but I don’t find them useful for drafting prose. As I’ve written about peer review, I think the slogs we go through are often where science happens.

References

  1. Storey, J. D. (2003). The positive false discovery rate: A Bayesian interpretation and the q-value. The Annals of Statistics, 31(6), 2013–2035.
  2. Gelman, A., Hill, J., & Yajima, M. (2012). Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness, 5(2), 189–211.
  3. Anderson, C. A., Allen, J. J., Plante, C., Quigley-McBride, A., Lovett, A., & Rokkum, J. N. (2019). The MTurkification of social and personality psychology. Personality and social psychology bulletin, 45(6), 842-850.
  4. Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., … & Johnson, V. E. (2018). Redefine statistical significance. Nature human behaviour, 2(1), 6-10.
  5. Allcott, H., Gentzkow, M., Mason, W., Wilkins, A., Barberá, P., Brown, T., … & Tucker, J. A. (2024). The effects of Facebook and Instagram on the 2020 election: A deactivation experiment. Proceedings of the National Academy of Sciences, 121(21), e2321584121.




Enjoy Reading This Article?

Here are some more articles you might like to read next: