The generalizability crisis goes pandemic.

I’ve learned way more about rigorous methodology and good research practices from social psychologists than I have from neuroscientists. And, that’s coming from someone who’s spent their whole graduate career in a neurobiology department. Why? What are social psychologists doing that we are not?

Social psychologists are willing to admit when they have problems. First the replication crisis, fueled by questionable (even nefarious) research practices hit them. Textbook effects either disappeared or shriveled away to nothing, once the field rolled out some large scale replication projects. Goodbye ego depletion. Goodbye priming. The upshot to finding out effects didn’t replicate is that the field made a big collaborative correction to these textbook effects (or lack thereof). But, the rewriting the textbook effects has only made possible by the acknowledgement of the issues facing the field.

Not all replications are equal. Sometimes it’s not that interesting to replicate every minuscule detail of a study. Nor would it be that insightful either. Is there any point in using the same sample size and sample demographics as a previous study? Especially, when the original sample was underpowered and more homogenized than the milk in your fridge. We expect certain effects to conceptually replicate. The hypotheses we test are typically not overly specific, e.g., we expect to see a Cohen’s d of exactly 0.64 in a sample of white male students recruited from the business school at “insert generic university name” etc. We test less restrictive and more conceptual hypotheses. Although, conceptual replications are prone to arbitrary interpretations of the hypothesised effect. Both are useful and both give more verisimilitude to a theory/hypothesis.

Can’t catch a break.

No one likes having salt and lemon juice rubbed onto their fresh wounds. Unfortunately for social psychology, they seem to get mixed up in crisis after crisis. As the replication crisis becomes cold, the generalizability crisis has just been put on the stove to start heating up. The Yarkoni Generalizability Crisis paper got a metric shit tonne of attention among psychologists on Twitter.

It also did the rounds on the big psychology related podcasts - Very Bad Wizards and Two Psychologists Four Beers. Despite a quick read through the paper, I got absolutely no podcast invitations. I have to admit some of the points made in the paper went over my head. But I did take some (but not all, this is a meaty paper) of the points away from the paper:

  1. The effects we show should generalize across stimuli. For example, a Stroop effect should be consistent regardless of the font colour used.
  2. Experimental designs use a sample of stimuli out of a population of possible stimuli. And, barely anyone has given a shit about this before this paper.
  3. Just like we treat subjects/participants as a sample of a large population, we should do the same with stimuli. Meaning, we report the N of stimuli, and treat stimuli as random effects, just like we do with participants.
  4. Effect sizes shrivel and their intervals grow when test stimuli are treated as random effects.

Agree or disagree with the points made in the paper, what I like is that people are wiling to have a productive discussion about these kinds of issues in science.

Papers like this and the discussions that followed are important metascientific advances for the field. If for any reason other than opening people up to the possibility of rethinking some of their axiomatic assumptions when approaching research. This is good for social psychology! There is absolutely no reason these issues don’t generalize to neuroscience. We want to know what effects are actually true and what effects are not. Our incentives, in that regard, are completely identical, but our approaches and discussions couldn’t be more dissimilar.

Ringing in the neuropocalypse?

So far, neuroscience hasn’t plunged into crisis. Is it because we’re better scientists than psychologists? No.

Although, some might think so.

Is it because our effects are more “real” because they are “in the brain” (whatever on earth this kind of vacuous statement means)? Absolutely not.

There’s no reason to belief that questionable research practices and p-hacking are less frequent in neuroscience than psychology. There’s also no reason to believe experimental designs are more robust and reproducible in neuroscience than in psychology. We might use cooler methods like lasers, drugs, and viruses. But we are prone to the exact same errors as other fields. We’re just as good at running underpowered studies as anyone else.

Part of the answer seems simple: You can’t find things if you don’t look.

Neuroscience, especially animal research, is expensive. No one is paying out big grants to run large scale replication projects. Whether they do or do not replicate is an empirical question. The incentive structures within neuroscience and psychology are very similar (i.e., positive publication bias, lack of interest in publishing replications, disinterest in publishing the negative replication of an important effect). These exactly the same incentive structures that have fueled questionable research practices and a lot of false-positive publications. I don’t see why we would be in a different boat to social psychology.

I think it is more common than we are willing to admit (in public and private) that highly-cited effects do not replicate. Walking around poster sessions, I’ve heard it mentioned that specific effects only replicate for certain labs. But, we never officially hear about the non-replications, just background whispers at conferences. How many resources should a lab piss away chasing the mirage of replicating X lab’s effect? Especially, when everyone who’s tried has never gotten it to work.

That being said, there are many effects and findings that replicate conceptually across labs and even across animals models. For instance, we know sex hormones are required for the full display of sexual behaviours across many animals. The conceptual replication, that removing sex hormones and replacing them with exogenous analogues restores sexual behaviour, is well replicated across animals and labs. Direct replications, using the same animal model, the same or similar doses of exogenous hormones, and showing effects of the same magnitude have also been done. Both types of replication are vital to constructing useful theories about the role of sex hormones in sexual behaviour.

Should we declare a crisis?

I think we should believe way fewer of the effects we have published are real and generalizable. The first reason for this is that almost all animal research has focused on male animals. However, when we leave females out we are restricting our sampling to one half of the animal population. Yet, we may have the audacity to claim that our effects are generalizable. Come back to me when everything replicates in female animals. Not every effect should replicate in females because there are often good reasons to expect sex differences, due to different behaviours and physiology across males and females. But, we occlude ourselves from generalizable truths by omitting females from research.

Thankfully, things are changing; #SABV (sex as a biological variable) is trending. People care. The flagship journals are publishing papers highlighting this exact issue. On the other hand, some people don’t care, and will never care. But, thankfully funding agencies are implementing change from the top down.

The next reason is the animals themselves. I am a rat researcher. There are lots of different strains of rats, each with their little quirks and peculiarities. I only use one strain of rat in my research. Would my effects generalize from this sample of rats to a population of rats? I’ll be honest and say: I don’t know. It’s an empirical question, and I certainly hope so. But we know that for certain effects, they don’t generalize across strains. If you think not being able to generalize across strains is bad, wait until you hear about breeders! If you’re not familiar with rodent neuroscience, you either buy or breed your rats. Some companies make a pretty penny from breeding rats and selling them to researchers at a premium. The last year and a half of my research life has been hell. Disappointment, frustration, just hell. I used to buy my rats from a breeding company in Quebec (I won’t name names…). I study female sexual behaviour, which means I need a group of male studs that I can call upon to have sex with my female rats. The breeding company shut down the line of rats I used. So, I buy from the same company, but at a different location. As it turned out, around 50% of the new male rats refused to have sex.


So, then I swapped suppliers… and guess what? I got some stud males.

I got sold a bunch of rats that would not have sex. This tells me we are breeding some very strange traits into our animals. And if that’s happening for something as rudimentary as sex, what’s happening to higher cognitive abilities etc.? This personal experience, combined with studies, make me worried about the generalizability of effects within a strain from different breeders. Adding to the above points, it looks like effects do not replicate in genetically similar animals depending on geographical location.

The generalizability of behavioural models

Going full circle and getting back to the Generalizability Crisis paper. The point about the sampling of stimuli made me think about current animal models. Yarkoni reminds us that we are sampling a set of stimuli for a large population of possible stimuli within a test. Expanding this to animal models, we often end up using one test to probe some cognitive, emotional, or behavioural capacity. Let’s use spatial memory as an example (sorry spatial memory folks, nothing personal). You design an experiment that tests the effect of lesions to a specific brain area on spatial memory. How do you test spatial memory? You use the Morris water maze. Okay, well you’ve tested a sample (n = 1) of the possible population of spatial memory tests. Would the effect generalize to the other tests of spatial memory? And would the same rat behave in a consistent way across the whole population of tests? This is an empirical question that we do not have a compelling answer to.

For now, we might get more realistic estimates of effects by including the number of tests used as random effects.

I’ve seen all the above as issues in what I do for a while now. It was only once I saw them through the lens of a generalizability crisis that I saw the commonality between them. Maybe it’s because of the Twitter bubble of people I choose to follow, but I haven’t seen the same open discussion of these issues among neuroscientists as I have among social psychologists. Unless, we address these issues we’ll end up continuing with business as usual, hoping no one tries to replicate our effects, seeing the emperor without his clothes.

comments powered by Disqus