Why the Joy of Cooking is going after a Cornell researcher

America’s most celebrated cookbook brand is calling out one of America’s most cited food scientists — the latest chapter in a bigger scandal that has been rocking social science.

The tiff erupted on Tuesday morning, when the Joy of Cooking on Twitter called Cornell’s Brian Wansink a “bad researcher,” and claimed he had unfairly implicated the cookbook brand in the obesity epidemic with a flawed scientific study in 2009.

In a fascinating Twitter thread that goes far deeper into the intricacies of scientific research than you’d probably expect from the Joy of Cooking — discussing unrepresentative sample sizes and cherry picking — the brand alleges serious misconduct by Wansink. “The rote repetition of his work needs to stop,” it said.

So what prompted this tirade? Over the past two years, a cadre of skeptical researchers and journalists, including BuzzFeed’s Stephanie Lee, have taken a close look at Wansink’s food psychology research unit, the Food and Brand Lab at Cornell, and have shown that unsavory data manipulation ran rampant there. The cookbook brand, inspired by Lee’s recent feature, says the 2009 study about them is yet another example of Wansink’s dubious scientific practices.

But this story is a lot bigger than any cookbook or any single researcher. It’s important because it helps shine a light on persistent problems in science that have existed in labs across the world, problems that science reformers are increasingly calling for action on. Here’s what you need to know.

Six of Wansink’s studies have been retracted and the findings in dozens more have been called into question

In 2009, Wansink and a co-author published a study that went viral suggesting the Joy of Cooking cookbook (and others like it) were contributing to America’s growing waistline. It found that recipes in more recent editions of the tome — which has sold more than 18 million copies since 1936 — contain more calories and larger serving sizes compared to the Joy of Cooking’s earliest editions.

The study focused on 18 classic recipes that have appeared in the Joy of Cooking since 1936 and found that their average calorie density had increased by 35 percent per serving over the years. But in its tweetstorm, the cookbook maker accused Wansink of cherry-picking recipes, making up arbitrary portion sizes, and smearing their name under the guise of science.

Still, when the study appeared, it got a lot of media coverage and helped Wansink reinforce his larger research agenda focused on how the decisions we make about what we eat and live are very much shaped by environmental cues. See his famous “bottomless bowls” study, concluding that people will mindlessly guzzle down soup as long as their bowls are automatically refilled, or the “bad popcorn” study, which demonstrated that we’ll gobble up stale and unpalatable food when it’s presented to us in huge quantities.

The critical inquiry into his work started in 2016 when Wansink published a blog post in which he inadvertently admitted to encouraging his graduate students to engage in questionable research practices. Since then, scientists have been combing through his body of work and looking for errors, inconsistencies, and general fishiness. And they’ve uncovered dozens of head-scratchers. For instance, in one study Wansink misidentified the ages of participants, calling children ages 8 to 11 toddlers.

In sum, the collective efforts have led to a whole dossier of troublesome findings in Wansink’s work.

To date, six of his papers have been retracted from journals. But this debacle has drawn a lot of attention because Wansink was highly cited and his studies were catnip for reporters (including us here at Vox). Wansink also collected government grants, helped shape the marketing practices at food companies, and worked with the White House to influence food policy in this country.

For now, Wansink is still defending his research, according to BuzzFeed, and Cornell is investigating his work.

Wansink allegedly engaged in “p-hacking on steroids”

Among the biggest problems in science that the Wansink debacle exemplifies is the publish-or-perish mentality.

To be more competitive for grants, scientists have to publish their research in respected scientific journals. For their work to be accepted by these journals, they need positive (i.e., statistically significant) results.

That puts pressure on labs like Wansink’s who do what’s known as p-hacking. The “p” stands for p-values, a measure of statistical significance. Typically, researchers hope their results yield a p-value of less than .05 — the cutoff beyond which they can call their results significant.

P-values are a bit complicated to explain (as we do here and here). But basically: They’re a tool to help researchers understand how rare their results are. If the results are super-rare, scientists can feel more confident their hypothesis is correct.

Here’s the thing: P-values of .05 aren’t that hard to find if you sort the data differently, or perform a huge number of analyses. In flipping coins, you’d think it’s rare to get 10 heads in a row. You might start to suspect the coin is weighted to favor heads. And that the result is statistically significant.

But what if you just got 10 heads in a row by chance (it can happen) and then suddenly decided you were done flipping coins. If you kept going, you’d stop believing the coin is weighted.

Stopping an experiment when a p-value of .05 is achieved is an example of p-hacking. But there are other ways to do it — like collecting data on a large number of outcomes but only reporting the outcomes that achieve statistical significance. By running many analyses, you’re bound to find something significant just by chance alone.

According to BuzzFeed’s Lee, who obtained Wansink’s emails, instead of testing a hypothesis and reporting on whatever findings he came to, Wansink often encouraged his underlings to crunch data in ways that would yield more interesting or desirable results (which could have happened with the Joy of Cooking study. It’s possible his lab only focused on the recipes that seemingly became more caloric over time.)

In effect, he was running a p-hacking operation — or as one researcher, Stanford’s Kristin Sainani told BuzzFeed, “p-hacking on steroids.”

Wansink’s sloppiness and exaggerations may be greater than ordinary. But many, many researchers have admitted to engaging in some form a p-hacking in their careers.

A 2012 survey of 2,000 psychologists found p-hacking tactics were commonplace. Fifty percent admitted to only reporting studies that panned out (ignoring data that was inconclusive). Around 20 percent admitted to stopping data collection after they got the result they were hoping for. Most of the respondents thought their actions were defensible. Many thought p-hacking was a way to find the real signal in all the noise.

But they’re not. Increasingly, even textbook studies and phenomenon are coming undone as researchers retest them with more rigorous designs.

There are many people working on trying to stop p-hacking

There’s a movement of scientists who seek to rectify practices in science like the ones that Wansick is accused of. Together, they basically call for three main fixes that are gaining momentum.

  • Preregistration of study designs: This is a huge safeguard against p-hacking. Pre-registration means that scientists publicly commit to experimental design before they start collecting data. This makes it much harder to cherry-pick results.
  • Open data sharing: Increasingly, scientists are calling on their colleagues to make all the data from their experiments available for anyone to scrutinize (there are exceptions, of course, for particularly sensitive information). This ensures that shoddy research that makes it through peer review can still be double-checked.
  • Registered replication reports: Scientists are hungry to see if previously reporting findings in the academic literature hold up under more intense scrutiny. There are many efforts underway to replicate (exactly, or conceptually) research findings with rigor.

There are other potential fixes too: There’s a group of scientists calling for a stricter definition of statistically significant. Others argue arbitrary cutoffs for significance are always going to be gamed. And increasingly, scientists are turning to other forms of mathematical analysis, such as Bayesian statistics, which asks a slightly different question of data. (Whereas p-values ask, “How rare are these numbers?” a Bayesian approach asks, “What’s the probability my hypothesis is the best explanation for the results we’ve found?”)

No one solution will be the panacea. And it’s important to recognize that science has to grapple with a much more fundamental problem: its culture.

In 2016, Vox sent out a survey to more than 200 scientists, asking, “If you could change one thing about how science works today, what would it be and why?” One of the clear themes in the responses: The institutions of science need to get better at rewarding failure instead of prizing publications above all else.

One young scientist told us, “I feel torn between asking questions that I know will lead to statistical significance and asking questions that matter.”

Brian Wansink faced the same dilemma. And it’s increasingly clear which path he chose.

Further reading: Vox explains research methods (and how to make them better)

  • The Inside Story of How An Ivy League Food Scientist Turned Shoddy Data Into Viral Studies
  • What does “statistical significance” really mean
  • Why psychology’s “replication crisis” will lead to better science
  • The 7 biggest problems in science, according to hundreds of scientists
  • P-hacking, explained