Studies Until Results Expected, Thinks Hypothesis Is Now Golden
My sons watch a cartoon called Daniel Tiger’s Neighborhood. In one episode, which they (and by extension I) have watched at least one hundred times, Daniel and co. sing a little song that I imagine will repeat in my head for decades. The chorus goes:
“Keep trying, you’ll get better.”
The episode and song have a really nice message. Daniel is struggling to hit a baseball, but his friends encourage him to work at it until he improves.
What does this song have to do with experimental psychology? One interpretation of the lyric could be that of a researcher refining her craft to improve the research she conducts and strengthen the quality of evidence her studies produce. I can’t help but hear it another way.
“Keep trying, you’ll get better…results.”
As in, if at first your hypothesis is not supported, dust yourself off and try again. I think many of us have done too much SURE THING hypothesis testing.
A Twist on an Excellent Cartoon
“Bullseyes” by Charlie Hankin
This cartoon elegantly captures the concept of HARKing. Hypothesizing After Results are Known. SURE THING hypothesis testing definitely isn’t HARKing. The hypothesis in question is often established well before any results, and certainly before the supporting results, are known. The researcher simply tries and tries and tries, all the while making “improvements” or “tweaks” with the best of intentions, until the target is struck.
It also isn’t really p-hacking, a practice in which we exercise myriad researcher degrees of freedom, typically within a single study, until our results reach statistical significance. I think that both p-hacking and SURE THING hypothesis testing deserve their own cartoons. I am not a cartoonist, nor do I know Charlie Hankin, so allow me to simply describe the needed cartoons. The artistically inclined reader is invited to produce these cartoons in exchange for fame and glory.
- The “p-hacking bullseyes” cartoon: Targets are drawn beforehand, but they cover approximately 67% (drawn from Simmons & Simonsohn’s simulations of how bad it can get if we really go off the p-hacking rails) of the possible-arrow-landing surface.
- The King’s shot has landed on one of the targets, and the assistant exclaims, “excellent shot, my lord.”
- The “SURE THING bullseyes” cartoon: This one will need multiple panes, as SURE THING hypothesis testing is more episodic than HARKING or p-hacking.The target is drawn beforehand.
- The King shoots and misses. “No worries, my lord. The arrow must be faulty. Allow me to retrieve and refine it.”
- The King shoots again and misses again. “Ah, I know the problem. Let us quickly tighten your bowstring.”
- The King shoots again and misses again. “Perhaps we shall try again in better lighting and wind conditions tonight.”
- At night. The King shoots again and hits! “Excellent shot, my lord!”
If you shoot until you hit, then success is a
Of course, others have described this process in scientific experimentation. Perhaps my favorite description comes from the Planet Money podcast episode on the replication crisis. They describe flipping coins over and over until one of them hits an unlikely sequence of results. What I think hasn’t yet been discussed adequately, is the fact that many of the proposals of the open science movement (pre-registration, open data, open materials) provide weak defense against SURE THING hypothesis testing.
An Illustrative Hypothetical Scenario
In my last post, I discussed Comprehensive Public Dissemination of empirical research. This and the following hypothetical scenarios will help outline why I think it can be so powerful.
One researcher pre-registers and runs attempt after attempt at essentially the same study, “tweaking and refining” with the best of intentions as he goes. Eventually,
p < .o5.
How do we feel about this?
An Alternative Hypothetical Scenario
A different researcher has a hypothesis about a potentially cool new effect. She engages in CPD. She clearly identifies on her CPD log a series of studies intended to pilot methods and to establish the necessary conditions for the effect to occur. Once she thinks she has established solid methods, she runs a pre-registered confirmatory study and
p < .o5.
How do we feel about this?