Things I learnt at replication club #2 - power analysis and preregistration are fiddly and not many people document how they do it

Published on November 29, 2025

Earlier this month I built a replication of a classic psychology result: time taken to solve shape rotation puzzles scales with the angles between the shapes.

I was also curious about some more recent studies linking performance on these tasks to aphantasia, which can be measured via "vividness of visual imagery" questionnaires, so I figured I'd throw one of those on the end of my design. I didn't think this would complicate things much... in my imagination it was going to be something like:

Collect the data from participants
Put the data in excel or pyplot
Plot the same graphs as the researchers who ran the original study
See if the lines looks like theirs
Plot another graph where the x-axis is the result of my questionnaire
See if that line goes up or down

I probably could have still done this, but instead I went down a rabbit-hole of trying to understand techniques for making sure my results would remain statistically significant. I had been vaguely aware of power analysis before and it became clear this what was I needed to use here. In many ways it turned out to be more elegant than I expected, in other ways the implementation turned out to be kind of... fiddly.

How will I know when I've collected enough data?

(Or: why do power analysis in the first place?)

This was a pressing question! I had built an interface for collecting participant responses, and was going around hassling friends and other bloggers to complete the task. Each new person I asked was 15 minutes of someone's time, and also a small favour I owed them (I mostly offered to read bits of people's writing in exchange).

Screenshot 2025-11-29 230140.png

Screenshot 2025-11-29 230408.png

Screenshot 2025-11-29 230430.png

So I was pretty motivated to understand how much data would be "enough".

Power analysis helps answer this question by estimating how likely your planned study is to detect an effect of a given size, assuming that effect truly exists.

There are two mains ways to calculate the power of a study:

Use statistical formulas for the specific test you want to run, or
Use your fast computer to simulate the entire experiment thousands of times.

I also found this tutorial published by Lisa DeBruine and Dale Barr that works through specifically how to use Monte Carlo simulations to design experiments where you are taking lots of repeated measurements on both people and stimuli sampled from random populations. In my case the stimuli weren't sampled (they were fixed rotation angles) but there were still enough moving parts in the design that their advice about simulating the whole experiment still applied.

Broadly their key advice was:

Simulate data that matches how the experiment actually works, including the noise and effect you realistically expect
List the hypotheses you wish to test
Analyse each simulated dataset using the exact statistical model you plan to use later
Repeat this whole process thousands of times to count cases your analysis catches and cases it misses
Use this process to measure bias, precision and false-positive rates as well as just power
Choose a sample size that results in sensible values for all these metrics

Aside from tutorial-style papers like these I couldn't find much code or detail online about how power analyses are done for real studies in this field. I got a LOT of help from LLMs patiently explaining various details to me, and have not checked my work with a real statistician or research scientist, so please do not take this post as advice. Also please reach out if you spot issues in my approach.

What noise and effects do I expect to see?

Shape rotation is fairly well studied information processing task in psychology. My main study is a replication of a commonly reproduced result so there is a bunch of data on what kinds of effect sizes and noise to expect. For the link to aphantasia one out of two of the prior papers I found detected an effect, so I estimated that if there was an effect it would be about as big as the one observed in the paper with the positive result.

The relationship between shape rotation performance and various other factors has been studied extensively, and it looks like the biggest factors are gender and age. Some of these factors manifest as a constant offset in response time, while others manifest as a combination of constant offset and some function of rotation angle. For this study I therefore chose to model rotation time as predicted by “intercept-level” factors (VVIQ, gender, age), plus “slope-level” factors (rotation angle × (how much VVIQ affects the slope + how much gender affects the slope)), plus block-order effects and within-trial noise.

The final equation when you factor all these effects together looks like this:

RT stands for "rotation time". i is the participant number, j is the label for the pair of shapes they're being asked to rotate (with rotation angle θj).

Screenshot 2025-11-29 225337.png

All of the parameters labelled by the greek letter gamma will be configuration parameters for the simulation.

What hypotheses do I have?

The main papers I drew on were:

The last paper finds that strategies vary greatly between participants, and one of the main things I'm predicting is that aphantasia is what explains some of the variation between the strategies participants would use. An even more sophisticated version of this power analysis would have also modelled a variety of strategies, and tried to predict the variance across populations, and how VVIQ influences strategies used. I decided to leave this out - though I do collect free-text strategy data from participants for qualitative analysis.

The hypotheses are:

No effect (VVIQ score and block angle have no effect)
Basic result (VVIQ score does nothing but response time scales with block angle)
Constant effect (low VVIQ score results in slower / faster responses, regardless of angle)
Slope offset (VVIQ score changes the rotation rate itself i.e. there is a detectable difference in strategy used)

This last effect was the one I was most interested to test for. It also turned out to be the most computationally expensive to simulate.

I also added a "sanity check" hypothesis to my simulation (a huge artificial result to debug whether I was detecting any effects at all).

Each of these hypotheses are modelled by the equation above by varying just four of the numbers: γ₀₀, γ₀₁, γ₁₀, γ₁₁ (intercept, VVIQ effect, slope, VVIQ-on-slope effect). The other parameters are constants that I fixed according to some reasonable assumptions based on existing literature.

How will I analyse the data?

The final analysis is just a regression, though because we're also predicting the effect on the slope per participant there are some slightly fiddly details that I am still in the process of writing up here.

Running the simulation

I made the following changes to my simulation code (the first two on advice from an AI coding assistant)

Vectorising the code
Precompiling the functions
Only doing IO and visualisation at the end

Benchmarks showed a total 3x speedup from applying these changes. I suspect most came from the IO, further benchmarking could help narrow that down.

The remaining cost came from recomputing a linear regression for every participant for every block of the trial. This was necessary in order to detect differences in slope indicated differences in strategy.

I considered using carlo.app or some other library - and might still try rebuilding my analysis there, but I was keen roll as much of it from scratch as I could to make sure I understood the core concepts.

Effect size woes

p-values or SESOI?

I ran my original code and noticed my graphs were not exceeding more than 50% power. In some cases power was apparently going down as sample size went up. It turns out that I had set a SESOI (smallest effect size of interest) parameter that was smaller than the expected real effect size.

In general you should not adjust your SESOI post-hoc like this after running your simulation. I also struggled to figure out what a reasonable SESOI would be. Maybe a safe approach would have been to ask another researcher to pick the SESOI blind and compare notes? I'm not sure how people handle stuff like this in practice - it seems stressful if you are working on a major paper with a deadline, in my case this is just a blog post so I tried not to overthink it. Instead I just threw out SESOI and reverted to using classic p-values (like the original tutorial recommends).

Final analysis

I am still running some iterations of the simulation, but the early outcomes of this analysis are that to get significant results I'd need to run the study on 250 participants. Currently I've run it on about 12.

This is... not nearly enough.

On the bright side now the power analysis is designed (and so the study is in effect pre-registered), I can at least start looking at the real data, and sending results to people who wanted it shared back with them. In the meantime I can think about other study designs, or ways to recruit more participants for this one.