Things I learnt at replication club

Published on November 7, 2025

I went to a workshop this week about replicating scientific studies, organised by Slime Mold Time Mold at Inkhaven. Slime Mold Time Mold run a blog which I admire a lot, and if you enjoy this topic you probably will enjoy their blogpost about replication even more. The idea was for a bunch of us to get together and to pick an interesting psychology study and then run a replication of the results.

One area of psychology I am interested in is the study of the relationship between visual imagination and information processing. Visual imagination often also shows up as a case study in the philosophy of mind - though emphasis is normally on its phenomenology or its relationship to perception. One topic which bridges these fields of study is the question of to what extent we make use of access-conscious visual imagery when carrying out information processing tasks, for example this paper argues that, since subjects who have aphantasia (a lack of visual imagination) perform equally well on some information processing tasks, visual imagery cannot play a role in such tasks.

The psychology studies referenced in that paper use tasks about remembering numbers from a grid that gets flashed up on a screen. There are some other studies using tasks involving rotating shapes which report the opposite effect (i.e. different performance between aphantasics and non-aphantasics, in particular a longer response time for aphantasics). I would like to try to see if these replicate. However, I am just starting out running diy-psychology-replications, so am planning to start small and work my way up.

For that reason the first paper I plan to replicate is this one, which is itself is a replication of a classic and very robust result in psychology, namely that for shape rotation tasks response time goes up linearly as the angle between the shapes increases. But, also, in the spirit of just trying things to see what happens, I will also stick an aphantasia questionnaire on the end just for fun (though see point 7 below for why stuff like this can be tricky).

The reason I chose this paper is that the dataset of shapes they use is publicly available and the methodology is especially well described. All the same, there are some things I didn't know before starting out that I wanted to share.

Here are some of those things in no particular order

You can really just replicate many studies if you are interested in them! Slime Mold Time Mold have a list at the end of the blog post I mentioned above. It's a great way to learn, you will have a lot of fun doing it, and also it will make you understand the original papers better.
Start small! ... like, even smaller than I did. Even the vanilla shape rotation study is surprisingly fiddly for a beginner. In hindsight I wish I had just done a questionnaire based on, for example, one of Kosslyn's text-based tasks. Even then I'd have had to find a way to get precise measurements of subject response times which was hard because...
Good quality and easily accessible tooling for setting up psychology studies seems hard to find. I am generally quite fast at figuring out how to use new software, but it still took me the best part of an afternoon and nearly a day's worth of daily AI token budget to get PsychoPy set up for this task. The other option I looked at was GuidedTrack but I couldn't find anything in its manual about measuring response time for individual tasks, and even if I had figured this out, the study I was trying to replicate had some other requirements like giving the user live feedback during practice rounds, and timing them out if they didn't respond within 7500 milliseconds.
LLMs are not quite robust enough for vibecoding interfaces for studies. I still got a bunch of coding help from an LLM in this task, but it was mostly help figuring out the new tooling I was using, and explaining error messages to me. Another benefit of using an off the shelf framework like PsychoPy was that it handled a bunch of fiddly things like frame-rate sync, precise timing, and data collection for me. I reckon if I had (1) a much larger token budget and (2) the experience I have now about what a good version of this study actually requires, I might have been able to get a lot further with the vibe-coding approach.
A good software engineer interested in supporting better psychology research could probably do a lot of good work contributing features to these frameworks. PsychoPy for example seems to support all sorts of fancy things like eye tracking however you require custom code or hacks in order to set up a simple "rate this statement from 1-5" multiple choice scale
Once you get started you will have loads of ideas for more studies you will want to run. Did I mention that PsychoPy supports eye tracking? Wouldn't it be fun to set that up?
If you are worried about sample sizes, there is a cheap way to do power analysis. You can list out your hypotheses for what might be happening and then run monte carlo simulations to figure out if the effect will be detectable. It's best to do this before running your study in case it makes you want to change anything about the set-up. Thank you to Gwern, and Adrià Garriga-Alonso for independently suggesting that I try this approach, and for both patiently explaining to me the idea behind it. Adrià has a blog post on the topic here. This was only required in my case because I wanted to make sure I would be collecting enough samples for the data from my aphantasia questionnaire to be informative.
If you can run your study in a browser, do that. PsychoPy's browser support turned out to be kinda weak. Running studies in person significantly reduces your sample size (though I guess gives you more control over the experimental conditions).
If you tell participants to press "B" if two shapes are the same and "N" if two shapes are different they will often forget almost immediately which key was which. I fixed this by sticking coloured tape on my keyboard. When I showed this to a real scientist they told me that doing this was actually best practice.

An image of my "best practice" psychophysical input device.