And it’s screwing up both functionalist theories of consciousness and mechanistic interpretability.
Hello! Today's post is a collaboration with Adrià Garriga-Alonso. It is a work in progress but we are posting it in its current state and welcome feedback while we continue to flesh it out.
Here are two ways to build a machine which adds two numbers.
Machine 1: is a calculator embedded in the wall of a 3 by 3 meter (10x10 feet) box
Machine 2: is a 3x3 meter box with a keypad on the wall that looks just like a calculator, except…
- When you press buttons it displays their values on a screen hidden inside the box to Bob
- Bob, by the way, is a human who lives in the box.
- Bob reads values, adds them together in his head, and then enters the result on an internal keypad which displays it back on the outer screen.
- Bob is very disciplined, he has ample supplies inside the box, never makes mistakes, and will never deviate from his task even if you try to communicate messages to him with the numbers you type.
If you don’t look inside them, both machines appear to have the same behaviour. For any given input they will give the same output. But in order to perform its function, one of these machines involves a conscious mind. This seems important! 1

Here lives Bob. Don’t worry about him, he’s fine. Though we can't really tell... in order to perform his calculating job perfectly he's essentially an extreme form of what Hilary Putnam calls a super spartan.
You can’t judge whether a machine like this is conscious based just on its inputs and outputs. In general, this motivates coming up with a more sophisticated approach for deciding if two machines are the same than looking at their inputs and outputs.
One alternative approach is to ask: are they implementing the same algorithm to realize this behaviour?
For a machine B to be implementing the same algorithm as a machine A, there needs to be a mapping from the states of machine A to the states of machine B such that whenever machine A is in state a-x machine B is in state b-x for all possible states a-x and b-x. The mapping should also make it the case that whenever machine A would move from state a-x to a-y machine B moves from the corresponding state b-x to b-y.
This seems very promising! Surely if you take this definition Machine 1 (the calculator) and Machine 2 (the box with Bob inside) must be implementing different algorithms, because there is no suitable mapping between their internal states.
It turns out that there is such a mapping. In fact, there are infinitely such mappings.. In general, the question of whether two machines are running the same algorithm defined this way is vacuous: if you have enough degrees of freedom in determining what counts as ‘the same state’, anything is implementing the same algorithm as anything else!
This is a problem that goes beyond examples like these. It crops up in several places in both the philosophy of consciousness, and in mechanistic interpretability:
In mechanistic interpretability, we’re interested in reverse-engineering neural networks, and finding out what algorithm they implement. Research papers in the field usually plot the correlation of some behavior with the network’s states (observational evidence), and modify the network in targeted ways and observe how it behaves (causal evidence). This is tedious to evaluate, and not an objective measurement: what constitutes a good enough plot?
In 2022, Atticus Geiger et al. (in parallel Chan et al.) proposed a formal definition of when a neural network implements an algorithm that can be checked numerically. A neural network implements an algorithm when you can pick a location in the network where each variable is represented. And, if you change the state of the neural network at that location, the behaviour changes in the same way as the program predicts. Doing this to the level of a concrete algorithm is long and difficult so it hasn’t caught on, but researchers use interchange interventions (the proposed test for changing internals of networks).
In the philosophy of consciousness, we’re interested in the question of whether human beings are conscious by virtue of the computational structure of their brains. Perhaps the physical substance which human brains are made of does not matter, and if brains were made of silicon, or, a huge complex arrangement of water pipes, they would still be conscious, so long as information was flowing around those structures in the same way that it flows around the human brain.
Early versions of these theories were especially popular in the 1960s and 1970s but, because of the objections we’ll discuss below, many of the philosophers who defended these theories later went on to change their minds. Ned Block, in his paper “Troubles with Functionalism” and Hilary Putnam in his book “Representation and Reality”. Several other philosophers, most notably David Chalmers, continued to defend this view with variations of accounts
The arguments
There isn't currently a very readable exposition of these arguments. We are working on one! For now we'll link to a few places that make them:
- The appendix of this book (available in full on archive.org). By Hilary Putnam
- The discussion of the arguments in this paper by David Chalmers
- This paper by Matthias Scheutz, which contains some helpful diagrams as well.
- This mechanistic interpretability paper uses a very similar argument to argue that causal abstraction approaches are not sufficient to explain the behaviour of language models.
The upshot of all these arguments is that unless you introduce some restricted set of allowable mappings between computational states, any notion of algorithmic identity that relies on such mappings ends up being vacuous.
How can we restrict the mapping?
In philosophy of mind literature, there are two main identifiable threads of response:
- One set of responses proceeds from this observation by iterating on tighter and tighter definitions of “functional implementation”, tying it to notions like counterfactuals, causation, sub-states, and continuous physical regions.
- Another set of responses proceeds by exploring how we interpret and justify explanations to speakers in our community. This seems like a radically relativist approach, as the extreme version of this position would involve denying that there is no “natural” fact of the matter as to whether two functions are the same. The motivation for this approach mostly comes from the Kripke-Wittgenstein rule following paradox.
- Kripke raises an argument which attacks any attempt to define what abstract computational function a physical system is implementing. The basis of the attack is that abstract functions are fully deterministic and defined for infinite domains, while physical mechanisms are finite and error prone. This means that any mapping from physical system states to abstract function states relies on some interpretation of what “normal function” for the physical system should be.
- He works through a variety of solutions, dismissing them one by one, and then concludes that the mapping you perform is relative to the interpretation of some observer or community of observers.
- One way out of this problem is quite pragmatist. Does the explanation eventually enable you to make better predictions?
- This approach in general is helpful for resolving mechanistic interpretability questions, but when it comes to questions about consciousness, there remains a worry that there is more to the world than what your theory predicts.
In mechanistic interpretability:
- The same mechanistic interpretability paper which notes similar worries to the ones which Searle and Putnam raised, proposes another pragmatic solution: when considering whether a neural network is implementing some algorithm, only permit implementations where variables are encoded in that neural network as linear representations.
There are also some other interesting responses from philosophically inclined computer scientists
- One appears in Scott Aaronson’s Why Philosophers should care about computational complexity
- The response is that for your mapping to be meaningful, its computational complexity must be meaningfully lower than the original algorithm you are trying to analyse.
-
This sounds like Searle's Chinese room thought experiment but it is in fact different. In the Chinese room, the man inside the room does not understand Chinese, or even that the job he is doing involves speaking Chinese. ↩