One of the delights
of quantum science is that it is filled with puzzles, paradoxes and
surprises. In this post, I want to exposit the close connection
between quantum superposition and parallel computation. At first,
these might seem to be widely unrelated ideas – after all,
superposition is a wave phenomenon (continuous), while parallel
computation describes a discrete process. As we saw in the previous
post, however, the discrete and the continuous are not so
incompatible as it might seem. Often, the distinction is a matter of
perspective.
The Perl programming
language has become a workhorse of Linux operating systems, bundled
by default into most major distributions of that operating system.
Around 2000, a Perl package named Quantum::Superpositions was added
to the Perl package repository. This package added several features
to the language, but the two most notable for our purposes are the
any() construct and the all() construct. These constructs allows a
program to treat a list as though it were a single value, in one of
two different ways. If you are interested in a more detailed
description, the documentation is quite straightforward. The irony is that these constructs have
nothing to do with simulating quantum computations – so why name
the package Quantum::Superpositions?
In order to
understand the rationale behind the name, we have to understand
exactly what is happening in a qubit that is taking advantage of the
power of quantum superposition. Physicists will often describe it in
terms of parallel universes – imagine that the quantum bit is
simultaneously existing in two, separate universes and, in one
universe, its value is 0 and in the other universe its value is 1. In
this sense, the qubit is in “two, mutually-exclusive states at
once.” These kinds of mind-bending descriptions tend to make us
feel either monumentally stupid or secretly skeptical. Like the story
of the naked Emperor, however, we dare not air our skepticism lest we
look like the only dunce in the room.
On modern computers,
operating systems use a technique called “multi-tasking” to
perform many, unrelated tasks in such a way that they all appear to be
performed in parallel. The core technique of multi-tasking is
time-slicing … if there are 100 processes running
on your system, the operating system will divide each unit of time in
such a way that each process is allowed to run for a
small slice of time before it is forced to give up control to the
next process scheduled to run. Because the CPU in a typical computer
is operating at billions of cycles per second, the user cannot
perceive any observable artifacts from the time-slicing. When more
threads are available to the operating system (i.e. when running on a
multi-core CPU), the operating system performs less time-slicing
because each task can be allowed to run for a longer slice of time on
each thread. Suppose we have a system with four threads, the load
that each thread has is one-fourth the load that a single-threaded
system would have. Thus, the system only needs to time-slice 25% as
much, allowing each process to run four times as long without
changing the total throughput of the system as perceived by the user.
Now, suppose we have a system with eight threads. All these numbers
change proportionally – each thread has 1/8th the load of a
single-threaded system and we can allow each process to run eight
times as long. As we increase the number of threads until we have
one thread per process in the system, we reach a point at which every
process can run 100% of the time because every process can monopolize
a thread on the CPU without interfering with any other running
process – all while maintaining the same multi-tasking guarantees
that the operating-system must satisfy to the end-user.
But given the
fantastic performance of time-slicing on small numbers of threads, it
is obvious that a hardware system with so many threads that every
process is running on its own dedicated thread would be wasteful.
Most processes (and the threads they are running on) would be idle
most of the time. Most of the potential computing power of a CPU with
hundreds of threads would be wasted. This brings us to the problem of
automatic parallelization.
To understand the
problem of automatic parallelization, let’s look at the behavior of
a single computer program running on the CPU (suppose it has a
complete monopoly on the CPU). How can this program keep the CPU as
busy as possible, especially if the CPU has multiple threads? One
approach is for the program to explicitly parallelize itself, cutting
up its work into blocks so that each block can be dispatched to a
separate thread on the CPU. Another approach – the approach that
has dominated from the advent of superscalar CPUs in the mid-1990s
until today – is to allow the hardware to try to “extract”
parallelism from the software, without explicit assistance from the
software. This way, the software can focus on describing what
it is trying to compute, and the hardware can focus on how to
compute it.
How does the
hardware do this? After all, the hardware has only a very low-level
“machine code” view of what the software is doing. The hardware
differentiates between two classes of instructions. The first class
of instructions are non-branching instructions. The second class of
instructions are branching instructions. Non-branching instructions
always proceed in order, one after the other, like the steps in a
cooking recipe. But branching instructions can “jump” to any
other instruction, based on logic conditions. Branching instructions
are how software implements conditional constructs like
“if-then-else”. Whenever you want some part of a program to only
be executed under certain conditions, you’re going to need a
branching instruction to implement that functionality in the machine
code.
We can use code
visualization tools to look at how machine code looks to the hardware
– this is called a control-flow graph. A CFG shows only the
branching instructions (as vertices in the graph) while all other
instructions are abstracted away as graph edges. Whenever the CPU
encounters a branch it faces a conundrum – which branch should it
follow? It can wait to find the outcome of the branch condition, but
this wastes valuable cycles that the CPU could use to get a
head-start on processing the branch path that will be taken. In order
to avoid wasting cycles, the CPU performs “branch prediction”. It
makes a guess as to which path will be taken and starts executing
instructions down that path. Branches can be nested multiple times
and the branch predictor will attempt to make a reasonable guess as
to which branch path will be taken. When the branch predictor is
wrong, the cost of misprediction is about the same as having waited
for the branch condition to be evaluated, but when the branch
predictor is right, the program is able to proceed as if there had
been no branch at all.
When a CFG becomes
sufficiently deep and intricate, the branch predictor is bound to
make mistakes and the penalties will pile up. A modern, superscalar
CPU may have as many eight available executions (EUs) units for
branch-predicted instructions to proceed through – keeping all
those EUs busy is the scheduler's job, a job that would be almost pointless without the branch predictor. But this general
problem could be scaled up to the level of hardware threads and
automatic parallelization architectures aim to do just this.
One early proposal
for automatic parallelization is called eager execution. If
you have two available CPU threads and you encounter a branch
instruction, rather than trying to predict which path the program
will take, you simply set one thread to executing one branch path and
the other thread to executing the other branch path and just discard
the results of the wrong path. If you have four threads you can
process two branches one right after the other – with eight threads
you can process three sequential branches, with sixteen threads you
can process four branches, and so on.
Obviously, eager
execution cannot scale to parallelizing entire programs which may
easily have thousands of branch instructions. A CPU with 1,000
threads utilizing EE would waste roughly 99.9% of its processing
power on branch paths that would never occur and would still only be
able to perform lookahead for a handful of branches at a time. To
handle this scaling problem, the idea of disjoint eager execution
(DEE) – and many
other similar architectures – was proposed. A CPU that implements
DEE combines branch-prediction and eager-execution to process the
control-flow graph in probability-order – the most probable paths
are scheduled across all available threads until there are no more
available threads. As the results from those paths become available,
the branch-predictor is updated, pruned paths are discarded and the
currently most-probable paths are again dispatched across the
available threads.
Conceptually, this
is an optimal utilization of a multi-threaded CPU for processing a
serial program. Such a CPU, if it were built, would be utilizing all
the available information about the program’s runtime behavior to
make informed predictions about its future branch behavior, allowing
the CPU to speculatively look ahead across dozens, maybe hundreds of
branches at a time. This would allow a CPU with N threads to provide
speedup in proportion to a non-trivial ratio of N.
The
Quantum::Superpositions package provides us an interesting way of
visualizing what is happening at the hardware level of a DEE CPU.
Let’s suppose we have written a program that searches for a
specific number from a list of 1,000 numbers. Assuming ideal memory
performance (no cost to read or write memory), our 1,000 thread CPU
could potentially locate the number we are searching for in just 1
step (each available thread gets assigned to perform one of the
search comparisons). Using the any() construct from
Quantum::Superpositions, our search would look something like this:
if(any(number_list) == search_number){print “Found it!”}
Read: “If any
number in number_list is equal to the search_number, tell the user we
found it, otherwise, do nothing.” This construct suggests that
there is another way of thinking of the search procedure. Suppose
that the any() construct actually splits the universe into separate
parallel universes until it completes. Instead of having a 1,000
threaded CPU, we would have 1,000 copies of a single-threaded CPU,
each in its own parallel universe. The overall construct would
exhibit the same net behavior, either way. We search through 1,000
numbers in 1 time-step. (For me, this was the “Aha”-moment. Your
mileage may vary.)
More sophisticated
models of branches in the CFG would assign a weight to each edge. It
might even “expand” the CFG itself by duplicating the entire
graph relative to branches that choose between independent
weight-assignments across many edges in the graph[1]. From the
point-of-view of the multi-threaded CPU, the CFG looks like a
weighted spanning tree (which would be a kind of probability-tree) at
any given time. But this would be equally true for our
single-thread-parallel-universe CPU … some parallel universes would
have more weight than others. I assert that the mathematics that
describes the behavior of superposed qubits exactly corresponds to
the weighted-parallel-threading CPU model I have described[2]. In short,
a DEE CPU using an ideal CFG model would have exactly the same net
behavior as a quantum computer processing the same CFG using a number
of qubits that is logarithmic in the number of threads of our DEE CPU. Ten qubits could
theoretically match performance of a 1,024-thread DEE CPU; 20 qubits
could theoretically match performance of a 1.05 million-thread DEE
CPU, and so on.
But hold on a moment
-- where is the quantum fabric of the universe getting the CFG model
from? There is no reason to believe that the quantum computer can
divine what we’re
trying to compute just from the quantum operators we supply for the
computation. What the mathematics of quantum computation conceal is
the assumption of ideal traversal of the CFG as guided by the
weights on its edges – these weights would correlate to the
amplitudes of each superposition state within a quantum
computation. Think of each of the exponential number of states in a
quantum computation as being a path through the spanning tree of the
CFG – if we don’t place weights on the spanning tree, this is
exactly equivalent to Eager Execution! On a computation spanning just
10 qubits, 99.9% of the computation would be wasted, and this waste
factor increases exponentially in the number of qubits.
The mathematics that
describes the bulk of quantum computation does not view the discarded
paths as waste. Rather, it is taken as a given that Nature performs
computation for free, so all we have to do is give her the problem we
want to be solved and then wait while she solves it. But the
practical challenges of protecting qubits from decoherence have
encouraged research into a new area – quantum error-correcting
codes. What I assert is this: quantum error-correcting codes are
logically equivalent to the
weights on a weighted CFG used in a multi-threaded DEE CPU.[2]
Let
us think of every particle of matter in the Universe as though it
were a tiny, single-threaded CPU, capable of performing some kind of
simple functional computation. We are surrounded by a veritable ocean
of such particles. If we could somehow program these particles to
work together at an atomic scale, we could build a plain old
classical multi-threaded CPU
that could compute immense problems through sheer scale – a
quadrillion such particles could easily fit within the confines of a
warehouse and could compute problems at scales that are hardly
imaginable for us, today. The promise of quantum
computation is that we do not need to have a particle-per-thread
because Nature is very clever and she uses her resources much more
efficiently. Instead, we can (theoretically) obtain exponential
advantage by properly preparing qubits. But
once we have prepared our qubits – or so the theory goes in some
quarters – Nature stops being so stingy and she will compute
countless parallel Universes almost all of which will be discarded at
the moment of measurement when all the possible paths
through quantum state-space are discarded and only the actual
path remains.
To
be clear, the theory of
quantum computation is correct. But
the most optimistic
expectations of quantum speedup (exponential in the number of qubits)
do not take into account the complexities of practical quantum
computation. The further we
go down the path of quantum speed-up, the greater the challenges are
going to be. Quantum speedup faces a law of diminishing returns,
nay, it faces a difficulty
curve that becomes asymptotically vertical. You can only get
particles so close to absolute zero, you can only isolate so much
environmental noise, you can only maintain the coherence of qubits
for brief periods of time. Each of these challenges increases in
difficulty at a geometric rate, in the number of qubits. Quantum
error-correction is a promising path forward but, judging as an
outside observer, I sense a certain aloofness among the quantum
physics community, as though quantum computers are just one
breakthrough away from obsoleting all prior computing technologies.
It’s not that easy. Even Nature cannot escape the laws of
information theory, and it is those laws that place fundamental
limits on what both
classical and quantum computers can do.
I
propose a new way of thinking about quantum computation. I propose
that quantum computation is a viable candidate for the
characteristica universalis (CU),
a concept that goes back to 18th-century
philosopher and inventor of the calculus, Gottfried Leibniz. The idea
of the CU is that there must be some language which is ideal for
discussing any problem of science, mathematics or physics – a
perfect language. This perfect language would enable us to think so
much more clearly that it would become the language in which we think
about everything, even fuzzy, complicated subjects like law, social
norms, economics and politics. This
ideal language would be so well-defined, Leibniz imagined, that it
could be processed by a mechanical system, in much the same way that
modern programming languages are mechanically processed by computers.
He called the device that would process the CU the calculus
ratiocinator (CR). I propose
that Nature herself – in all her quantum subtlety – is the CR
that Leibniz was seeking.
In
summary: I assert that the Simulation Hypothesis is the case and that
quantum computation is the characteristical universalis;
the CU is
processed by the universal simulator, which is more accurately
thought of as being an instantiation of Leibniz’s reasoning
machine, the calculus ratiocinator. There
are no “levels” of simulation in the vein of simulation-based
science-fiction, whereby we are trapped inside of an illusion or
construct. Rather, the Universe, considered in its entirety, is
self-simulating or
co-simulating, more
like a peer-to-peer network than like a centrally-controlled
computation.[3]
Next: Part 24c, Finis (stay tuned)
Footnotes:
[1] – Suppose when the
program starts, the user can select between one of several operating
modes… the program’s behavior might be radically different in
each of those modes. In that case, you would want to model each of
those different modes as separate copies of the entire CFG with each
having their own weight-assignments.
[2] – One caveat
is that our CPU would have to be probabilistic in order for this
logical equivalence to hold, not deterministic like the ordinary CPUs
we are familiar with
[3]
As a closing note, the astute reader familiar with quantum phenomena
will notice that I have not mentioned entangled states. It is sometimes
implied that entangled states illustrate that quantum computers can do
things that classical computers cannot -- this is partly true because
classical computers cannot exhibit specifically quantum phenomena.
Nevertheless, every computational consequence of quantum phenomena can be -- and is
-- simulated by classical computers, including entangled states. In
short, entangled states are just one among many counter-intuitive
consequences of the mathematics of quantum computers.