Thursday, June 14, 2018

Notes on a Cosmology - Part 24b, Epilogue cont'd


One of the delights of quantum science is that it is filled with puzzles, paradoxes and surprises. In this post, I want to exposit the close connection between quantum superposition and parallel computation. At first, these might seem to be widely unrelated ideas – after all, superposition is a wave phenomenon (continuous), while parallel computation describes a discrete process. As we saw in the previous post, however, the discrete and the continuous are not so incompatible as it might seem. Often, the distinction is a matter of perspective.

The Perl programming language has become a workhorse of Linux operating systems, bundled by default into most major distributions of that operating system. Around 2000, a Perl package named Quantum::Superpositions was added to the Perl package repository. This package added several features to the language, but the two most notable for our purposes are the any() construct and the all() construct. These constructs allows a program to treat a list as though it were a single value, in one of two different ways. If you are interested in a more detailed description, the documentation is quite straightforward. The irony is that these constructs have nothing to do with simulating quantum computations – so why name the package Quantum::Superpositions?

In order to understand the rationale behind the name, we have to understand exactly what is happening in a qubit that is taking advantage of the power of quantum superposition. Physicists will often describe it in terms of parallel universes – imagine that the quantum bit is simultaneously existing in two, separate universes and, in one universe, its value is 0 and in the other universe its value is 1. In this sense, the qubit is in “two, mutually-exclusive states at once.” These kinds of mind-bending descriptions tend to make us feel either monumentally stupid or secretly skeptical. Like the story of the naked Emperor, however, we dare not air our skepticism lest we look like the only dunce in the room.

On modern computers, operating systems use a technique called “multi-tasking” to perform many, unrelated tasks in such a way that they all appear to be performed in parallel. The core technique of multi-tasking is time-slicing … if there are 100 processes running on your system, the operating system will divide each unit of time in such a way that each process is allowed to run for a small slice of time before it is forced to give up control to the next process scheduled to run. Because the CPU in a typical computer is operating at billions of cycles per second, the user cannot perceive any observable artifacts from the time-slicing. When more threads are available to the operating system (i.e. when running on a multi-core CPU), the operating system performs less time-slicing because each task can be allowed to run for a longer slice of time on each thread. Suppose we have a system with four threads, the load that each thread has is one-fourth the load that a single-threaded system would have. Thus, the system only needs to time-slice 25% as much, allowing each process to run four times as long without changing the total throughput of the system as perceived by the user. Now, suppose we have a system with eight threads. All these numbers change proportionally – each thread has 1/8th the load of a single-threaded system and we can allow each process to run eight times as long. As we increase the number of threads until we have one thread per process in the system, we reach a point at which every process can run 100% of the time because every process can monopolize a thread on the CPU without interfering with any other running process – all while maintaining the same multi-tasking guarantees that the operating-system must satisfy to the end-user.

But given the fantastic performance of time-slicing on small numbers of threads, it is obvious that a hardware system with so many threads that every process is running on its own dedicated thread would be wasteful. Most processes (and the threads they are running on) would be idle most of the time. Most of the potential computing power of a CPU with hundreds of threads would be wasted. This brings us to the problem of automatic parallelization.

To understand the problem of automatic parallelization, let’s look at the behavior of a single computer program running on the CPU (suppose it has a complete monopoly on the CPU). How can this program keep the CPU as busy as possible, especially if the CPU has multiple threads? One approach is for the program to explicitly parallelize itself, cutting up its work into blocks so that each block can be dispatched to a separate thread on the CPU. Another approach – the approach that has dominated from the advent of superscalar CPUs in the mid-1990s until today – is to allow the hardware to try to “extract” parallelism from the software, without explicit assistance from the software. This way, the software can focus on describing what it is trying to compute, and the hardware can focus on how to compute it.

How does the hardware do this? After all, the hardware has only a very low-level “machine code” view of what the software is doing. The hardware differentiates between two classes of instructions. The first class of instructions are non-branching instructions. The second class of instructions are branching instructions. Non-branching instructions always proceed in order, one after the other, like the steps in a cooking recipe. But branching instructions can “jump” to any other instruction, based on logic conditions. Branching instructions are how software implements conditional constructs like “if-then-else”. Whenever you want some part of a program to only be executed under certain conditions, you’re going to need a branching instruction to implement that functionality in the machine code.

We can use code visualization tools to look at how machine code looks to the hardware – this is called a control-flow graph. A CFG shows only the branching instructions (as vertices in the graph) while all other instructions are abstracted away as graph edges. Whenever the CPU encounters a branch it faces a conundrum – which branch should it follow? It can wait to find the outcome of the branch condition, but this wastes valuable cycles that the CPU could use to get a head-start on processing the branch path that will be taken. In order to avoid wasting cycles, the CPU performs “branch prediction”. It makes a guess as to which path will be taken and starts executing instructions down that path. Branches can be nested multiple times and the branch predictor will attempt to make a reasonable guess as to which branch path will be taken. When the branch predictor is wrong, the cost of misprediction is about the same as having waited for the branch condition to be evaluated, but when the branch predictor is right, the program is able to proceed as if there had been no branch at all.

When a CFG becomes sufficiently deep and intricate, the branch predictor is bound to make mistakes and the penalties will pile up. A modern, superscalar CPU may have as many eight available executions (EUs) units for branch-predicted instructions to proceed through – keeping all those EUs busy is the scheduler's job, a job that would be almost pointless without the branch predictor. But this general problem could be scaled up to the level of hardware threads and automatic parallelization architectures aim to do just this.

One early proposal for automatic parallelization is called eager execution. If you have two available CPU threads and you encounter a branch instruction, rather than trying to predict which path the program will take, you simply set one thread to executing one branch path and the other thread to executing the other branch path and just discard the results of the wrong path. If you have four threads you can process two branches one right after the other – with eight threads you can process three sequential branches, with sixteen threads you can process four branches, and so on.

Obviously, eager execution cannot scale to parallelizing entire programs which may easily have thousands of branch instructions. A CPU with 1,000 threads utilizing EE would waste roughly 99.9% of its processing power on branch paths that would never occur and would still only be able to perform lookahead for a handful of branches at a time. To handle this scaling problem, the idea of disjoint eager execution (DEE) – and many other similar architectures – was proposed. A CPU that implements DEE combines branch-prediction and eager-execution to process the control-flow graph in probability-order – the most probable paths are scheduled across all available threads until there are no more available threads. As the results from those paths become available, the branch-predictor is updated, pruned paths are discarded and the currently most-probable paths are again dispatched across the available threads.

Conceptually, this is an optimal utilization of a multi-threaded CPU for processing a serial program. Such a CPU, if it were built, would be utilizing all the available information about the program’s runtime behavior to make informed predictions about its future branch behavior, allowing the CPU to speculatively look ahead across dozens, maybe hundreds of branches at a time. This would allow a CPU with N threads to provide speedup in proportion to a non-trivial ratio of N.

The Quantum::Superpositions package provides us an interesting way of visualizing what is happening at the hardware level of a DEE CPU. Let’s suppose we have written a program that searches for a specific number from a list of 1,000 numbers. Assuming ideal memory performance (no cost to read or write memory), our 1,000 thread CPU could potentially locate the number we are searching for in just 1 step (each available thread gets assigned to perform one of the search comparisons). Using the any() construct from Quantum::Superpositions, our search would look something like this:

if(any(number_list) == search_number){
    print “Found it!”
}

Read: “If any number in number_list is equal to the search_number, tell the user we found it, otherwise, do nothing.” This construct suggests that there is another way of thinking of the search procedure. Suppose that the any() construct actually splits the universe into separate parallel universes until it completes. Instead of having a 1,000 threaded CPU, we would have 1,000 copies of a single-threaded CPU, each in its own parallel universe. The overall construct would exhibit the same net behavior, either way. We search through 1,000 numbers in 1 time-step. (For me, this was the “Aha”-moment. Your mileage may vary.)

More sophisticated models of branches in the CFG would assign a weight to each edge. It might even “expand” the CFG itself by duplicating the entire graph relative to branches that choose between independent weight-assignments across many edges in the graph[1]. From the point-of-view of the multi-threaded CPU, the CFG looks like a weighted spanning tree (which would be a kind of probability-tree) at any given time. But this would be equally true for our single-thread-parallel-universe CPU … some parallel universes would have more weight than others. I assert that the mathematics that describes the behavior of superposed qubits exactly corresponds to the weighted-parallel-threading CPU model I have described[2]. In short, a DEE CPU using an ideal CFG model would have exactly the same net behavior as a quantum computer processing the same CFG using a number of qubits that is logarithmic in the number of threads of our DEE CPU. Ten qubits could theoretically match performance of a 1,024-thread DEE CPU; 20 qubits could theoretically match performance of a 1.05 million-thread DEE CPU, and so on.

But hold on a moment -- where is the quantum fabric of the universe getting the CFG model from? There is no reason to believe that the quantum computer can divine what we’re trying to compute just from the quantum operators we supply for the computation. What the mathematics of quantum computation conceal is the assumption of ideal traversal of the CFG as guided by the weights on its edges – these weights would correlate to the amplitudes of each superposition state within a quantum computation. Think of each of the exponential number of states in a quantum computation as being a path through the spanning tree of the CFG – if we don’t place weights on the spanning tree, this is exactly equivalent to Eager Execution! On a computation spanning just 10 qubits, 99.9% of the computation would be wasted, and this waste factor increases exponentially in the number of qubits.

The mathematics that describes the bulk of quantum computation does not view the discarded paths as waste. Rather, it is taken as a given that Nature performs computation for free, so all we have to do is give her the problem we want to be solved and then wait while she solves it. But the practical challenges of protecting qubits from decoherence have encouraged research into a new area – quantum error-correcting codes. What I assert is this: quantum error-correcting codes are logically equivalent to the weights on a weighted CFG used in a multi-threaded DEE CPU.[2]

Let us think of every particle of matter in the Universe as though it were a tiny, single-threaded CPU, capable of performing some kind of simple functional computation. We are surrounded by a veritable ocean of such particles. If we could somehow program these particles to work together at an atomic scale, we could build a plain old classical multi-threaded CPU that could compute immense problems through sheer scale – a quadrillion such particles could easily fit within the confines of a warehouse and could compute problems at scales that are hardly imaginable for us, today. The promise of quantum computation is that we do not need to have a particle-per-thread because Nature is very clever and she uses her resources much more efficiently. Instead, we can (theoretically) obtain exponential advantage by properly preparing qubits. But once we have prepared our qubits – or so the theory goes in some quarters – Nature stops being so stingy and she will compute countless parallel Universes almost all of which will be discarded at the moment of measurement when all the possible paths through quantum state-space are discarded and only the actual path remains.

To be clear, the theory of quantum computation is correct. But the most optimistic expectations of quantum speedup (exponential in the number of qubits) do not take into account the complexities of practical quantum computation. The further we go down the path of quantum speed-up, the greater the challenges are going to be. Quantum speedup faces a law of diminishing returns, nay, it faces a difficulty curve that becomes asymptotically vertical. You can only get particles so close to absolute zero, you can only isolate so much environmental noise, you can only maintain the coherence of qubits for brief periods of time. Each of these challenges increases in difficulty at a geometric rate, in the number of qubits. Quantum error-correction is a promising path forward but, judging as an outside observer, I sense a certain aloofness among the quantum physics community, as though quantum computers are just one breakthrough away from obsoleting all prior computing technologies. It’s not that easy. Even Nature cannot escape the laws of information theory, and it is those laws that place fundamental limits on what both classical and quantum computers can do.

I propose a new way of thinking about quantum computation. I propose that quantum computation is a viable candidate for the characteristica universalis (CU), a concept that goes back to 18th-century philosopher and inventor of the calculus, Gottfried Leibniz. The idea of the CU is that there must be some language which is ideal for discussing any problem of science, mathematics or physics – a perfect language. This perfect language would enable us to think so much more clearly that it would become the language in which we think about everything, even fuzzy, complicated subjects like law, social norms, economics and politics. This ideal language would be so well-defined, Leibniz imagined, that it could be processed by a mechanical system, in much the same way that modern programming languages are mechanically processed by computers. He called the device that would process the CU the calculus ratiocinator (CR). I propose that Nature herself – in all her quantum subtlety – is the CR that Leibniz was seeking.

In summary: I assert that the Simulation Hypothesis is the case and that quantum computation is the characteristical universalis; the CU is processed by the universal simulator, which is more accurately thought of as being an instantiation of Leibniz’s reasoning machine, the calculus ratiocinator. There are no “levels” of simulation in the vein of simulation-based science-fiction, whereby we are trapped inside of an illusion or construct. Rather, the Universe, considered in its entirety, is self-simulating or co-simulating, more like a peer-to-peer network than like a centrally-controlled computation.[3]

Next: Part 24c, Finis (stay tuned)

Footnotes:

[1] – Suppose when the program starts, the user can select between one of several operating modes… the program’s behavior might be radically different in each of those modes. In that case, you would want to model each of those different modes as separate copies of the entire CFG with each having their own weight-assignments.

[2] – One caveat is that our CPU would have to be probabilistic in order for this logical equivalence to hold, not deterministic like the ordinary CPUs we are familiar with

[3] As a closing note, the astute reader familiar with quantum phenomena will notice that I have not mentioned entangled states. It is sometimes implied that entangled states illustrate that quantum computers can do things that classical computers cannot -- this is partly true because classical computers cannot exhibit specifically quantum phenomena. Nevertheless, every computational consequence of quantum phenomena can be -- and is -- simulated by classical computers, including entangled states. In short, entangled states are just one among many counter-intuitive consequences of the mathematics of quantum computers.

No comments:

Post a Comment

Wave-Particle Duality Because Why?

We know from experimental observation that particles and waves are fundamentally interchangeable and that the most basic building-blocks of ...