Tuesday, August 1, 2017

Notes on a Cosmology - Part 9, Algorithmic Inference

In this post, we are going to start tying together some of the concepts introduced in the first eight parts of this series. Algorithmic inference - more commonly called Solomonoff's theory of induction, or Solomonoff induction, for short - is the theory of inference based on algorithmic information theory. I am using the term "algorithmic inference" only to contrast it with Bayesian inference. However, we will find that algorithmic inference and Bayesian inference form a unified theory.

Algorithmic probability

The algorithmic probability measures the probability of a string x, relative to a reference computer, U:
Eq. 1

This probability converges under the condition that p must be encoded as a prefix-free code in the definition of K(x). We can think of this probability as the probability that x is observed at the output of U, given a random input p to U (p must be prefix-free). We can also define a length-based probability measure over all x (without respect to U):
Eq. 2
Since x is prefix-free, this measures the probability of x arising as the result of flipping a coin. We can now define the inductive probability of a string as the ratio between these probabilities:

Eq. 3

Pind(x) should be read, "the probability that x arose as the result of a random process (e.g. flipping a coin) rather than as an algorithmic process (e.g. random input to U)". The lower K(x), the more structured x is, and the less likely it is that x is the result of a random process (flipping a coin) rather than the result of an algorithmic process. Inverting the probability, we can say that it is 2|x|-K(x) times more likely that x is the result of an algorithmic process than a random process[1]. When Pind(x) is close to 0.5, we say that x is random or, equivalently, arose as the result of a random process.

Epicurus and William of Ockham
There are also some things for which it is not enough to state a single cause, but several, of which one, however, is the case. Just as if you were to see the lifeless corpse of a man lying far away, it would be fitting to list all the causes of death in order to make sure that the single cause of this death may be stated. For you would not be able to establish conclusively that he died by the sword or of cold or of illness or perhaps by poison, but we know that there is something of this kind that happened to him. - Epicurus
Entities should not be multiplied unnecessarily. - William of Ockham
In the first quote, Epicurus explains that when there are multiple explanations for data, we should keep all the explanations. Ockham explains that "entities" - or, in our case, explanations - should not be multiplied without need. In short, we should only accept the simplest explanation of things. This is commonly called "Ockham's razor".

Ideally, our theory of inductive inference should satisfy both Epicurus and Ockham - we should keep explanations that are possible, but unlikely, but we should give more weight to explanations that are more likely than to those that are less likely. Bayesian inference aims to achieve this goal but without a way to put the prior probability terms on a firm footing, Bayesian inference is squishy and informal. We need something more rigorous.

Inductive inference

Algorithmic information theory considers the complexity of a string. Algorithmic probability associates every string with a probability. To do this, we associate a string with a number and consider it an indivisible entity. But suppose we thought of the string as a sequence, and asked the following question: what is the next bit in this sequence? This is much like how we think about a discrete communication channel in Shannon information theory and it is also much like how we think about scientific experiment.

Let an n-bit prefix of string x be notated xn. Solomonoff induction theory asks the following question: given xn, what is xn+1? Given U, the answer is fairly straightforward. Let U(p) = xn•b notate that p halts and outputs sequence xn followed by b, where b is an element of {0,1}. The probability that b=0, given xn is:
Eq. 4 [see footnote 2]

This probability provably converges if p is encoded as a prefix-free code. The key to understanding the generality of this probability is to realize that the search over U(p) exhausts all order or structure because U can implement any mathematical function. Unlike PU, Eq. 4 counts the weight of all programs that output x, not just the shortest one. Note that the vast majority of the probability is contributed by short programs. For a random string, there are no programs of length less than itself that will contribute to the result - this is true by definition of a random string.

Bayesian inference and algorithmic inference

In Bayesian inference, we have no obvious way to choose a prior probability distribution. Solomonoff's theory of induction gives us what is termed a universal prior probability distribution. This distribution gives every hypothesis some non-zero probability, including hypotheses that would never occur to a human being. This construct satisfies both Epicurus's principle of multiple explanations and Ockham's razor.

Let's relate the framework for Bayesian hypothesis testing that we introduced in Part 8 to Solomonoff's probability measure.  We want to know what is the probability that the evidence observed in the laboratory (E) is the result of hypothesis Hi. Let's reframe the evidence E as the output of U, and Hi as the input to U. In this case, we have many possible hypotheses, all with non-zero probability, so we have to count every H that outputs E, and assign it a probability. The result:

Eq. 5
Assuming that Hi does output E, of course. Note that Eq. 5 basically has Eq. 4 in the denominator, with Eq. 2 in the numerator (with x=Hi) - it is the ratio of the algorithmic probability of this hypothesis (Hi) to the weighted sum of all other hypotheses (Hj). See this page from LessWrong for a detailed explanation of how we translate between Bayesian theory and algorithmic theory.

Note that Eq. 5 only works for exact reproduction of the evidence E. The algorithmic theory of inductive inference generalizes for approximate reproduction of evidence, using the very tools of information theory we covered in an earlier part of this series.

Recapitulation

Let's take a moment to reflect on where we've come in this series. We started with the long-range goal of asking, and answering, the question of how things hang together - in the broadest possible sense - with all other things, given the Simulation Hypothesis. To this end, we started building a toolbox for thinking. We touched on physical reasoning, information, entropy, computation, algorithmic information, Bayesian inference and now the algorithmic theory of inductive inference. These are not just a random collection of tools, they are a powerful set of tools that connect the foundations of science and mathematics together at the deepest level. We are getting closer to applying our new toolbox to causal reasoning and physical reasoning in a way that was not possible for Ernst Mach.

As I noted earlier, the theory of communication channels applies naturally to scientific experiments. The algorithmic theory of induction closes the last gap of Bayesian theory by giving us a universal prior probability distribution and provides an even deeper look into what science really is. Loosely speaking, science is data-compression, where the data being compressed are the phenomena themselves. In short, we want to compress the phenomena to their smallest possible description and derive from this description a theory (or program, so to speak) from which we can predict future phenomena (including observations).

That "mathematics [is] ... so admirably appropriate to the objects of reality", as Einstein noted, is looking less and less coincidental. However, we must still refrain from jumping off into the Never Never Land of metaphysical speculation. In the following posts, I will introduce one of the mathematical objects that is going to act as a guiding light throughout much of the rest of this series - the halting probability.

Next: Part 10, Quantum Computation

---

1. This intuitive argument borrowed from here.

2. This approach to describing Solomonoff's theory of induction is informal. P, here, is really a "universal semi-measure", not a probability. There are a lot of other technicalities involved and what is happening here is more subtle than the treatment we have given in this series. But we are following the toolbox approach to thinking, so the minutiae are left as an exercise.

No comments:

Post a Comment

Wave-Particle Duality Because Why?

We know from experimental observation that particles and waves are fundamentally interchangeable and that the most basic building-blocks of ...