Saturday, July 29, 2017

Notes on a Cosmology - Part 7, Algorithmic Information

In Parts 3, 4 and 5, we introduced the concept of Shannon information. In this post, we are going to take another view of information by looking at a message from the point-of-view of a digital computer.

Consider the message "SEND MORE MONEY" - as English speakers, it is obvious to us that this message could be broken up into 3 words, or 13 letters (and 2 spaces), and so on. But, from the point of view of information theory, the message itself is just a single message, drawn from the set of possible messages which our communication system can convey. Considered by itself, this message contains no more information than a teapot or an orangutan contains. That is to say, this message can only be said to contain information relative to a set (or ensemble) of possible messages. As Shannon pointed out in a portion of MTC we quoted earlier, it was once common practice for commercial telegraphers to organize certain common greetings as 4- or 5-digit numbers, allowing a longer greeting like "Merry Christmas and Happy New Year" to be conveyed with a few numbers keyed in Morse code[1]. The message "SEND MORE MONEY" should not be thought of, from the point-of-view of Shannon information theory, as a composite of shorter messages (e.g. "words" or "characters"). Rather, it should be thought of as a single element selected from a larger ensemble of possible messages.

This is in contrast to the field of algorithmic information theory[2], where we are primarily interested in looking at the information content of a single message, usually thought of as a string. I will be using the word "string" synonymously with "binary sequence", i.e. any element of * (see Wikipedia link on strings for more information).

In the digital era, we are familiar with the idea of discretization - an image displayed on a computer screen is really represented as an elaborately long sequence of binary digits (bits). In a 24-bit color image, each pixel of the image (each discrete dot of color) corresponds to one 24-bit chunk of that long sequence. For example, a single 1920x1080 image at 24-bit color depth is roughly 6.2 million bytes in size, or 49.6 million bits in size.

We know that we can use information theory to compress the image if it has statistical regularities - fortunately, almost any real image taken with a camera does. It might seem that we have somehow "measured the information content" of the original image in order to reduce it to its compressed size. But this is not true. We have, rather, measured the information content of the pixels of the image (or, more likely, 2-dimensional blocks of pixels) relative to the image itself. In other words, the image's possible pixel values are the message space, the ensemble containing all the messages we will be sending. Each pixel or pixel-block within the image is a "message" selected from that message space. We can encode the image at advantage by building a model of the message space (this reduces the size of the message space from "all allowable pixels"). For example, if we are encoding a nature photograph and the majority of the background is green in color, the encoding of green pixels will take up the majority of the code space and other color encodings will be assigned a smaller portion of the code space. This stretching of the code space is the model. By transmitting this model along with the encoded image, we are able to achieve good compression ratios.

Algorithmic information theory looks at any string as an information-bearing object, without respect to an ensemble of possible messages. Thus, we can ask: "How much information is in the string making up this image?" without referring to the set of all possible images.

Let us suppose we have a general-purpose computer, U. This computer accepts a program p as input, and produces some string x as output, and halts:
x = U(p)
We say that the algorithmic complexity (or Kolmogorov complexity) of a string x is the length of the shortest program that, when run on U, outputs x and halts. Leaving aside some important, technical details, we can intuitively define the Kolmogorov complexity as follows[3]:
K(x) = min { |p| : U(p) = x }
... where |p| is read "length of string p" (we assume U operates on the same kinds of strings as x, without loss of generality). For example, the Kolmogorov complexity of our camera image, above, is defined as the length of the shortest computer program that, when run on U, will output the original string encoding the image.

The astute reader will have noticed that we have played fast and loose with U. There is no "natural" choice of U, and the internal implementation of U can have an arbitrarily large effect on x - for example, we might have a special computer instruction encoded by the string consisting solely of the bit 0 that causes U to output this particular image string, in its entirety. The complexity of the image, relative to this U, would be 1 since that is the length of the input program string. There is no need to argue about such pedantic details, however, since there is an invariance theorem that shows that, for any choice of U1 and U2, there is some constant c such that KU1(x) ≤ KU2(x) + c, for all x. Thus, no systematic advantage in the representation of arbitrary strings can be gained by the choice of computer implementation.

Shannon information defined the information content of a message relative to the ensemble of all possible messages that can be sent by the communication system. By contrast, algorithmic information defines the information content of a string (which may or may not be a message) relative to a reference computer, U.

One of the properties of K(x) is that it is uncomputable. This means that there is no algorithm for computing K(x) and the function itself is harder (more time-consuming) to compute than any computable function. Here is a visualization of what K(x) might look like for small program prefixes:

The image depicts that K(x) globally tracks with log2(x) (for binary x). This is a result of the fact that almost all strings are random (we will define what this means) and cannot be compressed - the shortest program representing the string has the same length as the string itself. In addition, K(x) falls far below log2(x) for certain x. These x are "highly compressible", meaning there are very short programs that encode them. There is a lower bound below which K never falls but this function grows more slowly than any computable function - for the purposes of the diagram, it is indistinguishable from constant 0.

But what's the use of an uncomputable function? K is an immensely useful function because it allows us to give formal definitions to a variety of concepts that would otherwise be left to intuition. And while we cannot compute K(x) directly, we can build computable approximations to K(x). It turns out that these approximations are very powerful in their own right.

Let's use K to define randomness. The mathematical theory of random variables has struggled to define randomness. Uniform distribution does not work - 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, ... is uniform over the numbers 1-9, but it is clearly not random. Normality (equal frequency of all digits and all sequences of digits) does not work, either, because it can be proved that Champernowne's number, 0.123456789101112131415... is normal. Of course, we all know what we mean by "random" and, even if we can't define it, we know it when we see it.

The K function gives us a powerful way of defining randomness. In fact, it is so powerful that definitions of randomness that are satisfactory for other fields do not measure up to the algorithmic definition of randomness. The sequence of digits output by a cryptographically secure pseudo-random number generator (CSPRNG) are sufficiently random to secure highly valuable information from determined attack by a well-resourced adversary. But they are not random by the standard of algorithmic information theory. The bounces of a die on a table - since they can be modeled by a sufficiently fine-grained physics simulator - are not random by the standards of algorithmic information theory. The fluctuations of the quantum vacuum are generally supposed to be random but algorithmic information theory tells us that it is impossible to prove this since to do so is tantamount to computing K, and we already know that K is uncomputable.

We define a string x to be random if:
K(x) = |x| + c
... for some c relative to U (not x). Intuitively, we can see that this definition of randomness is sufficient because it exhausts all notions of order. A general-purpose computer, U, allows us to simulate any function of our choice - thus, U can take advantage of any structure within the string x, whether this structure is humanly definable or not. But K(x) is the exhaustion of all such structure - if no structure can be found within the string, then there is no structure at all within the string. And the complete absence of order and structure is precisely what we mean by randomness, at least, randomness of discrete sequences[4].

It might seem that algorithmic information and Shannon information - arising from different mathematical disciplines - have nothing to do with each other. In fact, it can be proved that the expected value of the Kolmogorov complexity is equal to the Shannon information[5]. Intuitively, this is not mysterious, given that almost all strings are algorithmically random which, in turn, gives rise to the tracking between the global structure of K(x) and log2(x). Just as we can build relations between Shannon entropies (joint entropy, conditional entropy, mutual information, and so on), so there are formally equivalent relations between Kolmogorov complexities - the joint information, conditional information, mutual information, and so on.

Here, we are starting to bump into a deep connection between pure abstraction - the general-purpose computer U - and physical reality, that is, physical entropy. This connection is consistently centered around a key, unifying idea - uncertainty and randomness.
"Randomness does occur in pure mathematics, it occurs in number theory, it occurs in arithmetic." - Gregory Chaitin[6]
"At this point an enigma presents itself which in all ages has agitated inquiring minds. How can it be that mathematics, being after all a product of human thought which is independent of experience, is so admirably appropriate to the objects of reality?" - Albert Einstein, Geometry and Experience
Next: Part 8, Bayesian Inference

---

1. A telegraphy code-book found on Archive.org

2. Also known as Kolmogorov complexity theory, Kolmogorov-Chaitin complexity theory, program-length-complexity, among others.

3. { x : P(x) } is called "set-builder notation" and it is read as follows "All x such that the predicate formula P(x) is true." Note that P(x) can be any reasonable predicate. It does not necessarily have to be a function of x, although it commonly is. A predicate is just a formula which can be true or false. For example, let P(x) represent the following predicate sentence: "x is a dog-breed." P(x) is true whenever x is the label of a dog-breed -- P("chihuahua") = true, P("siamese") = false. The min{} notation means "choose only the minimum elements of this set." So, min{ |p| : U(p) = x } should be read "choose all (and only) the minimum-length p's from the set of all p such that U(p)=x (and halts)."

4. There are other intuitive notions of randomness but all of these other notions can be reduced to algorithmic randomness in a natural way.

5. Grunwald and Vitanyi, 2008, p. 15

6.  Undecidability & Randomness in Pure Mathematics, § "The Hilbert Problems"

No comments:

Post a Comment

Wave-Particle Duality Because Why?

We know from experimental observation that particles and waves are fundamentally interchangeable and that the most basic building-blocks of ...