Monday, July 24, 2017

Notes on a Cosmology - Part 4, The Mystery of Entropy

"In 1961, one of us asked Shannon what he had thought about when he had finally confirmed his famous measure. Shannon replied: ‘My greatest concern was what to call it. I thought of calling it ‘information’, but the word was overly used, so I decided to call it ‘uncertainty’. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name. In the second place, and more importantly, no one knows what entropy really is, so in a debate you will always have the advantage.” - Myron Tribus, from a 1971 article “Energy and Information”, co-written with American physicist Edward Charles McIrvine (1933-)
The word "entropy" originated in physics, not the mathematical theory of communication. But each field of physics (theory of heat engines, thermodynamics, statistical mechanics, etc.) has tended to have its own definition of entropy. It is not obvious that these definitions are equivalent and this has led to a great deal of complexity and borderline mysticism in the theory of entropy throughout history, as the above anecdote humorously highlights. Even today, discussion of entropy is sometimes accompanied by language of an almost mystical character. Here are some definitions from Wikipedia's page on entropy:
  • Entropy is reversible heat divided by temperature
  • Pressure, density, and temperature tend to become uniform over time because this equilibrium state has higher probability (more possible combinations of microstates) than any other. [Entropy is] a measure of how far the equalization has progressed
  • Entropy is ... the measure of uncertainty, or "mixedupness", in the phrase of Gibbs, which remains about a system after its observable macroscopic properties, such as temperature, pressure and volume, have been taken into account
To specialists in their respective fields of physics, these definitions are clear and useful. Today, it is undeniable that the various specialized definitions of entropy are all equivalent, given a sufficient degree of abstraction. Not only are all physical definitions of entropy equivalent, they are equivalent to the mathematical definition we will lay out in this post. I will connect the theoretical definition of entropy back to the physical definitions. I will argue that, by accident of history, we are able to see the significance of physical entropy more clearly from the vantage-point of theory even though the idea of entropy originated in physics.

In MTC, Shannon introduces the idea of a communication system, or "channel", terms I left undefined in the last post. I cannot improve on Shannon's description, so I will reproduce that section of the paper, here:


By a communication system we will mean a system of the type indicated schematically in Fig. 1. It consists of essentially five parts:  
  1. An information source which produces a message or sequence of messages to be communicated to the receiving terminal. The message may be of various types: (a) A sequence of letters as in a telegraph of teletype system; (b) A single function of time f(t) as in radio or telephony; (c) A function of time and other variables as in black and white television — here the message may be thought of as a function f(x; y; t) of two space coordinates and time, the light intensity at point (x; y) and time t on a pickup tube plate; (d) Two or more functions of time, say f(t), g(t), h(t) — this is the case in “three dimensional” sound transmission or if the system is intended to service several individual channels in multiplex; (e) Several functions of several variables — in color television the message consists of three functions f(x; y; t), g(x; y; t), h(x; y; t) defined in a three-dimensional continuum — we may also think of these three functions as components of a vector field defined in the region — similarly, several black and white television sources would produce “messages” consisting of a number of functions of three variables; (f) Various combinations also occur, for example in television with an associated audio channel.
  2. A transmitter which operates on the message in some way to produce a signal suitable for transmission over the channel. In telephony this operation consists merely of changing sound pressure into a proportional electrical current. In telegraphy we have an encoding operation which produces a sequence of dots, dashes and spaces on the channel corresponding to the message. In a multiplex PCM system the different speech functions must be sampled, compressed, quantized and encoded, and finally interleaved properly to construct the signal. Vocoder systems, television and frequency modulation are other examples of complex operations applied to the message to obtain the signal.  
  3. The channel is merely the medium used to transmit the signal from transmitter to receiver. It may be a pair of wires, a coaxial cable, a band of radio frequencies, a beam of light, etc.  
  4. The receiver ordinarily performs the inverse operation of that done by the transmitter, reconstructing the message from the signal.  
  5. The destination is the person (or thing) for whom the message is intended.
In the last post, we established two important ideas:
  • Information can be understood without reference to meaning
  • Information, in this sense, can be quantified
So, now that we know what a communication channel is and what information is in the technical sense of information theory, what is entropy?

Information and entropy are not the same thing in the theory of communications. Information - also called "self-information" or "surprisal" - is how the "amount of information" pertaining to a particular symbol or message is quantified. Entropy is the average or expected value of a random variable[1] that generates symbols or messages, having some probability distribution.

In order to avoid going into a book-length treatment of information theory, I will introduce the mathematical definitions of self-information and entropy without justifying them. The goal is to present the formal relationships between these.

Suppose we have a set of n symbols (called an alphabet in computer science), Sn. From this set, we may select any one symbol and transmit it. Let pi (0 < i  ≤ n) be the probability with which each symbol in Sn is selected. For our purposes, we do not care if this probability is prior (specified) or posterior (observed). The self-information, measured in bits[2], of each symbol si in Sn is:
I(si) = -log2(pi)
Let's consider a real-world example. Let S26 be the alphabet of English letters (case-insensitive). The letter 'e' occurs in English with a frequency of roughly 12%, for some representative corpus. Its self-information is, thus, -log2(0.12) ~ 3.05 bits. The letter 'j' occurs in English with a frequency of 0.153%. Its self-information is -log2(0.00153) ~ 9.53 bits. The more improbable a symbol is, the greater its self-information. This correlates with our intuition that we will be more surprised to see a rare symbol than to see a common symbol. Thus, we assign greater "surprisal" to rarer symbols.

Because entropy is a mean or expected value, it is defined on a random variable whose domain is the set of symbols (or messages). Let X be a random variable that can take on the value of any s in Sn. To calculate the entropy - called H - of X:
H(X) =  pi • -log2(pi), for i from 1 to n
This is just the mean or expected value of I over the domain of X; it is the average self-information or surprisal of the symbols in Sn. We have defined the surprisal and the entropy over a set of "symbols" on the basis of the supposition that these symbols can be transmitted in succession to send a complete message. But the definitions given work just as well (and, in certain cases, only work) when we consider Sn to be the set of all messages. The former view of Sn is more often used in the design of real communications systems and the latter view in theory.

Up to this point, I have made no reference to the communication channel, defined by Shannon in the quote above. This is not an accident - I want to introduce another view of the communication channel as an abstraction of a scientific experiment:


You will notice that the "preparer" and "observer" are denoted as two, separate individuals. In real experiments, the preparer is often the same person as the observer - but not so from a physical standpoint. Even if there is just one experimenter, the observer is in the future of the preparer, and the preparer is in the past of the observer. The "generator" is just any apparatus that "drives" or "generates" the conditions of the experiment (e.g. a laser). The "detector" is any apparatus that "observes" or "detects" the results of the experiment. The medium, in this case, is just the world-as-such. The ever-present fact of randomness is denoted by the intermediary box, acting on the "signal". The "signal" and "received signal" components of the diagram are best understood as sitting on the fence between physical laws and theory. A received signal (during an experiment) is perturbed not only by randomness, but also by the disagreement of theoretical expectations and physical fact even if - or especially if - we are not able to discriminate whether an error is due to physical randomness or flawed theory.

There is already a well-developed theory of scientific experiment with its foundations in stochastic theory. The point of the above illustration is to point out that the standard theory of scientific experiment can be understood, equivalently, from the point-of-view of information theory. In many respects, information theory gives a simpler, more unified and more direct explanation of the process of scientific experiment.

In the next post, we will cover the two major heads of information theory: (1) noiseless coding and data-compression and (2) noisy coding, which relates noise, signal power, error-detection and error-correction. We will connect the theoretical view of entropy with the physical view. That will wrap up this section on information theory.

Next: Part 5, The Mystery of Entropy cont'd

---
1. In probability theory, a "random variable" is analogous to a coin or die - it can take on one value from a set of values (discrete or continuous), which defines its domain, but there is no function or algorithm (effective method) by which to calculate the variable - its behavior is random over the given domain.

2. Choosing a different logarithmic base gives different units.

No comments:

Post a Comment

Wave-Particle Duality Because Why?

We know from experimental observation that particles and waves are fundamentally interchangeable and that the most basic building-blocks of ...