Thursday, July 27, 2017

Notes on a Cosmology - Part 5, The Mystery of Entropy cont'd

In MTC, Shannon divided information theory into two major heads - noiseless coding theory, and noisy coding theory. Let's consider each of these.

Noiseless coding theory

From MTC:
How is an information source to be described mathematically, and how much information in bits per second is produced in a given source? The main point at issue is the effect of statistical knowledge about the source in reducing the required capacity of the channel, by the use of proper encoding of the information. In telegraphy, for example, the messages to be transmitted consist of sequences of letters. These sequences, however, are not completely random. In general, they form sentences and have the statistical structure of, say, English. The letter E occurs more frequently than Q, the sequence TH more frequently than XP, etc. The existence of this structure allows one to make a saving in time (or channel capacity) by properly encoding the message sequences into signal sequences. This is already done to a limited extent in telegraphy by using the shortest channel symbol, a dot, for the most common English letter E; while the infrequent letters, Q, X, Z are represented by longer sequences of dots and dashes. This idea is carried still further in certain commercial codes where common words and phrases are represented by four- or five-letter code groups with a considerable saving in average time. The standardized greeting and anniversary telegrams now in use extend this to the point of encoding a sentence or two into a relatively short sequence of numbers.
The basic idea of a noiseless code is to arrange the "code space" such that shorter codes correspond to more probable symbols, and longer codes to less probable symbols. In this way, the length of a typical message is reduced from what it would be if we had assigned equal-length codes to each symbol to be transmitted (or, worse, longer codes to more probable symbols and shorter codes to less probable symbols). Shannon establishes certain theoretical limits to how far this method can go towards economizing the available channel capacity. In particular, the formula we introduced in the last post describing the entropy of a given message-source determines the maximum efficiency we can attain for a given communication system.

In the computer age, one of the most common applications of noiseless coding is in data-compression. A ZIP file, or other compression methods, relies on the non-uniform statistics of the file it is compressing to reduce its size. The concept of compression is at the heart of noiseless coding - the goal is always to reduce a message or file from its original size to some smaller size (as measured in bits) so that the space, time or bandwidth[1] required to transmit the message or store the file is reduced[2].

One point worth especially noting is that random data (for example, an encrypted message) is not susceptible to compression. The cost of transmitting random data over a communication channel is always bit for bit. This is a fundamental limit of entropy. It can be derived by setting the probabilities of n symbols to 1/n and then calculating the entropy - it is the maximum of the entropy function. Data compression schemes that purport to achieve compression on random data belong in the same category of quackery as perpetual motion machines. And as we will discover in future posts, they bear more than a metaphorical similarity.

Noisy coding theory

In the real world, noise perturbs every signal. The noiseless coding theory does not work in the presence of random perturbations because the codes map inputs to outputs, one-to-one. That is, if there is a random error during the transmission of an encoded signal, a noiseless code will faithfully decode this to an erroneous message. In MTC, Shannon describes the basic idea of noisy channel coding:
By sending the information in a redundant form the probability of errors can be reduced. For example, by repeating the message many times and by a statistical study of the different received versions of the message the probability of errors could be made very small. One would expect, however, that to make this probability of errors approach zero, the redundancy of the encoding must increase indefinitely, and the rate of transmission therefore approach zero. This is by no means true. If it were, there would not be a very well defined capacity, but only a capacity for a given frequency of errors, or a given equivocation; the capacity going down as the error requirements are made more stringent. Actually the capacity C defined above has a very definite significance. It is possible to send information at the rate C through the channel with as small a frequency of errors or equivocation as desired by proper encoding. This statement is not true for any rate greater than C. If an attempt is made to transmit at a higher rate than C, say C + R1, then there will necessarily be an equivocation equal to or greater than the excess R1. Nature takes payment by requiring just that much uncertainty, so that we are not actually getting any more than C through correctly.
The key is that it is always possible to construct a code in such a way that, for a given channel capacity C, the error rate can be made as small as desired.

What unites noiseless coding theory and noisy coding theory is the concept of uncertainty, of which entropy is a measure. When a receiver is awaiting a message, he has maximal uncertainty regarding what the message will be. After the first symbol has been transmitted, the receiver's uncertainty regarding the message goes down by some amount[3]. And so on for each successive symbol, until the entire message has been transmitted. Because a noiseless code does not have to take noise into account, each symbol reduces uncertainty at the receiver by a maximal amount. A noisy code, however, cannot reduce uncertainty at the receiver by the same degree per symbol, because some amount of redundancy must be built into the code in order to allow for the detection and correction of errors introduced by noise in the communication channel.

Without going into the mathematical details, there is some quantity of information that can be reliably transmitted over a noisy channel. This quantity is a function of the channel capacity, the noise on the channel, and the desired overall error rate. We can make the error rate as low as desired, at the cost of reducing the rate at which we can transmit the "payload" data. The fundamental relation between noise and the channel rate is subtractive - the more noise, the more of the available channel rate must be diverted to error correction, leaving less for the payload data. Fortunately, the theoretically achievable rates for real-world levels of noise are very good. Modern error-correction schemes achieve virtually the maximum theoretically possible rate of data transmission for a given level of noise.

The magnitude of the effect of information theory on day-to-day life is difficult to exaggerate. Every modern device with any kind of electronics inside of it is affected - cell phones, computers, onboard car computers, and so on. If the principles of information theory were not applied to these systems, they would be nonviable, requiring immense power supplies and antennas, with many nearby RF towers. It is not an exaggeration to say that information theory made the digital revolution possible.

Maxwell's demon

James Clerk Maxwell devised a thought-experiment to describe a possible condition for the violation of the second law of thermodynamics[4]:
If we conceive of a being whose faculties are so sharpened that he can follow every molecule in its course, such a being, whose attributes are as essentially finite as our own, would be able to do what is impossible to us. For we have seen that molecules in a vessel full of air at uniform temperature are moving with velocities by no means uniform, though the mean velocity of any great number of them, arbitrarily selected, is almost exactly uniform. Now let us suppose that such a vessel is divided into two portions, A and B, by a division in which there is a small hole, and that a being, who can see the individual molecules, opens and closes this hole, so as to allow only the swifter molecules to pass from A to B, and only the slower molecules to pass from B to A. He will thus, without expenditure of work, raise the temperature of B and lower that of A, in contradiction to the second law of thermodynamics.

This contradiction[5] was resolved by the work of later scientists, notably Leo Szilard and Rolf Landauer. Charles Bennett[6] gives a summary of Landauer's principle:
Rolf Landauer (1961) attempted to apply thermodynamic reasoning to digital computers. Paralleling the fruitful distinction in statistical physics between macroscopic and microscopic degrees of freedom, he noted that some of a computer’s degrees of freedom are used to encode the logical state of the computation, and these information bearing degrees of freedom (IBDF) are by design sufficiently robust that, within limits, the computer’s logical (i.e., digital) state evolves deterministically as a function of its initial value, regardless of small fluctuations or variations in the environment or in the computer’s other noninformation bearing degrees of freedom (NIBDF). While a computer as a whole (including its power supply and other parts of its environment), may be viewed as a closed system obeying reversible laws of motion (Hamiltonian or, more properly for a quantum system, unitary dynamics), Landauer noted that the logical state often evolves irreversibly, with two or more distinct logical states having a single logical successor. Therefore, because Hamiltonian/unitary dynamics conserves (finegrained) entropy, the entropy decrease of the IBDF during a logically irreversible operation must be compensated by an equal or greater entropy increase in the NIBDF and environment. This is Landauer’s principle. Typically the entropy increase takes the form of energy imported into the computer, converted to heat, and dissipated into the environment, but it need not be, since entropy can be exported in other ways, for example by randomizing configurational degrees of freedom in the environment.
From Wikipedia, "Any irreversible logical transformation of classical information produces at least kT ln 2 of heat per bit, where k is the Boltzmann constant (approximately 1.38×10−23 J/K) and T is the temperature of the heat sink in kelvins." Landauer's principle resolves the paradox created by Maxwell's demon by drawing an equivalence between intangible information and physical entropy. It is crucial to keep in mind the scales involved. Even if a colossal amount of information were irreversibly transformed, it would amount to a very small amount of physical entropy (heat, more or less).

Connecting the dots

Maxwell's demon is the connecting bridge between noiseless coding theory and noisy coding theory. Not only does it connect these two heads of information theory, it connects information theory to physics. Specifically, the demon highlights the relation between uncertainty in the mind - which is physically realized in the brain and body - and uncertainty or entropy in the physical environment.

Physically speaking, noiseless coding is impossible in the same way that an ideal cube or an ideal sphere is impossible. But the nature of physical noise is not, itself, a mystery. Reversible computation, ideal work[7] and error-correcting codes on a noisy channel are conceptually isomorphic and the role of indeterminacy (noise, randomness) in these systems is well-understood.

We can conceive, in our minds, of a noiseless channel or an ideal cube or an ideal sphere. The mainstream view of our capacity to imagine physically impossible idealizations is that such visualization is no different than imagining fictional worlds. That these idealizations happen to be useful is irrelevant to their physical impossibility.

Another view of a noiseless channel is to compare it to the deduction of theorems from axioms, in mathematics. Suppose I had an "uncertainty meter" - of unspecified mechanism - that can magically determine the amount of uncertainty in my mind about anything. Let's say that I use this meter to measure my uncertainty at the beginning of a mathematical deduction and call this u0. Then, I perform my deductions and derive a theorem as a result. Afterwards, I use the meter to measure my uncertainty again, and call this u1. The nature of deduction is precisely this: any uncertainty meter that functions according to what we mean by the word "uncertainty" must measure u0 = u1. A noiseless channel is an idealization that happens to correlate with the relationship between a theorem and the axioms from which the theorem is derived.

The significance of physical entropy is more clearly seen from this vantage-point. A noisy code allows us to encode messages in such a way that the error-rate is made as small as desired. In the same way, we manipulate the noisy physical world in such a way that we are able to do things that are arbitrarily close to the ideal of a noiseless channel - things like making deductions, communicating detailed messages exactly over long distances, or performing lengthy computations without errors. The many information revolutions throughout human history[8] - for example, the Gutenberg press - are not only a story of the cheaper propagation of information but also a story of the gradual de-noising of information.

In the next post(s), we will begin looking at the nature of classical computation and another field of information theory, originating in a field of mathematics called computability theory, known as Kolmogorov complexity theory or algorithmic information theory.

Next: Part 6, Computation

---

1. "Bandwidth" is a term that originates from RF communications. In RF, a group of adjacent frequencies are referred to as a "band". The width of this band - or bandwidth - is calculated by subtracting the lowest frequency from the highest. In the context of this discussion, the lowest frequency is always 0, so the bandwidth is just the highest frequency.

2. I say "reduced" and not "minimized" because the conditions for minimal codes - that is, ideal codes - are difficult to achieve in actual practice. It is a complex subject that would require a separate treatment.

3. You have experienced this phenomenon whenever you have completed someone's sentence for them, or someone else has done the same for you. The ability to guess the rest of a sentence from some prefix of that sentence is an example of information theory operating in ordinary life.

4. This is the law of thermodynamics which is famously responsible for the inevitable rundown of all physical systems - its effects are seen in everything from erosion, to combustion, to rust, to aging.

5. In case it is not obvious, the demon contradicts the second law of thermodynamics because that law dictates that an enclosed gas will tend towards equilibrium over time, all else equal. Once it has reached equilibrium, it will remain there unless work is expended to move the gas into a non-equilibrium state (for example, by applying heating or refrigeration to the gas, both of which expend physical work). Maxwell's demon achieves the same result, without work.

6. Notes on Landauer’s principle, reversible computation, and Maxwell’s Demon, Charles H. Bennett 2003

7. Ideal work is work at the Carnot limit; the Szilard engine is a conceptual bridge between the Carnot limit and the noisy channel capacity

8. An idea due to Seth Lloyd, as far as I know

No comments:

Post a Comment

Wave-Particle Duality Because Why?

We know from experimental observation that particles and waves are fundamentally interchangeable and that the most basic building-blocks of ...