Calypso's Farewell: Lossless Compression with a Lossy Compression Ratio

Lossy compression is used on rich media - audio, images, video, and so on - to great effect. JPEG image compression can drastically reduce the file size of an image without severely degrading the quality of the image. Compression ratios of 90% or more are attainable for applications where image quality is less important. This tradeoff between data quality and file size is only possible when the type of data being compressed is noise-tolerant. A compressed image of a cat may be slightly fuzzy or have slight mis-coloration but it is still recognizable as a cat, after compression. A text file, such as a legal document or a computer program, on the other hand, must be losslessly compressed because semantic content will almost certainly be destroyed by the noise in the resulting, compressed file. For these kinds of files, we use lossless compression.

It would be nice, however, if we could achieve the filesize benefits of lossy compression, without introducing noise - can we have our lossless compression cake and eat it, too? Consider the following quote:

No one rejects, dislikes or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again is there anyone who loves or pursues or desires to obtain pain of itself, because it is pain, but because occasionally circumstances occur in which toil and pain can procure him some great pleasure. To take a trivial example, which of us ever undertakes laborious physical exercise, except to obtain some advantage from it? But who has any right to find fault with a man who chooses to enjoy a pleasure that has no annoying consequences, or one who avoids a pain that produces no resultant pleasure?

- Cicero, De Finibus

As we know, the more often a particular character or word is repeated in a text, the higher the redundancy of that text and the more compressible the text is. Can we apply a transform to this text that will render it more redundant, without losing our ability to recover the text exactly?

No one rejects, dislikes ◦◦ avoids pleasure itself, because it ◦◦ pleasure, ◦◦◦ because those who do ◦◦◦ know how ◦◦ pursue pleasure rationally encounter consequences that ◦◦◦ extremely painful. Nor again ◦◦ there anyone who loves ◦◦ pursues ◦◦ desires to obtain pain of itself, because it ◦◦ pain, ◦◦◦ because occasionally circumstances occur ◦◦ which toil ◦◦◦ pain can procure him some great pleasure. To take ◦ trivial example, which of us ever undertakes laborious physical exercise, except ◦◦ obtain some advantage from ◦◦? But who has any right ◦◦ find fault with ◦ man who chooses ◦◦ enjoy ◦ pleasure that has ◦◦ annoying consequences, or one who avoids ◦ pain that produces ◦◦ resultant pleasure?

A quick glance at the text should allow you to convince yourself that an English speaker will be able to easily reconstruct the original text with few, if any, errors. Clearly, this version of the text has higher redundancy because we have replaced 46 separate characters – drawn from a subset of the English alphabet – with 46 repetitions of a single character. Thus, this version of the text admits to a better compression ratio. Is there a way that we could ensure that the person who is trying to decode this text has reconstructed the original text? The answer is to take a hash of the original text and give this to the person trying to decode the obscured text. In this case, the CRC32 of the original text is 0x77ea20bb. If the person decoding the obscured text makes a mistake, say, by choosing “yet because” instead of “but because” for the third obscured word, the resulting CRC32 will be 0xffa9da29. So, by adding a few bytes to store the hash of the original text, we can strike out words that are easy to guess from the context and an English speaker will be able to recover the original text and convince herself that the reconstructed text is identical to the original text.

We have been careful to strike out only words that are easy-to-guess for an English speaker (from context) – but do we have to stop there? The answer is no. Let’s say we strike out a large word that cannot necessarily be guessed from the context:

No one rejects, dislikes ◦◦ avoids pleasure itself, because it ◦◦ pleasure, ◦◦◦ because those who do ◦◦◦ know how ◦◦ pursue pleasure rationally encounter ◦◦◦◦◦◦◦◦◦◦◦◦ that ◦◦◦ extremely painful. Nor again ◦◦ there anyone who loves ◦◦ pursues ◦◦ desires to obtain pain of itself, because it ◦◦ pain, ◦◦◦ because occasionally circumstances occur ◦◦ which toil ◦◦◦ pain can procure him some great pleasure. To take ◦ trivial example, which of us ever undertakes laborious physical exercise, except ◦◦ obtain some advantage from ◦◦? But who has any right ◦◦ find fault with ◦ man who chooses ◦◦ enjoy ◦ pleasure that has ◦◦ annoying consequences, or one who avoids ◦ pain that produces ◦◦ resultant pleasure?

Here, we have increased the number of struck-out characters from 46 to 58. The word that has been obscured has length 12. There are thousands of possible words that could be substituted here but, in all likelihood, only one of them will satisfy the criterion that the CRC32 hash of the resulting text block be 0x77ea20bb – the word “consequences”. Thus, we could write a program to automate a word search through the dictionary until it finds the missing word, and gives us the result.

In itself, this is not very useful. After striking out just a handful of hard-to-guess words, the time required for a brute-force search would become prohibitive – it grows exponentially with each word struck out. But we have identified a method for applying a generic “guess-and-check” algorithm to the reconstruction of an arbitrary text from an obscured text with higher redundancy. Can we make this algorithm more efficient?

In the first example, it would be quite easy to train a neural network to perform a guess-and-check algorithm to emulate the guesses of an English speaker. But if we focus on trying to recreate the mind of an English speaker, we miss the wider application of the method for the purposes of compression. Let us call the person who encodes information by increasing its redundancy the obscurer (O) and the person who tries to reconstruct the original input, with the aid of its checksum, the revealer (R). O is able to strike out the easy-to-guess words because she knows that the substitutions will be easy-to-guess for R. So, as long as this property holds – that R will easily be able to guess the substitution based on the information given by O – we have a system that can losslessly encode and decode information into a form with higher redundancy than was present at the input to the system.

We can think of the obscurer and revealer as playing a game, such as a crossword-puzzle – O is trying to construct puzzles that have as much redundancy as possible for a given degree of difficulty and R is trying to reconstruct the original input, based on the puzzle, as quickly as possible. We can implement O and R as a pair of convolutional neural networks (CNN) using Monte Carlo Tree Search (MCTS) – a construct used to great effect by Deep Mind with its Alpha Go, Alpha Go Zero and Alpha Zero game software. During training, we calculate the difficulty of each encoding/obscuring choice that O makes and we train O to prefer paths that result in greater redundancy, while avoiding paths that result in excessively high difficulty for R.

For each compression context, we choose a training corpus and train O and R. We can think of this as choosing which game O and R will play with each other. An English text game is different than a music audio file game, and so on. It is common practice in modern compression utilities to automatically change compressor based on context. For this purpose, we train a feed-forward network Q to function as the meta-compressor. During compression of a file, Q switches context as appropriate, so that O is likely to be playing the game that is best suited to increasing the redundancy of the input file for a given degree of difficulty, including, in the case of random data, the pass-through game in which the source information is passed through as-is. For each game on which O and R are trained, we can vary the difficulty parameter to allow a user to choose between quick, mild compression and slower, more aggressive compression.

Note that we do not necessarily use the CRC32 hash, this was merely chosen as an academic example. The hash should be chosen based on probability considerations, that is, it should be chosen such that the probability of collision is negligibly small for the given context. The puzzle game played between O and R uses as many hashes as are suitable for guiding the guess-and-check puzzle. The problem with obscuring random words is that the guess-and-check complexity grows exponentially with each word obscured. By choosing to obscure only words that are easy-to-guess for R and only obscuring so many words before providing a puzzle hint (hash), O can limit the difficulty of R’s task.

We can think of the game being played between O and R as building and pruning a variable-order, conditional-entropy model of the input. At each branch-point in the model, the lowest entropy (most probable) branches are liable to be discarded (obscured). Every so many branches (based on the training of O and R), a puzzle hint is given by O so that R can reconstruct (reveal) the missing branches with a reasonable amount of difficulty.

The resulting, more redundant data produced by O resembles lossy compression because the obscured information is simply discarded. Any suitable, lossless compression algorithm can be applied to the output of O. If O and R are well-trained, this should result in an improved compression ratio or, at worst, no change to the compression ratio.

Calypso's Farewell

Friday, December 22, 2017

Lossless Compression with a Lossy Compression Ratio

No comments:

Post a Comment

Wave-Particle Duality Because Why?

Blog Archive