Lossy compression is used on rich media - audio, images, video, and so on - to great effect. JPEG image compression can drastically reduce the file size of an image without severely degrading the quality of the image. Compression ratios of 90% or more are attainable for applications where image quality is less important. This tradeoff between data quality and file size is only possible when the type of data being compressed is noise-tolerant. A compressed image of a cat may be slightly fuzzy or have slight mis-coloration but it is still recognizable as a cat, after compression. A text file, such as a legal document or a computer program, on the other hand, must be losslessly compressed because semantic content will almost certainly be destroyed by the noise in the resulting, compressed file. For these kinds of files, we use lossless compression.
It would be nice, however, if we could achieve the filesize benefits of lossy compression, without introducing noise - can we have our lossless compression cake and eat it, too? Consider the following quote:
It would be nice, however, if we could achieve the filesize benefits of lossy compression, without introducing noise - can we have our lossless compression cake and eat it, too? Consider the following quote:
No one rejects, dislikes or avoids
pleasure itself, because it is pleasure, but because those who do not
know how to pursue pleasure rationally encounter consequences that
are extremely painful. Nor again is there anyone who loves or pursues
or desires to obtain pain of itself, because it is pain, but because
occasionally circumstances occur in which toil and pain can procure
him some great pleasure. To take a trivial example, which of us ever
undertakes laborious physical exercise, except to obtain some
advantage from it? But who has any right to find fault with a man who
chooses to enjoy a pleasure that has no annoying consequences, or one
who avoids a pain that produces no resultant pleasure?
- Cicero, De Finibus
As
we know, the more often a particular character or word is repeated in
a text, the higher the redundancy of that text and the more
compressible the text is. Can we apply a transform to this text that
will render it more redundant, without losing our ability to recover
the text exactly?
No one rejects, dislikes ◦◦ avoids
pleasure itself, because it ◦◦ pleasure, ◦◦◦ because those
who do ◦◦◦ know how ◦◦ pursue pleasure rationally encounter
consequences that ◦◦◦ extremely painful. Nor again ◦◦ there
anyone who loves ◦◦ pursues ◦◦ desires to obtain pain of
itself, because it ◦◦ pain, ◦◦◦ because occasionally
circumstances occur ◦◦ which toil ◦◦◦ pain can procure him
some great pleasure. To take ◦ trivial example, which of us ever
undertakes laborious physical exercise, except ◦◦ obtain some
advantage from ◦◦? But who has any right ◦◦ find fault with ◦
man who chooses ◦◦ enjoy ◦ pleasure that has ◦◦ annoying
consequences, or one who avoids ◦ pain that produces ◦◦
resultant pleasure?
A
quick glance at the text should allow you to convince yourself that
an English speaker will be able to easily reconstruct the original
text with few, if any, errors. Clearly, this version of the text has
higher redundancy because we have replaced 46 separate characters –
drawn from a subset of the English alphabet – with 46 repetitions
of a single character. Thus, this version of the text admits to a
better compression ratio. Is there a way that we could ensure that
the person who is trying to decode this text has reconstructed the
original text? The answer is to take a hash of the original text and
give this to the person trying to decode the obscured text. In this
case, the CRC32 of the original text is 0x77ea20bb. If the person
decoding the obscured text makes a mistake, say, by choosing “yet
because” instead of “but because” for the third obscured
word, the resulting CRC32 will be 0xffa9da29. So, by adding a few
bytes to store the hash of the original text, we can strike out words
that are easy to guess from the context and an English speaker will
be able to recover the original text and convince herself that the
reconstructed text is identical to the original text.
We
have been careful to strike out only words that are easy-to-guess for
an English speaker (from context) – but do we have to stop there?
The answer is no. Let’s say we strike out a large word that cannot necessarily be guessed from the context:
No one rejects, dislikes ◦◦ avoids
pleasure itself, because it ◦◦ pleasure, ◦◦◦ because those
who do ◦◦◦ know how ◦◦ pursue pleasure rationally encounter
◦◦◦◦◦◦◦◦◦◦◦◦ that ◦◦◦ extremely
painful. Nor again ◦◦ there anyone who loves ◦◦ pursues ◦◦
desires to obtain pain of itself, because it ◦◦ pain, ◦◦◦
because occasionally circumstances occur ◦◦ which toil ◦◦◦
pain can procure him some great pleasure. To take ◦ trivial
example, which of us ever undertakes laborious physical exercise,
except ◦◦ obtain some advantage from ◦◦? But who has any
right ◦◦ find fault with ◦ man who chooses ◦◦ enjoy ◦
pleasure that has ◦◦ annoying consequences, or one who avoids ◦
pain that produces ◦◦ resultant pleasure?
Here,
we have increased the number of struck-out characters from 46 to 58.
The word that has been obscured has length 12. There are thousands of
possible words that could be substituted here but, in all likelihood,
only one of them will satisfy the criterion that the CRC32 hash of
the resulting text block be 0x77ea20bb – the word “consequences”.
Thus, we could write a program to automate a word search through the
dictionary until it finds the missing word, and gives us the result.
In
itself, this is not very useful. After striking out just a handful of
hard-to-guess words, the time required for a brute-force search would
become prohibitive – it grows exponentially with each word struck
out. But we have identified a method for applying a generic
“guess-and-check” algorithm to the reconstruction of an arbitrary
text from an obscured text with higher redundancy. Can we make this
algorithm more efficient?
In
the first example, it would be quite easy to train a neural network
to perform a guess-and-check algorithm to emulate the guesses of an
English speaker. But if we focus on trying to recreate the mind of an
English speaker, we miss the wider application of the method for the
purposes of compression. Let us call the person who encodes
information by increasing its redundancy the obscurer (O) and
the person who tries to reconstruct the original input, with the aid
of its checksum, the revealer (R). O is able to strike out the
easy-to-guess words because she knows that the substitutions will be
easy-to-guess for R. So, as long as this property holds – that R
will easily be able to guess the substitution based on the
information given by O – we have a system that can losslessly
encode and decode information into a form with higher redundancy than
was present at the input to the system.
We
can think of the obscurer and revealer as playing a game, such as a
crossword-puzzle – O is trying to construct puzzles that have as
much redundancy as possible for a given degree of difficulty and R is
trying to reconstruct the original input, based on the puzzle, as
quickly as possible. We can implement O and R as a pair of
convolutional neural networks (CNN) using Monte Carlo Tree Search
(MCTS) – a construct used to great effect by Deep Mind with its
Alpha Go, Alpha Go Zero and Alpha Zero game software. During
training, we calculate the difficulty of each encoding/obscuring
choice that O makes and we train O to prefer paths that result in
greater redundancy, while avoiding paths that result in excessively
high difficulty for R.
For
each compression context, we choose a training corpus and train O and
R. We can think of this as choosing which game O and R will play with
each other. An English text game is different than a music audio file
game, and so on. It is common practice in modern compression
utilities to automatically change compressor based on context. For
this purpose, we train a feed-forward network Q to function as the
meta-compressor. During compression of a file, Q switches context as
appropriate, so that O is likely to be playing the game that is best
suited to increasing the redundancy of the input file for a given
degree of difficulty, including, in the case of random data, the
pass-through game in which the source information is passed through
as-is. For each game on which O and R are trained, we can vary the
difficulty parameter to allow a user to choose between quick, mild
compression and slower, more aggressive compression.
Note
that we do not necessarily use the CRC32 hash, this was merely chosen
as an academic example. The hash should be chosen based on
probability considerations, that is, it should be chosen such that
the probability of collision is negligibly small for the given
context. The puzzle game played between O and R uses as many hashes
as are suitable for guiding the guess-and-check puzzle. The problem
with obscuring random words is that the guess-and-check complexity
grows exponentially with each word obscured. By choosing to obscure
only words that are easy-to-guess for R and only obscuring so many
words before providing a puzzle hint (hash), O can limit the
difficulty of R’s task.
We
can think of the game being played between O and R as building and
pruning a variable-order, conditional-entropy model of the input.
At each branch-point in the model, the lowest entropy (most probable)
branches are liable to be discarded (obscured). Every so many
branches (based on the training of O and R), a puzzle hint is given
by O so that R can reconstruct (reveal) the missing branches with a
reasonable amount of difficulty.
The
resulting, more redundant data produced by O resembles lossy
compression because the obscured information is simply discarded. Any
suitable, lossless compression algorithm can be applied to the output
of O. If O and R are well-trained, this should result in an improved
compression ratio or, at worst, no change to the compression ratio.
No comments:
Post a Comment