Weissman Scores: Useful? |
September 23rd, 2015 |
tech |
The metric is:
efficiency_new ------------------- efficiency_baselineWhere
efficiency
is:
compression_ratio --------------------- log(compression_time)The
compression_time
is just how long it takes to run the
algorithm while the compression_ratio
is:
uncompressed_size ----------------- compressed_sizeThere are two ideas here: (a) the ratio between the efficiency of a new algorithm and a standard baseline algorithm like
gzip
is
a good way to control for how compressable a corpus is and (b)
efficiency
captures what matters about a compression
algorithm. While (a) seems plausible, (b) is way off. The problem
with this metric is that it pays much more attention to the time than
the ratio, when we usually care much more about the ratio. So let's
look again at how efficiency
is defined:
compression_ratio --------------------- log(compression_time)Say we currently can manage 10% compression (ratio = 100/90 = 1.11) and it takes us 16ms. That's
compression_ratio / log(compression_time) = 1.11 / log(16) = 0.2775Now we have two proposals. One brings us to from 10% compression to 55% compression (ratio = 100/45 = 2.22) while the other one drops compression time to 4ms:
compression_ratio / log(compression_time) = 2.22 / log(16) = 0.555vs
compression_ratio / log(compression_time) = 1.11 / log(4) = 0.555These have the same
efficiency
, but improving compression by
5x will nearly always be better than improving speed by 4x. Take this
to the extreme: a compressor that exits immediately leaving its input
unchanged has the best possible efficiency
score, despite
being useless.
This score is also missing some other things we care a lot about. Many things are compressed once but decompressed millions of times. Consider an image on a popular website. The site owner probably compressed the image once on uploading, while every single site visitor will need to decompress the image before they can view it. So in this sort of case we're often willing to put much more work into compression if it can save a little work each time it's decompressed.
Another thing that matters a lot for practical adoption of compression
algorithms is resource efficiency, primarily memory. When your
browser downloads a web page it tells the server Accept-Encoding:
gzip
and the server will typically gzip-compress its response on
the fly. The server tags it Content-Encoding: gzip
, and the
browser automatically unzips it before using the response. Both the
browser and server are processing lots of simultaneous transfers, and
each one of those needs to handle compression, so there's a big
difference between each needing about ~50K of memory with gzip and
needing up to several MB with brotli.
This also illustrates another important issue with compression: different use cases need different metrics. For on-the-fly compression you don't want to take so much time compressing that this beats the time savings from needing to transfer fewer bytes, while for advance compression small increases in the compression ratio matter a huge amount.
I'm definitely in favor of TV shows seeking out expert advice, but this metric should probably stay on TV.
Comment via: google plus, facebook