The Upcoming Generation of Lossless Data Compression, Part 1
First, yes – there is such a thing as next gen lossless data compression.
Here’s one for starters: cross-file content-aware dedup for images and video.
Take a JPEG image, compress it with your favorite lossless compression algorithm, say BZip2 or LZMA, and you’ll get a larger file. Of course that makes sense, as JPEG itself is written to be compressed out of the box.
Now take a group of, say, 100 JPEGs, and compress them into a single archive. You’ll get no real compression. Even though there are likely redundancies among those photos, JPEG isn't designed to handle cross file compression, and LZMA can't handle it because the redundancies aren't obvious by looking only at the binary patterns.
But do that with a compression algorithm from the future, and you’ll get a much smaller file. These algorithms will know how to do content-aware cross-file compression.
What would these algorithms look like?
In fact we already have algorithms that know how to de-dup among a set of pictures. They’re called video compression algos. Try taking those same 100 JPEGs, put them into a movie file format, and run your favorite video compression algorithm over them with the right parameters and you’ll get your smaller compressed archive. Of course this is just a hacked approach, but you get the picture!
There are also various tricks you could do on text, audio, etc.
This technology will become increasingly important as more and more data gets centralized into the cloud.
Someone from the commercial side is already working on this. After thinking about this idea for several hours and getting really excited, I came across a company called Ocarina (chalk talk here) who is commercializing this approach.
That doesn’t mean this technology should be limited to big companies with big budgets, though. I’m hoping someone from academia or the open source community will eventually pick this up and run with it!
Part 2 to come soon! (Edit: link to part 2) In the mean time let’s hear it if you have any further thoughts or ideas!
(There was a good discussion of this blog post on Hacker News here.)