Not sure how that would work out in practice. We kind of rushed with the experiments in any case - maybe we could let the methods train for longer. In the end the adversary might learn any regular patters that appear, therefore forcing the generator to come up with something that cannot be detected easily.
(We trained everything to 1M steps. Perhaps letting it train to 2M would solve it.)
Independently of this work, we have models which are competitive with HEVC while being significantly smaller (this is from previous work). They will not look nearly as good as what you see in the website demo, but they're still better.
I don't have any such model handy but perhaps it's 10x-20x smaller.
We don't claim that this (or even the previous work) is the way to go forward for images, but we hope to incentivize more researchers to look in this direction so that we can figure out how to deploy these kinds of methods since the results they produce are very compelling.
We haven't specifically compared to AVIF, which as far as we know is still under development. We'd be happy to compare, but it's unlikely that we'd learn much out of it. As far as we know, AVIF is better by <100% than HEVC, but we're comparing against HEVC at 300% of the bitrate.
Of course, we'd be happy to add any additional images from other codecs if they're available.
I would add JPEG-XL in addition if you're looking for suggestions for other codecs to compare to. It's very competitive with AV1 and beats it, in my opinion, at higher bitrates.
Admittedly, you're not likely to learn much from this that is useful for your research, but most of the interest from people clicking on this is probably wanting to see the latest developments in image compression.
What you suggest has already been done: train a neural network with the output of BPG or JPEG, and ask to reconstruct the input with just the decompressed pixels being available.
It definitely is a valid approach but the limitation is that if the network needs some texture-specific information that cannot be extracted from the decoded pixels, it can't really do much.
There were approaches where such information was also sent on the side, which yielded better results, of course.
The field is wide-open and each approach has its own challenges (e.g., you may need to train one network for quantization level for example if you're going to do restoration).
One of the things we discussed to address this is to have the ability to:
a) turn off detail hallucination completely given the same bitstream;
and
b) store the median/maximum absolute error across the image
(b) should allow the user to determine whether the image is suitable for their use-case.
Is it possible to put the decoder into a feedback loop and search among multiple possible encodings that minimize the residual errors? Similar to trellis optimization in video codecs. http://akuvian.org/src/x264/trellis.txt
It would be possible - but by minimizing residual errors you end up in a similar regime as when minimizing MSE again, likely making reconstructions blurry!
Agreed! Alternatively, some semantic segmentation network could be used and a masked MSE loss. In this paper, we focussed on showing the crazy potential of GANs for compression - let's see what future work brings.
On the standardization issue: the advantage of such a method that we presented is that as long as there exists a standard for model specification, we can encode every image with an arbitrary computational graph that can be linked from the container.
Imagine being able to have domain specific models - say we could have a high accuracy/precision model for medical images (super-close to lossless), and one for low bandwidth applications where detail generation is paramount. Also imagine having a program written today (assuming the standard is out), and it being able to decode images created with a model invented 10 years from today doing things that were not even thought possible when the program was originally written. This should be possible because most of the low level building blocks (like convolution and other mathematical operations) is how we define new models!
On noise: I'll let my coauthors find some links to noisy images to see what happens when you process those.
That was exactly the goal of the project! Basically if the size doesn't allow to have detail, we need to "hallucinate it". This of course is not necessary if there's enough bandwidth available for transmission or enough storage.
On the other hand, in our paper we show that some generated detail can help even at higher bitrates.
(coauthor here) We used an adversarial loss in addition to a perceptual loss and MSE. None of these work super-well when the others are not used.
The adversarial loss "learns" what is a compressed image and tries to make the decoder go away from such outputs.
The perceptual (LPIPS) is not so sensitive to pure noise and allows for it, but is sensitive to texture details.
MSE tries to get the rough shape right.
We also asked users in a study to tell us which images they preferred when having access to the original. Most prefer the added details even if they're not exactly right.
The idea is that any distortion loss imposes specific artifacts. For example, MSE tends to blur outputs, CNN-feature-based losses (VGG, LPIPS) tend to produce gridding patterns or also blur. Now, when the discriminator network sees these artifacts, those artifacts very obviously distinguish the reconstructions from the input, and thus the gradients from the discriminator guide the optimization away from these artifacts. Let me know if this helps!
(coauthor here) The 0.7 megapixels/sec is PNG decode
(to get input)+encoding+decoding+PNG encoding (to get output we can visualize in a browser) speed.
(We trained everything to 1M steps. Perhaps letting it train to 2M would solve it.)