This is interesting, but it's key to remember that audio and video codecs have three primary constraints in there design: compression efficiency and encode, and decode complexity
Simply addressing compression efficiency without considering the other constraints makes for an impractical codec.
If your encode complexity is too high, then you can't handle real-time (live) use cases, and your compute requirements may make practical use for large libraries like social media impractical.
If your decode complexity is too high mobile devices will suffer from severe battery drain or simply won't be able to decode, even in hardware. Dedicated hardware may also have to much complexity to implement in a cost effective way.
It can be worth deprioritizing encode complexity since the majority of audio and video streaming is non-live content. Youtube, Netflix, TikTok, Instagram, and so on can all benefit even if encode is slower than realtime. Obviously there are cost-benefit considerations here but they are doing AV1 so they are willing to accept some hit on compute costs as a trade for bandwidth costs.
A lot of that footage is originated on smartphones, GoPro, drone cameras, etc. where hardware and power limitations do not allow one to run expensive encoding algorithms.
So the smartphone does a crummy H.264 encode that's bit-expensive but power-efficient, then the YouTube server saves the bits a million times over by transcoding to AV1. There's still room for at least 2 codecs on the Pareto curve.
Lyra is a real-time neural speech codec from Google - I don't know if they use it in the Pixel line for call compression, but they certainly could.
Interestingly, I had the idea of using their open-source version as a vocoder for a light-weight TTS model. It did work - as in, it produced intelligible speech - but with very rough audio quality on the validation set. No matter what I tweaked, after 1-2 epochs the validation error would always diverge from the training error, which to me suggests considerable redundancy in the compressed representations (i.e two clips of perceptually similar audio can decode to different representations, so the TTS model has difficulty learning the underlying loss surface). I suspect there's still a lot more entropy to be squeezed out of it The Encodec authors encountered something similar, compressing their codec by a further 40% by simply layering a language model over the top.
These are interesting developments. We are seeing much interest in codec representations utilized in Generative AI applications. My specific exposure to EnCodec comes from MusicGen, and my initial response is that some combination of the perceptual tuning of the codec itself, or the chosen quantization seriously limits the perceptual audio quality of MusicGen outputs. I have yet to use EnCodec in isolation, but its utilization in MusicGen does not showcase a perceptually lossless codec.
Another approach to neural-based data representation / encoding / decoding is Implicit Neural Representations (INRs). For INRs, you overfit a neural network, typically a Multilayer Perceptron (MLP)[1], to a single data point. For instance, in the case of a single image (the data point), the inputs would be the xy pixel coordinates—usually scaled from [0, n_pixels]^2 to either [0, 1]^2 or [-1, 1]^2—and the outputs would be the RGB values of each pixel. Once trained on pixels from this singular image, the INR model essentially becomes a representation of that data point. The image can then be re-rendered or reconstructed by inputting all the pixel coordinates and receiving the corresponding RGB values. This approach even allows for sampling at intermediate coordinates or a subset of coordinates for partial, progressive, or super-resolution decoding.
This technique is generally effective as long as you can specify a reasonable coordinate system for your data, whether it's audio (n_samples, n_channels), 3D models (x, y, z)(x, y, z, θ, φ) [2], geospatial data (lat, lon, altitude), and so on. Interestingly, it can also be applied even if there isn't a well-defined or non-Euclidean coordinate system for your data [3].
Additionally, you can meta-learn an initial set of weights through meta-learning techniques tailored for a particular distribution of data points [4]. This enables you to quickly fit an INR model to a new data point that comes from the same distribution, such as microscope images of cells, MRI scans, or scenes for self-driving cars.
Most pertinent to the post, INRs can be compressed and quantized to achieve comparable or even superior performance to traditional data compression techniques, at various quality levels for different modalities [5]. By no means is it competitive for long standing approaches like for image compression but it works well nonetheless. This is an area that we have explored extensively in my lab, enabling end-to-end encoding, processing, and decoding of INRs in hardware for a variety of applications.
The most interesting topic to me is the fact that you can edit the weights of the INR to edit the underlying data [6][7], or use the weights of the INR to compute tasks on the data without ever having to decode it. For example, I can perform image classification on the weights of the INR model rather than on the actual images. This classification model is a hypernetwork that takes the INR parameters as inputs instead of the RGB image. These hypernetworks can also be designed in a way that preserves the invariance and equivariance properties of MLP structures, which is a very cool idea to me [8].
Simply addressing compression efficiency without considering the other constraints makes for an impractical codec.
If your encode complexity is too high, then you can't handle real-time (live) use cases, and your compute requirements may make practical use for large libraries like social media impractical.
If your decode complexity is too high mobile devices will suffer from severe battery drain or simply won't be able to decode, even in hardware. Dedicated hardware may also have to much complexity to implement in a cost effective way.