Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

Zongyu Guo^1*, Jiajun He^2*, Zhaoyang Jia¹, Xiaoyi Zhang¹, Jiahao Li¹, Xiao Li¹, Bin Li¹ José Miguel Hernández-Lobato² Yan Lu¹

¹Microsoft Research Asia, ²University of Cambridge ^*Equal contribution

Paper arXiv Code

Compression Performance Comparison

Drag to compare Ground Truth (left) and the selected compressed result (right).

Beyond Compression: Editing

Editing results using LoRA-based representations. — We can use the LoRA-based representations for controlled generation, such as image editing or merging

Video editing results by changing the dog in the prompt to panda

Method overview

Comparison between different representation methods. — Comparison between different representation methods: (a) Explicit representations by encoding signals into symbolic latent variables. (b) Implicit representations that encode signal information implicitly in functions. (c) Adaptation of generative models can serve as implicit visual representations

A detailed illustration of the adaptation method in a pretrained diffusion foundation model. — A detailed illustration for our methods by adaptation in a pretrained diffusion foundation model

Representation results

Reconstruction quality versus training step results. — Reconstruction quality v.s. training step. (a) Common LoRA representations with different ranks for image Kodim03 from Kodak dataset. (b) One-vector representations for image Kodim03, varying LoRA rank and vector size after hashing. (c) One-vector representations for video Beauty from UVG dataset.

Compression Performances

UVG compression performance results. — Comparisons video codecs on UVG. For DISTS, FVD and LPIPS, lower is better. For PSNR, higher is better.

Compression inference-time scaling

We identity a key advantage of functional representation: it supports inference-time scaling for better performance naturally. In our framework, we can generate multiple samples per denoising steps, and select the most promising one for better compression quality.

Composing Representations of Multiple Videos

Concatenation of LoRA for different videos enables composition of multiple videos.

ShakeNDry's background + Beauty's object and motion.

ShakeNDry's object + Beauty's background and motion.

Merging ShakeNDry and Beauty in the background of Beauty.

Merging ShakeNDry and Beauty in the background of ShakeNDry.

More RD Curves

HEVC B compression performance results. — Comparisons video codecs on HEVC B. For DISTS, FVD and LPIPS, lower is better. For PSNR, higher is better.

HEVC C compression performance results. — Comparisons video codecs on HEVC C. For DISTS, FVD and LPIPS, lower is better. For PSNR, higher is better.

HEVC E compression performance results. — Comparisons video codecs on HEVC E. For DISTS, FVD and LPIPS, lower is better. For PSNR, higher is better.

BibTex

@inproceedings{he2026compression,
  title={Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models},
  author={Zongyu Guo and Jiajun He and Zhaoyang Jia and Xiaoyi Zhang and Jiahao Li and Xiao Li and Bin Li and José Miguel Hernández-Lobato and Yan Lu},
  year={2026},
  booktitle={International Conference on Machine Learning},
}