ZipMap

Abstract

Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and π³ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to compress an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU—more than 20× faster than SOTA methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene state querying and its extension to sequential streaming reconstruction.

Interactive examples

Pick a scene below to explore in 3D! Press Space to play/pause, click and drag to change viewpoint.

[Demo requres browser with WebGL2 support.]

(Note: scene geometry is downsampled for faster loading. Firefox may not properly render point clouds.)

Static Example Dynamic Example

ⓘ Note that, both the static and dynamic examples are attained from linear-time, bidirectional (non-streaming) reconstructions. We show the dynamic examples in streaming mode to better visualize the per-frame prediction.

⚡ Superfast reconstruction without compromising quality

Left: ZipMap’s runtime increases linearly in the number of frames, while quadratic-time baselines (VGGT, π³) slow down rapidly as sequences grow. ZipMap can process 750 images in less than 10 seconds, while a prior SOTA method (VGGT) takes over 200 seconds.

Right: Not only is ZipMap much faster (20×+ at 750 frames), it maintains strong pose accuracy with low ATE across varying input sequence lengths. Its accuracy matches or surpasses the existing quadratic-time SOTA (e.g., VGGT and π³)

Please refer to the main paper for complete results across multiple datasets and additional evaluations.

Acknowledgements

We would like to thank Shangzhan Zhang, Kyle Genova, Songyou Peng, and Zehao Yu for valuable discussions throughout the project. We thank Alfred Piccioni for help with setting up the training infrastructure, and Ben Poole for feedback on the manuscript. We also thank Yifan Wang and Jianyuan Wang for sharing baseline results and implementation details. Haian Jin was supported in part by a grant from the National Science Foundation (IIS-2211259) and by a Google PhD Fellowship.

BibTeX

@inproceedings{jin2026zipmap, title = {{ZipMap}: Linear-Time Stateful 3D Reconstruction via Test-Time Training}, author = {Jin, Haian and Wu, Rundi and Zhang, Tianyuan and Gao, Ruiqi and Barron, Jonathan T. and Snavely, Noah and Holynski, Aleksander}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year = {2026} }

ZipMap

Linear-Time Stateful 3D Reconstruction via Test-Time Training

CVPR 2026

TL;DR: Linear-time, stateful 3D reconstruction that matches/beats O(N²) SOTA, like VGGT, with over 20× speedup.

Abstract

Interactive examples

⚡ Superfast reconstruction without compromising quality

More long-sequence reconstruction results

Scene State Query

Acknowledgements

BibTeX