ZipMap

Linear-Time Stateful 3D Reconstruction via Test-Time Training

Haian Jin1,2     Rundi Wu1     Tianyuan Zhang3     Ruiqi Gao1     Jonathan T. Barron1     Noah Snavely1,2     Aleksander Hołýnski1
1Google DeepMind      2Cornell University      3Massachusetts Institute of Technology

CVPR 2026

TL;DR: ZipMap achieves linear-time and stateful 3D reconstruction by zipping all input tokens into a compact scene state via Test-Time Training layers, matching or surpassing the existing quadratic-time SOTA (e.g., VGGT and π³) with over 20× speedup at 750 images — zips through large image collections.

Abstract

Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and π³ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to compress an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU—more than 20× faster than SOTA methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene state querying and its extension to sequential streaming reconstruction.


Interactive examples

Pick a scene below to explore in 3D! Press Space to play/pause, click and drag to change viewpoint.


[Demo requres browser with WebGL2 support.]


Loading...
Points: Cur size: Multi size:
Frusta:
Frusta size:
Orig. video:
(Note: scene geometry is downsampled for faster loading. Firefox may not properly render point clouds.)
Static Example    Dynamic Example

⚡ Superfast reconstruction without compromising quality

Runtime comparison
Left: ZipMap’s runtime increases linearly with frames, while quadratic-time baselines (VGGT, π³) slow down rapidly as sequences grow. ZipMap can process 750 images in less than 10 seconds, while prior SOTA method (VGGT) takes over 200 seconds.

Right: Despite being much faster (20×+ at 750 frames), ZipMap maintains strong pose accuracy with low ATE across lengths. Its accuracy matches or surpasses the existing quadratic-time SOTA (e.g., VGGT and π³)

Please refer to the main paper for complete results across datasets and additional evaluations.



Scene State Query

Querying the Scene State

Querying the Scene State. For each example (top and bottom), the left column shows the input views (a), ground-truth RGB at the query poses (b), our rendered RGB from the scene state (c), ground-truth depth (d), and predicted depth (e). The middle panels visualize the 3D point clouds reconstructed from the input images. The right panels show point clouds attained solely by querying the scene state. The close visual match between these two point clouds indicates that the learned scene state faithfully captures the geometry and appearance of the input scene.

1 / 2


Acknowledgements

We would like to thank Shangzhan Zhang, Kyle Genova, Songyou Peng, and Zehao Yu for valuable discussions throughout the project. We thank Alfred Piccioni for help with setting up the training infrastructure, and Ben Poole for feedback on the manuscript. We also thank Yifan Wang and Jianyuan Wang for sharing baseline results and implementation details. Haian Jin was supported in part by a grant from the National Science Foundation (IIS-2211259) and by a Google PhD Fellowship.

BibTeX

@inproceedings{jin2026zipmap,
    title     = {{ZipMap}: Linear-Time Stateful 3D Reconstruction with Test-Time Training},
    author    = {Jin, Haian and Wu, Rundi and Zhang, Tianyuan and Gao, Ruiqi and Barron, Jonathan T. and Snavely, Noah and Holynski, Aleksander},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year      = {2026}
}