We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods—from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps)—addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs).
We compare our method with state-of-the-art approaches for the task of Scene-Level Novel View Synthesis from sparse view inputs (2 views). Since PixelSplat and MVSplat only support 256x256 resolution, we generated the following results using our models at the same 256x256 resolution.
The following results have a resolution of 512x512 and are generated by our decoder-only model. The input images are attached to the bottom of each novel view synthesis result.
The following results have a resolution of 512x512 and are generated by our decoder-only model. The input images are attached to the bottom of each novel view synthesis result.
We observe that our LVSM also works with a single input view for many cases, despite only being trained with multi-view inputs. This observation shows the capability of LVSM to understand the 3D world, e.g. understanding depth, rather than just performing pixel-level view interpolation.
@misc{jin2024lvsmlargeviewsynthesis,
title={LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias},
author={Haian Jin and Hanwen Jiang and Hao Tan and Kai Zhang and Sai Bi and Tianyuan Zhang and Fujun Luan and Noah Snavely and Zexiang Xu},
year={2024},
eprint={2410.17242},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.17242},
}