LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

1Cornell University, 2The University of Texas at Austin
3Adobe Research 4Massachusetts Institute of Technology

LVSM is a purely transformer-based large view synthesis model. Given sparse input views with camera poses, it achieves high-quality novel view synthesis results in a feed-forward manner with minimal 3D inductive bias.

(This webpage contains a lot of videos. We suggest using Chrome or Edge on Mac or PC for the best experience.)

Abstract


problem_define

We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods—from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps)—addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs).

Comparison


Quantitative Comparison

method

Qualitative Comparison

We compare our method with state-of-the-art approaches for the task of Scene-Level Novel View Synthesis from sparse view inputs (2 views). Since PixelSplat and MVSplat only support 256x256 resolution, we generated the following results using our models at the same 256x256 resolution.

method

Our Results


Object-Level Novel View Synthesis from Sparse View Inputs (4 views)

The following results have a resolution of 512x512 and are generated by our decoder-only model. The input images are attached to the bottom of each novel view synthesis result.




Scene-Level Novel View Synthesis from Sparse View Inputs (2 views)

The following results have a resolution of 512x512 and are generated by our decoder-only model. The input images are attached to the bottom of each novel view synthesis result.

Additional Results


Single-View Novel View Synthesis for Scene Data

We observe that our LVSM also works with a single input view for many cases, despite only being trained with multi-view inputs. This observation shows the capability of LVSM to understand the 3D world, e.g. understanding depth, rather than just performing pixel-level view interpolation.

BibTeX

@misc{jin2024lvsmlargeviewsynthesis,
      title={LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias}, 
      author={Haian Jin and Hanwen Jiang and Hao Tan and Kai Zhang and Sai Bi and Tianyuan Zhang and Fujun Luan and Noah Snavely and Zexiang Xu},
      year={2024},
      eprint={2410.17242},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.17242}, 
}