Skip to main content

CUT3R summary

Author
Mark Ogata
AI and Robotics Undergraduate Researcher

Problem statement
#

online metric scale 3D point maps over time from dynamic/static scene video, unordered collection of pictures. Able to infer smooth/blurry infills of unseen parts of the 3D world.

Prior approaches
#

3DGS, NeRF, SfM start from a blank slate, so they struggle with single frame reconstruction and unseen viewpoint queries.

Methods with data driven prior such as DUST3R reconstruct point clouds in the world coordinate frame, implicitly predicting extrinsics and intrinsics. But DUST3R doesn’t support multiframe prediction natively (the global alignment process does but is computationally expensive and can’t retroactively correct past pointmap predictions). also DUSt3R Global alignment is O(N) to process a new frame where N is number of frames vs O(1) for CUT3R.

Spann3R is very similar but is more a cache for observed images and doesn’t allow querying of unobserved viewpoints.

Method
#

With each new observation, update a hidden state and readout a 3D pointmap of the contents. Data driven priors are crucial in handling degenerate cameras and querying unobserved viewpoints.

Trained on many types of data: image, image collections, video with partial or full 3D annotations

Predicting Pose, World coord point map, and frame coord point map is redundant, the individual supervision helps and also allows use of datasets with partial annotations.

Results
#

Limitations
#

  • no correspondence/tracking between frames
  • compute required (practically speaking not online…?)

Key contributions
#

other notes
#

the latent could be super useful as it is updated online and stores the 3D scene info.

Related