-
Notifications
You must be signed in to change notification settings - Fork 223
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
5 additions
and
174 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,174 +1,5 @@ | ||
<!doctype html> | ||
<html lang="en"> | ||
<head> | ||
<!-- Global site tag (gtag.js) - Google Analytics --> | ||
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-156935549-3"></script> | ||
<script> | ||
window.dataLayer = window.dataLayer || []; | ||
function gtag(){dataLayer.push(arguments);} | ||
gtag('js', new Date()); | ||
|
||
gtag('config', 'UA-156935549-3'); | ||
</script> | ||
|
||
|
||
|
||
<!-- Required meta tags --> | ||
<meta charset="utf-8"> | ||
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> | ||
|
||
<!-- Bootstrap CSS --> | ||
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous"> | ||
|
||
<!-- Other --> | ||
<script src="https://cdnjs.cloudflare.com/ajax/libs/handlebars.js/4.4.2/handlebars.min.js"></script> | ||
|
||
<title>Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D</title> | ||
</head> | ||
<body> | ||
|
||
<div style="overflow: hidden; background-color: #6699cc;"> | ||
<div class="container"> | ||
<a href=https://www.nvidia.com/ style="float: left; color: black; text-align: center; padding: 12px 16px; text-decoration: none; font-size: 16px;"><img width="100%" src="https://nv-tlabs.github.io/3DStyleNet/assets/nvidia.svg"></a> | ||
<a href=https://nv-tlabs.github.io/ style="float: left; color: black; text-align: center; padding: 14px 16px; text-decoration: none; font-size: 16px;"><strong>Toronto AI Lab</strong></a> | ||
</div> | ||
</div> | ||
|
||
<!-- header --> | ||
<div class='jumbotron' style="background-color:#e6e9ec"> | ||
<div class="container"> | ||
|
||
<h1 class="text-center">Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D</h1> | ||
<p class='text-center'><a href="https://scholar.google.com/citations?user=VVIAoY0AAAAJ&hl=en" target="_blank">Jonah Philion</a>, <a href="http://www.cs.toronto.edu/~fidler/" target="_blank">Sanja Fidler</a></p> | ||
<p class='text-center'>NVIDIA, Vector Institute, University of Toronto</p> | ||
<p class='text-center'>ECCV 2020</p> | ||
<p class='text-center'><img src='imgs/nusc.gif' class='img-fluid' style='height:250px; border-radius:15px; padding:5px'></p> | ||
<!-- <iframe src="https://drive.google.com/file/d/1XwqzDYfzXhky1WNuXTKy7i-jVXgp4hc_/preview" width="640" height="480" autoplay></iframe> --> | ||
<!-- <div class="embed-responsive embed-responsive-16by9" style='height:250px; width:444px; margin:auto;'> | ||
<iframe class="embed-responsive-item" src="https://www.youtube.com/embed/oL5ISk6BnDE?autoplay=1" allowfullscreen></iframe> | ||
</div> --> | ||
</div> | ||
</div> | ||
|
||
<div class="container"> | ||
|
||
<p> | ||
The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single “bird’s-eye-view” coordinate frame for consumption by motion planning. We propose a new end-to-end architecture that directly extracts a bird’s-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to “lift” each image individually into a frustum of features for each camera, then “splat” all frustums into a rasterized bird’s-eye- view grid. By training on the entire camera rig, we provide evidence that our model is able to learn not only how to represent images but how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error. On standard bird’s- eye-view tasks such as object segmentation and map segmentation, our model outperforms all baselines and prior work. In pursuit of the goal of learning dense representations for motion planning, we show that the representations inferred by our model enable interpretable end-to-end motion planning by “shooting” template trajectories into a bird’s-eye-view cost map output by our network. We benchmark our approach against models that use oracle depth from lidar. | ||
</p> | ||
|
||
<hr/> | ||
|
||
<span class="border border-white"> | ||
<h4 class="text-center">News</h4> | ||
<ul> | ||
<li>[October 2020] code release on <a href='https://github.com/nv-tlabs/lift-splat-shoot' target="_blank">github</a></li> | ||
<li>[August 2020] paper released on <a href='https://arxiv.org/abs/2008.05711' target="_blank">arxiv</a></li> | ||
</ul> | ||
</span> | ||
|
||
<hr/> | ||
|
||
<span class="border border-white"> | ||
<h4 class="text-center">Paper</h4> | ||
<div class='row'> | ||
<div class='col'> | ||
<a href='https://arxiv.org/abs/2008.05711' target='_blank'><img src='imgs/icon.png' class='img-fluid float-right' style='height:180px; border: solid; border-radius:30px; border-color:#000000;'></a> | ||
</div> | ||
<div class='col'> | ||
<p class="card-text">Jonah Philion, Sanja Fidler</p> | ||
<p class="card-text">Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D</p> | ||
<p class="card-text">ECCV, 2020. (poster)</p> | ||
<p class="card-text">[<a href="https://arxiv.org/abs/2008.05711" target="_blank">preprint</a>] [<a href="./cite.txt" target="_blank">bibtex</a>]</p> | ||
</div> | ||
</div> | ||
</span> | ||
|
||
<hr/> | ||
<h4 class='text-center'>Main Idea</h4> | ||
<b>Lift, Splat, Shoot</b> Our goal is to design a model that takes as input multi-view image data from any camera rig and outputs a semantics in the reference frame of the camera rig as determined by the extrinsics and intrinsics of the cameras. The tasks we consider in this paper are <span style="color:#0191F7">bird's-eye-view vehicle segmentation</span>, <span style="color:#00FFFF">bird's-eye-view lane segmentation</span>, <span style="color:#FF7F4F">drivable area segmentation</span>, and <span style="color:#C71585">motion planning</span>. Our network is composed of an initial per-image CNN followed by a bird's-eye-view CNN connected by a "Lift-Splat" pooling layer (left). To "lift" images into 3D, the per-image CNN performs an attention-style operation at each pixel over a set of discrete depths (right). | ||
|
||
<!-- <div class='row'> | ||
<div class='col-6 text-center'> | ||
<img src='imgs/newsplat.jpg' class='img-fluid' style="border:1px solid #000000; border-radius: 25px;"> | ||
</div> | ||
<div class='col-6 text-center'> | ||
<img src='imgs/lift.png' class='img-fluid' style="border:1px solid #000000; border-radius: 25px;"> | ||
</div> | ||
</div> --> | ||
<p class='text-center'><img src='imgs/together.png' class='img-fluid' style="border:0px solid #000000; border-radius: 15px; height: 160px;"></p> | ||
|
||
<b>Learning Cost Maps for Planning</b> We frame end-to-end motion planning ("shooting") as classification over a set of fixed template trajectories (left). We define the logit for template to be the sum of values in the bird's-eye-view cost map output by our model (right). We then train the model to maximize the likelihood of expert trajectories. | ||
|
||
<p class='text-center'><img src='imgs/eq.png' class='img-fluid' style="border:0px solid #000000; border-radius: 15px; height: 200px;"> | ||
</p> | ||
|
||
<b>Equivariance</b> To be maximally useful, models that perform inference in the bird's-eye-view frame need to generalize to any choice of bird's-eye-view coordinates. Our model is designed such that it roughly respects equivariance under translations (top left) and rotations (top right) of the camera extrinsics. Lift-Splat Pooling is also exactly permutation invariant (bottom left) and roughly invariant to image translation (bottom right). | ||
|
||
<div class='row'> | ||
<div class='col-6 text-center'> | ||
<img src='imgs/sym.gif' class='img-fluid' style="border:1px solid #000000; border-radius: 25px;"> | ||
<figcaption class="figure-caption">Extrinsic Translation</figcaption> | ||
</div> | ||
<div class='col-6 text-center'> | ||
<img src='imgs/rot.gif' class='img-fluid' style="border:1px solid #000000; border-radius: 25px;"> | ||
<figcaption class="figure-caption">Extrinsic Rotation</figcaption> | ||
</div> | ||
</div> | ||
<br> | ||
|
||
<div class='row'> | ||
<div class='col-6 text-center'> | ||
<img src='imgs/perm.gif' class='img-fluid' style="border:1px solid #000000; border-radius: 25px;"> | ||
<figcaption class="figure-caption">Image Permutation</figcaption> | ||
</div> | ||
<div class='col-6 text-center'> | ||
<img src='imgs/im.gif' class='img-fluid' style="border:1px solid #000000; border-radius: 25px;"> | ||
<figcaption class="figure-caption">Image Translation</figcaption> | ||
</div> | ||
</div> | ||
<br> | ||
|
||
<b>Results</b> We outperform baselines on bird's-eye-view segmentation. We demonstrate transfer across camera rigs in two scenarios of increasing difficulty. In the first, we drop cameras at test time from the same camera rig that was used during training. In the second, we test on an entirely new camera rig (Lyft dataset) from what was used during training (nuScenes dataset). | ||
|
||
<p class='text-center'><img src='imgs/results.png' class='img-fluid' style="border:0px solid #000000; border-radius: 15px; height: 120px;"> | ||
<figcaption class="figure-caption text-center">Bird's-Eye-View Segmentation IOU (nuScenes and Lyft)</figcaption> | ||
</p> | ||
|
||
<p class='text-center'><img src='imgs/nusc.gif' class='img-fluid' style='height:250px; border-radius:15px; padding:5px'> | ||
<figcaption class="figure-caption text-center"><b>nuScenes validation set</b> Input images are shown on the left. BEV inference output by our model is shown on the right. The BEV semantics are additionally projected back onto the input images for visualization convenience.</figcaption> | ||
</p> | ||
|
||
<div class='row'> | ||
<div class='col-6 text-center'> | ||
<iframe src="https://drive.google.com/file/d/1dGU0zmsxJgFtXMkkHrD2P6DB_JjlvYnQ/preview" width="90%"></iframe> | ||
<figcaption class="figure-caption text-center"><b>Camera Dropout</b> At test time, we remove different cameras from the camera rig. When a camera is removed, the network imputes semantics in the blind-spot by using information in the remaining cameras as well as priors about object shapes and road structure.</figcaption> | ||
</div> | ||
<div class='col-6 text-center'> | ||
<iframe src="https://drive.google.com/file/d/1XwqzDYfzXhky1WNuXTKy7i-jVXgp4hc_/preview" width="90%"></iframe> | ||
<figcaption class="figure-caption text-center"><b>Train on nuScenes -> Test on Lyft</b> We evaluate a model trained on the nuScenes dataset on the Lyft dataset. Segmentations output by the model are fuzzy but still meaningful. Quantitative transfer results against baselines can be found in our <a href='https://arxiv.org/pdf/2008.05711.pdf' target='_blank'>paper</a>.</figcaption> | ||
</div> | ||
</div> | ||
|
||
<hr/> | ||
<h4 class='text-center'>ECCV 2020 1 minute video</h4> | ||
<div class="embed-responsive embed-responsive-16by9"> | ||
<iframe class="embed-responsive-item" src="https://www.youtube.com/embed/ypQQUG4nFJY" style='display:block;' allowfullscreen></iframe> | ||
</div> | ||
<br> | ||
|
||
<hr/> | ||
<h4 class='text-center'>ECCV 2020 10 minute video</h4> | ||
<div class="embed-responsive embed-responsive-16by9"> | ||
<iframe class="embed-responsive-item" src="https://www.youtube.com/embed/oL5ISk6BnDE" allowfullscreen></iframe> | ||
</div> | ||
<br> | ||
|
||
|
||
|
||
<!-- Optional JavaScript --> | ||
<!-- jQuery first, then Popper.js, then Bootstrap JS --> | ||
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script> | ||
<script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js" integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1" crossorigin="anonymous"></script> | ||
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script> | ||
</body> | ||
</html> | ||
<!DOCTYPE html> | ||
<meta charset="utf-8"> | ||
<title>Redirecting to https://research.nvidia.com/labs/toronto-ai/lift-splat-shoot/</title> | ||
<meta http-equiv="refresh" content="0; URL=https://research.nvidia.com/labs/toronto-ai/lift-splat-shoot/"> | ||
<link rel="canonical" href="https://research.nvidia.com/labs/toronto-ai/lift-splat-shoot/"> |