-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Depth distribution and context vector of a pixel #33
Comments
Context vector are the features. Both of them are generated in the last layer of the Camera encoder. |
@manueldiaz96 Thanks for your comment. Let me summerize what happens in CamEncode and check if my understanding is correct. Extract features from each input image sample by considering the last 2 blocks of efficientnet (reduction_5 and reduction_4), lets say they are x1 and x2. new_x, is this a projection of categorical depth on image features? |
Yes, that is what the code does.
You can just call it
Look at the definition of They split the features given by
Using the information from the As they describe in the 4th paragraph in subsection 3.1: These Where Look at the definition of categorical depth distributions in the Categorical Depth Distribution Network for Monocular 3D Object Detection paper, specifically subsection 1.1 : |
Hi there, just a few more questions about the choice of D, as I found no ablation study about it:
|
@ZiyuXiong, although I am not the author of the paper, the following is my intuition:
I would think that the lower limit pertains to a safe area around the car (including itself), since the reference frame is located in the rear axle. The upper limit, I am not sure.
I am not sure it would ease the depth projection error, or at least not that much for vehicles (given their normal dimensions). Matching delta D to the grid resolution will result in an increased processing time (at least doubling the time used to do the projection, and even more if you increase the original image input size), where the gaps between the planes in the BEV (1m = 2 pixels) could be easily filled in by the So I would guess that it is all about what compromises you can make to have the best output, at the latency you desire. |
Wanted to understand more of lift stage.
Basically they mentioned in the paper that lift stage is where a 2D to 3D conversion happens.
As a first step in this process ' generate representations at all possible depths for each pixel' which is what they call it is D which gets generated from 4.0 to 45 with range of 1.0. Basically Depth distribution is getting generated from 4.0 to 45.0 with the range of 1.0, is not it?
Then there is something called context vector C, am not sure how this gets generated for each pixel.
Would be great help if there anybody gives a little more eplanation on both of these (Depth distribution and Context vector of each pixel)?
The text was updated successfully, but these errors were encountered: