Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about create_frustum and voxel_pooling #36

Open
GYGWG opened this issue Oct 26, 2022 · 2 comments
Open

Questions about create_frustum and voxel_pooling #36

GYGWG opened this issue Oct 26, 2022 · 2 comments

Comments

@GYGWG
Copy link

GYGWG commented Oct 26, 2022

Hi, thanks for your excellent work! I am a little bit confused about functions create_frustum and voxel_pooling. It will be great if you can give some further explanations.

In create_frustum, the code indicates that the output dimension is D x H x W x 3, I am wondering what is this 3 represents for? Is it RGB value? Or is it the coordinate position of point in frustum? I am also wondering whether the input to this function is raw image or extracted feature?

For voxel_pooling, what I understand is that it sums up the features of all the points in a same voxel(pillar) using cumsum trick. The dimension of output in this function is B x C x Z x X x Y, where X Y and Z are the coordinates in the BEV field(which are not the same with H W and D). However, in the paper it says "perform sum pooling to create a CxHxW tensor" which really confused me. Why are we still want H and W here? Besides, I am wondering how you get rid of Z?

@GYGWG
Copy link
Author

GYGWG commented Oct 27, 2022

I am also confused when understanding function get_geometry. It says "Determine the (x,y,z) locations (in the ego frame)"; however, the output dimension is still B x N x D x H/downsample x W/downsample x 3. I assume X Y and Z are matched to H/downsample W/downsample and D in this case? Again, I am wondering what does this 3 stand for?

@manueldiaz96
Copy link

In create_frustum, the code indicates that the output dimension is D x H x W x 3, I am wondering what is this 3 represents for? Is it RGB value? Or is it the coordinate position of point in frustum?.

It is the tuple (x,y,z) that indicates the 3D coordinates of the point, you can see this since all the Z values for any given D are the same. This is because LSS tries to learn where the objects are using depth planes.

I am also wondering whether the input to this function is raw image or extracted feature?

For create_frustum, there are no inputs. What the use is the original image shape, which is an internal variable of the model together with the downsampling factor to find the final width and height of the final feature map after the encoder backbone, which would be the extracted features.

However, in the paper it says "perform sum pooling to create a CxHxW tensor" which really confused me. Why are we still want H and W here?

Because they are used to find the proper xyz coordinates for each pixel in each depth plane D. They aren't used for anything else, if I am not mistaken.

Besides, I am wondering how you get rid of Z?

You get rid of Z by performing the sum pooling, which takes all points in a voxel (discretization of 3D space) of infinite height, and then add them all together. Therefore, summing all features that may appear in the same cell in the BEV, where you cannot distinguish their Z component.

I assume X Y and Z are matched to H/downsample W/downsample and D in this case?

No, XYZ are just a 3D vector that is assigned to each pixel (which has coordinates DHW), doing this is how you manage to associate each pixel in the features to their projection in 3D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants