-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ncams=5 #31
Comments
Maybe to add robustness to the system to not depend on all cameras for each segment of the final grid? That way you can also see how much the other camera's info might contribute to the one that is currently disabled. |
@manueldiaz96, am not sure if i understand the robustness that you mentioned. Looks like no of cams can be from 1 to many and was thinking that it should be 1 to 6 as nuscenes dataset generated using 6 cameras but looks like that is not the case. I have read through the paper as well, they mentioned something here but not understood fully. |
Hello @VeeranjaneyuluToka The idea here is that by using the intrinsics, extrinsics, predicted categorical depth and sum pooling, you could theoretically do the projection of features coming from any number of cameras (N) to the 3D space centred in the robot. The case for NuScenes limits us to choose between one and six, but the method theoretically works for any number of cameras (for example 1000 cameras). In practice, this number of cameras will be limited by your computational capabilities (GPU memory) and how fast you want the system to run. I hope this helps |
@manueldiaz96 , thanks for your reply and sorry for late comment again. Was actually debugging the code and analysing or trying to understand more. I still have a couple of questions on the same. What do you mean by predicted categorical depth? I believe or if i understand correctly, the depth map/distribution is getting generated using certain interval (4.0, 45.0 and with the range 1.0). Is this just a initial random initailization and later it is tuned with an architecture (depthnet in the implementation) and that is what you mean predicted categorical depth? I believe sum pooling which is getting calculated in voxel_pooling, is not it? I have one more question, if we have extrinsics, intrinsics and 2D map (image in our case), then it is straght away, a 3D map can be calculated as mentioned in the below link. Is it just becuase of the depth generation needs to be added in learning process, we are having initial depth distribution (4.0, 45.0 with range 1.0) and then tuning as i mentioned above? And i think projection falls down the approach stated in the stackoverflow link that i posted above?. |
First, let me start from what I (and the authors, I think) mean with categorical depth. In our case, we do not want to predict the depth for each pixel in the image, but instead, the depth for each pixel in the feature map given by the backbone (which has been downsampled). We can approach this problem by wanting to predict the exact depth value, which implies an infinite range of values between [0m, 50m]. This can be done following MDE approaches, but it would require proper depth ground truth and a specific depth prediction head architecture. Instead, Philion and Fidler propose to leverage something from a previous paper (check figure 2) which is to define a set of planes (or depths) from the start and not have to deal with continuous depth prediction. For the case of LSS, they propose a range between [4m, 45m] from the camera, where each plane is separated by 1m. If you see now, this is very similar to a classification problem since we have a proxy for a set of classes, which is the depth the pixel belongs to. This is the reason why it is called categorical depth, we are trying to classify if a pixel belongs to class 1,2,3,... which in this case means that the pixel has a depth of 4,5,6,... meters in front of the camera, and in practice it would be a vector containing the probability that the pixel belongs to a specific depth: [0.002, 0.004, 0.89,...], like it is the case in normal classification with NNs. It is due to this that pixels will only be put at 4,5,6,... meters, not 4.5, 5.78, 6.21 since these values are already defined when creating the frustum. This is the distribution that the network needs to learn, which is called alpha in the paper or And since we are learning these alpha/depth values, per image, it is independent of the number of cameras we use to train, since we are doing the prediction of these alpha distributions based only on what there is in the image. So to answer your questions specifically:
I mean that they define an amount of depth planes, which are positioned from 4m to 45m in front of the camera and separated by a distance of 1m. They are predefined when creating the frustum.
No, the depth values (or depth planes) are set up from the start. You can see this in the
I am not sure I understand your question. The Sumpooling is used in the
No, this answer is misleading, because they even say: "Add them a z=1, so they are now 3d. Project them as follows" . Which means that they are choosing a specific depth. This depth they choose, is the one we are trying to learn in an approximate manner from the categorical depth distribution vector. It is not possible to project from 2D back to 3D by only knowing the intrinsics and extrinsics, if it was, then tasks such as MDE would not exist. The challenge is that since images are generated by the projection of light rays to the camera sensor, we are losing information from the 3D world by projecting it to the image plane.
Yes, we are learning the depth estimation (not sure generation would apply here).
No, the depth planes are predefined and they do not change or tuned. What we are tuning is the categorical prediction to know what pixel belongs to which plane.
No, it does not. As I said, they 'assume' a depth value. We want to estimate/learn this specifically. I hope this helps! If you want to better understand how cameras and projective geometry work, I recommend you this excellent series of blog posts about understanding the camera matrix, in the last post they have a very useful demo: Post 1, Part 2, Part 3 |
I want to know, why Ncams=5 when is training ?
I found :
The text was updated successfully, but these errors were encountered: