Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX export problem #64

Open
NickHauptvogel opened this issue Aug 3, 2021 · 6 comments
Open

ONNX export problem #64

NickHauptvogel opened this issue Aug 3, 2021 · 6 comments

Comments

@NickHauptvogel
Copy link

Hi,

I am posting a description of the issue in this repo as I used PyEDDL, however it will most likely concern EDDL and not the bindings.

When exporting similarly trained models with good training scores in version 0.14.0 (EDDL v0.9.2b) and version 1.0.0 (EDDL v.1.0.2a) into .onnx, the model from version 1.0.0 performs significantly worse afterwards. It does not degrade completely, but the test loss after export and import is about 10x higher in v1.0.0 than in v0.14.0. Keeping the application unchanged and solely switching the version back to v0.14.0 and repeating training and export does resolve the issue.
I noticed that the model file in v1.0.0 is larger by a few bytes, so something must have changed during the .onnx export between the versions. I suspect that it may be due to any float conversion or cut-off,
but it could also be a third-party dependency of the export that has changed. Inspecting both model and their weights, I cannot tell a difference at first sight.

I have sent @salvacarrion the models from both versions for comparison.

Best regards,

Nick

@jonandergomez
Copy link

Dear Nick,

I can confirm you several changes in ONNX have been included between EDDL v0.9.2b and v1.0.0.a. And obviously, this issue relates to EDDL instead of PyEDDL.

We, all the EDDL team, are in holidays in August. So as not to delay this matter, could you please send me ([email protected]) the models you mention and some details on the dataset in order to reproduce the issue you comment. If the models are too large to be attached into an email, you can use the files interchange service of our university: https://intercambio.upv.es/index.php?lang=en&upv=

Best regards,

Jon

@NickHauptvogel
Copy link
Author

Dear Jon,

I have sent you the files via the file interchange service!

Best regards,

Nick

@jonandergomez
Copy link

Thanks, I received them. I'll back to you as soon as possible.

Regards,
Jon

@jonandergomez
Copy link

Hi again,

@NickHauptvogel could you provide us with more information about your problem.

I guess from the model you provided it is a semantic segmentation problem. But I saw the output are 256x256x14 tensors I permuted to 14x256x256 because EDDL works with "channel first" in this cases. However, I can see the output values are in the range 0..255, which makes NNs difficult to work with, I'm performing some tests re-scaling the output to 0..1, please could you confirm it is correct for this use case?

Additionally, as the last layer of the model provided by you has no activation (i.e., activation is linear), I can't guess how to evaluate the predictions. So, please, could you provide us with all the information you can share in order to reproduce same behavior you get?

With the 10 samples you provided, I did some tests and get same output for the network just after training it during 10 epochs and then after importing from ONNX when starting from the v1_0_0, either initializing weights or using the ones you provided to start training. However, I have to inform you that when using the model v_0_14_0 the outputs after training and when importing from scratch differ, either initializing the weights of the model and without initialization.

Regards,

Jon

@NickHauptvogel
Copy link
Author

Hi Jon,

the use case is pose estimation via the prediction of heatmaps, i.e. probability distributions of the occurrence at each pixel location. The image is 256x256 and there are 14 key points, one heatmap for each key point. I did not normalize the data as I have explored that it was not necessary. As I said, the v0.14.0 model works perfectly fine with the data set, predicting key points in a range of 1px or below compared to the truth in 95.5% of the images.

However, if I only infer (not train) the same images with the v0.14.0 and the v1.0.0 model, the v1.0.0 results in a worse prediction (metric is mean squared error). Is this the same for you?

The point is that both models performed equally well while training with 10.000 images. Only after exporting/importing, the v1.0.0 got so much worse.

Best regards,

Nick

@chavicoski
Copy link

Hi Nick,

I have been reviewing the problem and the models that you sent to Jon and I spotted the bug.

The difference between the version 0.14 and 1.0 is that the 0.14 exports the Upsample layer as Upsample in ONNX, but the version 1.0 exports it as Resize since the Upsample operator is deprecated in ONNX (That can explain the difference with the file sizes). What happens is that in these versions of the EDDL the implementation of Resize and Upsample is not completely equivalent, so some precision is lost. So when exporting (Upsample EDDL -> Resize ONNX) and then importing (Resize ONNX -> Resize EDDL) the results are not exactly the same (Upsample EDDL != Resize EDDL in 1.0).

The bug is fixed in the develop branch and the Upsample and Resize layers perform the same operation, so when we release the next version (soon) it will be fixed.

Best regards,

Alvaro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants