-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kojak Dataset Validation #81
Comments
lxml.etree.XMLSyntaxError: Namespace prefix xsi for schemaLocation on MzIdentML is not defined, line 2, column 194 I will fix with |
Traceback (most recent call last): |
can you share the mzIdentML file? C |
yeah, the spectrumID attributes for SpectrumIdentificationResult elements have values like "211026EWas03_F2.34324.34324.4", Changing the SpectrumIDFormats to MS:1001530 may make it work. There may still be an open question about whether spectra in mzML files can be referenced using just the index, my feeling is we've been into this before and perhaps they can't. There might be some debate about this. |
we could try just switching the SpectrumIDFormats in lines 507902 to 507932 of the mzIdentML file |
re. p8 of the 1.2.0 schema, i think "NativeID" refers to IDs in proprietary file formats, e.g. things form Therma/Waters/Bruker, so thats in part why MS:1000774 isn't applicable to mzML. I think if we go into the pyteomics library we might find it doesn't allow mzML spectra to be retrieved by index only, it's something we can look into (and potentially ask pyteomics devs about) if it becomes an issue. |
Thanks @colin-combe for your comments.. I change the SpectrumIDFormats to MS:1001530 as follows:
Still getting errors:
|
i think the converter is correct in saying the file referred to has no spectrum with ID "211026EWas01_E1.00061.00061.2" (the file referred to was 211026EWas01_E1.mzML, to find that you need to search for "211026EWas01_E1.00061.00061.2" in the mzId and then see the associated id of the spectra data and look that up in the Inputs element, though the beginning of the ID they used gives a strong clue it will be that file.) |
@ypriverol Could you please report this issue to the Kojak dataset provider? thanks! |
can refer them to this GH issue then any further discussion needed can take place here |
I will try. |
My apologies, our software here automatically interprets TPP/mzML spectrum nomenclature and ProteoWizard/mzML spectrum nomenclature (e.g., 211026EWas01_E1.00061.00061.2 == controllerType=0 controllerNumber=1 scan=61). I seem to have taken that for granted and I will fix the mzID files to use ProteoWizard/mzML spectrum nomenclature throughout. |
New mzID files have been uploaded. |
@colin-combe I tested the newly uploaded mzID file(which I copied to you in the same FTP location) gives long error messages like this which are not helpful for debugging.
|
yes, thats obviously not helpful for debugging and this is the sort of thing that needs improved as we move towards a more usable mzIdentML validator. Though that wasn't the full output from it, was it? (maybe it was) Anyway, I'll look into it and get back to you. Whatever the error is, I'll try to update the code to make the output more meaningful in the case of that error. |
Same sort of output getting repeated and it will just fill up the buffer with these kind of JSON objects. Thanks! |
could also be a bug in the converter and nothing to do with validation |
I think the file is invalid at the schema level due to duplicate ids.The same ids for SpectrumIdentificationResults and SpectrumIdentificationItems recur in the different SpectrumIdentificationLists. The scope within which these ids are meant to be unique is perhaps open to interpretation from the text in the specification document. But I've attached the start of the output from (its also why the converters output was meaningless, it kinda assumes the input is schema valid.) |
Gotcha, sorry about that. Makes sense that each SpectrumIdentificationResult.id and SpectrumIdentificationItem.id should have unique values external to their SpectrumIdentificationLists, especially if this all goes into an SQL database that requires those tables to have a unique key based on id alone. new mzID files have been uploaded (*fixedB.mzid) |
Yes, the sqlalchemy errors Suresh was seeing were caused by duplicate primary keys. |
Good news! Dataset is parsed successfully. Thank you to everyone for working to make this happen. |
Validating Crosslinking data:
The text was updated successfully, but these errors were encountered: