Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kojak Dataset Validation #81

Open
sureshhewabi opened this issue Sep 19, 2024 · 21 comments
Open

Kojak Dataset Validation #81

sureshhewabi opened this issue Sep 19, 2024 · 21 comments
Assignees

Comments

@sureshhewabi
Copy link
Collaborator

Validating Crosslinking data:

  • 211026EWas01_E1.mzML
  • 211026EWas02_F1.mzML
  • 211026EWas03_F2.mzML
  • interact-1_2.ipro.mzid
@sureshhewabi sureshhewabi self-assigned this Sep 19, 2024
@sureshhewabi
Copy link
Collaborator Author

sureshhewabi commented Sep 19, 2024

lxml.etree.XMLSyntaxError: Namespace prefix xsi for schemaLocation on MzIdentML is not defined, line 2, column 194
2024-09-19 11:25:40 - main - ERROR - Namespace prefix xsi for schemaLocation on MzIdentML is not defined, line 2, column 194 (interact-1_2.ipro.mzid, line 2)


I will fix with xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

@sureshhewabi
Copy link
Collaborator Author

sureshhewabi commented Sep 19, 2024

Traceback (most recent call last):
File "xi-mzidentml-converter/parser/process_dataset.py", line 241, in convert_dir
id_parser.parse()
File "xi-mzidentml-converter/parser/MzIdParser.py", line 94, in parse
self.main_loop()
File "xi-mzidentml-converter/parser/MzIdParser.py", line 665, in main_loop
spectrum = peak_list_reader[sid_result["spectrumID"]]
File "xi-mzidentml-converter/parser/peaklistReader/PeakListWrapper.py", line 71, in getitem
return self.reader[spec_id]
File "xi-mzidentml-converter/parser/peaklistReader/PeakListWrapper.py", line 274, in getitem
raise SpectrumIdFormatError(
parser.peaklistReader.PeakListWrapper.SpectrumIdFormatError: MS:1000774 not supported for mzML!

@ypriverol

@colin-combe
Copy link

colin-combe commented Sep 19, 2024

can you share the mzIdentML file?
I think MS:1001530 is the only supported SpectrumIdFormat for mzML, but it could be open to interpretation, p.8 of the 1.2.0 schema is the relevant part.
Looking at the mzIdentML file and seeing what the values of the spectrum IDs is would help,

C

@colin-combe
Copy link

colin-combe commented Sep 19, 2024

yeah, the spectrumID attributes for SpectrumIdentificationResult elements have values like "211026EWas03_F2.34324.34324.4",
so that doesn't meet the requirements for MS:1000774, which needs to be of format "index=xsd:nonNegativeInteger" (p.8 of 1.2.0 schema).

Changing the SpectrumIDFormats to MS:1001530 may make it work.

There may still be an open question about whether spectra in mzML files can be referenced using just the index, my feeling is we've been into this before and perhaps they can't. There might be some debate about this.

@colin-combe
Copy link

colin-combe commented Sep 19, 2024

we could try just switching the SpectrumIDFormats in lines 507902 to 507932 of the mzIdentML file
(i.e. swapping them to MS:1001530)

@colin-combe
Copy link

re. p8 of the 1.2.0 schema, i think "NativeID" refers to IDs in proprietary file formats, e.g. things form Therma/Waters/Bruker, so thats in part why MS:1000774 isn't applicable to mzML.

I think if we go into the pyteomics library we might find it doesn't allow mzML spectra to be retrieved by index only, it's something we can look into (and potentially ask pyteomics devs about) if it becomes an issue.

@sureshhewabi
Copy link
Collaborator Author

Thanks @colin-combe for your comments.. I change the SpectrumIDFormats to MS:1001530 as follows:

<Inputs>
 <SearchDatabase id="sdb_0" location="/proteomics/dshteynb/data/ABRF/StudyPackage/ABRF_iPRG_XL_2023_DECOY.fasta" name="ABRF_iPRG_XL_2023_DECOY.fasta">
  <FileFormat>
   <cvParam accession="MS:1001348" cvRef="PSI-MS" name="FASTA format"/>
  </FileFormat>
  <DatabaseName>
   <userParam name="ABRF_iPRG_XL_2023_DECOY.fasta"/>
  </DatabaseName>
 </SearchDatabase>
 <SpectraData id="sd_0" location="/proteomics/dshteynb/data/ABRF/StudyPackage/Study_Data_Phase_1/211026EWas01_E1.mzML" name="211026EWas01_E1">
  <FileFormat>
   <cvParam accession="MS:1000584" cvRef="PSI-MS" name="mzML format"/>
  </FileFormat>
  <SpectrumIDFormat>
   <cvParam accession="MS:1001530" cvRef="PSI-MS" name="mzML unique identifier"/>
  </SpectrumIDFormat>
 </SpectraData>
 <SpectraData id="sd_1" location="/proteomics/dshteynb/data/ABRF/StudyPackage/Study_Data_Phase_1/211026EWas02_F1.mzML" name="211026EWas02_F1">
  <FileFormat>
   <cvParam accession="MS:1000584" cvRef="PSI-MS" name="mzML format"/>
  </FileFormat>
  <SpectrumIDFormat>
   <cvParam accession="MS:1001530" cvRef="PSI-MS" name="mzML unique identifier"/>
  </SpectrumIDFormat>
 </SpectraData>
 <SpectraData id="sd_2" location="/proteomics/dshteynb/data/ABRF/StudyPackage/Study_Data_Phase_1/211026EWas03_F2.mzML" name="211026EWas03_F2">
  <FileFormat>
   <cvParam accession="MS:1000584" cvRef="PSI-MS" name="mzML format"/>
  </FileFormat>
  <SpectrumIDFormat>
   <cvParam accession="MS:1001530" cvRef="PSI-MS" name="mzML unique identifier"/>
  </SpectrumIDFormat>
 </SpectraData>
</Inputs>

Still getting errors:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "xi-mzidentml-converter/parser/process_dataset.py", line 241, in convert_dir
    id_parser.parse()
  File "xi-mzidentml-converter/parser/MzIdParser.py", line 94, in parse
    self.main_loop()
  File "xi-mzidentml-converter/parser/MzIdParser.py", line 665, in main_loop
    spectrum = peak_list_reader[sid_result["spectrumID"]]
  File "xi-mzidentml-converter/parser/peaklistReader/PeakListWrapper.py", line 71, in __getitem__
    return self.reader[spec_id]
  File "xi-mzidentml-converter/parser/peaklistReader/PeakListWrapper.py", line 251, in __getitem__
    spec = self._reader.get_by_id(spec_id)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/auxiliary/file_helpers.py", line 84, in wrapped
    return func(self, *args, **kwargs)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/xml.py", line 1152, in get_by_id
    elem = self._find_by_id_reset(elem_id, id_key=id_key)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/auxiliary/file_helpers.py", line 84, in wrapped
    return func(self, *args, **kwargs)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/xml.py", line 1119, in _find_by_id_reset
    return self._find_by_id_no_reset(elem_id, id_key=id_key)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/xml.py", line 660, in _find_by_id_no_reset
    raise KeyError(elem_id)
KeyError: '211026EWas01_E1.00061.00061.2'
2024-09-19 13:25:49 - __main__ - ERROR - '211026EWas01_E1.00061.00061.2'

@colin-combe
Copy link

i think the converter is correct in saying the file referred to has no spectrum with ID "211026EWas01_E1.00061.00061.2"

(the file referred to was 211026EWas01_E1.mzML, to find that you need to search for "211026EWas01_E1.00061.00061.2" in the mzId and then see the associated id of the spectra data and look that up in the Inputs element, though the beginning of the ID they used gives a strong clue it will be that file.)

@sureshhewabi
Copy link
Collaborator Author

@ypriverol Could you please report this issue to the Kojak dataset provider? thanks!

@colin-combe
Copy link

@ypriverol Could you please report this issue to the Kojak dataset provider? thanks!

can refer them to this GH issue then any further discussion needed can take place here

@ypriverol
Copy link

I will try.

@mhoopmann
Copy link

My apologies, our software here automatically interprets TPP/mzML spectrum nomenclature and ProteoWizard/mzML spectrum nomenclature (e.g., 211026EWas01_E1.00061.00061.2 == controllerType=0 controllerNumber=1 scan=61). I seem to have taken that for granted and I will fix the mzID files to use ProteoWizard/mzML spectrum nomenclature throughout.

@mhoopmann
Copy link

New mzID files have been uploaded.

@sureshhewabi
Copy link
Collaborator Author

@colin-combe I tested the newly uploaded mzID file(which I copied to you in the same FTP location) gives long error messages like this which are not helpful for debugging.

[parameters: {'id_m0': 'sii_34501_1', 'upload_id_m0': 9, 'spectrum_id_m0': 'controllerType=0 controllerNumber=1 scan=50434', 'spectra_data_id_m0': 0, 'multiple_spectra_identification_id_m0': None, 'multiple_spectra_identification_pc_m0': None, 'pep1_id_m0': 2982, 'pep2_id_m0': None, 'charge_state_m0': 2, 'pass_threshold_m0': True, 'rank_m0': 1, 'scores_m0': '{}', 'exp_mz_m0': 622.773865, 'calc_mz_m0': None, 'sip_id_m0': 0, 'id_m1': 'sii_34502_1', 'upload_id_m1': 9, 'spectrum_id_m1': 'controllerType=0 controllerNumber=1 scan=50436', 'spectra_data_id_m1': 0, 'multiple_spectra_identification_id_m1': None, 'multiple_spectra_identification_pc_m1': None, 'pep1_id_m1': 9502, 'pep2_id_m1': None, 'charge_state_m1': 2, 'pass_threshold_m1': True, 'rank_m1': 1, 'scores_m1': '{}', 'exp_mz_m1': 750.88916, 'calc_mz_m1': None, 'sip_id_m1': 0, 'id_m2': 'sii_34503_1', 'upload_id_m2': 9, 'spectrum_id_m2': 'controllerType=0 controllerNumber=1 scan=50438', 'spectra_data_id_m2': 0, 'multiple_spectra_identification_id_m2': None, 'multiple_spectra_identification_pc_m2': None, 'pep1_id_m2': 4088, 'pep2_id_m2': None, 'charge_state_m2': 2, 'pass_threshold_m2': True, 'rank_m2': 1, 'scores_m2': '{}', 'exp_mz_m2': 614.776733, 'calc_mz_m2': None, 'sip_id_m2': 0, 'id_m3': 'sii_34504_1', 'upload_id_m3': 9, 'spectrum_id_m3': 'controllerType=0 controllerNumber=1 scan=50440', 'spectra_data_id_m3': 0, 'multiple_spectra_identification_id_m3': None ... 697400 parameters truncated ... 'rank_m46496': 1, 'scores_m46496': '{}', 'exp_mz_m46496': 572.768249, 'calc_mz_m46496': None, 'sip_id_m46496': 2, 'id_m46497': 'sii_9083_1', 'upload_id_m46497': 9, 'spectrum_id_m46497': 'controllerType=0 controllerNumber=1 scan=13393', 'spectra_data_id_m46497': 2, 'multiple_spectra_identification_id_m46497': None, 'multiple_spectra_identification_pc_m46497': None, 'pep1_id_m46497': 4394, 'pep2_id_m46497': None, 'charge_state_m46497': 2, 'pass_threshold_m46497': True, 'rank_m46497': 1, 'scores_m46497': '{}', 'exp_mz_m46497': 613.256044, 'calc_mz_m46497': None, 'sip_id_m46497': 2, 'id_m46498': 'sii_9084_1', 'upload_id_m46498': 9, 'spectrum_id_m46498': 'controllerType=0 controllerNumber=1 scan=13394', 'spectra_data_id_m46498': 2, 'multiple_spectra_identification_id_m46498': None, 'multiple_spectra_identification_pc_m46498': None, 'pep1_id_m46498': 5082, 'pep2_id_m46498': None, 'charge_state_m46498': 2, 'pass_threshold_m46498': True, 'rank_m46498': 1, 'scores_m46498': '{}', 'exp_mz_m46498': 711.303225, 'calc_mz_m46498': None, 'sip_id_m46498': 2, 'id_m46499': 'sii_9085_1', 'upload_id_m46499': 9, 'spectrum_id_m46499': 'controllerType=0 controllerNumber=1 scan=13395', 'spectra_data_id_m46499': 2, 'multiple_spectra_identification_id_m46499': None, 'multiple_spectra_identification_pc_m46499': None, 'pep1_id_m46499': 6672, 'pep2_id_m46499': None, 'charge_state_m46499': 2, 'pass_threshold_m46499': True, 'rank_m46499': 1, 'scores_m46499': '{}', 'exp_mz_m46499': 548.814697, 'calc_mz_m46499': None, 'sip_id_m46499': 2}]
(Background on this error at: https://sqlalche.me/e/20/gkpj)

@colin-combe
Copy link

yes, thats obviously not helpful for debugging and this is the sort of thing that needs improved as we move towards a more usable mzIdentML validator.

Though that wasn't the full output from it, was it? (maybe it was)

Anyway, I'll look into it and get back to you. Whatever the error is, I'll try to update the code to make the output more meaningful in the case of that error.

@sureshhewabi
Copy link
Collaborator Author

Same sort of output getting repeated and it will just fill up the buffer with these kind of JSON objects.

Thanks!

@colin-combe
Copy link

could also be a bug in the converter and nothing to do with validation

@colin-combe
Copy link

I think the file is invalid at the schema level due to duplicate ids.The same ids for SpectrumIdentificationResults and SpectrumIdentificationItems recur in the different SpectrumIdentificationLists.

The scope within which these ids are meant to be unique is perhaps open to interpretation from the text in the specification document. But I've attached the start of the output from xmllint --noout --schema mzIdentML1.2.0.xsd interact-1_2-fixed.mzid

outputfile.txt

(its also why the converters output was meaningless, it kinda assumes the input is schema valid.)

@mhoopmann
Copy link

Gotcha, sorry about that. Makes sense that each SpectrumIdentificationResult.id and SpectrumIdentificationItem.id should have unique values external to their SpectrumIdentificationLists, especially if this all goes into an SQL database that requires those tables to have a unique key based on id alone.

new mzID files have been uploaded (*fixedB.mzid)

@colin-combe
Copy link

colin-combe commented Sep 21, 2024

especially if this all goes into an SQL database that requires those tables to have a unique key based on id alone.

Yes, the sqlalchemy errors Suresh was seeing were caused by duplicate primary keys.

@sureshhewabi
Copy link
Collaborator Author

Good news! Dataset is parsed successfully. Thank you to everyone for working to make this happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants