Kojak Dataset Validation #81

sureshhewabi · 2024-09-19T10:28:06Z

Validating Crosslinking data:

211026EWas01_E1.mzML
211026EWas02_F1.mzML
211026EWas03_F2.mzML
interact-1_2.ipro.mzid

sureshhewabi · 2024-09-19T10:28:17Z

lxml.etree.XMLSyntaxError: Namespace prefix xsi for schemaLocation on MzIdentML is not defined, line 2, column 194
2024-09-19 11:25:40 - main - ERROR - Namespace prefix xsi for schemaLocation on MzIdentML is not defined, line 2, column 194 (interact-1_2.ipro.mzid, line 2)

I will fix with xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

sureshhewabi · 2024-09-19T10:36:31Z

Traceback (most recent call last):
File "xi-mzidentml-converter/parser/process_dataset.py", line 241, in convert_dir
id_parser.parse()
File "xi-mzidentml-converter/parser/MzIdParser.py", line 94, in parse
self.main_loop()
File "xi-mzidentml-converter/parser/MzIdParser.py", line 665, in main_loop
spectrum = peak_list_reader[sid_result["spectrumID"]]
File "xi-mzidentml-converter/parser/peaklistReader/PeakListWrapper.py", line 71, in getitem
return self.reader[spec_id]
File "xi-mzidentml-converter/parser/peaklistReader/PeakListWrapper.py", line 274, in getitem
raise SpectrumIdFormatError(
parser.peaklistReader.PeakListWrapper.SpectrumIdFormatError: MS:1000774 not supported for mzML!

@ypriverol

colin-combe · 2024-09-19T10:55:46Z

can you share the mzIdentML file?
I think MS:1001530 is the only supported SpectrumIdFormat for mzML, but it could be open to interpretation, p.8 of the 1.2.0 schema is the relevant part.
Looking at the mzIdentML file and seeing what the values of the spectrum IDs is would help,

C

colin-combe · 2024-09-19T11:19:38Z

yeah, the spectrumID attributes for SpectrumIdentificationResult elements have values like "211026EWas03_F2.34324.34324.4",
so that doesn't meet the requirements for MS:1000774, which needs to be of format "index=xsd:nonNegativeInteger" (p.8 of 1.2.0 schema).

Changing the SpectrumIDFormats to MS:1001530 may make it work.

There may still be an open question about whether spectra in mzML files can be referenced using just the index, my feeling is we've been into this before and perhaps they can't. There might be some debate about this.

colin-combe · 2024-09-19T11:21:43Z

we could try just switching the SpectrumIDFormats in lines 507902 to 507932 of the mzIdentML file
(i.e. swapping them to MS:1001530)

colin-combe · 2024-09-19T11:32:55Z

re. p8 of the 1.2.0 schema, i think "NativeID" refers to IDs in proprietary file formats, e.g. things form Therma/Waters/Bruker, so thats in part why MS:1000774 isn't applicable to mzML.

I think if we go into the pyteomics library we might find it doesn't allow mzML spectra to be retrieved by index only, it's something we can look into (and potentially ask pyteomics devs about) if it becomes an issue.

sureshhewabi · 2024-09-19T12:30:39Z

Thanks @colin-combe for your comments.. I change the SpectrumIDFormats to MS:1001530 as follows:

<Inputs>
 <SearchDatabase id="sdb_0" location="/proteomics/dshteynb/data/ABRF/StudyPackage/ABRF_iPRG_XL_2023_DECOY.fasta" name="ABRF_iPRG_XL_2023_DECOY.fasta">
  <FileFormat>
   <cvParam accession="MS:1001348" cvRef="PSI-MS" name="FASTA format"/>
  </FileFormat>
  <DatabaseName>
   <userParam name="ABRF_iPRG_XL_2023_DECOY.fasta"/>
  </DatabaseName>
 </SearchDatabase>
 <SpectraData id="sd_0" location="/proteomics/dshteynb/data/ABRF/StudyPackage/Study_Data_Phase_1/211026EWas01_E1.mzML" name="211026EWas01_E1">
  <FileFormat>
   <cvParam accession="MS:1000584" cvRef="PSI-MS" name="mzML format"/>
  </FileFormat>
  <SpectrumIDFormat>
   <cvParam accession="MS:1001530" cvRef="PSI-MS" name="mzML unique identifier"/>
  </SpectrumIDFormat>
 </SpectraData>
 <SpectraData id="sd_1" location="/proteomics/dshteynb/data/ABRF/StudyPackage/Study_Data_Phase_1/211026EWas02_F1.mzML" name="211026EWas02_F1">
  <FileFormat>
   <cvParam accession="MS:1000584" cvRef="PSI-MS" name="mzML format"/>
  </FileFormat>
  <SpectrumIDFormat>
   <cvParam accession="MS:1001530" cvRef="PSI-MS" name="mzML unique identifier"/>
  </SpectrumIDFormat>
 </SpectraData>
 <SpectraData id="sd_2" location="/proteomics/dshteynb/data/ABRF/StudyPackage/Study_Data_Phase_1/211026EWas03_F2.mzML" name="211026EWas03_F2">
  <FileFormat>
   <cvParam accession="MS:1000584" cvRef="PSI-MS" name="mzML format"/>
  </FileFormat>
  <SpectrumIDFormat>
   <cvParam accession="MS:1001530" cvRef="PSI-MS" name="mzML unique identifier"/>
  </SpectrumIDFormat>
 </SpectraData>
</Inputs>

Still getting errors:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "xi-mzidentml-converter/parser/process_dataset.py", line 241, in convert_dir
    id_parser.parse()
  File "xi-mzidentml-converter/parser/MzIdParser.py", line 94, in parse
    self.main_loop()
  File "xi-mzidentml-converter/parser/MzIdParser.py", line 665, in main_loop
    spectrum = peak_list_reader[sid_result["spectrumID"]]
  File "xi-mzidentml-converter/parser/peaklistReader/PeakListWrapper.py", line 71, in __getitem__
    return self.reader[spec_id]
  File "xi-mzidentml-converter/parser/peaklistReader/PeakListWrapper.py", line 251, in __getitem__
    spec = self._reader.get_by_id(spec_id)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/auxiliary/file_helpers.py", line 84, in wrapped
    return func(self, *args, **kwargs)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/xml.py", line 1152, in get_by_id
    elem = self._find_by_id_reset(elem_id, id_key=id_key)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/auxiliary/file_helpers.py", line 84, in wrapped
    return func(self, *args, **kwargs)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/xml.py", line 1119, in _find_by_id_reset
    return self._find_by_id_no_reset(elem_id, id_key=id_key)
  File ".local/share/virtualenvs/xi-mzidentml-converter-MB5dCI2Q/lib/python3.10/site-packages/pyteomics/xml.py", line 660, in _find_by_id_no_reset
    raise KeyError(elem_id)
KeyError: '211026EWas01_E1.00061.00061.2'
2024-09-19 13:25:49 - __main__ - ERROR - '211026EWas01_E1.00061.00061.2'

colin-combe · 2024-09-19T14:11:08Z

i think the converter is correct in saying the file referred to has no spectrum with ID "211026EWas01_E1.00061.00061.2"

(the file referred to was 211026EWas01_E1.mzML, to find that you need to search for "211026EWas01_E1.00061.00061.2" in the mzId and then see the associated id of the spectra data and look that up in the Inputs element, though the beginning of the ID they used gives a strong clue it will be that file.)

sureshhewabi · 2024-09-19T15:34:51Z

@ypriverol Could you please report this issue to the Kojak dataset provider? thanks!

colin-combe · 2024-09-19T15:38:50Z

@ypriverol Could you please report this issue to the Kojak dataset provider? thanks!

can refer them to this GH issue then any further discussion needed can take place here

ypriverol · 2024-09-19T15:42:56Z

I will try.

mhoopmann · 2024-09-19T19:24:33Z

My apologies, our software here automatically interprets TPP/mzML spectrum nomenclature and ProteoWizard/mzML spectrum nomenclature (e.g., 211026EWas01_E1.00061.00061.2 == controllerType=0 controllerNumber=1 scan=61). I seem to have taken that for granted and I will fix the mzID files to use ProteoWizard/mzML spectrum nomenclature throughout.

mhoopmann · 2024-09-19T21:29:11Z

New mzID files have been uploaded.

sureshhewabi · 2024-09-20T10:42:44Z

@colin-combe I tested the newly uploaded mzID file(which I copied to you in the same FTP location) gives long error messages like this which are not helpful for debugging.

[parameters: {'id_m0': 'sii_34501_1', 'upload_id_m0': 9, 'spectrum_id_m0': 'controllerType=0 controllerNumber=1 scan=50434', 'spectra_data_id_m0': 0, 'multiple_spectra_identification_id_m0': None, 'multiple_spectra_identification_pc_m0': None, 'pep1_id_m0': 2982, 'pep2_id_m0': None, 'charge_state_m0': 2, 'pass_threshold_m0': True, 'rank_m0': 1, 'scores_m0': '{}', 'exp_mz_m0': 622.773865, 'calc_mz_m0': None, 'sip_id_m0': 0, 'id_m1': 'sii_34502_1', 'upload_id_m1': 9, 'spectrum_id_m1': 'controllerType=0 controllerNumber=1 scan=50436', 'spectra_data_id_m1': 0, 'multiple_spectra_identification_id_m1': None, 'multiple_spectra_identification_pc_m1': None, 'pep1_id_m1': 9502, 'pep2_id_m1': None, 'charge_state_m1': 2, 'pass_threshold_m1': True, 'rank_m1': 1, 'scores_m1': '{}', 'exp_mz_m1': 750.88916, 'calc_mz_m1': None, 'sip_id_m1': 0, 'id_m2': 'sii_34503_1', 'upload_id_m2': 9, 'spectrum_id_m2': 'controllerType=0 controllerNumber=1 scan=50438', 'spectra_data_id_m2': 0, 'multiple_spectra_identification_id_m2': None, 'multiple_spectra_identification_pc_m2': None, 'pep1_id_m2': 4088, 'pep2_id_m2': None, 'charge_state_m2': 2, 'pass_threshold_m2': True, 'rank_m2': 1, 'scores_m2': '{}', 'exp_mz_m2': 614.776733, 'calc_mz_m2': None, 'sip_id_m2': 0, 'id_m3': 'sii_34504_1', 'upload_id_m3': 9, 'spectrum_id_m3': 'controllerType=0 controllerNumber=1 scan=50440', 'spectra_data_id_m3': 0, 'multiple_spectra_identification_id_m3': None ... 697400 parameters truncated ... 'rank_m46496': 1, 'scores_m46496': '{}', 'exp_mz_m46496': 572.768249, 'calc_mz_m46496': None, 'sip_id_m46496': 2, 'id_m46497': 'sii_9083_1', 'upload_id_m46497': 9, 'spectrum_id_m46497': 'controllerType=0 controllerNumber=1 scan=13393', 'spectra_data_id_m46497': 2, 'multiple_spectra_identification_id_m46497': None, 'multiple_spectra_identification_pc_m46497': None, 'pep1_id_m46497': 4394, 'pep2_id_m46497': None, 'charge_state_m46497': 2, 'pass_threshold_m46497': True, 'rank_m46497': 1, 'scores_m46497': '{}', 'exp_mz_m46497': 613.256044, 'calc_mz_m46497': None, 'sip_id_m46497': 2, 'id_m46498': 'sii_9084_1', 'upload_id_m46498': 9, 'spectrum_id_m46498': 'controllerType=0 controllerNumber=1 scan=13394', 'spectra_data_id_m46498': 2, 'multiple_spectra_identification_id_m46498': None, 'multiple_spectra_identification_pc_m46498': None, 'pep1_id_m46498': 5082, 'pep2_id_m46498': None, 'charge_state_m46498': 2, 'pass_threshold_m46498': True, 'rank_m46498': 1, 'scores_m46498': '{}', 'exp_mz_m46498': 711.303225, 'calc_mz_m46498': None, 'sip_id_m46498': 2, 'id_m46499': 'sii_9085_1', 'upload_id_m46499': 9, 'spectrum_id_m46499': 'controllerType=0 controllerNumber=1 scan=13395', 'spectra_data_id_m46499': 2, 'multiple_spectra_identification_id_m46499': None, 'multiple_spectra_identification_pc_m46499': None, 'pep1_id_m46499': 6672, 'pep2_id_m46499': None, 'charge_state_m46499': 2, 'pass_threshold_m46499': True, 'rank_m46499': 1, 'scores_m46499': '{}', 'exp_mz_m46499': 548.814697, 'calc_mz_m46499': None, 'sip_id_m46499': 2}]
(Background on this error at: https://sqlalche.me/e/20/gkpj)

colin-combe · 2024-09-20T11:00:56Z

yes, thats obviously not helpful for debugging and this is the sort of thing that needs improved as we move towards a more usable mzIdentML validator.

Though that wasn't the full output from it, was it? (maybe it was)

Anyway, I'll look into it and get back to you. Whatever the error is, I'll try to update the code to make the output more meaningful in the case of that error.

sureshhewabi · 2024-09-20T11:08:46Z

Same sort of output getting repeated and it will just fill up the buffer with these kind of JSON objects.

Thanks!

colin-combe · 2024-09-20T11:22:13Z

could also be a bug in the converter and nothing to do with validation

colin-combe · 2024-09-20T16:29:08Z

I think the file is invalid at the schema level due to duplicate ids.The same ids for SpectrumIdentificationResults and SpectrumIdentificationItems recur in the different SpectrumIdentificationLists.

The scope within which these ids are meant to be unique is perhaps open to interpretation from the text in the specification document. But I've attached the start of the output from xmllint --noout --schema mzIdentML1.2.0.xsd interact-1_2-fixed.mzid

outputfile.txt

(its also why the converters output was meaningless, it kinda assumes the input is schema valid.)

mhoopmann · 2024-09-20T22:30:09Z

Gotcha, sorry about that. Makes sense that each SpectrumIdentificationResult.id and SpectrumIdentificationItem.id should have unique values external to their SpectrumIdentificationLists, especially if this all goes into an SQL database that requires those tables to have a unique key based on id alone.

new mzID files have been uploaded (*fixedB.mzid)

colin-combe · 2024-09-21T06:20:08Z

especially if this all goes into an SQL database that requires those tables to have a unique key based on id alone.

Yes, the sqlalchemy errors Suresh was seeing were caused by duplicate primary keys.

sureshhewabi · 2024-09-23T10:28:08Z

Good news! Dataset is parsed successfully. Thank you to everyone for working to make this happen.

sureshhewabi self-assigned this Sep 19, 2024

colin-combe mentioned this issue Sep 19, 2024

MzIdentML Validation Feature #78

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kojak Dataset Validation #81

Kojak Dataset Validation #81

sureshhewabi commented Sep 19, 2024

sureshhewabi commented Sep 19, 2024 •

edited

Loading

sureshhewabi commented Sep 19, 2024 •

edited

Loading

colin-combe commented Sep 19, 2024 •

edited

Loading

colin-combe commented Sep 19, 2024 •

edited

Loading

colin-combe commented Sep 19, 2024 •

edited

Loading

colin-combe commented Sep 19, 2024

sureshhewabi commented Sep 19, 2024

colin-combe commented Sep 19, 2024

sureshhewabi commented Sep 19, 2024

colin-combe commented Sep 19, 2024

ypriverol commented Sep 19, 2024

mhoopmann commented Sep 19, 2024

mhoopmann commented Sep 19, 2024

sureshhewabi commented Sep 20, 2024

colin-combe commented Sep 20, 2024

sureshhewabi commented Sep 20, 2024

colin-combe commented Sep 20, 2024

colin-combe commented Sep 20, 2024

mhoopmann commented Sep 20, 2024

colin-combe commented Sep 21, 2024 •

edited

Loading

sureshhewabi commented Sep 23, 2024

Kojak Dataset Validation #81

Kojak Dataset Validation #81

Comments

sureshhewabi commented Sep 19, 2024

sureshhewabi commented Sep 19, 2024 • edited Loading

sureshhewabi commented Sep 19, 2024 • edited Loading

colin-combe commented Sep 19, 2024 • edited Loading

colin-combe commented Sep 19, 2024 • edited Loading

colin-combe commented Sep 19, 2024 • edited Loading

colin-combe commented Sep 19, 2024

sureshhewabi commented Sep 19, 2024

colin-combe commented Sep 19, 2024

sureshhewabi commented Sep 19, 2024

colin-combe commented Sep 19, 2024

ypriverol commented Sep 19, 2024

mhoopmann commented Sep 19, 2024

mhoopmann commented Sep 19, 2024

sureshhewabi commented Sep 20, 2024

colin-combe commented Sep 20, 2024

sureshhewabi commented Sep 20, 2024

colin-combe commented Sep 20, 2024

colin-combe commented Sep 20, 2024

mhoopmann commented Sep 20, 2024

colin-combe commented Sep 21, 2024 • edited Loading

sureshhewabi commented Sep 23, 2024

sureshhewabi commented Sep 19, 2024 •

edited

Loading

sureshhewabi commented Sep 19, 2024 •

edited

Loading

colin-combe commented Sep 19, 2024 •

edited

Loading

colin-combe commented Sep 19, 2024 •

edited

Loading

colin-combe commented Sep 19, 2024 •

edited

Loading

colin-combe commented Sep 21, 2024 •

edited

Loading