Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 3659 fixed loading for MOL2 and PDB edge cases #3662

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

zwsmith200
Copy link

@zwsmith200 zwsmith200 commented May 11, 2022

Fixes #3659

Changes made in this Pull Request:

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@pep8speaks
Copy link

pep8speaks commented May 11, 2022

Hello @zwsmith200! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2023-11-07 19:25:28 UTC

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello there first time contributor! Welcome to the MDAnalysis community! We ask that all contributors abide by our Code of Conduct and that first time contributors introduce themselves on the developer mailing list so we can get to know you. You can learn more about participating here. Please also add yourself to package/AUTHORS as part of this PR.

@codecov
Copy link

codecov bot commented May 11, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (4b8b924) 93.37% compared to head (6d965b0) 93.37%.
Report is 8 commits behind head on develop.

Additional details and impacted files
@@            Coverage Diff             @@
##           develop    #3662     +/-   ##
==========================================
  Coverage    93.37%   93.37%             
==========================================
  Files          170      184     +14     
  Lines        22295    23465   +1170     
  Branches      4075     4084      +9     
==========================================
+ Hits         20818    21911   +1093     
- Misses         962     1036     +74     
- Partials       515      518      +3     
Files Coverage Δ
package/MDAnalysis/topology/MOL2Parser.py 98.27% <100.00%> (+0.03%) ⬆️
package/MDAnalysis/topology/PDBParser.py 99.50% <100.00%> (+0.01%) ⬆️
package/MDAnalysis/topology/base.py 96.61% <100.00%> (+0.95%) ⬆️

... and 20 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@orbeckst
Copy link
Member

Welcome to contributing to MDAnalysis! Thanks for the PR.

Sorry, I don't have time to review, I just wanted to ask if this PR is going to fix #3659 ? If so, please enter the issue number in the Fixes # line at the top of your issue report so that PRs and issues get linked.

@IAlibay
Copy link
Member

IAlibay commented May 12, 2022

So I should ask here - the fact that both squash_by and change_squash exist just next to each other makes me wonder if there is a file format specific reason as to why these behaviours are different :/ Is there anything in the PR history that explains why these exist?

Nevermind, I better understand the issue now.

@zwsmith200
Copy link
Author

zwsmith200 commented May 12, 2022

The scope has expanded slightly since the initial issue. Originally I was concerned that there were different behaviours when reading PDB and MOL2, namely the MOL2 format deleted the resname with a repeated resid while PDB did not. I then went to re-use change_squash from PDB and realized this will double the number of residues if they are not contiguous (for example openbabel adds hydrogens to the end of PDBs).

In light of those findings, I wrote a squash function that is robust to both repeated resids with different attributes and non-contiguous residues. I have applied this change to the MOL2 and PDB parsers but I suspect experts in other formats may want to take a look at whether similar unit tests are needed.

@richardjgowers richardjgowers self-assigned this May 12, 2022
@richardjgowers
Copy link
Member

I need to double check this, hopefully in the next few days

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zwsmith200 thanks for putting this together. I think it's an improvement for a lot of cases, but I can foresee cases where it changes behaviour, e.g. if the resids wrapped around the max value and the resname happened to be the same by bad luck then this would be considered the same residue. I think it might be better to allow this new residue grouping strategy via a keyword argument to Universe, which is passed to the parser.

Comment on lines 149 to 161
Arguments
---------
squash_attributes - list of attribute arrays (attributes used to
identify the parent)
*other_attributes - other arrays that need to follow the sorting of ids

Returns
-------
child_parents_idx - an array of len(child) which points to the index of
parent
parent_combos - len(parent) of the unique combinations
squashed_attrs - len(parent) of the attributes used for squashing
*other_attrs - len(parent) of the other attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this follow the doc style a bit better? Usually it's type annotation on the same line as the variable name and description on the following line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, I have actually run into an example of this using mol2 generated from pdb in my work because it lost chain information. I’ll add a test for wrapped resids.

@IAlibay
Copy link
Member

IAlibay commented Aug 14, 2023

@zwsmith200 is this still something you are looking to contribute?

@IAlibay IAlibay added the close? Evaluate if issue/PR is stale and can be closed. label Aug 14, 2023
@zwsmith200
Copy link
Author

Yes, I can finish cleaning this up. I got it working for my now-completed project then forgot about the PR.

Copy link

github-actions bot commented Nov 2, 2023

Linter Bot Results:

Hi @zwsmith200! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/6789351225/job/18456338341


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

@zwsmith200
Copy link
Author

Hello @zwsmith200! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 409:80: E501 line too long (80 > 79 characters)
Line 418:80: E501 line too long (81 > 79 characters)

Line 325:80: E501 line too long (86 > 79 characters)

Line 388:80: E501 line too long (87 > 79 characters)

Comment last updated at 2023-11-06 01:06:50 UTC

I realize darker used max length of 88 and we use 79. It may be useful to add -l 79 here so that copying the command matches our requirements.

@zwsmith200
Copy link
Author

@richardjgowers @IAlibay It looks like I have addressed everything I needed to, let me know if I missed something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
close? Evaluate if issue/PR is stale and can be closed. Component-Topology
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent resnames when loading pdb and mol2 with repeat resnums
5 participants