Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: preparation for somatic cnv #590

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

ericblanc20
Copy link
Contributor

Adds several (fairly simple & simple-minded) steps required for proper CNV calling:

  • guess_sex: simple inference of sex for autosome & sex chromosome coverage
  • germline_snvs: simple identification of well-supported germline SNPs. The variant_calling step unfortunately cannot be used for this task, as it is designed for trios.
  • somatic_variants_for_cnv: creates input for cnv tools using B-allele fractions to improve/verify CNV calls based on coverage alone. The somatic_variant_calling step cannot be used, as the somatic variants from mutect2 differ greatly when germline variants are included or not.

The current code is OK, but can certainly be improved:

  • Pipes could be extended, but it needs more clever choices in germline_snvs/__init__.py
  • The problems with model serialization & validation should be understood & fixed: at the moment, automatic alias generation in bcftools models seems to work in isolation, when only the step model is considered. But validation fails during registration of sub-steps.
  • Better design of the snappy_wrapper is probably possible. Also, the derived BcftoolsWrapper is a first attempt at streamlining UNIX-like tools (such as bcftools, bedtools, bedops, samtools, rnaqc, ...). Its design should be critically reviewed, before similar wrappers are built.
  • The treatment of ignored_chroms should also be seen as a first attempt to be critically reviewed. The code in genome_windows is exercised in the ignored_chroms wrapper, called from the germline_snvs & somatic_variants_for_cnv snakefiles.

@ericblanc20 ericblanc20 linked an issue Jan 10, 2025 that may be closed by this pull request
@ericblanc20 ericblanc20 requested a review from tedil January 10, 2025 16:51
@ericblanc20 ericblanc20 changed the title 587 preparation for somatic cnv feat: preparation for somatic cnv Jan 10, 2025
Copy link

  • Please format your Python code with ruff: make fmt
  • Please check your Python code with ruff: make check
  • Please format your Snakemake code with snakefmt: make snakefmt

You can trigger all lints locally by running make lint

@coveralls
Copy link

coveralls commented Jan 14, 2025

Coverage Status

coverage: 80.761% (-5.0%) from 85.765%
when pulling 0fa6eed on 587-preparation-for-somatic-cnv
into f569530 on main.

@tedil tedil force-pushed the 587-preparation-for-somatic-cnv branch from 5d0043e to 9660c7c Compare January 16, 2025 12:11
Copy link
Member

@tedil tedil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, this is a lot! Thank you for all the work and effort that went into it:

Here are some remarks, questions & feedback, in no particular order:

  • make sure VCF_TAG_PATTERN and ANNOTATION_VCF_TAG_PATTERN cover the definitions of the VCF specification
  • try to use the model and its attributes where applicable instead of dict access (i.e. wf.config.property.… instead of config["step_config"][…])
  • use functions for getting resources and params/args, so we can more easily add dynamic resource estimation later on
  • I think "last" is an unfortunate name for an action, I'd prefer something akin to "finalize" or "gather_final_result" or …)
  • make use of dictify and listify consistently
  • should we produce a little illustration for the documentation on how the steps introduced here interact/intertwine/support one another?
  • avoid code duplication:
    • I think _collapsed_arg_value and collapse_args appear multiple times with basically the same code; in this case, we may also find a cleaner way to do it, but it's fine for now!
    • get_args is often the same 2 lines I think, is that not part of Base/AbstractStepXYZ already?
  • hardcoded extra_args seem weird to me, can they not be part of the default config instead?
  • instead of using the do_md5 argument, we could add an automatic check for this
  • TODO: check regular expressions for correctness
  • use named groups in regexes
  • for even more comments, see the respective line comments ;)

Comment on lines 217 to 224
SNP_INS_DEL = "snp-ins-del"
"""Used in merge"""
ID = "id"
"""Used in merge"""
STAR = "*"
"""Used in merge"""
STAR_STAR = "**"
"""Used in merge"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into two (or more) enums? (And use the union of them where applicable?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added command-dependent variant type classes. I am not sure it is the best solution, but I think at least it is clear.
As the model is not used anywhere (at the moment), I think I'll keep it as it is, unless you have reservations.

snappy_pipeline/workflows/common/samplesheet.py Outdated Show resolved Hide resolved
assert not any(table.duplicated()), "Duplicated entries in sample sheets"
assert not any(table["ngs_library"].duplicated()), "Duplicated NGS libraries"

# table.set_index("ngs_library", drop=False, inplace=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm wondering if there is any column which is useful to set as (default) index…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are absolutely right. I was caught by my ignorance of pandas. But actually, it makes a lot of sense to index the table by the ngs library name.


rule germline_snvs_bcftools_ignore_chroms:
input:
reference=config["static_data_config"]["reference"]["path"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could also use wf.w_config.static_data_config.reference.path instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better indeed. Done

**{
"args": {
"ignore_chroms": set(
config["step_config"]["germline_snvs"]["ignore_chroms"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, wf.w_config.step_config["germline_snvs"].ignore_chroms etc etc.

args = getattr(snakemake.params, "args", {})

cmd = r"""
bcftools iseq \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bcftools iseq \
bcftools isec \


awk \
-F '\t' \
'($5 != "11") && (length($3) == 1) && (length($3) == length($4)) {{printf "%s\t%d\t%d\n", $1, $2-1, $2}}' \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love having some documentation on what this is supposed to filter / produce etc

- bioconda
- nodefaults
dependencies:
- htslib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should at least put the current version as a lower limit (+ an upper limit)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little docstring explaining what this wrapper does / what it is used for would be nice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're using grep and awk etc, make sure to also add e.g. coreutils or just awk and grep as their own tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added coreutils dependency, and set the version to the same as snappy's. I did the same for samtools (same version as snappy).
Should we try to enforce version match for those basic tools that also are in snappy (bcftools, samtools, htslib, pysam, coreutils)? At least for simple wrappers, that don't require anything else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preparation for somatic CNV
3 participants