-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: preparation for somatic cnv #590
base: main
Are you sure you want to change the base?
Conversation
…of Rscript, for more verbose logs
…to normal sample mapping
… BAQ, changed skip on flags), protect against None model values, changed regions option name
5d0043e
to
9660c7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew, this is a lot! Thank you for all the work and effort that went into it:
Here are some remarks, questions & feedback, in no particular order:
- make sure
VCF_TAG_PATTERN
andANNOTATION_VCF_TAG_PATTERN
cover the definitions of the VCF specification - try to use the model and its attributes where applicable instead of dict access (i.e.
wf.config.property.…
instead ofconfig["step_config"][…]
) - use functions for getting resources and params/args, so we can more easily add dynamic resource estimation later on
- I think "last" is an unfortunate name for an action, I'd prefer something akin to "finalize" or "gather_final_result" or …)
- make use of
dictify
andlistify
consistently - should we produce a little illustration for the documentation on how the steps introduced here interact/intertwine/support one another?
- avoid code duplication:
- I think
_collapsed_arg_value
andcollapse_args
appear multiple times with basically the same code; in this case, we may also find a cleaner way to do it, but it's fine for now! get_args
is often the same 2 lines I think, is that not part of Base/AbstractStepXYZ already?
- I think
- hardcoded
extra_args
seem weird to me, can they not be part of the default config instead? - instead of using the
do_md5
argument, we could add an automatic check for this - TODO: check regular expressions for correctness
- use named groups in regexes
- for even more comments, see the respective line comments ;)
snappy_pipeline/models/bcftools.py
Outdated
SNP_INS_DEL = "snp-ins-del" | ||
"""Used in merge""" | ||
ID = "id" | ||
"""Used in merge""" | ||
STAR = "*" | ||
"""Used in merge""" | ||
STAR_STAR = "**" | ||
"""Used in merge""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split into two (or more) enums? (And use the union of them where applicable?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added command-dependent variant type classes. I am not sure it is the best solution, but I think at least it is clear.
As the model is not used anywhere (at the moment), I think I'll keep it as it is, unless you have reservations.
assert not any(table.duplicated()), "Duplicated entries in sample sheets" | ||
assert not any(table["ngs_library"].duplicated()), "Duplicated NGS libraries" | ||
|
||
# table.set_index("ngs_library", drop=False, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I'm wondering if there is any column which is useful to set as (default) index…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are absolutely right. I was caught by my ignorance of pandas. But actually, it makes a lot of sense to index the table by the ngs library name.
|
||
rule germline_snvs_bcftools_ignore_chroms: | ||
input: | ||
reference=config["static_data_config"]["reference"]["path"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could also use wf.w_config.static_data_config.reference.path
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better indeed. Done
**{ | ||
"args": { | ||
"ignore_chroms": set( | ||
config["step_config"]["germline_snvs"]["ignore_chroms"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, wf.w_config.step_config["germline_snvs"].ignore_chroms
etc etc.
args = getattr(snakemake.params, "args", {}) | ||
|
||
cmd = r""" | ||
bcftools iseq \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bcftools iseq \ | |
bcftools isec \ |
|
||
awk \ | ||
-F '\t' \ | ||
'($5 != "11") && (length($3) == 1) && (length($3) == length($4)) {{printf "%s\t%d\t%d\n", $1, $2-1, $2}}' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love having some documentation on what this is supposed to filter / produce etc
- bioconda | ||
- nodefaults | ||
dependencies: | ||
- htslib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should at least put the current version as a lower limit (+ an upper limit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little docstring explaining what this wrapper does / what it is used for would be nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're using grep and awk etc, make sure to also add e.g. coreutils
or just awk
and grep
as their own tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added coreutils
dependency, and set the version to the same as snappy's. I did the same for samtools
(same version as snappy).
Should we try to enforce version match for those basic tools that also are in snappy (bcftools
, samtools
, htslib
, pysam
, coreutils
)? At least for simple wrappers, that don't require anything else?
Co-authored-by: Till Hartmann <[email protected]>
Co-authored-by: Till Hartmann <[email protected]>
…variants_for_cnv steps
…aside (because it's not quite complete)
… workflow config attributes & removed unnecessary resource allocation
Adds several (fairly simple & simple-minded) steps required for proper CNV calling:
guess_sex
: simple inference of sex for autosome & sex chromosome coveragegermline_snvs
: simple identification of well-supported germline SNPs. Thevariant_calling
step unfortunately cannot be used for this task, as it is designed for trios.somatic_variants_for_cnv
: creates input for cnv tools using B-allele fractions to improve/verify CNV calls based on coverage alone. Thesomatic_variant_calling
step cannot be used, as the somatic variants frommutect2
differ greatly when germline variants are included or not.The current code is OK, but can certainly be improved:
germline_snvs/__init__.py
snappy_wrapper
is probably possible. Also, the derivedBcftoolsWrapper
is a first attempt at streamlining UNIX-like tools (such asbcftools
,bedtools
,bedops
,samtools
,rnaqc
, ...). Its design should be critically reviewed, before similar wrappers are built.ignored_chroms
should also be seen as a first attempt to be critically reviewed. The code ingenome_windows
is exercised in theignored_chroms
wrapper, called from thegermline_snvs
&somatic_variants_for_cnv
snakefiles.