Some changes on finemapper.py
with data parsing, and if unnecessary, i'll only submit the code with bug fix for bgen file
#188
+679
−345
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Title: Enhancements and New Features for Genetic Data Processing
Code only modified for
finemapper.py
Summary of Changes
This PR introduces several updates and new features to improve the handling and processing of genetic data. Key changes include:
Support for BGEN to cal LD and save to .npz Conversion to accelate the loading speed compared to .bcor:
--geno genofile --ldstore2 $(which ldstore) --cache-dir ./ --cache-format npz
to save in npz format by default.NPZ File Reading Capability:
--ld your_npz_prefix
.PGEN File Support:
--geno pgen_file_prefix
.finemap_tools
for invoking Plink2 (version must be later than PLINK v2.00a6LM 64-bit Intel, dated 2 Mar 2024) to compute LD, using the command template:plink2 --r2-unphased square
.--geno
option matches files using prefixes, with bed files having higher priority over pgen to avoid conflicts when both file types are present.Improvements in LD Matrix Handling:
sync_ld_sumstats
function to exclude SNPs with NA values.Enhancements in Summary Statistics (sumstats) Loading:
tabix
command-line tool. This approach is particularly efficient for genome-wide sumstats, allowing direct retrieval of data by chromosome, significantly reducing loading times.finemap_tools
is unavailable, the original logic of reading the entire file will be followed.tabix -s 2 -b 3 -e 3 -c S sumstats_with_bgz_compressed.bgz
.Integration of
finemap_tools
Package:finemap_tools
for filtering bialleic and ambiguous alleles during sumstats reading.Code Formatting Updates:
bgen bug fix
This is the commits :
03c283d2190e2f3100462bb8932ed4f7441b54aa
do, and after this commits is some more changes which may not necessary.Future Developments
Further development and updates will continue in my own repository and will not be submitted as pull requests to this project.