Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: How to improve number of BINS/MAGS from soil ? #171

Open
ecairns62 opened this issue Dec 4, 2024 · 1 comment
Open

[Question]: How to improve number of BINS/MAGS from soil ? #171

ecairns62 opened this issue Dec 4, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@ecairns62
Copy link

Hi.

I am nearing the MAG classification step for samples extracted from soil around plant roots, but the number of MAGS seems to only be around 4 or 5? The sequencing was completed on a NovaSeq, so not ideal as it is short read, but we hoped we would still get a better coverage. Is there any way to adjust the pipeline to allow a larger error to get more MAGS assembled? Also we do have replicate samples, and I saw from another issue report than the samples could maybe be merged together? I've attached the visualisation files as well to give more data. These two samples are also the same, just from different sequencing runs.

assemblyVis.pdf
binningVis.pdf
qfilterVis.pdf

@ecairns62 ecairns62 added the question Further information is requested label Dec 4, 2024
@franciscozorrilla
Copy link
Owner

Hi Edward,

Based on the qfiltering plot it seems like your samples have a decent sequencing depth (over 10 Gbp/sample), although increasing this in future experiments may help improve recovery of genomes, especially for complex samples. Base quality also looks good.

Regarding your assemblies, it looks like you have around 40 Mbp per sample. Unfortunately the average contig length is quite low at about 1600 bp, although this is not uncommon for complex/soil samples. Considering that the smallest binnable contig is 1000bp, you are likely losing a lot of sequences at the binning stage due to poor quality assemblies. One metric that may be helpful is to concatenate your MAGs, then map your short reads to the MAG concatenation to get an idea of what % of reads from the samples end up in MAGs. Similarly, you can map your short reads to the assembly to get an idea of the % of reads that end up getting assembled. To get a general idea of what values you may expect, have a look at supplementary figure 4 from the metaGEM paper.

sup_fig4

You may want to try playing around with the assembly presets/parameters to see if you can get a higher average contig length and/or assembly size. Considering an average bacterial size of about 4Mbp, and the fact that you are losing sequences due to small contig size, I think your 4-5 genomes is reasonable given the samples you have.

If those two samples are from the same biological material, you could indeed coassemble them to improve the quality of assembly. There is currently no coassembly rule in the main Snakefile, but you could have a look at this example code where I coassembled some samples, also have a look at the megahit repo/wiki. You just have to list all the R1 samples and R2 samples in the megahit call:

rule megahitCoassembly:
input:
R1 = f'/scratch/zorrilla/soil/coassembly/data/{{borkSoil}}/R1',
R2 = f'/scratch/zorrilla/soil/coassembly/data/{{borkSoil}}/R2'
output:
f'{config["path"]["root"]}/coassembly/coassemblies/{{borkSoil}}/contigs.fasta.gz'
benchmark:
f'{config["path"]["root"]}/benchmarks/coassembly.{{borkSoil}}.benchmark.txt'
shell:
"""
set +u;source activate {config[envs][metabagpipes]};set -u;
cd $SCRATCHDIR
echo -n "Copying qfiltered reads to $SCRATCHDIR ... "
cp -r {input.R1} {input.R2} $SCRATCHDIR
echo "done. "
R1=$(ls R1/|tr '\n' ','|sed 's/,$//g')
R2=$(ls R2/|tr '\n' ','|sed 's/,$//g')
mv R1/* .
mv R2/* .
echo -n "Running megahit ... "
megahit -t {config[cores][megahit]} \
--presets {config[params][assemblyPreset]} \
--min-contig-len {config[params][assemblyMin]}\
--verbose \
-1 $R1 \
-2 $R2 \
-o tmp;
echo "done. "
echo "Renaming assembly ... "
mv tmp/final.contigs.fa contigs.fasta
echo "Fixing contig header names: replacing spaces with hyphens ... "
sed -i 's/ /-/g' contigs.fasta
echo "Zipping and moving assembly ... "
gzip contigs.fasta
mkdir -p $(dirname {output})
mv contigs.fasta.gz $(dirname {output})
echo "Done. "
"""

Finally, you may want to bin your MAGs using contig coverage across more samples which has been shown to improve results.

Hope this helps!
Best wishes,
Francisco

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants