[Question]: How to improve number of BINS/MAGS from soil ? #171

ecairns62 · 2024-12-04T10:54:57Z

Hi.

I am nearing the MAG classification step for samples extracted from soil around plant roots, but the number of MAGS seems to only be around 4 or 5? The sequencing was completed on a NovaSeq, so not ideal as it is short read, but we hoped we would still get a better coverage. Is there any way to adjust the pipeline to allow a larger error to get more MAGS assembled? Also we do have replicate samples, and I saw from another issue report than the samples could maybe be merged together? I've attached the visualisation files as well to give more data. These two samples are also the same, just from different sequencing runs.

assemblyVis.pdf
binningVis.pdf
qfilterVis.pdf

franciscozorrilla · 2024-12-10T09:52:57Z

Hi Edward,

Based on the qfiltering plot it seems like your samples have a decent sequencing depth (over 10 Gbp/sample), although increasing this in future experiments may help improve recovery of genomes, especially for complex samples. Base quality also looks good.

Regarding your assemblies, it looks like you have around 40 Mbp per sample. Unfortunately the average contig length is quite low at about 1600 bp, although this is not uncommon for complex/soil samples. Considering that the smallest binnable contig is 1000bp, you are likely losing a lot of sequences at the binning stage due to poor quality assemblies. One metric that may be helpful is to concatenate your MAGs, then map your short reads to the MAG concatenation to get an idea of what % of reads from the samples end up in MAGs. Similarly, you can map your short reads to the assembly to get an idea of the % of reads that end up getting assembled. To get a general idea of what values you may expect, have a look at supplementary figure 4 from the metaGEM paper.

You may want to try playing around with the assembly presets/parameters to see if you can get a higher average contig length and/or assembly size. Considering an average bacterial size of about 4Mbp, and the fact that you are losing sequences due to small contig size, I think your 4-5 genomes is reasonable given the samples you have.

If those two samples are from the same biological material, you could indeed coassemble them to improve the quality of assembly. There is currently no coassembly rule in the main Snakefile, but you could have a look at this example code where I coassembled some samples, also have a look at the megahit repo/wiki. You just have to list all the R1 samples and R2 samples in the megahit call:

metaGEM/workflow/rules/Snakefile_experimental.smk.py

Lines 21 to 65 in 8609ad6

    
           rule megahitCoassembly: 
        
               input: 
        
                   R1 = f'/scratch/zorrilla/soil/coassembly/data/{{borkSoil}}/R1',  
        
                   R2 = f'/scratch/zorrilla/soil/coassembly/data/{{borkSoil}}/R2' 
        
               output: 
        
                   f'{config["path"]["root"]}/coassembly/coassemblies/{{borkSoil}}/contigs.fasta.gz' 
        
               benchmark: 
        
                   f'{config["path"]["root"]}/benchmarks/coassembly.{{borkSoil}}.benchmark.txt' 
        
               shell: 
        
                   """ 
        
                   set +u;source activate {config[envs][metabagpipes]};set -u; 
        
                   cd $SCRATCHDIR 
        
                   echo -n "Copying qfiltered reads to $SCRATCHDIR ... " 
        
                   cp -r {input.R1} {input.R2} $SCRATCHDIR 
        
                   echo "done. " 
        
                   R1=$(ls R1/|tr '\n' ','|sed 's/,$//g') 
        
                   R2=$(ls R2/|tr '\n' ','|sed 's/,$//g') 
        
                   mv R1/* . 
        
                   mv R2/* . 
        
                   echo -n "Running megahit ... " 
        
                   megahit -t {config[cores][megahit]} \ 
        
                       --presets {config[params][assemblyPreset]} \ 
        
                       --min-contig-len {config[params][assemblyMin]}\ 
        
                       --verbose \ 
        
                       -1 $R1 \ 
        
                       -2 $R2 \ 
        
                       -o tmp; 
        
                   echo "done. " 
        
                   echo "Renaming assembly ... " 
        
                   mv tmp/final.contigs.fa contigs.fasta 
        
                   echo "Fixing contig header names: replacing spaces with hyphens ... " 
        
                   sed -i 's/ /-/g' contigs.fasta 
        
                   echo "Zipping and moving assembly ... " 
        
                   gzip contigs.fasta 
        
                   mkdir -p $(dirname {output}) 
        
                   mv contigs.fasta.gz $(dirname {output}) 
        
                   echo "Done. " 
        
                   """

Finally, you may want to bin your MAGs using contig coverage across more samples which has been shown to improve results.

Hope this helps!
Best wishes,
Francisco

ecairns62 added the question Further information is requested label Dec 4, 2024

ecairns62 assigned franciscozorrilla Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: How to improve number of BINS/MAGS from soil ? #171

[Question]: How to improve number of BINS/MAGS from soil ? #171

ecairns62 commented Dec 4, 2024

franciscozorrilla commented Dec 10, 2024

[Question]: How to improve number of BINS/MAGS from soil ? #171

[Question]: How to improve number of BINS/MAGS from soil ? #171

Comments

ecairns62 commented Dec 4, 2024

franciscozorrilla commented Dec 10, 2024