Skip to content

Snakefile config

Francisco Zorrilla edited this page Mar 23, 2021 · 14 revisions

The config.yaml file contains a list of parameters that are read in by the Snakefile. Instead of editing the Snakefile whenever you want to try to change some parameter, just create a new copy of the config.yaml file. Now thats what I call reproducibility.

The config.yaml file looks something like this:

path:
    root: /path/to/your/project/folder/on/the/cluster
    scratch: $SCRATCH_FOLDER_VARIABLE_SPECIFIC_TO_YOUR_CLUSTER
folder:
    data: dataset
    logs: logs
    assemblies: assemblies
    ...
scripts:
    kallisto2concoct: kallisto2concoct.py
    prepRoary: prepareRoaryInput.R
    binFilter: binFilter.py
    ...
cores:
    fastp: 4
    megahit: 48
    crossMap: 24
    ...
params:
    cutfasta: 10000
    assemblyPreset: meta-sensitive
    assemblyMin: 1000
    ...
envs:
    metagem: metagem
    metawrap: metawrap
    prokkaroary: prokkaroary

Paths

Root

The root path will be automatically set by the metaGEM.sh parser to be the current folder you are submitting jobs from. This is where folders will be created to store the generated files:

~/cluster_login_home/
|-project_X/
|--root/
|---logs
|---dataset
|---qfiltered
|---assemblies
...

Scratch

The scratch path is cluster specific, and you will likely need to consult your the wiki for your institutions cluster to determine how it should be set. Generally there should be some directory for high I/O jobs, usually called something like $SCRATCHDIR or $TMPDIR or $TMP. The Snakefile assumes that this variable has a unique location for each job submission. You should not set the scratch path to be a specific directory if you are submitting jobs in parallel, as this may result in multiple jobs copying and reading files from the same temporary directory and result in errors.