Tools to analyse and extract genomic data.
Perl and R script to get assembly/ fasta info. The R script is embedded in the Perl code and is used for plotting the histogram of contig lengths.
L10-L90, N10-N90, histogram of contig/ scaffold lengths(need an installation of R), number of contigs, assembly size, longest contig length, mean contig length, total number of N's, average number of N's per contig, IUPAC bases other than ATGC(%), GC content(%).
Python script to fetch a specific chromosome/ scaffold/ contig from a fasta file. Use Python v3 or above.
Fasta file with specified chromosome/ scaffold header and sequence.
Python script to get complement, reverse or reverse complement of sequences in a fasta file. The masked sequences will not be affected and the case is maintained. Use Python v3 or above.
(-c : 0 for reverse; 1 for complement; 2 for reverse complement)
Fasta file with desired operation performed on all the sequences in the file.
Shell script for complete variant calling process. This script requires path to GATK (jar file), PICARD (jar file) and a global installation of BWA (http://bio-bwa.sourceforge.net/). It will ask for an input of the names of the paired-end read files (fastq) and the genome file (fasta). The following steps will be carried out :
- Genome indexing
- Quality control and trimming of reads
- Alignment of reads against the genome
- Sorting SAM file by coordinate and conversion to BAM
- Getting sequence depth
- Building index for BAM file
- Creating realignment targets
- Realigning indels
- Variant calling
- Extracting SNPs and INDELS
- Filtering SNPs and INDELS
- Base Quality Score Recalibration (BQSR)
- Calling final variants
- Extracting final SNPs and INDELS
- Final filtering of SNPs and INDELS
VCF files containing SNPs and INDELS present in sample w.r.t. the genome.
Perl script to create a synthetic genome/ chromosome using a random number generator. The program will ask for the number of bases required and a file name to write the output to. Only ATGC bases are printed in the file in random.
A fasta file with a header consisting of a randomly generated genome.
Perl script that can :
- Find all possible k-mers in a given sequence
- Find all unique k-mers
- Find the shortest common superstring using the unique k-mers The program will ask for a string and the size of k-mers to compute the above.
Unique k-mers, Shortest Common Superstring