BATH was built on top of the existing HMMER3 code base. Users who are familiar with HMMER will find that BATH uses many of the same conventions. This tutorial will focus on getting you familiar with the five BATH tools listed below and, to avoid redundancy, will link to the HMMER user guide where applicable.
There are two sections in this tutorial. The first section - Input files - will cover the tools that are used to prepare your data before you begin a search. The second section - Running bathsearch - will focus on using BATH to perform translated homology search, and on interpreting the search results. All the necessary files to complete the practices are located in the directory BATH/tutorial/. You should cd into this directory before running the practice commands. If you have not run 'make install' you will need to add the path to the BATH/src/ directory to the commands.
bathsearch - search a DNA sequence database (or genome) for instances of one or more query proteins. The query can consist of a file of pHMMs (produced using bathbuild or bathconvert - see Practices 1 and 5 below) or a file containing sequences or sequence alignments (see Practice 11).
Usage: bathsearch [options] <protein-queryfile> <DNA-targetfile>
bathbuild - build BATH formatted profile hidden Markov models (pHMMs) from input multiple sequence alignments (MSAs) or unaligned sequences and save to file
Usage: bathbuild [-options] <hmmfile_out> <msa_or_seq_file_in>
bathstat - show summary statistics for a BATH formated pHMM file
Usage: bathstat [-options] <hmmfile_in>
bathconvert - convert HMMER formated pHMM files to BATH formated pHMM files
Usage: bathconvert [-options] <hmmfile_out> <hmmfile_in>
bathfetch - copy selected pHMMs from an HMMER or BATH formatted file (converting if necessary) to a new BATH formated pHMM file
Usage: bathfetch [options] <hmmfile_in> <key> (retrieves HMM named <key>)
Usage: bathfetch [options] -f <hmmfile_in> <keyfile> (retrieves all HMMs in <keyfile>)
Usage: bathfetch [options] --index <hmmfile_in> (indexes <hmmfile>)
Before you begin using BATH, it will be helpful to become familiar with the file types it uses. Running bathsearch requires two input files; a protein query file and a DNA target file. The target file must contain one or more DNA sequences in a recognizable unaligned sequence or multiple sequence alignment (MSA) format. Accepted unaligned sequence formats include fasta, embl, and genbank. Accepted MSA formats include stockholm, a2m, afa, psiblast, clustal, and phylip.
BATH's installation includes a branch of the Easel software suite developed by the Eddy/Rivas Lab. The Easel miniapps are a set of tools designed to perform a number of operations on MSA and unaligned sequence files. To familiarize yourself with those tools see the HMMER user guide (pages 145-204).
A bathsearch query file contains the proteins you wish to search for in the target DNA. The preferred format for query files is a BATH formated pHMM file (although you may also use an MSA or unaligned sequence file - see practice 11). If you are not interested in taking advantage of BATH's frameshift-aware algorithms, you can also use a HMMER formatted pHMM file along with the '--nofs' flag. The rest of this section will focus on practices to get you acquainted with the BATH tools that are used to create and manipulate BATH formated pHMM files.
Practice 1: building a pHMM from an MSA using bathbuild
The sensitivity of BATH is powered, in large part, by the use of pHMMs. The pHMM files used by BATH are nearly identical to the ones used by HMMER, but contain additional information needed to perform accurate frameshift-aware translations and provide reliable e-values for the alignments. This additional information includes the frameshift rate and codon translation table to be used by bathsearch as well as tau and lambda values that define the curve for the pHMMs score distribution from the frameshift-aware Forward algorithm. If you would like more information on pHMM files see the HMMER user guide (page 208).
BATH formated pHMMs can be created from MSA files or unaligned sequence files using the tool bathbuild. The file MET.msa contains two stockholm formatted protein MSAs (note that stockholm is the only MSA format that allows multiple MSAs in a single file). You can build pHMMs from those MSAs and save them to the file MET.bhmm by running the following command:
% bathbuild MET.bhmm MET.msa
(note the file suffix '.bhmm' - this can help distinguish BATH formated pHMM files from HMMER formatted ones, which often have the suffix '.hmm')
The summary output that is printed to your stdout should resemble the text below (the exact CPU and elapsed time will vary):
# input file: MET.msa
# output HMM file: MET.bhmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# idx name nseq alen mlen codon_tbl eff_nseq re/pos description
# ------ -------------------- ----- ----- ----- --------- -------- ------ -----------
1 metC 11 483 409 1 0.60 0.591 Cystathionine beta-lyase
2 metG 24 494 458 1 0.62 0.589 Methionine--tRNA ligase
# CPU time: 4.38u 0.00s 00:00:04.38 Elapsed: 00:00:02.33
The following is a brief description of each of the above fields.
idx Number, in order of the MSA file.
name Name of the pHMM.
nseq Number of sequences in the alignment this pHMM was built from.
alen Length of alignment - number of columns in the MSA.
mlen Length of the pHMM - number of match states.
codon_tbl The NCBI codon translation table ID assumed for the target DNA
eff_nseq Effective sequence number. This is the “effective” number of independent sequences that bathbuild’s default “entropy weighting” step decided on, given the phylogenetic similarity of the nseq sequences in the input alignment. The higher the number the more diversity there is among the sequences in the MSA.
re/pos Mean positional relative entropy, in bits. This can be ignored by most users.
description Description of the protein family - may be blank.
Practice 2: building a pHMM from an MSA using bathbuild with an alternate codon translation table
One of the fields that distinguishes a BATH formatted pHMM file from an HMMER formated pHMM file is an NCBI codon translation table ID. The correct codon table depends on the origins of the target DNA you intend to search the pHMMs against. When you run bathsearch, selecting the correct codon table will produce the highest quality alignments. Ensuring that the pHMMs were built with that same codon table will produce the most accurate e-values for those alignments.
By default, bathbuild will use the standard code employed by eukaryotic nuclear DNA. To use an alternate codon translation table include the option --ct followed by a table ID from the list below:
id description
--- -----------------------------------
1 Standard
2 Vertebrate mitochondrial
3 Yeast mitochondrial
4 Mold, protozoan, coelenterate mitochondrial; Mycoplasma/Spiroplasma
5 Invertebrate mitochondrial
6 Ciliate, dasycladacean, Hexamita nuclear
9 Echinoderm and flatworm mitochondrial
10 Euplotid nuclear
11 Bacterial, archaeal; and plant plastid
12 Alternative yeast
13 Ascidian mitochondrial
14 Alternative flatworm mitochondrial
16 Chlorophycean mitochondrial
21 Trematode mitochondrial
22 Scenedesmus obliquus mitochondrial
23 Thraustochytrium mitochondrial
24 Pterobranchia mitochondrial
25 Candidate Division SR1 and Gracilibacteria
In practice 10 you will search pHMMs built from MET.msa against a target sequence from the genome of an endosymbiotic bacteria that uses codon table 4. Running the following command will build the pHMMs using the correct codon table for that target:
% bathbuild --ct 4 MET-ct4.bhmm MET.msa
The summary output that is printed to your stdout should resemble the text below (the exact CPU and elapsed time will vary):
# input file: MET.msa
# output HMM file: MET.bhmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# idx name nseq alen mlen codon_tbl eff_nseq re/pos description
# ------ -------------------- ----- ----- ----- --------- -------- ------ -----------
1 metC 11 483 409 1 0.60 0.591 Cystathionine beta-lyase
2 metG 24 494 458 1 0.62 0.589 Methionine--tRNA ligase
# CPU time: 4.38u 0.00s 00:00:04.38 Elapsed: 00:00:02.33
Practice 3: building a pHMM from an unaligned sequences file
If your queries are single unaligned sequences rather than MSAs you can still build HMMs using bathbuild and the flag '--unali'. The file three_seqs.fa contains three unaligned protein sequences. To build three HMMs (one for each sequence) use the following command.
% bathbuild --unali three_seqs.bhmm three_seqs.fa
The summary output that is printed to your stdout should resemble the text below (the exact CPU and elapsed time will vary):
# input file: three_seqs.fa
# output HMM file: three_seqs.bhmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# idx name nseq len mlen codon_tbl eff_nseq re/pos description
# ------ -------------------- ----- ----- ----- --------- -------- ------ -----------
1 AT1G01010.1 1 429 429 1 1.00 0.591
2 AT1G01020.1 1 245 245 1 1.00 0.552
3 AT1G01030.1 1 358 358 1 1.00 0.600
# CPU time: 2.62u 0.00s 00:00:02.62 Elapsed: 00:00:02.14
Practice 4: summerizing pHMM files with bathstat
Since a pHMM file may contain any number of individual models, it is useful to be able to quickly summarize the contents. The tool bathstat is designed to provide such a summary for BATH formated pHMM files. The following command will summarize the pHMM file built in practice 1:
% bathstat MET.bhmm
This command should produce the following output to stdout:
#
# idx name accession nseq eff_nseq mlen codon_tbl re/pos
# ------ -------------------- ------------ -------- -------- ------ --------- ------
1 metC - 11 0.60 409 1 0.53
2 metG - 24 0.62 458 1 0.53
The fields are mainly the same as those produced by bathbuild, and detailed in practice 1, except for the accession field which may contain an alphanumeric identifier for the protein family or be left blank if no accession is listed for the pHMM.
Practice 5: converting an HMMER formated pHMM file to BATH format using bathconvert
If you have an existing HMMER formatted pHMM file and want to use it to run bathsearch with frameshift detectionG you will first need to convert it to the BATH format using bathconvert. The file tRNA-proteins.hmm contains 12 pHMMs in HMMER3 format. The following command will create the BATH formatted file tRNA-proteins.bhmm containing the same twelve pHMMs:
% bathconvert tRNA-proteins.bhmm tRNA-proteins.hmm
Your summary output should match that shown below.
# input HMM file: tRNA-proteins.hmm
# output HMM file: tRNA-proteins.bhmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# idx name nseq mlen codon_tbl eff_nseq re/pos description
# ------ -------------------- ----- ----- --------- -------- ------ -----------
1 ATE_N 30 78 1 1.11 0.726 Arginine-tRNA-protein transferase, N terminus
2 GlutR_N 12 152 1 0.87 0.590 Glutamyl-tRNAGlu reductase, N-terminal domain
3 PTH2 10 116 1 0.74 0.589 Peptidyl-tRNA hydrolase PTH2
4 RtcB 30 459 1 0.83 0.590 tRNA-splicing ligase RtcB
5 TGT 15 238 1 0.80 0.589 Queuine tRNA-ribosyltransferase
6 Thg1 30 131 1 0.69 0.589 tRNAHis guanylyltransferase
7 Trm56 11 121 1 0.64 0.590 tRNA ribose 2'-O-methyltransferase, aTrm56
8 tRNA-synt_1_2 30 185 1 0.91 0.590 Leucyl-tRNA synthetase, Domain 2
9 tRNA-synt_1c_C 14 192 1 0.81 0.591 tRNA synthetases class I (E and Q), anti-codon binding domain
10 tRNA-synt_2d 19 247 1 0.73 0.592 tRNA synthetases class II core domain (F)
11 tRNA-Thr_ED 12 136 1 0.63 0.590 Archaea-specific editing domain of threonyl-tRNA synthetase
12 TruB_C 11 56 1 1.64 0.994 tRNA Pseudouridine synthase II, C terminal
Practice 6: changing the codon table of a BATH formated pHMM using bathconvert
You can also use bathconvert to change the codon table of an existing BATH pHMM file using the --ct flag. This is faster than rebuilding from the original MSA. The following command will create the file tRNA-proteins-ct11.bhmm containing the same 12 pHMMs as tRNA-proteins.bhmm but modified to use NCBI codon translation table 11:
% bathconvert --ct 11 tRNA-proteins-ct11.bhmm tRNA-proteins.bhmm
This should produce the following output:
# input HMM file: tRNA-proteins.bhmm
# output HMM file: tRNA-proteins=ct11.bhmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# idx name nseq mlen codon_tbl eff_nseq re/pos description
# ------ -------------------- ----- ----- --------- -------- ------ -----------
1 ATE_N 30 78 11 1.11 0.726 Arginine-tRNA-protein transferase, N terminus
2 GlutR_N 12 152 11 0.87 0.590 Glutamyl-tRNAGlu reductase, N-terminal domain
3 PTH2 10 116 11 0.74 0.589 Peptidyl-tRNA hydrolase PTH2
4 RtcB 30 459 11 0.83 0.590 tRNA-splicing ligase RtcB
5 TGT 15 238 11 0.80 0.589 Queuine tRNA-ribosyltransferase
6 Thg1 30 131 11 0.69 0.589 tRNAHis guanylyltransferase
7 Trm56 11 121 11 0.64 0.590 tRNA ribose 2'-O-methyltransferase, aTrm56
8 tRNA-synt_1_2 30 185 11 0.91 0.590 Leucyl-tRNA synthetase, Domain 2
9 tRNA-synt_1c_C 14 192 11 0.81 0.591 tRNA synthetases class I (E and Q), anti-codon binding domain
10 tRNA-synt_2d 19 247 11 0.73 0.592 tRNA synthetases class II core domain (F)
11 tRNA-Thr_ED 12 136 11 0.63 0.590 Archaea-specific editing domain of threonyl-tRNA synthetase
12 TruB_C 11 56 11 1.64 0.994 tRNA Pseudouridine synthase II, C terminal
# CPU time: 8.65u 0.00s 00:00:08.65 Elapsed: 00:00:08.67
Practice 7: indexing a pHMM file and copying a single pHMM using bathfetch
If you only need to search with a single pHMM but it is located in a file with multiple pHMMs, you can save time by copying the desired pHMM to a new file using bathfetch. If the original file contains a large number of pHMMs, you may want to create an index file to speed up the fetch process. The following command will create the index file tRNA-proteins.bhmm.ssi for the BATH pHMM file created in Practice 4.
% bathfetch --index tRNA-proteins.bhmm
The summary output should read as follows:
Working... done.
Indexed 12 HMMs (12 names and 12 accessions).
SSI index written to file tRNA-proteins.bhmm.ssi
Whether or not you choose to create an index you will need the name of the pHMM you wish to copy to use as a key. The command below will copy the pHMM PTH2 from the tRNA-proteins.bhmm. The -o flag will direct the copied pHMM to the specified output file (PTH2.hmm in this case). Otherwise, it will be printed to standard out.
% bathfetch -o PTH2.bhmm tRNA-proteins.bhmm PTH2
The summary output should simply read as:
Retrieved HMM PTH2.
Practice 8: copying and converting multiple pHMMs using bathfetch
You can also use bathfetch to copy multiple pHMMs. To do so you will need to create a key file that contains the names of all the pHMMs you wish to copy, with one name per line, and use the -f flag. If the original pHMM file is in HMMER format bathfetch will automatically convert it to BATH format. The following command will copy all 3 of the pHMMs listed in the key file tRNA-synthetases-key.txt from an HMMER formated pHMM file, convert them to BATH format, and print them to the output file tRNA-synthetases.bhmm.
% bathfetch -f -o tRNA-synthetases.bhmm tRNA-proteins.hmm tRNA-synthetases-key.txt
The summary output should simply read as:
Retrieved 3 HMMs.
As with bathconvert, you can also use the --ct flag with bathfetch to change the codon table.
This section of the tutorial will focus on the tool bathsearch. This tool allows the user to perform translated annotation of protein-coding DNA even when mutations or sequencing errors have introduced frameshifts. Each of the practices in this section will involve running bathsearch with a different set of input formats, options, and outputs.
Practice 9: running a simple bathsearch and reading the output
Bathsearch requires two inputs - the query and the target. In this practice, you will use the single pHMM in the file PTH2.bhmm as the query. For the target, you will use a single DNA sequence in the file target-PTH2.fa. The -o flag is used to direct the standard output to the file PTH2.out (otherwise it will be printed to stdout).
% bathsearch -o PTH2.out PTH2.bhmm target-PTH2.fa
We can now open the file PTH2.out and see that the output is organized into the following sections:
- File Header - lines begin with '#' and contain basic information about the search parameters
# query HMM file: PTH2.bhmm
# target sequence database: target-PTH2.fa
# frameshift probability: 0.010000
# codon translation table 1
# output directed to file: PTH2.out
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- Query Header - includes a summary of each query and a hits list sorted by E-value. For each hit, the query header lists the E-value, bit score, and bias score adjustment (for more information on bias scores see pages 60-61 of the HMMER user guide. This is followed by the name of the target sequence where the hit was located, the target sequence position for the start and end of the alignment, the number of frameshifts and stop codons in that alignment, and finally a target description (which may be blank).
Query: PTH2 [M=116]
Accession: PF01981.11
Description: Peptidyl-tRNA hydrolase PTH2
Scores for complete hits:
E-value score bias Sequence start end shifts stops Description
------- ------ ----- -------- ----- ----- ------ ----- -----------
1.3e-34 110.1 0.3 seq1 672 325 0 0
7.4e-27 90.7 0.0 seq1 2670 2343 3 1
2.1e-25 86.0 0.0 seq1 1486 1731 0 0
3.3e-11 40.2 0.0 seq1 1273 1387 1 0
-
Annotation Lines- for each hit listed in the query header, bathsearch will produce an annotation line containing useful information about the hit. After the line 'Annotation for each hit (and alignments):' these annotation lines (as well as the alignments) will appear, sorted first by target sequence and then by e-value.
As in the query header, the annotations line lists the score, bias, and E-value for each hit. It also lists three types of coordinates for the hit - the alignment start and end coordinates for both the query (hmm-from & hmm-to) and the target (ali-from & ali-to), as well as the envelope coordinates (env-from & env-to). The envelope is the region of the target that bathsearch has identified as containing the homology (the alignment coordinates are always contained within the envelope). It is the envelope coordinates that bound the target subsequence used to calculate the score, bias, and E-value. An explanation of the characters seen after the coordinates ('.','[', & ']') can be found on page 38 of the HMMER user guide. The annotation line also lists the number of frameshifts and stop codons in the alignment (shift & stops), the full length of the target sequence (sq-len), and the alignment's accuracy score (acc) which is the average expected per residue accuracy of the alignment.
Below is the annotation line for the first hit in the file PTH2.out
Annotation for each hit (and alignments):
>> seq1
score bias Evalue hmm-from hmm-to ali-from ali-to env-from env-to shifts stops sq-len acc
------ ----- --------- -------- ------- --------- --------- --------- --------- ------ ----- --------- ----
! 110.1 0.3 1.3e-34 2 116 .] 672 325 .. 675 325 .. 0 0 3000 0.92
-
Alignment - Below each annotation line bathsearch prints the alignment for that query-target hit. A typical bathsearch alignment will contain at least the following five rows (in order from top to bottom): (1) the query row, (2) the match row, (3) the translation row, (4) the target row, and (5) the posterior probability row. If the pHMM was built from an MSA containing consensus structure or reference annotations those will be visible on separate CS and RF rows above the query row. There are also three types of columns: (1) a match in which a query amino is aligned to a target codon or quasi-codon, (2) a deletion in which the query amino acid is aligned to target gap characters, or (3) an insertion in which the target codon is aligned to a query gap character.
The query row begins with the name of the query pHMM followed by the coordinates of the first amino acid on that line of the alignment and ends with the coordinates of the last amino acid on that line of the alignment. For each column, the query row shows either the query consensus letter, for matches and deletions, or a gap character ('.') for insertions.
The target row begins with the name of the target sequence followed by the coordinates of the first nucleotide on that line of the alignment and ends with the coordinates of the last nucleotide on that line of the alignment. For each column, the target row shows the target codon or pseudo-codon that has been aligned to the query. In the case of a deletion, the target line will print three gap characters ('---') in place of the codon.
The translation row shows the amino acid translations of the codons and pseudo-codons on the target row. The match row shows which columns in the alignment are positive scoring. Exact matches are shown as the matched amino acid residue (in lowercase) and positive scoring mismatches are shown as a '+'. Finally, the posterior probability (PP) row gives the expected accuracy for each position of the alignment.
Below is the first line of the first alignment in the file PTH2.out. It contains a CS row in addition to the five basic rows detailed above. All the columns in this line of the alignment are matches and, as there are no frameshifts, no quasi-codons are present.
Alignment:
score: 110.1 bits
E E E E E E E E S C C S S - H H H H H H H H H H H H H CS
PTH2 2 l k q v i v v r t d l k m g k G k l a a q v a h a a v 28
+ k v + v v r t d l m + k G k + a a q + + h a +
V K L V L V V R T D L G M T K G K I A A Q C S H A T L
seq1 672 GTG AAG CTT GTG CTG GTT GTG AGG ACA GAT CTG GGC ATG ACC AAA GGC AAA ATC GCC GCC CAG TGC TCG CAT GCA ACG CTC 592
8 9 9 * * * * * * * * * * * * * * * * * * * * * * 8 7 PP
-
Query Footer - each query's output will conclude with a footer that provides information about the hit filtering process inside bathsearch. The average user can ignore this data. For those who are interested, more information on these data can be found on page 54 of the HMMER user guide. There will also be a couple of lines listing run times and a line with just '//', indicating the end of the output for the query.
Below is the query footer from PTH2.out.
Internal pipeline statistics summary:
-------------------------------------
Query model(s): 1 (116 nodes)
Target sequence(s): 1 (6000 residues searched)
Residues passing SSV filter: 1503 (0.251); expected (0.02)
Residues passing bias filter: 1503 (0.251); expected (0.02)
Residues passing Vit filter: 1401 (0.234); expected (0.001)
Residues passing Fwd filter: 1961 (0.327); expected (1e-05)
Total number of hits: 4 (0.173)
# CPU time: 0.04u 0.01s 00:00:00.05 Elapsed: 00:00:00.05
# Mc/sec: 12.20
//
- File Footer - If bathsearch did not encounter any errors the last line of the file will read '[ok]'
Practice 10: running a bathsearch search on a target with an alternate codon translation table
As discussed in Practice 2, some DNA sequences use alternate codon translation tables and the best results are achieved by specifying the correct codon table both when building the pHMMs and when performing the search. To prevent searches with mismatched codon tables, bathsearch responds to such searches with an error message. Running the following command will attempt a mismatches search by searching the pHMMs in MET.bhmm, built with the standard codon table, against the target DNA in the file target-MET.fa while specifying the use of the alternate codon table 4.
% bathsearch --ct 4 -o MET.out MET.bhmm target-MET.fa
This will result in the following error message:
Error: Requested codon translation tabel ID 4 does not match the codon translation tabel ID of the HMM file MET.bhmm. Please either run bathsearch with option '--ct 1' or run bathconvert with option '--ct 4'.
In this case, we already have a pHMM file built with the correct codon table and can skip running bathconvert. The following command will use that pHMM file to run the same search without a codon table mismatch.
% bathsearch --ct 4 -o MET-ct4.out MET-ct4.bhmm target-MET.fa
The file MET-ct4.out should contain a single hit between each of the pHMMS in MET-ct4.bhmm and the DNA sequence in target-MET.fa.
Practice 11: running bathsearch with a sequence query
If you do not wish to build the query pHMMs ahead of time you can use a sequence file (MSA or unaligned) as the query and bathsearch will build the pHMMs on the fly. However, depending on the number and length of the proteins, building pHMMs can be time-consuming. If you choose to use a sequence query file it is recommended that you use the '--hmmout' flag to save the pHMMs for use in any subsequent searches. The following command uses the single unaligned sequence in the file gidA.fa as the query, building a pHMM for that sequence and printing it the to file gidA.bhmm. The use of the '--ct' flag will determine the codon table used both to build the pHMM and to conduct the search. The standard output is directed to the file gidA.out using the '-o' flag.
% bathsearch --ct 4 --hmmout gidA.bhmm -o gidA.out gidA.fa target-gidA.fa
The file gidA.bhmm will now contain a single pHMM built with codon table 4 and gidA.out will contain output for a single git between that pHMM and the DNA sequence in target-gidA.fa.
Practice 12: producing and interpreting tabular output from bathsearch
In addition to the standard output, bathsearch can also produce a tabular summary file with a more easily parsable list of the hits found in a search. By using the '--tblout' flag you can direct bathsearch to create this tabular output and save it to the file of your choosing. The following command will run the same search as in Practice 8, but with the addition of the '--tblout' flag directing the tabular output to the file PTH2.tbl.
% bathsearch -o PTH2.out --tblout PTH2.tbl PTH2.bhmm target-PTH2.fa
If you open the file PTH2.tbl you will see the following text (file directories and dates may vary):
# target name accession query name accession hmm len hmm from hmm to seq len ali from ali to env from env to E-value score bias shifts stops pipe description of target
#------------------- ---------- -------------------- ---------- --------- --------- --------- --------- --------- --------- --------- --------- --------- ------ ----- ------- ------ ----- ---------------------
seq1 - PTH2 PF01981.11 116 2 116 3000 672 325 675 325 1.3e-34 110.1 0.3 0 0 std -
seq1 - PTH2 PF01981.11 116 1 113 3000 2670 2343 2677 2327 7.4e-27 90.7 0.0 3 1 fs -
seq1 - PTH2 PF01981.11 116 35 116 3000 1486 1731 1426 1731 2.1e-25 86.0 0.0 0 0 fs -
seq1 - PTH2 PF01981.11 116 2 40 3000 1273 1387 1269 1491 3.3e-11 40.2 0.0 1 0 fs -
#
# Program: bathsearch
# Query file: PTH2.bhmm
# Target file: target-PTH2.fa
# Option settings: bathsearch -o PTH2.out --tblout PTH2.tbl PTH2.bhmm target-PTH2.fa
# Current dir: BATH/tutorial
# Date: Wed Mar 1 18:27:58 2023
# [ok]
A brief description of each column header (from left to right) is provided below.
target name Name of the target sequence where the hit is located.
accession Alphanumeric ID for the target sequence (if provided in the target file).
query name Name of the query pHMM.
accession Alphanumeric ID for the query (if provided in the query file).
hmm len Length (number of match states) of the query pHMM.
hmm from The start position of the alignment on the query pHMM.
hmm to The end position of the alignment on the query pHMM.
seq len The length of the target sequence (in nucleotides).
ali from The start position of the alignment on the target sequence. If the hit is located on the reverse complement strand, ali from will be greater than ali to.
ali to The end position of the alignment on the target sequence.
env from The start position of the hit envelope on the target sequence. See practice 9 for an explanation of hit envelopes.
env to The end position of the hit envelope on the target sequence.
E-value The hit e-value.
score The hit bit score.
bias The hit bias adjustment score. See pages 60-61 of the [HMMER user guide](http://eddylab.org/software/hmmer/Userguide.pdf) for an explanation of bias adjustment.
shifts The number of frameshifts reported in the alignment.
stops The number of stop codons reported in the alignment.
pipe The translation pipeline used to produce the hit, either standard 'std' or frameshift aware 'fs'. To reduce runtimes and the potential for false frameshifts bathsearch uses standard (non frameshift aware) translation for hits it determines are unlikely to contain frameshifts. Most users can ignore this column.
description of target Description of the target sequence (if provided in the target file).
Practice 13: locating frameshifts and stop codons in bathsearch alignments using '--frameline'
While both the standard and tabular outputs give the user the count of frameshifts and stop codons in an alignment, the user may also want to locate the quasi and stop codons. Quasi-codons with deletions can be identified by looking for codons with one or two '-' characters in place of a nucleotide. Quasi-codons with insertions can be identified by looking for codons with more than 4 or 5 nucleotides (the nucleotides bathsearch determines to be the insertions will be shown in lowercase). Stop codons can be identified by looking for codons with all two uppercase and one lowercase nucleotide (the nucleotide bathsearch determined to be a substitution will be in lowercase). Below are examples of quasi and stop codons taken from the alignment in gidA.out from Practice 11:
| one nucleotide deletion | two nucleotide deletions | one nucleotide insertion | two nucleotide insertions | stop codon |
| w | g | k | v | a |
| w | g | + | v | + |
| W | G | Q | V | S |
| -GA | --A | CtAA | GTtaT | TaA |
| 6 | 3 | 8 | 7 | 9 |
To make it easier to locate frameshifts and stop codons the '--frameline' flag can be used to add a row to the alignment that numbers the frame of each codon and quasi-codon. This line can be used to locate quasi-codons by looking for a change from one frame to another. Stop codons can be identified on the frameline by a '0'. Note that, for hits on the reverse complement strand of a sequence, the frames will be negative (i.e. -1, -2, & -3).
Running the follwing comand will use the file gidA.bhmm, created in Practice 11, to search a single pHMM against the DNA sequence in the file target-gidA.fa using codon table 4. The '-o' flag will direct the standard output to the file gidA-frameline.out and the '--frameline' flag will add the frameline row to the alignment.
% bathsearch --ct 4 --frameline -o gidA-frameline.out gidA.bhmm target-gidA.fa
The following is an excerpt of four lines from the alignment in gidA-frameline.out. This excerpt shows two frameshifts (one by deletion and one by insertion) as well as one stop codon (the frame is shown directly beneath each codon or quasi-codon). On the first line, the frame changes - from 2 to 1 - due to the deletion of one nucleotide. There is a stop codon on the second line, with a 0 in the frameline. On the third line, the frame changes again - from 1 to 2 - due to a single nucleotide insertion.
gidA 241 g p r y c l s i e g k t l k f g r k p q k l i m e p 266
p + c + i + e p
H P P F C N Q T - K M K H P - T I N I Y R P I I E P
seq1 27209 CAT CCA CCC TTT T-C AAC CAA ACT --- AAA ATG AAA CAT CCA --- ACC ATA AAC ATC TAC AGA CCC ATC ATA GAA CCA 27279
2 2 2 2 1 1 1 1 . 1 1 1 1 1 . 1 1 1 1 1 1 1 1 1 1 1 FRAME
7 7 7 7 1 1 1 1 . 0 0 0 0 0 . 4 4 5 6 7 8 8 9 * * * PP
gidA 267 e a t g s s s v y v n g l s t s m . p i e l q l q l 291
e + + + v + + n g s s + + q l +
E S L D T K T V H L N G T S I S T s N L V I Q L N I
seq1 27280 GAA TaA CTT GAT ACT AAA ACT GTA CAC CTA AAT GGT ACC TCT ATC TCA ACC TCC AAT CTG GTA ATC CAA CTT AAC ATA 27357
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 FRAME
* 9 9 9 * * * * * * * * 9 9 7 7 5 1 5 8 8 9 * * * * PP
gidA 292 l k f t k a f r g a k i i k a g y a i e y d c v c s 317
l k t + + + k + + i e y d c s
L K P T Q H P I D V C V V K S K H T I E Y D V T C S
seq1 27358 CTA AAA CCC ACA CtAA CAC CCA ATT GAT GTC TGT GTT GTT AAG TCC AAA CAC ACC ATA GAA TAC GAT GTA ACT TGT TCA 27436
1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 FRAME
* * * * 8 7 7 8 9 9 * * * * * * * * * * * * * * * * PP
Practice 14: locating frameshifts and stop codons with a tabular output
While the frameline makes it easier to find frameshifts and stop codons in individual alignments, some users may want to see the locations of frameshifts and stop codons across multiple alignments and in a more parseable form. For this reason, the flag '--fstblout' allows the user to create a tabular output of frameshift and stop codon locations for all hits. The following command reruns the same search as in Practice 13, but rather than using '--frameline' it uses '--fstblout' to save frameshift and stop codon locations to the file gidA.fstbl.
% bathsearch --ct 4 -o gidA-fstbl.out --fstblout gidA.fstbl gidA.bhmm target-gidA.fa
If you open the file gidA.fstbl you will see the following text (file directories and dates may vary):
# target name accession query name accession E-value ali from ali to I D S length seq start ali start
#------------------- ----------- -------------------- ---------- --------- --------- --------- ----- ------ --------- ---------
seq1 - gidA - 3e-13 26678 27759 I 2 26747 70
seq1 - gidA - 3e-13 26678 27759 D 1 26815 138
seq1 - gidA - 3e-13 26678 27759 I 1 26958 281
seq1 - gidA - 3e-13 26678 27759 D 2 26995 318
seq1 - gidA - 3e-13 26678 27759 D 1 27029 352
seq1 - gidA - 3e-13 26678 27759 D 2 27076 399
seq1 - gidA - 3e-13 26678 27759 D 2 27116 439
seq1 - gidA - 3e-13 26678 27759 D 1 27141 464
seq1 - gidA - 3e-13 26678 27759 D 1 27221 544
seq1 - gidA - 3e-13 26678 27759 I 1 27370 693
seq1 - gidA - 3e-13 26678 27759 I 1 27647 970
seq1 - gidA - 3e-13 26678 27759 I 1 27696 1019
#
# Program: bathsearch
# Query file: gidA.bhmm
# Target file: target-gidA.fa
# Option settings: /home/u3/gkrause/git/BATH/src/bathsearch -o gidA-fstbl.out --fstblout gidA.fstbl --ct 4 gidA.bhmm target-gidA.fa
# Current dir: /home/u3/gkrause/git/BATH/tutorial
# Date: Mon Mar 11 14:40:19 2024
# [ok]
A brief description of each column header (from left to right) is provided below.
target name Name of the target sequence where the hit is located.
accession Alphanumeric ID for the target sequence (if provided in the target file).
query name Name of the query pHMM.
accession Alphanumeric ID for the query (if provided in the query file).
E-value The hit e-value.
ali from The start position of the alignment on the target sequence. If the hit is located on the reverse complement strand, ali from will be greater than ali to.
ali to The end position of the alignment on the target sequence.
I D S Type of codon/quasi-codons. I - insertion. D - deletion. S - stop codon.
length Number of nucleotides inserted or deleted (0 for stop codons).
seq start Position (in the target sequence) of the start of the quasi-codon or stop codon.
ali start Position (in the alignment) of the start of the quasi-codon or stop codon. The first nucleotide in the alignment = 1.