NAME
fasta36 - scan a
protein or DNA sequence library for similar sequences
fastx36 -
compare a DNA sequence to a protein sequence database,
comparing the translated DNA sequence in forward and reverse
frames.
tfastx36 -
compare a protein sequence to a DNA sequence database,
calculating similarities with frameshifts to the forward and
reverse orientations.
fasty36 -
compare a DNA sequence to a protein sequence database,
comparing the translated DNA sequence in forward and reverse
frames.
tfasty36 -
compare a protein sequence to a DNA sequence database,
calculating similarities with frameshifts to the forward and
reverse orientations.
fasts36 -
compare unordered peptides to a protein sequence
database
fastm36 -
compare ordered peptides (or short DNA sequences) to a
protein (DNA) sequence database
tfasts36 -
compare unordered peptides to a translated DNA sequence
database
fastf36 -
compare mixed peptides to a protein sequence database
tfastf36 -
compare mixed peptides to a translated DNA sequence
database
ssearch36 -
compare a protein or DNA sequence to a sequence database
using the Smith-Waterman algorithm.
ggsearch36 -
compare a protein or DNA sequence to a sequence database
using a global alignment (Needleman-Wunsch)
glsearch36 -
compare a protein or DNA sequence to a sequence database
with alignments that are global in the query and local in
the database sequence (global-local).
lalign36 -
produce multiple non-overlapping alignments for protein and
DNA sequences using the Huang and Miller sim algorithm for
the Waterman-Eggert algorithm.
prss36, prfx36 -
discontinued; all the FASTA programs will estimate
statistical significance using 500 shuffled sequence scores
if two sequences are compared.
DESCRIPTION
Release 3.6 of
the FASTA package provides a modular set of sequence
comparison programs that can run on conventional single
processor computers or in parallel on multiprocessor
computers. More than a dozen programs - fasta36,
fastx36/tfastx36, fasty36/tfasty36, fasts36/tfasts36,
fastm36, fastf36/tfastf36, ssearch36, ggsearch36, and
glsearch36 - are currently available.
All the
comparison programs share a set of basic command line
options; additional options are available for individual
comparison functions.
Threaded
versions of the FASTA programs (built by default under
Unix/Linux/MacOX) run in parallel on modern Linux and Unix
multi-core or multi-processor computers. Accelerated
versions of the Smith-Waterman algorithm are available for
architectures with the Intel SSE2 or Altivec PowerPC
architectures, which can speed-up Smith-Waterman
calculations 10 - 20-fold.
In addition to
the serial and threaded versions of the FASTA programs, MPI
parallel versions are available as fasta36_mpi,
ssearch36_mpi, fastx36_mpi, etc. The MPI parallel versions
use the same command line options as the serial and threaded
versions.
Running the FASTA programs
By default, the
FASTA programs are no longer interactive; they are run from
the command line by specifying the program, query.file, and
library.file. Program options must preceed the
query.file and library.file arguments:
fasta36 -option1
-option2 -option3 query.file library.file >
fasta.output
The
"classic" interactive mode, which prompts for a
query.file and library.file, is available with the -I
option. Typing a program name without any arguments
(ssearch36) provides a short help message; program_name
-help provides a complete set of program options.
Program options
MUST preceed the query.file and library.file
arguments.
FASTA program options
The default
scoring matrix and gap penalties used by each of the
programs have been selected for high sensitivity searches
with the various algorithms. The default program behavior
can be modified by providing command line options
before the query.file and library.file arguments.
Command line options can also be used in interactive
mode.
Command line
arguments come in several classes.
(1) Commands
that specify the comparison type. FASTA, FASTS, FASTM,
SSEARCH, GGSEARCH, and GLSEARCH can compare either protein
or DNA sequences, and attempt to recognize the comparison
type by looking the residue composition. -n, -p specify DNA
(nucleotide) or protein comparison, respectively. -U
specifies RNA comparison.
(2) Commands
that limit the set of sequences compared: -1, -3, -M.
(3) Commands
that modify the scoring parameters: -f gap-open penaltyP, -g
gap-extend penalty, -j inter-codon frame-shift, within-codon
frameshift, -s scoring-matrix, -r match/mismatch score, -x
X:X score.
(4) Commands
that modify the algorithm (mostly FASTA and [T]FASTX/Y): -c,
-w, -y, -o. The -S can be used to ignore lower-case (low
complexity) residues during the initial score
calculation.
(5) Commands
that modify the output: -A, -b number, -C width, -d number,
-L, -m 0-11,B, -w line-width, -W context-width, -o
offset1,ofset2
(6) Commands
that affect statistical estimates: -Z, -k.
Option summary:
|
-1 |
|
Sort by "init1" score
(obsolete) |
|
-3 |
|
([t]fast[x,y] only) use only
forward frame translations |
|
-a |
|
Displays the full length
(included unaligned regions) of both sequences with fasta36,
ssearch36, glsearch36, and fasts36. |
-A (fasta36 only) For DNA:DNA,
force Smith-Waterman alignment for
output. Smith-Waterman is the
default for FASTA protein alignment and [t]fast[x,y], but
not for DNA comparisons with FASTA. For protein:protein, use
band-alignment algorithm.
|
-b # |
|
number of best scores/descriptions to show (must be <
expectation cutoff if -E is given). By default, this option
is no longer used; all scores better than the expectation
(E()) cutoff are listed. To guarantee the display of #
descriptions/scores, use -b =#, i.e. -b =100 ensures that
100 descriptions/scores will be displayed. To guarantee at
least 1 description, but possibly many more (limited by -E
e_cut), use -b >1. |
-c "E-opt E-join"
threshold for gap joining
(E-join) and band optimization (E-opt) in FASTA and
[T]FASTX/Y. FASTA36 now uses BLAST-like statistical
thresholds for joining and band optimization. The default
statistical thresholds for protein and translated
comparisons are E-opt=0.2, E-join=0.5; for DNA, E-join = 0.1
and E-opt= 0.02. The actual number of joins and
optimizations is reported after the E-join and E-opt scoring
parameters. Statistical thresholds improves search speed 2 -
3X, and provides much more accurate statistical estimates
for matrices other than BLOSUM50. The "classic"
joining/optimization thresholds that were the default in
fasta35 and earlier programs are available using -c O (upper
case O), possibly followed a value > 1.0 to set the
optcut optimization threshold.
|
-C # |
|
length of name abbreviation in alignments, default = 6.
Must be less than 20. |
|
-d # |
|
number of best alignments to
show ( must be < expectation (-E) cutoff and <= the -b
description limit). |
|
-D |
|
turn on debugging mode. Enables
checks on sequence alphabet that cause problems with
tfastx36, tfasty36 (only available after compile time
option). Also preserves temp files with -e expand_script.sh
option. |
-e expand_script.sh
Run a script to expand the set
of sequences displayed/aligned based on the results of the
initial search. When the -e expand_script.sh option is used,
after the initial scan and statistics calculation, but
before the "Best scores" are shown,
expand_script.sh with a single argument, the name of a file
that contains the accession information (the text on the
fasta description line between > and the first space) and
the E()-value for the sequence. expand_script.sh then uses
this information to send a library of additional sequences
to stdout. These additional sequences are included in the
list of high-scoring sequences (if their scores are
significant) and aligned. The additional sequences do not
change the statistics or database size.
-E e_cut e_cut_r
expectation value upper limit
for score and alignment display. Defaults are 10.0 for
FASTA36 and SSEARCH36 protein searches, 5.0 for translated
DNA/protein comparisons, and 2.0 for DNA/DNA searches. FASTA
version 36 now reports additional alignments between the
query and the library sequence, the second value sets the
threshold for the subsequent alignments. If not given, the
threshold is e_cut/10.0. If given and value > 1.0,
e_cut_r = e_cut / value; for value < 1.0, e_cut_r =
value; If e_cut_r < 0, then the additional alignment
option is disabled.
|
-f # |
|
penalty for opening a gap. |
|
-F # |
|
expectation value lower limit
for score and alignment display. -F 1e-6 prevents library
sequences with E()-values lower than 1e-6 from being
displayed. This allows the use to focus on more distant
relationships. |
|
-g # |
|
penalty for additional residues
in a gap |
|
-h |
|
Show short help message. |
|
-help |
|
Show long help message, with all
options. |
|
-H |
|
show histogram (with
fasta-36.3.4, the histogram is not shown by default). |
|
-i |
|
(fasta DNA, [t]fastx[x,y])
compare against only the reverse complement of the library
sequence. |
|
-I |
|
interactive mode; prompt for
query filename, library. |
-j # #
([t]fast[x,y] only) penalty for
a frameshift between two codons, ([t]fasty only) penalty for
a frameshift within a codon.
|
-J |
|
(lalign36 only) show identity alignment. |
|
-k |
|
specify number of shuffles for
statistical parameter estimation (default=500). |
-l str
specify FASTLIBS file
|
-L |
|
report long sequence description in alignments (up to
200 characters). |
-m
0,1,2,3,4,5,6,8,9,10,11,B,BB,"F# out.file"
alignment display
options. -m 0, 1, 2, 3 display
different types of alignments. -m 4 provides an alignment
"map" on the query. -m 5 combines the alignment
map and a -m 0 alignment. -m 6 provides an HTML output.
-m 8 seeks to mimic BLAST -m 8
tabular output. Only query and
library sequence names, and
identity, mismatch, starts/stops, E()-values, and bit scores
are displayed. -m 8C mimics BLAST tabular format with
comment lines. -m 8 formats do not show alignments.
-m 9 does not change the
alignment output, but provides
alignment coordinate and
percent identity information with the best scores report. -m
9c adds encoded alignment information to the -m 9; -m 9C
adds encoded alignment information as a CIGAR formatted
string. To accomodate frameshifts, the CIGAR format has been
supplemented with F (forward) and R (reverse). -m 9i
provides only percent identity and alignment length
information with the best scores. With current versions of
the FASTA programs, independent -m options can be combined;
e.g. -m 1 -m 9c -m 6.
-m 11 provides lav format output
from lalign36. It does not
currently affect other
alignment algorithms. The lav2ps and lav2svg programs can be
used to convert lav format output to postscript/SVG
alignment "dot-plots".
-m B provides BLAST-like
alignments. Alignments are labeled as
"Query" and
"Sbjct", with coordinates on the same line as the
sequences, and BLAST-like symbols for matches and
mismatches. -m BB extends BLAST similarity to all the
output, providing an output that closely mimics BLAST
output.
-m "F# out.file"
allows one search to write different alignment
formats to different files. The
’F’ indicates separate file output; the
’#’ is the output format (1-6,8,9,10,11,B,BB,
multiple compatible formats can be combined separated by
commas -’,’).
-M #-#
molecular weight (residue)
cutoffs. -M "101-200" examines only library
sequences that are 101-200 residues long.
|
-n |
|
force query to nucleotide sequence |
|
-N # |
|
break long library sequences
into blocks of # residues. Useful for bacterial genomes,
which have only one sequence entry. -N 2000 works well for
well for bacterial genomes. (This option was required when
FASTA only provided one alignment between the query and
library sequence. It is not as useful, now that multiple
alignments are available.) |
-o "#,#"
offsets query, library sequence
for numbering alignments
-O file
send output to file.
|
-p |
|
force query to protein alphabet. |
|
-P pssm_file
(ssearch36, ggsearch36,
glsearch36 only). Provide blastpgp checkpoint file as the
PSSM for searching. Two PSSM file formats are available,
which must be provided with the filename. ’pssm_file
0’ uses a binary format that is machine specific;
’pssm_file 1’ uses the "blastpgp -u 1 -C
pssm_file" ASN.1 binary format (preferred).
|
-q/-Q |
|
quiet option; do not prompt for input (on by
default) |
|
-r "+n/-m"
(DNA only) values for
match/mismatch for DNA comparisons. +n is used for the
maximum positive value and -m is used for the maximum
negative value. Values between max and min, are rescaled,
but residue pairs having the value -1 continue to be -1.
-R file
save all scores to statistics
file (previously -r file)
-s name
specify substitution matrix.
BLOSUM50 is used by default; PAM250, PAM120, and BLOSUM62
can be specified by setting -s P120, P250, or BL62.
Additional scoring matrices include: BLOSUM80 (BL80), and
MDM10, MDM20, MDM40 (Jones, Taylor, and Thornton, 1992
CABIOS 8:275-282; specified as -s MD10, -s MD20, -s MD40),
OPTIMA5 (-s OPT5, Kann and Goldstein, (2002) Proteins
48:367-376), and VTML160 (-s VT160, Mueller and Vingron
(2002) J. Comp. Biol. 19:8-13). Each scoring matrix has
associated default gap penalties. The BLOSUM62 scoring
matrix and -11/-1 gap penalties can be specified with -s
BP62.
Alternatively,
a BLASTP format scoring matrix file can be specified, e.g.
-s matrix.filename. DNA scoring matrices can also be
specified with the "-r" option.
With fasta36.3,
variable scoring matrices can be specified by preceeding the
scoring matrix abbreviation with ’?’, e.g. -s
’?BP62’. Variable scoring matrices allow the
FASTA programs to choose an alternative scoring matrix with
higher information content (bit score/position) when short
queries are used. For example, a 90 nucleotide FASTX query
can produce only a 30 amino-acid alignment, so a scoring
matrix with 1.33 bits/position is required to produce a 40
bit score. The FASTA programs include BLOSUM50 (0.49
bits/pos) and BLOSUM62 (0.58 bits/pos) but can range to MD10
(3.44 bits/position). The variable scoring matrix option
searches down the list of scoring matrices to find one with
information content high enough to produce a 40 bit
alignment score.
|
-S |
|
treat lower case letters in the query or database as low
complexity regions that are equivalent to ’X’
during the initial database scan, but are treated as normal
residues for the final alignment display. Statistical
estimates are based on the ’X’ed out sequence
used during the initial search. Protein databases (and query
sequences) can be generated in the appropriate format using
John Wooton’s "pseg" program, available from
ftp://ftp.ncbi.nih.gov/pub/seg/pseg. Once you have compiled
the "pseg" program, use the command: |
pseg
database.fasta -z 1 -q > database.lc_seg
|
-t # |
|
Translation table - [t]fastx36
and [t]fasty36 support the BLAST tranlation tables. See
http://www.ncbi.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c/. |
|
-T # |
|
(threaded, parallel only) number
of threads or workers to use (on Linux/MacOS/Unix, the
default is to use as many processors as are available; on
Windows systems, 2 processors are used). |
|
-U |
|
Do RNA sequence comparisons:
treat ’T’ as ’U’, allow G:U base
pairs (by scoring "G-A" and "T-C" as
score(G:G)-3). Search only one strand. |
-V "?$%*"
Allow special annotation
characters in query sequence. These characters will be
displayed in the alignments on the coordinate number
line.
-w # line width for similarity
score, sequence alignment, output.
-W # context length (default is 1/2 of line width -w) for
alignment,
like fasta and ssearch, that
provide additional sequence context.
-X extended options. Less used
options. Other options include
-XB, -XM4G, -Xo, -Xx, and -Xy;
see fasta_guide.pdf.
-z 1, 2, 3, 4, 5, 6
Specify the statistical
calculation. Default is -z 1 for local similarity searches,
which uses regression against the length of the library
sequence. -z -1 disables statistics. -z 0 estimates
significance without normalizing for sequence length. -z 2
provides maximum likelihood estimates for lambda and K,
censoring the 250 lowest and 250 highest scores. -z 3 uses
Altschul and Gish’s statistical estimates for specific
protein BLOSUM scoring matrices and gap penalties. -z 4,5:
an alternate regression method. -z 6 uses a composition
based maximum likelihood estimate based on the method of
Mott (1992) Bull. Math. Biol. 54:59-75.
-z 11,12,14,15,16
compute the regression against
scores of randomly shuffled copies of the library sequences.
Twice as many comparisons are performed, but accurate
estimates can be generated from databases of related
sequences. -z 11 uses the -z 1 regression strategy, etc.
-z 21, 22, 24, 25, 26
compute two E()-values. The
standard (library-based) E()-value is calculated in the
standard way (-z 1, 2, etc), but a second E2() value is
calculated by shuffling the high-scoring sequences (those
with E()-values less than the threshold). For
"average" composition proteins, these two
estimates will be similar (though the best-shuffle estimates
are always more conservative). For biased composition
proteins, the two estimates may differ by 100-fold or more.
A second -z option, e.g. -z "21 2", specifies the
estimation method for the best-shuffle E2()-values.
Best-shuffle E2()-values approximate the estimates given by
PRSS (or in a pairwise SSEARCH).
-Z db_size
Set the apparent database size
used for expectation value calculations (used for
protein/protein FASTA and SSEARCH, and for [T]FASTX/Y).
Reading sequences from STDIN
The FASTA
programs can accept a query sequence from the unix
"stdin" data stream. This makes it much easier to
use fasta36 and its relatives as part of a WWW page. To
indicate that stdin is to be used, use "@" as the
query sequence file name. "@" can also be used to
specify a subset of the query sequence to be used, e.g:
cat query.aa |
fasta36 @:50-150 s
would search the
’s’ database with residues 50-150 of query.aa.
FASTA cannot automatically detect the sequence type (protein
vs DNA) when "stdin" is used and assumes protein
comparisons by default; the ’-n’ option is
required for DNA for STDIN queries.
Environment variables:
FASTLIBS
location of library choice file
(-l FASTLIBS)
SRCH_URL1, SRCH_URL2
format strings used to define
options to re-search the database.
REF_URL
the format string used to
define the option to lookup the library sequence in entrez,
or some other database.
AUTHOR
Bill Pearson
wrp@virginia.EDU
Version: $ Id: $
Revision: $Revision: 210 $
fasta36
1
general
|