pfscale(1)
fit parameters of an extreme-value distribution to a profile score list
Description
PFSCALE
NAME
pfscale - fit parameters of an extreme-value distribution to a profile score list
SYNOPSIS
|
pfscale |
[ -hl ] [ -L log_base ] [ -M mode_nb ] [ -N db_size ] [ -P upper_limit ] [ -Q lower_limit ] [ score_list | - ] [ profile ] [ parameters ] |
DESCRIPTION
pfscale fits the two parameters of an extreme-value distribution to a sorted score distribution obtained by searching a sequence database with a profile. The file ’score_list’ is a sorted list of profile match scores generated by pfsearch. If ’-’ is specified instead of a filename, the score list is read from the standard input. The result is written to the standard output.
If the original profile is given as the second argument, the normalization function with the lowest mode number or the lowest priority number specified within the profile will be updated such as to produce -Log10 per-residue E-values. If the second argument is omitted, the output consists of a header line containing the normalization parameters followed by a modified score list, showing score rank, original raw scores, log-cumulative frequencies and corresponding normalized scores next to each other.
Note that this program implements the significance estimation procedure for profile match scores described in Hofmann & Bucher (1995). It has been used for the calculation of the normalization parameters of all profiles in the PROSITE database.
OPTIONS
score_list
Input score list.
The file must contain a sorted list of scores. The first
field of each line is considered as being a score, all other
fields on the same line are ignored. The different fields of
each line should be delimited by whitespaces. If the
filename is replaced by a ’-’,
pfscale will read the score list from
stdin.
profile
Optional profile file.
If a filename is specified, the profile will be parsed and
either the lowest priority mode or the mode number specified
with option -M will be scaled. All cut-off levels
which use the specified mode number will also be
updated.
|
-h |
Display usage help text. | ||
|
-l |
Remove output line length limit. Individual lines of the output profile can exceed a length of 132 characters, removing the need to wrap them over several lines. |
-L log_base
Logarithmic base of the
parameters of the estimated extreme-value distribution. The
parameters reported by pfscale are expressed as
logarithms and thus can be inserted directly into a linear
normalization function defined in a generalized profile.
Default: 10
-M mode_nb
Mode number to scale.
Defines which mode number (and implicitly which cut-off
level) of the input PROSITE profile should be scaled. This
overrides the default behaviour of scaling only the
normalization mode with the lowest priority (or lowest mode
number). All cut-off levels defined in the profile as using
this mode number (via the MODE keyword) will be
updated as well.
-N db_size
Size of the database from which
the input score list was derived. The searched database is
typically a shuffled version of a real protein or nucleotide
sequence database.
Default: 14147368 (size of SWISS-PROT release 30 and
shuffled derivatives of it).
-P upper_limit
Upper threshold of the
probability range to which the extreme-value distribution
will be fitted. For instance: if
N=10’000’000 and P=0.0001 then
profile match scores below rank 1000 in the sorted input
list (corresponding to occurrence probabilities > 0.0001)
will be ignored.
Default: 0.0001
-Q lower_limit
Lower threshold of the
probability range to which the extreme-value distribution
will be fitted. For instance: if
N=10’000’000 and Q=0.000001 then
profile match scores above rank 10 in the sorted input list
(corresponding to occurrence probabilities < 0.000001)
will be ignored.
Default: 0.000001
PARAMETERS
|
Note: |
for backwards compatibility, release 2.3 of the pftools package will parse the version 2.2 style parameters, but these are deprecated and the corresponding option (refer to the options section) should be used instead. | ||
|
L=# |
Logarithmic base. |
Use option -L instead.
|
M=# |
Mode number. |
Use option -M instead.
|
N=# |
Database size. |
Use option -N instead.
|
P=# |
Upper probability threshold. |
Use option -P instead.
|
Q=# |
Lower probability threshold. |
Use option -Q instead.
EXAMPLES
|
(1) |
pfsearch -fr -C 200 sh3.prf shuffle20.seq | sort -nr | pfscale -P 0.0001 -Q 0.000001 - |
derives score-normalization parameters for the SH3 domain profile in file ’sh3.prf’. The file ’shuffle20.seq’ contains a window-shuffled derivative of SWISS-PROT release 30 in Pearson/Fasta format (window-size 20). Note that the implicit default of N corresponds to the size of this database and thus needs not to be specified on the command line. The cut-off value 200 for the pfsearch(1) option -C will produce about 2000 matches completely covering the range defined by the command line parameters -P and -Q of pfscale. A suitable cut-off value has to be guessed in advance by computing a few optimal alignment scores for random sequences.
EXIT CODE
On successful completion of its task, pfscale will return an exit code of 0. If an error occurs, a diagnostic message will be output on standard error and the exit code will be different from 0. When conflicting options where passed to the program but the task could nevertheless be completed, warnings will be issued on standard error.
NOTES
|
(1) |
The current version of pfscale does not yet support the xpsa(5) output format produced by pfscan(1) or pfsearch(1). The score list should therefore be generated without the pfscan(1) and pfsearch(1) option -k. |
REFERENCES
Hofmann K & Bucher P. (1995). The FHA-domain: a nuclear signalling domain found in protein kinases and transcription factors. Trends Biochem. Sci. 20:47-349.
SEE ALSO
pfsearch(1), pfscan(1), xpsa(5)
AUTHOR
The
pftools package was developed by Philipp Bucher.
Any comments or suggestions should be addressed to
<pftools@sib.swiss>.