cdbfasta(1)
Creates an index file for records from a multi-fasta file.
Description
CDBFASTA
NAME
cdbfasta - Creates an index file for records from a multi-fasta file.
DESCRIPTION
Usage:
cdbfasta <fastafile> [-o <index_file>] [-r <record_delimiter>]
[-z <compressed_db>] [-i] [-m|-n <numkeys>|-f<LIST>]|-c|-C]
[-w <stopwords_list>] [-s <stripendchars>] [{-Q|-G}]
[-v]
Creates an index file for records from a multi-fasta file. By default (without -m/-n/-c/-C option), only the first space-delimited token from the defline is used as a key.
<fastafile> is the multi-fasta file to index; -o the index file will be named <index_file>; if not given,
the index filename is database name plus the suffix ’.cidx’
|
-r <record_delimiter> a string of characters at the beginning of line |
marking the start of a record (default: ’>’)
|
-Q treat input as fastq format, i.e. with ’@’ as record delimiter |
and with records expected to have at least 4 lines
|
-z database is compressed into the file <compressed_db> |
before indexing (<fastafile> can be "-" or "stdin" in order to get the input records from stdin)
|
-s strip extraneous characters from *around* the space delimited |
tokens, for the multikey options below (-m,-n,-f); Default <stripendchars> set is: ’",‘.(){}/[]!:;˜|><+-
|
-m ("multi-key" option) create hash entries pointing to |
the same record for all tokens found in the defline
|
-n <numkeys> same as -m, but only takes the first <numkeys> |
tokens from the defline; when used with -a option (see below), only collects the first <numkeys> accessions from each defline
|
-f indexes *space* delimited tokens (fields) in the defline as given |
by LIST of fields or fields ranges (the same syntax as UNIX ’cut’)
|
-w <stopwordslist> exclude from indexing all the words found |
in the file <stopwordslist> (for options -m, -n and -k)
|
-i do case insensitive indexing (i.e. create additional keys for |
all-lowercase tokens used for indexing from the defline
|
-c for deflines in the format: db1|accession1|db2|accession2|..., |
only the first db-accession pair (’db1|accession1’) is taken as key
|
-C like -c, but also subsequent db|accession constructs are indexed, |
along with the full (default) token; additionally, all nrdb concatenated accessions found in the defline are parsed and stored (assuming 0x01 or ’ˆ|ˆ’ as separators)
|
-a accession mode: like -C but indexes only the ’accession’ part for all |
’db|accession’ constructs found, plus the default first tokens
|
-A like -a and -C together (both accessions and ’db|accession’ |
constructs are used as keys
|
-D index each pipe (’|’) delimited token found in the record identifier |
(e.g. >key1|key2|key3|.. )
|
-d same as -D but using a custom key delimiter <kdelim> instead of the pipe |
character ’|’
|
-G FASTA records are treated as large genomic sequences (e.g. full |
chromosomes/contigs) and their formatting is checked for suitability for fast range queries (i.e. uniform line length within each record)
|
-v show program version and exit |