sortgrcd(1)
Postprocess of the output of spaln with -O12 option, Version 2
Description
sortgrcd
NAME
sortgrcd - Postprocess of the output of spaln with -O12 option, Version 2
SYNOPSIS
sortgrcd [options] xxx1.grd(.gz) [xxx2.grd(.gz) ...]
DESCRIPTION
sortgrcd is used to recover the output of spaln with -O12 option, to apply some filtering, and also to rearrange the outputs of multiple Spaln runs.
OPTIONS
|
-C# |
Minimum cover rate = % nucleotides in predicted exons / length of query (x 3 if query is protein) (0-100) | ||
|
-F# |
Filter level #=0: no; #=1: mild; #=2: medium; #=3: stringent (0) | ||
|
-H# |
Minimum alignment score | ||
|
-M# |
Maximum total number of mismatches near boundaries | ||
|
-N# |
Maximum number of non-canonical boundaries | ||
|
-O# |
Output format. 0:Gff3, 4:Native, 5:Intron 15: unique intron | ||
|
-P# |
Minimum overall % sequence identity (0-100) |
-S[a|b|c|r]
sort order of chromosomes/contigs a:alphabetical, b:abundance, c:input order r:reverse for minus strand
|
-U# |
Maximum total number of unpaired bases in gaps | ||
|
-V# |
Maximum internal memory size used for core sort. Suffix k (or K) or m (or M) may be attached to specify kilo or mega bytes. | ||
|
-m# |
Maximum number of mismatches within 10bp from the nearest exon-intron boundary | ||
|
-n# |
Allow non-canonical (other than GT..AG, GC..AG, AT..AC) intron ends (0: no) | ||
|
-u# |
Maximum number of unpaired (gap) sites within 10bp from the nearest exon-intron boundary |
COMMENTS
The output
format of spaln -O12 has been changed since version 2; in
addition to *.grd and *.erd files, *.qrd file will be
generated. This change has removed the limitations on the
lengths of the identifiers of both target (genomic) and
query sequences. The database files that was specified by -d
option of spaln must not be changed before running sortgrcd.
By default, no filter listed above is applied
When the output of Spaln is separated in several files, the
combined
results are subjected to the sorting. Although *.grd(.gz)
files are
assigned as the argument, there must be corresponding
*.erd(.gz) and
*.qrd(.gz) files in the same directory.
In the default output format, the gene structure
corresponding to each
transcript is delimited by a line starting with
‘@’, whereas each gene
locus is delimited by a line starting with ‘!’.
Two transcripts belong to
the same locus if their corresponding genomic regions
overlap by at least
one nucleotide on the same strand.
The -O0, -O3, -O4, -O5, -O6, and -O7 options work in the
same manner as
those of spaln.
In particular, with -O0 option, the outputs follow the Gff3
gene format
(http://www.sequenceontology.org/gff3.shtml) where a gene
locus is defined
as described above.
With -O4 (default) and -O5 options, the outputs follow the
exon-oriented
and intron-oriented spaln formats, respectively.
With -O15 option, introns are uniqued, i.e., introns
inferred from
different transcripts with the same 5’ and 3’
boundaries are output only
once.
REFERENCES
(1) "A
Space-Efficient and Accurate Method for Mapping and Aligning
cDNA Sequences onto Genomic Sequence", O. Gotoh,
Nucleic Acid Res., 36 (8), 2630-2638 (2008).
(2) "Direct Mapping and Alignment of Protein Sequences
onto Genomic Sequence", O. Gotoh, Bioinformatics, 24
(21) 2438-2444 (2008).
AUTHOR
Osamu Gotoh <o.gotoh@aist.go.jp>