unikmer(1)

Toolkit for nucleic acid k-mer analysis

Section 1 unikmer bookworm source

Description

UNIKMER

NAME

unikmer - Toolkit for nucleic acid k-mer analysis

DESCRIPTION

unikmer - Toolkit for k-mer with taxonomic information

unikmer is a toolkit for nucleic acid k-mer analysis, providing functions including set operation on k-mers optional with TaxIds but without count information.

K-mers are either encoded (k<=32) or hashed (arbitrary k) into ’uint64’, and serialized in binary file with extension ’.unik’.

TaxIds can be assigned when counting k-mers from genome sequences, and LCA (Lowest Common Ancestor) is computed during set opertions including computing union, intersection, set difference, unique and repeated k-mers.

Version: v0.19.0

Author: Wei Shen <shenwei356@gmail.com>

Documents : https://bioinf.shenwei.me/unikmer Source code: https://github.com/shenwei356/unikmer

Dataset (optional):

Manipulating k-mers with TaxIds needs taxonomy file from e.g., NCBI Taxonomy database, please extract "nodes.dmp", "names.dmp", "delnodes.dmp" and "merged.dmp" from link below into ˜/.unikmer/ , ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz , or some other directory, and later you can refer to using flag --data-dir or environment variable UNIKMER_DB.

For GTDB, use ’taxonkit create-taxdump’ to create NCBI-style taxonomy dump files, or download from:

https://github.com/shenwei356/gtdb-taxonomy

Note that TaxIds are represented using uint32 and stored in 4 or less bytes, all TaxIds should be in the range of [1, 4294967295]

Usage:

unikmer [command]

Available Commands:

autocompletion Generate shell autocompletion script (bash|zsh|fish|powershell) common Find k-mers shared by most of multiple binary files concat Concatenate multiple binary files without removing duplicates count Generate k-mers (sketch) from FASTA/Q sequences decode Decode encoded integer to k-mer text diff Set difference of multiple binary files dump Convert plain k-mer text to binary format encode Encode plain k-mer text to integer filter Filter out low-complexity k-mers (experimental) grep Search k-mers from binary files head Extract the first N k-mers info Information of binary files inter Intersection of multiple binary files locate Locate k-mers in genome merge Merge k-mers from sorted chunk files num Quickly inspect number of k-mers in binary files rfilter Filter k-mers by taxonomic rank sample Sample k-mers from binary files sort Sort k-mers in binary files to reduce file size split Split k-mers into sorted chunk files tsplit Split k-mers according to taxid union Union of multiple binary files uniqs Mapping k-mers back to genome and find unique subsequences version Print version information and check for update view Read and output binary format to plain text

Flags:

-c, --compact

write compact binary file with little loss of speed

--compression-level int

compression level (default -1)

--data-dir string

directory containing NCBI Taxonomy files, including nodes.dmp, names.dmp, merged.dmp and delnodes.dmp (default "/home/nilesh/.unikmer")

-h, --help

help for unikmer

-I, --ignore-taxid

ignore taxonomy information

-i, --infile-list string

file of input files list (one file per line), if given, they are appended to files from cli arguments

--max-taxid uint32

for smaller TaxIds, we can use less space to store TaxIds. default value is 1<<32-1, that’s enough for NCBI Taxonomy TaxIds (default 4294967295)

-C, --no-compress

do not compress binary file (not recommended)

--nocheck-file

do not check binary file, when using process substitution or named pipe

-j, --threads int

number of CPUs to use (default 4)

--verbose

print verbose information

Use "unikmer [command] --help" for more information about a command.