mmseqs(1)

MMseqs2 (Many against Many sequence searching): fast, parallelized protein sequence searches and clustering of huge protein sequence data

Section 1 mmseqs2 bookworm source

Description

MMSEQS2

NAME

MMseqs2 - MMseqs2 (Many against Many sequence searching): fast, parallelized protein sequence searches and clustering of huge protein sequence data sets.

SYNOPSIS

mmseqs <module> args

DESCRIPTION

MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge proteins/nucleotide sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.

The following depicts the different <module> that can be used.

Easy workflows (for non-experts)

An example for running a command using easy-* modules would be mmseqs easy-search <DB> <targetDB>
easy-search

Search with a query fasta against target fasta (or database) and return a BLAST-compatible result in a single step

easy-linsearch

Linear time search with a query fasta against target fasta (or database) and return a BLAST-compatible result in a single step

easy-linclust

Compute clustering of a fasta/fastq database in linear time. The workflow outputs the representative sequences, a cluster tsv and a fasta-like format containing all sequences.

easy-cluster

Compute clustering of a fasta database. The workflow outputs the representative sequences, a cluster tsv and a fasta-like format containing all sequences.

easy-taxonomy

Compute taxonomy and lowest common ancestor for each sequence. The workflow outputs a taxonomic classification for sequences and a hierarchical summery report.

Main tools (for non-experts)
createdb

Convert protein sequence set in a FASTA file to MMseqs sequence DB format

Search with query sequence or profile DB (iteratively) through target sequence DB

linsearch

Search with query sequence DB through target sequence DB

map

Fast ungapped mapping of query sequences to target sequences.

cluster

Compute clustering of a sequence DB (quadratic time)

linclust

Cluster sequences of >30% sequence identity *in linear time*

createindex

Precompute index table of sequence DB for faster searches

createlinindex

Precompute index for linsearch

enrich

Enrich a query set by searching iteratively through a profile sequence set.

rbh

Find reciprocal best hits between query and target

clusterupdate

Update clustering of old sequence DB to clustering of new sequence DB

Utility tools for format conversions
createtsv

Create tab-separated flat file from prefilter DB, alignment DB, cluster DB, or taxa DB

convertalis

Convert alignment DB to BLAST-tab format or specified custom-column output format

convertprofiledb

Convert ffindex DB of HMM files to profile DB

convert2fasta

Convert sequence DB to FASTA format

result2flat

Create a FASTA-like flat file from prefilter DB, alignment DB, or cluster DB

createseqfiledb

Create DB of unaligned FASTA files (1 per cluster) from sequence DB and cluster DB

Taxonomy tools
taxonomy

Compute taxonomy and lowest common ancestor for each sequence.

createtaxdb

Annotates a sequence database with NCBI taxonomy information

addtaxonomy

Add taxonomy information to result database.

lca

Compute the lowest common ancestor from a set of taxa.

taxonomyreport

Create Kraken-style taxonomy report.

filtertaxdb

Filter taxonomy database.

Multi-hit search tools
multihitdb

Create sequence database and associated metadata for multi hit searches

multihitsearch

Search with a grouped set of sequences against another grouped set

besthitperset

For each set of sequences compute the best element and updates the p-value

combinepvalperset

For each set compute the combined p-value

summerizeresultsbyset

For each set compute summary statistics, such as spread-pvalue etc.

resultsbyset

For each set compute the combined p-value

mergeresultsbyset

Merge results from multiple orfs back to their respective contig

Utility tools for clustering
mergeclusters

Merge multiple cluster DBs into single cluster DB

Core tools (for advanced users)
prefilter

Search with query sequence / profile DB through target DB (k-mer matching + ungapped alignment)

ungappedprefilter

Search with query sequence / profile DB through target DB and compute optimal ungapped alignment score

align

Compute Smith-Waterman alignments for previous results (e.g. prefilter DB, cluster DB)

alignall

Compute all against all Smith-Waterman alignments for a results (e.g. prefilter DB, cluster DB)

transitivealign

Transfers alignments by transitivity via a center star alignment

clust

Cluster sequence DB from alignment DB (e.g. created by searching DB against itself)

kmermatcher

Finds exact $k$-mers matches between sequences

kmersearch

Search with query sequence through target DB. (k-mer matching)

kmerindexdb

Finds exact $k$-mers matches between sequences and stores them as index

clusthash

Cluster sequences of same length and >90% sequence identity *in linear time*

Utility tools to manipulate DBs
compress

Compresses a database.

decompress

Decompresses a database.

apply

Passes each input database entry to stdin of the specified program, executes it and writes its stdout to the output database.

extractorfs

Extract open reading frames from all six frames from nucleotide sequence DB

extractframes

Extract frames reading frames from a nucleotide sequence DB

orftocontig

Obtain location information of extracted orfs with respect to their contigs in alignment format

reverseseq

Reverse each sequence in a DB

touchdb

Memory map database

translatenucs

Translate nucleotide sequence DB into protein sequence DB

translateaa

Translate protein sequence into nucleotide sequence DB

swapresults

Reformat prefilter or alignment DB as if target DB had been searched through query DB

swapdb

Create a DB where the key is from the first column of the input result DB

mergedbs

Merge multiple DBs into a single DB, based on IDs (names) of entries

splitdb

Split a mmseqs DB into multiple DBs

splitsequence

Split sequences by length

subtractdbs

Generate a DB with entries of first DB not occurring in second DB

filterdb

Filter a DB by conditioning (regex, numerical, ...) on one of its whitespace-separated columns

createsubdb

Create a subset of a DB from a file of IDs of entries

view

Prints entries to console

rmdb

Removes the database

mvdb

Move the database

result2profile

Compute profile and consensus DB from a prefilter, alignment or cluster DB

result2pp

Merge the query profiles with target profiles according to search results and outputs an enriched profile DB

result2rbh

Filter a merged result DB to retain only reciprocal best hits

result2msa

Generate MSAs for queries by locally aligning their matched targets in prefilter/alignment/cluster DB

convertmsa

Turns an MSA file into an MSA database.

msa2profile

Turns an MSA database into a MMseqs profile database.

profile2pssm

Converts a profile database into a human readable tab-separated PSSM file.

profile2cs

Converts a profile database into a column state sequence.

result2stats

Compute statistics for each entry in a sequence, prefilter, alignment or cluster DB

proteinaln2nucl

Map protein alignment to nucleotide alignment

tsv2db

Turns a TSV file into a MMseqs database

result2repseq

Get representative sequences for a result database

Special-purpose utilities
rescorediagonal

Compute sequence identity for diagonal

alignbykmer

Predict sequence identity, score, alignment start and end by kmer alignment

diffseqdbs

Find IDs of sequences kept, added and removed between two versions of sequence DB

concatdbs

Concatenate two DBs, giving new IDs to entries from second input DB

sortresult

Sort a result database in the same order as prefilter or align would.

summarizealis

Summarize alignment results into a single show uniq. coverage, coverage and avg. sequence identity

summarizeresult

Extract annotations from alignment DB

summarizetabs

Extract annotations from HHblits BAST-tab-formatted results

gff2db

Turn a gff3 (generic feature format) file into a gff3 DB

masksequence

Soft mask sequences using tantan, low. complex regions in lower case the rest upper

maskbygff

X out sequence regions in a sequence DB by features in a gff3 file

prefixid

For each entry in a DB prepend the entry ID to the entry itself

suffixid

For each entry in a DB append the entry ID to the entry itself

convertkb

Convert UniProt knowledge base files into MMseqs2 database format for the selected column types

summarizeheaders

Return a new summarized header DB from the UniProt headers of a cluster DB

extractalignedregion

Extract aligned sequence region from query

extractdomains

Extract highest scoring alignment region for each sequence from BLAST-tab file

convertca3m

Converts a cA3M database into a MMseqs2 result database.

expandaln

Expands an alignment result based on another.

countkmer

Simple kmer counter, it prints the numeric, alphanumeric representation and kmercount