QTLtools-cis(1)

cis QTL analysis

Section 1 qtltools bookworm source

Description

QTLtools-cis

NAME

QTLtools cis - cis QTL analysis

SYNOPSIS

QTLtools cis --vcf [in.vcf|in.vcf.gz|in.bcf|in.bed.gz] --bed quantifications.bed.gz [--nominal float | --permute integer | --mapping in.txt] --out output.txt [OPTIONS]

DESCRIPTION

This mode maps cis (proximal) quantitative trait loci (QTLs) that affect the phenotype, using linear regression. The method is detailed in <https://www.nature.com/articles/ncomms15452>. We first regress out the provided covariates from the phenotype data, followed by running the linear regression between the phenotype residuals and the genotype. If --normal and --cov are provided at the same time, then the residuals after the covariate correction are rank normal transformed. It incorporates an efficient permutation scheme to control for differential multiple testing burden of each phenotype. You can run a nominal pass (--nominal threshold) listing all genotype-phenotype associations below a certain threshold, a permutation pass (--permute no_of_permutations) to empirically characterize the null distribution of associations for each phenotype separately, thus adjusting the nominal p-value of the best association for a phenotype, or a conditional analysis pass (--mapping filename) to discover multiple proximal QTLs with independent effects on a phenotype.

As multiple molecular phenotypes can belong to higher order biological entities, e.g. exons of genes, QTLtools cis allows grouping of phenotypes to maximize the discoveries in such particular cases. Specifically, QTLtools can either aggregate multiple phenotypes in a given group into a single phenotype via PCA (--grp-pca1) or by taking their mean (--grp-mean), or directly use all individual phenotypes in an extended permutation scheme that accounts for their number and correlation structure (--grp-best). In our experience, --grp-best outperforms the other options for expression QTLs (eQTLs).

The conditional analysis pass first uses permutations to derive a nominal p-value threshold per phenotype that varies and reflects the number of independent tests per cis-window. Then, it uses a forward-backward stepwise regression to learn the number of independent signals per phenotype, determine the best candidate variant per signal and assign all significant hits to the independent signal they relate to.

Since linear regressions assumes normally distributed data, we highly recommend using the --normal option to rank normal transform the phenotype quantifications in order to avoid false positive associations due to outliers.

OPTIONS

--vcf [in.vcf|in.bcf|in.vcf.gz|in.bed.gz]

Genotypes in VCF/BCF format, or another molecular phenotype in BED format. If there is a DS field in the genotype FORMAT of a variant (dosage of the genotype calculated from genotype probabilities, e.g. after imputation), then this is used as the genotype. If there is only the GT field in the genotype FORMAT then this is used and it is converted to a dosage. REQUIRED.

--bed quantifications.bed.gz

Molecular phenotype quantifications in BED format. REQUIRED.

--out output.txt

Output file. REQUIRED.

--cov covariates.txt

Covariates to correct the phenotype data with.

--normal

Rank normal transform the phenotype data so that each phenotype is normally distributed. RECOMMENDED.

--std-err

Calculate and output the standard error of the beta (slope).

--window integer

Size of the cis window flanking each phenotype’s start position. DEFAULT=1000000. RECOMMENDED=1000000.

--nominal float

Calculate the nominal p-value for the genotype-phenotype associations and print out the ones that pass the provided threshold. Give 1.0 to print everything. Mutually exclusive with --permute and --mapping.

--permute integer

Adjust the best nominal p-value for this phenotype accounting for the number of variants and the linkage disequilibrium in its cis-window. We recommend at least 1000 permutation for the final analysis, and in most cases you will see diminishing returns when going over 5000. However, if you are doing exploratory analyses like which/how many covariates to include, you can go as low as 100. Mutually exclusive with --nominal and --mapping. RECOMMENDED=1000.

--mapping thresholds_filename

The conditional analysis. First you need to run a permutation analysis, then generate a thresholds file using the runFDR_cis.R script in the script directory. Mutually exclusive with --nominal and --permute.

--grp-best

Correct for multiple phenotypes within a phenotype group. Mutually exclusive with --grp-pca1 and --grp-mean.

--grp-pca1

Run PCA on phenotypes within a phenotype group and use PC1 for association testing. Mutually exclusive with --grp-best and --grp-mean.

--grp-mean

Average phenotypes within a group and use the results for association testing. Mutually exclusive with --grp-best and --grp-pca1.

--chunk integer1 integer2

For parallelization. Divide the data into integer2 number of chunks and process chunk number integer1. Chunk 0 will print a header. Mutually exclusive with --region and --region-pair. Minimum number of chunks has to be at least the same number of chromosomes in the --bed file.

--region chr:start-end

Genomic region to be processed. E.g. chr4:12334456-16334456, or chr5 Mutually exclusive with --chunk and --region-pair.

--region-pair chr:start-end chr:start-end

Genomic region for genotypes followed by region for phenotypes to be processed. Mutually exclusive with --chunk and --region.

OUTPUT FILES

--nominal output file

Space separated output file with the following columns (certain columns are only printed based on options). We recommend including chunk 0 to print out a header in order to avoid confusion.

Image grohtml-80736-1.png

--permute output file

Space separated output file with the following columns (certain columns are only printed based on options). We recommend including chunk 0 to print out a header in order to avoid confusion.

Image grohtml-80736-2.png

--mapping output file

Space separated output file with the following columns (certain columns are only printed based on options). We recommend including chunk 0 to print out a header in order to avoid confusion.

Image grohtml-80736-3.png

NOMINAL ANALYSIS EXAMPLES

o

Nominal pass, correcting for technical covariates, rank normal transforming the phenotype, and printing out associations with a p-value <= 0.01 on chromosome 22 between 17000000 and 18000000 bp, and excluding some samples (see QTLtools (1)):

QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --region chr22:17000000-18000000 --normal --out nominals.txt --exclude-samples sample_names_to_exclude.txt

o

Nominal pass with parallelization correcting for technical covariates, rank normal transforming the phenotype, and printing out associations with a p-value <= 0.01. To facilitate parallelization on compute cluster, we developed an option to run the analysis into chunks of molecular phenotypes. For instance, to run analysis on chunk 12 when splitting the example data set into 20 chunks, run:

QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --chunk 12 20 --normal --out nominals_12_20.txt

o

If you want to submit the whole analysis with 20 jobs on a compute cluster, just run (qsub needs to be changed to the job submission system used [bsub, psub, etc...]):

for j in $(seq 0 20); do

echo "QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --chunk $j 20 --normal --out nominals_$j\_20.txt" | qsub

done

PERMUTATION ANALYSIS EXAMPLES

o

Permutation pass, correcting for technical covariates, rank normal transforming the phenotype, and running 1000 permutations with a specific random seed on chromosome 22 between 17000000 and 18000000 bp:

QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --permute 1000 --region chr22:17000000-18000000 --normal --seed 1354145 --out permutation.txt

o

Permutation pass with parallelization correcting for technical covariates, rank normal transforming the phenotype, and running 5000 permutations. To facilitate parallelization on compute cluster, we developed an option to run the analysis into chunks of molecular phenotypes. For instance, to run analysis on chunk 12 when splitting the example data set into 20 chunks, run:

QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --permute 5000 --chunk 12 20 --normal --out permutations_12_20.txt

o

If you want to submit the whole analysis with 20 jobs on a compute cluster, just run (qsub needs to be changed to the job submission system used [bsub, psub, etc...]):

for j in $(seq 0 20); do

echo "QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --permute 5000 --chunk $j 20 --normal --out permutations_$j\_20.txt" | qsub

done

o

When your phenotypes in the BED file are grouped, you can perform a permutation pass at the phenotype group level in order to discover group-level QTLs:

QTLtools cis --vcf genotypes.chr22.vcf.gz --bed exons.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --permute 1000 --normal --grp-best --out permutation.group.txt

CONDITIONAL ANALYSIS EXAMPLE

Conditional analysis to discover independent signals.

1

First we need to run a permutation analysis (see previous section), then calculate nominal p-value threshold for each gene. Here an FDR of 5% is given as an example:

cat permutations_*.txt | gzip -c > permutations_all.txt.gz

Rscript ./script/qtltools_runFDR_cis.R permutations_all.txt.gz 0.05 permutations_all

2

Now you can proceed with the actual conditional analysis. Here splitting into 20 chunks, and when all complete concatenate the results:

for j in $(seq 0 20); do

echo "QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --mapping permutations_all.thresholds.txt --chunk $j 20 --normal --out conditional_$j\_20.txt" | qsub

done
cat conditional_*.txt > conditional_all.txt

3

If you are interested in the most significant variants per independent signal, you can filter the results, using the backward p-value:

awk ’{ if ($21 == 1) print $0 }’ conditional_all.txt > conditional_top_variants.txt

SEE ALSO

QTLtools(1)

QTLtools website: <https://qtltools.github.io/qtltools>

BUGS

o

Versions up to and including 1.2, suffer from a bug in reading missing genotypes in VCF/BCF files. This bug affects variants with a DS field in their genotype’s FORMAT and have a missing genotype (DS field is .) in one of the samples, in which case genotypes for all the samples are set to missing, effectively removing this variant from the analyses.

Please submit bugs to <https://github.com/qtltools/qtltools>

CITATION

Delaneau, O., Ongen, H., Brown, A. et al. A complete tool set for molecular QTL discovery and analysis. Nat Commun 8, 15452 (2017). <https://doi.org/10.1038/ncomms15452>

AUTHORS

Olivier Delaneau (olivier.delaneau@gmail.com), Halit Ongen (halitongen@gmail.com)