MAGMA manual (v0.2)

Transcription

MAGMA manual (v0.2)
MAGMA manual (v0.2)
The program is composed of three modules: annotation, gene analysis and gene-set analysis. Gene analysis
feeds into the gene-set analysis through the [OUT].genes.raw file. All three steps can be performed at
once, or in individual steps. When performing multiple steps in one go, intermediate files are stored
well so later steps can be run again without rerunning the earlier steps.
Input data is binary PLINK format. It is assumed that the data has undergone quality control. It is
strongly advised that A) ancestry checks are performed during data QC and that population outliers are
removed (or split into subsamples by population); and B) that principal components are used (computed
using eg. Eigenstrat) to correct for population stratification.
MAGMA can use both raw genotype data as well as SNP p-values as input, though in the latter case a
reference data set (eg. 1000 Genomes European panel) must also be provided. It is recommended that raw
genotype data is used if possible, as the p-value only analysis is less powerful due to the loss of
information.
Using MAGMA
Input arguments for MAGMA take the form of flags (prefixed by --) followed by the relevant values needed
for that flag (if any). Many flags accept additional optional modifiers, which are keywords specified
after the values for that flag that modify the behaviour of that flag. Some modifiers consist of only the
keyword itself (eg. --flag [VALUE] modifier), other modifiers take further parameters specified by the =
sign and a comma-separated list of parameter values (eg. --flag [VALUE] modifier = param1, param2).
Annotation
./magma --annotate --snp-loc [SNPLOC_FILE] --gene-loc [GENELOC_FILE] --out [PREFIX]
Annotates SNPs to genes based on the SNP location file (no header, three columns: SNP id, chromosome,
base-pair position) and the gene location file (no header, four columns: gene name, chromosome, start
position, end position). A PLINK .bim file, for example from the data to be analysed, can also be used as
[SNPLOC_FILE]; files ending in .bim are automatically recognised as such, and the appropriate columns
selected. This has the advantage that all SNPs in the data can be annoted; when using an external SNP
location file, only SNPs with rs-ID present in both the data and the SNP location file can be used in the
analysis.
WARNING: when using your data .bim file as SNP location file, make sure that the SNP locations in the
.bim file refer to the same human reference genome build version as the gene location file!
The --annotate flag accepts three modifiers. The chr modifier specifies a subset of chromosomes to
annotate, either a single value or a range (eg. --annotate chr=3 or --annotate chr=20-X).
The window modifier specifies a window (in kilobase) around genes to be included for that gene (default
window is 0). Can either be symmetrical (single value, eg. window=5) or separately for before and after
the gene (pair of values, eg. window=5,1.5).
The filter modifier specifies a file with no header (eg. --annotate filter=data.bim), and it causes -annotate to retain only SNPs if they are specified in the first column of this file (or second, if a .bim
file). This can be useful for example when using a very large SNP location file, which would otherwise
produce very large gene annotation file as well.
Gene analysis
./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --out [OUT]
This runs the default (PC regression) gene analysis on the PLINK data specified by [PREFIX], using a
gene-annotation (.genes.annot) file previously produced with the --annotate function. This will also
produce an [OUT].genes.raw file for subsequent gene-set analysis, unless the --genes-only is added.
Gene analysis with on-the-fly annotation
./magma --annotate --bfile [PREFIX] --gene-loc [GENELOC_FILE]
./magma --annotate --bfile [PREFIX] --snp-loc [SNPLOC_FILE] --gene-loc [GENELOC_FILE]
Does SNP-to-gene annotation and immediately does gene analysis. The annotation file is also saved, and is
automatically filtered for SNPs in [PREFIX].bim. If the --snp-loc flag is not set, [PREFIX].bim is also
used as the SNP location file.
Gene analysis with covariates and/or alternate phenotype
./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --pheno file=[PHENO_FILE] --out [OUT]
./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --covar file=[COVAR_FILE] --out [OUT]
Performs the analysis on an alternate phenotype or with covariates (or both). Files can contain an
optional header, but the first two columns must be the family ID and individual ID, corresponding to
those in the .fam file (only individuals both in the .fam file and the phenotype/covariate files are used
in the analysis).
For --pheno, the default is to use the first variable (after the two ID columns) in the file, use the use
modifier to change this. Can be specified by name if a header is present (eg. --pheno file=[PHENO_FILE]
use=variableX) or by variable index (eg. use=3; does not count the two ID columns, so this would be the
*fifth* column in the file).
For --covar, the default is to use all the variables. Use the include or exclude modifiers to use only a
subset. Can be specified as a comma-separated list of names (if header is present) and/or variable
indices. This can also include ranges of variables (eg. --covar file=[COVAR_FILE] include=1-5, 7, varXvarZ will use the first five variables, the 7th variable and the variables varX, varZ and all variables in
between). In addition, the use-sex modifier includes the sex variable in the .fam file as covariate (if
no other covariates are used, the file modifier can be omitted: --covar use-sex).
Gene analysis on summary statistics
./magma --bfile [PREFIX] --gene-annot
[BATCH_OUT]
[GENEANNOT_FILE]
--pval
[SNPPVAL_FILE]
N=[SAMPLE_SIZE]
--out
Performs gene analysis on SNP p-values, using an appropriate reference data-set to obtain estimates of
the LD (a typical choice would be the 1,000 Genomes European panel). The p-value file needs to contain a
column of SNP ids and of SNP p-values. If the file has a header the program looks for columns named SNP
and P (not case-sensitive; should work automatically with PLINK SNP analysis output files), otherwise it
uses the first (ids) and second (p-values) columns. Use the use modifier (use=[SNP_COL],[PVAL_COL]) to
change this. Use the --snp-wise flag to change the model used for the gene analysis (see below).
Gene analysis in batch mode
./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --batch [INDEX] [TOTAL] --out [BATCH_OUT]
./magma --merge [BATCH_OUT] --out [MERGE_OUT]
To facilitate parallel computation of the gene analysis, the --batch flag can be used. [TOTAL] specifies
the total number of parts to split the computation into, [INDEX] the particular part to compute the gene
analysis for. Thus, one could for example run MAGMA in 8 batches with --batch [INDEX] 8, running the
program 8 times with [INDEX] = 1, …, [INDEX] = 8. The --merge function then combines the parts back into
a single set of output files.
Gene-set analysis
./magma --gene-results [NAME].genes.raw --set-annot [SETANNOT_FILE] --out [OUT]
Runs gene-set analysis using the .genes.raw file generated by an earlier MAGMA gene analysis and the
provided set annotation file. Sets in the set annotation file must be specified by line (whitespaceseparated), with the first value on each line the name or ID of the set and the values that follow it all
gene IDs. Alternatively, --set-annot [SETANNOT_FILE] col=[GENE_COL],[SET_COL] can be used if the set
annotation file is in column-based format, where the col modifier specifies which column contains the
gene IDs and which column the gene-set names.
The modifier no-size-correct can be added to turn off the correction for gene size and gene density for
the competitive gene-set analysis. This is not recommended. Note that this will also turn off the
correction for any concurrent gene property analyses.
Gene-set analysis can also be run in conjunction with a gene analysis, or a merge, eg.:
./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --set-annot [SETANNOT_FILE] --out [OUT]
./magma --merge [BATCH_OUT] --set-annot [SETANNOT_FILE] --out [OUT]
Gene property analysis
./magma --gene-results [NAME].genes.raw --gene-covar [GCOV_FILE] --out [OUT]
Runs gene property analysis, the generalization of the competitive gene-set analysis to continuous genelevel variables. The gene covariate file has the same kind of structure as regular covariate files used
with --covar; columns correspond to gene properties, rows to genes. The first column should contain gene
IDs. Like --covar all variables in the file are used by default, unless a subset is specified using the
include or exclude modifier. Note that at present, mean imputation is used for genes for which values are
missing (ie. value is set to NA or gene is missing from file entirely).
The gene property analysis can be run simultaneously with the gene-set analysis, and like the gene-set
analysis has a no-size-correct modifier and can be run in conjunction with a gene analysis or a merge.
Conditional gene-set / gene property analysis
./magma --gene-results [NAME].genes.raw --gene-covar [GCOV_FILE] condition=[VARIABLES] --out [OUT]
./magma
--gene-results
[NAME].genes.raw
--set-annot
[SETANNOT_FILE]
--gene-covar
[GCOV_FILE]
condition=[VARIABLES] --out [OUT]
./magma --gene-results [NAME].genes.raw --set-annot [SETANNOT_FILE] --gene-covar [GCOV_FILE] conditiononly=[VARIABLES] --out [OUT]
Gene-set and gene property analyses can be conditioned on variables in the gene covariate file (they are
always also conditioned on gene size, gene density and the log value of each, unless the no-size-correct
modifier is added). This can be done by specifying which variables to condition on using the condition
modifier. These variables are not analysed themselves. To perform conditional gene-set analysis only, use
the condition-only modifier instead.
Additional options
SNP-wise gene analysis
./magma --bfile [PREFIX]
./magma --bfile [PREFIX]
./magma --bfile [PREFIX]
./magma --bfile [PREFIX]
--gene-annot
--gene-annot
--gene-annot
--gene-annot
[GENEANNOT_FILE]
[GENEANNOT_FILE]
[GENEANNOT_FILE]
[GENEANNOT_FILE]
--snp-wise
--snp-wise
--snp-wise
--snp-wise
--out [OUT]
model=[MODEL] --out [OUT]
stat=[STAT] --out [OUT]
model=[MODEL] stat=[STAT] --out [OUT]
The --snp-wise flag can be used to perform a SNP-wise analysis rather than using the PC regression model;
or when used in conjunction with the --pval flag, to change the settings of the SNP-wise model used. The
stat modifier can be ‘chi’ or ‘chisq’ to use chi-square SNP-statistics, or ‘z’ or ‘Z’ to use standard
normal SNP-statistics (default is chi-square). The model modifier specified how SNP-statistics are
aggregated: it can be ‘unweighted’ for unweighted mean SNP-statistic, ‘weighted’ for weighted (based on
SNP LD matrix) mean SNP-statistic or ‘top’ for highest SNP-statistic (default is unweighted mean). For
model=top, a second value can be specified to use the mean of several of the highest SNP-statistics; this
second value specifies either an absolute number (eg. model=top,3 to use the mean of the top 3 SNP-
statistics in the gene) or a fraction (eg. model=top,0.1 to use the top 10% SNP-statistics in the gene).
Note that model=top requires permutation, and as such will take considerably longer to compute than other
analyses.
Rare variant analysis
./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --rare [MAF_CUTOFF] --out [OUT]
./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --rare [MAF_CUTOFF] max=[MAX] --out [OUT]
The --rare flag can be used to aggregate rare variants in a gene to a compound variable (the minor allele
burden score), where rare variants are defined by the MAF_CUTOFF specified. The rare variants themselves
are removed from the gene, and are replaced by the burden score. If the max modifier is specified, no
more than MAX rare variants are aggregated into a single burden score variable. If more than MAX rare
variants are present in a gene, multiple burden score variables are created.
Fixed-effects meta-analysis
./magma --meta genes=[FILENAMES] --out [OUT]
./magma --meta sets=[FILENAMES] --out [OUT]
Merges the provided comma-separated list of either .genes.out or .sets.out
weighted Z method, with the square root of the sample size as weights.
files
using
Stouffer’s
Outlier removal in gene-set and gene property analysis
Use --set-settings truncate=[LOWER],[UPPER] or --set-settings truncate=[MAX] to truncate outlier gene Zvalues during the gene-set and gene property analysis (if truncate=[MAX], LOWER and UPPER are both equal
to MAX). Bound are specified as mean(Z) - LOWER × sd(Z) and mean(Z) + UPPER × sd(Z), and Z-values outside
those bounds are set to those bounds instead.