File Formats

Annotation Files

remeta uses many of the same files as regenie to define variant sets. Most annotation files compatible with regenie should also be compatible with remeta.

--anno-file

1:55039839:T:C PCSK9 LoF
1:55039842:G:A PCSK9 missense
.

A file defining variant annotations. Contains 3 whitespace delimited columns: variant id (in CPRA format), gene name, and variant annotation.

--set-list

A1BG 19  58346922  19:58346922:C:A,19:58346924:G:A,...
A1CF 10  50806630  10:50806630:A:G,10:50806630:A:AT,...
.

A file defining which variants are part of a gene set. Contains 4 whitespace delimited columns: gene name, chromosome, start position, and a comma separated list of variants in the gene.

--mask-def

LoF LoF
missense missense
LoF_missense LoF,missense
.

A file specifying which annotation categories to combine into masks. Contains 2 whitespace delimited columns: mask name, and a comma separated list of variant annotations in a mask.

--aaf-file

An optional file with variant alternate allele frequencies. If specified, these frequencies are used for building masks. Three formats are supported. All formats assume whitespace separated columns.

Here is an example of the five column format:

7:6187101:C:T 1.53918207864341e-05 0 7 6187101
7:6190395:C:A 2.19920388819247e-06 1 7 6190395
.

Reference LD Files

--ld-prefix

A set of three files named $PREFIX.remeta.gene.ld, $PREFIX.remeta.buffer.ld, and $PREFIX.remeta.ld.idx.gz generated by remeta compute-ref-ld. The index $PREFIX.remeta.ld.idx.gz is bgzipped and human readable. The columns are:

--gene-list

PCSK9   1   55039446    55064852
USP24   1   55066358    55215364
.

A file listing gene start and end positions. Contains 4 whitespace separated columns: gene name, chromosome, start position, end position. Note that this file must align with the --set-list file for gene-based tests. Specifically,

  1. Gene names in the --gene-list file should match gene names in the --set-list file exactly.
  2. Any variant in the --set-list must appear within the start position and end position of the LD matrix for that gene: otherwise it will be dropped from the test (unless the --keep-variants-not-in-ld-mat option is specified).
  3. The start and end position should be inclusive of any variant that could appear in a test. In particular, it is not recommended to set the start and end positions based on the variants in the set list. If the set list changes, this could result in variants being dropped from a test.

In the remeta repository we provide an example gene list under resources/Ensembl100.GRCh38.chr1_23.gene_list.txt.gz directory. This file was created by extracting gene boundaries for protein coding genes annotated in Ensembl release 100. An Ensembl GTF file was downloaded from Ensembl; start and end positions were extracted from lines with feature gene and the biotype protein_coding.

--target-pfile and --buffer-pfile

A set of pgen, pvar, and psam files from plink2.

--genetic-map

A genetic map in the SHAPEIT format. Contains 3 columns: position, chromosome, and centimorgan. Note that genetic maps are available from the SHAPEIT5 repository.

Miscellaneous Files

--htp

A file in htp format, the default output format of remeta. Output by regenie with the --htp option. htp files are whitespace separated with the following columns:

--extract and --exclude

Files with variant ids (one per line) to include or exclude from meta-analysis.

--genep-def

LoF LoF
NonSyn LoF,missense,LoF_missense
.

A file defining which masks to combine with ACAT. Contains two space-separated columns: the name of the GENEP set, and a comma separated list of masks to include.