Recommendations for UK Biobank analysis

regenie is ideally suited for large-scale analyses such as 500K UK Biobank (UKBB) data, where records are available for thousands of phenotypes.

We provide below a few guidelines on how to perform such analysis on the UKBB files that all UKBB approved researchers have access to.

Pre-processing

We will first go over important steps to consider before running regenie.

Selection of traits

regenie can perform whole genome regression on multiple traits at once, which is where higher computational gains are obtained.

As different traits can have distinct missing patterns, regenie uses an imputation scheme to handle missing data. From the real data applications we have performed so far with traits having up to ~20% (for quantitative) and ~5% (for binary) missing observations, our imputation scheme resulted in nearly identical results as from discarding missing observations when analyzing each trait separately (see the paper for details). Hence, we recommend to analyze traits in groups that have similar missingness patterns with resonably low amount of missingness (<15%).

The number of phenotypes in a group will affect the computational resources required and the table below shows typical computational requirements based on using 500,000 markers in step 1 split in blocks of 1000 and using blocks of size 200 when testing SNPs in step 2. The estimates are shown when step 1 of regenie is run in low-memory mode so that within-block predictions are temporarily stored on disk (see Documentation).

Rflow

In the following sections, we'll assume traits (let's say binary) and covariates used in the analysis have been chosen and data are in files ukb_phenotypes_BT.txt and ukb_covariates.txt, which follow the format requirement for regenie (see Documentation).

Preparing genotype file

Step 1 of a regenie run requires a single genotype file as input; we recommend using array genotypes for this step. The UKBB genotype files are split by chromosome, so we recommend using PLINK to merge the files using the following code.

NOTE: please change XXX to you own UKBB application ID number

rm -f list_beds.txt
for chr in {2..22}; do echo "ukb_cal_chr${chr}_v2.bed ukb_snp_chr${chr}_v2.bim ukbXXX_int_chr1_v2_s488373.fam" >> list_beds.txt; done

plink \
  --bed ukb_cal_chr1_v2.bed \
  --bim ukb_snp_chr1_v2.bim \
  --fam ukbXXX_int_chr1_v2_s488373.fam \
  --merge-list list_beds.txt \
  --make-bed --out ukb_cal_allChrs

Exclusion files

Quality control (QC) filters can be applied using PLINK2 to filter out samples and markers in the genotype file prior to step 1 of regenie.

Note: regenie will throw an error if a low-variance SNP is included in the step 1 run. Hence, the user should run adequate QC filtering prior to running regenie to identify and remove such SNPs.

For example, to filter out SNPs with minor allele frequency (MAF) below 1%, minor allele count (MAC) below 100, genotype missingess above 10% and Hardy-Weinberg equilibrium p-value exceeding , and samples with more than 10% missingness,

plink2 \
  --bfile ukb_cal_allChrs \
  --maf 0.01 --mac 100 --geno 0.1 --hwe 1e-15 \
  --mind 0.1 \
  --write-snplist --write-samples --no-id-header \
  --out qc_pass

Step 1

We recommend to run regenie using multi-threading (8+ threads) which will decrease the overall runtime of the program. As this step can be quite memory intensive (due to storing block predictions), we recommend to use option --lowmem, where the number of phenotypes analyzed will determine how much disk space is required (see table above).

Running step 1 of regenie (by default, all available threads are used)

./regenie \
  --step 1 \
  --bed ukb_cal_allChrs \
  --extract qc_pass.snplist \
  --keep qc_pass.id \
  --phenoFile ukb_phenotypes_BT.txt \
  --covarFile ukb_covariates.txt \
  --bt \
  --bsize 1000 \
  --lowmem \
  --lowmem-prefix tmpdir/regenie_tmp_preds \
  --out ukb_step1_BT

For P phenotypes analyzed, this will generate a set of $P$ files ending with .loco which contain the genetic predictions using a LOCO scheme that will be needed for step 2, as well as a prediction list file ukb_step1_BT_pred.list, which lists the names of these predictions files and can be used as input for step 2.

Step 2

As step 1 and 2 are completely decoupled in regenie, you could either use all the traits for testing in step 2 or select a subset of the traits to perform association testing. Furthermore, you can use the same Step 1 output to test on array, exome or imputed variants; below, we will illustrate testing on imputed variants.

Step 2 of regenie can be run in parallel across chromosomes so if you have access to multiple machines, we recommend to split the runs over chromosomes (using 8+ threads).

Running regenie tesing on a single chromosome (here chromosome 1) and using the fast Firth correction as fallback for p-values below 0.01

./regenie \
  --step 2 \
  --bgen ukb_imp_chr1_v3.bgen \
  --ref-first \
  --sample ukbXXX_imp_chr1_v3_s487395.sample \
  --phenoFile ukb_phenotypes_BT.txt \
  --covarFile ukb_covariates.txt \
  --bt \
  --firth --approx --pThresh 0.01 \
  --pred ukb_step1_BT_pred.list \
  --bsize 400 \
  --split \
  --out ukb_step2_BT_chr1

This will create separate association results files for each phenotype as ukb_step2_BT_chr1_*.regenie.

When running the SKAT/ACAT gene-based tests, we recommend to use at most 2 threads and instead parallelize the runs over partitions of the genome (e.g. groups of genes).