Frequently asked questions

General

regenie performs whole genome regression using the following model

where is a phenotype, is a genotype matrix, and . This model has close ties to a linear mixed model (LMM) based on an infinitesimal model

where with is referred to as the genetic relatedness matrix (GRM). In the LMM, the polygenic effects have been integrated out so that model only involves the GRM $K$ through a variance component in the covariance matrix of the trait.

In regenie, we directly estimate the polygenic effects parameter by using ridge regression, which corresponds to fitting a linear regression model with a L2 penalty to impose shrinkage. Hence, we bypass having to use the GRM and use the polygenic effect estimates to control for population structure when testing variants for association.


For quantitative traits, we have not obtained issues running regenie on small data sets. For binary traits, we have obtained successful runs of regenie (step 1 and 2) on data sets with as little as 300 samples. A few factors to consider:

  1. Convergence issues may occur in step 1 (all the more if a trait is highly unbalanced) see below
  2. Similarly, convergence issues may occur in step 2 when using Firth approximation see below

Note: we have found that regenie can get conservative in more extreme relatedness scenarios so we recommend not to use it for smaller cohorts with high amounts of relatedness like founder populations where exact mixed-model methods can be used

Step 1

We recommend to use blocks of size 1000 as we have observed that it leads to a reasonable number of ridge predictors at level 1 (e.g. 2,500 with 500K SNPs used and the default regenie parameters) and have noticed little change in the final predictions when varying the block size.


We recommend to use a smaller set of about 500K directly genotyped SNPs in step 1, which should be sufficient to capture genome-wide polygenic effects. Note that using too many SNPs in Step 1 (e.g. >1M) can lead to a high computational burden due to the resulting higher number of predictors in the level 1 models.


This is due to variants with very low minor allele count (MAC) being included in step 1. To avoid this, you should use a MAC filter to remove such variants in a pre-processing step before running Regenie.

For example, in PLINK2 you would use the --mac option and obtain a list of variants that pass the MAC filter (note that if you are using --keep/--remove in Regenie, you should also use it in the PLINK2 command)

plink2 \
  --bfile my_bed_file \
  --mac 100 \
  --write-snplist \
  --out snps_pass

You would then use the output file in regenie as --extract snps_pass.snplist (and this would avoid having to make a new genotype file).


This can occur when the sample size used to fit the model is small and/or if the trait is extremely unbalanced.

  1. If using K-fold CV, switch to LOOCV (option --loocv) to increase the size of the sample used to fit the model (note: LOOCV is now used by default when the sample size is below 5,000)
  2. If it is due to quasi-separation (i.e. Var(Y)=0 occurred in model fitting), either increase the sample size using LOOCV or increase the MAF threshold for variants included in step 1 analysis

Step 2

This can occur when the sample size used to fit the model is small and/or if the trait is extremely unbalanced. We have implemented the same measures as in the logistf function in R to avoid convergence issues, which include the use of a step size threshold when performing a Newton step.

  1. We first try fitting the model with a step size threshold that is more liberal (=25) as well as a maximum number of iterations of 1,000 and if convergence fails, we retry the model fit using a more stringent step size threshold (=5) and a higher threshold for the number of iterations (=5,000), which will slow down convergence.
  2. The user can also specify a maximum step size threshold using --maxstep-null (use value <5) as well as increase the maximum number of iterations using --maxiter-null (use value >5000). In that case, no retries are perfomed if convergence fails.
    • We recommend to test chromosomes separately (using --chr) as these parameters may need to be altered when fitting the null model for each chromosome