GLMM with negative binomial distribution for sequencing count data

Fits many generalised linear mixed effects models (GLMM) with negative binomial distribution for analysis of overdispersed count data with random effects. Designed for longitudinal analysis of RNA-Sequencing count data.

Usage

glmmSeq(
  modelFormula,
  countdata,
  metadata,
  id = NULL,
  dispersion = NA,
  sizeFactors = NULL,
  reduced = NULL,
  modelData = NULL,
  designMatrix = NULL,
  method = c("lme4", "glmmTMB"),
  control = NULL,
  family = nbinom2,
  cores = 1,
  removeSingles = FALSE,
  zeroCount = 0.125,
  verbose = TRUE,
  returnList = FALSE,
  progress = FALSE,
  ...
)

Arguments

modelFormula: the model formula. This must be of the form "~ ..." where the structure is assumed to be "counts ~ ...". The formula must include a random effects term. For more information on formula structure for random effects see lme4::glmer()
countdata: the sequencing count data matrix with genes in rows and samples in columns
metadata: a dataframe of sample information with variables in columns and samples in rows
id: Optional. Used to specify the column in metadata which contains the sample IDs to be used in repeated samples for random effects. If not specified, the function defaults to using the variable after the "|" in the random effects term in the formula.
dispersion: a numeric vector of gene dispersion. Not required for glmmTMB models, as these determine and fit dispersion for each gene.
sizeFactors: size factors (default = NULL). If provided the glmer offset is set to log(sizeFactors). For more information see`` lme4::glmer()
reduced: Optional reduced model formula. If this is chosen, a likelihood ratio test is used to calculate p-values instead of the default Wald type 2 Chi-squared test.
modelData: Optional dataframe. Default is generated by call to expand.grid using levels of variables in the formula. Used to calculate model predictions (estimated means & 95% CI) for plotting via modelPlot. It can therefore be used to add/remove points in modelPlot.
designMatrix: custom design matrix, used only for prediction
method: Specifies which package to use for fitting GLMM models. Either "lme4" or "glmmTMB" depending on whether to use lme4::glmer or glmmTMB::glmmTMB to fit GLMM models.
control: the glmer optimizer control. If method = "lme4" default is glmerControl(optimizer = "bobyqa")). If method = "glmmTMB" default is glmmTMBControl()
family: Only used with glmmTMB models. Default is nbinom2. See glmmTMB::nbinom2
cores: number of cores to use. Default = 1.
removeSingles: whether to remove individuals without repeated measures (default = FALSE)
zeroCount: numerical value to offset zeroes for the purpose of log (default = 0.125)
verbose: Logical whether to display messaging (default = TRUE)
returnList: Logical whether to return results as a list or glmmSeq object (default = FALSE). Useful for debugging.
progress: Logical whether to display a progress bar
...: Other parameters to pass to lme4::glmer()

Value

Returns an S4 class GlmmSeq object with results for gene-wise general linear mixed models. A list of results is returned if returnList

is TRUE which is useful for debugging. If all genes return errors from glmer, then an error message is shown and a character vector containing error messages for all genes is returned.

Details

This function is a wrapper for lme4::glmer(). By default, p-values for each model term are computed using Wald type 2 Chi-squared test as per car::Anova(). The underlying code for this has been optimised for speed. However, if a reduced model formula is specified by setting reduced, then a likelihood ratio test is performed instead using stats::anova. This will double computation time since two GLMM have to be fitted.

Parallelisation is provided using parallel::mclapply on Unix/Mac or parallel::parLapply on PC.

Setting method = "glmmTMB" enables an alternative method of fitting GLMM using the glmmTMB package. This gives access to a variety of alternative GLM family functions. Note, glmmTMB negative binomial models are substantially slower to fit than glmer models with known dispersion due to the extra time taken by glmmTMB to optimise the dispersion parameter.

The id argument is usually optional. By default the id column in the metadata is determined as the term after the bar in the random effects term of the model. Note that id is not passed to glmer or glmmTMB. It is only really used to remove singletons from the countdata matrix and metadata dataframe. The id is also stored in the output from glmmSeq and used by plotting function modelPlot(). However, due to its flexible nature, in theory glmmSeq should allow for more than one random effect term, although this has not been tested formally. In this case, it is probably prudent to specify a value for id.

Examples

data(PEAC_minimal_load)
disp <- apply(tpm, 1, function(x) {
(var(x, na.rm = TRUE)-mean(x, na.rm = TRUE))/(mean(x, na.rm = TRUE)**2)
})
MS4A1glmm <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID),
                     countdata = tpm[1:2, ],
                     metadata = metadata,
                     dispersion = disp,
                     verbose = FALSE)
names(attributes(MS4A1glmm))
#>  [1] "info"      "formula"   "stats"     "predict"   "reduced"   "countdata"
#>  [7] "metadata"  "modelData" "optInfo"   "errors"    "vars"      "class"