# Empirical Bayes Metrics with openEBGM

## Background

In Bayesian statistics, the gamma distribution is the conjugate prior distribution for a Poisson likelihood. ‘Conjugate’ means that the posterior distribution will follow the same general form as the prior distribution. DuMouchel (1999) used a model with a Poisson($$\mu_{ij}$$) likelihood for the counts (for row i and column j of the contingency table). We are interested in the ratio $$\lambda_{ij}=\frac{\mu_{ij}}{E_{ij}}$$, where $$E_{ij}$$ are the expected counts. The $$\lambda_{ij}$$s are considered random draws from a mixture of two gamma distributions (our prior) with hyperparameter $$\theta=(\alpha_1,\beta_1,\alpha_2,\beta_2,P)$$, where $$P$$ is the prior probability that $$\lambda$$ came from the first component of the prior mixture (i.e., the mixture fraction). The prior is a single distribution that models all the cells in the table; however, there is a separate posterior distribution for each cell in the table. The posterior distribution of $$\lambda$$, given count $$N=n$$, is a mixture of two gamma distributions with parameters $$\theta=(\alpha_1+n,\beta_1+E,\alpha_2+n,\beta_2+E,Q_n)$$ (subscripts suppressed for clarity), where $$Q_n$$ is the probability that $$\lambda$$ came from the first component of the posterior, given $$N=n$$ (i.e., the mixture fraction).

The posterior distribution, in a sense, is a Bayesian representation of the relative reporting ratio, $$RR$$ (note the similarity in the equations $$RR_{ij}=\frac{N_{ij}}{E_{ij}}$$ and $$\lambda_{ij}=\frac{\mu_{ij}}{E_{ij}}$$). The Empirical Bayes (EB) metrics are taken from the posterior distribution. The Empirical Bayes Geometric Mean $$(EBGM)$$ is the antilog of the mean of the log2-transformed posterior distribution. The $$EBGM$$ is therefore a measure of central tendency of the posterior distribution. The 5% and 95% quantiles of the posterior distributions can be used to create two-sided 90% credibility intervals for $$\lambda_{ij}$$, given $$N_{ij}$$ (i.e, our “sort of” RR). Alternatively, since we are most interested in the lower bound, we could ignore the upper bound and create a one-sided 95% credibility interval.

Due to Bayesian shrinkage (please see the Background section of the Introduction to openEBGM vignette), the EB scores are much more stable than $$RR$$ for small counts.

# Calculating the EB-Scores

Once the product/event combinations have been counted and the hyperparameters have been estimated, we can calculate the EB scores:

library(openEBGM)
data(caers)  #subset of publicly available CAERS data

processed <- processRaw(caers, stratify = FALSE, zeroes = FALSE)
squashed <- squashData(processed)
squashed2 <- squashData(squashed, count = 2, bin_size = 10, keep_pts = 50)
theta_init <- data.frame(alpha1 = c(0.2, 0.1, 0.3, 0.5, 0.2),
beta1  = c(0.1, 0.1, 0.5, 0.3, 0.2),
alpha2 = c(2,   10,  6,   12,  5),
beta2  = c(4,   10,  6,   12,  5),
p      = c(1/3, 0.2, 0.5, 0.8, 0.4)
)
hyper_estimates <- autoHyper(squashed2, theta_init = theta_init)
(theta_hat <- hyper_estimates$estimates) #> alpha1 beta1 alpha2 beta2 P #> 3.25380903 0.39989105 2.02613680 1.90808515 0.06534565 ###Qn() The Qn() function calculates the mixture fractions for the posterior distributions. The values returned by Qn() correspond to the probability that $$\lambda$$ came from the first component of the posterior mixture distribution, given $$N=n$$ (recall there is a $$\lambda|N=n$$ for each cell in the table, but that each $$\lambda$$ comes from a common distribution). Thus, the output from Qn() returns a numeric vector of length equal to the total number of product-symptom combinations, which is also the number of rows in the data frame returned by processRaw(). When calculating the $$Q_n$$s, be sure to use the full data set from processRaw() – not the squashed data set or the raw data. qn <- Qn(theta_hat, N = processed$N, E = processed$E) head(qn) #>  0.2819735 0.3409649 0.3482312 0.2819735 0.2226355 0.2556668 identical(length(qn), nrow(processed)) #>  TRUE summary(qn) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.0001205 0.2340247 0.3089645 0.2846018 0.3397680 0.9999997 ###ebgm() The ebgm() function calculates the Empirical Bayes Geometric Mean $$(EBGM)$$ scores. $$EBGM$$ is a measure of central tendency of the posterior distributions, $$\lambda_{ij}|N=n$$. Scores much larger than one indicate product/adverse event pairs that are reported at an unusually high rate. processed$ebgm <- ebgm(theta_hat, N = processed$N, E = processed$E, qn  = qn)
#>                      var1                  var2 N            E      RR    PRR
#> 1         1-PHENYLALANINE  HEART RATE INCREASED 1 0.0360548272   27.74  27.96
#> 2 11 UNSPECIFIED VITAMINS                ASTHMA 1 0.0038736591  258.15 279.58
#> 3 11 UNSPECIFIED VITAMINS CARDIAC FUNCTION TEST 1 0.0002979738 3356.00    Inf
#> 4 11 UNSPECIFIED VITAMINS            CHEST PAIN 1 0.0360548272   27.74  27.96
#> 5 11 UNSPECIFIED VITAMINS              DYSPNOEA 1 0.0765792610   13.06  13.11
#> 6 11 UNSPECIFIED VITAMINS      HYPERSENSITIVITY 1 0.0527413588   18.96  19.06
#>   ebgm
#> 1 2.23
#> 2 2.58
#> 3 2.63
#> 4 2.23
#> 5 1.92
#> 6 2.09

###quantBisect()

The quantBisect() function calculates quantiles of the posterior distribution using the bisection method. quantBisect() can calculate any quantile of the posterior distribution between 1 and 99%, and these quantiles can be used as limits for credibility intervals. Below, QUANT_05 is the 5th percentile; QUANT_95 is the 95th percentile. These form the lower and upper bounds of 90% credibility intervals for the Empirical Bayes (EB) scores.

processed$QUANT_05 <- quantBisect(5, theta_hat = theta_hat, N = processed$N, E = processed$E, qn = qn) processed$QUANT_95 <- quantBisect(95, theta_hat = theta_hat,
N = processed$N, E = processed$E, qn = qn)
#>                      var1                  var2 N            E      RR    PRR
#> 1         1-PHENYLALANINE  HEART RATE INCREASED 1 0.0360548272   27.74  27.96
#> 2 11 UNSPECIFIED VITAMINS                ASTHMA 1 0.0038736591  258.15 279.58
#> 3 11 UNSPECIFIED VITAMINS CARDIAC FUNCTION TEST 1 0.0002979738 3356.00    Inf
#> 4 11 UNSPECIFIED VITAMINS            CHEST PAIN 1 0.0360548272   27.74  27.96
#> 5 11 UNSPECIFIED VITAMINS              DYSPNOEA 1 0.0765792610   13.06  13.11
#> 6 11 UNSPECIFIED VITAMINS      HYPERSENSITIVITY 1 0.0527413588   18.96  19.06
#>   ebgm QUANT_05 QUANT_95
#> 1 2.23     0.49    13.85
#> 2 2.58     0.52    15.78
#> 3 2.63     0.52    16.02
#> 4 2.23     0.49    13.85
#> 5 1.92     0.47    11.77
#> 6 2.09     0.48    12.95

# Analysis of EB-Scores

The EB-scores ($$EBGM$$ and quantile scores) can be used to look for “signals” in the data. As stated in the Background section of the Introduction to openEBGM vignette, Bayesian shrinkage causes the EB-scores to be far more stable than their $$RR$$ counterparts, which allows for better separation between signal and noise. One could, for example, look at all product-symptom combinations where QUANT_05 (the lower part of the 90% two-sided credibility interval) is 2 or greater. This is often used as a conservative alternative to $$EBGM$$ since QUANT_05 scores are naturally smaller than $$EBGM$$ scores. We can say with high confidence that the “true relative reporting ratios” of product/adverse event combinations above this threshold are much greater than 1, so those combinations are truly reported more than expected. The value of 2 is arbitrarily chosen, and depends on the context. Below is an example of how one may identify product-symptom combinations that require further investigation based on the EB-scores.

suspicious <- processed[processed$QUANT_05 >= 2, ] nrow(suspicious); nrow(processed); nrow(suspicious)/nrow(processed) #>  131 #>  17189 #>  0.007621153 From above we see that less than 1% of product-symptom pairs are suspect based on the QUANT_05 score. One may look more closely at these product-symptom combinations to ascertain which products may need further investigation. Subject matter knowledge is required to determine which signals might identify a possible causal relationship. The EB-scores find statistical associations – not necessarily causal relationships. suspicious <- suspicious[order(suspicious$QUANT_05, decreasing = TRUE),
c("var1", "var2", "N", "E", "QUANT_05", "ebgm",
"QUANT_95")]
#>                                           var1                        var2  N
#> 13924                            REUMOFAN PLUS            WEIGHT INCREASED 16
#> 8187  HYDROXYCUT REGULAR RAPID RELEASE CAPLETS          EMOTIONAL DISTRESS 19
#> 13886                            REUMOFAN PLUS                    IMMOBILE  6
#> 7793              HYDROXYCUT HARDCORE CAPSULES CARDIO-RESPIRATORY DISTRESS  8
#> 8220  HYDROXYCUT REGULAR RAPID RELEASE CAPLETS                      INJURY 11
#>                E QUANT_05  ebgm QUANT_95
#> 13924 0.40643623    15.68 23.26    33.48
#> 8187  0.89690107    11.65 16.78    23.55
#> 13886 0.07866508    10.16 18.28    30.83
#> 7793  0.30482718     9.00 15.25    24.52
#> 8220  0.56317044     8.98 14.28    21.78

tabbed <- table(suspicious\$var1)
#>
#> HYDROXYCUT REGULAR RAPID RELEASE CAPLETS
#>                                       26
#>             HYDROXYCUT HARDCORE CAPSULES
#>                                       13
#>                            REUMOFAN PLUS
#>                                        8
#>                      HYDROXYCUT CAPSULES
#>                                        5
#>               HYDROXYCUT MAX LIQUID CAPS
#>                                        5
#>         HYDROXYCUT CAFFEINE FREE CAPLETS
#>                                        4

The output above suggests some products which may require further investigation.

Next, the openEBGM Objects and Class Functions vignette will demonstrate the object-oriented features of the openEBGM package.