Bayes Factors

This vignette can be referred to by citing the package:


The adoption of the Bayesian framework for applied statistics, especially in social or psychological sciences, seems to be developing in two distinct directions. One of the key topics marking their separation is their opinion about the Bayes factor. In short, some authors (e.g., the “Amsterdam school”, led by Wagenmakers) advocate its use and emphasize its qualities as a statistical index, while others point to its limits and prefer, instead, the precise description of posterior distributions (using CIs, ROPEs, etc.).

bayestestR does not take a side in this debate, rather offering tools to help you in whatever analysis you want to achieve. Instead, it strongly supports the notion of an informed choice: discover the methods, try them, understand them, learn about them, and decide for yourself.

Having said that, here’s an introduction to Bayes factors :)

Bayes Factors

Bayes factors (BFs) are indices of relative evidence of one “model” over another, which are used in Bayesian inference as alternatives to classical (frequentist) hypothesis testing indices (such as (p-values)). In the Bayesian framework, a Bayes factor can also be thought of as the quantity by which some prior beliefs about the relative odds of two models are updated in light of the observed data.

According to Bayes’ theorem:

[ P(M|D) = \frac{P(D|M)P(M)}{P(D)} ]

Then by comparing two models, we get:

[ \frac{P(M_1|D)}{P(M_2|D)} = \frac{P(D|M_1)}{P(D|M_2)} \times \frac{P(M_1)}{P(M_2)} ] Where the middle term is the Bayes factor: [ BF_{12}=\frac{P(D|M_1)}{P(D|M_2)} ] These are usually computed as the ratio of marginal likelihoods of two competing hypotheses / models, but as we can see from the equation above, they can also be computed by dividing the posterior odds by the prior odds. Importantly, Bayes factors cover a wide range of indices and applications, and come in different flavors.

Comparing models

Generally speaking, Bayes factors are used to compare models and answer the question:

Under which model is the the observed data more probable?

In other words, which model is more likely to have produced the observed data? This is usually done by computing the marginal likelihoods of two models. In such a case, the Bayes factor is a measure of the relative evidence of one of the compared models over the other.

Bayesian models (brms and rstanarm)

Note: In order to compute Bayes factors for models, non-default arguments must be added upon fitting:

Let’s first fit 5 Bayesian regressions with brms to predict Sepal.Length:

library(brms)

m0 <- brm(Sepal.Length ~ 1, data = iris, save_all_pars = TRUE)
m1 <- brm(Sepal.Length ~ Petal.Length, data = iris, save_all_pars = TRUE)
m2 <- brm(Sepal.Length ~ Species, data = iris, save_all_pars = TRUE)
m3 <- brm(Sepal.Length ~ Species + Petal.Length, data = iris, save_all_pars = TRUE)
m4 <- brm(Sepal.Length ~ Species * Petal.Length, data = iris, save_all_pars = TRUE)

We can now compare these models with the bayesfactor_models() function, using the denominator argument to specify which model all models will be compared against (in this case, the intercept-only model):

library(bayestestR)
comparison <- bayesfactor_models(m1, m2, m3, m4, denominator = m0)
comparison
> Bayes factor analysis
> ---------------------                                      
> [1] Petal.Length              3.45e+44
> [2] Species                   5.63e+29
> [3] Species + Petal.Length    7.12e+55
> [4] Species * Petal.Length    9.15e+55
> 
> Against denominator:
>        [5] (Intercept only)   
> ---
> Bayes factor type:  marginal likelihoods (bridgesampling)

We can see that the full model is the best model - with (BF_{\text{m0}}=9\times 10^{55}) compared to the null (intercept only).

We can also change the reference model to the main effect model:

update(comparison, reference = 3)
> Bayes factor analysis
> ---------------------                                      
> [1] Petal.Length              4.84e-12
> [2] Species                   7.90e-27
> [4] Species * Petal.Length        1.28
> [5] (Intercept only)          1.40e-56
> 
> Against denominator:
>        [3] Species + Petal.Length   
> ---
> Bayes factor type:  marginal likelihoods (bridgesampling)

As we can see, though the full model is the best, there is hardly any evidence that it is preferable to the main effects model.

We can also change the reference model to the Species model:

update(comparison, reference = 2)
> Bayes factor analysis
> ---------------------                                      
> [1] Petal.Length              6.12e+14
> [3] Species + Petal.Length    1.27e+26
> [4] Species * Petal.Length    1.63e+26
> [5] (Intercept only)          1.78e-30
> 
> Against denominator:
>        [2] Species   
> ---
> Bayes factor type:  marginal likelihoods (bridgesampling)

Notice that in the Bayesian framework the compared models do not need to be nested models, as happened here when we compared the Petal.Length-only model to the Species-only model (something that cannot be done in the frequentists framework, where compared models must be nested in one another).

NOTE: In order to correctly and precisely estimate Bayes Factors, you always need the 4 P’s: Proper Priors (1, 2, 3), and a Plentiful Posterior (4).

The BIC approximation for Frequentist Models

It is also possible to compute Bayes factors for frequentist models. This is done by comparing BIC measures, allowing a Bayesian comparison of non-nested frequentist models (Wagenmakers 2007). Let’s try it out on some linear mixed models:

library(lme4)

m0 <- lmer(Sepal.Length ~ (1 | Species), data = iris)
m1 <- lmer(Sepal.Length ~ Petal.Length + (1 | Species), data = iris)
m2 <- lmer(Sepal.Length ~ Petal.Length + (Petal.Length | Species), data = iris)
m3 <- lmer(Sepal.Length ~ Petal.Length + Petal.Width + (Petal.Length | Species), data = iris)
m4 <- lmer(Sepal.Length ~ Petal.Length * Petal.Width + (Petal.Length | Species), data = iris)

bayesfactor_models(m1, m2, m3, m4, denominator = m0)
> Bayes factor analysis
> ---------------------                                                                     
> [1] Petal.Length + (1 | Species)                             8.24e+24
> [2] Petal.Length + (Petal.Length | Species)                  4.77e+23
> [3] Petal.Length + Petal.Width + (Petal.Length | Species)    1.52e+22
> [4] Petal.Length * Petal.Width + (Petal.Length | Species)    5.93e+20
> 
> Against denominator:
>        [5] 1 + (1 | Species)   
> ---
> Bayes factor type:  BIC approximation

Inclusion Bayes factors via Bayesian model averaging

Inclusion Bayes factors answer the question:

Are the observed data more probable under models with a particular effect, than they are under models without that particular effect?

In other words, on average - are models with effect (X) more likely to have produce the observed data than models without effect (X)?

Lets use the brms example from above:

bayesfactor_inclusion(comparison)
>                      Pr(prior) Pr(posterior) Inclusion BF
> Petal.Length               0.6          1.00     1.93e+26
> Species                    0.6          1.00     3.15e+11
> Species:Petal.Length       0.2          0.56         5.14
> ---
> Inclusion BFs compared among all models.

If we examine the interaction term’s inclusion Bayes factor, we can see that across all 5 models, a model with the interaction term (Species:Petal.Length) is 5 times more likely than a model without the interaction term.

We can also compare only matched models - i.e., a model without effect (A) will only be compared to models with effect (A), but not with models with higher-level interaction (see explanation for why you might want to do this here).

bayesfactor_inclusion(comparison, match_models = TRUE)
>                      Pr(prior) Pr(posterior) Inclusion BF
> Petal.Length               0.4          0.44     1.27e+26
> Species                    0.4          0.44     2.07e+11
> Species:Petal.Length       0.2          0.56         1.28
> ---
> Inclusion BFs compared among matched models only.

Comparison with JASP

bayesfactor_inclusion() is meant to provide Bayes Factors across model averages, similar to JASP’s Effects option. Let’s compare the two:

Compared across all models

library(BayesFactor)
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
BF_ToothGrowth <- anovaBF(len ~ dose*supp, ToothGrowth)

bayesfactor_inclusion(BF_ToothGrowth)
>           Pr(prior) Pr(posterior) Inclusion BF
> supp            0.6          1.00       147.54
> dose            0.6          1.00     3.36e+14
> supp:dose       0.2          0.73        11.06
> ---
> Inclusion BFs compared among all models.

Compared across matched models

bayesfactor_inclusion(BF_ToothGrowth, match_models = TRUE)
>           Pr(prior) Pr(posterior) Inclusion BF
> supp            0.4          0.26        58.06
> dose            0.4          0.27     1.34e+14
> supp:dose       0.2          0.73         2.81
> ---
> Inclusion BFs compared among matched models only.

With Nuisance Effects

We’ll add dose to the null model in JASP, and do the same in R:

BF_ToothGrowth_against_dose <- BF_ToothGrowth[3:4]/BF_ToothGrowth[2] # OR: 
# update(bayesfactor_models(BF_ToothGrowth), subset = c(4,5), reference = 3)
BF_ToothGrowth_against_dose
> Bayes factor analysis
> --------------
> [1] supp + dose             : 58  ±1.6%
> [2] supp + dose + supp:dose : 163 ±3.5%
> 
> Against denominator:
>   len ~ dose 
> ---
> Bayes factor type: BFlinearModel, JZS
bayesfactor_inclusion(BF_ToothGrowth_against_dose)
>           Pr(prior) Pr(posterior) Inclusion BF
> dose           1.00          1.00          NaN
> supp           0.67          1.00       110.66
> supp:dose      0.33          0.73         5.53
> ---
> Inclusion BFs compared among all models.

Savage-Dickey density ratio Bayes factor

The Savage-Dickey density ratio can be used to answer the question:

Given the observed data, is the null more, or less likely?

This is done by comparing the density of the null value between the prior and posterior distributions, and is an approximation of a Bayes factor comparing the provided against the (point) null model:

“[…] the Bayes factor for (H_0) versus (H_1) could be obtained by analytically integrating out the model parameter (\theta). However, the Bayes factor may likewise be obtained by only considering (H_1), and dividing the height of the posterior for (\theta) by the height of the prior for (\theta), at the point of interest.” (Wagenmakers et al. 2010)

Let’s use the Students’ Sleep data, and try and answer the question: given the observed data, is it more or less likely that the drug (the variable group) has no effect on the numbers of hours of extra sleep (variable extra)?

The bloxplot suggests that the 2nd group has a higher number of hours of extra sleep. By how much? Let’s fit a simple Bayesian linear model.

library(rstanarm)
model <- stan_glm(extra ~ group, data = sleep)

We can use as.data.frame() on this model to extract the posterior distribution related to the effect of group2, and get_priors() from the insight package to see what prior distribution was used:

posterior <- as.data.frame(model)$group2

insight::get_priors(model)
parameter distribution location scale adjusted_scale
(Intercept) normal 0 10.0 20
group2 normal 0 2.5 5

For the group2 parameter, the prior that was used was a normal distribution of mean (location) 0 and SD (scale) 5.044799. We can simulate this prior distribution as follows:

library(bayestestR)
prior <- distribution_normal(length(posterior), mean = 0, sd = 5.044799)

We can now plot both the prior and posterior distribution:

Looking at the distributions, we can see that the posterior is centred at 1.56. But this does not mean that an effect of 0 is necessarily less likely. To test that, we will use bayesfactor_savagedickey().

test_group2 <- bayesfactor_savagedickey(posterior = posterior, prior = prior)
test_group2
> # Bayes Factor (Savage-Dickey density ratio)
> 
>  Bayes Factor
>          0.88
> ---
> Evidence Against Test Value:  0

This BF indicates that an effect of 0 (the point-null effect model) is 0.88 less likely given the data. Or, when interpreting as a Bayes factor for (H_0) versus (H_1), we can say that the observed data is 1/0.88 = 1.14 times more probable under the null than under the model with the effect! Thus, although the center of distribution has shifted away from the null, it is still quite dense around the null.

Note that interpretation guides for Bayes factors can be found here.

Directional test

We can also conduct a directional test (a “one sided” or “one tailed” test) if we have a prior hypotheses about the direction of the effect. This is done by setting an order restriction on the prior and posterior distributions (Morey and Wagenmakers 2014; Morey 2015).

test_group2_right <- bayesfactor_savagedickey(posterior = posterior, prior = prior, direction = ">")
test_group2_right
> # Bayes Factor (Savage-Dickey density ratio)
> 
>  Bayes Factor
>          1.69
> ---
> Evidence Against Test Value:  0 
> Right-Sided test

As we can see, given that we have an a priori assumption about the direction of the effect (that the effect is positive), the presence of an effect has become 1.69 times more likely than the absence of an effect, given the observed data (or that the data is 1.69 time more probable under (H_1) than (H_0)). This indicates that, given the observed data, the posterior mass has shifted away from the null value, giving some evidence against the null (note that a Bayes factor of 1.69 is still considered quite weak evidence).

Testing all model parameters

Alternatively, we could also pass our model directly as-is to bayesfactor_savagedickey() to simultaneously test all of the model’s parameters:

bayesfactor_savagedickey(model)
> Computation of Bayes factors: sampling priors, please wait...

> # Bayes Factor (Savage-Dickey density ratio)
> 
>    Parameter Bayes Factor
>  (Intercept)         0.07
>       group2         0.86
> ---
> Evidence Against Test Value:  0

Testing Contrasts (with emmeans)

We can also use bayesfactor_savagedickey() together with emmeans, allowing us to test Bayesian contrasts.

library(emmeans)
group_diff <- pairs(emmeans(model, ~ group))
group_diff
>  contrast estimate lower.HPD upper.HPD
>  1 - 2       -1.56      -3.3     0.258
> 
> HPD interval probability: 0.95
# pass the original model via prior
bayesfactor_savagedickey(group_diff, prior = model)
> Computation of Bayes factors: sampling priors, please wait...

> # Bayes Factor (Savage-Dickey density ratio)
> 
>  Parameter Bayes Factor
>      1 - 2         0.91
> ---
> Evidence Against Test Value:  0

References

Morey, Richard D. 2015. “Multiple Comparisons with Bayesfactor, Part 2 – Order Restrictions.” https://richarddmorey.org/category/order-restrictions/.

Morey, Richard D, and Eric-Jan Wagenmakers. 2014. “Simple Relation Between Bayesian Order-Restricted and Point-Null Hypothesis Tests.” Statistics & Probability Letters 92: 121–24.

Wagenmakers, Eric-Jan. 2007. “A Practical Solution to the Pervasive Problems Ofp Values.” Psychonomic Bulletin & Review 14 (5): 779–804.

Wagenmakers, Eric-Jan, Tom Lodewyckx, Himanshu Kuriyal, and Raoul Grasman. 2010. “Bayesian Hypothesis Testing for Psychologists: A Tutorial on the Savage–Dickey Method.” Cognitive Psychology 60 (3): 158–89.