More ANOVA designs

Aaron R. Caldwell

2022-05-16

library(Superpower)
Superpower_options(verbose = FALSE,
                   plot = FALSE)
nsims = 250

In this vignette we will quickly demonstrate a few variations on ANOVA designs and how to implement them in Superpower. To see even more designs, please check out our book

Three between subject conditions

Imagine we aim to design a study to test the hypothesis that giving people a pet to take care of will increase their life satisfaction. We have a control condition, a ‘cat’ pet condition, and a ‘dog’ pet condition. We can simulate a One-Way ANOVA with a specified alpha, sample size, and effect size, and see the statistical power we would have for the ANOVA and the follow-up comparisons. We expect all pets to increase life-satisfaction compared to the control condition. Obviously, we also expect the people who are in the ‘dog’ pet condition to have even greater life-satisfaction than people in the ‘cat’ pet condition. Based on work by Pavot and Diener (1993) we again assume that we can expect responses on the life-satisfaction scale to have a mean of approximately 24 in our population, with a standard deviation of 6.4. We expect having a pet increases life satisfaction with approximately 2.2 scale points for participants who get a cat, and 2.6 scale points for participants who get a dog. We initially consider collecting data from 150 participants in total, with 50 participants in each condition. But before we proceed with the data collection, we examine the statistical power our design would have to detect the differences we predict.

design <- "3b"
n <- 50
mu <- c(24, 26.2, 26.6)
sd <- 6.4
label_list = list("condition" = c("control", "cat", "dog"))

design_result <- ANOVA_design(design = design,
                   n = n, 
                   mu = mu, 
                   sd = sd, 
                   label_list = label_list,
                   plot = TRUE)

ANOVA_exact2(design_result,
            emm = TRUE) # Pairwise emmeans will automatically run
            
#> Power and Effect sizes for ANOVA tests
#>             power partial_eta_squared cohen_f non_centrality
#> condition 47.5236              0.0314    0.18         4.7651
#> 
#> Power and Effect sizes for pairwise comparisons (t-tests)
#>                                   power effect_size
#> p_condition_control_condition_cat 39.83        0.34
#> p_condition_control_condition_dog 52.05        0.41
#> p_condition_cat_condition_dog      6.10        0.06
#> 
#> Power and Effect sizes for estimated marginal means
#>                            contrast  power partial_eta_squared cohen_f
#> 1 condition_cat - condition_control 39.909           0.0196190 0.14146
#> 2     condition_cat - condition_dog  6.107           0.0006611 0.02572
#> 3 condition_control - condition_dog 52.148           0.0271902 0.16718
#>   non_centrality
#> 1        2.94171
#> 2        0.09725
#> 3        4.10867

The result shows that you would have quite low power with 50 participants, both for the overall ANOVA (just below 50% power), as for the follow up comparisons (approximately 40% power for the control vs cat condition, just above 50% for the control vs dogs condition, and a really low power (around 6%, just above the Type 1 error rate of 5%) for the expected difference between cats and dogs. We can easily confirm the expected power for these simple comparisons using the pwr package. For our example, Cohen’s d (the standardized mean difference) is 2.2/6.4, or d = 0.34375 for the difference between the control condition and cats, 2.6/6.4 of d = 0.40625 for the difference between the control condition and dogs, and 0.4/6.4 or d = 0.0625 for the difference between cats and dogs as pets.

library(pwr)
pwr.t.test(d = 2.2/6.4,
           n = 50,
           sig.level = 0.05,
           type="two.sample",
           alternative="two.sided")$power
#> [1] 0.3983064

pwr.t.test(d = 2.6/6.4,
           n = 50,
           sig.level = 0.05,
           type="two.sample",
           alternative="two.sided")$power
#> [1] 0.5205162

pwr.t.test(d = 0.4/6.4,
           n = 50,
           sig.level = 0.05,
           type="two.sample",
           alternative="two.sided")$power
#> [1] 0.06104044

It is clear we would need to collect more participants in each group to increase our power. We can plot a power curve across a range of sample sizes to see how much we need to increase the sample size.

plot_power(design_result, min_n = 10, max_n = 250)

We see 134 participants give us around 90% power for the ANOVA. But we should also check the power for the comparison of the control condition and the cat condition.

design <- "3b"
n <- 134
mu <- c(24, 26.2, 26.6)
sd <- 6.4
label_list = list("condition" = c("control", "cat", "dog"))

design_result <- ANOVA_design(design = design,
                   n = n, 
                   mu = mu, 
                   sd = sd, 
                   label_list = label_list,
                   plot = FALSE) #do not print the plot (same as above)

ANOVA_exact2(design_result,
             emm = TRUE)
#> Power and Effect sizes for ANOVA tests
#>             power partial_eta_squared cohen_f non_centrality
#> condition 90.1767              0.0311  0.1793        12.8242
#> 
#> Power and Effect sizes for pairwise comparisons (t-tests)
#>                                   power effect_size
#> p_condition_control_condition_cat 80.05        0.34
#> p_condition_control_condition_dog 91.20        0.41
#> p_condition_cat_condition_dog      8.03        0.06
#> 
#> Power and Effect sizes for estimated marginal means
#>                            contrast  power partial_eta_squared cohen_f
#> 1 condition_cat - condition_control 80.150           0.0194560 0.14086
#> 2     condition_cat - condition_dog  8.033           0.0006555 0.02561
#> 3 condition_control - condition_dog 91.266           0.0269660 0.16647
#>   non_centrality
#> 1         7.9170
#> 2         0.2617
#> 3        11.0576

n <- 180
design_result <- ANOVA_design(design = design,
                   n = n, 
                   mu = mu, 
                   sd = sd, 
                   label_list = label_list,
                   plot = FALSE) #do not print the plot (same as above)

ANOVA_exact2(design_result,
             emm = TRUE)
#> Power and Effect sizes for ANOVA tests
#>             power partial_eta_squared cohen_f non_centrality
#> condition 96.7794              0.0311  0.1791        17.2266
#> 
#> Power and Effect sizes for pairwise comparisons (t-tests)
#>                                   power effect_size
#> p_condition_control_condition_cat 90.19        0.34
#> p_condition_control_condition_dog 97.02        0.41
#> p_condition_cat_condition_dog      9.09        0.06
#> 
#> Power and Effect sizes for estimated marginal means
#>                            contrast power partial_eta_squared cohen_f
#> 1 condition_cat - condition_control 90.24           0.0194194 0.14073
#> 2     condition_cat - condition_dog  9.10           0.0006543 0.02559
#> 3 condition_control - condition_dog 97.04           0.0269157 0.16631
#>   non_centrality
#> 1        10.6348
#> 2         0.3516
#> 3        14.8535

n <- 5380
design_result <- ANOVA_design(design = design,
                   n = n, 
                   mu = mu, 
                   sd = sd, 
                   label_list = label_list,
                   plot = FALSE) #do not print the plot (same as above)

ANOVA_exact2(design_result,
             emm = TRUE)
#> Power and Effect sizes for ANOVA tests
#>           power partial_eta_squared cohen_f non_centrality
#> condition   100              0.0309  0.1786       514.8828
#> 
#> Power and Effect sizes for pairwise comparisons (t-tests)
#>                                   power effect_size
#> p_condition_control_condition_cat   100        0.34
#> p_condition_control_condition_dog   100        0.41
#> p_condition_cat_condition_dog        90        0.06
#> 
#> Power and Effect sizes for estimated marginal means
#>                            contrast power partial_eta_squared cohen_f
#> 1 condition_cat - condition_control   100           0.0193172 0.14035
#> 2     condition_cat - condition_dog    90           0.0006507 0.02552
#> 3 condition_control - condition_dog   100           0.0267750 0.16587
#>   non_centrality
#> 1         317.86
#> 2          10.51
#> 3         443.96

With 134 participants per condition, power for this simple comparison is only 80%. If we want to aim for 90% power for follow up tests, we need more participants. 180 participants provides good power for the follow-up comparison between control and cats. But power for the difference between cats and dogs is incredibly low. We would need 5380 participants to have 90% power for that comparison. This is possible if you can easily collect data, but might make this specific question impossible to reliably answer if data collection is costly. Realizing you can not reliably study a question is an important aspect of a-priori power analysis.

Power in Repeated Measures ANOVA

In a repeated measures design multiple observations are collected from the same participants. In the simplest case, where there are two repeated observations, a repeated measures ANOVA equals a dependent or paired t-test. Let’s explore the impact of this correlation on the power of a repeated measures ANOVA.

Two within conditions

To illustrate the effect of correlated observations, we start by simulating data for a medium effect size for a dependent (or paired, or within-subject) t-test. Let’s first look at G*power. If we want to perform an a-priori power analysis, we are asked to fill in the effect size \(d_z\). As Cohen (1988) writes, “The Z subscript is used to emphasize the fact that our raw score unit is no longer X or Y, but Z”, where Z are the difference scores of X-Y.

Within designs can have greater power to detect differences than between designs because the values are correlated, and a within design requires less participants because each participant provides multiple observations. One difference between an independent t-test and a dependent t-test is that an independent t-test has 2(n-1) degrees of freedom, while a dependent t-test has (n-1) degrees of freedom. The sample size needed in a two-group within-design (NW) relative to the sample needed in two-group between-designs (NB), assuming normal distributions, and ignoring the difference in degrees of freedom between the two types of tests, is (from Maxwell & Delaney, 2004, p. 561, formula 45):

\(N_{W}=\frac{N_{B}(1-\rho)}{2}\)

The division by 2 in the equation is due to the fact that in a two-condition within design every participant provides two data-points. The extent to which this reduces the sample size compared to a between-subject design depends on the correlation (r) between the two dependent variables, as indicated by the 1-r part of the equation. If the correlation is 0, a within-subject design needs half as many participants as a between-subject design (e.g., 64 instead 128 participants), simply because every participants provides 2 datapoints. The higher the correlation, the larger the relative benefit of within designs, and whenever the correlation is negative (up to -1) the relative benefit disappears.

Whereas in an independent t-test the two observations are uncorrelated, in a within design the observations are correlated. This has an effect on the standard deviation of the difference scores. In turn, because the standardized effect size is the mean difference divided by the standard deviation of the difference scores, the correlation has an effect on the standardized mean difference in a within design, Cohen’s \(d_z\). The relation, as Cohen (1988, formula 2.3.7) explains, is:

\(\sigma_{z}=\sigma\sqrt{2(1-\rho)}\)

Therefore, the relation between dz and d is \(\sqrt{2(1-\rho)}\). As Cohen (1988) writes: “In other words, a given difference between population means for matched (dependent) samples is standardized by a value which is \(\sqrt{2(1-\rho)}\) as large as would be the case were they independent. If we enter a correlation of 0.5 in the formula, we get \(\sqrt{2(0.5)}=1\). When the correlation is 0.5, d = \(d_z\). When there is a strong correlation between dependent variables, for example r = 0.9, we get \(d=d_{z}\sqrt{2(1-0.9)}\), and a \(d_z\) of 1 would be a d = 0.45. Reversely, \(d_{z}=\frac{d}{\sqrt{2(1-r)}}\), so with a r = 0.9, a d of 1 would be a \(d_z\) = 2.24. Some consider this increase in \(d_z\) compared to d when observations are strongly correlated an ‘inflation’ when estimating effect sizes, but since the reduction in the standard deviation of the difference scores due to the correlation makes it easier to distinguish signal from noise in a hypothesis test, it leads to a clear power benefit.

There is no equivalent \(f_z\) for Cohen’s f for a within subject ANOVA. For two groups, we can directly compute Cohen’s f from Cohen’s d, as Cohen (1988) describes, because f = \(\frac{1}{2}d\). For a d = 0.5, f = 0.25. In G*power we can run a 2 group within-subject power analysis for ANOVA. We plan for 80% power, and reproduce the analysis above for the dependent t-test. This works because the correlation is set to 0.5, when d = \(d_z\), and thus the transformation of f=\(\frac{1}{2}d\) works.

If we change the correlation to 0.7 and keep all other settings the same, the repeated measure a-priori power analysis yields a sample of 21. The correlation increases the power for the test.

To reproduce this analysis in G*power with a dependent t-test we need to change \(d_z\) following the formula above, \(d_{z}=\frac{0.5}{\sqrt{2(1-0.7)}}\), which yields \(d_z\) = 0.6454972. If we enter this value in G*power for an a-priori power analysis, we get the exact same results (as we should, since an repeated measures ANOVA with 2 groups equals a dependent t-test). This example illustrates that the correlation between dependent variables always factors into a power analysis, both for a dependent t-test, and for a repeated measures ANOVA. Because a dependent t-test uses \(d_z\) the correlation might be less visible, but given the relation between d and \(d_z\), the correlation is always taken into account and can greatly improve power for within designs compared to between designs.

We can perform both these power analyses using simulations as well. We set groups to 2 for the simulation, n = 34 (which should give 80.777 power, according to g*power), a correlation among repeated measures of 0.5, and an alpha of 0.05. In this case, we simulate data with means -0.25 and 0.25, and set the sd to 1. This means we have a mean difference of 0.5, and a Cohen’s d of 0.5/1 = 0.5. In the first example, we set the correlation to 0.5, and the result should be 80.777% power, and an effect size estimate of 0.5 for the simple effect. We also calculate partial eta-squared for the ANOVA, which equals \(\frac{f^2}{f^2+1}\), or 0.05882353.

K <- 2
n <- 34
sd <- 1
r <- 0.5
alpha = 0.05
f <- 0.25
f2 <- f^2
ES <- f2/(f2+1)
ES
#> [1] 0.05882353
mu <- mu_from_ES(K = K, ES = ES)
design = paste(K,"w",sep="")
label_list <- list("speed" =c("fast", "slow"))

design_result <- ANOVA_design(design = design,
                              n = n, 
                              mu = mu, 
                              sd = sd, 
                              r = r, 
                              label_list = label_list,
                              plot = TRUE)


ANOVA_exact2(design_result,
            emm = TRUE)
#> Power and Effect sizes for ANOVA tests
#>       power partial_eta_squared cohen_f non_centrality
#> speed 80.25              0.2026  0.5041         8.3852
#> 
#> Power and Effect sizes for pairwise comparisons (t-tests)
#>                         power effect_size
#> p_speed_fast_speed_slow 80.78         0.5
#> 
#> Power and Effect sizes for estimated marginal means
#>                  contrast power partial_eta_squared cohen_f non_centrality
#> 1 speed_fast - speed_slow 80.25              0.2026  0.5041          8.385

The results of the exact simulation approach are indeed to 80.777%. Note that the simulation calculates Cohen’s \(d_z\) effect sizes for paired comparisons - which here given the correlation of 0.5 is also 0.5 for a medium effect size.

We should see a larger \(d_z\) if we increase the correlation, keeping the sample size the same, following the example in Gpower above. We repeat the simulation, and the only difference is a correlation between dependent variables of 0.7. This should yield an effect size \(d_z\) = 0.6454972.

K <- 2
n <- 34
sd <- 1
r <- 0.7
alpha = 0.05
f <- 0.25
f2 <- f^2
ES <- f2/(f2+1)
ES
#> [1] 0.05882353
mu <- mu_from_ES(K = K, ES = ES)
design = paste(K,"w",sep="")
label_list <- list("speed" = c("fast", "slow"))

design_result <- ANOVA_design(design = design,
                              n = n, 
                              mu = mu, 
                              sd = sd, 
                              r = r, 
                              label_list = label_list,
                              plot = TRUE)


ANOVA_exact2(design_result,
            emm = TRUE)
#> Power and Effect sizes for ANOVA tests
#>         power partial_eta_squared cohen_f non_centrality
#> speed 95.2298              0.2975  0.6508        13.9754
#> 
#> Power and Effect sizes for pairwise comparisons (t-tests)
#>                         power effect_size
#> p_speed_fast_speed_slow 95.47        0.65
#> 
#> Power and Effect sizes for estimated marginal means
#>                  contrast power partial_eta_squared cohen_f non_centrality
#> 1 speed_fast - speed_slow 95.23              0.2975  0.6508          13.98

Power in Repeated Measures ANOVA with More than 2 Groups

We will examine a repeated measures experiment with 3 within-participant conditions, to illustrate how a repeated measures ANOVA extends a dependent t-test. In the example for a two-group within design we provided a specific formula for the sample size benefit for two groups. The sample size needed in within-designs (NW) with more than 2 conditions, relative to the sample needed in between-designs (NB), assuming normal distributions and compound symmetry, and ignoring the difference in degrees of freedom between the two types of tests, is (from Maxwell & Delaney, 2004, p. 562, formula 47):

\(N_{W}=\frac{N_{B}(1-\rho)}{a}\)

Where a is the number of within-subject levels.

The relation between Cohen’s f and Cohen’s d

Whereas in the case of a repeated measures ANOVA with 2 groups we could explain the principles of a power analysis by comparing our test against a t-test and Cohen’s d, this becomes more difficult when we have more than 2 groups. It is more useful to explain how to directly calculate Cohen’s f, the effect size used in power analyses for ANOVA. Cohen’s f is calculated following Cohen, 1988, formula 8.2.1 and 8.2.2:

\(f = \sqrt{\frac{\frac{\sum(\mu-\overline{\mu})^2)}N}\sigma}\)

Imagine we have a within-subject experiment with 3 conditions. We ask people what they mood is when their alarm clock wakes them up, when they wake up naturally on a week day, and when they wake up naturally on a weekend day. Based on pilot data, we expect the means (on a 7 point validated mood scale) are 3.8, 4.2, and 4.3. The standard deviation is 0.9, and the correlation between the dependent measurements is 0.7. We can calculate Cohen’s f for the ANOVA, and Cohen’s dz for the contrasts:

mu <- c(3.8, 4.2, 4.3)
sd <- 0.9
f <- sqrt(sum((mu-mean(mu))^2)/length(mu))/sd #Cohen, 1988, formula 8.2.1 and 8.2.2
f
#> [1] 0.2400274

r <- 0.7
(4.2-3.8)/0.9/sqrt(2*(1-r))
#> [1] 0.5737753
(4.3-3.8)/0.9/sqrt(2*(1-r))
#> [1] 0.7172191
(4.3-4.2)/0.9/sqrt(2*(1-r))
#> [1] 0.1434438

The relation between Cohen’s d or dz and Cohen’s f becomes more difficult when there are multiple groups, because the relationship depends on the pattern of the means. Cohen (1988) presents calculations for three patterns, minimal variability (for example, for 5 means: -0.25, 0, 0, 0, 0.25), medium variability (for example, for 5 means: -0.25, -0.25, 0.25, 0.25, 0.25 or -0.25, -0.25, -0.25, 0.25, 0.25). For these three patterns, formula’s are available that compute Cohen’s f from Cohen’s d, where d is the effect size calculated for the difference between the largest and smallest mean (if the largest mean is 0.25 and the smallest mean is -0.25, 0.25 - -0.25 = 0.5, so d is 0.5 divided by the standard deviation of 0.9). In our example, d would be (4.3-3.8)/0.9 = 0.5555556. If we divide this value by sqrt(2*(1-r)) we have dz = 0.5555556/0.7745967 = 0.7172191. If the means and standard deviation is known, it is best to calculate Cohen’s f directly from these values.

We can perform power analyses for within designs using simulations. We set groups to 3 for the simulation, n = 20, and the correlation between dependent variables to 0.8. If the true effect size is f = 0.25, and the alpha level is 0.05, the power is 96.9%.

In this case, we simulate data with means -0.3061862, 0.0000000, and 0.3061862, and set the sd to 1.

K <- 3
n <- 20
sd <- 1
r <- 0.8
alpha = 0.05
f <- 0.25
f2 <- f^2
ES <- f2/(f2+1)
ES
#> [1] 0.05882353
mu <- mu_from_ES(K = K, ES = ES)
sqrt(sum((mu-mean(mu))^2)/length(mu))/sd #Cohen, 1988, formula 8.2.1 and 8.2.2
#> [1] 0.25
design = paste(K,"w",sep="")
label_list <- list("speed" = c("fast", "medium", "slow"))

design_result <- ANOVA_design(design = design,
                              n = n, 
                              mu = mu, 
                              sd = sd, 
                              r = r, 
                              label_list = label_list,
                              plot = TRUE)

power_result = ANOVA_power(design_result,
                           nsims = 250)

confint(power_result,
        param = "main_results",
        level = .98)
#>             power lower.ci upper.ci
#> anova_speed    98 95.40444 99.14277

The results of the simulation are indeed very close to 96.9% (the confidence intervals from the simulation include ). We can see this is in line with the power estimate from Gpower:

We can even check the calculation of Cohen’s f SPSS style in GPower. We take the GPower settings as illustrated above. We click the ‘Options’ button, and check the radio button next to ‘As in SPSS’. Click ok, and you will notice that the ‘Corr among rep measures’ field has disappeared. The correlation does not need to be entered separately, but is incorporated in Cohen’s f. The value of Cohen’s f, which was 0.25, has changed into 0.7024394. This is the SPSS equivalent. The value is much larger. This value, and it’s corresponding partial eta-squared, incorporate the correlation between observations.

2x2 ANOVA, within-between design

We can simulate a two-way ANOVA with a specific alpha, sample size and effect size, to achieve a specified statistical power. We will try to reproduce the power analysis in G*power for an F-test, ANOVA: Repeated measures, within-between interaction.