Statistical analysis: statistical model; continuous distributions; normal; log-normal
Please note that while we will be using brms to conduct a Bayesian analysis, the models would hold for other packages and frequentist analyses.
We are interested in confirming that in Murray River Turtles (MRT, Emydura macquarii) females are substantially larger than males. To this end, historical records of 85 MRT were examines and their straight carapace length (scl) (in centimetres) and sex was noted in the dataset df_mrt.RDS
.
After we load the data we can compare the length distributions by using a combined boxplot and violin plot1. (fig-mrt-scl?) shows that the scl is larger in females.
df_mrt <- readRDS('data/df_mrt.RDS')
ggplot(df_mrt, aes(sex, scl_cm)) + geom_violin() +
geom_boxplot(fill = 'darkgrey', width = 0.5) +
geom_jitter(width = 0.1, height = 0) +
ylab('Straight carapace length (cm)') +
xlab('Sex')
Figure 1: Combined boxplot and Violin plots of the straight carapace length measurements from Murray River Turtles separated by sex.
To test if there is formal evidence of a difference we can go through the steps of the flow diagram. The data is continuous, it is not near zero and the density is (roughly) bell shaped ((fig-mrt-scl?)). Therefore, we can start with a normal (Gaussian) distribution. The model has scl as a response to the explanatory variable of the sex of the MRT.
\[\text{scl_cm} \sim \text{sex}\]
mod_mrt <- brm(scl_cm ~ sex, data = df_mrt, family = 'Gaussian', cores = 4)
Once we have run the model in brms
we need to check that the model has converged by checking the Rhat and the chains. From the model output below and (fig-mrt-scl-chains?) the model looks to have converged.
mod_mrt
Family: gaussian
Links: mu = identity; sigma = identity
Formula: scl_cm ~ sex
Data: df_mrt (Number of observations: 85)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 28.37 0.38 27.60 29.13 1.00 3538 2368
sexMale -3.46 0.56 -4.57 -2.37 1.00 4068 2611
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 2.47 0.20 2.13 2.89 1.00 3654 2664
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
plot(mod_mrt)
Figure 2: The chains from each estimated parameter in the Murray River Turtle model.
Any model that we construct involves making some assumptions. These assumptions can change depending on the model. In general, the assumptions can be assessed by looking at the residuals and the posterior predictive checks.By using a package like DHARMa
we can quickly construct residual plots like in (fig-mrt-scl-resid?). The QQ plot shows the expected values versus the observed values and if the points do not follow the line \(y = x\) then distribution assumption is violated. The other plot relates to the assumption of equal variance, which would be violated if the medians and quartiles don’t match-up.
check_brms(mod_mrt)
Figure 3: Residual plots for the Murray River Turtle model.
Another plot to consider when checking model fit is the posterior predictive check ((fig-mrt-scl-pp?)). What we want to see is that the line in black (the observed data) is roughly within the bounds of the simulated data from the model. In this case it looks fine in general.
pp_check(mod_mrt, ndraws = 100)
Figure 4: Posterior predictive plot for the Murray River Turtle model.
Now that we are happy the model has converged and the model assumptions are not violated, we can see what the analysis is telling us.
First, let’s interpret the model output. The Intercept is effectively the default option, which in this case is the scl for female MRT. The next term is labelled sexMale and is the key term, as it represents the difference in scl between the mean female and mean male scl. It seems that there is a clear difference in scl, with the estimated difference being -3.5 with lower and upper 95% credible intervals (95% CI2) of -4.6 and -2.4 respectively. To see what this looks like visually, we can plot the estimated mean values with 95% credible intervals and prediction intervals3. (fig-mrt-scl-est?) highlights the difference in mean scl, but also shows that not all large MRT will be female.
prep_mrt <- df_mrt %>% select(sex) %>% distinct()
pred_mrt <- cbind(prep_mrt, fitted(mod_mrt, newdata = prep_mrt)) %>% rename(lcl = Q2.5, ucl = Q97.5) %>%
cbind(as.data.frame(predict(mod_mrt, newdata = prep_mrt)) %>% select(lpl = Q2.5, upl = Q97.5))
ggplot(pred_mrt) +
geom_linerange(aes(sex, Estimate, ymin = lpl, ymax = upl), linewidth = 1, alpha = 0.2) +
geom_linerange(aes(sex, Estimate, ymin = lcl, ymax = ucl), linewidth = 2, alpha = 0.5) +
geom_point(aes(sex, Estimate), size = 3) +
ylab('Straight carapace length (cm)') +
xlab('Sex')
Figure 5: A plot of the estimated straight carapace length (in centimetres) for each sex from the Murray River Turtle model. The black point is the mean, the thick grey vertical line is the 95% credible interval, while the thin grey line is the 95% prediciton interval.
In the methods of your document, you need to specify what model you used; was it frequentist or Bayesian; what packages you used to create the model; and any other assumptions that you made. For Bayesian models, you should also specify the priors that were used and how convergence of the model was assessed.
Potentially you may want or need to include a more formal definition of the model. If that is required, then it would be something like:
In the results section you may want to produce (fig-mrt-scl-est?) (maybe without the prediction intervals) as well as a table with the parameter estimates and the 95% credible intervals (see (tbl-mrt-mod?)). Depending on the audience, maybe the parameter estimate go into an appendix.
tb_mod_mrt <- tidy(mod_mrt)
imp_mrt <- (tb_mod_mrt$conf.low*tb_mod_mrt$conf.high > 0) & (tb_mod_mrt$effect == 'fixed')
tb_mod_mrt %>%
select(-component, -group) %>%
mutate(across(where(is.numeric), ~ round(.x, 3)),
effect = str_to_sentence(effect),
effect = str_replace(effect, 'Ran_pars', 'Random'),
term = c('Intercept', 'Sex: Male', 'Standard deviation')) %>% flextable() %>%
set_header_labels(values = c('Effect', 'Parameter', 'Estimate', 'S.E.', 'Lower', 'Upper')) %>%
add_header_row(colwidths = c(4, 2), values = c("", "95% Credible bound")) %>%
align( j=4:6, align='right', part='body' ) %>%
flextable::bold(i = imp_mrt) %>%
ari_theme()
95% Credible bound | |||||
---|---|---|---|---|---|
effect | term | estimate | std.error | conf.low | conf.high |
Fixed | Intercept | 28.373 | 0.382 | 27.599 | 29.134 |
Fixed | Sex: Male | -3.462 | 0.556 | -4.573 | -2.373 |
Random | Standard deviation | 2.474 | 0.196 | 2.127 | 2.887 |
Wetland interventions have been carries out across many locations in Victoria. To assess the effectiveness of the interventions, a subset of 15 of these location were monitored over 6 year utilising a BACI4 experimental design to determine the effect of crash grazing on wetlands. At each location there were 5 intervention and 5 control transects, with the interventions happening between the first and second years monitoring. The maximum plant height was among several criteria of interest. In particular, does the presence of crash grazing effect the maximum plant height. The relevant data can be found in df_mh_cm_crash.RDS
.
After we load the data an initial plot of the overall density ((fig-mh-den?)) shows that the data is positively skewed and the data is close to zero.
df_mh_cm_crash <- readRDS('data/df_mh_cm_crash.RDS')
ggplot(df_mh_cm_crash, aes(max_ht_cm)) +
geom_histogram(aes(y = after_stat(density)), alpha = 0.2) +
geom_density() +
xlab('Maximum plant height (cm)') +
ylab('Density')
Figure 6: A plot of the histogram of the maximum plant height with the kernel density estimate overlayed.
Looking at the flow diagram, that means that we either need to transform the data to try and make it approximately normal or choose an alternative distribution to the normal distribution. We are going to do the latter, as the log-normal distribution may fit this data, given it is positively skewed (like the data) and can only take positive values (plant heights can’t be negative). To check the effect of crash grazing on maximum plant height, the model has maximum plant height as a response to explanatory variables year and grazing treatment interacting. Given the data was from repeated visits to the same locations and transects, it is appropriate to use a mixed model with transect nested within location.
\[\text{max_ht_cm} \sim \text{year} \times \text{grazing}\]mod_mh_crash <- brm(max_ht_cm ~ exclusion * year + (1|location/transect), family = 'lognormal',
data = df_mh_cm_crash, cores = 4)
From the model output below and the trace plots the model looks to have converged.
mod_mh_crash
Family: lognormal
Links: mu = identity; sigma = identity
Formula: max_ht_cm ~ exclusion * year + (1 | location/transect)
Data: df_mh_cm_crash (Number of observations: 900)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~location (Number of levels: 15)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept) 0.30 0.08 0.19 0.48 1.00 924
Tail_ESS
sd(Intercept) 1463
~location:transect (Number of levels: 75)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept) 0.22 0.02 0.18 0.27 1.00 1206
Tail_ESS
sd(Intercept) 2160
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept 3.58 0.09 3.40 3.76 1.01 691
exclusionYes 0.01 0.04 -0.07 0.10 1.00 2044
year 0.00 0.01 -0.01 0.02 1.00 2440
exclusionYes:year 0.05 0.01 0.03 0.08 1.00 1970
Tail_ESS
Intercept 1510
exclusionYes 2635
year 2590
exclusionYes:year 2558
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.30 0.01 0.28 0.31 1.00 4212 2924
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
plot(mod_mh_crash, bins = 30)
Figure 7: The chains from each estimated parameter in the maximum plant height model.
Figure 8: The chains from each estimated parameter in the maximum plant height model.
The QQ plot ((fig-mh-crash-resid?)) does not show any violations of the log-normal distribution assumption. The other plot, however, shows a pattern in the residuals. Therefore, we should not use this model, and an alternative model should be used.
check_brms(mod_mh_crash)
Figure 9: Residual plots for the maximum plant height model.
When the data is separated by exclusion treatment and year ((fig-mh-violin?)) the positive skew persists and maybe there is a change is maximum plant height over time and between treatments. However, the change over time does not seem to follow a linear (straight line) path.
ggplot(df_mh_cm_crash %>% mutate(grazing = ifelse(exclusion == 'No', 'Grazing continued', 'Grazing excluded')), aes(year, max_ht_cm, group = year)) + geom_violin() +
geom_boxplot(fill = 'darkgrey', width = 0.5) +
geom_jitter(width = 0.1, height = 0) +
ylab('Maximum plant height (cm)') +
xlab('Time (years)') +
facet_wrap(.~ grazing)
Figure 10: Combined boxplot and Violin plots of the maximum plant height over time and split by treatment group.
Maybe after crash grazing was implemented (after the first year) the maximum plant height was steady at a greater height than before at the intervention transects. At the control transects, there may have been some change, but it is not consistent, potentially reflecting a greater variance. To check if there is just a step change after the intervention, we can swap year in the model to just before or after year 1.
\[\text{max_ht_cm} \sim \text{after} \times \text{grazing}\]
mod_mh_crash_ba <- brm(max_ht_cm ~ after * exclusion + (1|location/transect), family = 'lognormal',
data = df_mh_cm_crash, cores = 4)
Now the QQ plot and within group deviation plots look much better ((fig-mh-crash-ba-resid?)). There is still an issue with the variance being different between years after the intervention, but for the moment this is acceptable for what we are after. What options could be explore to address that issue?
mod_mh_crash_ba
Family: lognormal
Links: mu = identity; sigma = identity
Formula: max_ht_cm ~ after * exclusion + (1 | location/transect)
Data: df_mh_cm_crash %>% mutate(after = factor(ifelse(ye (Number of observations: 900)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~location (Number of levels: 15)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept) 0.31 0.08 0.20 0.49 1.00 783
Tail_ESS
sd(Intercept) 1100
~location:transect (Number of levels: 75)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept) 0.22 0.02 0.18 0.27 1.00 1019
Tail_ESS
sd(Intercept) 1735
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept 3.40 0.09 3.21 3.59 1.01
afterAfter 0.21 0.03 0.14 0.27 1.00
exclusionYes -0.04 0.04 -0.12 0.05 1.00
afterAfter:exclusionYes 0.28 0.05 0.19 0.38 1.00
Bulk_ESS Tail_ESS
Intercept 753 865
afterAfter 2362 2832
exclusionYes 1985 2582
afterAfter:exclusionYes 1882 2456
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.27 0.01 0.25 0.28 1.00 3573 2849
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
check_brms(mod_mh_crash_ba)
Figure 11: Residual plots for the maximum plant height model using before and after only, not year.
The posterior predictive check looks okay ((fig-mh-crash-ba?)).
pp_check(mod_mh_crash_ba, ndraws = 100)
Figure 12: Posterior predictive plot for the maximum plant height model using before and after only, not year.
To get a quick sense of the results we look at the plot of estimates for each combination (fig-mh-crash-ba-est). It seems that there was initially no difference in maximum plant height between exclusion and non-exclusion transects. After the intervention, the mean maximum plant heights increased across all transects, but at the herbivore exclusion transects, the increase in mean height was larger than at the non-exclusion size. However, if we want to be more formal about our conclusions that looking at the plot, we need to use the model output table.
prep_mh_crash_ba <- df_mh_cm_crash %>% filter(location == 'A', transect == 'A_1') %>%
select(after, exclusion) %>% distinct()
pred_mh_crash_ba <- cbind(prep_mh_crash_ba,
fitted(mod_mh_crash_ba, newdata = prep_mh_crash_ba, re_formula = NA)) %>%
rename(lcl = Q2.5, ucl = Q97.5)
ggplot(pred_mh_crash_ba, aes(exclusion, Estimate, ymin = lcl, ymax = ucl, group = after, shape = after, color = after)) +
geom_pointrange(position = position_dodge(width = 0.2)) +
ylab('Maximum plant height (cm)') +
xlab('Herbivore exclusion') +
theme(legend.title = element_blank())
Figure 13: A plot of the estimated maximum plant height (in centimetres) for each combination from the BACI model. The black point is the mean, the thick grey vertical line is the 95% credible interval, while the thin grey line is the 95% prediciton interval.
The Intercept is effectively the default option, which in this case is the maximum plant height at the non-exclusion transects in year 1 (before the intervention). The next term is labelled afterAfter and represents the change in maximum plant height at non-exclusion transects. Simialrly the third term is exclusionYes and represents the difference in maximum plant height at exclusion transects compared to non-exclusion in year 1. The final fixed effect term is the interaction term5 and it is the key term in the analysis. It represents the difference in maximum plant height at exclusion transects after the intervention not accounted for by each of those options independently.
It seems that there was initially no difference in maximum plant height between exclusion and non-exclusion transects (the model estimate is -0.04, 95% CI of -0.12 to 0.05). After the intervention at exclusion transects, the maximum plant heights increased by a factor of 1.33 (95% CI: 1.21 — 1.05) above what would expect if the interaction was not important. Note the increase is not the same as the parameter estimate from the model, as that is on the transformed (log) scale. To get back to the response scale, we need to back transform6 the estimate. Which means we are now talking about a multiplicative effect (factor) rather than additive.
If a more formal definition of the model is required, then it would be something like:
In the results section you may want to produce (fig-mh-crash-ba-est?) as well as a table with the parameter estimates and their 95% credible intervals (see (tbl-mh-crash-mod?)). Depending on the audience, maybe the parameter estimate go into an appendix.
tb_mod_mh_crash <- tidy(mod_mh_crash_ba)
imp_mh_crash <- (tb_mod_mh_crash$conf.low*tb_mod_mh_crash$conf.high > 0) & (tb_mod_mh_crash$effect == 'fixed')
tb_mod_mh_crash %>%
select(-component) %>%
mutate(across(where(is.numeric), ~ round(.x, 3)),
effect = str_to_sentence(effect),
effect = str_replace(effect, 'Ran_pars', 'Random'),
term = c('Intercept', 'After intervention', 'Exclusion used',
'After intervention and exclusion used',
rep('Standard deviation', 3))) %>% flextable() %>%
set_header_labels(values = c('Effect', 'Source', 'Parameter', 'Estimate', 'S.E.',
'Lower', 'Upper')) %>%
add_header_row(colwidths = c(5, 2), values = c("", "95% Credible bound")) %>%
align( j=5:7, align='right', part='body' ) %>%
bold(i = imp_mh_crash) %>%
ari_theme()
95% Credible bound | ||||||
---|---|---|---|---|---|---|
effect | group | term | estimate | std.error | conf.low | conf.high |
Fixed | Intercept | 3.404 | 0.094 | 3.213 | 3.591 | |
Fixed | After intervention | 0.207 | 0.033 | 0.142 | 0.272 | |
Fixed | Exclusion used | -0.035 | 0.044 | -0.122 | 0.049 | |
Fixed | After intervention and exclusion used | 0.285 | 0.048 | 0.192 | 0.378 | |
Random | location | Standard deviation | 0.310 | 0.080 | 0.197 | 0.491 |
Random | location:transect | Standard deviation | 0.222 | 0.023 | 0.180 | 0.273 |
Random | Residual | Standard deviation | 0.266 | 0.007 | 0.253 | 0.279 |
Now that you are familiar with normal and log-normal generalised linear models, it is now your turn to try some.
In groups of three or four could you please try at least one of the following small analysis tasks. As a group, you will need to follow the step to conduct your analysis, including:
If there is time and the technology permits, we may may ask some of the groups to give a quick 2 minute presentation on what they found.
ex_mrt.RDS
. 🌶ex_mc_wt.RDS
. 🌶🌶ex_pdw.RDS
. 🌶🌶🌶
Violin plots are a graph type that can be used to quickly compare continuous distributions. They are a vertical (kernel) density plot reflected horizontally. Wider sections indicate more data in that area.↩︎
Remember, that in Bayesian statistics we use credible intervals rather than confidence intervals. If the 95% CI of the parameter estimate does not include zero, then that is strong evidence that the term is having an effect on the data.↩︎
Prediction intervals are representative of what individual responses are likely to be observed, rather than what is the likely mean response in credible intervals.↩︎
Before-After-Control-Intervention↩︎
In R
the interaction between two or more variables is signified by the colon (:).↩︎
The more current way of saying this is to use the “inverse operation”.↩︎
Mean-log is used to emphasis that it is the mean of the log of the variable, not the mean of the variable itself. Similarly↩︎