Day 4: Mixture models

Statistical analysis: beta, mixture

Paul Moloney https://bfanson.github.io/2025_Models_workshop/ (Arthur Rylah Institute)https://www.ari.vic.gov.au/
2025-09-11

Day’s objectives

Project to download

Link

Key packages

Please note that while we will be using brms to conduct a Bayesian analysis, the models would hold for other packages and frequentist analyses.

Does connectivity and/or watering regimes effect canopy health in Black Box?

Context and design

Researchers are interested in determining if the canopy health of Black Box (Eucalyptus largiflorens) is effected by the connectivity of their wetland and/or the flooding history. In this study a total of 80 known Black Box sites were sampled across 5 connectivity and flooding histories (CAFH). The average crown condition score (ACCS) for Black Box was recorded. The crown condition score for a tree is an ordinal1 score from 0 (no canopy) to 7 (91%-100% canopy cover). That means the score at any site could range from 0 to 7, inclusive. The dataset for this study can be found in df_crown_score.RDS.

Exploring the data and developing the model

After we load the data we can compare the ACCS by using a combined boxplot and violin plot. Figure 1 shows that the distribution of scores is quite different between some of the groups both in terms of central tendency and spread.

Show code
head(df_crown_score <- readRDS('data/df_crown_score.RDS'))
                          cafh accs
1 Connected frequently flooded  4.8
2 Connected frequently flooded  4.8
3 Connected frequently flooded  3.6
4 Connected frequently flooded  4.6
5 Connected frequently flooded  4.7
6 Connected frequently flooded  3.9
Show code
ggplot(df_crown_score, aes(cafh, accs)) + geom_violin() +
  geom_boxplot(fill = 'darkgrey', width = 0.1) +
  geom_jitter(width = 0.1, height = 0) +
  ylab('Average crown condition score') +
  xlab('Connectivity and flooding history') +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
Combined boxplot and Violin plots of the average crown condition score of Black Box trees at wetlands with different connectivity.

Figure 1: Combined boxplot and Violin plots of the average crown condition score of Black Box trees at wetlands with different connectivity.

Picking the distribution

Each site is a sampling event, with a response variable ACCS (\(y_i\)) and a categorical explanatory variable CAFH (\(x_i\)). The response data is continuous but it is bound between two values (0 and 7), but it does not reach those values (Figure 1). Therefore, we could start with a beta distribution.

In words I would describe the model as “The ACCS for any site is a scaled beta distributed, where the mean and precision are functions of the wetlands CAFH category”. Here the ACCS at site \(i\) is \(y_i\), the CAFH is \(x_i\), the mean is \(\mu_i\), and the precision is \(\phi_i\). Please note that we are allowing the precision parameter to vary between CAFH as in the exploration the spread of the different categories seemed quite different.

Observational model:

\[ \begin{align} \frac{y_i}{7} &\sim beta(\mu_i, \phi) \\ \end{align} \] Process model:

The mean (\(\mu_i\)) and precision (\(\phi\)) are linked to the predictors via the logit and log links respectively:

In formal style:

\[ \begin{align} logit(\mu_i) &= \text{cafh}_i \\ log(\phi_i) &= \text{cafh}_i \end{align} \]

Formal style:

\[ \begin{align} logit(\mu_i) &= \alpha + \beta x_i \\ log(\phi_i) &= \gamma + \delta x_i \end{align} \]

Run the model

Show code
mod_accs <- brm(bf(accs/7 ~ cafh, phi ~ cafh), family = 'Beta', data = df_crown_score, cores = 4)

Checking the model converged

From the model output below and Figure ?? the model looks to have converged.

Show code
mod_accs
 Family: beta 
  Links: mu = logit; phi = log 
Formula: accs/7 ~ cafh 
         phi ~ cafh
   Data: df_crown_score (Number of observations: 80) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                               Estimate Est.Error l-95% CI u-95% CI
Intercept                          0.20      0.09     0.01     0.38
phi_Intercept                      2.67      0.26     2.14     3.14
cafhConnectedrarelyflooded        -0.37      0.15    -0.67    -0.09
cafhConnectedunflooded            -0.54      0.16    -0.86    -0.24
cafhIsolatedrarelyflooded         -0.60      0.12    -0.83    -0.37
cafhLinearrarelyflooded           -0.62      0.11    -0.83    -0.41
phi_cafhConnectedrarelyflooded     0.44      0.46    -0.51     1.31
phi_cafhConnectedunflooded         0.24      0.45    -0.66     1.12
phi_cafhIsolatedrarelyflooded      1.96      0.58     0.73     3.00
phi_cafhLinearrarelyflooded        2.77      0.56     1.58     3.79
                               Rhat Bulk_ESS Tail_ESS
Intercept                      1.00     2028     1967
phi_Intercept                  1.00     3073     2581
cafhConnectedrarelyflooded     1.00     2764     2680
cafhConnectedunflooded         1.00     2842     2940
cafhIsolatedrarelyflooded      1.00     2288     2402
cafhLinearrarelyflooded        1.00     2098     2152
phi_cafhConnectedrarelyflooded 1.00     3264     2755
phi_cafhConnectedunflooded     1.00     2998     3001
phi_cafhIsolatedrarelyflooded  1.00     2935     2126
phi_cafhLinearrarelyflooded    1.00     3218     2321

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Show code
plot(mod_accs)
The chains from each estimated parameter in the average crown condition score model.

Figure 2: The chains from each estimated parameter in the average crown condition score model.

The chains from each estimated parameter in the average crown condition score model.

Figure 3: The chains from each estimated parameter in the average crown condition score model.

Checking the model assumptions

The QQ plot (Figure 4) does not show any violations of the beta distribution assumption. The homogeneity plot is also fine, which is good, since the model allowed the spread to different between levels.

Show code
check_brms(mod_accs)
Residual plots for the average crown condition score model.

Figure 4: Residual plots for the average crown condition score model.

The posterior predictive check looks okay (Figure 5).

Show code
pp_check(mod_accs, ndraws = 100)
Posterior predictive plot for the average crown condition score model.

Figure 5: Posterior predictive plot for the average crown condition score model.

Interpreting the model outputs

We can interpret the model estimates roughly without getting too much into the weeds and let the plot do the heavy lifting for us. If the 95% credible interval excludes zero, then we can infer that that parameter was different to the probability for the default option (connected frequently flooded). Here all of the levels are lower. In Figure 6 we can see that connected frequently flooded is clearly higher scoring than the others, but also that connected rarely flooded is most likely higher than isolated or linear rarely flooded.

Show code
prep_accs <- df_crown_score %>% select(cafh) %>% distinct()
pred_accs <- cbind(prep_accs, 7*fitted(mod_accs, newdata = prep_accs)) %>% rename(lcl = Q2.5, ucl = Q97.5)
ggplot() +
  geom_pointrange(data = pred_accs, aes(cafh, Estimate, ymin = lcl, ymax = ucl)) +
  geom_jitter(data = df_crown_score, aes(cafh, accs), alpha = 0.2, width = 0.1) +
  ylab('Average crown condition score') +
  xlab(NULL) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
A plot of the estimated average crown condition score for each connectivity and flood history category. The black point is the mean, the vertical line is the 95% credible interval.

Figure 6: A plot of the estimated average crown condition score for each connectivity and flood history category. The black point is the mean, the vertical line is the 95% credible interval.

In the results section you may want to produce Figure 6 as well as a table with the parameter estimates and the 95% credible intervals (see Table ??). Depending on the audience, maybe the parameter estimate go into an appendix.

Show code
tb_mod_accs <- cbind(data.frame(Component = c('Mean', 'Precision', rep('Mean', 4),
                                              rep('Precision', 4)),
                                Parameter = c(rep('Intercept', 2),
                                              rep(c('CAFH: Connected raely flooded',
                                                    'CAFH: Connected unflooded',
                                                    'CAFH: Isolated rarely flooded',
                                                    'CAFH: Linear rarely flooded'), 2))),
                    summary(mod_accs)$fixed %>%
                      select(Estimate, Est.Error, Lower = `l-95% CI`, Upper = `u-95% CI`)) %>%
  as_tibble()
imp_accs <- (tb_mod_accs$Lower*tb_mod_accs$Upper > 0)
tb_mod_accs %>%
  mutate(across(where(is.numeric), ~ round(.x, 3))) %>% arrange(Component) %>% flextable() %>%
  set_header_labels(values = c('Component', 'Parameter', 'Estimate', 'S.E.', 'Lower', 'Upper')) %>%
  add_header_row(colwidths = c(4, 2), values = c("", "95% Credible bound")) %>%
  align( j=4:6, align='right', part='body' ) %>%
  bold(i = imp_accs) %>%
  ari_theme()

Conclusion

The model shows that the connectivity and flood history categories have different Black Box canopy health. In particular connected and frequently flooded wetlands had a greater ACCS than all other wetland types (Figure 6). Additionally there is evidence to conclude that connected rarely flooded wetland have greater scores than isolated rarely flooded or linear rarely flooded sites.

Is thatch depth effected by grzing?

Context and design

A client is interested in determining the impact of excluding livestock from wetlands. To assess the effectiveness, a BACI study was conducted (that we have simplified here). There were 10 locations monitored before and after the intervention was completed. At each location there were 5 transect that were to continue with the current continual grazing regime, and 5 transects that that would have livestock excluded. The dataset for this study can be found in df_td_cm_press.RDS.

Exploring the data and developing the model

We load the data, then plot the data to see how it is distributed (Figure 7). Hopefully you noticed that a lot of the data is in the first bin, which is at 0, then there is a dip before a mini-peak around 3 cm. If we look at the violin plots (Figure 8) you can see that there is a lot a data (at least 25% given the lower quartile is 0) at 0 cm and then a gap until after 1 cm. It is almost like the data is split into two groups, the places iti s zero and then the rest.

Show code
head(df_td_cm_press <- readRDS('data/df_td_cm_press.RDS'))
# A tibble: 6 × 7
  location grazing exclusion           transect  year td_cm time  
  <chr>    <chr>   <fct>               <chr>    <int> <dbl> <fct> 
1 P        Press   No exclusion        P_1          1  0    Before
2 P        Press   No exclusion        P_1          6  7.54 After 
3 P        Press   Livestock exclusion P_1          1  6.59 Before
4 P        Press   Livestock exclusion P_1          6  0    After 
5 P        Press   No exclusion        P_2          1  0    Before
6 P        Press   No exclusion        P_2          6  5.28 After 
Show code
ggplot(df_td_cm_press, aes(td_cm)) +
  geom_histogram(aes(y = after_stat(density)), alpha = 0.2) +
  geom_density() +
  xlab('Thatch depth (cm)') +
  ylab('Density')
Combined histogram and density plot of the thatch depth.

Figure 7: Combined histogram and density plot of the thatch depth.

Show code
ggplot(df_td_cm_press, aes(time, td_cm)) + geom_violin() +
  geom_boxplot(fill = 'darkgrey', width = 0.1) +
  geom_jitter(width = 0.1, height = 0) +
  ylab('Thatch depth (cm)') +
  xlab('Intervention schedule') +
  facet_wrap(.~ exclusion)
Combined boxplot and Violin plots of the thatch depth by time and exclusion.

Figure 8: Combined boxplot and Violin plots of the thatch depth by time and exclusion.

Introducing mixture models

Game 1

Let us consider a game where we flip a coin and roll two dice. If the coin lands on heads, then you score 0, otherwise you score the sum of the two dice. What sort of score are you likely to get? Let us simulate playing the game 100 times to get us some data.

Show code
fn_game1 <- function(nsim){
  data.frame(coin = rbinom(nsim, 1, 0.5),
             dice = sample.int(6, nsim, replace = TRUE) + sample.int(6, nsim, replace = TRUE)) %>%
    mutate(score = coin * dice, Distribution = ifelse(coin == 0, 'Coin', 'Dice'))
}
trial1 <- fn_game1(100)
ggplot(trial1, aes(score, fill = Distribution)) + geom_bar() + xlab('Game score')
Bar chart of the result of playing Game 1 100 times.

Figure 9: Bar chart of the result of playing Game 1 100 times.

Intuitively, we should know that half the time we should expect to score 0 and half the time we will score between 2 and 12, with 7 happening most frequently. From Figure 9 we can see this is about right. If we played the game 10,000 times, the results would be close to the theoretical distribution.

What we have done here is created what we call a mixture distribution. It combines two (or more) separate distributions into one overall distribution. Specifically what we have here could be considered a hurdle model. The hurdle here was flipping a tail, otherwise you scored zero.

Game 2

Now let us play another game, this time, we are flipping a $2 coin and 3 $1 coins. If the $2 coin lands on head, then you score 0, otherwise your score comes from the number of tails you got with your $1 coins.What sort of score are you likely to get? Let us simulate playing the game 1000 times to get us some data.

Show code
fn_game2 <- function(nsim){
  data.frame(coin2 = rbinom(nsim, 1, 0.5),
             coin1 = rbinom(nsim, 3, 0.5)) %>%
    mutate(score = coin2 * coin1,
           Distribution = factor(ifelse(coin2 == 0, '$2 coin', '$1 coins'),
                                 levels = c('$2 coin', '$1 coins')))
}
trial2 <- fn_game2(100)
ggplot(trial2, aes(score, fill = Distribution)) + geom_bar() + xlab('Game score')
Bar chart of the result of playing Game 2 100 times.

Figure 10: Bar chart of the result of playing Game 2 100 times.

Intuitively, we should know that half the time we should expect to score 0 due to the $2 coin, but, we would also expect to score a 0 from the $1 coins about 1 in 8 times (ignoring the $2 coin). From Figure 10 we can see this is about right. If we played the game 10,000 times, the results would be close to the theoretical distribution.

We again created a mixture distribution. It combines two (or more) separate distributions into one overall distribution. Specifically what we have here could be considered a zero-inflation model. We got more zero scores than you would expect if you only summed the $1 coins and ignored the $2 coin.

Hurdle models versus Zero-inflated models

Both zero-inflated and hurdle model can be a way of dealing with zeros that were unexpected. However, they are usually used under different circumstances, which usually depends on which type of zero we have.

Roughly if you only have structural zeroes (or don’t care why there are zeroes) the you would use a hurdle mixture model, if there are structural and random zeroes, then you would use a zero-inflated mixture model.

Either way, you end up with two components to the model, one part of the model deals with the structural zeroes (the hurdle or zero-inflation) and the conditional part of the model deals with the rest of the distribution (effectively dealing with only the non-structural zero section). The hurdle/zero-inflation model is typically a Bernoulli/binomial distribution to calculate the proportion of the data that are structural zeroes. The conditional part of the model will depend on the rest of the data.

Picking the distribution

The response variable is thatch depth (\(\text{td}_i\)), which is a continuous measurement and two categorical explanatory variables, time period (\(t_i\)) and exclusion \(f_i\). The data is close to zero, so maybe a log-normal distribution could be used, except for the fact that we have lots (or any2) zeroes. A mixture model could account for those structural zeroes, in particular a hurdle model as there can be no random zeroes with a log-normal distribution. Therefore we end up with trying a hurdle log-normal model. Given the data was from repeated visits to the same locations and transects, it is appropriate to use a mixed model with transect nested within location.

Observational model:

\[ \begin{align} \text{td}_i &\sim \text{HLN}(\pi, \mu_i, \sigma^2) \\ \end{align} \]

where \(\pi\) is the probability of a non-zero answer, \(\mu\) is the mean-log and \(\sigma\) is the sd-log.

Process model:

The mean-log (\(\mu_i\)) is linked to the predictors via the identity link, while the sd-log (\(\sigma\)) and hurdle (\(\pi\)) are constants on the identity link:

In formal style:

\[ \begin{align} \mu_i &= \text{time}_i \times \text{exclusion}_i \end{align} \]

Formal style:

\[ \begin{align} \mu_i &= \beta_0 + \beta_1 \times t_i + \beta_2 \times f_i + \beta_3 \times t_i \times f_i \end{align} \]

where \(t_i = 0\) when time is before and 1 otherwise and \(f_i = 1\) if the transect is a exclusion transect and 0 otherwise.

Run the model

Show code
mod_td_press <- brm(td_cm ~ time * exclusion + (1|location/transect), family = 'hurdle_lognormal',
                    data = df_td_cm_press, cores = 4, control = list(adapt_delta = 0.9))

Checking the model converged

From the model output below and Figure ?? the model looks to have converged.

Show code
mod_td_press
 Family: hurdle_lognormal 
  Links: mu = identity; sigma = identity; hu = identity 
Formula: td_cm ~ time * exclusion + (1 | location/transect) 
   Data: df_td_cm_press (Number of observations: 200) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~location (Number of levels: 10) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)     0.42      0.14     0.22     0.77 1.00     1148
              Tail_ESS
sd(Intercept)     1594

~location:transect (Number of levels: 50) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)     0.22      0.05     0.12     0.32 1.00     1033
              Tail_ESS
sd(Intercept)     1645

Regression Coefficients:
                                      Estimate Est.Error l-95% CI
Intercept                                 1.34      0.15     1.03
timeAfter                                -0.16      0.07    -0.31
exclusionLivestockexclusion              -0.06      0.08    -0.22
timeAfter:exclusionLivestockexclusion     1.26      0.12     1.03
                                      u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept                                 1.63 1.00     1204     1970
timeAfter                                -0.02 1.00     2943     2947
exclusionLivestockexclusion               0.11 1.00     2804     2836
timeAfter:exclusionLivestockexclusion     1.49 1.00     2423     2880

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.26      0.03     0.22     0.32 1.00     1788     2707
hu        0.45      0.03     0.38     0.52 1.00     6243     3297

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Show code
plot(mod_td_press)
The chains from each estimated parameter in the thatch depth model.

Figure 11: The chains from each estimated parameter in the thatch depth model.

The chains from each estimated parameter in the thatch depth model.

Figure 12: The chains from each estimated parameter in the thatch depth model.

Checking the model assumptions

The QQ plot (Figure 13) does not show any violations of the hurdle log-normal distribution assumption. The homogeneity plot is also fine.

Show code
check_brms(mod_td_press)
Residual plots for the thatch depth model.

Figure 13: Residual plots for the thatch depth model.

The posterior predictive check looks okay (Figure 14).

Show code
pp_check(mod_td_press, ndraws = 100)
Posterior predictive plot for the thatch depth model.

Figure 14: Posterior predictive plot for the thatch depth model.

Interpreting the model outputs

Important parts of this model are that BACI interaction term provided strong evidence the the intervention is associated with increased thatch cover by a factor of 3.5 (95% CI from 2.8 to 4.4) and that we estimate that only 45% (95% CI from 38 to 52%) of the transects had some thatch.

Figure 15 is the unconditional estimate of thatch depth, and the difference in the interaction term is clear.

Show code
prep_td_press <- df_td_cm_press %>% filter(location == 'P', transect == 'P_1') %>%
  select(time, exclusion) %>% distinct()
pred_td_press <- cbind(prep_td_press,
                          fitted(mod_td_press, newdata = prep_td_press, re_formula = NA)) %>%
  rename(lcl = Q2.5, ucl = Q97.5) %>% mutate(time = factor(time, labels = c('Before intervention', 'After intervention')))
ggplot(pred_td_press, aes(exclusion, Estimate, group = time, shape = time, color = time)) +
  geom_pointrange(aes(ymin = lcl, ymax = ucl), position = position_dodge(width = 0.2)) +
  ylab('Thatch depth (cm)') +
  xlab('Herbivore exclusion') +
  theme(legend.title = element_blank())
A plot of the estimated thatch depth for each option. The black point is the mean, the vertical line is the 95% credible interval.

Figure 15: A plot of the estimated thatch depth for each option. The black point is the mean, the vertical line is the 95% credible interval.

In the results section you may want to produce Figure 15 as well as a table with the parameter estimates and the 95% credible intervals (see Table ??). Depending on the audience, maybe the parameter estimate go into an appendix.

Show code
tb_mod_td_press <- rbind(cbind(data.frame(Component = 'Conditional: Fixed',
                                    Parameter = c('Intercept', 'After intervention', 'Exclusion used',
                                                  'After intervention and exclusion used')),
                         summary(mod_td_press)$fixed %>%
                           select(Estimate, Est.Error, Lower = `l-95% CI`, Upper = `u-95% CI`)),
                         cbind(data.frame(Component = c('Conditional: Random', 'Hurdle'),
                                          Parameter = c('Standard deviation: Residual', 'Probability of non-zero')),
                               summary(mod_td_press)$spec_pars %>%
                                 select(Estimate, Est.Error, Lower = `l-95% CI`, Upper = `u-95% CI`)),
                         cbind(data.frame(Component = 'Conditional: Random',
                                          Parameter = c('Standard deviation: Location', 'Standard deviation: Transect')),
                               rbind(summary(mod_td_press)$random$location,
                                     summary(mod_td_press)$random$`location:transect`) %>%
                                 select(Estimate, Est.Error, Lower = `l-95% CI`, Upper = `u-95% CI`))) %>% as_tibble() %>%
  mutate(Component = factor(Component, levels =c('Hurdle', 'Conditional: Fixed', 'Conditional: Random'))) %>%
  arrange(Component)
imp_td_press <- (tb_mod_td_press$Lower*tb_mod_td_press$Upper > 0) &
  (tb_mod_td_press$Component == 'Conditional: Fixed')
tb_mod_td_press %>%
  mutate(across(where(is.numeric), ~ round(.x, 3))) %>% flextable() %>%
  set_header_labels(values = c('Component', 'Parameter', 'Estimate', 'S.E.', 'Lower', 'Upper')) %>%
  add_header_row(colwidths = c(4, 2), values = c("", "95% Credible bound")) %>%
  align( j=3:6, align='right', part='body' ) %>%
  bold(i = imp_td_press) %>%
  ari_theme()

Conclusion

The model shows that the exclusion of livestock increases the thatch depth and that over half the transect had no measurable thatch.

Recruitment of seedling in semi-arid woodlands

Context and design

Researcher are interested in determining if tree recruitment in semi-arid woodlands is effected by environmental features at the site. In this study a total of 320 sites were sampled. The number of recruits (seedling or juveniles) and well as several environmental factors were recorded. The dataset for this study can be found in df_rec_saw.RDS.

Exploring the data and developing the model

After we load the data we can explore the data. Figure 16 shows that there are qute a few zeroes and some relatively large values.

Show code
head(df_rec_saw <- readRDS('data/df_rec_saw.RDS'))
  crust recruits
1  0.42        0
2  0.14       11
3  0.00        0
4  0.00        1
5  0.59        0
6  0.79        0
Show code
ggplot(df_rec_saw, aes(recruits)) +
  geom_histogram(aes(y = after_stat(density)), alpha = 0.2) +
  geom_density() +
  xlab('Tree recruitment') +
  ylab('Density')
Combined histogram and density of the number of recruits.

Figure 16: Combined histogram and density of the number of recruits.

Show code
ggplot(df_rec_saw, aes(crust, recruits)) +
  geom_jitter() +
  ylab('Tree recruitment') +
  xlab('Soil crust (%)')
Scatter plot of the number of recruits compared to the percentage of soil crust.

Figure 17: Scatter plot of the number of recruits compared to the percentage of soil crust.

Picking the distribution

The data is count data and from Figure 16 it would seem that there are some quite large values compared to the majority of the data. There is also a very large peak at zero, which may or may not be explainable by the model. Therefore with those things together we should try an initial model of a zero-inflated negative binomial. The zero-inflation will account for any structural zeroes,while still allowing for random zeroes. The negative binomial as the very large values may be examples of “clumping”.

Observational model:

\[ \begin{align} y_i &\sim \text{ZINB}(\pi, \mu_i, \rho) \\ \end{align} \] Process model:

The mean (\(\mu_i\)) linked to the predictors via the log link, while the shape or aggregation (\(\rho\)) and the zero-inflation (\(\pi\)) are constants and use the identity link.

In formal style:

\[ \begin{align} log(\mu_i) &= \text{crust}_i \end{align} \]

Formal style:

\[ \begin{align} log(\mu_i) &= \alpha + \beta x_i \end{align} \]

where \(x_i\) if the proportion of soil crust at that site.

Run the model

Show code
mod_rec_zinb <- brm(recruits ~ z_crust, family = 'zero_inflated_negbinomial', data = df_rec_saw, cores = 4)

Checking the model converged

From the model output below and Figure 18 the model looks to have converged.

Show code
mod_rec_zinb
 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = identity 
Formula: recruits ~ z_crust 
   Data: df_rec_saw (Number of observations: 320) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     2.37      0.17     2.05     2.70 1.00     1102     1329
z_crust      -0.26      0.10    -0.45    -0.07 1.00     2169     2052

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     0.27      0.06     0.18     0.42 1.00      829      841
zi        0.17      0.09     0.01     0.35 1.00      767     1071

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Show code
plot(mod_rec_zinb)
The chains from each estimated parameter in the recruits model.

Figure 18: The chains from each estimated parameter in the recruits model.

Checking the model assumptions

The QQ plot (Figure 19) does not show any violations of the beta distribution assumption. The homogeneity plot is also fine, which is good, since the model allowed the spread to different between levels.

Show code
check_brms(mod_rec_zinb)
Residual plots for the recruits model.

Figure 19: Residual plots for the recruits model.

The posterior predictive check looks okay (Figure 20).

Show code
pp_check(mod_rec_zinb, ndraws = 100)
Posterior predictive plot for the recruits model.

Figure 20: Posterior predictive plot for the recruits model.

Interpreting the model outputs

Important parts of this model are that slope term provided strong evidence as soil crust increases the recruitment rate decreased, but have important is that decrease in the overall system,and that zero-inflation may have been justified, as an estimated 17% (95% CI from 1 to 35%) of site have no recruitment through structural zeroes. In Figure 21 we can see that there is still a high degree of variability that the model has not accounted for.

Show code
prep_rec_zinb <- data.frame(crust = seq(0, 0.9, 0.025)) %>% mutate(z_crust = (crust - 0.25)/0.25)
pred_rec_zinb <- cbind(prep_rec_zinb,
                       fitted(mod_rec_zinb, newdata = prep_rec_zinb)) %>%
  rename(lcl = Q2.5, ucl = Q97.5)
ggplot() +
  geom_ribbon(data = pred_rec_zinb, aes(crust, ymin = lcl, ymax = ucl), alpha = 0.2) +
  geom_line(data = pred_rec_zinb, aes(crust, Estimate)) +
  geom_jitter(data=df_rec_saw, aes(crust, recruits)) +
  ylab('Tree recruitment') +
  xlab('Soil crust (%)')
A plot of the estimated recruits. The black line is the mean, and the shade area is the 95% credible interval.

Figure 21: A plot of the estimated recruits. The black line is the mean, and the shade area is the 95% credible interval.

In the results section you may want to produce Figure 21 as well as a table with the parameter estimates and the 95% credible intervals (see Table ??). Depending on the audience, maybe the parameter estimate go into an appendix.

Show code
tb_mod_rec_zinb <- rbind(data.frame(Source = 'Negative binomial', Parameter = c('Intercept', 'Soil crust'), summary(mod_rec_zinb)$fixed[, 1:4]),
                         data.frame(Source = c('Negative binomial', 'Zero-inflation'), Parameter = c('Aggregation', 'Zero-inflation'),
                                    summary(mod_rec_zinb)$spec_pars[, 1:4])) %>% rename(Lower = `l.95..CI`, Upper = `u.95..CI`) %>% as_tibble()
imp_mod_rec_zinb <- (tb_mod_rec_zinb$Lower*tb_mod_rec_zinb$Upper > 0)
tb_mod_rec_zinb %>%
  mutate(across(where(is.numeric), ~ round(.x, 3))) %>% flextable() %>%
  set_header_labels(values = c('Source', 'Parameter', 'Estimate', 'S.E.', 'Lower', 'Upper')) %>%
  add_header_row(colwidths = c(4, 2), values = c("", "95% Credible bound")) %>%
  align( j=3:6, align='right', part='body' ) %>%
  bold(i = imp_mod_rec_zinb) %>%
  ari_theme()

Conclusion

The model shows that the amount of soil crust may influence the amount of recruits present, but there is something else driving the system (Figure 21). Additionally there is evidence that at 17% of sites have no recruitment over the baseline level expected.

Group exercises

Just one group exercise today. The data involves pointing data in Volcanic Plains Grasslands. Of interest is the effect of different regimes on the proportion of the landscape with themeda present. At 15 locations 5 transects were monitored annually for 5 years. On each transect 50 points were assessed as to the presence of themeda. Additionally, at the commencement of the monitoring, the soil phosphorus was determined at each location. Can you explore any pattern in the data? How would you shape the data to help with your analysis? The data set is ex_them.RDS.


  1. Ordinal data is ordered categorical data. A common example would be the Likert scales (e.g. strongly disagree, disagree, neutral, agree, strongly agree).↩︎

  2. If we only had a couple of zeroes and thousands of observations, I would be tempted to either exclude these observations or use some other approach to avoid making the model much more complex.↩︎

References