Tackling discrete models: Binomial, Poisson, Negative Binomial
A REMINDER TO START RECORDING
Adapt Day 2’s workflow to discrete distributions
highlight unique aspects of discrete data
understand your experimental design
graphically explore your data (!!!!🙏 !!!!)
modeling time
repeat 2-3 until defensible model reached
summarizing model outputs for presentation
Key points
Most of the distributions (sans LogNormal) we have talked about come from the same family, the Exponential family
f(x|θ) = h(x) exp{η(θ)ᵀT(x) - A(θ)}
This unifying reality underlies development of Generalized Linear Models, which always for the same modeling framework, just need to specify the distribution (e.g. family=‘poisson’, family=‘gaussian’)
As part of converting a distribution into the exponential family notion, there are something called the “natural” link function. Table below has the natural link function
note - “natural” part indicates this link has the best statistical properties. You can use other link functions
Table showing link functions# Create the data frame
library(flextable)
ds_dist <- tibble(
Distribution = c("Gaussian", "Poisson", "Negative Binomial", "Log-normal", "Binomial", "Beta"),
Type=c('Continuous','Discrete','Discrete','Continuous','Discrete','Continuous'),
`No. parameters` = c(2, 1, 2, 2, 2, 2),
Parameters = c(
"Mean (μ)\n Standard deviation (σ)",
"Rate (λ)",
"Mean (μ) or size (r)\n Dispersion (θ) or probability (p)",
"Mean (μ)\n Standard deviation (σ)",
"Number of trials (n)\n Probability of success (p)",
"Shape parameters α (shape1)\n β (shape2)"
),
Link=c('identity/raw','log','log','log','logit','logit'),
Boundary = c("(-Inf, Inf)","[0,1,2,...,Inf)","[0,1,2,...,Inf)","(0,Inf)","[0,1,2,...,n]", "(0,1)}"),
Rfunction = c("rnorm","rpoisson","rnbinom","rlnorm","rbinom","rbeta")
) %>% filter(Distribution!='Log-normal')
regulartable( ds_dist %>% select(Distribution, Link) ) %>% ari_theme()
Distribution | Link |
---|---|
Gaussian | identity/raw |
Poisson | log |
Negative Binomial | log |
Binomial | logit |
Beta | logit |
Researchers were interested in improving detection of skink during surveys when searching a plot, allowing for better identification of areas with She-oak populations. The logic was that placing concrete tiles in area with known skink could allow for the assessment of being able to detect skink. The logic was that the tiles provide a thermal microhabitat as well as does not require disturbing natural shelters when surveying. As tiles heat up by solar radiation, skinks are more likely to use them for thermoregulation.
plot_grid(
ggdraw() + draw_image('images/rooftile.PNG'),
ggdraw() + draw_image('images/skink.PNG'),
nrow=1, scale = c(0.85,0.85),
labels=c('Roof tile', 'Alpine She-oak skink')
)
# ds_skink <- readRDS('data/df_ss_tile_c.RDS') %>% as_tibble() %>%
# mutate( area = rep( c('1','2'), each=n()/2 ) )
# saveRDS(ds_skink, 'data/df_ss_tile_c_area.RDS')
ds_skink <- readRDS('data/df_ss_tile_c_area.RDS') %>% as_tibble()
n_tiles <- nrow(ds_skink)
n_area <- n_distinct(ds_skink$area)
Here, we are using a simplified design of the original project (with simulated data). A total of 500 tiles were placed in 2 spatial areas. The checking of tiles was over several time periods (but that data is not included here). When a tile was checked, the tile temperature (continuous measure) underneath was measured using an infrared thermometer. Whether it was a sunny day (binary: yes/no) was also recorded.
ggdraw() + draw_image('images/design.PNG')
Predictors
ds_pred <- tibble( predictors = c('temperature','sunny','area'),
prediction = c('increased presence with higher temps',
'if sunny, heats up tile, increasing presence',
'areas differ in presence of skink') )
regulartable(ds_pred) %>% ari_theme()
predictors | prediction |
---|---|
temperature | increased presence with higher temps |
sunny | if sunny, heats up tile, increasing presence |
area | areas differ in presence of skink |
Let’s bring in the data
# A tibble: 6 × 5
tile_c sunny detected area tile_id
<dbl> <chr> <int> <chr> <chr>
1 41.2 Yes 0 1 tile_001
2 13.9 Yes 0 1 tile_002
3 20.5 Yes 1 1 tile_003
4 15.8 No 0 1 tile_004
5 38.1 Yes 0 1 tile_005
6 45.9 Yes 0 1 tile_006
Recommendation: I always plot a visual of the design. This example is a bit simple but will still do to show as an example.
What am I checking for in this visual?
sample sizes in each sampling event (e.g. here that is just plots)
I will check out “sampling space” (how many sampling events in each key predictor combination) [check for missing or low sample sizes in combinations]
plot_grid(
group_by(ds_skink, area) %>% summarise( n_unique_tiles= n_distinct(tile_id)) %>%
ggplot( aes( x=area, y= n_unique_tiles ) ) +
geom_col( fill='blue', col='black') +
geom_text( aes(label= paste('n=',n_unique_tiles) ), vjust=2, col='white' ) +
labs( title='Total samping events per area' ),
group_by(ds_skink, area, sunny) %>% summarise( n_unique_tiles= n_distinct(tile_id)) %>%
ggplot( aes( x=area, y= sunny, fill= n_unique_tiles ) ) +
geom_tile( col='black', show.legend = T) +
geom_text( aes(label= paste('n=',n_unique_tiles) ), vjust=3, col='white' ) +
labs(fill='No. tiles', title='No. sampling events per area by sunny'),
nrow=1
)
Predictors
ggplot(ds_skink, aes( sunny, tile_c) ) + geom_violin( fill='blue') +
labs(y='Tile Temp', title='Relationship between sunny and tile temp' )
Response variable
Look at categorical plots…
ggplot(ds_skink, aes(sunny, detected)) +
geom_violin() +
geom_jitter( width=0.1, height=0, col='grey') +
geom_point( data=group_by(ds_skink, sunny, area) %>% summarise(p=sum(detected)/n() ),
aes(y=p,fill='porportion'), pch=21, size=3) +
scale_fill_manual(NULL, values='red') +
facet_wrap(~area) + labs(title='')
What does this teach us???
Tip: If possible, I like to graph out what the main model will look like. Now, binary data like this is tricky as it is either 0 or 1 along a continuous predictor (e.g. temperature). Binning can help here.
# function below grabs the mid-value for a range form cut() functions
get_midpoints <- function(intervals) {
sapply(intervals, function(interval) {
bounds <- gsub("\\[|\\]|\\(|\\)", "", interval)
nums <- as.numeric(strsplit(bounds, ",")[[1]])
mean(nums)
})
}
# plot out/explore the data
plotBinnedData <- function(d, n_groups=10){
d<- mutate( d, temp_grp = cut_interval(tile_c, n=n_groups ),
temp_mid = get_midpoints(temp_grp) ) %>%
group_by(temp_mid, sunny) %>%
summarise( p = sum(detected)/n(), success=sum(detected), n=n() )
ggplot(ds_skink, aes(tile_c, detected, col=sunny) ) +
geom_point(data=d, aes(temp_mid, p, fill=sunny, size=n), pch=21,col='black' ) +
scale_size_binned(range=c(0.5,3), n.breaks=5 )+
geom_text(data=d, aes(temp_mid, p, label=n), size=2.5, hjust=-0.5, col='black' ) +
scale_size_binned( range=c(0.5,3), breaks=c(1,5,10,20,30,100) )+
geom_smooth(span=2,se=F, method='gam', method.args = list(family = "binomial") ) +
labs( x='Tile temperature', title=paste('Binning with ',n_groups,' groups') ) +
coord_cartesian( ylim= c(0,0.51) )
}
plot_grid(
plotBinnedData(ds_skink,10),
plotBinnedData(ds_skink,15)
)
Each tile is a sampling event and each sampling event has unique attributes (temp). So, I would keep as N=1. Since we have a binary response (0/1) and N=1, I am thinking Bernoulli distribution here (note-you could use Binomial as well but selecting Bernoulli is more efficient)
Observation model:
\[ y_i \sim \text{Bernoulli}( p_i ) \]
Or could do the following: \[ y_i \sim \text{Binomial}( p_i, 1 ) \]
Process model:
The probability (p) is linked to the predictors via a logit link:
Informal formula style: \[ \text{logit}(p) = sunny * temperature + area \]
Technical formula style: \[ \text{logit}(p_i) = \alpha + \beta_1 * \text{sunny}_i + \beta_2 * \text{temperature}_i + \beta_3 * \text{temperature}_i*\text{sunny}_i + \beta_4 * \text{area}_i \]
where i is observation i; \(\beta_1\), \(\beta_3\),\(\beta_4\) either 0/1 (indicator variables)
Link function
log odds = logit(p) = \(log(\frac{p}{(1-p)})\)
In words, the odds (\(\frac{p}{(1-p)}\)) is of probability of success relative to it not happening.
Why use logit?
The log part allows us to have additive effects used in GLMs (e.g. ~ sunny + temperature )
log odds has some nice interpret-ability properties
Question: is increasing probability from 0.01 to 0.02 same as 0.50 to 0.51? 0.98 to 0.99?
Let’s think about it in respect to mortality/survival rates
Think about the odds \(\frac{p}{(1-p)}\) again
[1] " 0.01 to 0.02 increases odds by 2.02-fold"
[1] " 0.50 to 0.51 increases odds by 1.04-fold"
[1] " 0.98 to 0.99 increases odds by 2.02-fold"
Interpreting parameters estimates from model
For binomial GLM, this means that the parameter estimates from model will be on the logit scale and the $
Let’s take a simple model glm( success ~ sunny )…
\(\text{logit}(p_i) = \alpha + \beta_1 * \text{sunny}_i\)
\(\text{logit}(p_{no,i}) = \alpha + \beta_1 * 0\) when sunny=no, \(\beta_1\) disappears
\(\text{logit}(p_{yes,i}) = \alpha + \beta_1 * 1\) when sunny=yes, \(\beta_1\)
\(\beta_1 = \text{logit}(p_{yes}) - \text{logit}(p_{no}) = log(\frac{\text{odds }p_{yes}}{\text{odds } p_{no}})\)
What does this all mean
When interpreting \(\beta\) on the logit…
When interpreting \(e^\beta\)…
Check the Rhats (<1.01) in model output
mod_skink
Family: bernoulli
Links: mu = logit
Formula: detected ~ sunny * tile_c + area
Data: ds_skink (Number of observations: 500)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept -1.05 0.66 -2.34 0.27 1.00 1356
sunnyYes -0.10 0.78 -1.61 1.45 1.00 1279
tile_c -0.03 0.04 -0.10 0.04 1.00 1285
area2 0.22 0.25 -0.28 0.72 1.00 2341
sunnyYes:tile_c 0.00 0.04 -0.07 0.08 1.00 1218
Tail_ESS
Intercept 1982
sunnyYes 1672
tile_c 1766
area2 1847
sunnyYes:tile_c 1727
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
mod_skink2
Family: bernoulli
Links: mu = logit
Formula: detected ~ sunny * poly(tile_c, 2) + area
Data: ds_skink (Number of observations: 500)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept -2.04 0.64 -3.51 -0.92 1.00
sunnyYes -0.07 0.66 -1.23 1.37 1.00
polytile_c21 -16.93 22.10 -66.07 21.34 1.00
polytile_c22 -5.88 13.32 -33.94 17.77 1.00
area2 0.18 0.25 -0.30 0.67 1.00
sunnyYes:polytile_c21 6.36 22.62 -33.33 57.18 1.00
sunnyYes:polytile_c22 -11.64 14.77 -38.12 19.35 1.00
Bulk_ESS Tail_ESS
Intercept 1171 1270
sunnyYes 1143 1178
polytile_c21 1057 1187
polytile_c22 1143 1333
area2 2588 2568
sunnyYes:polytile_c21 1071 1210
sunnyYes:polytile_c22 1153 1315
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Check the trace plots
plot(mod_skink)
Lets do DHARMa quantile approach and pp_check from Day2
Model1
check_brms(mod_skink)
Model2
check_brms(mod_skink2)
ds_sum <- mutate( ds_skink, temp_grp = cut_interval(tile_c, n=10 ),
temp_mid = get_midpoints(temp_grp) ) %>%
group_by(temp_mid, sunny) %>%
summarise( p = sum(detected)/n(), success=sum(detected), n=n() )
ds_skink$fit <- fitted(mod_skink, newdata=mutate(ds_skink, area='1') )[,1]
ds_skink$fit_lo <- fitted(mod_skink, newdata=mutate(ds_skink, area='1') )[,3]
ds_skink$fit_hi <- fitted(mod_skink, newdata=mutate(ds_skink, area='1') )[,4]
ds_skink$fit2 <- fitted(mod_skink2, newdata=mutate(ds_skink, area='1') )[,1]
ds_skink$fit2_lo <- fitted(mod_skink2, newdata=mutate(ds_skink, area='1') )[,3]
ds_skink$fit2_hi <- fitted(mod_skink2, newdata=mutate(ds_skink, area='1') )[,4]
plot_grid(
ggplot(ds_skink, aes(tile_c, detected, col=sunny) ) +
geom_point(data=ds_sum, aes(temp_mid, p, fill=sunny, size=n), pch=21,col='black' ) +
geom_text(data=ds_sum, aes(temp_mid, p, label=n), size=2.5, hjust=-0.5, col='black' ) +
scale_size_binned( range=c(0.5,3), breaks=c(1,5,10,20,30,100) )+
geom_line( aes(y=fit ) ) +
geom_ribbon( aes(ymin=fit_lo, ymax=fit_hi, fill=sunny), alpha=0.3, col=NA ) +
labs( x='Tile temperature', title='Linear fit'),
ggplot(ds_skink, aes(tile_c, detected, col=sunny) ) +
geom_point(data=ds_sum, aes(temp_mid, p, fill=sunny, size=n), pch=21,col='black' ) +
geom_text(data=ds_sum, aes(temp_mid, p, label=n), size=2.5, hjust=-0.5, col='black' ) +
scale_size_binned( range=c(0.5,3), breaks=c(1,5,10,20,30,100) )+
geom_line( aes(y=fit2 ) ) +
geom_ribbon( aes(ymin=fit2_lo, ymax=fit2_hi, fill=sunny), alpha=0.3, col=NA ) +
labs( x='Tile temperature', title='Quadratic fit')
)
mod_skink2
Family: bernoulli
Links: mu = logit
Formula: detected ~ sunny * poly(tile_c, 2) + area
Data: ds_skink (Number of observations: 500)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept -2.04 0.64 -3.51 -0.92 1.00
sunnyYes -0.07 0.66 -1.23 1.37 1.00
polytile_c21 -16.93 22.10 -66.07 21.34 1.00
polytile_c22 -5.88 13.32 -33.94 17.77 1.00
area2 0.18 0.25 -0.30 0.67 1.00
sunnyYes:polytile_c21 6.36 22.62 -33.33 57.18 1.00
sunnyYes:polytile_c22 -11.64 14.77 -38.12 19.35 1.00
Bulk_ESS Tail_ESS
Intercept 1171 1270
sunnyYes 1143 1178
polytile_c21 1057 1187
polytile_c22 1143 1333
area2 2588 2568
sunnyYes:polytile_c21 1071 1210
sunnyYes:polytile_c22 1153 1315
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Below, below gets the actual estimates based on hypothesis() function…we can now see that there is evidence for the quadratic for sunny
bind_rows(
cloudy_linear=hypothesis( mod_skink2, 'polytile_c21 = 0' )$hypo,
cloudy_quad=hypothesis( mod_skink2, 'sunnyYes:polytile_c21 + polytile_c21 = 0' )$hypo,
sunny_linear=hypothesis( mod_skink2, 'polytile_c22 = 0' )$hypo,
sunny_quad=hypothesis(mod_skink2,'polytile_c22+sunnyYes:polytile_c22= 0')$hypo,
.id='predictor'
)
predictor Hypothesis Estimate
1 cloudy_linear (polytile_c21) = 0 -16.934824
2 cloudy_quad (sunnyYes:polytile_c21+polytile_c21) = 0 -10.576889
3 sunny_linear (polytile_c22) = 0 -5.884924
4 sunny_quad (polytile_c22+sunnyYes:polytile_c22) = 0 -17.528032
Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 22.10314 -66.07500 21.33653 NA NA
2 4.85391 -20.67852 -1.56093 NA NA *
3 13.32197 -33.93701 17.76571 NA NA
4 6.57965 -31.26140 -5.69419 NA NA *
loo_compare( loo(mod_skink,moment_match = TRUE), loo(mod_skink2,moment_match = TRUE) )
elpd_diff se_diff
mod_skink2 0.0 0.0
mod_skink -1.3 3.2
rowname df AIC BIC
1 m1 5 435.8317 456.9047
2 m2 7 431.3958 460.8981
Researchers wanted to know if golden perch catch rates increased with flow discharge in order to assess the potential role of environmental flows in management of the species
library(cowplot)
plot_grid(
ggdraw() + draw_image('images/fishway.PNG'),
ggdraw() + draw_image('images/goldenperch.jfif'),
nrow=1, scale = c(0.85,0.85),
labels=c('Fishway', 'Golden perch')
)
ds_gp <- readRDS('data/df_gp.RDS') %>% as_tibble()
n_month <- n_distinct(ds_gp$month)
n_fishway <- n_distinct(ds_gp$fishway)
Again, we are using a simplified design of original project (and simulated data). In this simplified version, researchers sampled twice a month for 7 months across 10 fishways.
Predictors
ds_pred <- tibble( predictors = c('discharge','month','fishway'),
prediction = c(
'flow both triggers and facilitates upstream movements in GP',
'months may differ in movement rate -random effect',
'fishway may differ in catch rates - random effect') )
regulartable(ds_pred) %>% ari_theme()
predictors | prediction |
---|---|
discharge | flow both triggers and facilitates upstream movements in GP |
month | months may differ in movement rate -random effect |
fishway | fishway may differ in catch rates - random effect |
Let’s bring in the data
# A tibble: 6 × 8
month fishway season discharge catch month_adj month_abb
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 9 A Spring 289. 1 1 Sep
2 9 B Spring 347. 2 1 Sep
3 9 C Spring 150. 0 1 Sep
4 9 D Spring 238. 1 1 Sep
5 9 E Spring 284. 1 1 Sep
6 9 F Spring 182. 1 1 Sep
# ℹ 1 more variable: discharge_gl <dbl>
Let’s get an idea of how catch rates very across the fishways…
plot_grid(
ggplot(ds_gp, aes(catch)) + geom_histogram(fill='grey', color='black') + labs(title='Response'),
ggplot(ds_gp, aes(fishway, catch) ) +
geom_violin() + geom_jitter(width=0.1, height=0.1) +
geom_point( data=group_by(ds_gp, fishway) %>% summarise(catch=mean(catch)),
pch=21, fill='red',col='black',size=3) + labs(title='Break up by fishway'),
nrow=1
)
I want to know about month and discharge relationship…
group_by(ds_gp, fishway, month_adj) %>% summarise( discharge=mean(discharge) ) %>%
ggplot(aes(month_adj, discharge, col=fishway) ) +
geom_point( ) + geom_line() +
scale_x_continuous( breaks=ds_mon$month_adj, labels=ds_mon$month_abb )
Let’s look predictor relationships….
ggplot( ds_gp, aes( discharge, catch ) ) + geom_point( aes(col=season) )+
geom_smooth( method='lm') +
facet_wrap(~fishway)
ggplot( ds_gp, aes( discharge, catch ) ) + geom_point( aes(col=season) )+
geom_smooth( method='lm') +
facet_wrap(~month)
Observation model:
\[ y_i \sim \text{Poisson}( \lambda_i ) \] Process model:
The process model will have the log link plus predictors
Informal formula style:
\[log(\lambda_i) = discharge + (1|month) + (1|fishway)\]
Technical formula style: \[ \begin{align} log(\lambda_i) &= \alpha + \beta_1 Discharge + \epsilon_{month_i} + \epsilon_{fishway_i} \\ \epsilon_{month_i} & \sim \text{N}(0, \sigma_{month_i}^2) \\ \epsilon_{fishway_i} & \sim \text{N}(0, \sigma_{fishway_i}^2) \end{align} \]
where discharge is fixed effect and month and fishway are random effects
With rates (e.g. number fish/hour, number species/transect), the effort may vary across the survey. For example, electrofishing time can vary or the length of transects. As a result, the total counts need to be adjusted for those varying efforts.
Basically, the following is happening (short equation form):
\[\lambda = e^{\alpha}*e^{\beta_1 Discharge}\] apply link function (log)…
\[log(\lambda) = log(\frac{n}{\text{effort}}) = \alpha + \beta_1 Discharge\] \[\log\left(\frac{n}{\text{effort}}\right) = \log(n) - \log(\text{effort})\]
\[log(\lambda) = log(\frac{n}{\text{effort}}) = \log(n) - \log(\text{effort})=\alpha + \beta_1 Discharge\] \[log(\lambda) = \alpha + \beta_1 Discharge + \log(\text{effort})\]
R code: : brm( catch ~ offset( log(effort ) ) + … )
Check the Rhats (<1.01) in model output
mod_gp
Family: poisson
Links: mu = log
Formula: catch ~ discharge_gl + (1 | month) + (1 | fishway)
Data: ds_gp (Number of observations: 120)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~fishway (Number of levels: 10)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept) 0.20 0.15 0.01 0.56 1.00 1464
Tail_ESS
sd(Intercept) 1937
~month (Number of levels: 7)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept) 0.55 0.30 0.16 1.32 1.00 1150
Tail_ESS
sd(Intercept) 1714
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept -0.47 0.34 -1.16 0.19 1.00 1812
discharge_gl 0.85 0.74 -0.63 2.26 1.00 3661
Tail_ESS
Intercept 2125
discharge_gl 2689
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Check the trace plots
plot(mod_gp)
Lets do DHARMa quantile approach and pp_check from Day2
check_brms(mod_gp)
ds_gp$fit <- fitted(mod_gp, re_formula = NA)[,1]
ds_gp$resid <- residuals(mod_gp,type = 'pearson')[,1]
ggplot(ds_gp, aes(fit, resid) ) + geom_point() + geom_smooth()
ds_gp$fit <- fitted(mod_gp, re_formula = NA)[,1]
ds_gp$fit_lo <- fitted(mod_gp, re_formula = NA)[,3]
ds_gp$fit_hi <- fitted(mod_gp, re_formula = NA)[,4]
ggplot(ds_gp, aes(discharge) ) +
geom_point( aes( y=catch), pch=21, fill='grey', col='black' ) +
geom_line( aes(y=fit) ) +
geom_ribbon( aes(ymin=fit_lo, ymax=fit_hi), alpha=0.3, col=NA ) +
labs( x='Discharge (ML/day)', y='Catch (n/event)')
mod_gp
Family: poisson
Links: mu = log
Formula: catch ~ discharge_gl + (1 | month) + (1 | fishway)
Data: ds_gp (Number of observations: 120)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~fishway (Number of levels: 10)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept) 0.20 0.15 0.01 0.56 1.00 1464
Tail_ESS
sd(Intercept) 1937
~month (Number of levels: 7)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept) 0.55 0.30 0.16 1.32 1.00 1150
Tail_ESS
sd(Intercept) 1714
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept -0.47 0.34 -1.16 0.19 1.00 1812
discharge_gl 0.85 0.74 -0.63 2.26 1.00 3661
Tail_ESS
Intercept 2125
discharge_gl 2689
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Researchers want to know if adding instream woody habitat (IWH) increases Murray cod abundance. IWH is likely critical habitat for Murray cod which use the IWH as microhabitat both for protection as well as ambushing prey.
library(cowplot)
plot_grid(
ggdraw() + draw_image('images/iwh.jfif'),
ggdraw() + draw_image('images/murraycod.jpg'),
nrow=1, scale = c(0.85,0.85),
labels=c('Instream Woody Habitat (IWH)', 'Golden perch')
)
ds_mc <- readRDS('data/df_resnag.RDS') %>% as_tibble()
n_reach <- n_distinct(str_split_fixed(ds_mc$reach_id,'_',2)[,2])
n_year <- n_distinct(ds_mc$year)
Again, we are using simulated data of an actual experiment. The researchers were commissioned to add enhance riverine habitat in the Ovens river. Two methods of enhancement was undertaken, basic and IWH, at 5 reaches. As a “control”, 5 reaches at a comparable river was chosen that did not have any riverine improvement. Over 7 years, annual surveys were undertaken to count Murray cod caught.
Predictors
ds_pred <- tibble( predictors = c('year','works','reach'),
prediction = c('change in catch rates over time',
'treatment of interest',
'reach effects - random effect') )
regulartable(ds_pred) %>% ari_theme()
predictors | prediction |
---|---|
year | change in catch rates over time |
works | treatment of interest |
reach | reach effects - random effect |
Let’s bring in the data
# A tibble: 6 × 4
works reach_id year catch
<fct> <chr> <int> <dbl>
1 None None_A 0 8
2 None None_A 1 0
3 None None_A 2 13
4 None None_A 3 1
5 None None_A 4 1
6 None None_A 5 8
Let check the design
ds_sum <- distinct(ds_mc, works, reach_id,year) %>%
mutate(reach_id=str_split_fixed(reach_id,'_',2)[,2])
group_by(ds_sum, works, reach_id) %>% count() %>%
ggplot( aes(works, reach_id, fill=n)) +
geom_tile( col='black' ) +
geom_text( aes(label=n), col='white') +
labs( title='Design: number of sampling events' )
Let’s get an idea of the catch rate variable across reaches …
plot_grid(
ggplot(ds_mc, aes(catch) ) + geom_histogram(aes(fill=works), binwidth = 3, color='black') + labs(title='Response') + theme(legend.position=c(0.8,0.8)),
ggplot(ds_mc, aes(reach_id, catch, fill=works) ) +
geom_violin() + geom_jitter(width=0.1, height=0.1) + coord_flip()+ theme(legend.position='none'),
nrow=1
)
Let’s visualise the likely model…
plot_grid(
ggplot(ds_mc, aes(year, catch, col=works) ) +
geom_point() + geom_smooth(se=F, span=1) +
labs(title='Raw scale'),
ggplot(ds_mc, aes(year, catch+1, col=works) ) +
geom_point() + geom_smooth(se=F, span=1) + scale_y_log10() +
labs(title='Log10 version')
)
Observation model:
\[ y_i \sim \text{NB}( \mu_i, \theta ) \] Process model:
The process model will have the log link and the predictors
Informal formula style:
\[ log(\mu_i) = year * works + (1|reach) \]
Technical formula style: \[ \begin{align} log(\mu_i) &= \alpha + \beta_1 year + \beta_2 works + \beta_3 (year*works) + \epsilon_{reach_i} \\ \epsilon_{reach_i} & \sim \text{N}(0, \sigma_{reach_i}^2) \end{align} \]
where discharge is a fixed effect and month and fishway are random effects
library(brms)
mod_mc <- brm( catch ~ year*works + (1|reach_id) , data=ds_mc, family=negbinomial() )
saveRDS(mod_mc,'data/mod_mc.rds')
Check the Rhats (<1.01) in model output
mod_mc
Family: negbinomial
Links: mu = log; shape = identity
Formula: catch ~ year * works + (1 | reach_id)
Data: ds_mc (Number of observations: 105)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~reach_id (Number of levels: 15)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept) 0.19 0.15 0.01 0.54 1.00 1619
Tail_ESS
sd(Intercept) 1638
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept 1.36 0.36 0.68 2.09 1.00 1764
year 0.07 0.09 -0.12 0.25 1.00 1789
worksBasic 0.18 0.51 -0.83 1.19 1.00 1790
worksIWH 0.45 0.52 -0.59 1.46 1.00 1937
year:worksBasic -0.02 0.13 -0.28 0.24 1.00 1748
year:worksIWH 0.08 0.14 -0.20 0.35 1.00 1895
Tail_ESS
Intercept 2355
year 2287
worksBasic 2135
worksIWH 2640
year:worksBasic 2347
year:worksIWH 2269
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape 1.01 0.17 0.73 1.38 1.00 4203 3485
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Tip: Check the shape parameter estimate (shape = \(\theta\) = “clumped”). If \(\theta\)>10, then worth re-running as Poisson
Check the trace plots
plot(mod_mc)
Lets do DHARMa quantile approach and pp_check from Day2
DHARMa
check_brms(mod_mc)
PP check
Residuals vs fitted
ds_mc$fit <- fitted(mod_mc, re_formula = NA)[,1]
ds_mc$resid <- residuals(mod_mc,type = 'pearson')[,1]
plot_grid(
ggplot(ds_mc, aes(fit, resid) ) + geom_point() + geom_smooth(),
ggplot(ds_mc, aes(year, resid) ) + geom_point() + geom_smooth() +
facet_wrap(~works),
ncol=1
)
ds_mc$fit <- fitted(mod_mc, re_formula = NA)[,1]
ds_mc$fit_lo <- fitted(mod_mc, re_formula = NA)[,3]
ds_mc$fit_hi <- fitted(mod_mc, re_formula = NA)[,4]
ggplot(ds_mc, aes(year, col=works, fill=works) ) +
geom_point( aes( y=catch), pch=21, col='black' ) +
geom_line( aes(y=fit) ) +
geom_ribbon( aes(ymin=fit_lo, ymax=fit_hi), alpha=0.3, col=NA ) +
labs( x='Discharge (ML/day)', y='Catch (n/event)') +
facet_wrap(~works, nrow=1) + theme(legend.position = 'none')
(Assuming we have enough time…)
Like yesterday, get in small groups and attempt one of the following small analysis tasks. Today’s tasks will use the same examples from above.
For the Murray cod model, re-run the model using a Poisson distribution. Check the diagnostics to see issues that emerge.
For the Murray cod model, re-run the model summing all catch to each year for each works treatment. Compare the estimates from this model to the lecture’s one.