Day 3: Discrete Distribution Models

Tackling discrete models: Binomial, Poisson, Negative Binomial

Ben Fanson https://bfanson.github.io/2025_Models_workshop/ (Arthur Rylah Institute)https://www.ari.vic.gov.au/
2025-09-10

A REMINDER TO START RECORDING

Day’s objective

Project to download

Link

Key packages

Review of analysis workflow

  1. understand your experimental design

    • does the dataset match what you expect?
    • what is the response variable?
    • what are the main predictors(/processes)?
  2. graphically explore your data (!!!!🙏 !!!!)

    • experimental design
    • between predictors [inferential space]
    • between response and predictors [relationships]
  3. modeling time

    • model formulation (e.g. y ~ sex + habitat )
    • model checking (e.g. convergence, residuals, simulated data)
    • visualise fitted releationships
  4. repeat 2-3 until defensible model reached

  5. summarizing model outputs for presentation


Generalized Linear Models (GLM) framework

Key points

  1. Most of the distributions (sans LogNormal) we have talked about come from the same family, the Exponential family

    • this means they can all be rewritten as this ugly (but beautiful) formula:

      f(x|θ) = h(x) exp{η(θ)ᵀT(x) - A(θ)}

  2. This unifying reality underlies development of Generalized Linear Models, which always for the same modeling framework, just need to specify the distribution (e.g. family=‘poisson’, family=‘gaussian’)

  3. As part of converting a distribution into the exponential family notion, there are something called the “natural” link function. Table below has the natural link function

note - “natural” part indicates this link has the best statistical properties. You can use other link functions

Table showing link functions
Show code
# Create the data frame
  library(flextable)
  ds_dist <- tibble(
    Distribution = c("Gaussian", "Poisson", "Negative Binomial", "Log-normal", "Binomial", "Beta"),
    Type=c('Continuous','Discrete','Discrete','Continuous','Discrete','Continuous'),
    `No. parameters` = c(2, 1, 2, 2, 2, 2),
    Parameters = c(
      "Mean (μ)\n Standard deviation (σ)",
      "Rate (λ)",
      "Mean (μ) or size (r)\n Dispersion (θ) or probability (p)",
      "Mean (μ)\n Standard deviation (σ)",
      "Number of trials (n)\n Probability of success (p)",
      "Shape parameters α (shape1)\n β (shape2)"
    ),
    Link=c('identity/raw','log','log','log','logit','logit'),
    Boundary = c("(-Inf, Inf)","[0,1,2,...,Inf)","[0,1,2,...,Inf)","(0,Inf)","[0,1,2,...,n]", "(0,1)}"),
    Rfunction = c("rnorm","rpoisson","rnbinom","rlnorm","rbinom","rbeta") 
  ) %>% filter(Distribution!='Log-normal')

 regulartable( ds_dist %>% select(Distribution, Link)  ) %>%  ari_theme()

Example1: Optimising detection rates of Alpine She-oak skink under tiles

Overview

Context

Researchers were interested in improving detection of skink during surveys when searching a plot, allowing for better identification of areas with She-oak populations. The logic was that placing concrete tiles in area with known skink could allow for the assessment of being able to detect skink. The logic was that the tiles provide a thermal microhabitat as well as does not require disturbing natural shelters when surveying. As tiles heat up by solar radiation, skinks are more likely to use them for thermoregulation.

Show code
plot_grid(
  ggdraw() + draw_image('images/rooftile.PNG'),
  ggdraw() + draw_image('images/skink.PNG'),
  nrow=1, scale = c(0.85,0.85),
  labels=c('Roof tile', 'Alpine She-oak skink')
)

Experimental design

Show code
 #  ds_skink <- readRDS('data/df_ss_tile_c.RDS') %>% as_tibble() %>% 
 #                mutate( area = rep( c('1','2'), each=n()/2 ) )
 # saveRDS(ds_skink, 'data/df_ss_tile_c_area.RDS')
 
  ds_skink <- readRDS('data/df_ss_tile_c_area.RDS') %>% as_tibble()
  n_tiles <- nrow(ds_skink)
  n_area <- n_distinct(ds_skink$area)

Here, we are using a simplified design of the original project (with simulated data). A total of 500 tiles were placed in 2 spatial areas. The checking of tiles was over several time periods (but that data is not included here). When a tile was checked, the tile temperature (continuous measure) underneath was measured using an infrared thermometer. Whether it was a sunny day (binary: yes/no) was also recorded.

Show code
  ggdraw() + draw_image('images/design.PNG')

Predictors

Show code
  ds_pred <- tibble( predictors = c('temperature','sunny','area'),
                     prediction = c('increased presence with higher temps',
                         'if sunny, heats up tile, increasing presence',
                         'areas differ in presence of skink') )
  regulartable(ds_pred) %>% ari_theme()

Explore the data

Let’s bring in the data

Show code
# grab the data from the data folder
  ds_skink <- readRDS('data/df_ss_tile_c_area.RDS') %>% as_tibble() %>% 
                mutate( tile_id = paste0('tile_',sprintf('%03d',1:n() ) ) )  # unique IDs can be useful (more for repeated designs)
  head(ds_skink)
# A tibble: 6 × 5
  tile_c sunny detected area  tile_id 
   <dbl> <chr>    <int> <chr> <chr>   
1   41.2 Yes          0 1     tile_001
2   13.9 Yes          0 1     tile_002
3   20.5 Yes          1 1     tile_003
4   15.8 No           0 1     tile_004
5   38.1 Yes          0 1     tile_005
6   45.9 Yes          0 1     tile_006

Design

Recommendation: I always plot a visual of the design. This example is a bit simple but will still do to show as an example.

What am I checking for in this visual?

Show code
plot_grid(
  group_by(ds_skink, area) %>% summarise( n_unique_tiles= n_distinct(tile_id)) %>% 
    ggplot(  aes( x=area, y= n_unique_tiles  ) ) + 
      geom_col( fill='blue', col='black') +
      geom_text( aes(label= paste('n=',n_unique_tiles) ), vjust=2, col='white'  ) +
        labs( title='Total samping events per area' ),
  group_by(ds_skink, area, sunny) %>% summarise( n_unique_tiles= n_distinct(tile_id)) %>% 
    ggplot(  aes( x=area, y= sunny, fill= n_unique_tiles  ) ) + 
      geom_tile( col='black', show.legend = T) +
      geom_text( aes(label= paste('n=',n_unique_tiles) ), vjust=3, col='white'  ) +
        labs(fill='No. tiles', title='No. sampling events per area by sunny'),
  nrow=1
)

Predictors/Response

Predictors

Show code
  ggplot(ds_skink, aes( sunny, tile_c) ) + geom_violin( fill='blue') +
    labs(y='Tile Temp', title='Relationship between sunny and tile temp' )

Response variable

Look at categorical plots…

Show code
 ggplot(ds_skink, aes(sunny, detected)) + 
        geom_violin() +
        geom_jitter( width=0.1, height=0, col='grey') + 
        geom_point( data=group_by(ds_skink, sunny, area) %>% summarise(p=sum(detected)/n() ),
                    aes(y=p,fill='porportion'), pch=21, size=3)  +
        scale_fill_manual(NULL, values='red') +
        facet_wrap(~area) + labs(title='')

What does this teach us???

  1. jittering of points not useful really
  2. violin plot still useful [but misleading…]
  3. overlaying raw proportion dots useful

Tip: If possible, I like to graph out what the main model will look like. Now, binary data like this is tricky as it is either 0 or 1 along a continuous predictor (e.g. temperature). Binning can help here.

Show code
# function below grabs the mid-value for a range form cut() functions
  get_midpoints <- function(intervals) {
    sapply(intervals, function(interval) {
      bounds <- gsub("\\[|\\]|\\(|\\)", "", interval)
      nums <- as.numeric(strsplit(bounds, ",")[[1]])
      mean(nums)
    })
  }

# plot out/explore the data
  plotBinnedData <- function(d, n_groups=10){
    d<- mutate( d, temp_grp = cut_interval(tile_c, n=n_groups ), 
                   temp_mid = get_midpoints(temp_grp) ) %>% 
              group_by(temp_mid, sunny) %>% 
              summarise( p = sum(detected)/n(), success=sum(detected), n=n() )    
    ggplot(ds_skink, aes(tile_c, detected, col=sunny) ) + 
      geom_point(data=d, aes(temp_mid, p, fill=sunny, size=n), pch=21,col='black' ) +
        scale_size_binned(range=c(0.5,3), n.breaks=5 )+
      geom_text(data=d, aes(temp_mid, p, label=n), size=2.5, hjust=-0.5, col='black' ) +
        scale_size_binned( range=c(0.5,3), breaks=c(1,5,10,20,30,100) )+
    geom_smooth(span=2,se=F, method='gam', method.args = list(family = "binomial") ) +
        labs( x='Tile temperature', title=paste('Binning with ',n_groups,' groups') ) +
        coord_cartesian( ylim= c(0,0.51) )
  }
  plot_grid(
    plotBinnedData(ds_skink,10), 
    plotBinnedData(ds_skink,15) 
  )  

Picking the distribution

Each tile is a sampling event and each sampling event has unique attributes (temp). So, I would keep as N=1. Since we have a binary response (0/1) and N=1, I am thinking Bernoulli distribution here (note-you could use Binomial as well but selecting Bernoulli is more efficient)

Observation model:

\[ y_i \sim \text{Bernoulli}( p_i ) \]

Or could do the following: \[ y_i \sim \text{Binomial}( p_i, 1 ) \]

Process model:

The probability (p) is linked to the predictors via a logit link:

Informal formula style: \[ \text{logit}(p) = sunny * temperature + area \]

Technical formula style: \[ \text{logit}(p_i) = \alpha + \beta_1 * \text{sunny}_i + \beta_2 * \text{temperature}_i + \beta_3 * \text{temperature}_i*\text{sunny}_i + \beta_4 * \text{area}_i \]

where i is observation i; \(\beta_1\), \(\beta_3\),\(\beta_4\) either 0/1 (indicator variables)

Link function

log odds = logit(p) = \(log(\frac{p}{(1-p)})\)

In words, the odds (\(\frac{p}{(1-p)}\)) is of probability of success relative to it not happening.

Why use logit?

  1. p is bounded by [0, 1] but we adopting the linear process formula of GLM requires unbounded (-Inf, Inf)
Show code
  ds <- data.frame( p = c(0.000001, 0.001,seq( 0.01, 0.99, by=0.01),0.999,0.9999999 ) ) %>% 
          mutate(logit_p = log(p/(1-p) ) ) 
  ggplot(ds, aes(p, logit_p)) + geom_line() + labs(title='P vs logit(p)' )

  1. The log part allows us to have additive effects used in GLMs (e.g. ~ sunny + temperature )

  2. log odds has some nice interpret-ability properties

Question: is increasing probability from 0.01 to 0.02 same as 0.50 to 0.51? 0.98 to 0.99?

Let’s think about it in respect to mortality/survival rates

Think about the odds \(\frac{p}{(1-p)}\) again

Show code
  logit <- function(p) log(p/(1-p))

  paste0(' 0.01 to 0.02 increases odds by ', round( exp( logit(0.02) - logit(0.01) ),2 ),'-fold' )
[1] " 0.01 to 0.02 increases odds by 2.02-fold"
Show code
  paste0(' 0.50 to 0.51 increases odds by ', round( exp( logit(0.51) - logit(0.50) ),2 ),'-fold' )
[1] " 0.50 to 0.51 increases odds by 1.04-fold"
Show code
  paste0(' 0.98 to 0.99 increases odds by ', round( exp( logit(0.99) - logit(0.98) ),2 ),'-fold' )
[1] " 0.98 to 0.99 increases odds by 2.02-fold"

Interpreting parameters estimates from model

For binomial GLM, this means that the parameter estimates from model will be on the logit scale and the $

Let’s take a simple model glm( success ~ sunny )…

\(\text{logit}(p_i) = \alpha + \beta_1 * \text{sunny}_i\)

\(\text{logit}(p_{no,i}) = \alpha + \beta_1 * 0\) when sunny=no, \(\beta_1\) disappears

\(\text{logit}(p_{yes,i}) = \alpha + \beta_1 * 1\) when sunny=yes, \(\beta_1\)

\(\beta_1 = \text{logit}(p_{yes}) - \text{logit}(p_{no}) = log(\frac{\text{odds }p_{yes}}{\text{odds } p_{no}})\)

What does this all mean

When interpreting \(\beta\) on the logit…

When interpreting \(e^\beta\)

Run the Model

brms model

Show code
  library(brms)
  mod_skink <- brm( detected ~ sunny * tile_c + area, 
                    data = ds_skink, family = 'Bernoulli')
  mod_skink2 <- brm( detected ~ sunny * poly(tile_c,2) + area, 
                    data = ds_skink, family = 'Bernoulli')

 saveRDS(mod_skink, 'data/mod_skink.rds')
 saveRDS(mod_skink2, 'data/mod_skink2.rds')

check convergence

Check the Rhats (<1.01) in model output

Show code
  mod_skink
 Family: bernoulli 
  Links: mu = logit 
Formula: detected ~ sunny * tile_c + area 
   Data: ds_skink (Number of observations: 500) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept          -1.05      0.66    -2.34     0.27 1.00     1356
sunnyYes           -0.10      0.78    -1.61     1.45 1.00     1279
tile_c             -0.03      0.04    -0.10     0.04 1.00     1285
area2               0.22      0.25    -0.28     0.72 1.00     2341
sunnyYes:tile_c     0.00      0.04    -0.07     0.08 1.00     1218
                Tail_ESS
Intercept           1982
sunnyYes            1672
tile_c              1766
area2               1847
sunnyYes:tile_c     1727

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Show code
  mod_skink2
 Family: bernoulli 
  Links: mu = logit 
Formula: detected ~ sunny * poly(tile_c, 2) + area 
   Data: ds_skink (Number of observations: 500) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                      Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept                -2.04      0.64    -3.51    -0.92 1.00
sunnyYes                 -0.07      0.66    -1.23     1.37 1.00
polytile_c21            -16.93     22.10   -66.07    21.34 1.00
polytile_c22             -5.88     13.32   -33.94    17.77 1.00
area2                     0.18      0.25    -0.30     0.67 1.00
sunnyYes:polytile_c21     6.36     22.62   -33.33    57.18 1.00
sunnyYes:polytile_c22   -11.64     14.77   -38.12    19.35 1.00
                      Bulk_ESS Tail_ESS
Intercept                 1171     1270
sunnyYes                  1143     1178
polytile_c21              1057     1187
polytile_c22              1143     1333
area2                     2588     2568
sunnyYes:polytile_c21     1071     1210
sunnyYes:polytile_c22     1153     1315

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Check the trace plots

Show code
  plot(mod_skink)

check model assumptions

Lets do DHARMa quantile approach and pp_check from Day2

Model1

Show code
  check_brms(mod_skink)

Show code
  plot_grid(pp_check(mod_skink, ndraw=100) + labs(title='Default'), 
            pp_check(mod_skink,, ndraw=100, type = "bars") + labs(title='Discrete') )

Model2

Show code
  check_brms(mod_skink2)

Show code
  plot_grid(pp_check(mod_skink2, ndraw=100) + labs(title='Default'), 
            pp_check(mod_skink, type = "bars", ndraw=100) + labs(title='Discrete') )

Graph main output

Show code
  ds_sum <- mutate( ds_skink, temp_grp = cut_interval(tile_c, n=10 ), 
                   temp_mid = get_midpoints(temp_grp) ) %>% 
              group_by(temp_mid, sunny) %>% 
              summarise( p = sum(detected)/n(), success=sum(detected), n=n() )    

  ds_skink$fit <- fitted(mod_skink, newdata=mutate(ds_skink, area='1') )[,1]
    ds_skink$fit_lo <- fitted(mod_skink, newdata=mutate(ds_skink, area='1') )[,3]
    ds_skink$fit_hi <- fitted(mod_skink, newdata=mutate(ds_skink, area='1') )[,4]

  ds_skink$fit2 <- fitted(mod_skink2, newdata=mutate(ds_skink, area='1') )[,1]
    ds_skink$fit2_lo <- fitted(mod_skink2, newdata=mutate(ds_skink, area='1') )[,3]
    ds_skink$fit2_hi <- fitted(mod_skink2, newdata=mutate(ds_skink, area='1') )[,4]
plot_grid(
  ggplot(ds_skink, aes(tile_c, detected, col=sunny) ) + 
    geom_point(data=ds_sum, aes(temp_mid, p, fill=sunny, size=n), pch=21,col='black' ) +
    geom_text(data=ds_sum, aes(temp_mid, p, label=n), size=2.5, hjust=-0.5, col='black' ) +
      scale_size_binned( range=c(0.5,3), breaks=c(1,5,10,20,30,100) )+
    geom_line( aes(y=fit ) ) + 
    geom_ribbon( aes(ymin=fit_lo, ymax=fit_hi, fill=sunny), alpha=0.3, col=NA ) +
      labs( x='Tile temperature', title='Linear fit'),
  ggplot(ds_skink, aes(tile_c, detected, col=sunny) ) + 
    geom_point(data=ds_sum, aes(temp_mid, p, fill=sunny, size=n), pch=21,col='black' ) +
    geom_text(data=ds_sum, aes(temp_mid, p, label=n), size=2.5, hjust=-0.5, col='black' ) +
      scale_size_binned( range=c(0.5,3), breaks=c(1,5,10,20,30,100) )+
    geom_line( aes(y=fit2 ) ) + 
    geom_ribbon( aes(ymin=fit2_lo, ymax=fit2_hi, fill=sunny), alpha=0.3, col=NA ) +
          labs( x='Tile temperature', title='Quadratic fit')
)

Which model to pick

  1. Check coefficients for strong evidence
Show code
  mod_skink2
 Family: bernoulli 
  Links: mu = logit 
Formula: detected ~ sunny * poly(tile_c, 2) + area 
   Data: ds_skink (Number of observations: 500) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                      Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept                -2.04      0.64    -3.51    -0.92 1.00
sunnyYes                 -0.07      0.66    -1.23     1.37 1.00
polytile_c21            -16.93     22.10   -66.07    21.34 1.00
polytile_c22             -5.88     13.32   -33.94    17.77 1.00
area2                     0.18      0.25    -0.30     0.67 1.00
sunnyYes:polytile_c21     6.36     22.62   -33.33    57.18 1.00
sunnyYes:polytile_c22   -11.64     14.77   -38.12    19.35 1.00
                      Bulk_ESS Tail_ESS
Intercept                 1171     1270
sunnyYes                  1143     1178
polytile_c21              1057     1187
polytile_c22              1143     1333
area2                     2588     2568
sunnyYes:polytile_c21     1071     1210
sunnyYes:polytile_c22     1153     1315

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Below, below gets the actual estimates based on hypothesis() function…we can now see that there is evidence for the quadratic for sunny

Show code
bind_rows(
  cloudy_linear=hypothesis( mod_skink2, 'polytile_c21 = 0' )$hypo,
  cloudy_quad=hypothesis( mod_skink2, 'sunnyYes:polytile_c21 + polytile_c21  = 0' )$hypo,
  sunny_linear=hypothesis( mod_skink2, 'polytile_c22 = 0' )$hypo,
  sunny_quad=hypothesis(mod_skink2,'polytile_c22+sunnyYes:polytile_c22= 0')$hypo,
  .id='predictor'
)  
      predictor                               Hypothesis   Estimate
1 cloudy_linear                       (polytile_c21) = 0 -16.934824
2   cloudy_quad (sunnyYes:polytile_c21+polytile_c21) = 0 -10.576889
3  sunny_linear                       (polytile_c22) = 0  -5.884924
4    sunny_quad (polytile_c22+sunnyYes:polytile_c22) = 0 -17.528032
  Est.Error  CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1  22.10314 -66.07500 21.33653         NA        NA     
2   4.85391 -20.67852 -1.56093         NA        NA    *
3  13.32197 -33.93701 17.76571         NA        NA     
4   6.57965 -31.26140 -5.69419         NA        NA    *
  1. Compare Bayesian models using LOO [not convincingly better]
Show code
  loo_compare( loo(mod_skink,moment_match = TRUE), loo(mod_skink2,moment_match = TRUE) )
           elpd_diff se_diff
mod_skink2  0.0       0.0   
mod_skink  -1.3       3.2   
  1. Compare to model selection using frequentist’s approach [robust are the findings]
Show code
  m1 <- glm( formula(mod_skink ), data=ds_skink, family=binomial )  
  m2 <- glm( formula(mod_skink2), data=ds_skink, family=binomial )  

  left_join( rownames_to_column(AIC( m1, m2)), BIC( m1, m2) )
  rowname df      AIC      BIC
1      m1  5 435.8317 456.9047
2      m2  7 431.3958 460.8981

Example 2: Assessing golden perch catch rates over time at fishways in relation to discharge

Overview

Context

Researchers wanted to know if golden perch catch rates increased with flow discharge in order to assess the potential role of environmental flows in management of the species

Show code
library(cowplot)
plot_grid(
  ggdraw() + draw_image('images/fishway.PNG'),
  ggdraw() + draw_image('images/goldenperch.jfif'),
  nrow=1, scale = c(0.85,0.85),
  labels=c('Fishway', 'Golden perch')
)

Experimental design

Show code
  ds_gp <- readRDS('data/df_gp.RDS') %>% as_tibble() 
  n_month   <- n_distinct(ds_gp$month)
  n_fishway <- n_distinct(ds_gp$fishway)

Again, we are using a simplified design of original project (and simulated data). In this simplified version, researchers sampled twice a month for 7 months across 10 fishways.

Predictors

Show code
  ds_pred <- tibble( predictors = c('discharge','month','fishway'),
                     prediction = c(
                         'flow both triggers and facilitates upstream movements in GP',
                         'months may differ in movement rate -random effect',
                         'fishway may differ in catch rates - random effect') )
  regulartable(ds_pred) %>% ari_theme()

Explore the data

Let’s bring in the data

Show code
  ds_gp <- readRDS('data/df_gp.RDS') %>% as_tibble() %>% 
            mutate( month_adj = ifelse(month>8, month-8, month+4 ),
                    month_abb=month.abb[month],
                    discharge_gl = discharge/1000)
  head(ds_gp)
# A tibble: 6 × 8
  month fishway season discharge catch month_adj month_abb
  <dbl> <chr>   <chr>      <dbl> <int>     <dbl> <chr>    
1     9 A       Spring      289.     1         1 Sep      
2     9 B       Spring      347.     2         1 Sep      
3     9 C       Spring      150.     0         1 Sep      
4     9 D       Spring      238.     1         1 Sep      
5     9 E       Spring      284.     1         1 Sep      
6     9 F       Spring      182.     1         1 Sep      
# ℹ 1 more variable: discharge_gl <dbl>

Design

Show code
  ds_mon <- distinct(ds_gp, month_adj, month_abb) %>% arrange(month_adj)
  group_by(ds_gp, month_adj, fishway) %>% count() %>% 
    ggplot( aes(month_adj, fishway, fill=n)) + 
      geom_tile( ) + 
      geom_text( aes(label=n), col='white') +
      scale_x_continuous( breaks=ds_mon$month_adj, labels=ds_mon$month_abb )

Predictors/Response

Let’s get an idea of how catch rates very across the fishways…

Show code
plot_grid(
  ggplot(ds_gp, aes(catch)) + geom_histogram(fill='grey', color='black') + labs(title='Response'),
  ggplot(ds_gp, aes(fishway, catch) ) + 
    geom_violin() + geom_jitter(width=0.1, height=0.1) +
    geom_point( data=group_by(ds_gp, fishway) %>% summarise(catch=mean(catch)),
                pch=21, fill='red',col='black',size=3) + labs(title='Break up by fishway'),
  nrow=1
)

I want to know about month and discharge relationship…

Show code
  group_by(ds_gp, fishway, month_adj) %>% summarise( discharge=mean(discharge) )  %>% 
  ggplot(aes(month_adj, discharge, col=fishway) ) + 
    geom_point( ) + geom_line() +
    scale_x_continuous( breaks=ds_mon$month_adj, labels=ds_mon$month_abb )

Let’s look predictor relationships….

Show code
  ggplot( ds_gp, aes( discharge, catch ) ) + geom_point( aes(col=season) )+   
    geom_smooth( method='lm') +
    facet_wrap(~fishway)

Show code
  ggplot( ds_gp, aes( discharge, catch ) ) + geom_point( aes(col=season) )+   
    geom_smooth( method='lm') +
    facet_wrap(~month)

Picking the distribution

Observation model:

\[ y_i \sim \text{Poisson}( \lambda_i ) \] Process model:

The process model will have the log link plus predictors

Informal formula style:

\[log(\lambda_i) = discharge + (1|month) + (1|fishway)\]

Technical formula style: \[ \begin{align} log(\lambda_i) &= \alpha + \beta_1 Discharge + \epsilon_{month_i} + \epsilon_{fishway_i} \\ \epsilon_{month_i} & \sim \text{N}(0, \sigma_{month_i}^2) \\ \epsilon_{fishway_i} & \sim \text{N}(0, \sigma_{fishway_i}^2) \end{align} \]

where discharge is fixed effect and month and fishway are random effects

Effort and Offsets

With rates (e.g. number fish/hour, number species/transect), the effort may vary across the survey. For example, electrofishing time can vary or the length of transects. As a result, the total counts need to be adjusted for those varying efforts.

Basically, the following is happening (short equation form):

\[\lambda = e^{\alpha}*e^{\beta_1 Discharge}\] apply link function (log)…

\[log(\lambda) = log(\frac{n}{\text{effort}}) = \alpha + \beta_1 Discharge\] \[\log\left(\frac{n}{\text{effort}}\right) = \log(n) - \log(\text{effort})\]

\[log(\lambda) = log(\frac{n}{\text{effort}}) = \log(n) - \log(\text{effort})=\alpha + \beta_1 Discharge\] \[log(\lambda) = \alpha + \beta_1 Discharge + \log(\text{effort})\]

R code: : brm( catch ~ offset( log(effort ) ) + … )

Run the Model

brms model

Show code
  mod_gp <- brm( catch ~ discharge_gl + (1|month) + (1|fishway), data=ds_gp, 
                 family=poisson, control=list(adapt_delta=0.95)  )
     saveRDS(mod_gp,'data/mod_gp.rds')

check convergence

Check the Rhats (<1.01) in model output

Show code
  mod_gp
 Family: poisson 
  Links: mu = log 
Formula: catch ~ discharge_gl + (1 | month) + (1 | fishway) 
   Data: ds_gp (Number of observations: 120) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~fishway (Number of levels: 10) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)     0.20      0.15     0.01     0.56 1.00     1464
              Tail_ESS
sd(Intercept)     1937

~month (Number of levels: 7) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)     0.55      0.30     0.16     1.32 1.00     1150
              Tail_ESS
sd(Intercept)     1714

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept       -0.47      0.34    -1.16     0.19 1.00     1812
discharge_gl     0.85      0.74    -0.63     2.26 1.00     3661
             Tail_ESS
Intercept        2125
discharge_gl     2689

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Check the trace plots

Show code
  plot(mod_gp)

check model assumptions

Lets do DHARMa quantile approach and pp_check from Day2

Show code
  check_brms(mod_gp)

Show code
   plot_grid(pp_check(mod_gp, ndraw=100) + labs(title='Default'), 
            pp_check(mod_gp, type = "bars", ndraw=100) + labs(title='Discrete') )

Show code
  ds_gp$fit <- fitted(mod_gp, re_formula = NA)[,1]
  ds_gp$resid <- residuals(mod_gp,type = 'pearson')[,1]
  ggplot(ds_gp, aes(fit, resid) ) + geom_point() + geom_smooth()

Graph main output

Show code
  ds_gp$fit <- fitted(mod_gp, re_formula = NA)[,1]
  ds_gp$fit_lo <- fitted(mod_gp, re_formula = NA)[,3]
  ds_gp$fit_hi <- fitted(mod_gp, re_formula = NA)[,4]
  ggplot(ds_gp, aes(discharge) ) + 
    geom_point( aes( y=catch), pch=21, fill='grey', col='black' ) +
    geom_line( aes(y=fit) ) +
    geom_ribbon( aes(ymin=fit_lo, ymax=fit_hi), alpha=0.3, col=NA ) +
      labs( x='Discharge (ML/day)', y='Catch (n/event)')

Show code
mod_gp
 Family: poisson 
  Links: mu = log 
Formula: catch ~ discharge_gl + (1 | month) + (1 | fishway) 
   Data: ds_gp (Number of observations: 120) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~fishway (Number of levels: 10) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)     0.20      0.15     0.01     0.56 1.00     1464
              Tail_ESS
sd(Intercept)     1937

~month (Number of levels: 7) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)     0.55      0.30     0.16     1.32 1.00     1150
              Tail_ESS
sd(Intercept)     1714

Regression Coefficients:
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept       -0.47      0.34    -1.16     0.19 1.00     1812
discharge_gl     0.85      0.74    -0.63     2.26 1.00     3661
             Tail_ESS
Intercept        2125
discharge_gl     2689

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Example 3: Murray Cod response to resnagging

Overview

Context

Researchers want to know if adding instream woody habitat (IWH) increases Murray cod abundance. IWH is likely critical habitat for Murray cod which use the IWH as microhabitat both for protection as well as ambushing prey.

Show code
library(cowplot)
plot_grid(
  ggdraw() + draw_image('images/iwh.jfif'),
  ggdraw() + draw_image('images/murraycod.jpg'),
  nrow=1, scale = c(0.85,0.85),
  labels=c('Instream Woody Habitat (IWH)', 'Golden perch')
)

Experimental design

Show code
  ds_mc <- readRDS('data/df_resnag.RDS') %>% as_tibble() 
  n_reach <- n_distinct(str_split_fixed(ds_mc$reach_id,'_',2)[,2])
  n_year <- n_distinct(ds_mc$year)

Again, we are using simulated data of an actual experiment. The researchers were commissioned to add enhance riverine habitat in the Ovens river. Two methods of enhancement was undertaken, basic and IWH, at 5 reaches. As a “control”, 5 reaches at a comparable river was chosen that did not have any riverine improvement. Over 7 years, annual surveys were undertaken to count Murray cod caught.

Predictors

Show code
  ds_pred <- tibble( predictors = c('year','works','reach'),
                     prediction = c('change in catch rates over time',
                                    'treatment of interest',
                                    'reach effects - random effect') )
  regulartable(ds_pred) %>% ari_theme()

Explore the data

Let’s bring in the data

Show code
  ds_mc <- readRDS('data/df_resnag.RDS') %>% as_tibble() 
  head(ds_mc)
# A tibble: 6 × 4
  works reach_id  year catch
  <fct> <chr>    <int> <dbl>
1 None  None_A       0     8
2 None  None_A       1     0
3 None  None_A       2    13
4 None  None_A       3     1
5 None  None_A       4     1
6 None  None_A       5     8

Design

Let check the design

Show code
  ds_sum <- distinct(ds_mc, works, reach_id,year) %>%  
              mutate(reach_id=str_split_fixed(reach_id,'_',2)[,2])
  group_by(ds_sum, works, reach_id) %>% count() %>% 
    ggplot( aes(works, reach_id, fill=n)) + 
      geom_tile( col='black' ) + 
      geom_text( aes(label=n), col='white') +
          labs( title='Design: number of sampling events' )

Predictors/Response

Let’s get an idea of the catch rate variable across reaches …

Show code
plot_grid(
  ggplot(ds_mc, aes(catch) ) + geom_histogram(aes(fill=works), binwidth = 3, color='black') + labs(title='Response') + theme(legend.position=c(0.8,0.8)),
  ggplot(ds_mc, aes(reach_id, catch, fill=works) ) + 
    geom_violin() + geom_jitter(width=0.1, height=0.1) + coord_flip()+ theme(legend.position='none'),
  nrow=1
)

Let’s visualise the likely model…

Show code
plot_grid(
  ggplot(ds_mc, aes(year, catch, col=works) ) + 
    geom_point() + geom_smooth(se=F, span=1) +
    labs(title='Raw scale'),
  ggplot(ds_mc, aes(year, catch+1, col=works) ) + 
    geom_point() + geom_smooth(se=F, span=1) + scale_y_log10() +
    labs(title='Log10 version')
)

Picking the distribution

Observation model:

\[ y_i \sim \text{NB}( \mu_i, \theta ) \] Process model:

The process model will have the log link and the predictors

Informal formula style:

\[ log(\mu_i) = year * works + (1|reach) \]

Technical formula style: \[ \begin{align} log(\mu_i) &= \alpha + \beta_1 year + \beta_2 works + \beta_3 (year*works) + \epsilon_{reach_i} \\ \epsilon_{reach_i} & \sim \text{N}(0, \sigma_{reach_i}^2) \end{align} \]

where discharge is a fixed effect and month and fishway are random effects

Run the Model

brms model

Show code
  library(brms)
  mod_mc <- brm( catch ~ year*works + (1|reach_id) , data=ds_mc, family=negbinomial()  )
     saveRDS(mod_mc,'data/mod_mc.rds')

check convergence

Check the Rhats (<1.01) in model output

Show code
  mod_mc
 Family: negbinomial 
  Links: mu = log; shape = identity 
Formula: catch ~ year * works + (1 | reach_id) 
   Data: ds_mc (Number of observations: 105) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~reach_id (Number of levels: 15) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)     0.19      0.15     0.01     0.54 1.00     1619
              Tail_ESS
sd(Intercept)     1638

Regression Coefficients:
                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept           1.36      0.36     0.68     2.09 1.00     1764
year                0.07      0.09    -0.12     0.25 1.00     1789
worksBasic          0.18      0.51    -0.83     1.19 1.00     1790
worksIWH            0.45      0.52    -0.59     1.46 1.00     1937
year:worksBasic    -0.02      0.13    -0.28     0.24 1.00     1748
year:worksIWH       0.08      0.14    -0.20     0.35 1.00     1895
                Tail_ESS
Intercept           2355
year                2287
worksBasic          2135
worksIWH            2640
year:worksBasic     2347
year:worksIWH       2269

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     1.01      0.17     0.73     1.38 1.00     4203     3485

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Tip: Check the shape parameter estimate (shape = \(\theta\) = “clumped”). If \(\theta\)>10, then worth re-running as Poisson

Check the trace plots

Show code
  plot(mod_mc)

check model assumptions

Lets do DHARMa quantile approach and pp_check from Day2

DHARMa

Show code
  check_brms(mod_mc)

PP check

Show code
  pp_check(mod_mc, ndraw=100) + labs(title='Default')

Residuals vs fitted

Show code
  ds_mc$fit <- fitted(mod_mc, re_formula = NA)[,1]
  ds_mc$resid <- residuals(mod_mc,type = 'pearson')[,1]
  plot_grid(
    ggplot(ds_mc, aes(fit, resid) ) + geom_point() + geom_smooth(),
    ggplot(ds_mc, aes(year, resid) ) + geom_point() + geom_smooth() +
        facet_wrap(~works),
    ncol=1
  )

Graph main output

Show code
  ds_mc$fit <- fitted(mod_mc, re_formula = NA)[,1]
  ds_mc$fit_lo <- fitted(mod_mc, re_formula = NA)[,3]
  ds_mc$fit_hi <- fitted(mod_mc, re_formula = NA)[,4]
  ggplot(ds_mc, aes(year, col=works, fill=works) ) + 
    geom_point( aes( y=catch), pch=21,  col='black' ) +
    geom_line( aes(y=fit) ) +
    geom_ribbon( aes(ymin=fit_lo, ymax=fit_hi), alpha=0.3, col=NA ) +
      labs( x='Discharge (ML/day)', y='Catch (n/event)') + 
      facet_wrap(~works, nrow=1) + theme(legend.position = 'none')


Group Exercises

(Assuming we have enough time…)

Like yesterday, get in small groups and attempt one of the following small analysis tasks. Today’s tasks will use the same examples from above.

  1. Using the skink dataset, re-run the model as using the Binomial distribution. This will involve the following task:
  1. For the Murray cod model, re-run the model using a Poisson distribution. Check the diagnostics to see issues that emerge.

  2. For the Murray cod model, re-run the model summing all catch to each year for each works treatment. Compare the estimates from this model to the lecture’s one.