Day 1: What statistical distribution to pick?

Gentle introduction to statistical distributions: what are they and why are they important

Ben Fanson https://bfanson.github.io/2024DADAworkshop/ (Arthur Rylah Institute)https://www.ari.vic.gov.au/
2025-09-08

Collecting data for today

Note - if online, feel free to grab a fair coin as well as enter your height

As part of the presentation today, we will use data we collect from simple measurement experiments. We have set up two stations in order to collect some data. The data will be entered into an online form

Station 1: Flipping of coins

There are two coins at the table and you will flip each coin following the steps below (and listed on the online form). One coin has a washer attached (“Weighted”) and a regular (“Fair”) coin.

Steps:

  1. Select one coin
  2. Flip that coin, collecting the number heads you get after 1, 5, 10, and 25 flips.
  3. Enter data into form
  4. Repeat the same steps for the other coin

Station 2: Measuring “Eel” Wayne and your height

You will measure the height of the cutout of Wayne holding the eel

Steps:

  1. Measure the height of Wayne from the floor to top of head (top of hair)
  2. Enter data into online form
  3. Enter your height on the form [rough estimate fine]

Day’s objectives

A REMINDER TO START RECORDING

Key packages used today


Housekeeping

Course overview on website


Introduction to statistical distributions

Example: Measuring fish

Scenario: Measuring the length of fish at a fishway

Break up by species…

There are lots of them…

Well over 100 named distributions

What creates distribution in our data???

  1. measurement error
    • limits of measurement tools (accuracy, user-error)
    • miscounting number of birds in large aggregations
    • data entry/workflow errors
  1. result of a stochastic process [e.g. probabilistic events like coin toss, survival rates]

coin toss experiment

let’s take a look at the data…

  ggplot(ds_coin %>% filter(n_total==25), aes(n_success) ) + geom_histogram(col='white') +
    facet_wrap(~trt) + labs(x='No. of successes')

animal spacing

Let’s walk through the classical ecology course on animal spacing…

from https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology

Scenario: Imagine we setup a sampling transect of 100 grid that we search for the presence of bird. There are 200 birds that distribute themselves among the 100 grids, differing in degree of spacing from uniform to random to modest clumping.

# simulate classical spacing of animals
  n_grids <- 100
  n_birds <- 200

  set.seed(12)
  p <- rbeta(n_grids,1,6) # preferences of grids for clumped
  ds_all <- tibble( grid_id=1:n_grids, 
                    `1) equal`=rep( round(n_birds/n_grids), n_grids) ,
                    `2) random`= rmultinom(1, n_birds, prob=rep( 1/n_grids,n_grids) ),
                    `3) clumped` = rmultinom(1,n_birds, prob=p/sum(p) ) ) %>% 
            pivot_longer( cols=-grid_id, names_to='type', values_to='n' )

#--- visualise the results ---# 
 ds_stat <- group_by(ds_all, type) %>% summarise( mean=round( mean(n) ), sd= round(sd(n),1) ) 

 ggplot(ds_all, aes(n) ) + geom_histogram(binwidth = 1,fill='grey', col='black') + facet_wrap(~type) +
      geom_text(data=ds_stat, aes( x=3, y=Inf, label=paste('Mean=',mean,'(SD=',sd,')') ), 
                vjust=2, hjust=0 ) +
        labs( x='No. birds per grid cell', y='Count' ) 

Example of from height data

Let’s look at height data…

  ggplot(ds_ht %>% group_by(trt) %>% mutate(height=height-mean(height)), 
         aes(trt, height) ) + geom_violin() + labs(x='Source')

Why care about it?

# height data
#  d <- dslabs::heights  # backup data if cannot download from form
  d <- ds_ht %>% filter(trt=='measurement')
  d <- map_df( 3:30, ~ data.frame( mean_ht=mean(d$height[1:.x]), 
                                             sample_size = .x, 
                                              se = sd(d$height[1:.x])/sqrt(.x)  ) )
  ggplot(d, aes(sample_size, mean_ht, ymin=mean_ht-se, ymax=mean_ht+se ) ) + geom_pointrange()

Random effects of height for quantitative genetics [e.g. hints about stabilising selection, gene-environment interactions]

e.g. “male vulnerability hypothesis” to try to explain differing variation between human sexes

Types of selection

from https://numustafa.medium.com/

Take a look at normal in the Shiny App

Key properties statistical probability distribution

Type: continuous vs discrete

A continuous distribution in statistics refers to a probability distribution in which the random variable can take any value within a given range, often an interval on the real number line.

A discrete distribution is a type of probability distribution that describes the likelihood of outcomes for a discrete random variable — one that can take on countable values, often integers or categories.

Location: means and median

Mean: \(\mu = \frac{1}{n} \sum_{i=1}^{n} x_i\) [average value]

Median: the 50th percentile value (middle value)… for {1,2,3,4,5}, median=3

Bonus: what is the mode?

Spread/dispersion: variance/standard deviation

SD = \(\sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2}\)

In essence, average difference from mean penalised by magnitude of the difference (squared cost). Outliers carry much more weight

Support and Shape

Support is the range possible values… e.g. Beta distribution can only range from (0 to 1)

Shape is exactly that, the shape of the curve. Types may be symmetric, skewed, unimodel/bimodal

Tools for visualising distributions using R

Histograms

# histograms
  ds <- data.frame( x=rnorm(100, mean=10, sd=2 )  )
  cowplot::plot_grid(
    ggplot( ds, aes(x) ) + geom_histogram( binwidth=1, fill='blue',col='black'),
    ggplot( ds, aes(x) ) + geom_histogram( binwidth=3, fill='blue',col='grey'),
    ggplot( ds, aes(x) ) + geom_histogram( bins=10, fill='blue',col='black'),
    ggplot( ds, aes(x) ) + geom_histogram( bins=20, fill='blue',col='black'),
    nrow=2
  )

Density

  ds <- data.frame( x=rnorm(50, mean=10, sd=2 ), x2= rnorm(50, mean=5, sd=2 )  )
  cowplot::plot_grid(
    ggplot( ds, aes(x) ) + geom_density( fill='blue', col='black') + labs(title='Simple'), 
    ggplot( ds, aes(x) ) + geom_density( fill='blue', col='black',  bw=0.3)+ labs(title='Change bandwidth'), 
    ggplot( ds ) + 
        geom_density( aes(x), alpha=0.4, fill='blue', col='black') + 
        geom_density( aes(x=x2), alpha=0.4, fill='red', col='black') + labs(title='Overlapping'),
    nrow=1
  )

Density

  ds <- data.frame( 'm=10'=rnorm(50, mean=10, sd=2 ), 
                     'Mean=5'= rnorm(50, mean=5, sd=2 )  ) %>% 
          mutate( id=1:n() ) %>% 
          pivot_longer( cols=-id, names_to='type',values_to='value' )
  cowplot::plot_grid(
    ggplot( ds, aes(type, value, fill=type) ) + geom_violin( show.legend = F) + 
      labs(title='Simple'), 
    ggplot( ds, aes(type, value, fill=type) ) + geom_violin( bw=0.4,show.legend = F ) + 
      labs(title='change bandwidth'),    
    nrow=1
  )

QQplots: to assess fit to specific distribution

Useful for checking if a distribution matches a specific one

Plots values against the quantiles. If straight(ish) line, then support from that distribution. You can add reference line of the theoretical quantile

geom_qq() - default is for normal [see code below for example for non-normal] geom_qq_line() - plots the reference line if the data are actually that distribution

# make some data
set.seed(10)
  ds <- data.frame( normal=rnorm(100, mean=10, sd=2 ), 
                    gamma = rgamma(50, shape=2, scale=1.5 )  )

# get an estimate of parameters for gamma
  (gamma_parms <- MASS::fitdistr(ds$gamma, 'gamma') )
     shape         rate   
  2.08253136   0.60207926 
 (0.27420263) (0.08958277)
# plot out qqplot
  cowplot::plot_grid(
    ggplot(ds, aes(sample=normal)) + geom_qq() + geom_qq_line(col='blue') + 
        labs(title='Normal data vs. normal reference'),
    ggplot(ds, aes(sample=gamma)) + geom_qq() + geom_qq_line(col='blue') + 
        labs(title='Gamma data vs normal reference'),
    ggplot(ds, aes(sample=gamma)) + 
            geom_qq(distribution = stats::qgamma,
                    dparams=list(shape=2.08, rate=0.60) ) + 
            geom_qq_line(col='blue',
                         distribution =stats::qgamma, 
                         dparams=list(shape=2.08, rate=0.60) ) + 
        labs(title='Gamma data vs gamma reference'),
    nrow=2
  )


Continuous distributions

Note - we will talk more about on Day 2

Normal (aka Gaussian)

Rationale behind Normal

Key Parameters for location and variance

Location: Mean \(\mu\) Variance: \(\sigma^2\)

\[ y \sim \text{Normal}(\mu, \sigma) \] Wikipedia Normal

click on image

Summary Table

Why so important in statistics?

Log-normal

Rationale behind log normal

Key parameters for location and spread/variance

Location: \(\mu\) Variance: \(\sigma^2\)

\[ y \sim \text{LogNormal}(\mu, \sigma) \] Wikipedia LogNormal

click on image

Summary Table

log(Y) vs Lognormal

You probably have learned about log-transforming Y to do the following: 1) stabilize variance 2) normalise the data 3) linearise relationships

When working with log(y), all predictions are in logs.

    meanlog       sdlog   
  1.03800665   0.99490672 
 (0.03146171) (0.02224679)
      mean          sd    
  1.03800665   0.99490672 
 (0.03146171) (0.02224679)

Beta distribution

Rationale behind beta distribution

Key parameters for Beta and location/variance

The main parameters for Beta are shape parameters… shape1(/alpha): think as number of successes shape2(/beta): think as number of failures

This shape parameters can be combined to define location and variance

Location: \(\mu = \frac{\alpha}{\alpha + \beta}\) Variance: \(\sigma^2 = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}\)

Note - range of values (0,1)

\[ y \sim \text{Beta}(\alpha, \beta) \]

click on image

Summary Table

Discrete distributions

Note - we will talk more about on Day 3

Binomial(/Bernoulli)

Rationale behind binomial

Note: the Bernoulli distribution is a special case of Binomial in which only one sampling event occurs

Key parameters for binomial

p- probability an event N- number of trials (e.g. number of animals tested)

Location: \(\mu = N*p\) Variance: \(\sigma^2 = N * p * (1 - p)\)

Note - variance increases as p nears 0.5

\[ y \sim \text{Binomial}(p, N) \]

click on image

Summary Table

Poisson

Rationale behind Poisson

Key parameters for Poisson

\(\lambda\)- rate that an event happens

Location: \(\mu=\lambda\) Variance: \(\sigma^2 = \lambda\)

Note - variance increases as the rate

\[ y \sim \text{Poisson}(\lambda) \]

click on image

Summary Table

Negative Binomial

Rationale behind Negative Binomial

Key parameters for negative binomial

Location: \(\mu\) Variance/dispersion: size (\(\theta\)) [ decreasing size = higher overdispersion ]

Note - as

\[ y \sim \text{NegativeBinomial}(\mu, \theta) \]

click on image

Mixture distributions

You can combine distributions

Note - we will talk more about on Day 4

Zero-inflated Poisson

Rationale behind

Scenario: Surveying wetland sites to count the number of frogs from rare species

    n_wetland=100
    n_grid = 5
    p= 0.8  # 80% empty
    lambda = 2 # 2/grid
    set.seed(123)
    is_zero <- rbinom(n_wetland, 1, p)
    y <- map( 1:n_wetland, ~if(is_zero[.x] == 1){rep(0,n_grid)}else{rpois(n_grid, lambda)} ) %>% flatten() %>% unlist() 
    
    ggplot( tibble(value = y), aes(x = y)) +
      geom_histogram(binwidth = 1, fill = "#2c7fb8", color = "white") +
      labs(title = "Histogram of Simulated ZIP Data",
           x = "No. of frogs", y = "Frequency") 

Cause of the excess zeros

  1. Structural zeros: Sites where frogs are truly absent (e.g., no water, wrong vegetation).
  2. Sampling zeros: Sites where frogs are present but not detected during the survey.

Key parameters for binomial

p- probability for binomial component (e.g. frogs present at the wetland) \(\lambda\) - rate (e.g. if frogs present, density of frogs)

Location: \(\mu = (1-p)*\lambda\) Variance: \(\sigma^2 = (1 - p)*\lambda*(1+p*\lambda)\)

\[ y \sim \text{ZIP}(p, \lambda) \]


Observation vs Process framework

Let’s build up an example…

# grab the data
  library(dslabs)  # package with the dataset
  data(heights)
  ds_hts <- heights
  head(ds_hts)
     sex height
1   Male     75
2   Male     70
3   Male     68
4   Male     74
5   Male     61
6 Female     65
# visualise the data  
  cowplot::plot_grid(
    ggplot(ds_hts, aes(sex, height) ) + geom_violin(),
    ggplot(ds_hts, aes( height, fill=sex) ) + geom_density( alpha=0.2 ),
    nrow=1
  )
# get key parameters
  ds_stat <- group_by(ds_hts, sex) %>% 
    summarise( ht_mean = mean(height), sd_ht = sd(height)) %>% 
    ungroup() %>% 
    mutate(diff_mean = diff(ht_mean))

Equation

\[ height_{male} \sim \text{Normal}(\mu_{male}, \sigma_{male}) \] \[ height_{female} \sim \text{Normal}(\mu_{female}, \sigma_{female}) \] Let’s assume males and females have the same variation: \(\sigma_{female}\) = \(\sigma_{male} = \sigma\)

\[ height_{male} \sim \text{Normal}(\mu_{male}, \sigma) \] \[ height_{female} \sim \text{Normal}(\mu_{female}, \sigma) \]

Okay, let’s now simplify more

\[ y_i \sim \text{Normal}(\mu_i, \sigma) \] where i = male or female and \(y_i = height_i\)

\[ \mu_i = sex \]

Observation component (distribution)

\[ y_i \sim \text{Normal}(\mu_i, \sigma) \]

Process component

\[ \mu_i = sex \]

More accurately… \[ p_i = Intercept + Male*X \]

the R formula

  glm( height ~ sex, data=ds_hts, family=gaussian )

Example with the coins

Observation component (distribution)

\[ y_i \sim \text{Binomial}(p_i, N) \]

where i = fair or weighted and \(y_i = n_{successes}\)

Process component

\[ p_i = \text{coin_type} \]

More accurately… \[ p_i = Intercept + Weighted * X \]

the R formula

  ds_coin <- rename(ds_coin, coin_type=trt) %>% 
              mutate(n_failures= n_total-n_success)
  glm( cbind(n_success, n_failures) ~ coin_type, data=ds_coin, family=binomial )

Bonus question: what do random effects (such are a statistical distribution) come into the observation-process model formulation???


What about non-parametric approaches? Where do they fit in?

1. Limited Modeling Flexibility

- Nonparametric tests (e.g., Mann–Whitney U, Kruskal–Wallis) are typically designed for simple comparisons. - They do not easily accommodate multiple predictors, interactions, or random effects, which are common in ecological models.

2. Reduced Statistical Power

- Nonparametric tests often have lower power than parametric tests when parametric assumptions are reasonably met. - This means a higher chance of Type II errors (failing to detect a true effect).

# set up data
  set.seed(20)
  n=40
  makeData<-function(){
   data.frame( trt=rep( c('ctrl','trt'), each=n ), 
                    resp=c(rnorm(n,10,3), rnorm(n,11,3)) )
  }
  ds <-makeData()
  ggplot( ds, aes(resp, fill=trt) ) + geom_density(alpha=0.4)
  wilcox.test( resp~trt, data=ds) 

    Wilcoxon rank sum exact test

data:  resp by trt
W = 570, p-value = 0.02666
alternative hypothesis: true location shift is not equal to 0
  summary(lm(resp~trt, data=ds) )

Call:
lm(formula = resp ~ trt, data = ds)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.1297 -2.0481 -0.0577  2.3324  5.8958 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.4606     0.4667  20.271   <2e-16 ***
trttrt        1.5999     0.6600   2.424   0.0177 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.952 on 78 degrees of freedom
Multiple R-squared:  0.07006,   Adjusted R-squared:  0.05813 
F-statistic: 5.876 on 1 and 78 DF,  p-value: 0.01766
# run tests as nonparametric and parametric and look at power
  runTest <- function(d){ 
    bind_rows(
      broom::tidy(wilcox.test( resp~trt, data=d)) %>% select(method, p.value),
      broom::tidy(lm(resp~trt, data=d) ) %>% select(p.value) %>% mutate(method='parametric')
    )
    }
  ds_sim <- map_dfr( 1:100,~makeData() %>% mutate(rep=.x) ) %>%
            group_by(rep) %>% nest() %>% mutate( data= map(data, runTest) ) %>% 
            unnest('data') %>% mutate(sig= p.value<0.05) 
  group_by(ds_sim, method) %>% summarise( power=sum(sig)/n() ) %>% 
    ggplot(aes(method, power) ) + geom_col(fill='blue',col='black') +
      scale_y_continuous( labels=scales::percent ) + ylim(0,1)

3. Difficulty Handling Complex Designs

- Repeated measures, nested data, or hierarchical structures are hard to model with traditional nonparametric tests. - Mixed models or generalized linear models (GLMs/GLMMs) are better suited for such designs.

4. Interpretation Challenges

- Nonparametric tests typically assess ranks rather than actual values. - This makes it harder to interpret effect sizes or make predictions based on model coefficients.

5. Limited Diagnostic Tools

- Parametric models offer rich diagnostics (e.g., residual plots, AIC/BIC, goodness-of-fit). - Nonparametric tests lack these tools, making it harder to assess model adequacy.

6. Assumption of Independence

- Many nonparametric tests assume independent observations. - Violations (e.g., spatial or temporal autocorrelation) can lead to misleading results.

7. No Direct Extensions to Multivariate Data

- Nonparametric tests are not easily extended to multivariate or multilevel data structures. - Techniques like PERMANOVA exist but have their own limitations and assumptions.