Collecting data for today

Note - if online, feel free to grab a fair coin as well as enter your height

As part of the presentation today, we will use data we collect from simple measurement experiments. We have set up two stations in order to collect some data. The data will be entered into an online form

Station 1: Flipping of coins

There are two coins at the table and you will flip each coin following the steps below (and listed on the online form). One coin has a washer attached (“Weighted”) and a regular (“Fair”) coin.

Steps:

Select one coin
Flip that coin, collecting the number heads you get after 1, 5, 10, and 25 flips.
Enter data into form
Repeat the same steps for the other coin

Station 2: Measuring “Eel” Wayne and your height

You will measure the height of the cutout of Wayne holding the eel

Steps:

Measure the height of Wayne from the floor to top of head (top of hair)
Enter data into online form
Enter your height on the form [rough estimate fine]

Day’s objectives

A REMINDER TO START RECORDING

Understand what are statistical probability distributions
Learn some common distributions in ecology and be able to describe their key parameters
Learn what are observation vs state processes and how it fits into statistical models

Key packages used today

Housekeeping

Course overview on website

Introduction to statistical distributions

Example: Measuring fish

Scenario: Measuring the length of fish at a fishway

Break up by species…

There are lots of them…

Well over 100 named distributions

What creates distribution in our data???

measurement error
- limits of measurement tools (accuracy, user-error)
- miscounting number of birds in large aggregations
- data entry/workflow errors

result of a stochastic process [e.g. probabilistic events like coin toss, survival rates]

coin toss experiment

let’s take a look at the data…

  ggplot(ds_coin %>% filter(n_total==25), aes(n_success) ) + geom_histogram(col='white') +
    facet_wrap(~trt) + labs(x='No. of successes')

animal spacing

Let’s walk through the classical ecology course on animal spacing…

from https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology

Scenario: Imagine we setup a sampling transect of 100 grid that we search for the presence of bird. There are 200 birds that distribute themselves among the 100 grids, differing in degree of spacing from uniform to random to modest clumping.

# simulate classical spacing of animals
  n_grids <- 100
  n_birds <- 200

  set.seed(12)
  p <- rbeta(n_grids,1,6) # preferences of grids for clumped
  ds_all <- tibble( grid_id=1:n_grids, 
                    `1) equal`=rep( round(n_birds/n_grids), n_grids) ,
                    `2) random`= rmultinom(1, n_birds, prob=rep( 1/n_grids,n_grids) ),
                    `3) clumped` = rmultinom(1,n_birds, prob=p/sum(p) ) ) %>% 
            pivot_longer( cols=-grid_id, names_to='type', values_to='n' )

#--- visualise the results ---# 
 ds_stat <- group_by(ds_all, type) %>% summarise( mean=round( mean(n) ), sd= round(sd(n),1) ) 

 ggplot(ds_all, aes(n) ) + geom_histogram(binwidth = 1,fill='grey', col='black') + facet_wrap(~type) +
      geom_text(data=ds_stat, aes( x=3, y=Inf, label=paste('Mean=',mean,'(SD=',sd,')') ), 
                vjust=2, hjust=0 ) +
        labs( x='No. birds per grid cell', y='Count' )

Example of from height data

Let’s look at height data…

  ggplot(ds_ht %>% group_by(trt) %>% mutate(height=height-mean(height)), 
         aes(trt, height) ) + geom_violin() + labs(x='Source')

Why care about it?

Filtering out signals from the noise

# height data
#  d <- dslabs::heights  # backup data if cannot download from form
  d <- ds_ht %>% filter(trt=='measurement')
  d <- map_df( 3:30, ~ data.frame( mean_ht=mean(d$height[1:.x]), 
                                             sample_size = .x, 
                                              se = sd(d$height[1:.x])/sqrt(.x)  ) )
  ggplot(d, aes(sample_size, mean_ht, ymin=mean_ht-se, ymax=mean_ht+se ) ) + geom_pointrange()

Quantifying that noise to understand ecological processes

Random effects of height for quantitative genetics [e.g. hints about stabilising selection, gene-environment interactions]

e.g. “male vulnerability hypothesis” to try to explain differing variation between human sexes

Types of selection

Bayesian statistics relies very heavily on

from https://numustafa.medium.com/

Making predictions with estimates of uncertainty

Take a look at normal in the Shiny App

Key properties statistical probability distribution

Type: continuous vs discrete

A continuous distribution in statistics refers to a probability distribution in which the random variable can take any value within a given range, often an interval on the real number line.

A discrete distribution is a type of probability distribution that describes the likelihood of outcomes for a discrete random variable — one that can take on countable values, often integers or categories.

Location: means and median

Mean: \(\mu = \frac{1}{n} \sum_{i=1}^{n} x_i\) [average value]

Median: the 50th percentile value (middle value)… for {1,2,3,4,5}, median=3

Bonus: what is the mode?

Spread/dispersion: variance/standard deviation

SD = \(\sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2}\)

In essence, average difference from mean penalised by magnitude of the difference (squared cost). Outliers carry much more weight

Support and Shape

Support is the range possible values… e.g. Beta distribution can only range from (0 to 1)

Shape is exactly that, the shape of the curve. Types may be symmetric, skewed, unimodel/bimodal

Tools for visualising distributions using R

Histograms

# histograms
  ds <- data.frame( x=rnorm(100, mean=10, sd=2 )  )
  cowplot::plot_grid(
    ggplot( ds, aes(x) ) + geom_histogram( binwidth=1, fill='blue',col='black'),
    ggplot( ds, aes(x) ) + geom_histogram( binwidth=3, fill='blue',col='grey'),
    ggplot( ds, aes(x) ) + geom_histogram( bins=10, fill='blue',col='black'),
    ggplot( ds, aes(x) ) + geom_histogram( bins=20, fill='blue',col='black'),
    nrow=2
  )

Density

  ds <- data.frame( x=rnorm(50, mean=10, sd=2 ), x2= rnorm(50, mean=5, sd=2 )  )
  cowplot::plot_grid(
    ggplot( ds, aes(x) ) + geom_density( fill='blue', col='black') + labs(title='Simple'), 
    ggplot( ds, aes(x) ) + geom_density( fill='blue', col='black',  bw=0.3)+ labs(title='Change bandwidth'), 
    ggplot( ds ) + 
        geom_density( aes(x), alpha=0.4, fill='blue', col='black') + 
        geom_density( aes(x=x2), alpha=0.4, fill='red', col='black') + labs(title='Overlapping'),
    nrow=1
  )

Density

  ds <- data.frame( 'm=10'=rnorm(50, mean=10, sd=2 ), 
                     'Mean=5'= rnorm(50, mean=5, sd=2 )  ) %>% 
          mutate( id=1:n() ) %>% 
          pivot_longer( cols=-id, names_to='type',values_to='value' )
  cowplot::plot_grid(
    ggplot( ds, aes(type, value, fill=type) ) + geom_violin( show.legend = F) + 
      labs(title='Simple'), 
    ggplot( ds, aes(type, value, fill=type) ) + geom_violin( bw=0.4,show.legend = F ) + 
      labs(title='change bandwidth'),    
    nrow=1
  )

QQplots: to assess fit to specific distribution

Useful for checking if a distribution matches a specific one

Plots values against the quantiles. If straight(ish) line, then support from that distribution. You can add reference line of the theoretical quantile

geom_qq() - default is for normal [see code below for example for non-normal] geom_qq_line() - plots the reference line if the data are actually that distribution

# make some data
set.seed(10)
  ds <- data.frame( normal=rnorm(100, mean=10, sd=2 ), 
                    gamma = rgamma(50, shape=2, scale=1.5 )  )

# get an estimate of parameters for gamma
  (gamma_parms <- MASS::fitdistr(ds$gamma, 'gamma') )

     shape         rate   
  2.08253136   0.60207926 
 (0.27420263) (0.08958277)

# plot out qqplot
  cowplot::plot_grid(
    ggplot(ds, aes(sample=normal)) + geom_qq() + geom_qq_line(col='blue') + 
        labs(title='Normal data vs. normal reference'),
    ggplot(ds, aes(sample=gamma)) + geom_qq() + geom_qq_line(col='blue') + 
        labs(title='Gamma data vs normal reference'),
    ggplot(ds, aes(sample=gamma)) + 
            geom_qq(distribution = stats::qgamma,
                    dparams=list(shape=2.08, rate=0.60) ) + 
            geom_qq_line(col='blue',
                         distribution =stats::qgamma, 
                         dparams=list(shape=2.08, rate=0.60) ) + 
        labs(title='Gamma data vs gamma reference'),
    nrow=2
  )

Continuous distributions

Note - we will talk more about on Day 2

Normal (aka Gaussian)

Rationale behind Normal

empiricist/realist perspective view of the world where measurement imperfections are inevitable but statistically predictable. Errors are random and symmetric and small errors are more likely than large ones

Key Parameters for location and variance

Location: Mean \(\mu\) Variance: \(\sigma^2\)

\[ y \sim \text{Normal}(\mu, \sigma) \] Wikipedia Normal

click on image

Summary Table

Distribution	Type	No. parameters	Parameters	Boundary	Rfunction
Gaussian	Continuous	2	Mean (μ) Standard deviation (σ)	(-Inf, Inf)	rnorm

Why so important in statistics?

the maths for the normal were tractable and had easy, closed solutions
something called the Central Limit Theorem which states that with large sample sizes, sampling mean converge to a normal distribution

Log-normal

Rationale behind log normal

processes are multiplicative rather than additive
ecological examples: populate growth rate [ \(N(t) = N_0 \cdot \lambda^t\)], concentration = [ \(Amount = Volume \cdot concentration\)] body mass = again dealing with volume [allometry]

Key parameters for location and spread/variance

Location: \(\mu\) Variance: \(\sigma^2\)

\[ y \sim \text{LogNormal}(\mu, \sigma) \] Wikipedia LogNormal

click on image

Summary Table

Distribution	Type	No. parameters	Parameters	Boundary	Rfunction
Gaussian	Continuous	2	Mean (μ) Standard deviation (σ)	(-Inf, Inf)	rnorm
Log-normal	Continuous	2	Mean (μ) Standard deviation (σ)	(0,Inf)	rlnorm

log(Y) vs Lognormal

You probably have learned about log-transforming Y to do the following: 1) stabilize variance 2) normalise the data 3) linearise relationships

When working with log(y), all predictions are in logs.

    meanlog       sdlog   
  1.03800665   0.99490672 
 (0.03146171) (0.02224679)

      mean          sd    
  1.03800665   0.99490672 
 (0.03146171) (0.02224679)

Beta distribution

Rationale behind beta distribution

Beta distribution derived several ways, all pretty mathematical (e.g. conjugate prior, order statistics (kth smallest from n uniform(0,1) samples) )
The Beta distibution is mainly phenomenological/convenience distribution in ecology.

Key parameters for Beta and location/variance

The main parameters for Beta are shape parameters… shape1(/alpha): think as number of successes shape2(/beta): think as number of failures

This shape parameters can be combined to define location and variance

Location: \(\mu = \frac{\alpha}{\alpha + \beta}\) Variance: \(\sigma^2 = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}\)

Note - range of values (0,1)

\[ y \sim \text{Beta}(\alpha, \beta) \]

click on image

Summary Table

Distribution	Type	No. parameters	Parameters	Boundary	Rfunction
Gaussian	Continuous	2	Mean (μ) Standard deviation (σ)	(-Inf, Inf)	rnorm
Log-normal	Continuous	2	Mean (μ) Standard deviation (σ)	(0,Inf)	rlnorm
Beta	Continuous	2	Shape parameters α (shape1) β (shape2)	(0,1)}	rbeta

Discrete distributions

Note - we will talk more about on Day 3

Binomial(/Bernoulli)

Rationale behind binomial

derived from probabilistic/theoretically derived
in ecology, the distribution fits many scenarios like survival rates, presence/absence

Note: the Bernoulli distribution is a special case of Binomial in which only one sampling event occurs

Key parameters for binomial

p- probability an event N- number of trials (e.g. number of animals tested)

Location: \(\mu = N*p\) Variance: \(\sigma^2 = N * p * (1 - p)\)

Note - variance increases as p nears 0.5

\[ y \sim \text{Binomial}(p, N) \]

click on image

Summary Table

Distribution	Type	No. parameters	Parameters	Boundary	Rfunction
Gaussian	Continuous	2	Mean (μ) Standard deviation (σ)	(-Inf, Inf)	rnorm
Log-normal	Continuous	2	Mean (μ) Standard deviation (σ)	(0,Inf)	rlnorm
Beta	Continuous	2	Shape parameters α (shape1) β (shape2)	(0,1)}	rbeta
Binomial	Discrete	2	Number of trials (n) Probability of success (p)	[0,1,2,...,n]	rbinom

Poisson

Rationale behind Poisson

derived from probabilistic/theoretically derived.
Poisson is about the rate an event happens (e.g.number times/hour)
in ecology, the distribution fits many scenarios, often do to sampling design and measuring rates. For instance, number of individuals per sampling area (N/area), number of fish through fishway per hour (n/hr), number of plant stems touching a stick (n/stick)

Key parameters for Poisson

\(\lambda\)- rate that an event happens

Location: \(\mu=\lambda\) Variance: \(\sigma^2 = \lambda\)

Note - variance increases as the rate

\[ y \sim \text{Poisson}(\lambda) \]

click on image

Summary Table

Distribution	Type	No. parameters	Parameters	Boundary	Rfunction
Gaussian	Continuous	2	Mean (μ) Standard deviation (σ)	(-Inf, Inf)	rnorm
Log-normal	Continuous	2	Mean (μ) Standard deviation (σ)	(0,Inf)	rlnorm
Beta	Continuous	2	Shape parameters α (shape1) β (shape2)	(0,1)}	rbeta
Binomial	Discrete	2	Number of trials (n) Probability of success (p)	[0,1,2,...,n]	rbinom
Poisson	Discrete	1	Rate (λ)	[0,1,2,...,Inf)	rpoisson

Negative Binomial

Rationale behind Negative Binomial

derived from probabilistic/theoretically derived [from gambling, wanting to know how many trials are needed to achieve a certain number of successes]
another approach was allowing Poisson distribution \(\lambda\) to vary (as Gamma distribution). This is closer to our use in ecology
I would argue more of a convenience distribution in ecology
Negative binomial is used lots as often Poisson underestimates variability. E.g. not every plot is equally attractive to an animal (animal spacing problem)

Key parameters for negative binomial

Location: \(\mu\) Variance/dispersion: size (\(\theta\)) [ decreasing size = higher overdispersion ]

Note - as

\[ y \sim \text{NegativeBinomial}(\mu, \theta) \]

click on image

Mixture distributions

You can combine distributions

Note - we will talk more about on Day 4

Zero-inflated Poisson

Rationale behind

Scenario: Surveying wetland sites to count the number of frogs from rare species

    n_wetland=100
    n_grid = 5
    p= 0.8  # 80% empty
    lambda = 2 # 2/grid
    set.seed(123)
    is_zero <- rbinom(n_wetland, 1, p)
    y <- map( 1:n_wetland, ~if(is_zero[.x] == 1){rep(0,n_grid)}else{rpois(n_grid, lambda)} ) %>% flatten() %>% unlist() 
    
    ggplot( tibble(value = y), aes(x = y)) +
      geom_histogram(binwidth = 1, fill = "#2c7fb8", color = "white") +
      labs(title = "Histogram of Simulated ZIP Data",
           x = "No. of frogs", y = "Frequency")

Cause of the excess zeros

Structural zeros: Sites where frogs are truly absent (e.g., no water, wrong vegetation).
Sampling zeros: Sites where frogs are present but not detected during the survey.

Key parameters for binomial

p- probability for binomial component (e.g. frogs present at the wetland) \(\lambda\) - rate (e.g. if frogs present, density of frogs)

Location: \(\mu = (1-p)*\lambda\) Variance: \(\sigma^2 = (1 - p)*\lambda*(1+p*\lambda)\)

\[ y \sim \text{ZIP}(p, \lambda) \]

Observation vs Process framework

Let’s build up an example…

# grab the data
  library(dslabs)  # package with the dataset
  data(heights)
  ds_hts <- heights
  head(ds_hts)

     sex height
1   Male     75
2   Male     70
3   Male     68
4   Male     74
5   Male     61
6 Female     65

# visualise the data  
  cowplot::plot_grid(
    ggplot(ds_hts, aes(sex, height) ) + geom_violin(),
    ggplot(ds_hts, aes( height, fill=sex) ) + geom_density( alpha=0.2 ),
    nrow=1
  )

# get key parameters
  ds_stat <- group_by(ds_hts, sex) %>% 
    summarise( ht_mean = mean(height), sd_ht = sd(height)) %>% 
    ungroup() %>% 
    mutate(diff_mean = diff(ht_mean))

Equation

\[ height_{male} \sim \text{Normal}(\mu_{male}, \sigma_{male}) \] \[ height_{female} \sim \text{Normal}(\mu_{female}, \sigma_{female}) \] Let’s assume males and females have the same variation: \(\sigma_{female}\) = \(\sigma_{male} = \sigma\)

\[ height_{male} \sim \text{Normal}(\mu_{male}, \sigma) \] \[ height_{female} \sim \text{Normal}(\mu_{female}, \sigma) \]

Okay, let’s now simplify more

\[ y_i \sim \text{Normal}(\mu_i, \sigma) \] where i = male or female and \(y_i = height_i\)

\[ \mu_i = sex \]

Observation component (distribution)

\[ y_i \sim \text{Normal}(\mu_i, \sigma) \]

Process component

\[ \mu_i = sex \]

More accurately… \[ p_i = Intercept + Male*X \]

the R formula

  glm( height ~ sex, data=ds_hts, family=gaussian )

Example with the coins

Observation component (distribution)

\[ y_i \sim \text{Binomial}(p_i, N) \]

where i = fair or weighted and \(y_i = n_{successes}\)

Process component

\[ p_i = \text{coin_type} \]

More accurately… \[ p_i = Intercept + Weighted * X \]

the R formula

  ds_coin <- rename(ds_coin, coin_type=trt) %>% 
              mutate(n_failures= n_total-n_success)
  glm( cbind(n_success, n_failures) ~ coin_type, data=ds_coin, family=binomial )

Bonus question: what do random effects (such are a statistical distribution) come into the observation-process model formulation???

What about non-parametric approaches? Where do they fit in?

1. Limited Modeling Flexibility

- Nonparametric tests (e.g., Mann–Whitney U, Kruskal–Wallis) are typically designed for simple comparisons. - They do not easily accommodate multiple predictors, interactions, or random effects, which are common in ecological models.

2. Reduced Statistical Power

- Nonparametric tests often have lower power than parametric tests when parametric assumptions are reasonably met. - This means a higher chance of Type II errors (failing to detect a true effect).

# set up data
  set.seed(20)
  n=40
  makeData<-function(){
   data.frame( trt=rep( c('ctrl','trt'), each=n ), 
                    resp=c(rnorm(n,10,3), rnorm(n,11,3)) )
  }
  ds <-makeData()
  ggplot( ds, aes(resp, fill=trt) ) + geom_density(alpha=0.4)

  wilcox.test( resp~trt, data=ds)


    Wilcoxon rank sum exact test

data:  resp by trt
W = 570, p-value = 0.02666
alternative hypothesis: true location shift is not equal to 0

  summary(lm(resp~trt, data=ds) )


Call:
lm(formula = resp ~ trt, data = ds)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.1297 -2.0481 -0.0577  2.3324  5.8958 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.4606     0.4667  20.271   <2e-16 ***
trttrt        1.5999     0.6600   2.424   0.0177 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.952 on 78 degrees of freedom
Multiple R-squared:  0.07006,   Adjusted R-squared:  0.05813 
F-statistic: 5.876 on 1 and 78 DF,  p-value: 0.01766

# run tests as nonparametric and parametric and look at power
  runTest <- function(d){ 
    bind_rows(
      broom::tidy(wilcox.test( resp~trt, data=d)) %>% select(method, p.value),
      broom::tidy(lm(resp~trt, data=d) ) %>% select(p.value) %>% mutate(method='parametric')
    )
    }
  ds_sim <- map_dfr( 1:100,~makeData() %>% mutate(rep=.x) ) %>%
            group_by(rep) %>% nest() %>% mutate( data= map(data, runTest) ) %>% 
            unnest('data') %>% mutate(sig= p.value<0.05) 
  group_by(ds_sim, method) %>% summarise( power=sum(sig)/n() ) %>% 
    ggplot(aes(method, power) ) + geom_col(fill='blue',col='black') +
      scale_y_continuous( labels=scales::percent ) + ylim(0,1)

3. Difficulty Handling Complex Designs

- Repeated measures, nested data, or hierarchical structures are hard to model with traditional nonparametric tests. - Mixed models or generalized linear models (GLMs/GLMMs) are better suited for such designs.

4. Interpretation Challenges

- Nonparametric tests typically assess ranks rather than actual values. - This makes it harder to interpret effect sizes or make predictions based on model coefficients.

5. Limited Diagnostic Tools

- Parametric models offer rich diagnostics (e.g., residual plots, AIC/BIC, goodness-of-fit). - Nonparametric tests lack these tools, making it harder to assess model adequacy.

6. Assumption of Independence

- Many nonparametric tests assume independent observations. - Violations (e.g., spatial or temporal autocorrelation) can lead to misleading results.

7. No Direct Extensions to Multivariate Data

- Nonparametric tests are not easily extended to multivariate or multilevel data structures. - Techniques like PERMANOVA exist but have their own limitations and assumptions.

Day 1: What statistical distribution to pick?

Collecting data for today

Station 1: Flipping of coins

Station 2: Measuring “Eel” Wayne and your height

Day’s objectives

Key packages used today

Housekeeping

Introduction to statistical distributions

Example: Measuring fish

There are lots of them…

What creates distribution in our data???

Why care about it?

Key properties statistical probability distribution

Type: continuous vs discrete

Location: means and median

Spread/dispersion: variance/standard deviation

Support and Shape

Tools for visualising distributions using R

Histograms

Density

Density

QQplots: to assess fit to specific distribution

Continuous distributions

Normal (aka Gaussian)

Rationale behind Normal

Key Parameters for location and variance

Summary Table

Why so important in statistics?

Log-normal

Rationale behind log normal

Key parameters for location and spread/variance

Summary Table

log(Y) vs Lognormal

Beta distribution

Rationale behind beta distribution

Key parameters for Beta and location/variance

Summary Table

Discrete distributions

Binomial(/Bernoulli)

Rationale behind binomial

Key parameters for binomial

Summary Table

Poisson

Rationale behind Poisson

Key parameters for Poisson

Summary Table

Negative Binomial

Rationale behind Negative Binomial

Key parameters for negative binomial

Mixture distributions

Zero-inflated Poisson

Rationale behind

Key parameters for binomial

Observation vs Process framework

Let’s build up an example…

Equation

Observation component (distribution)

Process component

the R formula

Example with the coins

Observation component (distribution)

Process component

the R formula

What about non-parametric approaches? Where do they fit in?