Gentle introduction to statistical distributions: what are they and why are they important
Note - if online, feel free to grab a fair coin as well as enter your height
As part of the presentation today, we will use data we collect from simple measurement experiments. We have set up two stations in order to collect some data. The data will be entered into an online form
There are two coins at the table and you will flip each coin following the steps below (and listed on the online form). One coin has a washer attached (“Weighted”) and a regular (“Fair”) coin.
Steps:
You will measure the height of the cutout of Wayne holding the eel
Steps:
A REMINDER TO START RECORDING
Scenario: Measuring the length of fish at a fishway
Break up by species…
Well over 100 named distributions
coin toss experiment
let’s take a look at the data…
ggplot(ds_coin %>% filter(n_total==25), aes(n_success) ) + geom_histogram(col='white') +
facet_wrap(~trt) + labs(x='No. of successes')
animal spacing
Let’s walk through the classical ecology course on animal spacing…
from https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology
Scenario: Imagine we setup a sampling transect of 100 grid that we search for the presence of bird. There are 200 birds that distribute themselves among the 100 grids, differing in degree of spacing from uniform to random to modest clumping.
# simulate classical spacing of animals
n_grids <- 100
n_birds <- 200
set.seed(12)
p <- rbeta(n_grids,1,6) # preferences of grids for clumped
ds_all <- tibble( grid_id=1:n_grids,
`1) equal`=rep( round(n_birds/n_grids), n_grids) ,
`2) random`= rmultinom(1, n_birds, prob=rep( 1/n_grids,n_grids) ),
`3) clumped` = rmultinom(1,n_birds, prob=p/sum(p) ) ) %>%
pivot_longer( cols=-grid_id, names_to='type', values_to='n' )
#--- visualise the results ---#
ds_stat <- group_by(ds_all, type) %>% summarise( mean=round( mean(n) ), sd= round(sd(n),1) )
ggplot(ds_all, aes(n) ) + geom_histogram(binwidth = 1,fill='grey', col='black') + facet_wrap(~type) +
geom_text(data=ds_stat, aes( x=3, y=Inf, label=paste('Mean=',mean,'(SD=',sd,')') ),
vjust=2, hjust=0 ) +
labs( x='No. birds per grid cell', y='Count' )
Example of from height data
Let’s look at height data…
# height data
# d <- dslabs::heights # backup data if cannot download from form
d <- ds_ht %>% filter(trt=='measurement')
d <- map_df( 3:30, ~ data.frame( mean_ht=mean(d$height[1:.x]),
sample_size = .x,
se = sd(d$height[1:.x])/sqrt(.x) ) )
ggplot(d, aes(sample_size, mean_ht, ymin=mean_ht-se, ymax=mean_ht+se ) ) + geom_pointrange()
Random effects of height for quantitative genetics [e.g. hints about stabilising selection, gene-environment interactions]
e.g. “male vulnerability hypothesis” to try to explain differing variation between human sexes
Types of selection
from https://numustafa.medium.com/
Take a look at normal in the Shiny App
A continuous distribution in statistics refers to a probability distribution in which the random variable can take any value within a given range, often an interval on the real number line.
A discrete distribution is a type of probability distribution that describes the likelihood of outcomes for a discrete random variable — one that can take on countable values, often integers or categories.
Mean: \(\mu = \frac{1}{n} \sum_{i=1}^{n} x_i\) [average value]
Median: the 50th percentile value (middle value)… for {1,2,3,4,5}, median=3
Bonus: what is the mode?
SD = \(\sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2}\)
In essence, average difference from mean penalised by magnitude of the difference (squared cost). Outliers carry much more weight
Support is the range possible values… e.g. Beta distribution can only range from (0 to 1)
Shape is exactly that, the shape of the curve. Types may be symmetric, skewed, unimodel/bimodal
# histograms
ds <- data.frame( x=rnorm(100, mean=10, sd=2 ) )
cowplot::plot_grid(
ggplot( ds, aes(x) ) + geom_histogram( binwidth=1, fill='blue',col='black'),
ggplot( ds, aes(x) ) + geom_histogram( binwidth=3, fill='blue',col='grey'),
ggplot( ds, aes(x) ) + geom_histogram( bins=10, fill='blue',col='black'),
ggplot( ds, aes(x) ) + geom_histogram( bins=20, fill='blue',col='black'),
nrow=2
)
ds <- data.frame( x=rnorm(50, mean=10, sd=2 ), x2= rnorm(50, mean=5, sd=2 ) )
cowplot::plot_grid(
ggplot( ds, aes(x) ) + geom_density( fill='blue', col='black') + labs(title='Simple'),
ggplot( ds, aes(x) ) + geom_density( fill='blue', col='black', bw=0.3)+ labs(title='Change bandwidth'),
ggplot( ds ) +
geom_density( aes(x), alpha=0.4, fill='blue', col='black') +
geom_density( aes(x=x2), alpha=0.4, fill='red', col='black') + labs(title='Overlapping'),
nrow=1
)
ds <- data.frame( 'm=10'=rnorm(50, mean=10, sd=2 ),
'Mean=5'= rnorm(50, mean=5, sd=2 ) ) %>%
mutate( id=1:n() ) %>%
pivot_longer( cols=-id, names_to='type',values_to='value' )
cowplot::plot_grid(
ggplot( ds, aes(type, value, fill=type) ) + geom_violin( show.legend = F) +
labs(title='Simple'),
ggplot( ds, aes(type, value, fill=type) ) + geom_violin( bw=0.4,show.legend = F ) +
labs(title='change bandwidth'),
nrow=1
)
Useful for checking if a distribution matches a specific one
Plots values against the quantiles. If straight(ish) line, then support from that distribution. You can add reference line of the theoretical quantile
geom_qq() - default is for normal [see code below for example for non-normal] geom_qq_line() - plots the reference line if the data are actually that distribution
# make some data
set.seed(10)
ds <- data.frame( normal=rnorm(100, mean=10, sd=2 ),
gamma = rgamma(50, shape=2, scale=1.5 ) )
# get an estimate of parameters for gamma
(gamma_parms <- MASS::fitdistr(ds$gamma, 'gamma') )
shape rate
2.08253136 0.60207926
(0.27420263) (0.08958277)
# plot out qqplot
cowplot::plot_grid(
ggplot(ds, aes(sample=normal)) + geom_qq() + geom_qq_line(col='blue') +
labs(title='Normal data vs. normal reference'),
ggplot(ds, aes(sample=gamma)) + geom_qq() + geom_qq_line(col='blue') +
labs(title='Gamma data vs normal reference'),
ggplot(ds, aes(sample=gamma)) +
geom_qq(distribution = stats::qgamma,
dparams=list(shape=2.08, rate=0.60) ) +
geom_qq_line(col='blue',
distribution =stats::qgamma,
dparams=list(shape=2.08, rate=0.60) ) +
labs(title='Gamma data vs gamma reference'),
nrow=2
)
Note - we will talk more about on Day 2
Location: Mean \(\mu\) Variance: \(\sigma^2\)
\[ y \sim \text{Normal}(\mu, \sigma) \] Wikipedia Normal
click on image
Distribution | Type | No. parameters | Parameters | Boundary | Rfunction |
---|---|---|---|---|---|
Gaussian | Continuous | 2 | Mean (μ) | (-Inf, Inf) | rnorm |
Location: \(\mu\) Variance: \(\sigma^2\)
\[ y \sim \text{LogNormal}(\mu, \sigma) \] Wikipedia LogNormal
click on image
Distribution | Type | No. parameters | Parameters | Boundary | Rfunction |
---|---|---|---|---|---|
Gaussian | Continuous | 2 | Mean (μ) | (-Inf, Inf) | rnorm |
Log-normal | Continuous | 2 | Mean (μ) | (0,Inf) | rlnorm |
You probably have learned about log-transforming Y to do the following: 1) stabilize variance 2) normalise the data 3) linearise relationships
When working with log(y), all predictions are in logs.
meanlog sdlog
1.03800665 0.99490672
(0.03146171) (0.02224679)
mean sd
1.03800665 0.99490672
(0.03146171) (0.02224679)
Beta distribution derived several ways, all pretty mathematical (e.g. conjugate prior, order statistics (kth smallest from n uniform(0,1) samples) )
The Beta distibution is mainly phenomenological/convenience distribution in ecology.
The main parameters for Beta are shape parameters… shape1(/alpha): think as number of successes shape2(/beta): think as number of failures
This shape parameters can be combined to define location and variance
Location: \(\mu = \frac{\alpha}{\alpha + \beta}\) Variance: \(\sigma^2 = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}\)
Note - range of values (0,1)
\[ y \sim \text{Beta}(\alpha, \beta) \]
click on image
Distribution | Type | No. parameters | Parameters | Boundary | Rfunction |
---|---|---|---|---|---|
Gaussian | Continuous | 2 | Mean (μ) | (-Inf, Inf) | rnorm |
Log-normal | Continuous | 2 | Mean (μ) | (0,Inf) | rlnorm |
Beta | Continuous | 2 | Shape parameters α (shape1) | (0,1)} | rbeta |
Note - we will talk more about on Day 3
Note: the Bernoulli distribution is a special case of Binomial in which only one sampling event occurs
p- probability an event N- number of trials (e.g. number of animals tested)
Location: \(\mu = N*p\) Variance: \(\sigma^2 = N * p * (1 - p)\)
Note - variance increases as p nears 0.5
\[ y \sim \text{Binomial}(p, N) \]
click on image
Distribution | Type | No. parameters | Parameters | Boundary | Rfunction |
---|---|---|---|---|---|
Gaussian | Continuous | 2 | Mean (μ) | (-Inf, Inf) | rnorm |
Log-normal | Continuous | 2 | Mean (μ) | (0,Inf) | rlnorm |
Beta | Continuous | 2 | Shape parameters α (shape1) | (0,1)} | rbeta |
Binomial | Discrete | 2 | Number of trials (n) | [0,1,2,...,n] | rbinom |
\(\lambda\)- rate that an event happens
Location: \(\mu=\lambda\) Variance: \(\sigma^2 = \lambda\)
Note - variance increases as the rate
\[ y \sim \text{Poisson}(\lambda) \]
click on image
Distribution | Type | No. parameters | Parameters | Boundary | Rfunction |
---|---|---|---|---|---|
Gaussian | Continuous | 2 | Mean (μ) | (-Inf, Inf) | rnorm |
Log-normal | Continuous | 2 | Mean (μ) | (0,Inf) | rlnorm |
Beta | Continuous | 2 | Shape parameters α (shape1) | (0,1)} | rbeta |
Binomial | Discrete | 2 | Number of trials (n) | [0,1,2,...,n] | rbinom |
Poisson | Discrete | 1 | Rate (λ) | [0,1,2,...,Inf) | rpoisson |
Location: \(\mu\) Variance/dispersion: size (\(\theta\)) [ decreasing size = higher overdispersion ]
Note - as
\[ y \sim \text{NegativeBinomial}(\mu, \theta) \]
click on image
You can combine distributions
Note - we will talk more about on Day 4
Scenario: Surveying wetland sites to count the number of frogs from rare species
n_wetland=100
n_grid = 5
p= 0.8 # 80% empty
lambda = 2 # 2/grid
set.seed(123)
is_zero <- rbinom(n_wetland, 1, p)
y <- map( 1:n_wetland, ~if(is_zero[.x] == 1){rep(0,n_grid)}else{rpois(n_grid, lambda)} ) %>% flatten() %>% unlist()
ggplot( tibble(value = y), aes(x = y)) +
geom_histogram(binwidth = 1, fill = "#2c7fb8", color = "white") +
labs(title = "Histogram of Simulated ZIP Data",
x = "No. of frogs", y = "Frequency")
Cause of the excess zeros
p- probability for binomial component (e.g. frogs present at the wetland) \(\lambda\) - rate (e.g. if frogs present, density of frogs)
Location: \(\mu = (1-p)*\lambda\) Variance: \(\sigma^2 = (1 - p)*\lambda*(1+p*\lambda)\)
\[ y \sim \text{ZIP}(p, \lambda) \]
# grab the data
library(dslabs) # package with the dataset
data(heights)
ds_hts <- heights
head(ds_hts)
sex height
1 Male 75
2 Male 70
3 Male 68
4 Male 74
5 Male 61
6 Female 65
# visualise the data
cowplot::plot_grid(
ggplot(ds_hts, aes(sex, height) ) + geom_violin(),
ggplot(ds_hts, aes( height, fill=sex) ) + geom_density( alpha=0.2 ),
nrow=1
)
\[ height_{male} \sim \text{Normal}(\mu_{male}, \sigma_{male}) \] \[ height_{female} \sim \text{Normal}(\mu_{female}, \sigma_{female}) \] Let’s assume males and females have the same variation: \(\sigma_{female}\) = \(\sigma_{male} = \sigma\)
\[ height_{male} \sim \text{Normal}(\mu_{male}, \sigma) \] \[ height_{female} \sim \text{Normal}(\mu_{female}, \sigma) \]
Okay, let’s now simplify more
\[ y_i \sim \text{Normal}(\mu_i, \sigma) \] where i = male or female and \(y_i = height_i\)
\[
\mu_i = sex
\]
\[ y_i \sim \text{Normal}(\mu_i, \sigma) \]
\[ \mu_i = sex \]
More accurately… \[ p_i = Intercept + Male*X \]
glm( height ~ sex, data=ds_hts, family=gaussian )
\[ y_i \sim \text{Binomial}(p_i, N) \]
where i = fair or weighted and \(y_i = n_{successes}\)
\[ p_i = \text{coin_type} \]
More accurately… \[ p_i = Intercept + Weighted * X \]
Bonus question: what do random effects (such are a statistical distribution) come into the observation-process model formulation???
1. Limited Modeling Flexibility
- Nonparametric tests (e.g., Mann–Whitney U, Kruskal–Wallis) are typically designed for simple comparisons. - They do not easily accommodate multiple predictors, interactions, or random effects, which are common in ecological models.
2. Reduced Statistical Power
- Nonparametric tests often have lower power than parametric tests when parametric assumptions are reasonably met. - This means a higher chance of Type II errors (failing to detect a true effect).
# set up data
set.seed(20)
n=40
makeData<-function(){
data.frame( trt=rep( c('ctrl','trt'), each=n ),
resp=c(rnorm(n,10,3), rnorm(n,11,3)) )
}
ds <-makeData()
ggplot( ds, aes(resp, fill=trt) ) + geom_density(alpha=0.4)
wilcox.test( resp~trt, data=ds)
Wilcoxon rank sum exact test
data: resp by trt
W = 570, p-value = 0.02666
alternative hypothesis: true location shift is not equal to 0
Call:
lm(formula = resp ~ trt, data = ds)
Residuals:
Min 1Q Median 3Q Max
-8.1297 -2.0481 -0.0577 2.3324 5.8958
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.4606 0.4667 20.271 <2e-16 ***
trttrt 1.5999 0.6600 2.424 0.0177 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.952 on 78 degrees of freedom
Multiple R-squared: 0.07006, Adjusted R-squared: 0.05813
F-statistic: 5.876 on 1 and 78 DF, p-value: 0.01766
# run tests as nonparametric and parametric and look at power
runTest <- function(d){
bind_rows(
broom::tidy(wilcox.test( resp~trt, data=d)) %>% select(method, p.value),
broom::tidy(lm(resp~trt, data=d) ) %>% select(p.value) %>% mutate(method='parametric')
)
}
ds_sim <- map_dfr( 1:100,~makeData() %>% mutate(rep=.x) ) %>%
group_by(rep) %>% nest() %>% mutate( data= map(data, runTest) ) %>%
unnest('data') %>% mutate(sig= p.value<0.05)
group_by(ds_sim, method) %>% summarise( power=sum(sig)/n() ) %>%
ggplot(aes(method, power) ) + geom_col(fill='blue',col='black') +
scale_y_continuous( labels=scales::percent ) + ylim(0,1)
3. Difficulty Handling Complex Designs
- Repeated measures, nested data, or hierarchical structures are hard to model with traditional nonparametric tests. - Mixed models or generalized linear models (GLMs/GLMMs) are better suited for such designs.
4. Interpretation Challenges
- Nonparametric tests typically assess ranks rather than actual values. - This makes it harder to interpret effect sizes or make predictions based on model coefficients.
5. Limited Diagnostic Tools
- Parametric models offer rich diagnostics (e.g., residual plots, AIC/BIC, goodness-of-fit). - Nonparametric tests lack these tools, making it harder to assess model adequacy.
6. Assumption of Independence
- Many nonparametric tests assume independent observations. - Violations (e.g., spatial or temporal autocorrelation) can lead to misleading results.
7. No Direct Extensions to Multivariate Data
- Nonparametric tests are not easily extended to multivariate or multilevel data structures. - Techniques like PERMANOVA exist but have their own limitations and assumptions.