library(ggplot2)
library(popbio)
library(gtools)
When we have a set of Bernouli trials with a yes/no answer, we typically estimate the probability of success with the standard frequentist estimate:
\[p(success) = n_s/N\]
With that estimate in hand, we can simulate a binomial distribution to generate the distribution of successes for a given number of trials.
But a more useful analysis is based on the beta()
distribution, which estimates the underlying p values given a set of data for successes and failures. For example, using the beta distribution with 1 success and 5 failures, we have:
Success <- 1
Failure <- 5
# the beta needs shape1 and shape2, which represent the number of successes + 1 and the number of failures + 1
betaSim1 <- rbeta(n=10000,shape1=Success+1,shape2=Failure+1)
qplot(x=betaSim1,color=I("black"),fill=I("goldenrod"),xlim=c(0,1))
It seems intuitive that if we have a larger sample size of trials, our estimate of the probability will be more precise:
Success <- 10
Failure <- 50
# the beta needs shape1 and shape2, which represent the number of successes + 1 and the number of failures + 1
betaSim1 <- rbeta(n=10000,shape1=Success+1,shape2=Failure+1)
qplot(x=betaSim1,color=I("black"),fill=I("goldenrod"),xlim=c(0,1))
The Dirichlet distribution behaves the same way, but it is for a multinomial process, where there are multiple possible outcomes, not just the binary response that we are used to with the Binomial distribution.
Here is a simple function that uses the Dirichlet sampler to generate a matrix of transition probabilities:
##################################################
# function: dirichletSampler
# Creates transition matrix with sampling uncertainty
# input: matrix of raw transition counts
# output: matrix of transition probabilities sampled from a Dirichlet for each column
# note: add 1 to each matrix element for proper interpretation
# of rdirichlet paramaters as param = number of successes + 1
#-------------------------------------------------
dirichletSampler <- function(m=matrix(rpois(n=16,lambda=5),nrow=4)) {
z <- apply(m+1,2,rdirichlet,n=1)
return(list(data=m,trans=z))
}
#-------------------------------------------------
dirichletSampler() # sample output of a transition matrix
## $data
## [,1] [,2] [,3] [,4]
## [1,] 3 6 5 8
## [2,] 1 6 8 7
## [3,] 3 5 8 11
## [4,] 7 4 2 6
##
## $trans
## [,1] [,2] [,3] [,4]
## [1,] 0.13594658 0.2057730 0.22572025 0.2547917
## [2,] 0.04126061 0.2966261 0.38808660 0.2381026
## [3,] 0.31132812 0.2725394 0.31491451 0.3787989
## [4,] 0.51146468 0.2250614 0.07127863 0.1283068
So, the input is a matrix of counts (data), and the output is a matrix of transition probabilities (trans) for a single sampling iteration.
Here is an example from our work on ant occurrences in nestboxes at Harvard and Duke Forest. The data consist of a transition matrix in which we record the identity of the species occupying the next box in one month (time t) and the next month (time t + 1). These data can be organized into a square transition matrix:
stages <- c("Empty",paste("Species",LETTERS[1:3]))
# use the Duke Forest heated chamber counts
A <- matrix(c(245,14,8,0,
18,17,0,2,
8,1,0,5,
1,0,0,1) ,nrow=length(stages),byrow=TRUE,dimnames=list(stages,stages))
print(A)
## Empty Species A Species B Species C
## Empty 245 14 8 0
## Species A 18 17 0 2
## Species B 8 1 0 5
## Species C 1 0 0 1
For this matrix of raw counts, we divide by column totals for a simple frequentist estimate of transition probabilities:
scaledA <- apply(A,2, function(x) x/sum(x))
print(scaledA)
## Empty Species A Species B Species C
## Empty 0.900735294 0.43750 1 0.000
## Species A 0.066176471 0.53125 0 0.250
## Species B 0.029411765 0.03125 0 0.625
## Species C 0.003676471 0.00000 0 0.125
Finally, using the popbio()
package, we calculate the damping ratio from this matrix, which is a measure of neighborhood stability (the velocity of return time to equilibrium following a small perturbation):
damping.ratio(scaledA)
## [1] 2.156474
But with sampling uncertainty, we could just as likely have this set of transition probabilities:
sampleMatrix <- dirichletSampler(A)
print(sampleMatrix)
## $data
## Empty Species A Species B Species C
## Empty 245 14 8 0
## Species A 18 17 0 2
## Species B 8 1 0 5
## Species C 1 0 0 1
##
## $trans
## Empty Species A Species B Species C
## [1,] 0.868142536 0.38619878 0.72444070 0.05095593
## [2,] 0.106141396 0.47122398 0.01905835 0.21365796
## [3,] 0.023188726 0.09449398 0.12549170 0.34855249
## [4,] 0.002527343 0.04808327 0.13100925 0.38683362
Note in particular how different the sample estimates are for the Species B column compared to the frequentist estimate.
And the damping ratio would be:
damping.ratio(sampleMatrix$trans)
## [1] 1.981846
Now, let’s repeat this process to generate a distribution of damping values based on samples from the dirichlet:
simDat <- replicate(1000,damping.ratio(dirichletSampler(A)$trans))
qplot(x=simDat,col=I("black"),fill=I("goldenrod")) +
geom_vline(xintercept=damping.ratio(scaledA),color="red",size=1.3) +
geom_vline(xintercept=quantile(simDat,probs=c(0.025,0.975)),linetype="dashed")
If you ever have count data that represent a multinomial sampling process, the Dirichlet distribution is an effective way to generate variation in probabilities that reflects the uncertainty in your samples. The smaller your number of counts, the greater your uncertainty in the frequentist estimate of the probability.