In order to get a good idea of the underlying theory of frequentist confidence intervals and how model assumptions could potentially bias our results we examine a set of simulations below to test the estimation of a single known parameter, in the code referred to as trueValue
.
library(dplyr)
library(ggplot2)
trueValue <- 3. # True Underlying value
sigma_ <- 1.5 # Noise added to observation process
m <- 50 # samples per trail
n <- 100 # number of trails
set.seed(123) # make this analysis repeatable
Assumptions Met
In our first simulation we will have a model with perfect assumptions met. Each observation is centered around some true value, with noise that is normally distributed. Each sample is independent and identically distributed and is representative of the true population that we are drawing from. Below we create 100 trials of 50 observations each, simulate values from the true value, estimate the underlying true value using a linear model, and draw confidence intervals. As we can see in the plot below the confidence intervals only cover the true value about 95% of the time. “If confidence intervals are constructed using a given confidence level in an infinite number of independent experiments, the proportion of those intervals that contain the true value of the parameter will match the confidence level”.
# simulate 100 sets of observations and estimate underlying value
modelEstimates <- lapply(1:n, function(x)
glm(rnorm(m, trueValue, sigma_) ~ 1))
# grab the estimates and the confidence intervals
DF <- bind_rows(lapply(modelEstimates, function(x){
coefficients(x) %>%
as.data.frame %>%
cbind(t(as.data.frame(confint(x))))})) %>%
mutate(model=1:n) %>%
mutate(Cover=`2.5 %` <= 3 & 3 <= `97.5 %`)
names(DF)[1] <- "Estimate"
# plot the results
DF %>%
ggplot(aes(x=model, y=Estimate, ymin=`2.5 %`, ymax=`97.5 %`, color=Cover)) +
geom_point() +
geom_errorbar() +
theme_classic() +
coord_flip() +
geom_hline(yintercept=3) +
xlab("") +
labs(title="Proper Assumptions")
Just to show that this example is not a fluke we can repeat this simulation process 100 more times and count the percentage of coverage of the true population value for each set of simulations. The distribution plotted below is clearly centered around .95.
monteCarlo1 <- sapply(1:n, function(i){
mods_ <- lapply(1:n, function(x) glm(rnorm(m, trueValue, sigma_) ~ 1))
int_ <- sapply(mods_, function(x)confint(x))
mean((int_[1,] < 3) & (int_[2,] > 3))
})
qplot(monteCarlo1, geom="density") +
theme_classic() +
labs(x="Percent Coverage", y="Density", title="Proper Assumptions")
Skewed Noise
A way in which our model can be biased is if the assumption of normality is not met in regards to the error term. Below we show a distribution which is zero centered, so it does not change our mean population estimate, but is strongly right skewed.
qplot(rgamma(1000, .5, 5) - .5/5, geom="density") +
labs(x="Error term", y="Density", title="Skewed Noise") +
theme_classic()
When we use this noise in the simulation process we can see that our confidence intervals become biased and that, for a set of 100 trials, we cover the true value of the underlying true value less often.
modelEstimates2 <- lapply(1:n, function(x)
glm(trueValue + rgamma(m, .5, 5) - .5/5 ~ 1))
DF2 <- bind_rows(lapply(modelEstimates2, function(x){
coefficients(x) %>%
as.data.frame %>%
cbind(t(as.data.frame(confint(x))))})) %>%
mutate(model=1:n) %>%
mutate(Cover=`2.5 %` <= 3 & 3 <= `97.5 %`)
names(DF2)[1] <- "Estimate"
DF2 %>%
ggplot(aes(x=model, y=Estimate, ymin=`2.5 %`, ymax=`97.5 %`, color=Cover)) +
geom_point() +
geom_errorbar() +
theme_classic() +
coord_flip() +
geom_hline(yintercept=3) +
xlab("") +
ggtitle("Skewed Noise")
Again to show that this is not a fluke, we can repeat the process 100 more times and see that the resulting distribution of percentage covered is less than if all the assumptions of the model are met.
monteCarlo2 <- sapply(1:n, function(i){
mods_ <- lapply(1:n, function(x)
glm(trueValue + rgamma(m, .5, 5) - .5/5 ~ 1))
int_ <- sapply(mods_, function(x)confint(x))
mean((int_[1,] < 3) & (int_[2,] > 3))
})
qplot(monteCarlo2, geom="density") +
theme_classic() +
labs(x="Percent Coverage", y="Density", title="Skewed Noise")
Biased Sample
Perhaps the most devastating result can come from when you have a biased sample from the population. This is often a concern when the population that you are trying to measure has a hard to reach subpopulation that escapes measurement and that has a different pattern of your measure of interest than the rest of the population. Below is an example what happens to the estimation process when we miss the bottom ten percent of the distribution in each of our trails.
modelEstimates3 <- lapply(1:n, function(x){
sims <- rnorm(m, trueValue, sigma_)
sims <- sims[sims > quantile(sims, .10)]
glm(sims ~ 1)
})
DF3 <- bind_rows(lapply(modelEstimates3, function(x){
coefficients(x) %>%
as.data.frame %>%
cbind(t(as.data.frame(confint(x))))})) %>%
mutate(model=1:n) %>%
mutate(Cover=`2.5 %` <= 3 & 3 <= `97.5 %`)
names(DF3)[1] <- "Estimate"
DF3 %>%
ggplot(aes(x=model, y=Estimate, ymin=`2.5 %`, ymax=`97.5 %`, color=Cover)) +
geom_point() +
geom_errorbar() +
theme_classic() +
coord_flip() +
geom_hline(yintercept=3) +
xlab("") +
labs(title="Biased Selection")
Missing a specific population has a great impact on the bias of our estimates. As we can see from the simulations below the coverage of our true estimate is extremely inconsistent.
monteCarlo3 <- sapply(1:n, function(i){
sims <- rnorm(m, trueValue, sigma_)
sims <- sims[sims > quantile(sims, .10)]
mods_ <- lapply(1:n, function(x) glm(sims ~ 1))
int_ <- sapply(mods_, function(x)confint(x))
mean((int_[1,] < 3) & (int_[2,] > 3))
})
qplot(monteCarlo3, geom="density") +
theme_classic() +
labs(x="Percent Coverage", y="Density", title="Biased Sample")