michiexile on statistics: The Bootstrap

Suppose we have a (small) dataset, and a particular statistic $T$ (ie quantity computable from data) we are interested in. The Bias/Variance tradeoff tells us that if our statistic $T$ is an unbiased estimator of some quantity $\theta$, then the mean square error $\mathbb{E}[(\theta-T)^2]$ is completely determined by the variance of our statistic. So understanding the variance is crucial to understanding how well our statistic is performing.

For simple enough statistics, we can determine the variance directly – but for even slightly more complicated cases, the underlying mathematics quickly gets too difficult to handle gracefully.

This is where the Bootstrap enters the stage. The Bootstrap is a way to estimate the distribution of the statistic $T$ from the dataset itself, and then use that distribution to estimate new information about $T$ – such as its variance, or forming confidence intervals for $T$ on a specific dataset, etc.

Basically, the Bootstrap approximates the true distribution on the data with the empirical distribution given by the data: in the empirical distribution, each data point has an equal probability $1/N$ of appearing. Sampling from this empirical distribution amounts to sampling (with replacement!!) from the data itself.

Example

For a concrete example, Wasserman (in All of Statistics, Example 8.6) picks up a dataset used by Bradley Efron to illustrate the Bootstrap:


lsat.gpa = tibble(
  lsat = c(
    576, 635, 558, 578, 666, 580, 555, 661,
    651, 605, 653, 575, 545, 572, 594
  ),
  gpa  = c(
    3.39, 3.30, 2.81, 3.03, 3.44, 3.07, 3.00, 3.43,
    3.36, 3.13, 3.12, 2.74, 2.76, 2.88, 3.96
  )
)
ggplot(lsat.gpa, aes(lsat, gpa)) +
  geom_point()

The task is to estimate the correlation between LSAT scores and GPA scores from this set of 15 observations. Whereas Wasserman looked at sample correlation, we can go one step further – let’s find distributions for a linear model: predicting GPA using LSAT. Our model would be \[ GPA \sim \beta_1 LSAT + \beta_0 \] and easily calculated using lm(gpa~lsat, lsat.gpa). Computing this generates the model parameters as:


lsat.gpa.lm = lm(gpa~lsat, lsat.gpa)
lsat.gpa.lm$coefficients %>% kable()

	x
(Intercept)	0.5998613
lsat	0.0042672

And we can plot the resulting fitted model with the data.


ggplot(lsat.gpa, aes(lsat, gpa)) +
  geom_point() +
  geom_abline(slope = lsat.gpa.lm$coefficients["lsat"],
              intercept = lsat.gpa.lm$coefficients["(Intercept)"])

But this is just the expected values for intercept and slope for the model. We do not yet know how stable these values are – how large their respective variances come out to be. This is where the Bootstrap steps in to help.

The Bootstrap Method

The Bootstrap proceeds as follows:

Pick bootstrap size $B$.
$B$ times do:
- Sample $N$ values $X^*_b$ with replacement from your data $X$
- Calculate $T^*_b = T(X^*_b)$
Collect all the $T*^_b$: their distribution approximates the sampling distribution of $T$ on the original data.

So let’s get ourselves bootstrap estimates of the slope and intercept.


B = 1000
lsat.gpa.lm.boot = sapply(1:B, function(b) {
  lsat.gpa.star = lsat.gpa %>% sample_frac(replace = TRUE)
  lm(gpa~lsat, lsat.gpa.star)$coefficients
}) %>% t %>% as_tibble %>% rename(intercept=`(Intercept)`)

We can see the distribution of slope/intercept pairs in a heatmap as follows:


ggplot(lsat.gpa.lm.boot,aes(intercept, lsat)) +
  geom_hex()

And we can see its effects on the sampled lines as follows:


ggplot(lsat.gpa, aes(lsat, gpa)) +
  geom_abline(data=lsat.gpa.lm.boot,
              aes(intercept=intercept, slope=lsat),
              alpha=0.025) +
  geom_point()

The distributions of the slope and intercept from the Bootstrap sample are:


ggplot(lsat.gpa.lm.boot %>% pivot_longer(everything()),
       aes(value)) +
  geom_density() +
  facet_grid(rows=vars(name), scales="free")

This plot wasn’t entirely helpful. We can get a better view at the sampling distribution for the slope by plotting it in isolation:


ggplot(lsat.gpa.lm.boot, aes(lsat)) +
  geom_density()

Confidence intervals

Wasserman suggests three approaches for confidence intervals:

Use a normal approximation. If the Bootstrap sample data follows approximately a normal distribution, then we can use $T(X) \pm z_{\alpha/2}\widehat{se}$ as a confidence interval, where $\widehat{se}$ is the standard deviation of the Bootstrap sample.
Use Pivotal Intervals. Suppose we could write $H$ for the CDF of the quantity $T-\theta$. Then a confidence interval for $\theta$ could be given by $(T-H^{-1}(1-\alpha/2), T-H^{-1}(\alpha/2))$. Now, since $\theta$ is unknown, so is $H$ – but we can approximate $H$ using our Bootstrap! Use $\hat{H}(r) = 1/B (\# T^*_b-T(X) \leq r)$: for a fixed $r$, count the number of times that $T^*_b-T(X) \leq r$ and divide by $B$. Then by taking quantiles in the set of these $r^*_b = T^*_b-T(X)$ we get a confidence interval as $(T(X)-r^*_{(1-\alpha/2)}, T(X)-r^*_{(\alpha/2)})$ (here, I write $r^*_{(q)}$ for the $q$th quantile of the $r^*$). This reduces to the confidence interval $(2T(X) - T^*_{(1-\alpha/2)}, 2T(X)-T^*_{(\alpha/2)})$.
Just take quantiles directly from the $T^*$ quantities to form a bootstrap percentile interval.

So to compute the 95% two-sided confidence intervals, we could use the following R-code:


boot.ci.normal = c(
  lo=lsat.gpa.lm$coefficients["(Intercept)"]+
    qnorm(0.025)*sd(lsat.gpa.lm.boot$intercept), 
  hi=lsat.gpa.lm$coefficients["(Intercept)"]+
    qnorm(0.975)*sd(lsat.gpa.lm.boot$intercept),
  lo=lsat.gpa.lm$coefficients["lsat"]+
    qnorm(0.025)*sd(lsat.gpa.lm.boot$lsat),
  hi=lsat.gpa.lm$coefficients["lsat"]+
    qnorm(0.975)*sd(lsat.gpa.lm.boot$lsat)
)

boot.ci.pivotal = c(
  lo=2*lsat.gpa.lm$coefficients["(Intercept)"]-
    quantile(lsat.gpa.lm.boot$intercept, 0.975), 
  hi=2*lsat.gpa.lm$coefficients["(Intercept)"]-
    quantile(lsat.gpa.lm.boot$intercept, 0.025), 
  lo=2*lsat.gpa.lm$coefficients["lsat"]-
    quantile(lsat.gpa.lm.boot$lsat,0.975),
  hi=2*lsat.gpa.lm$coefficients["lsat"]-
    quantile(lsat.gpa.lm.boot$lsat,0.025)
)

boot.ci.percentage = c(
  `lo.(Interval)`=quantile(lsat.gpa.lm.boot$intercept, 0.025),
  `hi.(Interval)`=quantile(lsat.gpa.lm.boot$intercept, 0.975),
  lo.lsat=quantile(lsat.gpa.lm.boot$lsat, 0.025),
  hi.lsat=quantile(lsat.gpa.lm.boot$lsat, 0.975)
)

rbind(
  boot.ci.normal, boot.ci.pivotal, boot.ci.percentage
) %>% kable()

	lo.(Intercept)	hi.(Intercept)	lo.lsat	hi.lsat
boot.ci.normal	-0.9270581	2.126781	0.0017824	0.0067520
boot.ci.pivotal	-0.9414914	1.865015	0.0021046	0.0066943
boot.ci.percentage	-0.6652925	2.141214	0.0018401	0.0064299

Now, the boot.ci.normal confidence intervals here assume that the intercept and slope both distribute approximately normally. The usual way to check this is through QQ-plots:


ggplot(lsat.gpa.lm.boot, aes(sample=intercept)) +
  geom_qq() + geom_qq_line() + ggtitle("Intercept QQ")


ggplot(lsat.gpa.lm.boot, aes(sample=lsat)) +
  geom_qq() + geom_qq_line() + ggtitle("Slope QQ")

In `R`

“But wait!”, I hear you say, “Surely I don’t have to code this myself every time?!”

Indeed, like with so many other statistical methods, there is of course functionality in R to do the bootstrap. The package boot was developed for a book on bootstrapping, and contains functions for running the bootstrap on whatever data and statistic you may have.

At the core of boot sits the function boot, which takes as mandatory options:

The dataset
A function that returns the statistic of interest
The number $B$ of bootstrap repetitions

To figure out how to write all of these pieces, let’s read a quote from the boot documentation:

statistic must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample. Further, if predictions are required, then a third argument is required which would be a vector of the random indices used to generate the bootstrap predictions.

In other words, we need to write a function statistic(data, indices) that then computes the relevant statistic based on the data contained in data[,indices].

In code, this would look something like this:


library(boot)
lsat.gpa.boot = boot(lsat.gpa, 
                     function(data,ind) {
                       lm(gpa~lsat, data=data[ind,])$coefficients
                     }, 
                     B)

Now, lsat.gpa.boot contains a Bootstrap object that we can interact with in various ways. For starters, lsat.gpa.boot$t contains the actual Bootstrap replicates – the statistics values computed in each round. There are specialized functions to plot and to print bootstrap objects:


print(lsat.gpa.boot)


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = lsat.gpa, statistic = function(data, ind) {
    lm(gpa ~ lsat, data = data[ind, ])$coefficients
}, R = B)


Bootstrap Statistics :
       original        bias    std. error
t1* 0.599861296 -1.756139e-02 0.824389371
t2* 0.004267224  3.676261e-05 0.001334381


plot(lsat.gpa.boot)

Helpfully, the plotting function gives us a QQ-plot to help determine whether the normal approximation confidence intervals are reasonable to use in this case.

Possibly even more interesting than these is the boot.ci function: it computes a number of different possible confidence interval types:

Normal approximation
Basic bootstrap interval
Studentized bootstrap interval
Bootstrap percentile interval
Adjusted bootstrap percentile

Earlier in this post, we calculated the Normal approximation, and the Bootstrap percentile intervals. It’s not clear to me how the terminology used by Wasserman maps to the terminology used by the authors of the boot package, which makes it more difficult to tell which of the remaining bootstrap intervals – if any – corresponds to the one with the inverted empirical CDF.


boot.ci(lsat.gpa.boot, index=1) # the intercept


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = lsat.gpa.boot, index = 1)

Intervals : 
Level      Normal              Basic         
95%   (-0.9984,  2.2332 )   (-1.3092,  1.9679 )  

Level     Percentile            BCa          
95%   (-0.7682,  2.5089 )   (-0.3918,  3.1645 )  
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable


boot.ci(lsat.gpa.boot, index=2) # the slope


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = lsat.gpa.boot, index = 2)

Intervals : 
Level      Normal              Basic         
95%   ( 0.0016,  0.0068 )   ( 0.0021,  0.0073 )  

Level     Percentile            BCa          
95%   ( 0.0013,  0.0064 )   ( 0.0003,  0.0058 )  
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable

The Bootstrap

Example

The Bootstrap Method

Confidence intervals

In `R`

Corrections

Reuse

The Bootstrap

Example

The Bootstrap Method

Confidence intervals

In R

Corrections

Reuse

In `R`