The Bootstrap

A description of the Bootstrap, and an example from “All of Statistics”

Suppose we have a (small) dataset, and a particular statistic \(T\) (ie quantity computable from data) we are interested in. The Bias/Variance tradeoff tells us that if our statistic \(T\) is an unbiased estimator of some quantity \(\theta\), then the mean square error \(\mathbb{E}[(\theta-T)^2]\) is completely determined by the variance of our statistic. So understanding the variance is crucial to understanding how well our statistic is performing.

For simple enough statistics, we can determine the variance directly – but for even slightly more complicated cases, the underlying mathematics quickly gets too difficult to handle gracefully.

This is where the Bootstrap enters the stage. The Bootstrap is a way to estimate the distribution of the statistic \(T\) from the dataset itself, and then use that distribution to estimate new information about \(T\) – such as its variance, or forming confidence intervals for \(T\) on a specific dataset, etc.

Basically, the Bootstrap approximates the true distribution on the data with the empirical distribution given by the data: in the empirical distribution, each data point has an equal probability \(1/N\) of appearing. Sampling from this empirical distribution amounts to sampling (with replacement!!) from the data itself.

Example

For a concrete example, Wasserman (in All of Statistics, Example 8.6) picks up a dataset used by Bradley Efron to illustrate the Bootstrap:


lsat.gpa = tibble(
  lsat = c(
    576, 635, 558, 578, 666, 580, 555, 661,
    651, 605, 653, 575, 545, 572, 594
  ),
  gpa  = c(
    3.39, 3.30, 2.81, 3.03, 3.44, 3.07, 3.00, 3.43,
    3.36, 3.13, 3.12, 2.74, 2.76, 2.88, 3.96
  )
)
ggplot(lsat.gpa, aes(lsat, gpa)) +
  geom_point()

The task is to estimate the correlation between LSAT scores and GPA scores from this set of 15 observations. Whereas Wasserman looked at sample correlation, we can go one step further – let’s find distributions for a linear model: predicting GPA using LSAT. Our model would be \[ GPA \sim \beta_1 LSAT + \beta_0 \] and easily calculated using lm(gpa~lsat, lsat.gpa). Computing this generates the model parameters as:


lsat.gpa.lm = lm(gpa~lsat, lsat.gpa)
lsat.gpa.lm$coefficients %>% kable()
x
(Intercept) 0.5998613
lsat 0.0042672

And we can plot the resulting fitted model with the data.


ggplot(lsat.gpa, aes(lsat, gpa)) +
  geom_point() +
  geom_abline(slope = lsat.gpa.lm$coefficients["lsat"],
              intercept = lsat.gpa.lm$coefficients["(Intercept)"])

But this is just the expected values for intercept and slope for the model. We do not yet know how stable these values are – how large their respective variances come out to be. This is where the Bootstrap steps in to help.

The Bootstrap Method

The Bootstrap proceeds as follows:

  1. Pick bootstrap size \(B\).
  2. \(B\) times do:
    • Sample \(N\) values \(X^*_b\) with replacement from your data \(X\)
    • Calculate \(T^*_b = T(X^*_b)\)
  3. Collect all the \(T*^_b\): their distribution approximates the sampling distribution of \(T\) on the original data.

So let’s get ourselves bootstrap estimates of the slope and intercept.


B = 1000
lsat.gpa.lm.boot = sapply(1:B, function(b) {
  lsat.gpa.star = lsat.gpa %>% sample_frac(replace = TRUE)
  lm(gpa~lsat, lsat.gpa.star)$coefficients
}) %>% t %>% as_tibble %>% rename(intercept=`(Intercept)`)

We can see the distribution of slope/intercept pairs in a heatmap as follows:


ggplot(lsat.gpa.lm.boot,aes(intercept, lsat)) +
  geom_hex()

And we can see its effects on the sampled lines as follows:


ggplot(lsat.gpa, aes(lsat, gpa)) +
  geom_abline(data=lsat.gpa.lm.boot,
              aes(intercept=intercept, slope=lsat),
              alpha=0.025) +
  geom_point()

The distributions of the slope and intercept from the Bootstrap sample are:


ggplot(lsat.gpa.lm.boot %>% pivot_longer(everything()),
       aes(value)) +
  geom_density() +
  facet_grid(rows=vars(name), scales="free")

This plot wasn’t entirely helpful. We can get a better view at the sampling distribution for the slope by plotting it in isolation:


ggplot(lsat.gpa.lm.boot, aes(lsat)) +
  geom_density()

Confidence intervals

Wasserman suggests three approaches for confidence intervals:

  1. Use a normal approximation. If the Bootstrap sample data follows approximately a normal distribution, then we can use \(T(X) \pm z_{\alpha/2}\widehat{se}\) as a confidence interval, where \(\widehat{se}\) is the standard deviation of the Bootstrap sample.
  2. Use Pivotal Intervals. Suppose we could write \(H\) for the CDF of the quantity \(T-\theta\). Then a confidence interval for \(\theta\) could be given by \((T-H^{-1}(1-\alpha/2), T-H^{-1}(\alpha/2))\). Now, since \(\theta\) is unknown, so is \(H\) – but we can approximate \(H\) using our Bootstrap! Use \(\hat{H}(r) = 1/B (\# T^*_b-T(X) \leq r)\): for a fixed \(r\), count the number of times that \(T^*_b-T(X) \leq r\) and divide by \(B\). Then by taking quantiles in the set of these \(r^*_b = T^*_b-T(X)\) we get a confidence interval as \((T(X)-r^*_{(1-\alpha/2)}, T(X)-r^*_{(\alpha/2)})\) (here, I write \(r^*_{(q)}\) for the \(q\)th quantile of the \(r^*\)). This reduces to the confidence interval \((2T(X) - T^*_{(1-\alpha/2)}, 2T(X)-T^*_{(\alpha/2)})\).
  3. Just take quantiles directly from the \(T^*\) quantities to form a bootstrap percentile interval.

So to compute the 95% two-sided confidence intervals, we could use the following R-code:


boot.ci.normal = c(
  lo=lsat.gpa.lm$coefficients["(Intercept)"]+
    qnorm(0.025)*sd(lsat.gpa.lm.boot$intercept), 
  hi=lsat.gpa.lm$coefficients["(Intercept)"]+
    qnorm(0.975)*sd(lsat.gpa.lm.boot$intercept),
  lo=lsat.gpa.lm$coefficients["lsat"]+
    qnorm(0.025)*sd(lsat.gpa.lm.boot$lsat),
  hi=lsat.gpa.lm$coefficients["lsat"]+
    qnorm(0.975)*sd(lsat.gpa.lm.boot$lsat)
)

boot.ci.pivotal = c(
  lo=2*lsat.gpa.lm$coefficients["(Intercept)"]-
    quantile(lsat.gpa.lm.boot$intercept, 0.975), 
  hi=2*lsat.gpa.lm$coefficients["(Intercept)"]-
    quantile(lsat.gpa.lm.boot$intercept, 0.025), 
  lo=2*lsat.gpa.lm$coefficients["lsat"]-
    quantile(lsat.gpa.lm.boot$lsat,0.975),
  hi=2*lsat.gpa.lm$coefficients["lsat"]-
    quantile(lsat.gpa.lm.boot$lsat,0.025)
)

boot.ci.percentage = c(
  `lo.(Interval)`=quantile(lsat.gpa.lm.boot$intercept, 0.025),
  `hi.(Interval)`=quantile(lsat.gpa.lm.boot$intercept, 0.975),
  lo.lsat=quantile(lsat.gpa.lm.boot$lsat, 0.025),
  hi.lsat=quantile(lsat.gpa.lm.boot$lsat, 0.975)
)

rbind(
  boot.ci.normal, boot.ci.pivotal, boot.ci.percentage
) %>% kable()
lo.(Intercept) hi.(Intercept) lo.lsat hi.lsat
boot.ci.normal -0.9270581 2.126781 0.0017824 0.0067520
boot.ci.pivotal -0.9414914 1.865015 0.0021046 0.0066943
boot.ci.percentage -0.6652925 2.141214 0.0018401 0.0064299

Now, the boot.ci.normal confidence intervals here assume that the intercept and slope both distribute approximately normally. The usual way to check this is through QQ-plots:


ggplot(lsat.gpa.lm.boot, aes(sample=intercept)) +
  geom_qq() + geom_qq_line() + ggtitle("Intercept QQ")


ggplot(lsat.gpa.lm.boot, aes(sample=lsat)) +
  geom_qq() + geom_qq_line() + ggtitle("Slope QQ")

In R

“But wait!”, I hear you say, “Surely I don’t have to code this myself every time?!”

Indeed, like with so many other statistical methods, there is of course functionality in R to do the bootstrap. The package boot was developed for a book on bootstrapping, and contains functions for running the bootstrap on whatever data and statistic you may have.

At the core of boot sits the function boot, which takes as mandatory options:

To figure out how to write all of these pieces, let’s read a quote from the boot documentation:

statistic must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample. Further, if predictions are required, then a third argument is required which would be a vector of the random indices used to generate the bootstrap predictions.

In other words, we need to write a function statistic(data, indices) that then computes the relevant statistic based on the data contained in data[,indices].

In code, this would look something like this:


library(boot)
lsat.gpa.boot = boot(lsat.gpa, 
                     function(data,ind) {
                       lm(gpa~lsat, data=data[ind,])$coefficients
                     }, 
                     B)

Now, lsat.gpa.boot contains a Bootstrap object that we can interact with in various ways. For starters, lsat.gpa.boot$t contains the actual Bootstrap replicates – the statistics values computed in each round. There are specialized functions to plot and to print bootstrap objects:


print(lsat.gpa.boot)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = lsat.gpa, statistic = function(data, ind) {
    lm(gpa ~ lsat, data = data[ind, ])$coefficients
}, R = B)


Bootstrap Statistics :
       original        bias    std. error
t1* 0.599861296 -1.756139e-02 0.824389371
t2* 0.004267224  3.676261e-05 0.001334381

plot(lsat.gpa.boot)

Helpfully, the plotting function gives us a QQ-plot to help determine whether the normal approximation confidence intervals are reasonable to use in this case.

Possibly even more interesting than these is the boot.ci function: it computes a number of different possible confidence interval types:

Earlier in this post, we calculated the Normal approximation, and the Bootstrap percentile intervals. It’s not clear to me how the terminology used by Wasserman maps to the terminology used by the authors of the boot package, which makes it more difficult to tell which of the remaining bootstrap intervals – if any – corresponds to the one with the inverted empirical CDF.


boot.ci(lsat.gpa.boot, index=1) # the intercept

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = lsat.gpa.boot, index = 1)

Intervals : 
Level      Normal              Basic         
95%   (-0.9984,  2.2332 )   (-1.3092,  1.9679 )  

Level     Percentile            BCa          
95%   (-0.7682,  2.5089 )   (-0.3918,  3.1645 )  
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable

boot.ci(lsat.gpa.boot, index=2) # the slope

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = lsat.gpa.boot, index = 2)

Intervals : 
Level      Normal              Basic         
95%   ( 0.0016,  0.0068 )   ( 0.0021,  0.0073 )  

Level     Percentile            BCa          
95%   ( 0.0013,  0.0064 )   ( 0.0003,  0.0058 )  
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/michiexile/rbind-io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".