Reproducibility ensures re-running the same analysis yields identical
results. Because a random process is involved in generating the
bootstrap weights with fwb::fwb(), steps must be taken to
ensure reproducibility is possible.
There are a few arguments to fwb() that are relevant for
reproducibility. These are statistic, simple,
and cl.
statisticis the function that is applied to each bootstrap dataset and returns the quantities of interest to be estimated. It can either have a random component or not; each case requires special attention. It is always safer to avoid having a random component instatistic. Most common regression functions to do not involve a random component, but some advanced models, like machine learning models, may. Do not ever include a call toset.seed()or supply a seed tostatistic. If you are using parallelization forfwb(), do not use parallelization withinstatistic.simplecontrols whether the bootstrap weights are generated all at once (simple = FALSE) or generated separately within each bootstrap iteration (simple = TRUE). Whensimple = FALSE, the weights are generated before any parallelization takes place orstatisticis called, which makes ensuring reproducibility more straightforward. Whensimple = TRUE, the weights are generated before the call tostatisticin each bootstrap iteration, which can make it a bit more challenging to ensure reproducibility when using parallelization and adds even more challenges whenstatisticalso has a random component.clcontrols whether and how parallelization takes place. It is passed directly topbapply::pblapply(), which calls eitherparallel::mclapply(),parallel::parLapply(), orfuture.apply::future_lapply(), depending on how it is specified. The usual arguments include an integer referring to the number of cores, which only works on Mac and triggersparallel::mclapply(); aclusterobject (usually the result of a call toparallel::makeCluster()or related functions), which triggersparallel::parLapply(); or"future", which uses afuturebackend (usually initialized usingfuture::plan()). Each of these involves different requirements for ensuring reproducibility.
This guide will proceed for combinations of these scenarios.
Case 1: No parallelization (cl = NULL)
When no parallelization is used (i.e., cl is
unspecified, NULL, or 1), all you need to do
is call set.seed() before fwb() to ensure
reproducibility. It doesn’t matter what simple or
statistic do. This is probably the most common case. Just
run the following to ensure reproducibility, replacing {N}
with your favorite integer.
Case 2: simple = FALSE, non-random
statistic
If simple = FALSE and statistic does not
have a random component, see Case 1, regardless of
whether or how parallelization is used. In this case, no random process
occurs within each cluster, so no special steps need to be taken beyond
setting a seed. Note that simple = TRUE by default unless
wtype = "multinom", so this must be set manually. See below
for a code example:
Case 3: cl is an integer
When cl is an integer and the criteria for Case 2 are not met (i.e., simple = TRUE
or statistic has a random component), one additional step
is required for ensuring reproducibility. Again, all you need to do is
use set.seed(), but you must call it with
kind = "L'Ecuyer-CMRG", which is the only method
appropriate for use across multiple clusters. See below for a code
example:
Case 4: cl is "future"
When using a future backend and the criteria for Case 2 are not met, you can use the same solution as
for Case 3. fwb() performs an additional step to make sure
the seed is correctly sent to
future.apply::future_lapply(). (Internally, this works by
setting future.seed = TRUE, which you should not do
yourself.) See below for a code example:
Case 5: cl is a cluster object
When cl is a cluster object (i.e., the
output of a call to parallel::makeCluster(),
parallel::makePSOCKcluster(),
parallel::makeForkCluster() or similar functions in
parallelly), an additional step must be taken to ensure
reproducibility. Unfortunately, you can’t use set.seed();
you have to use parallel::clusterSetRNGStream(), to which
you supply the cluster object and your desired seed. See
below for a code example:
library(parallel)
cl <- makeCluster(3)
clusterSetRNGStream(cl, {N})
f.out <- fwb(., cl = cl)Computing BCa confidence intervals
Although the main purpose of considering reproducibility is to ensure
that multiple runs of the same code produce identical results, there is
another situation in which it can be important to be able to reproduce
the weights, and that is when computing bias-corrected accelerated (BCa)
confidence intervals using fwb.ci(., type = "bca") or
summary(., ci.type = "bca"). BCa confidence intervals have
the best statistical properties among the available bootstrap confidence
intervals, but they require computing the influence each unit has on the
bootstrap estimates, which requires re-generating the weights as they
were generated by fwb().
There are some cases where you don’t have to do any special work to ensure BCa intervals are correctly computed. These include:
-
simple = FALSE, regardless of parallelization or randomness instatistic -
simple = TRUE, there is no randomness instatistic, and no parallelization is used -
simple = TRUE, there is no randomness instatistic, andclis an integer or"future"
In these cases, fwb() saves the state of the random seed
that was used to originally generate the weights, and
fwb.ci() recalls that seed to re-generate the weights and
then computes the required statistics for the BCa interval without
requiring any extra involvement by the user. Remember,
simple = TRUE by default unless
wtype = "multinom".
Otherwise, when the following condition is met, an additional step is required:
-
simple = TRUE, there is no randomness instatistic, andclis aclusterobject
In this case, you need to call
parallel::clusterSetRNGStream(cl, {N}) with the same seed
as as was used prior to fwb() immediately before calling
fwb.ci() or summary().
When simple = TRUE and there is any randomness in
statistic, it is not possible to re-generate the weights
that were used in the bootstrap, so BCa confidence intervals cannot be
computed. fwb.ci() (and summary() and
confint(), which call fwb.ci()) automatically
checks for this case and throws an error if BCa confidence intervals are
requested when these conditions are met.
verbose
It may be surprising to know that the verbose argument
can affect reproducibility. This is because of how pbapply, the
package used to generate progress bars when verbose = TRUE,
handles parallelization. To ensure reproducibility when parallelization
is used, it is advisable to set verbose = FALSE and omit a
progress bar. As of fwb version 0.6.0, this is the default.
This also improves performance. Setting verbose = TRUE
affects reproducibility in Cases 3, 4, and 5.