bbw R package update
Latest CRAN release of version 0.3.0 of bbw
We just released an updated version (0.3.0
) of the {bbw}
package on
CRAN. This is {bbw}
’s third CRAN release since its maiden acceptance to CRAN almost 7 years ago (17 January 2018). Key updates include the streamlining of the resampling algorithm and the addition of the option to perform bootstrap resampling in parallel for a faster and more efficient process. The package is now also able to perform stratified bootstrap resampling out-of-the-box compared to previous version where users had to write additional code to setup stratification. Finally, the package now includes a convenience function for performing weighted post-stratification estimation. Before, users had to create additional script to perform this analysis.
To give you an idea of the new features, we compare the original bootBW()
function to the new/alternative boot_bw()
set of functions. We use the datasets for a rapid assessment method (RAM) survey on mother and child health and nutrition in three regions of Somalia included in the {bbw}
package for this demonstration.
The
indicatorsHH
dataset is a survey dataset collected from a RAM survey in Bakool, Bay, and Middle Shabelle regions of Somalia. The
villageData
contains the list of villages/clusters that were sampled in the survey that collected the indicatorsHH
dataset.
Original bootstrapping workflow
Bootstrap resampling with bootBW()
The bootBW()
function is the original bootstrap resampling function of the package. It can be used as follows:
boot_df <- bootBW(
x = indicatorsHH, w = villageData, statistic = bootClassic,
params = c("anc1", "anc2")
)
This call to bootBW()
takes in the survey dataset indicatorsHH
as its first argument (x
). This dataset is expected to have a variable labelled as psu
which identifies the primary sampling unit from which data was collected during the survey and then additional variables for the indicators to be estimated. The second argument (w
) is for the dataset of the list of primary sampling units that were sampled in the survey to collect the survey data specified in x
. This dataset, which in this case is villageData
, should have at least a variable labelled psu
which identified the primary sampling unit that matches the same variable in the survey dataset and a variable labelled pop
for the population size of the primary sampling unit. The statistic
argument specified the type of statistic to apply to the bootstrap replicates. There are two of these functions available from the {bbw}
package - bootClassic()
and the bootPROBIT()
. For this example, the bootClassic()
function is used to get the mean value of the bootstrap replicates. This is generally useful for binomial type of indicators and for continuous variables of which to get the mean of. The params
argument takes in values of the indicator names in x
to be estimated. In this example, two indicator names for antenatal care are specified. Finally, the argument for replicates
specify the number of replicate bootstraps to be performed. The default of 400 replicates is used here. This results in the following (showing first 10 rows):
head(boot_df, 10)
#> anc1 anc2
#> 1 0.1864175 0.01874714
#> 2 0.2290978 0.02035985
#> 3 0.2343529 0.02641509
#> 4 0.2548555 0.03084955
#> 5 0.2698864 0.02662863
#> 6 0.2151356 0.01819052
#> 7 0.1937834 0.02677702
#> 8 0.2148349 0.01678766
#> 9 0.2593480 0.02891566
#> 10 0.2151414 0.01922153
The result is a data.frame()
of bootstrap replicates with number of rows equal to the number or replicates and number of columns equal to the number of params
specified. Hence, boot_df
has 400 rows and 2 columns.
Bootstrap estimation
Using boot_df
containing bootstrap replicates of the indicators anc1
and anc2
, estimating each indicator with a 95% confidence interval using the percentile bootstrap method. This can be simply done using the quantile()
function from the stats
package as follows:
est_df <- lapply(
X = boot_df,
FUN = quantile,
probs = c(0.5, 0.025, 0.975)
) |>
do.call(rbind, args = _)
The quantile()
function is used to get the 50th percentile (for the estimate) and the 2.5th and the 97.5th percentile of the bootstrap replicates to get the lower confidence limit and the upper confidence limits (respectively) of the indicator estimate. This gives the following results:
est_df
#> 50% 2.5% 97.5%
#> anc1 0.2316597 0.17709920 0.28849265
#> anc2 0.0218962 0.01347537 0.03484835
Stratified bootstrap resampling
Note that the indicatorsHH
dataset has geographical stratification. Specifically, the survey from which this data was collected was designed to be representative of three regions in Somalia with the regions identified through the region
variable in indicatorsHH
. Because of this the more appropriate bootstrap resampling approach would be to resample within each region. To do this using the original bootBW()
function would require restructuring the survey dataset by region and then passing the region-stratified datasets individually to the bootBW()
function. This may look something like this:
## Split indicators by region ----
indicators_by_region <- split(indicatorsHH, f = indicatorsHH$region)
## Split psus by region ----
psus_by_region <- split(villageData, f = villageData$region)
## Bootstrap
boot_df <- Map(
f = bootBW,
x = indicators_by_region,
w = psus_by_region,
statistic = rep(list(get("bootClassic")), length(indicators_by_region)),
params = rep(list(c("anc1", "anc2")), length(indicators_by_region))
)
The bootBW()
function only accepts single data.frame
inputs for x
and w
arguments. Hence, to resample data from within region, the datasets will have to be split into separate data.frame
inputs per region and then bootBW()
applied to each separately. In the example above, this is done by concatenating each of the inputs to bootBW()
into a list and then using the Map()
function is sent to bootBW()
sequentially. This produces a list of the data.frame
bootstrap resample for each region (shown below):
class(boot_df)
#> [1] "list"
head(boot_df$Bay, 10)
#> anc1 anc2
#> 1 0.4043419 0.013568521
#> 2 0.3907104 0.020491803
#> 3 0.3224044 0.023224044
#> 4 0.2645862 0.016282225
#> 5 0.2708618 0.008207934
#> 6 0.3297151 0.024423338
#> 7 0.3627717 0.004076087
#> 8 0.3662551 0.016460905
#> 9 0.3410641 0.016371078
#> 10 0.2277628 0.014824798
head(boot_df$Bakool, 10)
#> anc1 anc2
#> 1 0.2916667 0.17415730
#> 2 0.2928177 0.09497207
#> 3 0.3260274 0.14804469
#> 4 0.2747253 0.11864407
#> 5 0.2900552 0.11797753
#> 6 0.1823204 0.05849582
#> 7 0.4065934 0.16343490
#> 8 0.2727273 0.11731844
#> 9 0.2821918 0.06944444
#> 10 0.2939560 0.09749304
head(boot_df$`Middle Shabelle`, 10)
#> anc1 anc2
#> 1 0.1723447 0.011055276
#> 2 0.2550607 0.018367347
#> 3 0.1330724 0.010816126
#> 4 0.2830189 0.024551464
#> 5 0.1921569 0.014792899
#> 6 0.2217782 0.010989011
#> 7 0.2117647 0.007881773
#> 8 0.2165156 0.019172553
#> 9 0.2195122 0.015625000
#> 10 0.2274549 0.015075377
To estimate the per region results from this bootstrap resampling, the following can be implemented:
est_df <- lapply(
X = boot_df,
FUN = function(x) lapply(
x, FUN = quantile, probs = c(0.5, 0.025, 0.975)
) |>
do.call(rbind, args = _)
)
est_df <- data.frame(
region = names(est_df),
indicators = lapply(est_df, FUN = row.names) |> unlist(),
do.call(rbind, args = est_df)
)
row.names(est_df) <- NULL
which results in the following output:
est_df
#> region indicators X50. X2.5. X97.5.
#> 1 Bakool anc1 0.30261405 0.188862799 0.41167127
#> 2 Bay anc2 0.11251780 0.050903865 0.19504237
#> 3 Middle Shabelle anc1 0.32391543 0.217411669 0.43781930
#> 4 Bakool anc2 0.01893172 0.002766156 0.03663750
#> 5 Bay anc1 0.20220114 0.134820317 0.27676772
#> 6 Middle Shabelle anc2 0.01724140 0.007237952 0.03006251
Alternative blocked weighted bootstrap function set
From this demonstration, the bootBW()
function proves to be straightforward to implement and can be easily incorporated into a user’s workflow based on their dataset and their analytic needs. However, as shown above, this flexibility requires a lot more extra coding from the user to get from resampling to indicator estimates.
Starting from v0.3.0
, an alternative set of functions is available to perform blocked weighted bootstrap resampling that facilitates all the steps from resampling to estimation. Below is an example of how to use this alternative set of functions for the same tasks shown above.
This set of functions attempts to make the blocked weighted bootstrap algorithm more efficient through vectorisation and use of parallelisation techniques. The function syntax has been kept consistent with bootBW()
for ease of transition.
Bootstrap resampling with boot_bw()
The boot_bw()
function is the alternative bootstrap resampling function of the package. It can be used as follows:
boot_df <- boot_bw(
x = indicatorsHH, w = villageData, statistic = bootClassic,
params = c("anc1", "anc2")
)
This call to boot_bw()
takes in the survey dataset indicatorsHH
as its first argument (x
). This dataset is expected to have a variable labelled as psu
which identifies the primary sampling unit from which data was collected during the survey and then additional variables for the indicators to be estimated. The second argument (w
) is for the dataset of the list of primary sampling units that were sampled in the survey to collect the survey data specified in x
. This dataset, which in this case is villageData
, should have at least a variable labelled psu
which identified the primary sampling unit that matches the same variable in the survey dataset and a variable labelled pop
for the population size of the primary sampling unit. The statistic
argument specified the type of statistic to apply to the bootstrap replicates. There are two of these functions available from the {bbw}
package - bootClassic()
and the bootPROBIT()
. For this example, the bootClassic()
function is used to get the mean value of the bootstrap replicates. This is generally useful for binomial type of indicators and for continuous variables of which to get the mean of. The params
argument takes in values of the indicator names in x
to be estimated. In this example, two indicator names for antenatal care are specified. Finally, the argument for replicates
specify the number of replicate bootstraps to be performed. The default of 400 replicates is used here. As can be noted, the boot_bw()
takes on the same type of arguments as bootBW()
and the syntax is exactly the same. Hence, using this alternative function will be familiar to those who have had experience using the original function.
However, the output of the boot_bw()
function is structured differently from the bootBW()
function. The boot_bw()
function produces and object of class boot_bw
.
class(boot_df)
#> [1] "boot_bw"
The object boot_bw
is a list with 4 named components: params
for the values specified for the params
argument, replicates
for the number of bootstrap replicates performed, strata
for the values specified for stratification, and boot_data
which is the bootstrap results.
names(boot_df)
#> [1] "params" "replicates" "strata" "boot_data"
The boot_data
component of the boot_bw
object corresponds to the output of the bootBW()
function.
Other than the difference in the structure of the output, this alternative function also has three additional arguments for the new features it provides.
-
strata
- the variable name inx
that provides information on the stratification in the survey data. This is by default set toNULL
signifying no stratification. This argument allows the user to perform stratified bootstrap resampling conveniently through theboot_bw()
function. -
parallel
- whether or not to use parallel computation for the bootstrap resampling. This is by default set to FALSE in which case bootstrap resampling is done sequentially as is with thebootBW()
function. If set to TRUE, the function sets up parallel computing and utilises the machines available cores (seecores
argument below). -
cores
- the number of cores to use for parallel computation. This is only evaluated ifparallel = TRUE
. By default, this is set to 1 less the total available number of cores of the current machine.
To use these new features and functionality, the call to boot_bw()
would look something like this:
boot_df <- boot_bw(
x = indicatorsHH, w = villageData, statistic = bootClassic,
params = c("anc1", "anc2"), strata = "region", parallel = TRUE
)
This produces a boot_bw
class list
object with the same components as above. The only different is that the boot_data
component is a list
(instead of a data.frame
) with each component being the data.frame
bootstrap resampling output for each of the strata in the dataset.
class(boot_df)
#> [1] "boot_bw"
class(boot_df$boot_data)
#> [1] "list"
names(boot_df$boot_data)
#> [1] "Bakool" "Bay" "Middle Shabelle"
Bootstrap estimation
The boot_bw_estimate()
function can then be applied to the output of the boot_bw()
function to get the indicator estimates with 95% confidence interval.
boot_bw_estimate(boot_df)
#> region indicator est lcl ucl
#> 1 Bakool anc1 0.43888889 0.38881944 0.48888889
#> 2 Bakool anc2 0.38055556 0.32497749 0.43062500
#> 3 Bay anc1 0.71619066 0.63887512 0.77849135
#> 4 Bay anc2 0.00254615 0.00000000 0.01294677
#> 5 Middle Shabelle anc1 0.20757542 0.14514451 0.28293531
#> 6 Middle Shabelle anc2 0.05065259 0.03133757 0.07453108
#> se
#> 1 0.027718319
#> 2 0.027983726
#> 3 0.036466569
#> 4 0.003743969
#> 5 0.036375151
#> 6 0.011463590
These two functions can be piped to each other for a single workflow from bootstrap resampling to estimation.
boot_bw(
x = indicatorsHH, w = villageData, statistic = bootClassic,
params = c("anc1", "anc2"), strata = "region", parallel = TRUE
) |>
boot_bw_estimate()
#> region indicator est lcl ucl
#> 1 Bakool anc1 0.438888889 0.3805556 0.49444444
#> 2 Bakool anc2 0.376731302 0.3138889 0.43888889
#> 3 Bay anc1 0.719130072 0.6487833 0.78255787
#> 4 Bay anc2 0.002534854 0.0000000 0.01262706
#> 5 Middle Shabelle anc1 0.203423968 0.1428536 0.27819673
#> 6 Middle Shabelle anc2 0.051256281 0.0339071 0.07622767
#> se
#> 1 0.030425611
#> 2 0.030033802
#> 3 0.034273078
#> 4 0.003372086
#> 5 0.033966913
#> 6 0.010573679
More efficient bootstrap resampling
A key feature of the most recent {bbw}
update is its new function set that uses parallelisation for bootstrap resampling. This vignette explores the bootstrap resampling efficiencies gained with parallelisation.
Applying the original and the alternative function/set to the Somalia survey dataset available from this package, bootstrap resampling is applied using the same parameters and the time the operation it takes to run is measured and compared.
Bootstrap resampling without parallelisation
In this comparison, the original and alternative function/set both implement sequential bootstrap resampling with number of parameters set at varying values.
Using one parameter and 400 replicates
Original vs Alternative bootstrap resampling function/set | |||
Sequential resampling with 1 parameter and 400 replicates | |||
User | System | Elapsed | |
---|---|---|---|
Original - 400 replicates - 1 parameter | 31.505 | 0.0320000000000036 | 31.4860000000008 |
Alternative - 400 replicates - 1 parameter | 25.956 | 0 | 25.8919999999998 |
Performing bootstrap resampling sequentially, the original function took 31.486 seconds to run while the alternative function set took 25.892 seconds to run. There was very little difference between the original and the alternative function/set.
Using varying number of parameters and 400 replicates
Original vs Alternative bootstrap resampling function/set | ||||||
Sequential resampling with increasing number of parameters and 400 replicates | ||||||
No. of parameters | User - Original | System - Original | Elapsed - Original | User - Alternative | System - Alternative | Elapsed - Alternative |
---|---|---|---|---|---|---|
1 | 31.505 | 0.032 | 31.486 | 25.956 | 0 | 25.892 |
2 | 32.442 | 0.000 | 32.363 | 25.747 | 0 | 25.690 |
4 | 32.485 | 0.000 | 32.404 | 25.302 | 0 | 25.244 |
8 | 31.727 | 0.000 | 31.652 | 25.660 | 0 | 25.604 |
There are marginal gains with the alternative function set when the number of parameters more than 1 but the gains do not increase with the increase in the number of parameters.
Bootstrap resampling with parallelisation
In this comparison, the alternative function/set implements parallel bootstrap resampling with number of parameters set at varying values and number of parallel cores set at varying values and then compared to performance of the original function as above.
Original vs Alternative bootstrap resampling function/set | |||||
Parallel resampling with increasing number of parameters and increasing number of cores | |||||
No. of parameters | Original | Alternative - sequential | Alternative - 2 cores | Alternative - 4 cores | Alternative - 8 cores |
---|---|---|---|---|---|
1 | 31.486 | 25.892 | 16.904 | 10.392 | 7.361 |
2 | 32.363 | 25.690 | 16.754 | 10.634 | 7.398 |
4 | 32.404 | 25.244 | 17.017 | 10.545 | 7.585 |
8 | 31.652 | 25.604 | 17.568 | 10.240 | 7.624 |
This updated version of the {bbw}
package maintains support for the original bootstrap resampling function bootBW()
so as to support existing workflows that use it. We would recommend, however, that the new boot_bw()
set of functions be used for new uses and for new workflows.