bbw R package update

Latest CRAN release of version 0.3.0 of bbw

Photo by Derek Lynn on Unsplash

We just released an updated version (0.3.0) of the {bbw} package on CRAN. This is {bbw}’s third CRAN release since its maiden acceptance to CRAN almost 7 years ago (17 January 2018). Key updates include the streamlining of the resampling algorithm and the addition of the option to perform bootstrap resampling in parallel for a faster and more efficient process. The package is now also able to perform stratified bootstrap resampling out-of-the-box compared to previous version where users had to write additional code to setup stratification. Finally, the package now includes a convenience function for performing weighted post-stratification estimation. Before, users had to create additional script to perform this analysis.

To give you an idea of the new features, we compare the original bootBW() function to the new/alternative boot_bw() set of functions. We use the datasets for a rapid assessment method (RAM) survey on mother and child health and nutrition in three regions of Somalia included in the {bbw} package for this demonstration.

The indicatorsHH dataset is a survey dataset collected from a RAM survey in Bakool, Bay, and Middle Shabelle regions of Somalia. The villageData contains the list of villages/clusters that were sampled in the survey that collected the indicatorsHH dataset.

Original bootstrapping workflow

Bootstrap resampling with bootBW()

The bootBW() function is the original bootstrap resampling function of the package. It can be used as follows:

boot_df <- bootBW(
  x = indicatorsHH, w = villageData, statistic = bootClassic,
  params = c("anc1", "anc2")
)

This call to bootBW() takes in the survey dataset indicatorsHH as its first argument (x). This dataset is expected to have a variable labelled as psu which identifies the primary sampling unit from which data was collected during the survey and then additional variables for the indicators to be estimated. The second argument (w) is for the dataset of the list of primary sampling units that were sampled in the survey to collect the survey data specified in x. This dataset, which in this case is villageData, should have at least a variable labelled psu which identified the primary sampling unit that matches the same variable in the survey dataset and a variable labelled pop for the population size of the primary sampling unit. The statistic argument specified the type of statistic to apply to the bootstrap replicates. There are two of these functions available from the {bbw} package - bootClassic() and the bootPROBIT(). For this example, the bootClassic() function is used to get the mean value of the bootstrap replicates. This is generally useful for binomial type of indicators and for continuous variables of which to get the mean of. The params argument takes in values of the indicator names in x to be estimated. In this example, two indicator names for antenatal care are specified. Finally, the argument for replicates specify the number of replicate bootstraps to be performed. The default of 400 replicates is used here. This results in the following (showing first 10 rows):

head(boot_df, 10)
#>         anc1       anc2
#> 1  0.1864175 0.01874714
#> 2  0.2290978 0.02035985
#> 3  0.2343529 0.02641509
#> 4  0.2548555 0.03084955
#> 5  0.2698864 0.02662863
#> 6  0.2151356 0.01819052
#> 7  0.1937834 0.02677702
#> 8  0.2148349 0.01678766
#> 9  0.2593480 0.02891566
#> 10 0.2151414 0.01922153

The result is a data.frame() of bootstrap replicates with number of rows equal to the number or replicates and number of columns equal to the number of params specified. Hence, boot_df has 400 rows and 2 columns.

Bootstrap estimation

Using boot_df containing bootstrap replicates of the indicators anc1 and anc2, estimating each indicator with a 95% confidence interval using the percentile bootstrap method. This can be simply done using the quantile() function from the stats package as follows:

est_df <- lapply(
  X = boot_df,
  FUN = quantile,
  probs = c(0.5, 0.025, 0.975)
) |>
  do.call(rbind, args = _)

The quantile() function is used to get the 50th percentile (for the estimate) and the 2.5th and the 97.5th percentile of the bootstrap replicates to get the lower confidence limit and the upper confidence limits (respectively) of the indicator estimate. This gives the following results:

est_df
#>            50%       2.5%      97.5%
#> anc1 0.2316597 0.17709920 0.28849265
#> anc2 0.0218962 0.01347537 0.03484835

Stratified bootstrap resampling

Note that the indicatorsHH dataset has geographical stratification. Specifically, the survey from which this data was collected was designed to be representative of three regions in Somalia with the regions identified through the region variable in indicatorsHH. Because of this the more appropriate bootstrap resampling approach would be to resample within each region. To do this using the original bootBW() function would require restructuring the survey dataset by region and then passing the region-stratified datasets individually to the bootBW() function. This may look something like this:

## Split indicators by region ----
indicators_by_region <- split(indicatorsHH, f = indicatorsHH$region)

## Split psus by region ----
psus_by_region <- split(villageData, f = villageData$region)

## Bootstrap
boot_df <- Map(
  f = bootBW, 
  x = indicators_by_region, 
  w = psus_by_region, 
  statistic = rep(list(get("bootClassic")), length(indicators_by_region)), 
  params = rep(list(c("anc1", "anc2")), length(indicators_by_region))
)

The bootBW() function only accepts single data.frame inputs for x and w arguments. Hence, to resample data from within region, the datasets will have to be split into separate data.frame inputs per region and then bootBW() applied to each separately. In the example above, this is done by concatenating each of the inputs to bootBW() into a list and then using the Map() function is sent to bootBW() sequentially. This produces a list of the data.frame bootstrap resample for each region (shown below):

class(boot_df)
#> [1] "list"

head(boot_df$Bay, 10)
#>         anc1        anc2
#> 1  0.4043419 0.013568521
#> 2  0.3907104 0.020491803
#> 3  0.3224044 0.023224044
#> 4  0.2645862 0.016282225
#> 5  0.2708618 0.008207934
#> 6  0.3297151 0.024423338
#> 7  0.3627717 0.004076087
#> 8  0.3662551 0.016460905
#> 9  0.3410641 0.016371078
#> 10 0.2277628 0.014824798

head(boot_df$Bakool, 10)
#>         anc1       anc2
#> 1  0.2916667 0.17415730
#> 2  0.2928177 0.09497207
#> 3  0.3260274 0.14804469
#> 4  0.2747253 0.11864407
#> 5  0.2900552 0.11797753
#> 6  0.1823204 0.05849582
#> 7  0.4065934 0.16343490
#> 8  0.2727273 0.11731844
#> 9  0.2821918 0.06944444
#> 10 0.2939560 0.09749304

head(boot_df$`Middle Shabelle`, 10)
#>         anc1        anc2
#> 1  0.1723447 0.011055276
#> 2  0.2550607 0.018367347
#> 3  0.1330724 0.010816126
#> 4  0.2830189 0.024551464
#> 5  0.1921569 0.014792899
#> 6  0.2217782 0.010989011
#> 7  0.2117647 0.007881773
#> 8  0.2165156 0.019172553
#> 9  0.2195122 0.015625000
#> 10 0.2274549 0.015075377

To estimate the per region results from this bootstrap resampling, the following can be implemented:

est_df <- lapply(
  X = boot_df, 
  FUN = function(x) lapply(
    x, FUN = quantile, probs = c(0.5, 0.025, 0.975)
  ) |> 
    do.call(rbind, args = _)
)

est_df <- data.frame(
  region = names(est_df),
  indicators = lapply(est_df, FUN = row.names) |> unlist(),
  do.call(rbind, args = est_df)
)

row.names(est_df) <- NULL

which results in the following output:

est_df
#>            region indicators       X50.       X2.5.     X97.5.
#> 1          Bakool       anc1 0.30261405 0.188862799 0.41167127
#> 2             Bay       anc2 0.11251780 0.050903865 0.19504237
#> 3 Middle Shabelle       anc1 0.32391543 0.217411669 0.43781930
#> 4          Bakool       anc2 0.01893172 0.002766156 0.03663750
#> 5             Bay       anc1 0.20220114 0.134820317 0.27676772
#> 6 Middle Shabelle       anc2 0.01724140 0.007237952 0.03006251

Alternative blocked weighted bootstrap function set

From this demonstration, the bootBW() function proves to be straightforward to implement and can be easily incorporated into a user’s workflow based on their dataset and their analytic needs. However, as shown above, this flexibility requires a lot more extra coding from the user to get from resampling to indicator estimates.

Starting from v0.3.0, an alternative set of functions is available to perform blocked weighted bootstrap resampling that facilitates all the steps from resampling to estimation. Below is an example of how to use this alternative set of functions for the same tasks shown above.

This set of functions attempts to make the blocked weighted bootstrap algorithm more efficient through vectorisation and use of parallelisation techniques. The function syntax has been kept consistent with bootBW() for ease of transition.

Bootstrap resampling with boot_bw()

The boot_bw() function is the alternative bootstrap resampling function of the package. It can be used as follows:

boot_df <- boot_bw(
  x = indicatorsHH, w = villageData, statistic = bootClassic,
  params = c("anc1", "anc2")
)

This call to boot_bw() takes in the survey dataset indicatorsHH as its first argument (x). This dataset is expected to have a variable labelled as psu which identifies the primary sampling unit from which data was collected during the survey and then additional variables for the indicators to be estimated. The second argument (w) is for the dataset of the list of primary sampling units that were sampled in the survey to collect the survey data specified in x. This dataset, which in this case is villageData, should have at least a variable labelled psu which identified the primary sampling unit that matches the same variable in the survey dataset and a variable labelled pop for the population size of the primary sampling unit. The statistic argument specified the type of statistic to apply to the bootstrap replicates. There are two of these functions available from the {bbw} package - bootClassic() and the bootPROBIT(). For this example, the bootClassic() function is used to get the mean value of the bootstrap replicates. This is generally useful for binomial type of indicators and for continuous variables of which to get the mean of. The params argument takes in values of the indicator names in x to be estimated. In this example, two indicator names for antenatal care are specified. Finally, the argument for replicates specify the number of replicate bootstraps to be performed. The default of 400 replicates is used here. As can be noted, the boot_bw() takes on the same type of arguments as bootBW() and the syntax is exactly the same. Hence, using this alternative function will be familiar to those who have had experience using the original function.

However, the output of the boot_bw() function is structured differently from the bootBW() function. The boot_bw() function produces and object of class boot_bw.

class(boot_df)
#> [1] "boot_bw"

The object boot_bw is a list with 4 named components: params for the values specified for the params argument, replicates for the number of bootstrap replicates performed, strata for the values specified for stratification, and boot_data which is the bootstrap results.

names(boot_df)
#> [1] "params"     "replicates" "strata"     "boot_data"

The boot_data component of the boot_bw object corresponds to the output of the bootBW() function.

Other than the difference in the structure of the output, this alternative function also has three additional arguments for the new features it provides.

  • strata - the variable name in x that provides information on the stratification in the survey data. This is by default set to NULL signifying no stratification. This argument allows the user to perform stratified bootstrap resampling conveniently through the boot_bw() function.

  • parallel - whether or not to use parallel computation for the bootstrap resampling. This is by default set to FALSE in which case bootstrap resampling is done sequentially as is with the bootBW() function. If set to TRUE, the function sets up parallel computing and utilises the machines available cores (see cores argument below).

  • cores - the number of cores to use for parallel computation. This is only evaluated if parallel = TRUE. By default, this is set to 1 less the total available number of cores of the current machine.

To use these new features and functionality, the call to boot_bw() would look something like this:

boot_df <- boot_bw(
  x = indicatorsHH, w = villageData, statistic = bootClassic,
  params = c("anc1", "anc2"), strata = "region", parallel = TRUE
)

This produces a boot_bw class list object with the same components as above. The only different is that the boot_data component is a list (instead of a data.frame) with each component being the data.frame bootstrap resampling output for each of the strata in the dataset.

class(boot_df)
#> [1] "boot_bw"

class(boot_df$boot_data)
#> [1] "list"

names(boot_df$boot_data)
#> [1] "Bakool"          "Bay"             "Middle Shabelle"

Bootstrap estimation

The boot_bw_estimate() function can then be applied to the output of the boot_bw() function to get the indicator estimates with 95% confidence interval.

boot_bw_estimate(boot_df)
#>            region indicator        est        lcl        ucl
#> 1          Bakool      anc1 0.43888889 0.38881944 0.48888889
#> 2          Bakool      anc2 0.38055556 0.32497749 0.43062500
#> 3             Bay      anc1 0.71619066 0.63887512 0.77849135
#> 4             Bay      anc2 0.00254615 0.00000000 0.01294677
#> 5 Middle Shabelle      anc1 0.20757542 0.14514451 0.28293531
#> 6 Middle Shabelle      anc2 0.05065259 0.03133757 0.07453108
#>            se
#> 1 0.027718319
#> 2 0.027983726
#> 3 0.036466569
#> 4 0.003743969
#> 5 0.036375151
#> 6 0.011463590

These two functions can be piped to each other for a single workflow from bootstrap resampling to estimation.

boot_bw(
  x = indicatorsHH, w = villageData, statistic = bootClassic,
  params = c("anc1", "anc2"), strata = "region", parallel = TRUE
) |>
  boot_bw_estimate()
#>            region indicator         est       lcl        ucl
#> 1          Bakool      anc1 0.438888889 0.3805556 0.49444444
#> 2          Bakool      anc2 0.376731302 0.3138889 0.43888889
#> 3             Bay      anc1 0.719130072 0.6487833 0.78255787
#> 4             Bay      anc2 0.002534854 0.0000000 0.01262706
#> 5 Middle Shabelle      anc1 0.203423968 0.1428536 0.27819673
#> 6 Middle Shabelle      anc2 0.051256281 0.0339071 0.07622767
#>            se
#> 1 0.030425611
#> 2 0.030033802
#> 3 0.034273078
#> 4 0.003372086
#> 5 0.033966913
#> 6 0.010573679

More efficient bootstrap resampling

A key feature of the most recent {bbw} update is its new function set that uses parallelisation for bootstrap resampling. This vignette explores the bootstrap resampling efficiencies gained with parallelisation.

Applying the original and the alternative function/set to the Somalia survey dataset available from this package, bootstrap resampling is applied using the same parameters and the time the operation it takes to run is measured and compared.

Bootstrap resampling without parallelisation

In this comparison, the original and alternative function/set both implement sequential bootstrap resampling with number of parameters set at varying values.

Using one parameter and 400 replicates

Original vs Alternative bootstrap resampling function/set
Sequential resampling with 1 parameter and 400 replicates
User System Elapsed
Original - 400 replicates - 1 parameter 31.505 0.0320000000000036 31.4860000000008
Alternative - 400 replicates - 1 parameter 25.956 0 25.8919999999998

Performing bootstrap resampling sequentially, the original function took 31.486 seconds to run while the alternative function set took 25.892 seconds to run. There was very little difference between the original and the alternative function/set.

Using varying number of parameters and 400 replicates

Original vs Alternative bootstrap resampling function/set
Sequential resampling with increasing number of parameters and 400 replicates
No. of parameters User - Original System - Original Elapsed - Original User - Alternative System - Alternative Elapsed - Alternative
1 31.505 0.032 31.486 25.956 0 25.892
2 32.442 0.000 32.363 25.747 0 25.690
4 32.485 0.000 32.404 25.302 0 25.244
8 31.727 0.000 31.652 25.660 0 25.604

There are marginal gains with the alternative function set when the number of parameters more than 1 but the gains do not increase with the increase in the number of parameters.

Bootstrap resampling with parallelisation

In this comparison, the alternative function/set implements parallel bootstrap resampling with number of parameters set at varying values and number of parallel cores set at varying values and then compared to performance of the original function as above.

Original vs Alternative bootstrap resampling function/set
Parallel resampling with increasing number of parameters and increasing number of cores
No. of parameters Original Alternative - sequential Alternative - 2 cores Alternative - 4 cores Alternative - 8 cores
1 31.486 25.892 16.904 10.392 7.361
2 32.363 25.690 16.754 10.634 7.398
4 32.404 25.244 17.017 10.545 7.585
8 31.652 25.604 17.568 10.240 7.624

This updated version of the {bbw} package maintains support for the original bootstrap resampling function bootBW() so as to support existing workflows that use it. We would recommend, however, that the new boot_bw() set of functions be used for new uses and for new workflows.



Ernest Guevarra
Ernest Guevarra
Founding Member

I am a public health specialist with a particular interest in health and nutrition metrics and analytics, and in spatial epidemiology. I develop fit-for-purpose R packages as part of my work with data.

Related