Turbo charge your R code with RCPP

So, you’ve written code in R which contains somewhat complicated loops. The execution time is not quite as fast as you hoped for. You turn to using the profvis package in RStudio (or Rprof) to profile the R program, in the hopes of finding the places in your code that are causing the bottleneck.  The profiler returns a few areas that you focus on to make more efficient, but unfortunately no matter how many ‘loops’ you jump through, you can’t seem to reduce the execution time.

Next, you spend at least a couple of frustrating hours trying to figure out how to vectorize (think: higher-level programming to improve efficiency) the loops creating the bottleneck, to no avail. And it’s okay to admit it, we’ve all been there.

STOP!!!! The solution may be to rewrite some of your key functions in C++.

In this blog post I will share a simple but real-life example of how I was able to use RCPP to convert code I used to measure Model Risk, and in the process made my boss happy while saving a LOT of time. In this case, running the original code written in native R (with only 100K iterations) took longer than the entire process of rewriting the code and running 10 times the number of iterations using RCPP…sold yet?

Pure R vs RCPP

Here’s the secret: RCPP allows R users to take advantage of the speed of C++. C++ is unparalleled in terms of speed! Aruoba and Villaverdez found that R code runs so much slower than pure C++, anywhere between 200 to 400 times slower. So, it suggests that leveraging the use of C++ can provide its own advantages when coupled with the functionality of the native R environment.

To review R basics, it is a language developed mainly for statisticians and economists. Unfortunately, the same features that make R so attractive to statisticians: the ease of use, predefined functions and packages, etc., are precisely what lead to sluggish execution. This problem is exacerbated by the R’s notorious memory mismanagement. Unlike C++,  where the size of arrays must be predefined, arrays in R are copied into free space in the memory as the array grows, and the current space increasingly loses the ability to accommodate the array.  It is clear to see how this not only leads to slowdown in computation time, but also how simply executing code can breach the memory limit.

RCPP, the super-charging extension developed and written by Dirk Eddelbuettel, provides a simple interface to connect C++ to R. Most R programmers have either never programmed in RCPP or just aren’t aware of the possible performance gains they can achieve by using RCPP. While R provides programmers with the ability to use vectorization and a whole host of apply functions to make code more efficient and faster to run, some parts of your code do not lend themselves to such operations (e.g. you may have loops that can’t be easily vectorized, etc.). Vectorized functions in R are faster than native R functions because they are essentially implemented in lower-level languages (C, C++). So, having the ability to incorporate lower level functionality into your code is a huge advantage.

There are some free resources available on the internet to get you up and running with RCPP:

The purpose of this article is to demonstrate the virtues of using RCPP and the potential performance improvements by converting a key function from native R to RCPP. To this end, full implementation of a model risk estimation exercise will not be performed here. For a quick review of Model Risk and Parameter Uncertainty, check out the blog post “Quantify Model Risk through Parameter Uncertainty“. In this article, we will short-circuit the process and limit our scope to quantifying the uncertainty related to estimating PD using historical default data. Note here that this approach can similarly be extended to measuring other sources of parameter uncertainty: variability related to historical recovery data used in estimating loss rates, i.e. LGDs, correlation analyses, etc.

A Non-Parametric Bootstrap Process to Measure Parameter Uncertainty

A brute force method to measure parameter estimation error is using a non-parametric bootstrap process.  Anyone who has attempted to perform a simple bootstrap procedure, using serial computations in R, knows how frustrating this can be.  As the number of iterations grows, the memory issues for which R is infamous become evident. There are packages in R which have been created and stabilized to perform bootstrapping such as the boot package which is well-optimized and contains an option for parallel arguments that allows users to improve run times. However, this may cause other issues if you like having a certain amount of control over your code, or you don’t want too many packages in your program which can add more overhead to the processing time. Although RCPP can be a bit gnarly to debug, it can be used as a simpler alternative, thereby solving the speed issue without introducing major overhead concerns.

Our example looks at the estimation of a measure of default, and how the PD parameter’s uncertainty can be captured. Assessing uncertainty, or the risk inherent in estimating an unknown parameter, is a fundamental building block in the model development process.

Given a time series of default data for a group of homogeneous firms, the average probability of default can be estimated in a straightforward fashion. The steps to approximate both the true PD (assumed to be a fixed but unknown quantity) and the associated estimation error, using a non-parametric bootstrapping approach, are first implemented below in native R. The algorithm is then optimized using RCPP, and the resulting performance gains are displayed graphically for comparison.

Bootstrap steps:

  1. Use historical default rates from the empirical data set of size n to fit a serially correlated process. A typical strategy for time series data utilizes generalized ARMA (Auto-regressive / Moving average) approaches. For simplicity, we specify a lag-2 auto-regressive model for the structure of the error terms, which is denoted as AR(2).
  2. Randomly draw default rates from two consecutive time periods from the observed data.
  3. Using these two randomly selected starting points, the remaining series of default rates is propagated by:
    • First, applying the estimated regression coefficients generated by the lag-2 auto-regressive process to the two beginning observations,
    • Then, combining with a randomly selected error term (with replacement) from the residuals of the regression model to generate the third “observation”, and
    • Finally, continuing to generate successive “observations”  until the pseudo-data series is of the same size as the original.
  4. Average the generated default rates to calculate the mean PD for that iteration.
  5. Repeat steps 1-4 100,000 times to estimate the distribution of average PD rates for the empirical data. For illustrative purposes, we limit the number of iterations to 100,000 but this choice is a function of the developer’s preferences.

Native R Implementation of PD Bootstrapping Code

To implement the above algorithm in the native R environment, the code begins with logic to clear the workspace and set the seed for random number generation.

rm(list=ls())          #clear workspace
set.seed(1)            #set random number seed

The next steps randomly generate a data set to use for our example. For illustration, the sample size is set at 40 observations for each of six homogeneous credit portfolios, indexed by decreasing credit worthiness: AAA, AA, A, BBB, BB, B.

Note also that each portfolio’s set of values represents a random sample of draws from a Uniform distribution, where the draws simulate default rates constructed to be monotonically increasing by credit category, AAA to B:

## Create sample random data set to use in code
sampNum <- 40          #sample number

AAA    <- runif(sampNum, min = 0.00235, max = 0.0053)
AA     <- runif(sampNum, min = 0.00902, max = 0.0242)
A      <- runif(sampNum, min = 0.03046, max = 0.0660)
BBB    <- runif(sampNum, min = 0.05843, max = 0.1168)
BB     <- runif(sampNum, min = 0.10427, max = 0.2055)
B      <- runif(sampNum, min = 0.13828, max = 0.2600)

dat    <-  data.frame(cbind(AAA,AA,A,BBB,BB,B))

Next, a function is defined to mechanically lag the time series for estimating the auto-regressive structure.

lagpad <- function(x, k) {
 if (!is.vector(x))
 stop('x must be a vector')
 if (!is.numeric(x))
 stop('x must be numeric')
 if (!is.numeric(k))
 stop('k must be numeric')
 if (1 != length(k))
 stop('k must be a single number')
 c(rep(NA, k), x)[1 : length(x)]
}

The bootstrap function (called PDSensitivity) accepts the starting points for the time series, the modeled regression object and residuals, and returns the simulated estimate of average PD:

PDSensitivity <- function(list_dat1,LMResid, coef1, coef2, coef3, simNum){

 # initialize variables
 avg_PD <- list()
 modeledTimeSeries <- list() # n is an array of 40 integers

 incrmnt <- 0

for (s in 1:simNum){

#initialize variable for each simulation
 ar1 <- 0
 ar2 <- 0
 sumPD <- 0
 avgPD <- 0
 getResid <- 0
 getRand  <- 0

 #first select the id
 id1 <- sample(1:nrow(dat1), 1, replace=T)
 id2 <- id1+1
 incrmnt <- incrmnt+1

   for (l in 1:nrow(dat1)){
      if(l == 1){
          modeledTimeSeries[l] <- as.numeric(list_dat1[id1])
      } else if (l==2){
          modeledTimeSeries[l] <- as.numeric(list_dat1[id2])
      } else {
         ar1  <- as.numeric(coef2) * as.numeric(modeledTimeSeries[l-1]) #use the first coeff with the 1 periods value
         ar2  <- as.numeric(coef3) * as.numeric(modeledTimeSeries[l-2]) #use the first coeff with the 2 periods value
         getResid  <-  LMResid[sample(1:nrow(dat1), 1, replace=T)]
         modeledTimeSeries[l]  <- as.numeric(ar1) + as.numeric(ar2) + getResid + as.numeric(coef1)
      }
   }
    avg_PD[s]  <- mean(as.numeric(modeledTimeSeries))
 }
 return (avg_PD)
}

This section of the code initializes a few key variables and initiates the function call to PDSensitivity defined above, which performs the bootstrapping. Depending on your machine, the simulation may run for about ten minutes for 100K iterations.

## Initial variable
avgPD_ALL <- list()

## get dataset column names
gNames <- names(dat)

system.time( #set system.time to time the code execution
for (i in 1:length(gNames)){
 simNum <- 100000

 ## Create the dataset for the auto-regressive model of 2 lags
 dat1 <- dat[gNames[i]]
 dat1 <- as.data.frame(rep(dat1,3))
 dat1 <- data.frame(x=as.numeric(unlist(dat1[1])), y=lagpad(as.numeric(unlist(dat1[2])), 1), z=lagpad(as.numeric(unlist(dat1[3])), 2))
 dat1 <- tail(dat1,-2) #remove first 2 record

 ## Run the model and get residuals
 LMfit          <- lm(x~y+z, data=dat1)
 LMResid        <- as.numeric(LMfit$residuals)
 LMfittedValues <- fitted(LMfit)

 ## plot the data and residuals
 plot(as.numeric(unlist(dat1[1])), xlab="Qtrs", ylab = "Probability of Default Original", type="l", col="blue")
 points(as.numeric(unlist(LMfittedValues)), type="l", pch=22, lty=2, col="red")

 coef1 <- LMfit$coef[1]
 coef2 <- LMfit$coef[2]
 coef3 <- LMfit$coef[3] 

 #use the ids to get the PDs from the PD time series
 list_dat1 <- as.numeric(unlist(dat1[1]))

 ## Run bootstrap procedure
 avgPD <- PDSensitivity(list_dat1,LMResid, coef1, coef2, coef3, simNum)

 ## store the 50th percentile of the average PDs
 avgPD_ALL[i] <- quantile(unlist(avgPD), c(.50),na.rm=TRUE)
 print(quantile(unlist(avgPD), c(.50),na.rm=TRUE))

 ## clean up for re-run
 rm(avgPD)
 }
)


Conversion of Key Function to RCPP

Now we examine the same problem in RCPP. The first step is to create a stand-alone C++ file, which should end with the extension .cpp, (you can use Notepad within Windows to create a .cpp file).

Include the header file.  This declares most of the base C++ libraries, functions and variables you will use.  More advanced operations will require that you explicitly include the appropriate header files.

#include <Rcpp.h>;
using namespace Rcpp;

The RCPP syntax is quite similar to that of the native R implementation with a few key differences. The RCPP implementation:

  1. requires the explicit declaration of variables and arrays.
  2. achieves indexing by using “()” versus “[]”.
  3. requires a slightly different definition in terms of a ‘for-loop’  than native R.
  4. accepts a random number generator defined by “rand() %numRows + 1″ instead of the native R function sample().

The RCPP implementation follows below:


//[[Rcpp::export]]
NumericVector PDSensitivity(NumericVector list_dat1, NumericVector LMResid, int numRows, double coef1, double coef2, double coef3, double simNum){

NumericVector avg_PD(simNum);

double sumPD=0.0;
double avgPD=0.0;

int getRand;
int id1;
int id2;

double modeledTimeSeries[numRows];
double ar1;
double ar2;
double getResid;
long incrmnt=0;

for (int s = 0.0; s < simNum ; s++){

getRand=0;
srand(42);
id1 = rand() %numRows + 1;
id2 = id1 + 1;
incrmnt=incrmnt+1;

ar1=0;
ar2=0;
getResid=0;

sumPD=0.0;
avgPD=0.0;

for (int l = 0; l < numRows; l++){
     if(l == 1){
        modeledTimeSeries[l] = list_dat1[id1];
     } else if (l==2){
        modeledTimeSeries[l] = list_dat1[id2];
     } else {
       ar1 = coef2 * modeledTimeSeries[l-1];
       ar2 = coef3 * modeledTimeSeries[l-2];
       srand(42);
       getRand = rand() %numRows + 1;
       getResid = LMResid[getRand];
       modeledTimeSeries[l] = ar1 + ar2 + getResid + coef1;
}
}

for (int k = 0; k < numRows; k++){
     sumPD = sumPD + modeledTimeSeries[k];
}

avgPD = sumPD/numRows;
avg_PD[s] = avgPD;
}
return avg_PD;
}

To run the RCPP implementation of the code, modify the code to add the “sourceCpp” call.

rm(list=ls()) #clear workspace
set.seed(1) #set random number seed
sourceCpp("~/pdsensitivity.cpp")

This compiles, links and loads the C++ in one step using the sourceCpp() function. The function can then be used within R as any other function.

I experimented with a series of iterations and recorded the execution times to demonstrate the efficiency gained by rewriting the bootstrap function in RCPP. The results are shown in the table and graph below. The improvements are remarkable. The execution time is between 198 to 429 times faster when using RCPP than when the function is written in native R, over 400 times faster!

Think about it: your frustratingly sluggish code being super-charged, turning what used to take days to run into mere minutes of execution time. Just imagine what you could do with RCPP…

Blog_Capture_Excel2 Blog_Capture_Chart

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.