# R code vectorization best practices

# By row operations

The key in vectorizing R code, is to reduce or eliminate "by row operations" or method dispatching of R functions.

That means that when approaching a problem that at first glance requires "by row operations", such as calculating the means of each row, one needs to ask themselves:

What are the classes of the data sets I'm dealing with?
Is there an existing compiled code that can achieve this without the need of repetitive evaluation of R functions?
If not, can I do these operation by columns instead by row?
Finally, is it worth spending a lot of time on developing complicated vectorized code instead of just running a simple apply loop? In other words, is the data big/sophisticated enough that R can't handle it efficiently using a simple loop?

Putting aside the memory pre-allocation issue and growing object in loops, we will focus in this example on how to possibly avoid apply loops, method dispatching or re-evaluating R functions within loops.

A standard/easy way of calculating mean by row would be:

apply(mtcars, 1, mean)
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive   Hornet Sportabout             Valiant          Duster 360 
           29.90727            29.98136            23.59818            38.73955            53.66455            35.04909            59.72000 
          Merc 240D            Merc 230            Merc 280           Merc 280C          Merc 450SE          Merc 450SL         Merc 450SLC 
           24.63455            27.23364            31.86000            31.78727            46.43091            46.50000            46.35000 
 Cadillac Fleetwood Lincoln Continental   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla       Toyota Corona 
           66.23273            66.05855            65.97227            19.44091            17.74227            18.81409            24.88864 
   Dodge Challenger         AMC Javelin          Camaro Z28    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
           47.24091            46.00773            58.75273            57.37955            18.92864            24.77909            24.88027 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
           60.97182            34.50818            63.15545            26.26273

But can we do better? Lets's see what happened here:

First, we converted a data.frame to a matrix. (Note that his happens within the apply function.) This is both inefficient and dangerous. a matrix can't hold several column types at a time. Hence, such conversion will probably lead to loss of information and some times to misleading results (compare apply(iris, 2, class) with str(iris) or with sapply(iris, class)).
Second of all, we performed an operation repetitively, one time for each row. Meaning, we had to evaluate some R function nrow(mtcars) times. In this specific case, mean is not a computationally expensive function, hence R could likely easily handle it even for a big data set, but what would happen if we need to calculate the standard deviation by row (which involves an expensive square root operation)? Which brings us to the next point:
We evaluated the R function many times, but maybe there already is a compiled version of this operation?

Indeed we could simply do:

rowMeans(mtcars)
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive   Hornet Sportabout             Valiant          Duster 360 
           29.90727            29.98136            23.59818            38.73955            53.66455            35.04909            59.72000 
          Merc 240D            Merc 230            Merc 280           Merc 280C          Merc 450SE          Merc 450SL         Merc 450SLC 
           24.63455            27.23364            31.86000            31.78727            46.43091            46.50000            46.35000 
 Cadillac Fleetwood Lincoln Continental   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla       Toyota Corona 
           66.23273            66.05855            65.97227            19.44091            17.74227            18.81409            24.88864 
   Dodge Challenger         AMC Javelin          Camaro Z28    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
           47.24091            46.00773            58.75273            57.37955            18.92864            24.77909            24.88027 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
           60.97182            34.50818            63.15545            26.26273

This involves no by row operations and therefore no repetitive evaluation of R functions. However, we still converted a data.frame to a matrix. Though rowMeans has an error handling mechanism and it won't run on a data set that it can't handle, it's still has an efficiency cost.

rowMeans(iris)
Error in rowMeans(iris) : 'x' must be numeric

But still, can we do better? We could try instead of a matrix conversion with error handling, a different method that will allow us to use mtcars as a vector (because a data.frame is essentially a list and a list is a vector).

Reduce(`+`, mtcars)/ncol(mtcars)
 [1] 29.90727 29.98136 23.59818 38.73955 53.66455 35.04909 59.72000 24.63455 27.23364 31.86000 31.78727 46.43091 46.50000 46.35000 66.23273 66.05855
[17] 65.97227 19.44091 17.74227 18.81409 24.88864 47.24091 46.00773 58.75273 57.37955 18.92864 24.77909 24.88027 60.97182 34.50818 63.15545 26.26273

Now for possible speed gain, we lost column names and error handling (including NA handling).

Another example would be calculating mean by group, using base R we could try

aggregate(. ~ cyl, mtcars, mean)
cyl      mpg     disp        hp     drat       wt     qsec        vs        am     gear     carb
1   4 26.66364 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
2   6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
3   8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000

Still, we are basically evaluating an R function in a loop, but the loop is now hidden in an internal C function (it matters little whether it is a C or an R loop).

Could we avoid it? Well there is a compiled function in R called rowsum, hence we could do:

rowsum(mtcars[-2], mtcars$cyl)/table(mtcars$cyl)
mpg     disp        hp     drat       wt     qsec        vs        am     gear     carb
4 26.66364 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000

Though we had to convert to a matrix first too.

A this point we may question whether our current data structure is the most appropriate one. Is a data.frame is the best practice? Or should one just switch to a matrix data structure in order to gain efficiency?

By row operations will get more and more expensive (even in matrices) as we start to evaluate expensive functions each time. Lets us consider a variance calculation by row example.

Lets say we have a matrix m:

set.seed(100)
m <- matrix(sample(1e2), 10)
m
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    8   33   39   86   71  100   81   68   89    84
 [2,]   12   16   57   80   32   82   69   11   41    92
 [3,]   62   91   53   13   42   31   60   70   98    79
 [4,]   66   94   29   67   45   59   20   96   64     1
 [5,]   36   63   76    6   10   48   85   75   99     2
 [6,]   18    4   27   19   44   56   37   95   26    40
 [7,]    3   24   21   25   52   51   83   28   49    17
 [8,]   46    5   22   43   47   74   35   97   77    65
 [9,]   55   54   78   34   50   90   30   61   14    58
[10,]   88   73   38   15    9   72    7   93   23    87

One could simply do:

apply(m, 1, var)
[1]  871.6556  957.5111  699.2111  941.4333 1237.3333  641.8222  539.7889  759.4333  500.4889 1255.6111

On the other hand, one could also completely vectorize this operation by following the formula of variance

RowVar <- function(x) {
  rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)
}
RowVar(m)
[1]  871.6556  957.5111  699.2111  941.4333 1237.3333  641.8222  539.7889  759.4333  500.4889 1255.6111

← xgboost Missing values →