Run-length encoding
Identifying and grouping by runs in base R
Section titled “Identifying and grouping by runs in base R”One might want to group their data by the runs of a variable and perform some sort of analysis. Consider the following simple dataset:
(dat <- data.frame(x = c(1, 1, 2, 2, 2, 1), y = 1:6))# x y# 1 1 1# 2 1 2# 3 2 3# 4 2 4# 5 2 5# 6 1 6The variable x has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1 with value 1. We might want to compute the mean value of variable y in each of the runs of variable x (these mean values are 1.5, 4, and 6).
In base R, we would first compute the run-length encoding of the x variable using rle:
(r <- rle(dat$x))# Run Length Encoding# lengths: int [1:3] 2 3 1# values : num [1:3] 1 2 1The next step is to compute the run number of each row of our dataset. We know that the total number of runs is length(r$lengths), and the length of each run is r$lengths, so we can compute the run number of each of our runs with rep:
(run.id <- rep(seq_along(r$lengths), r$lengths))# [1] 1 1 2 2 2 3Now we can use tapply to compute the mean y value for each run by grouping on the run id:
data.frame(x=r$values, meanY=tapply(dat$y, run.id, mean))# x meanY# 1 1 1.5# 2 2 4.0# 3 1 6.0Run-length Encoding with rle
Section titled “Run-length Encoding with rle”Run-length encoding captures the lengths of runs of consecutive elements in a vector. Consider an example vector:
dat <- c(1, 2, 2, 2, 3, 1, 4, 4, 1, 1)The rle function extracts each run and its length:
r <- rle(dat)r# Run Length Encoding# lengths: int [1:6] 1 3 1 1 2 2# values : num [1:6] 1 2 3 1 4 1The values for each run are captured in r$values:
r$values# [1] 1 2 3 1 4 1This captures that we first saw a run of 1’s, then a run of 2’s, then a run of 3’s, then a run of 1’s, and so on.
The lengths of each run are captured in r$lengths:
r$lengths# [1] 1 3 1 1 2 2We see that the initial run of 1’s was of length 1, the run of 2’s that followed was of length 3, and so on.
Run-length encoding to compress and decompress vectors
Section titled “Run-length encoding to compress and decompress vectors”Long vectors with long runs of the same value can be significantly compressed by storing them in their run-length encoding (the value of each run and the number of times that value is repeated). As an example, consider a vector of length 10 million with a huge number of 1’s and only a small number of 0’s:
set.seed(144)dat <- sample(rep(0:1, c(1, 1e5)), 1e7, replace=TRUE)table(dat)# 0 1# 103 9999897Storing 10 million entries will require significant space, but we can instead create a data frame with the run-length encoding of this vector:
rle.df <- with(rle(dat), data.frame(values, lengths))dim(rle.df)# [1] 207 2head(rle.df)# values lengths# 1 1 52818# 2 0 1# 3 1 219329# 4 0 1# 5 1 318306# 6 0 1From the run-length encoding, we see that the first 52,818 values in the vector are 1’s, followed by a single 0, followed by 219,329 consecutive 1’s, followed by a 0, and so on. The run-length encoding only has 207 entries, requiring us to store only 414 values instead of 10 million values. As rle.df is a data frame, it can be stored using standard functions like write.csv.
Decompressing a vector in run-length encoding can be accomplished in two ways. The first method is to simply call rep, passing the values element of the run-length encoding as the first argument and the lengths element of the run-length encoding as the second argument:
decompressed <- rep(rle.df$values, rle.df$lengths)We can confirm that our decompressed data is identical to our original data:
identical(decompressed, dat)# [1] TRUEThe second method is to use R’s built-in inverse.rle function on the rle object, for instance:
rle.obj <- rle(dat) # create a rle object hereclass(rle.obj)# [1] "rle"
dat.inv <- inverse.rle(rle.obj) # apply the inverse.rle on the rle objectWe can confirm again that this produces exactly the original dat:
identical(dat.inv, dat)# [1] TRUEIdentifying and grouping by runs in data.table
Section titled “Identifying and grouping by runs in data.table”The data.table package provides a convenient way to group by runs in data. Consider the following example data:
library(data.table)(DT <- data.table(x = c(1, 1, 2, 2, 2, 1), y = 1:6))# x y# 1: 1 1# 2: 1 2# 3: 2 3# 4: 2 4# 5: 2 5# 6: 1 6The variable x has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1 with value 1. We might want to compute the mean value of variable y in each of the runs of variable x (these mean values are 1.5, 4, and 6).
The data.table rleid function provides an id indicating the run id of each element of a vector:
rleid(DT$x)# [1] 1 1 2 2 2 3One can then easily group on this run ID and summarize the y data:
DT[,mean(y),by=.(x, rleid(x))]# x rleid V1# 1: 1 1 1.5# 2: 2 2 4.0# 3: 1 3 6.0Remarks
Section titled “Remarks”A run is a consecutive sequence of repeated values or observations. For repeated values, R’s “run-length encoding” concisely describes a vector in terms of its runs. Consider:
dat <- c(1, 2, 2, 2, 3, 1, 4, 4, 1, 1)We have a length-one run of 1s; then a length-three run of 2s; then a length-one run of 3s; and so on. R’s run-length encoding captures all the lengths and values of a vector’s runs.
Extensions
Section titled “Extensions”A run can also refer to consecutive observations in a tabular data. While R doesn’t have a natural way of encoding these, they can be handled with rleid from the data.table package (currently a dead-end link).