# # Run-length encoding

## # Identifying and grouping by runs in base R

One might want to group their data by the runs of a variable and perform some sort of analysis. Consider the following simple dataset:

``````(dat <- data.frame(x = c(1, 1, 2, 2, 2, 1), y = 1:6))
#   x y
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
# 5 2 5
# 6 1 6

``````

The variable `x` has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1 with value 1. We might want to compute the mean value of variable `y` in each of the runs of variable `x` (these mean values are 1.5, 4, and 6).

In base R, we would first compute the run-length encoding of the `x` variable using `rle`:

``````(r <- rle(dat\$x))
# Run Length Encoding
#   lengths: int [1:3] 2 3 1
#   values : num [1:3] 1 2 1

``````

The next step is to compute the run number of each row of our dataset. We know that the total number of runs is `length(r\$lengths)`, and the length of each run is `r\$lengths`, so we can compute the run number of each of our runs with `rep`:

``````(run.id <- rep(seq_along(r\$lengths), r\$lengths))
#  1 1 2 2 2 3

``````

Now we can use `tapply` to compute the mean `y` value for each run by grouping on the run id:

``````data.frame(x=r\$values, meanY=tapply(dat\$y, run.id, mean))
#   x meanY
# 1 1   1.5
# 2 2   4.0
# 3 1   6.0

``````

## # Run-length Encoding with `rle`

Run-length encoding captures the lengths of runs of consecutive elements in a vector. Consider an example vector:

``````dat <- c(1, 2, 2, 2, 3, 1, 4, 4, 1, 1)

``````

The `rle` function extracts each run and its length:

``````r <- rle(dat)
r
# Run Length Encoding
#   lengths: int [1:6] 1 3 1 1 2 2
#   values : num [1:6] 1 2 3 1 4 1

``````

The values for each run are captured in `r\$values`:

``````r\$values
#  1 2 3 1 4 1

``````

This captures that we first saw a run of 1's, then a run of 2's, then a run of 3's, then a run of 1's, and so on.

The lengths of each run are captured in `r\$lengths`:

``````r\$lengths
#  1 3 1 1 2 2

``````

We see that the initial run of 1's was of length 1, the run of 2's that followed was of length 3, and so on.

## # Run-length encoding to compress and decompress vectors

Long vectors with long runs of the same value can be significantly compressed by storing them in their run-length encoding (the value of each run and the number of times that value is repeated). As an example, consider a vector of length 10 million with a huge number of 1's and only a small number of 0's:

``````set.seed(144)
dat <- sample(rep(0:1, c(1, 1e5)), 1e7, replace=TRUE)
table(dat)
#       0       1
#     103 9999897

``````

Storing 10 million entries will require significant space, but we can instead create a data frame with the run-length encoding of this vector:

``````rle.df <- with(rle(dat), data.frame(values, lengths))
dim(rle.df)
#  207   2
#   values lengths
# 1      1   52818
# 2      0       1
# 3      1  219329
# 4      0       1
# 5      1  318306
# 6      0       1

``````

From the run-length encoding, we see that the first 52,818 values in the vector are 1's, followed by a single 0, followed by 219,329 consecutive 1's, followed by a 0, and so on. The run-length encoding only has 207 entries, requiring us to store only 414 values instead of 10 million values. As `rle.df` is a data frame, it can be stored using standard functions like `write.csv`.

Decompressing a vector in run-length encoding can be accomplished in two ways. The first method is to simply call `rep`, passing the `values` element of the run-length encoding as the first argument and the `lengths` element of the run-length encoding as the second argument:

``````decompressed <- rep(rle.df\$values, rle.df\$lengths)

``````

We can confirm that our decompressed data is identical to our original data:

``````identical(decompressed, dat)
#  TRUE

``````

The second method is to use R's built-in `inverse.rle` function on the `rle` object, for instance:

``````rle.obj <- rle(dat)                            # create a rle object here
class(rle.obj)
#  "rle"

dat.inv <- inverse.rle(rle.obj)               # apply the inverse.rle on the rle object

``````

We can confirm again that this produces exactly the original `dat`:

``````identical(dat.inv, dat)
#  TRUE

``````

## # Identifying and grouping by runs in data.table

The data.table package provides a convenient way to group by runs in data. Consider the following example data:

``````library(data.table)
(DT <- data.table(x = c(1, 1, 2, 2, 2, 1), y = 1:6))
#    x y
# 1: 1 1
# 2: 1 2
# 3: 2 3
# 4: 2 4
# 5: 2 5
# 6: 1 6

``````

The variable `x` has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1 with value 1. We might want to compute the mean value of variable `y` in each of the runs of variable x (these mean values are 1.5, 4, and 6).

The data.table `rleid` function provides an id indicating the run id of each element of a vector:

``````rleid(DT\$x)
#  1 1 2 2 2 3

``````

One can then easily group on this run ID and summarize the `y` data:

``````DT[,mean(y),by=.(x, rleid(x))]
#    x rleid  V1
# 1: 1     1 1.5
# 2: 2     2 4.0
# 3: 1     3 6.0

``````

#### # Remarks

A run is a consecutive sequence of repeated values or observations. For repeated values, R's "run-length encoding" concisely describes a vector in terms of its runs. Consider:

``````dat <- c(1, 2, 2, 2, 3, 1, 4, 4, 1, 1)

``````

We have a length-one run of 1s; then a length-three run of 2s; then a length-one run of 3s; and so on. R's run-length encoding captures all the lengths and values of a vector's runs.

### # Extensions

A run can also refer to consecutive observations in a tabular data. While R doesn't have a natural way of encoding these, they can be handled with `rleid` from the data.table package (currently a dead-end link).