# # Run-length encoding

## # Identifying and grouping by runs in base R

One might want to group their data by the runs of a variable and perform some sort of analysis. Consider the following simple dataset:

```
(dat <- data.frame(x = c(1, 1, 2, 2, 2, 1), y = 1:6))
# x y
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
# 5 2 5
# 6 1 6
```

The variable `x`

has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1 with value 1. We might want to compute the mean value of variable `y`

in each of the runs of variable `x`

(these mean values are 1.5, 4, and 6).

In base R, we would first compute the run-length encoding of the `x`

variable using `rle`

:

```
(r <- rle(dat$x))
# Run Length Encoding
# lengths: int [1:3] 2 3 1
# values : num [1:3] 1 2 1
```

The next step is to compute the run number of each row of our dataset. We know that the total number of runs is `length(r$lengths)`

, and the length of each run is `r$lengths`

, so we can compute the run number of each of our runs with `rep`

:

```
(run.id <- rep(seq_along(r$lengths), r$lengths))
# [1] 1 1 2 2 2 3
```

Now we can use `tapply`

to compute the mean `y`

value for each run by grouping on the run id:

```
data.frame(x=r$values, meanY=tapply(dat$y, run.id, mean))
# x meanY
# 1 1 1.5
# 2 2 4.0
# 3 1 6.0
```

## # Run-length Encoding with `rle`

Run-length encoding captures the lengths of runs of consecutive elements in a vector. Consider an example vector:

```
dat <- c(1, 2, 2, 2, 3, 1, 4, 4, 1, 1)
```

The `rle`

function extracts each run and its length:

```
r <- rle(dat)
r
# Run Length Encoding
# lengths: int [1:6] 1 3 1 1 2 2
# values : num [1:6] 1 2 3 1 4 1
```

The values for each run are captured in `r$values`

:

```
r$values
# [1] 1 2 3 1 4 1
```

This captures that we first saw a run of 1's, then a run of 2's, then a run of 3's, then a run of 1's, and so on.

The lengths of each run are captured in `r$lengths`

:

```
r$lengths
# [1] 1 3 1 1 2 2
```

We see that the initial run of 1's was of length 1, the run of 2's that followed was of length 3, and so on.

## # Run-length encoding to compress and decompress vectors

Long vectors with long runs of the same value can be significantly compressed by storing them in their run-length encoding (the value of each run and the number of times that value is repeated). As an example, consider a vector of length 10 million with a huge number of 1's and only a small number of 0's:

```
set.seed(144)
dat <- sample(rep(0:1, c(1, 1e5)), 1e7, replace=TRUE)
table(dat)
# 0 1
# 103 9999897
```

Storing 10 million entries will require significant space, but we can instead create a data frame with the run-length encoding of this vector:

```
rle.df <- with(rle(dat), data.frame(values, lengths))
dim(rle.df)
# [1] 207 2
head(rle.df)
# values lengths
# 1 1 52818
# 2 0 1
# 3 1 219329
# 4 0 1
# 5 1 318306
# 6 0 1
```

From the run-length encoding, we see that the first 52,818 values in the vector are 1's, followed by a single 0, followed by 219,329 consecutive 1's, followed by a 0, and so on. The run-length encoding only has 207 entries, requiring us to store only 414 values instead of 10 million values. As `rle.df`

is a data frame, it can be stored using standard functions like `write.csv`

.

Decompressing a vector in run-length encoding can be accomplished in two ways. The first method is to simply call `rep`

, passing the `values`

element of the run-length encoding as the first argument and the `lengths`

element of the run-length encoding as the second argument:

```
decompressed <- rep(rle.df$values, rle.df$lengths)
```

We can confirm that our decompressed data is identical to our original data:

```
identical(decompressed, dat)
# [1] TRUE
```

The second method is to use R's built-in `inverse.rle`

function on the `rle`

object, for instance:

```
rle.obj <- rle(dat) # create a rle object here
class(rle.obj)
# [1] "rle"
dat.inv <- inverse.rle(rle.obj) # apply the inverse.rle on the rle object
```

We can confirm again that this produces exactly the original `dat`

:

```
identical(dat.inv, dat)
# [1] TRUE
```

## # Identifying and grouping by runs in data.table

The data.table package provides a convenient way to group by runs in data. Consider the following example data:

```
library(data.table)
(DT <- data.table(x = c(1, 1, 2, 2, 2, 1), y = 1:6))
# x y
# 1: 1 1
# 2: 1 2
# 3: 2 3
# 4: 2 4
# 5: 2 5
# 6: 1 6
```

The variable `x`

has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1 with value 1. We might want to compute the mean value of variable `y`

in each of the runs of variable x (these mean values are 1.5, 4, and 6).

The data.table `rleid`

function provides an id indicating the run id of each element of a vector:

```
rleid(DT$x)
# [1] 1 1 2 2 2 3
```

One can then easily group on this run ID and summarize the `y`

data:

```
DT[,mean(y),by=.(x, rleid(x))]
# x rleid V1
# 1: 1 1 1.5
# 2: 2 2 4.0
# 3: 1 3 6.0
```

#### # Remarks

A run is a consecutive sequence of repeated values or observations. For repeated values, R's "run-length encoding" concisely describes a vector in terms of its runs. Consider:

```
dat <- c(1, 2, 2, 2, 3, 1, 4, 4, 1, 1)
```

We have a length-one run of 1s; then a length-three run of 2s; then a length-one run of 3s; and so on. R's run-length encoding captures all the lengths and values of a vector's runs.

### # Extensions

A run can also refer to consecutive observations in a tabular data. While R doesn't have a natural way of encoding these, they can be handled with `rleid`

from the data.table package (currently a dead-end link) (opens new window).