Aggregating data frames
Aggregation is one of the most common uses for R. There are several ways to do so in R, which we will illustrate here.
Aggregating with base R
Section titled “Aggregating with base R”For this, we will use the function aggregate, which can be used as follows:
aggregate(formula,function,data)The following code shows various ways of using the aggregate function.
CODE:
df = data.frame(group=c("Group 1","Group 1","Group 2","Group 2","Group 2"), subgroup = c("A","A","A","A","B"),value = c(2,2.5,1,2,1.5))
# sum, grouping by one columnaggregate(value~group, FUN=sum, data=df)
# mean, grouping by one columnaggregate(value~group, FUN=mean, data=df)
# sum, grouping by multiple columnsaggregate(value~group+subgroup,FUN=sum,data=df)
# custom function, grouping by one column# in this example we want the sum of all values larger than 2 per group.aggregate(value~group, FUN=function(x) sum(x[x>2]), data=df)OUTPUT:
> df = data.frame(group=c("Group 1","Group 1","Group 2","Group 2","Group 2"), subgroup = c("A","A","A","A","B"),value = c(2,2.5,1,2,1.5))> print(df) group subgroup value1 Group 1 A 2.02 Group 1 A 2.53 Group 2 A 1.04 Group 2 A 2.05 Group 2 B 1.5>> # sum, grouping by one column> aggregate(value~group, FUN=sum, data=df) group value1 Group 1 4.52 Group 2 4.5>> # mean, grouping by one column> aggregate(value~group, FUN=mean, data=df) group value1 Group 1 2.252 Group 2 1.50>> # sum, grouping by multiple columns> aggregate(value~group+subgroup,FUN=sum,data=df) group subgroup value1 Group 1 A 4.52 Group 2 A 3.03 Group 2 B 1.5>> # custom function, grouping by one column> # in this example we want the sum of all values larger than 2 per group.> aggregate(value~group, FUN=function(x) sum(x[x>2]), data=df) group value1 Group 1 2.52 Group 2 0.0Aggregating with dplyr
Section titled “Aggregating with dplyr”Aggregating with dplyr is easy! You can use the group_by() and the summarize() functions for this. Some examples are given below.
CODE:
# Aggregating with dplyrlibrary(dplyr)
df = data.frame(group=c("Group 1","Group 1","Group 2","Group 2","Group 2"), subgroup = c("A","A","A","A","B"),value = c(2,2.5,1,2,1.5))print(df)
# sum, grouping by one columndf %>% group_by(group) %>% summarize(value = sum(value)) %>% as.data.frame()
# mean, grouping by one columndf %>% group_by(group) %>% summarize(value = mean(value)) %>% as.data.frame()
# sum, grouping by multiple columnsdf %>% group_by(group,subgroup) %>% summarize(value = sum(value)) %>% as.data.frame()
# custom function, grouping by one column# in this example we want the sum of all values larger than 2 per group.df %>% group_by(group) %>% summarize(value = sum(value[value>2])) %>% as.data.frame()OUTPUT:
> library(dplyr)>> df = data.frame(group=c("Group 1","Group 1","Group 2","Group 2","Group 2"), subgroup = c("A","A","A","A","B"),value = c(2,2.5,1,2,1.5))> print(df) group subgroup value1 Group 1 A 2.02 Group 1 A 2.53 Group 2 A 1.04 Group 2 A 2.05 Group 2 B 1.5>> # sum, grouping by one column> df %>% group_by(group) %>% summarize(value = sum(value)) %>% as.data.frame() group value1 Group 1 4.52 Group 2 4.5>> # mean, grouping by one column> df %>% group_by(group) %>% summarize(value = mean(value)) %>% as.data.frame() group value1 Group 1 2.252 Group 2 1.50>> # sum, grouping by multiple columns> df %>% group_by(group,subgroup) %>% summarize(value = sum(value)) %>% as.data.frame() group subgroup value1 Group 1 A 4.52 Group 2 A 3.03 Group 2 B 1.5>> # custom function, grouping by one column> # in this example we want the sum of all values larger than 2 per group.> df %>% group_by(group) %>% summarize(value = sum(value[value>2])) %>% as.data.frame() group value1 Group 1 2.52 Group 2 0.0Aggregating with data.table
Section titled “Aggregating with data.table”Grouping with the data.table package is done using the syntax dt[i, j, by]
Which can be read out loud as: “Take dt, subset rows using i, then calculate j, grouped by by.” Within the dt statement, multiple calculations or groups should be put in a list. Since an alias for list() is .(), both can be used interchangeably. In the examples below we use .().
CODE:
# Aggregating with data.tablelibrary(data.table)
dt = data.table(group=c("Group 1","Group 1","Group 2","Group 2","Group 2"), subgroup = c("A","A","A","A","B"),value = c(2,2.5,1,2,1.5))print(dt)
# sum, grouping by one columndt[,.(value=sum(value)),group]
# mean, grouping by one columndt[,.(value=mean(value)),group]
# sum, grouping by multiple columnsdt[,.(value=sum(value)),.(group,subgroup)]
# custom function, grouping by one column# in this example we want the sum of all values larger than 2 per group.dt[,.(value=sum(value[value>2])),group]OUTPUT:
> # Aggregating with data.table> library(data.table)>> dt = data.table(group=c("Group 1","Group 1","Group 2","Group 2","Group 2"), subgroup = c("A","A","A","A","B"),value = c(2,2.5,1,2,1.5))> print(dt) group subgroup value1: Group 1 A 2.02: Group 1 A 2.53: Group 2 A 1.04: Group 2 A 2.05: Group 2 B 1.5>> # sum, grouping by one column> dt[,.(value=sum(value)),group] group value1: Group 1 4.52: Group 2 4.5>> # mean, grouping by one column> dt[,.(value=mean(value)),group] group value1: Group 1 2.252: Group 2 1.50>> # sum, grouping by multiple columns> dt[,.(value=sum(value)),.(group,subgroup)] group subgroup value1: Group 1 A 4.52: Group 2 A 3.03: Group 2 B 1.5>> # custom function, grouping by one column> # in this example we want the sum of all values larger than 2 per group.> dt[,.(value=sum(value[value>2])),group] group value1: Group 1 2.52: Group 2 0.0