# dplyr
# dplyr's single table verbs
dplyr (opens new window) introduces a grammar of data manipulation in R
. It provides a consistent interface to work with data no matter where it is stored: data.frame (opens new window), data.table (opens new window), or a database
. The key pieces of dplyr
are written using Rcpp (opens new window), which makes it very fast for working with in-memory data.
dplyr
's philosophy is to have small functions that do one thing well. The five simple functions (filter
, arrange
, select
, mutate
, and summarise
) can be used to reveal new ways to describe data. When combined with group_by
, these functions can be used to calculate group wise summary statistics.
# Syntax commonalities
All these functions have a similar syntax:
- The first argument to all these functions is always a data frame
- Columns can be referred directly using bare variable names (i.e., without using
$
) - These functions do not modify the original data itself, i.e., they don't have side effects. Hence, the results should always be saved to an object.
We will use the built-in mtcars (opens new window) dataset to explore dplyr
's single table verbs. Before converting the type of mtcars
to tbl_df
(since it makes printing cleaner), we add the rownames
of the dataset as a column using rownames_to_column
function from the tibble (opens new window) package.
library(dplyr) # This documentation was written using version 0.5.0
mtcars_tbl <- as_data_frame(tibble::rownames_to_column(mtcars, "cars"))
# examine the structure of data
head(mtcars_tbl)
# A tibble: 6 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# filter
filter
helps subset rows that match certain criteria. The first argument is the name of the data.frame
and the second (and subsequent) arguments are the criteria that filter the data (these criteria should evaluate to either TRUE
or FALSE
)
Subset all cars that have 4 cylinders - cyl
:
filter(mtcars_tbl, cyl == 4)
# A tibble: 11 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#2 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#3 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#5 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# ... with 6 more rows
We can pass multiple criteria separated by a comma. To subset the cars which have either 4 or 6 cylinders - cyl
and have 5 gears - gear
:
filter(mtcars_tbl, cyl == 4 | cyl == 6, gear == 5)
# A tibble: 3 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
#2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
#3 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
filter
selects rows based on criteria, to select rows by position, use slice
.
slice
takes only 2 arguments: the first one is a data.frame
and the second is integer row values.
To select rows 6 through 9:
slice(mtcars_tbl, 6:9)
# A tibble: 4 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
#2 Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
#3 Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
#4 Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
Or:
slice(mtcars_tbl, -c(1:5, 10:n()))
This results in the same output as slice(mtcars_tbl, 6:9)
n()
represents the number of observations in the current group
# arrange
arrange
is used to sort the data by a specified variable(s). Just like the previous verb (and all other functions in dplyr
), the first argument is a data.frame
, and consequent arguments are used to sort the data. If more than one variable is passed, the data is first sorted by the first variable, and then by the second variable, and so on..
To order the data by horsepower - hp
arrange(mtcars_tbl, hp)
# A tibble: 32 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#2 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#3 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#5 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#6 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# ... with 26 more rows
To arrange
the data by miles per gallon - mpg
in descending order, followed by number of cylinders - cyl
:
arrange(mtcars_tbl, desc(mpg), cyl)
# A tibble: 32 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#2 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#3 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#4 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#5 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#6 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# ... with 26 more rows
# select
select
is used to select only a subset of variables. To select only mpg
, disp
, wt
, qsec
, and vs
from mtcars_tbl
:
select(mtcars_tbl, mpg, disp, wt, qsec, vs)
# A tibble: 32 x 5
# mpg disp wt qsec vs
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 21.0 160.0 2.620 16.46 0
#2 21.0 160.0 2.875 17.02 0
#3 22.8 108.0 2.320 18.61 1
#4 21.4 258.0 3.215 19.44 1
#5 18.7 360.0 3.440 17.02 0
#6 18.1 225.0 3.460 20.22 1
# ... with 26 more rows
:
notation can be used to select consecutive columns. To select columns from cars
through disp
and vs
through carb
:
select(mtcars_tbl, cars:disp, vs:carb)
# A tibble: 32 x 8
# cars mpg cyl disp vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Mazda RX4 21.0 6 160.0 0 1 4 4
#2 Mazda RX4 Wag 21.0 6 160.0 0 1 4 4
#3 Datsun 710 22.8 4 108.0 1 1 4 1
#4 Hornet 4 Drive 21.4 6 258.0 1 0 3 1
#5 Hornet Sportabout 18.7 8 360.0 0 0 3 2
#6 Valiant 18.1 6 225.0 1 0 3 1
# ... with 26 more rows
or select(mtcars_tbl, -(hp:qsec))
For datasets that contain several columns, it can be tedious to select several columns by name. To make life easier, there are a number of helper functions (such as starts_with()
, ends_with()
, contains()
, matches()
, num_range()
, one_of()
, and everything()
) that can be used in select
. To learn more about how to use them, see ?select_helpers
and ?select
.
Note: While referring to columns directly in select()
, we use bare column names, but quotes should be used while referring to columns in helper functions.
To rename columns while selecting:
select(mtcars_tbl, cylinders = cyl, displacement = disp)
# A tibble: 32 x 2
# cylinders displacement
# <dbl> <dbl>
#1 6 160.0
#2 6 160.0
#3 4 108.0
#4 6 258.0
#5 8 360.0
#6 6 225.0
# ... with 26 more rows
As expected, this drops all other variables.
To rename columns without dropping other variables, use rename
:
rename(mtcars_tbl, cylinders = cyl, displacement = disp)
# A tibble: 32 x 12
# cars mpg cylinders displacement hp drat wt qsec vs
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0
#2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0
#3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1
#4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1
#5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0
#6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1
# ... with 26 more rows, and 3 more variables: am <dbl>, gear <dbl>, carb <dbl>
# mutate
mutate
can be used to add new columns to the data. Like all other functions in dplyr
, mutate doesn't add the newly created columns to the original data. Columns are added at the end of the data.frame
.
mutate(mtcars_tbl, weight_ton = wt/2, weight_pounds = weight_ton * 2000)
# A tibble: 32 x 14
# cars mpg cyl disp hp drat wt qsec vs am gear carb weight_ton weight_pounds
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1.3100 2620
#2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 1.4375 2875
#3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1.1600 2320
#4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1.6075 3215
#5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1.7200 3440
#6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1.7300 3460
# ... with 26 more rows
Note the use of weight_ton
while creating weight_pounds
. Unlike base R
, mutate
allows us to refer to columns that we just created to be used for a subsequent operation.
To retain only the newly created columns, use transmute
instead of mutate
:
transmute(mtcars_tbl, weight_ton = wt/2, weight_pounds = weight_ton * 2000)
# A tibble: 32 x 2
# weight_ton weight_pounds
# <dbl> <dbl>
#1 1.3100 2620
#2 1.4375 2875
#3 1.1600 2320
#4 1.6075 3215
#5 1.7200 3440
#6 1.7300 3460
# ... with 26 more rows
# summarise
summarise
calculates summary statistics of variables by collapsing multiple values to a single value. It can calculate multiple statistics and we can name these summary columns in the same statement.
To calculate the mean and standard deviation of mpg
and disp
of all cars in the dataset:
summarise(mtcars_tbl, mean_mpg = mean(mpg), sd_mpg = sd(mpg),
mean_disp = mean(disp), sd_disp = sd(disp))
# A tibble: 1 x 4
# mean_mpg sd_mpg mean_disp sd_disp
# <dbl> <dbl> <dbl> <dbl>
#1 20.09062 6.026948 230.7219 123.9387
# group_by
group_by
can be used to perform group wise operations on data. When the verbs defined above are applied on this grouped data, they are automatically applied to each group separately.
To find mean
and sd
of mpg
by cyl
:
by_cyl <- group_by(mtcars_tbl, cyl)
summarise(by_cyl, mean_mpg = mean(mpg), sd_mpg = sd(mpg))
# A tibble: 3 x 3
# cyl mean_mpg sd_mpg
# <dbl> <dbl> <dbl>
#1 4 26.66364 4.509828
#2 6 19.74286 1.453567
#3 8 15.10000 2.560048
# Putting it all togther
We select columns from cars
through hp
and gear
, order the rows by cyl
and from highest to lowest mpg
, group the data by gear
, and finally subset only those cars have mpg
> 20 and hp
> 75
selected <- select(mtcars_tbl, cars:hp, gear)
ordered <- arrange(selected, cyl, desc(mpg))
by_cyl <- group_by(ordered, gear)
filter(by_cyl, mpg > 20, hp > 75)
Source: local data frame [9 x 6]
Groups: gear [3]
# cars mpg cyl disp hp gear
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Lotus Europa 30.4 4 95.1 113 5
#2 Porsche 914-2 26.0 4 120.3 91 5
#3 Datsun 710 22.8 4 108.0 93 4
#4 Merc 230 22.8 4 140.8 95 4
#5 Toyota Corona 21.5 4 120.1 97 3
# ... with 4 more rows
Maybe we are not interested the intermediate results, we can achieve the same result as above by wrapping the function calls:
filter(
group_by(
arrange(
select(
mtcars_tbl, cars:hp
), cyl, desc(mpg)
), cyl
),mpg > 20, hp > 75
)
This can be a little difficult to read. So, dplyr
operations can be chained using the pipe (opens new window) %>%
operator. The above code transalates to:
mtcars_tbl %>%
select(cars:hp) %>%
arrange(cyl, desc(mpg)) %>%
group_by(cyl) %>%
filter(mpg > 20, hp > 75)
# summarise multiple columns
dplyr
provides summarise_all()
to apply functions to all (non-grouping) columns.
To find the number of distinct values for each column:
mtcars_tbl %>%
summarise_all(n_distinct)
# A tibble: 1 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 32 25 3 27 22 22 29 30 2 2 3 6
To find the number of distinct values for each column by cyl
:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_all(n_distinct)
# A tibble: 3 x 12
# cyl cars mpg disp hp drat wt qsec vs am gear carb
# <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 4 11 9 11 10 10 11 11 2 2 3 2
#2 6 7 6 5 4 5 6 7 2 2 3 3
#3 8 14 12 11 9 11 13 14 1 2 2 4
Note that we just had to add the group_by
statement and the rest of the code is the same. The output now consists of three rows - one for each unique value of cyl
.
To summarise
specific multiple columns, use summarise_at
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"), mean)
# A tibble: 3 x 4
# cyl mpg disp hp
# <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636
#2 6 19.74286 183.3143 122.28571
#3 8 15.10000 353.1000 209.21429
helper
functions (?select_helpers
) can be used in place of column names to select specific columns
To apply multiple functions, either pass the function names as a character vector:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"),
c("mean", "sd"))
or wrap them inside funs
:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"),
funs(mean, sd))
# A tibble: 3 x 7
# cyl mpg_mean disp_mean hp_mean mpg_sd disp_sd hp_sd
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636 4.509828 26.87159 20.93453
#2 6 19.74286 183.3143 122.28571 1.453567 41.56246 24.26049
#3 8 15.10000 353.1000 209.21429 2.560048 67.77132 50.97689
Column names are now be appended with function names to keep them distinct. In order to change this, pass the name to be appended with the function:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"),
c(Mean = "mean", SD = "sd"))
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"),
funs(Mean = mean, SD = sd))
# A tibble: 3 x 7
# cyl mpg_Mean disp_Mean hp_Mean mpg_SD disp_SD hp_SD
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636 4.509828 26.87159 20.93453
#2 6 19.74286 183.3143 122.28571 1.453567 41.56246 24.26049
#3 8 15.10000 353.1000 209.21429 2.560048 67.77132 50.97689
To select columns conditionally, use summarise_if
:
Take the mean
of all columns that are numeric
grouped by cyl
:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_if(is.numeric, mean)
# A tibble: 3 x 11
# cyl mpg disp hp drat wt qsec
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727
#2 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714
#3 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214
# ... with 4 more variables: vs <dbl>, am <dbl>, gear <dbl>,
# carb <dbl>
However, some variables are discrete, and mean
of these variables doesn't make sense.
To take the mean
of only continuous variables by cyl
:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_if(function(x) is.numeric(x) & n_distinct(x) > 6, mean)
# A tibble: 3 x 7
# cyl mpg disp hp drat wt qsec
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727
#2 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714
#3 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214
# Aggregating with %>% (pipe) operator
The pipe (%>%) operator (opens new window) could be used in combination with dplyr
functions. In this example we use the mtcars
dataset (see help("mtcars")
for more information) to show how to sumarize a data frame, and to add variables to the data with the result of the application of a function.
library(dplyr)
library(magrittr)
df <- mtcars
df$cars <- rownames(df) #just add the cars names to the df
df <- df[,c(ncol(df),1:(ncol(df)-1))] # and place the names in the first column
1. Sumarize the data
To compute statistics we use summarize
and the appropriate functions. In this case n()
is used for counting the number of cases.
df %>%
summarize(count=n(),mean_mpg = mean(mpg, na.rm = TRUE),
min_weight = min(wt),max_weight = max(wt))
# count mean_mpg min_weight max_weight
#1 32 20.09062 1.513 5.424
2. Compute statistics by group
It is possible to compute the statistics by groups of the data. In this case by Number of cylinders and Number of forward gears
df %>%
group_by(cyl, gear) %>%
summarize(count=n(),mean_mpg = mean(mpg, na.rm = TRUE),
min_weight = min(wt),max_weight = max(wt))
# Source: local data frame [8 x 6]
# Groups: cyl [?]
#
# cyl gear count mean_mpg min_weight max_weight
# <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#1 4 3 1 21.500 2.465 2.465
#2 4 4 8 26.925 1.615 3.190
#3 4 5 2 28.200 1.513 2.140
#4 6 3 2 19.750 3.215 3.460
#5 6 4 4 19.750 2.620 3.440
#6 6 5 1 19.700 2.770 2.770
#7 8 3 12 15.050 3.435 5.424
#8 8 5 2 15.400 3.170 3.570
# Subset Observation (Rows)
# dplyr::filter()
- Select a subset of rows in a data frame that meet a logical criteria:
dplyr::filter(iris,Sepal.Length>7)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 7.1 3.0 5.9 2.1 virginica
# 2 7.6 3.0 6.6 2.1 virginica
# 3 7.3 2.9 6.3 1.8 virginica
# 4 7.2 3.6 6.1 2.5 virginica
# 5 7.7 3.8 6.7 2.2 virginica
# 6 7.7 2.6 6.9 2.3 virginica
# 7 7.7 2.8 6.7 2.0 virginica
# 8 7.2 3.2 6.0 1.8 virginica
# 9 7.2 3.0 5.8 1.6 virginica
# 10 7.4 2.8 6.1 1.9 virginica
# 11 7.9 3.8 6.4 2.0 virginica
# 12 7.7 3.0 6.1 2.3 virginica
# dplyr::distinct()
- Remove duplicate rows:
distinct(iris, Sepal.Length, .keep_all = TRUE)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# 7 4.4 2.9 1.4 0.2 setosa
# 8 4.8 3.4 1.6 0.2 setosa
# 9 4.3 3.0 1.1 0.1 setosa
# 10 5.8 4.0 1.2 0.2 setosa
# 11 5.7 4.4 1.5 0.4 setosa
# 12 5.2 3.5 1.5 0.2 setosa
# 13 5.5 4.2 1.4 0.2 setosa
# 14 4.5 2.3 1.3 0.3 setosa
# 15 5.3 3.7 1.5 0.2 setosa
# 16 7.0 3.2 4.7 1.4 versicolor
# 17 6.4 3.2 4.5 1.5 versicolor
# 18 6.9 3.1 4.9 1.5 versicolor
# 19 6.5 2.8 4.6 1.5 versicolor
# 20 6.3 3.3 4.7 1.6 versicolor
# 21 6.6 2.9 4.6 1.3 versicolor
# 22 5.9 3.0 4.2 1.5 versicolor
# 23 6.0 2.2 4.0 1.0 versicolor
# 24 6.1 2.9 4.7 1.4 versicolor
# 25 5.6 2.9 3.6 1.3 versicolor
# 26 6.7 3.1 4.4 1.4 versicolor
# 27 6.2 2.2 4.5 1.5 versicolor
# 28 6.8 2.8 4.8 1.4 versicolor
# 29 7.1 3.0 5.9 2.1 virginica
# 30 7.6 3.0 6.6 2.1 virginica
# 31 7.3 2.9 6.3 1.8 virginica
# 32 7.2 3.6 6.1 2.5 virginica
# 33 7.7 3.8 6.7 2.2 virginica
# 34 7.4 2.8 6.1 1.9 virginica
# 35 7.9 3.8 6.4 2.0 virginica
# Examples of NSE and string variables in dpylr
dplyr
uses Non-Standard Evaluation(NSE), which is why we normally can use the variable names without quotes. However, sometimes during the data pipeline, we need to get our variable names from other sources such as a Shiny selection box. In case of functions like select
, we can just use select_
to use a string variable to select
variable1 <- "Sepal.Length"
variable2 <- "Sepal.Width"
iris %>%
select_(variable1, variable2) %>%
head(n=5)
# Sepal.Length Sepal.Width
# 1 5.1 3.5
# 2 4.9 3.0
# 3 4.7 3.2
# 4 4.6 3.1
# 5 5.0 3.6
But if we want to use other features such as summarize or filter we need to use interp
function from lazyeval
package
variable1 <- "Sepal.Length"
variable2 <- "Sepal.Width"
variable3 <- "Species"
iris %>%
select_(variable1, variable2, variable3) %>%
group_by_(variable3) %>%
summarize_(mean1 = lazyeval::interp(~mean(var), var = as.name(variable1)), mean2 = lazyeval::interp(~mean(var), var = as.name(variable2)))
# Species mean1 mean2
# <fctr> <dbl> <dbl>
# 1 setosa 5.006 3.428
# 2 versicolor 5.936 2.770
# 3 virginica 6.588 2.974
# Remarks
dplyr is an iteration of plyr that provides a flexible "verb" based functions to manipulate data in R. The latest version of dplyr can be downloaded from CRAN using
install.package("dplyr")
The key object in dplyr is a tbl, a representation of a tabular data structure. Currently dplyr (version 0.5.0) supports:
- data frames
- data tables
- SQLite
- PostgreSQL/Redshift
- MySQL/MariaDB
- Bigquery
- MonetDB
- data cubes with arrays (partial implementation)