Reshaping data between long and wide forms
In R, tabular data is stored in data frames. This topic covers the various ways of transforming a single table.
Reshaping data
Section titled “Reshaping data”Often data comes in tables. Generally one can divide this tabular data in wide and long formats. In a wide format, each variable has its own column.
|Person|Height [cm]|Age [yr] |---|---|---|--- |Alison|178|20 |Bob|174|45 |Carl|182|31
However, sometimes it is more convenient to have a long format, in which all variables are in one column and the values are in a second column.
|Person|Variable|Value |---|---|---|--- |Alison|Height [cm]|178 |Bob|Height [cm]|174 |Carl|Height [cm]|182 |Alison|Age [yr]|20 |Bob|Age [yr]|45 |Carl|Age [yr]|31
Base R, as well as third party packages can be used to simplify this process. For each of the options, the mtcars dataset will be used. By default, this dataset is in a long format. In order for the packages to work, we will insert the row names as the first column.
mtcars # shows the datasetdata <- data.frame(observation=row.names(mtcars),mtcars)Base R
Section titled “Base R”There are two functions in base R that can be used to convert between wide and long format: stack() and unstack().
long <- stack(data)long # this shows the long formatwide <- unstack(long)wide # this shows the wide formatHowever, these functions can become very complex for more advanced use cases. Luckily, there are other options using third party packages.
The tidyr package
Section titled “The tidyr package”This package uses gather() to convert from wide to long and spread() to convert from long to wide.
library(tidyr)long <- gather(data, variable, value, 2:12) # where variable is the name of the# variable column, value indicates the name of the value column and 2:12 refers to# the columns to be converted.long # shows the long resultwide <- spread(long,variable,value)wide # shows the wide result (~data)The data.table package
Section titled “The data.table package”The data.table package extends the reshape2 functions and uses the function melt() to go from wide to long and dcast() to go from long to wide.
library(data.table)long <- melt(data,'observation',2:12,'variable', 'value')long # shows the long resultwide <- dcast(long, observation ~ variable)wide # shows the wide result (~data)The reshape function
Section titled “The reshape function”The most flexible base R function for reshaping data is reshape. See ?reshape for its syntax.
# create unbalanced longitudinal (panel) data setset.seed(1234)df <- data.frame(identifier=rep(1:5, each=3), location=rep(c("up", "down", "left", "up", "center"), each=3), period=rep(1:3, 5), counts=sample(35, 15, replace=TRUE), values=runif(15, 5, 10))[-c(4,8,11),]df
identifier location period counts values1 1 up 1 4 9.1864782 1 up 2 22 6.4311163 1 up 3 22 6.3341045 2 down 2 31 6.1611306 2 down 3 23 6.5830627 3 left 1 1 6.5134679 3 left 3 24 5.19998010 4 up 1 18 6.09399812 4 up 3 20 7.62848813 5 center 1 10 9.57329114 5 center 2 33 9.15672515 5 center 3 11 5.228851Note that the data.frame is unbalanced, that is, unit 2 is missing an observation in the first period, while units 3 and 4 are missing observations in the second period. Also, note that there are two variables that vary over the periods: counts and values, and two that do not vary: identifier and location.
Long to Wide
Section titled “Long to Wide”To reshape the data.frame to wide format,
# reshape wide on time variabledf.wide <- reshape(df, idvar="identifier", timevar="period", v.names=c("values", "counts"), direction="wide")df.wide identifier location values.1 counts.1 values.2 counts.2 values.3 counts.31 1 up 9.186478 4 6.431116 22 6.334104 225 2 down NA NA 6.161130 31 6.583062 237 3 left 6.513467 1 NA NA 5.199980 2410 4 up 6.093998 18 NA NA 7.628488 2013 5 center 9.573291 10 9.156725 33 5.228851 11Notice that the missing time periods are filled in with NAs.
In reshaping wide, the “v.names” argument specifies the columns that vary over time. If the location variable is not necessary, it can be dropped prior to reshaping with the “drop” argument. In dropping the only non-varying / non-id column from the data.frame, the v.names argument becomes unnecessary.
reshape(df, idvar="identifier", timevar="period", direction="wide", drop="location")Wide to Long
Section titled “Wide to Long”To reshape long with the current df.wide, a minimal syntax is
reshape(df.wide, direction="long")However, this is typically trickier:
# remove "." separator in df.wide names for counts and valuesnames(df.wide)[grep("\\.", names(df.wide))] <- gsub("\\.", "", names(df.wide)[grep("\\.", names(df.wide))])Now the simple syntax will produce an error about undefined columns.
With column names that are more difficult for the reshape function to automatically parse, it is sometimes necessary to add the “varying” argument which tells reshape to group particular variables in wide format for the transformation into long format. This argument takes a list of vectors of variable names or indices.
reshape(df.wide, idvar="identifier", varying=list(c(3,5,7), c(4,6,8)), direction="long")In reshaping long, the “v.names” argument can be provided to rename the resulting varying variables.
Sometimes the specification of “varying” can be avoided by use of the “sep” argument which tells reshape what part of the variable name specifies the value argument and which specifies the time argument.
Remarks
Section titled “Remarks”Helpful packages
Section titled “Helpful packages”- Reshaping, stacking and splitting with data.table
- Reshape using tidyr
- splitstackshape