How to reshape a dataframe from wide to long or long to wide format

using tidyr - pivot_longer() pivot_wider()

Reshaping a dataframe / table from long to wide format or wide to long format is one of the daily tasks a Data Analyst / Data Scientist would be doing. The long format is similar to the tidy format that the tidyverse advocates. Even while, it’s been a very common task - the tidyr package’s solution of using spread() and gather() almost never was intuitive enough to be used in the code without SOing or Referring the documentation. While I used to think, It’s just me who didn’t spread and gather, It seemed to be mass problem when the package developer Hadley Wickham realized this to completely rewrite those functions into two new functions pivot_wider() and pivot_longer() (which is also equivalent to melt() and cast() - or simply unpivot and pivot)

update your tidyr

If you are an R-user, It’s highly likely that you’ve got tidyr either as a standalone package or part of the tidyverse packages. Please update the package the get the latest tidyr to get these functions for reshaping.

install.packages("tidyr") 

or

devtools::install_github("tidyverse/tidyr")

and load the library.

library(tidyr)

packageVersion("tidyr")
## [1] '0.8.99.9000'

Data

Let’s take this problem from a very popular Stack Overflow Question.

#data
df <- read.table(textConnection(
"Code Country        1950    1951    1952    1953    1954
AFG  Afghanistan    20249  19000  22532  5000  24555
ALB  Albania        8097   8986   3000  11123  12246"), header=TRUE)

df
##   Code     Country X1950 X1951 X1952 X1953 X1954
## 1  AFG Afghanistan 20249 19000 22532  5000 24555
## 2  ALB     Albania  8097  8986  3000 11123 12246

This wide format is good for some use-cases but for most of what we do with Data Analysis, it’d require the data to be in the long format.

Why Reshaping - The Problem

Given this data, Let’s try to make a Time-series Line Chart using ggplot2. But the format in which the data is currently shaped (wide) can’t help us in building the line chart because for a line chart using geom_line() we need the data in the long format - where the x-axis and y-axis are columns (ideally with x being a Time variable and y being a continuous variable) and also that we’ve got the country which could help us comparing both the countries.

Let’s do that!

library(tidyverse)

df_pivoted <- pivot_longer(df, cols = -c("Code","Country"), names_to = "year")

df_pivoted
## # A tibble: 10 x 4
##    Code  Country     year  value
##    <fct> <fct>       <chr> <int>
##  1 AFG   Afghanistan X1950 20249
##  2 AFG   Afghanistan X1951 19000
##  3 AFG   Afghanistan X1952 22532
##  4 AFG   Afghanistan X1953  5000
##  5 AFG   Afghanistan X1954 24555
##  6 ALB   Albania     X1950  8097
##  7 ALB   Albania     X1951  8986
##  8 ALB   Albania     X1952  3000
##  9 ALB   Albania     X1953 11123
## 10 ALB   Albania     X1954 12246

I’m personally glad that I don’t have to check the confusing syntax of gather() instead the intuitve name of pivot_longer() (which is literally what we’ve to do) flows through my fingers.

As you can see above, the pivot_longer() function as any tidyverse function supports %>% which also means the first argument is the dataframe itself, then the list of cols we would like to pivot_longer() and the next essential argument names_to - new column name under whcih these are rolled up. This gives us the reshaped data which we could use for plotting. If you came to this article, just to see reshaping, you’re good by this place. But if you want to connect better with the reason why we wanted to reshape in the first place, Let’s go do the line graph!

Line Graph with reshaped data

Now that we know how to reshape the data (from wide to long format), we can use ggplot2’s geom_line() to plot the (time-series) lines.

df %>% 
  pivot_longer(-c("Code","Country"),"year") %>% 
  mutate(year = as.integer(str_replace(year, "X",""))) %>% 
  ggplot() + geom_line(aes(year, value, group = Country, color = Country)) +
  labs(title =  "Line Graph after Reshaping",
       subtitle = "I love pivot_longer()",
       caption = "I was always confused with gather and spread") +
  theme_minimal()

In the above code, I’ve repeated the reshaping again - just to imply a point that this is why %>%s are cool because we can build intuitive ETL (kind-of) pipeline and visualization together step-by-step.

The End

Well, We finally ended up with a beautiful (opinionated) line graph to compare two countries’ (made-up values) which came from a wide format data originally which we reshaped to long format data.

 
comments powered by Disqus