Convert data from wide format to long format with multiple measure columns [duplicate] - r

This question already has answers here:
wide to long multiple measures each time
(5 answers)
Closed 1 year ago.
I want to do this but the exact opposite. So say my dataset looks like this:
ID
X_1990
X_2000
X_2010
Y_1990
Y_2000
Y_2010
A
1
4
7
10
13
16
B
2
5
8
11
14
17
C
3
6
9
12
15
18
but with a lot more measure variables (i.e. also Z_1990, etc.). How can I get it so that the year becomes a variable and it will keep the different measures, like this:
ID
Year
X
Y
A
1990
1
10
A
2000
4
13
A
2010
7
16
B
1990
2
11
B
2000
5
14
B
2010
8
17
C
1990
3
12
C
2000
3
15
C
2010
9
18

You may use pivot_longer with names_sep argument.
tidyr::pivot_longer(df, cols = -ID, names_to = c('.value', 'Year'), names_sep = '_')
# ID Year X Y
# <chr> <chr> <int> <int>
#1 A 1990 1 10
#2 A 2000 4 13
#3 A 2010 7 16
#4 B 1990 2 11
#5 B 2000 5 14
#6 B 2010 8 17
#7 C 1990 3 12
#8 C 2000 6 15
#9 C 2010 9 18
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c("A", "B", "C"), X_1990 = 1:3, X_2000 = 4:6,
X_2010 = 7:9, Y_1990 = 10:12, Y_2000 = 13:15, Y_2010 = 16:18),
row.names = c(NA, -3L), class = "data.frame")

Related

Using pivot_longer to pivot pairs of columns [duplicate]

I am trying to find a simple way to pivot_longer a dataframe that has multiple columns containing different data for each case. Using multiple names in names_to doesn't seem to solve the problem.
Here is a worked example:
#create the dataframe:
library('dplyr')
set.seed(11)
x <- data.frame(case = c(1:10),
X1990 = runif(10, 0, 1),
flag.1990 = rep(c('a','b'), 5),
X2000 = runif(10, 0, 1),
flag.2000 = rep(c('c', 'd'), 5))
> x
case X1990 flag.1990 X2000 flag.2000
1 1 0.2772497942 a 0.1751129 c
2 2 0.0005183129 b 0.4407503 d
3 3 0.5106083730 a 0.9071830 c
4 4 0.0140479084 b 0.8510419 d
5 5 0.0646897766 a 0.7339875 c
6 6 0.9548492255 b 0.5736857 d
7 7 0.0864958912 a 0.4817655 c
8 8 0.2899750092 b 0.3306110 d
9 9 0.8806991728 a 0.1576602 c
10 10 0.1232162013 b 0.4801341 d
Obviously I cannot just pivot_longer using cols = -case as that will combine year and flag data. If i try using a chr vector in names_to (from here: https://dcl-wrangle.stanford.edu/pivot-advanced.html (6.1.3):
x %>%
setNames(c('case','value.1990', 'flag.1990', 'value.2000', 'flag.2000')) %>%
pivot_longer(cols = -case,
names_to = c('value', 'flag'),
names_sep = '.',
values_to = 'value')
Things don't work, because the flag data isn't in the variable name.
The only way I can think to solve this is to break the dataframe into two data frames, pivot them and then join them. For example:
#create temporary data frame for year data, then pivot
temp1 <- x %>%
select(1,2, 4) %>% #select year data
pivot_longer(cols = c(X1990, X2000), #pivot longer on year data
names_to = 'year',
values_to = 'value') %>%
mutate(year = gsub('X', '', year)) #remove 'X' so that I can use this to join
#create temporary data frame for flag data, then pivot
temp2 <- x %>%
select(1, 3, 5) %>% #select flag variables
pivot_longer(cols = c(flag.1990, flag.2000), #pivot longer on flag data
names_to = 'flag.year',
values_to = 'flag') %>%
mutate(year = gsub('flag.', '', flag.year)) %>% #get year data so that I can join on this
select(-flag.year) #drop flag.year as its no longer useful information
final <- full_join(temp1, temp2, by = c('case', 'year')) #full join the two datasets to get the final data
> final
# A tibble: 20 x 4
case flag year value
<int> <chr> <chr> <dbl>
1 1 a 1990 0.277
2 1 c 2000 0.175
3 2 b 1990 0.000518
4 2 d 2000 0.441
5 3 a 1990 0.511
6 3 c 2000 0.907
7 4 b 1990 0.0140
8 4 d 2000 0.851
9 5 a 1990 0.0647
10 5 c 2000 0.734
11 6 b 1990 0.955
12 6 d 2000 0.574
13 7 a 1990 0.0865
14 7 c 2000 0.482
15 8 b 1990 0.290
16 8 d 2000 0.331
17 9 a 1990 0.881
18 9 c 2000 0.158
19 10 b 1990 0.123
20 10 d 2000 0.480
I assume there is a quicker way to do this. Am I just misreading the documentation on using multiple names in names_to. Any ideas?
In this case one has to use names_to combined with names_pattern:
library(dplyr)
library(tidyr)
> head(x,3)
case X1990 flag.1990 X2000 flag.2000
1 1 0.2772497942 a 0.1751129 c
2 2 0.0005183129 b 0.4407503 d
3 3 0.5106083730 a 0.9071830 c
> x %>%
pivot_longer(cols = -case,
names_to = c(".value", "year"),
names_pattern = "([^\\.]*)\\.*(\\d{4})")
# A tibble: 20 x 4
case year X flag
<int> <chr> <dbl> <chr>
1 1 1990 0.277 a
2 1 2000 0.175 c
3 2 1990 0.000518 b
4 2 2000 0.441 d
5 3 1990 0.511 a
6 3 2000 0.907 c
7 4 1990 0.0140 b
8 4 2000 0.851 d
9 5 1990 0.0647 a
10 5 2000 0.734 c
11 6 1990 0.955 b
12 6 2000 0.574 d
13 7 1990 0.0865 a
14 7 2000 0.482 c
15 8 1990 0.290 b
16 8 2000 0.331 d
17 9 1990 0.881 a
18 9 2000 0.158 c
19 10 1990 0.123 b
20 10 2000 0.480 d

Transpose only certain columns - data formating [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 1 year ago.
Im trying to modify some data and have the following data now:
df <- data.frame(year = c(2010,2011,2010,2011),A = c(10,11,10,11),B = c(11,12,11,12))
year A B
1 2010 10 11
2 2011 11 12
3 2010 10 11
4 2011 11 12
I want it to look like this, but do not know how to do it. Can anyone help me?
company year Variable
1 A 2010 10
2 A 2011 11
3 B 2010 11
4 B 2011 12
We can use pivot_longer
library(tidyr)
pivot_longer(df, cols = -year,
names_to = 'company', values_to = 'Variable')
-output
# A tibble: 8 × 3
year company Variable
<dbl> <chr> <dbl>
1 2010 A 10
2 2010 B 11
3 2011 A 11
4 2011 B 12
5 2010 A 10
6 2010 B 11
7 2011 A 11
8 2011 B 12

How to convert dataframe from wide to long and separate different values of one column into new columns? [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 2 years ago.
I have this dataframe (with dummy values)
Country lifeExp_1952 lifeExp_1962 pop_1952 pop_1962 gdp_1952 gdp_1962
A 1 4 7 10 13 16
B 2 5 8 11 14 17
C 3 6 9 12 15 18
I would like to convert it to a long format, but have separate columns for lifeExp, pop and gdp such that it looks like this:
Country Year lifeExp pop gdp
A 1952 1 7 13
A 1962 4 10 16
B 1952 2 8 14
B 1962 5 11 17
C 1952 3 9 15
C 1962 6 12 18
So far, I have been able to extract the year with lifeExp, pop and gdp in the same column using reshape2, but I have no idea how to give them their own column.
We can use pivot_longer
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -Country, names_to = c(".value", 'Year'), names_sep = "_")
-output
# A tibble: 6 x 5
# Country Year lifeExp pop gdp
# <chr> <chr> <int> <int> <int>
#1 A 1952 1 7 13
#2 A 1962 4 10 16
#3 B 1952 2 8 14
#4 B 1962 5 11 17
#5 C 1952 3 9 15
#6 C 1962 6 12 18
data
df1 <- structure(list(Country = c("A", "B", "C"), lifeExp_1952 = 1:3,
lifeExp_1962 = 4:6, pop_1952 = 7:9, pop_1962 = 10:12, gdp_1952 = 13:15,
gdp_1962 = 16:18), class = "data.frame", row.names = c(NA,
-3L))

pivot_longer multiple variables of different kinds

I am trying to find a simple way to pivot_longer a dataframe that has multiple columns containing different data for each case. Using multiple names in names_to doesn't seem to solve the problem.
Here is a worked example:
#create the dataframe:
library('dplyr')
set.seed(11)
x <- data.frame(case = c(1:10),
X1990 = runif(10, 0, 1),
flag.1990 = rep(c('a','b'), 5),
X2000 = runif(10, 0, 1),
flag.2000 = rep(c('c', 'd'), 5))
> x
case X1990 flag.1990 X2000 flag.2000
1 1 0.2772497942 a 0.1751129 c
2 2 0.0005183129 b 0.4407503 d
3 3 0.5106083730 a 0.9071830 c
4 4 0.0140479084 b 0.8510419 d
5 5 0.0646897766 a 0.7339875 c
6 6 0.9548492255 b 0.5736857 d
7 7 0.0864958912 a 0.4817655 c
8 8 0.2899750092 b 0.3306110 d
9 9 0.8806991728 a 0.1576602 c
10 10 0.1232162013 b 0.4801341 d
Obviously I cannot just pivot_longer using cols = -case as that will combine year and flag data. If i try using a chr vector in names_to (from here: https://dcl-wrangle.stanford.edu/pivot-advanced.html (6.1.3):
x %>%
setNames(c('case','value.1990', 'flag.1990', 'value.2000', 'flag.2000')) %>%
pivot_longer(cols = -case,
names_to = c('value', 'flag'),
names_sep = '.',
values_to = 'value')
Things don't work, because the flag data isn't in the variable name.
The only way I can think to solve this is to break the dataframe into two data frames, pivot them and then join them. For example:
#create temporary data frame for year data, then pivot
temp1 <- x %>%
select(1,2, 4) %>% #select year data
pivot_longer(cols = c(X1990, X2000), #pivot longer on year data
names_to = 'year',
values_to = 'value') %>%
mutate(year = gsub('X', '', year)) #remove 'X' so that I can use this to join
#create temporary data frame for flag data, then pivot
temp2 <- x %>%
select(1, 3, 5) %>% #select flag variables
pivot_longer(cols = c(flag.1990, flag.2000), #pivot longer on flag data
names_to = 'flag.year',
values_to = 'flag') %>%
mutate(year = gsub('flag.', '', flag.year)) %>% #get year data so that I can join on this
select(-flag.year) #drop flag.year as its no longer useful information
final <- full_join(temp1, temp2, by = c('case', 'year')) #full join the two datasets to get the final data
> final
# A tibble: 20 x 4
case flag year value
<int> <chr> <chr> <dbl>
1 1 a 1990 0.277
2 1 c 2000 0.175
3 2 b 1990 0.000518
4 2 d 2000 0.441
5 3 a 1990 0.511
6 3 c 2000 0.907
7 4 b 1990 0.0140
8 4 d 2000 0.851
9 5 a 1990 0.0647
10 5 c 2000 0.734
11 6 b 1990 0.955
12 6 d 2000 0.574
13 7 a 1990 0.0865
14 7 c 2000 0.482
15 8 b 1990 0.290
16 8 d 2000 0.331
17 9 a 1990 0.881
18 9 c 2000 0.158
19 10 b 1990 0.123
20 10 d 2000 0.480
I assume there is a quicker way to do this. Am I just misreading the documentation on using multiple names in names_to. Any ideas?
In this case one has to use names_to combined with names_pattern:
library(dplyr)
library(tidyr)
> head(x,3)
case X1990 flag.1990 X2000 flag.2000
1 1 0.2772497942 a 0.1751129 c
2 2 0.0005183129 b 0.4407503 d
3 3 0.5106083730 a 0.9071830 c
> x %>%
pivot_longer(cols = -case,
names_to = c(".value", "year"),
names_pattern = "([^\\.]*)\\.*(\\d{4})")
# A tibble: 20 x 4
case year X flag
<int> <chr> <dbl> <chr>
1 1 1990 0.277 a
2 1 2000 0.175 c
3 2 1990 0.000518 b
4 2 2000 0.441 d
5 3 1990 0.511 a
6 3 2000 0.907 c
7 4 1990 0.0140 b
8 4 2000 0.851 d
9 5 1990 0.0647 a
10 5 2000 0.734 c
11 6 1990 0.955 b
12 6 2000 0.574 d
13 7 1990 0.0865 a
14 7 2000 0.482 c
15 8 1990 0.290 b
16 8 2000 0.331 d
17 9 1990 0.881 a
18 9 2000 0.158 c
19 10 1990 0.123 b
20 10 2000 0.480 d

Assign unique ID based on two columns [duplicate]

This question already has answers here:
Add ID column by group [duplicate]
(4 answers)
How to create a consecutive group number
(13 answers)
Closed 5 years ago.
I have a dataframe (df) that looks like this:
School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000
And I would like to create a person ID column so that df looks like this:
ID School Student Year
1 A 10 1999
1 A 10 2000
2 A 20 1999
2 A 20 2000
2 A 20 2001
3 B 10 1999
3 B 10 2000
In other words, the ID variable indicates which person it is in the dataset, accounting for both Student number and School membership (here we have 3 students total).
I did df$ID <- df$Student and tried to request the value +1 if c("School", "Student) was unique. It isn't working. Help appreciated.
We can do this in base R without doing any group by operation
df$ID <- cumsum(!duplicated(df[1:2]))
df
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
NOTE: Assuming that 'School' and 'Student' are ordered
Or using tidyverse
library(dplyr)
df %>%
mutate(ID = group_indices_(df, .dots=c("School", "Student")))
# School Student Year ID
#1 A 10 1999 1
#2 A 10 2000 1
#3 A 20 1999 2
#4 A 20 2000 2
#5 A 20 2001 2
#6 B 10 1999 3
#7 B 10 2000 3
As #radek mentioned, in the recent version (dplyr_0.8.0), we get the notification that group_indices_ is deprecated, instead use group_indices
df %>%
mutate(ID = group_indices(., School, Student))
Group by School and Student, then assign group id to ID variable.
library('data.table')
df[, ID := .GRP, by = .(School, Student)]
# School Student Year ID
# 1: A 10 1999 1
# 2: A 10 2000 1
# 3: A 20 1999 2
# 4: A 20 2000 2
# 5: A 20 2001 2
# 6: B 10 1999 3
# 7: B 10 2000 3
Data:
df <- fread('School Student Year
A 10 1999
A 10 2000
A 20 1999
A 20 2000
A 20 2001
B 10 1999
B 10 2000')

Resources