How to convert dataframe from wide to long and separate different values of one column into new columns? [duplicate] - r

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 2 years ago.
I have this dataframe (with dummy values)
Country lifeExp_1952 lifeExp_1962 pop_1952 pop_1962 gdp_1952 gdp_1962
A 1 4 7 10 13 16
B 2 5 8 11 14 17
C 3 6 9 12 15 18
I would like to convert it to a long format, but have separate columns for lifeExp, pop and gdp such that it looks like this:
Country Year lifeExp pop gdp
A 1952 1 7 13
A 1962 4 10 16
B 1952 2 8 14
B 1962 5 11 17
C 1952 3 9 15
C 1962 6 12 18
So far, I have been able to extract the year with lifeExp, pop and gdp in the same column using reshape2, but I have no idea how to give them their own column.

We can use pivot_longer
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -Country, names_to = c(".value", 'Year'), names_sep = "_")
-output
# A tibble: 6 x 5
# Country Year lifeExp pop gdp
# <chr> <chr> <int> <int> <int>
#1 A 1952 1 7 13
#2 A 1962 4 10 16
#3 B 1952 2 8 14
#4 B 1962 5 11 17
#5 C 1952 3 9 15
#6 C 1962 6 12 18
data
df1 <- structure(list(Country = c("A", "B", "C"), lifeExp_1952 = 1:3,
lifeExp_1962 = 4:6, pop_1952 = 7:9, pop_1962 = 10:12, gdp_1952 = 13:15,
gdp_1962 = 16:18), class = "data.frame", row.names = c(NA,
-3L))

Related

Using pivot_longer to pivot pairs of columns [duplicate]

I am trying to find a simple way to pivot_longer a dataframe that has multiple columns containing different data for each case. Using multiple names in names_to doesn't seem to solve the problem.
Here is a worked example:
#create the dataframe:
library('dplyr')
set.seed(11)
x <- data.frame(case = c(1:10),
X1990 = runif(10, 0, 1),
flag.1990 = rep(c('a','b'), 5),
X2000 = runif(10, 0, 1),
flag.2000 = rep(c('c', 'd'), 5))
> x
case X1990 flag.1990 X2000 flag.2000
1 1 0.2772497942 a 0.1751129 c
2 2 0.0005183129 b 0.4407503 d
3 3 0.5106083730 a 0.9071830 c
4 4 0.0140479084 b 0.8510419 d
5 5 0.0646897766 a 0.7339875 c
6 6 0.9548492255 b 0.5736857 d
7 7 0.0864958912 a 0.4817655 c
8 8 0.2899750092 b 0.3306110 d
9 9 0.8806991728 a 0.1576602 c
10 10 0.1232162013 b 0.4801341 d
Obviously I cannot just pivot_longer using cols = -case as that will combine year and flag data. If i try using a chr vector in names_to (from here: https://dcl-wrangle.stanford.edu/pivot-advanced.html (6.1.3):
x %>%
setNames(c('case','value.1990', 'flag.1990', 'value.2000', 'flag.2000')) %>%
pivot_longer(cols = -case,
names_to = c('value', 'flag'),
names_sep = '.',
values_to = 'value')
Things don't work, because the flag data isn't in the variable name.
The only way I can think to solve this is to break the dataframe into two data frames, pivot them and then join them. For example:
#create temporary data frame for year data, then pivot
temp1 <- x %>%
select(1,2, 4) %>% #select year data
pivot_longer(cols = c(X1990, X2000), #pivot longer on year data
names_to = 'year',
values_to = 'value') %>%
mutate(year = gsub('X', '', year)) #remove 'X' so that I can use this to join
#create temporary data frame for flag data, then pivot
temp2 <- x %>%
select(1, 3, 5) %>% #select flag variables
pivot_longer(cols = c(flag.1990, flag.2000), #pivot longer on flag data
names_to = 'flag.year',
values_to = 'flag') %>%
mutate(year = gsub('flag.', '', flag.year)) %>% #get year data so that I can join on this
select(-flag.year) #drop flag.year as its no longer useful information
final <- full_join(temp1, temp2, by = c('case', 'year')) #full join the two datasets to get the final data
> final
# A tibble: 20 x 4
case flag year value
<int> <chr> <chr> <dbl>
1 1 a 1990 0.277
2 1 c 2000 0.175
3 2 b 1990 0.000518
4 2 d 2000 0.441
5 3 a 1990 0.511
6 3 c 2000 0.907
7 4 b 1990 0.0140
8 4 d 2000 0.851
9 5 a 1990 0.0647
10 5 c 2000 0.734
11 6 b 1990 0.955
12 6 d 2000 0.574
13 7 a 1990 0.0865
14 7 c 2000 0.482
15 8 b 1990 0.290
16 8 d 2000 0.331
17 9 a 1990 0.881
18 9 c 2000 0.158
19 10 b 1990 0.123
20 10 d 2000 0.480
I assume there is a quicker way to do this. Am I just misreading the documentation on using multiple names in names_to. Any ideas?
In this case one has to use names_to combined with names_pattern:
library(dplyr)
library(tidyr)
> head(x,3)
case X1990 flag.1990 X2000 flag.2000
1 1 0.2772497942 a 0.1751129 c
2 2 0.0005183129 b 0.4407503 d
3 3 0.5106083730 a 0.9071830 c
> x %>%
pivot_longer(cols = -case,
names_to = c(".value", "year"),
names_pattern = "([^\\.]*)\\.*(\\d{4})")
# A tibble: 20 x 4
case year X flag
<int> <chr> <dbl> <chr>
1 1 1990 0.277 a
2 1 2000 0.175 c
3 2 1990 0.000518 b
4 2 2000 0.441 d
5 3 1990 0.511 a
6 3 2000 0.907 c
7 4 1990 0.0140 b
8 4 2000 0.851 d
9 5 1990 0.0647 a
10 5 2000 0.734 c
11 6 1990 0.955 b
12 6 2000 0.574 d
13 7 1990 0.0865 a
14 7 2000 0.482 c
15 8 1990 0.290 b
16 8 2000 0.331 d
17 9 1990 0.881 a
18 9 2000 0.158 c
19 10 1990 0.123 b
20 10 2000 0.480 d

Transpose only certain columns - data formating [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 1 year ago.
Im trying to modify some data and have the following data now:
df <- data.frame(year = c(2010,2011,2010,2011),A = c(10,11,10,11),B = c(11,12,11,12))
year A B
1 2010 10 11
2 2011 11 12
3 2010 10 11
4 2011 11 12
I want it to look like this, but do not know how to do it. Can anyone help me?
company year Variable
1 A 2010 10
2 A 2011 11
3 B 2010 11
4 B 2011 12
We can use pivot_longer
library(tidyr)
pivot_longer(df, cols = -year,
names_to = 'company', values_to = 'Variable')
-output
# A tibble: 8 × 3
year company Variable
<dbl> <chr> <dbl>
1 2010 A 10
2 2010 B 11
3 2011 A 11
4 2011 B 12
5 2010 A 10
6 2010 B 11
7 2011 A 11
8 2011 B 12

Convert data from wide format to long format with multiple measure columns [duplicate]

This question already has answers here:
wide to long multiple measures each time
(5 answers)
Closed 1 year ago.
I want to do this but the exact opposite. So say my dataset looks like this:
ID
X_1990
X_2000
X_2010
Y_1990
Y_2000
Y_2010
A
1
4
7
10
13
16
B
2
5
8
11
14
17
C
3
6
9
12
15
18
but with a lot more measure variables (i.e. also Z_1990, etc.). How can I get it so that the year becomes a variable and it will keep the different measures, like this:
ID
Year
X
Y
A
1990
1
10
A
2000
4
13
A
2010
7
16
B
1990
2
11
B
2000
5
14
B
2010
8
17
C
1990
3
12
C
2000
3
15
C
2010
9
18
You may use pivot_longer with names_sep argument.
tidyr::pivot_longer(df, cols = -ID, names_to = c('.value', 'Year'), names_sep = '_')
# ID Year X Y
# <chr> <chr> <int> <int>
#1 A 1990 1 10
#2 A 2000 4 13
#3 A 2010 7 16
#4 B 1990 2 11
#5 B 2000 5 14
#6 B 2010 8 17
#7 C 1990 3 12
#8 C 2000 6 15
#9 C 2010 9 18
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c("A", "B", "C"), X_1990 = 1:3, X_2000 = 4:6,
X_2010 = 7:9, Y_1990 = 10:12, Y_2000 = 13:15, Y_2010 = 16:18),
row.names = c(NA, -3L), class = "data.frame")

pivot_longer multiple variables of different kinds

I am trying to find a simple way to pivot_longer a dataframe that has multiple columns containing different data for each case. Using multiple names in names_to doesn't seem to solve the problem.
Here is a worked example:
#create the dataframe:
library('dplyr')
set.seed(11)
x <- data.frame(case = c(1:10),
X1990 = runif(10, 0, 1),
flag.1990 = rep(c('a','b'), 5),
X2000 = runif(10, 0, 1),
flag.2000 = rep(c('c', 'd'), 5))
> x
case X1990 flag.1990 X2000 flag.2000
1 1 0.2772497942 a 0.1751129 c
2 2 0.0005183129 b 0.4407503 d
3 3 0.5106083730 a 0.9071830 c
4 4 0.0140479084 b 0.8510419 d
5 5 0.0646897766 a 0.7339875 c
6 6 0.9548492255 b 0.5736857 d
7 7 0.0864958912 a 0.4817655 c
8 8 0.2899750092 b 0.3306110 d
9 9 0.8806991728 a 0.1576602 c
10 10 0.1232162013 b 0.4801341 d
Obviously I cannot just pivot_longer using cols = -case as that will combine year and flag data. If i try using a chr vector in names_to (from here: https://dcl-wrangle.stanford.edu/pivot-advanced.html (6.1.3):
x %>%
setNames(c('case','value.1990', 'flag.1990', 'value.2000', 'flag.2000')) %>%
pivot_longer(cols = -case,
names_to = c('value', 'flag'),
names_sep = '.',
values_to = 'value')
Things don't work, because the flag data isn't in the variable name.
The only way I can think to solve this is to break the dataframe into two data frames, pivot them and then join them. For example:
#create temporary data frame for year data, then pivot
temp1 <- x %>%
select(1,2, 4) %>% #select year data
pivot_longer(cols = c(X1990, X2000), #pivot longer on year data
names_to = 'year',
values_to = 'value') %>%
mutate(year = gsub('X', '', year)) #remove 'X' so that I can use this to join
#create temporary data frame for flag data, then pivot
temp2 <- x %>%
select(1, 3, 5) %>% #select flag variables
pivot_longer(cols = c(flag.1990, flag.2000), #pivot longer on flag data
names_to = 'flag.year',
values_to = 'flag') %>%
mutate(year = gsub('flag.', '', flag.year)) %>% #get year data so that I can join on this
select(-flag.year) #drop flag.year as its no longer useful information
final <- full_join(temp1, temp2, by = c('case', 'year')) #full join the two datasets to get the final data
> final
# A tibble: 20 x 4
case flag year value
<int> <chr> <chr> <dbl>
1 1 a 1990 0.277
2 1 c 2000 0.175
3 2 b 1990 0.000518
4 2 d 2000 0.441
5 3 a 1990 0.511
6 3 c 2000 0.907
7 4 b 1990 0.0140
8 4 d 2000 0.851
9 5 a 1990 0.0647
10 5 c 2000 0.734
11 6 b 1990 0.955
12 6 d 2000 0.574
13 7 a 1990 0.0865
14 7 c 2000 0.482
15 8 b 1990 0.290
16 8 d 2000 0.331
17 9 a 1990 0.881
18 9 c 2000 0.158
19 10 b 1990 0.123
20 10 d 2000 0.480
I assume there is a quicker way to do this. Am I just misreading the documentation on using multiple names in names_to. Any ideas?
In this case one has to use names_to combined with names_pattern:
library(dplyr)
library(tidyr)
> head(x,3)
case X1990 flag.1990 X2000 flag.2000
1 1 0.2772497942 a 0.1751129 c
2 2 0.0005183129 b 0.4407503 d
3 3 0.5106083730 a 0.9071830 c
> x %>%
pivot_longer(cols = -case,
names_to = c(".value", "year"),
names_pattern = "([^\\.]*)\\.*(\\d{4})")
# A tibble: 20 x 4
case year X flag
<int> <chr> <dbl> <chr>
1 1 1990 0.277 a
2 1 2000 0.175 c
3 2 1990 0.000518 b
4 2 2000 0.441 d
5 3 1990 0.511 a
6 3 2000 0.907 c
7 4 1990 0.0140 b
8 4 2000 0.851 d
9 5 1990 0.0647 a
10 5 2000 0.734 c
11 6 1990 0.955 b
12 6 2000 0.574 d
13 7 1990 0.0865 a
14 7 2000 0.482 c
15 8 1990 0.290 b
16 8 2000 0.331 d
17 9 1990 0.881 a
18 9 2000 0.158 c
19 10 1990 0.123 b
20 10 2000 0.480 d

Match and Remove Rows Based on Condition R [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I've got an interesting one for you all.
I'm looking to first: Look through the ID column and identify duplicate values. Once those are identified, the code should go through the income of the duplicated values and keep the row with the larger income.
So if there are three ID values of 2, it will look for the one with the highest income and keep that row.
ID Income
1 98765
2 3456
2 67
2 5498
5 23
6 98
7 5645
7 67871
9 983754
10 982
10 2374
10 875
10 4744
11 6853
I know its as easy as subsetting based on a condition, but I don't know how to remove the rows based on if the income in one cell is greater than the other.(Only done if the id's match)
I was thinking of using an ifelse statement to create a new column to identify duplicates (through subsetting or not) then use the new column's values to ifelse again to identify the larger income. From there I can just subset based on the new columns I have created.
Is there a faster, more efficient way of doing this?
The outcome should look like this.
ID Income
1 98765
2 5498
5 23
6 98
7 67871
9 983754
10 4744
11 6853
Thank you
We can slice the rows by checking the highest value in 'Income' grouped by 'ID'
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(which.max(Income))
Or using data.table
library(data.table)
setDT(df1)[, .SD[which.max(Income)], by = ID]
Or with base R
df1[with(df1, ave(Income, ID, FUN = max) == Income),]
# ID Income
#1 1 98765
#4 2 5498
#5 5 23
#6 6 98
#8 7 67871
#9 9 983754
#13 10 4744
#14 11 6853
data
df1 <- structure(list(ID = c(1L, 2L, 2L, 2L, 5L, 6L, 7L, 7L, 9L, 10L,
10L, 10L, 10L, 11L), Income = c(98765L, 3456L, 67L, 5498L, 23L,
98L, 5645L, 67871L, 983754L, 982L, 2374L, 875L, 4744L, 6853L)),
class = "data.frame", row.names = c(NA,
-14L))
order with duplicated( Base R)
df=df[order(df$ID,-df$Income),]
df[!duplicated(df$ID),]
ID Income
1 1 98765
4 2 5498
5 5 23
6 6 98
8 7 67871
9 9 983754
13 10 4744
14 11 6853
Here is another dplyr method. We can arrange the column and then slice the data frame for the first row.
library(dplyr)
df2 <- df %>%
arrange(ID, desc(Income)) %>%
group_by(ID) %>%
slice(1) %>%
ungroup()
df2
# # A tibble: 8 x 2
# ID Income
# <int> <int>
# 1 1 98765
# 2 2 5498
# 3 5 23
# 4 6 98
# 5 7 67871
# 6 9 983754
# 7 10 4744
# 8 11 6853
DATA
df <- read.table(text = "ID Income
1 98765
2 3456
2 67
2 5498
5 23
6 98
7 5645
7 67871
9 983754
10 982
10 2374
10 875
10 4744
11 6853",
header = TRUE)
Group_by and summarise from dplyr would work too
df1 %>%
group_by(ID) %>%
summarise(Income=max(Income))
ID Income
<int> <dbl>
1 1 98765.
2 2 5498.
3 5 23.
4 6 98.
5 7 67871.
6 9 983754.
7 10 4744.
8 11 6853.
Using sqldf: Group by ID and select the corresponding max Income
library(sqldf)
sqldf("select ID,max(Income) from df group by ID")
Output:
ID max(Income)
1 1 98765
2 2 5498
3 5 23
4 6 98
5 7 67871
6 9 983754
7 10 4744
8 11 6853

Resources