Here is dataframe 1
card value cat1 cat2 cat3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 10 1 2 3
2 A 20 4 5 6
3 B 30 7 8 9
4 A 40 10 11 12
Here is dataframe 2 with the same number of rows and columns
card value cat1 cat2 cat3
<chr> <dbl> <dbl> <dbl> <dbl>
1 C 11 13 14 15
2 C 19 16 17 18
3 A 35 19 20 21
4 B 45 22 23 24
I want to create a new dataframe that is chosen based on the maximum value of the "value" column. The row of the new dataframe is the entire row of the dataframe that has the highest number in the "value" column.
Thus the desired solution is:
<chr> <dbl> <dbl> <dbl> <dbl>
1 C 11 13 14 15
2 A 20 4 5 6
3 A 35 19 20 21
4 B 45 22 23 24
These are demo dataframes. The actual data frames are on the order of 200,000 rows. What is the best way to do this? Note, it would also be good to have a column in the new dataframe indicating which data frame the row was from: df_1 or df_2.
Dataframes
df_1 <- structure(list(card = c("A", "A", "B", "A"), value = c(10, 20,
30, 40), cat1 = c(1, 4, 7, 10), cat2 = c(2, 5, 8, 11), cat3 = c(3,
6, 9, 12)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-4L))
df_2 <- structure(list(card = c("C", "C", "A", "B"), value = c(11, 19,
35, 45), cat1 = c(13, 16, 19, 22), cat2 = c(14, 17, 20, 23),
cat3 = c(15, 18, 21, 24)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L))
You can probably avoid needing to do this as a grouped operation if you stack each dataset and offset the row indexes. E.g.:
sel <- max.col(cbind(df_1$value, df_2$value))
rbind(df_1, df_2)[seq_along(sel) + c(0,nrow(df_1))[sel],]
## A tibble: 4 x 5
# card value cat1 cat2 cat3
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 C 11 13 14 15
#2 A 20 4 5 6
#3 A 35 19 20 21
#4 B 45 22 23 24
sel will contain the source dataset too.
cbind(rbind(df_1, df_2)[seq_along(sel) + c(0,nrow(df_1))[sel],], src=sel)
# card value cat1 cat2 cat3 src
#1 C 11 13 14 15 2
#2 A 20 4 5 6 1
#3 A 35 19 20 21 2
#4 B 45 22 23 24 2
base solutions
ifelse
do.call(rbind,
ifelse(df1$value >= df2$value,
split(df1, 1:nrow(df1)),
split(df2, 1:nrow(df2)))
)
lapply
do.call(rbind, lapply(1:nrow(df1), \(x) {
if(df1$value[x] >= df2$value[x]) df1[x, ] else df2[x, ]
}))
# card value cat1 cat2 cat3
# 1 C 11 13 14 15
# 2 A 20 4 5 6
# 3 A 35 19 20 21
# 4 B 45 22 23 24
Related
Here is some fictional data:
tibble(fruit = rep(c("apple", "pear", "orange"), each = 3),
size = rep(c("big", "medium", "small"), times = 3),
# summer stock
shopA_summer_wk1 = abs(round(rnorm(9, 10, 5), 0)),
shopA_summer_wk2 = abs(round(rnorm(9, 10, 5), 0)),
shopB_summer_wk1 = abs(round(rnorm(9, 10, 5), 0)),
shopB_summer_wk2 = abs(round(rnorm(9, 10, 5), 0)),
shopC_summer_wk1 = abs(round(rnorm(9, 10, 5), 0)),
shopC_summer_wk2 = abs(round(rnorm(9, 10, 5), 0)),
# winter stock
shopA_winter_wk1 = abs(round(rnorm(9, 8, 4), 0)),
shopA_winter_wk2 = abs(round(rnorm(9, 8, 4), 0)),
shopA_winter_wk3 = abs(round(rnorm(9, 8, 4), 0)),
shopB_winter_wk1 = abs(round(rnorm(9, 8, 4), 0)),
shopB_winter_wk2 = abs(round(rnorm(9, 8, 4), 0)),
shopB_winter_wk3 = abs(round(rnorm(9, 8, 4), 0)),
shopC_winter_wk1 = abs(round(rnorm(9, 8, 4), 0)),
shopC_winter_wk2 = abs(round(rnorm(9, 8, 4), 0)),
shopC_winter_wk3 = abs(round(rnorm(9, 8, 4), 0)))
Some data is collected for 3 shops (A, B, C) across 2 weeks in the summer and 3 weeks in the winter. The data collected is the number of fruits (apple, pear, orange) per size (big, medium, small) the shop had in stock on that particular week.
Here are the first 6 rows of of the dataset:
# fruit size shopA_summer_wk1 shopA_summer_wk2 shopB_summer_wk1 shopB_summer_wk2 shopC_summer_wk1 shopC_summer_wk2 shopA_winter_wk1 shopA_winter_wk2 shopA_winter_wk3
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 apple big 9 12 12 16 15 5 14 4 0
# 2 apple medium 21 16 16 1 12 11 8 8 9
# 3 apple small 10 6 18 18 22 12 4 2 0
# 4 pear big 13 7 4 12 13 6 10 6 2
# 5 pear medium 13 12 8 0 8 5 11 7 3
# 6 pear small 16 18 4 3 13 8 7 5 0
I would like to use the pivot_longer() function in R to restructure this dataset. Given that there are quite a few group categories I'm having difficulty in writing the code for this.
I would like it to look something like the following:
I would greatly appreciate any input :)
Using the names_pattern argument, we can do:
pivot_longer(df, c(-fruit, -size), names_pattern = '(^.*)_wk(.*$)',
names_to = c('Shop_season', 'week'))
#> # A tibble: 135 x 5
#> fruit size Shop_season week value
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 apple big shopA_summer 1 11
#> 2 apple big shopA_summer 2 8
#> 3 apple big shopB_summer 1 4
#> 4 apple big shopB_summer 2 24
#> 5 apple big shopC_summer 1 9
#> 6 apple big shopC_summer 2 10
#> 7 apple big shopA_winter 1 9
#> 8 apple big shopA_winter 2 12
#> 9 apple big shopA_winter 3 5
#> 10 apple big shopB_winter 1 5
#> # ... with 125 more rows
You might also want to separate shop and season, since these are really two different variables:
pivot_longer(df, c(-fruit, -size), names_pattern = '(^.*)_wk(.*$)',
names_to = c('Shop_season', 'week')) %>%
separate(Shop_season, into = c('shop', 'season'))
#> # A tibble: 135 x 6
#> fruit size shop season week value
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 apple big shopA summer 1 11
#> 2 apple big shopA summer 2 8
#> 3 apple big shopB summer 1 4
#> 4 apple big shopB summer 2 24
#> 5 apple big shopC summer 1 9
#> 6 apple big shopC summer 2 10
#> 7 apple big shopA winter 1 9
#> 8 apple big shopA winter 2 12
#> 9 apple big shopA winter 3 5
#> 10 apple big shopB winter 1 5
#> #... with 125 more rows
If data is dt, then
pivot_longer(
data = dt,
cols = -c(fruit:size),
names_to = c("shop_season", "week"),
names_pattern = "(.*)_(.*)"
)
Output:
# A tibble: 135 x 5
fruit size shop_season week value
<chr> <chr> <chr> <chr> <dbl>
1 apple big shopA_summer wk1 13
2 apple big shopA_summer wk2 12
3 apple big shopB_summer wk1 9
4 apple big shopB_summer wk2 9
5 apple big shopC_summer wk1 7
6 apple big shopC_summer wk2 17
7 apple big shopA_winter wk1 10
8 apple big shopA_winter wk2 17
9 apple big shopA_winter wk3 12
10 apple big shopB_winter wk1 8
I have a dataframe that looks like this:
ID x.2019 x.2020
1 10 20
2 20 30
3 30 40
4 40 50
5 50 60
and I would like to reformat it to look like this:
ID time x
1 2019 10
1 2020 20
2 2019 20
2 2020 30
3 2019 40
3 2020 50
4 2019 60
4 2020 70
5 2019 70
5 2020 80
Any idea how to achieve this?
This is a rather simple task which you can probably find in other answers. Though, you can achieve what you want with data.table as follows:
library(data.table)
df = data.table( ID = 1:5,
x.2019 = seq(10, 50, by = 10),
x.2020 = seq(20, 60, by = 10)
)
# change column names conveniently
setnames(df, c("x.2019", "x.2020"), c("2019", "2020"))
# transform the dataset from wide to long format
out = melt(df, id.vars = "ID", variable.name = "time", value.name = "x", variable.factor = FALSE)
# cast time to integer
out[ , time := as.integer(time)]
# reorder by ID
setorder(out, ID)
out
#> ID time x
#> 1: 1 2019 10
#> 2: 1 2020 20
#> 3: 2 2019 20
#> 4: 2 2020 30
#> 5: 3 2019 30
#> 6: 3 2020 40
#> 7: 4 2019 40
#> 8: 4 2020 50
#> 9: 5 2019 50
#> 10: 5 2020 60
Created on 2022-01-20 by the reprex package (v2.0.1)
You can use pivot_longer:
library(dplyr)
library(tidyr)
df = data.frame(ID=1:5,
x.2019=c(10, 20, 30, 40, 50),
x.2020=c(20, 30, 40, 50, 60))
df %>%
pivot_longer(cols = c(2, 3), names_to = 'time', values_to = 'x') %>%
mutate(time = as.integer(stringr::str_replace(time, 'x.', '')))
Result:
# A tibble: 10 x 3
ID time x
<int> <int> <dbl>
1 1 2019 10
2 1 2020 20
3 2 2019 20
4 2 2020 30
5 3 2019 30
6 3 2020 40
7 4 2019 40
8 4 2020 50
9 5 2019 50
10 5 2020 60
This question already has answers here:
wide to long multiple measures each time
(5 answers)
Closed 1 year ago.
I want to do this but the exact opposite. So say my dataset looks like this:
ID
X_1990
X_2000
X_2010
Y_1990
Y_2000
Y_2010
A
1
4
7
10
13
16
B
2
5
8
11
14
17
C
3
6
9
12
15
18
but with a lot more measure variables (i.e. also Z_1990, etc.). How can I get it so that the year becomes a variable and it will keep the different measures, like this:
ID
Year
X
Y
A
1990
1
10
A
2000
4
13
A
2010
7
16
B
1990
2
11
B
2000
5
14
B
2010
8
17
C
1990
3
12
C
2000
3
15
C
2010
9
18
You may use pivot_longer with names_sep argument.
tidyr::pivot_longer(df, cols = -ID, names_to = c('.value', 'Year'), names_sep = '_')
# ID Year X Y
# <chr> <chr> <int> <int>
#1 A 1990 1 10
#2 A 2000 4 13
#3 A 2010 7 16
#4 B 1990 2 11
#5 B 2000 5 14
#6 B 2010 8 17
#7 C 1990 3 12
#8 C 2000 6 15
#9 C 2010 9 18
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c("A", "B", "C"), X_1990 = 1:3, X_2000 = 4:6,
X_2010 = 7:9, Y_1990 = 10:12, Y_2000 = 13:15, Y_2010 = 16:18),
row.names = c(NA, -3L), class = "data.frame")
Let's say I have a dataframe with 3 ID columns and one column of interest. Each row represents one observation. Some ID have multiple observations, i.e., multiple rows.
df <- data.frame(id1 = c( 1, 2, 3, 4, 4),
id2 = c( 11, 12, 13, 14, 14),
id3 = c(111, 112, 113, 114, 114),
variable_of_interest = c(13, 24, 35, 31, 12))
id1 id2 id3 variable_of_interest
1 1 11 111 13
2 2 12 112 24
3 3 13 113 35
4 4 14 114 31
5 4 14 114 12
My goal is to restructure it in odred to have one row per ID, to keep the 3 IDs and to name the new columns "variable_of_interest1", "variable_of_interest2":
id1 id2 id3 variable_of_interest1 variable_of_interest1
1 1 11 111 13 NA
2 2 12 112 24 NA
3 3 13 113 35 NA
4 4 14 114 31 12
The solution might need reshape2 and the dcast function, but until now, I could not solve this out.
We can create a sequence grouped by the 'id' columns and then with pivot_wider reshape to wide
library(dplyr)
library(stringr)
library(tidyr)
library(data.table)
df %>%
mutate(ind = str_c('variable_of_interest', rowid(id1, id2, id3))) %>%
pivot_wider(names_from = ind, values_from = variable_of_interest)
-output
# A tibble: 4 x 5
# id1 id2 id3 variable_of_interest1 variable_of_interest2
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 11 111 13 NA
#2 2 12 112 24 NA
#3 3 13 113 35 NA
#4 4 14 114 31 12
Or another option is data.table
library(data.table)
dcast(setDT(df), id1 + id2 + id3 ~
paste0('variable_of_interest', rowid(id1, id2, id3)),
value.var = 'variable_of_interest')
-output
# id1 id2 id3 variable_of_interest1 variable_of_interest2
#1: 1 11 111 13 NA
#2: 2 12 112 24 NA
#3: 3 13 113 35 NA
#4: 4 14 114 31 12
I have a dataframe with missing values.
df1 <- data.frame(ID = c(1, 2, 3, 4, 5, 6), value1 = c(23, 14, NA, 45, NA, NA),
value2 = c(25, 15, NA, 34, NA, NA), value3 = c(33, 29, NA, 29, NA, NA))
ID value1 value2 value3
1 23 25 33
2 14 15 29
3 NA NA NA
4 45 34 29
5 NA NA NA
6 NA NA NA
And a dataframe with id relations.
df2 <- data.frame(ID1 = c(1, 2, 4), ID2 = c(3, 5, 6))
ID1 ID2
1 3
2 5
4 6
I want to replace the missing values, with the values of the related ID.
So the dataframe will look like this.
ID value1 value2 value3
1 23 25 33
2 14 15 29
3 23 25 33
4 45 34 29
5 14 15 29
6 45 34 29
Any help would be appreciated.
you will need a for-loop like this:
for (i in seq_along(df2[, "ID2"])) {
df1[df2[i, "ID2"], c("value1", "value2", "value3")] <- df1[df2[i, "ID1"], c("value1", "value2", "value3")] }
You can use as #FannieY already suggested a for loop. In addition I test with is.na to avoid to overwrite existing values.
for(i in seq_len(nrow(df2))) {
idx <- is.na(df1[df2[i,2],-1])
df1[df2[i,2],-1][idx] <- df1[df2[i,1],-1][idx]
}
df1
# ID value1 value2 value3
#1 1 23 25 33
#2 2 14 15 29
#3 3 23 25 33
#4 4 45 34 29
#5 5 14 15 29
#6 6 45 34 29