Specific Join of two Dataframes - r

I have two data frames: df1 and df2:
> df1
ID Gender age cd evnt scr test_dt
1 C0004 MALE 22 1 1 82 7/3/2014
2 C0004 MALE 22 1 2 76 7/3/2014
3 C0005 MALE 22 1 3 1514 7/3/2014
4 C0005 MALE 23 2 1 81 11/3/2014
5 C0006 MALE 23 2 2 75 11/3/2014
6 C0006 MALE 23 2 3 878 11/3/2014
and,
> df2
ID hgt wt phys_dt
1 C0004 70 147 6/29/2015
2 C0004 70 157 6/27/2016
3 C0005 67 175 6/27/2016
4 C0005 65 171 7/2/2014
5 C0006 69 160 6/29/2015
6 C0006 64 143 7/2/2014
I want to join df1 and df2 in a way that yields the following data frame, call it df3:
> df3
ID Gender age cd evnt scr hgt wt
1 C0004 MALE 22 1 1 82 70 147
2 C0004 MALE 22 1 2 76 70 157
3 C0005 MALE 22 1 3 1514 67 175
4 C0005 MALE 23 2 1 81 65 171
5 C0006 MALE 23 2 2 75 69 160
6 C0006 MALE 23 2 3 878 64 143
I'm trying to add df2$hgt and df2$wt to the proper ID row. The tricky part is that I want to join hgt and wt to the ID row whose dates (df1$test_dt and df2$phys_dt) most closely align. I was thinking I could first sort the two data frames by ID then by their respective dates then try and join? I'm not quite sure how to approach this. Thanks.

If you want to murge just matching the df1$ID and df2$ID, the following should do it:
df3 <- left_join(df1, df2, by = c("ID" = "ID"))
if the date should be matched as well as the ID, you could try:
df3 <- left_join(df1, df2, by = c("ID" = "ID", "test_dt" = "phys_dt"))
it is in the library(dplyr)

Related

Join data frame into one in r

I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56

How to calculate percentage for each column of a dataframe based on the total given in another column in R?

I want to calculate percentage of each column in a dataframe by adding new column after each column in R. How can I achieve this
The percentage is calculated based on the another column.
Here's is an example of a dataset. Percentage is calculated based on total column for col2, col3, col4
Var total col2 col3 col4
A 217 77 62 78
D 112 14 47 51
B 91 15 39 37
R 89 77 7 5
V 80 8 53 19
The output should look like
Var total col2 col2_percent col3 col3_percent col4 col4_percent
A 217 77 35.48% 62 28.57% 78 35.94%
D 112 14 12.50% 47 41.96% 51 45.54%
B 91 15 16.48% 39 42.86% 37 40.66%
R 89 77 86.52% 7 7.87% 5 5.62%
V 80 8 10.00% 53 66.25% 19 23.75%
You can use across:
library(dplyr)
df %>%
mutate(across(-c(Var, total), ~ sprintf('%.2f%%', .x / total * 100), .names = "{col}_percent")) %>%
relocate(Var, total, sort(colnames(.)))
Var total col2 col2_percent col3 col3_percent col4 col4_percent
1 A 217 77 35.48% 62 28.57% 78 35.94%
2 D 112 14 12.50% 47 41.96% 51 45.54%
3 B 91 15 16.48% 39 42.86% 37 40.66%
4 R 89 77 86.52% 7 7.87% 5 5.62%
5 V 80 8 10.00% 53 66.25% 19 23.75%

How can I transform multiple repeated measures from wide to long format?

I have a data set that looks like that:
id <- c(1:3)
gender <- factor(c("male","female","female"))
age <- c(51,69,44)
cortisol_1 <- c(23,32,54)
cortisol_2 <- c(34,52,49)
cortisol_3 <- c(34,65,12)
blood_1 <- c(12,64,54)
blood_2 <- c(52,32,75)
blood_3 <- c(12,12,75)
temp_1 <- c(38.5,38.7,37.9)
temp_3 <- c(36.5,36.4,37.1)
df <- data.frame(id,gender,age,cortisol_1,cortisol_2,cortisol_3,blood_1,blood_2,blood_3,temp_1,temp_3)
df
id gender age cortisol_1 cortisol_2 cortisol_3 blood_1 blood_2 blood_3 temp_1 temp_3
1 1 male 51 23 34 34 12 52 12 38.5 36.5
2 2 female 69 32 52 65 64 32 12 38.7 36.4
3 3 female 44 54 49 12 54 75 75 37.9 37.1
So I have cortisol level and blood pressure which were measured annually at three time points. However, body temperature was only assessed at baseline and wave 3.
How can I change the data structure from wide to long? I would hope that the data looks like that:
id gender wave cortisol blood temp
1 1 male 1 23 12 38.5
2 1 male 2 34 52 NA
3 1 male 3 34 12 36.5
4 2 female 1 32 64 37.7
5 2 female 2 52 32 NA
6 2 female 3 65 12 36.4
7 3 female 1 54 54 37.9
8 3 female 2 49 75 NA
9 3 female 3 12 75 37.1
Best
Jascha
We can use pivot_longer
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(id, gender, age),
names_to = c('.value', 'grp'), names_sep = "_") %>%
select(-grp)
-output
# A tibble: 9 x 6
# id gender age cortisol blood temp
# <int> <fct> <dbl> <dbl> <dbl> <dbl>
#1 1 male 51 23 12 38.5
#2 1 male 51 34 52 NA
#3 1 male 51 34 12 36.5
#4 2 female 69 32 64 38.7
#5 2 female 69 52 32 NA
#6 2 female 69 65 12 36.4
#7 3 female 44 54 54 37.9
#8 3 female 44 49 75 NA
#9 3 female 44 12 75 37.1

find max column value in r conditional on another column

I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))

How can I select the 10 largest values from three different columns and save them in a new data frame in R?

Var1 <- 90:115
Var2 <- 1:26
Var3 <- 52:27
data <- data.frame(Var1, Var2, Var3)
Hi, I want to select from each column the 10 largest values and save them in a new data frame? I know that in my example the new data frame will contain 20 rows but I don't understand the correct workflow.
That's what I'm expecting:
Var1 Var2 Var3
90 1 52
91 2 51
92 3 50
93 4 49
94 5 48
95 6 47
96 7 46
97 8 45
98 9 44
99 10 43
106 17 36
107 18 35
108 19 34
109 20 33
110 21 32
111 22 31
112 23 30
113 24 29
114 25 28
115 26 27
I can solve my problem for three column with this approach
df <- subset(data, Var1 >=106 | Var2 >=17 | Var3 >=43)
but if I have to do that for 50+ columns it's not really the best solution.
This can be done by looping over the columns with lapply, sort them, and get the first 10 values with head
data.frame(lapply(data, function(x) head(sort(x,
decreasing=TRUE) ,10)))
If we need the first 10 rows, just use
head(data, 10)
Update
Based on the OP's edit
data[sort(Reduce(union,lapply(data, function(x)
order(x,decreasing=TRUE)[1:10]))),]
I think this is what you want:
data[sort(unique(c(sapply(data,order,decreasing=T)[1:10,]))),]
Basically index the top 10 elements from each column, merge them and remove duplicate, reorder and extract it from the original data.
A direct answer to your question:
nv1 <- sort(Var1,decreasing = TRUE)[1:10]
nv2 <- sort(Var2,decreasing = TRUE)[1:10]
nv3 <- sort(Var2,decreasing = TRUE)[1:10]
nd <- data.frame(nv1, nv2, nv3)
But why would you want to do such a thing? You're breaking the order of the data -- Var3 is increasing and the others are decreasing. Perhaps you want a list, rather than a data frame?
This might help:
thresh <- sapply(data,sort,decreasing=T)[10,]
data[!!rowSums(sapply(1:ncol(data),function(x) data[,x]>=thresh[x])),]
First, a vector thresh is defined, which contains the tenth largest value of each column. Then we perform a loop over the columns to check if any of the values is larger than or equal to the corresponding threshold value. The !! is a shorthand notation for as.logical(), which (owing to the combination with rowSums) selects those rows where at least one of the values is above or equal to the threshold. In your example this yields the output:
# Var1 Var2 Var3
#1 90 1 52
#2 91 2 51
#3 92 3 50
#4 93 4 49
#5 94 5 48
#6 95 6 47
#7 96 7 46
#8 97 8 45
#9 98 9 44
#10 99 10 43
#17 106 17 36
#18 107 18 35
#19 108 19 34
#20 109 20 33
#21 110 21 32
#22 111 22 31
#23 112 23 30
#24 113 24 29
#25 114 25 28
#26 115 26 27
Which is equal to the output that you obtain with the command you posted:
#> identical(data[!!rowSums(sapply(1:ncol(data),function(x) data[,x]>=thresh[x])),], subset(data, Var1 >=106 | Var2 >=17 | Var3 >=43))
[1] TRUE

Resources