Writing function to help merge to different datasets - r

I am attempting to merge two different datasets: nflfastrpbp and routes.merging.
While both datasets have identifying factors for each game:
nflfastrpbp = game_id, old_game_id
routes.merging = GameID
... they are not matches.
Here is a look at the nflfastrpbp data:
A tibble: 48,514 x 8
game_id old_game_id week home_team away_team game_date pass_defense_1_player_id pass_defense_1_player_name
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr>
1 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
2 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
3 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
4 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
5 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
6 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
7 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
8 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
9 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
10 2020_01_ARI_SF 2020091311 1 SF ARI 2020-09-13 NA NA
And here is a look at the routes.merging data:
# A tibble: 80,676 x 6
EventID GameID Season Week OffensiveTeam DefensiveTeam
<int> <int> <int> <int> <chr> <chr>
1 15 2793 2020 1 Texans Chiefs
2 15 2793 2020 1 Texans Chiefs
3 15 2793 2020 1 Texans Chiefs
4 15 2793 2020 1 Texans Chiefs
5 15 2793 2020 1 Texans Chiefs
6 25 2793 2020 1 Texans Chiefs
7 25 2793 2020 1 Texans Chiefs
8 25 2793 2020 1 Texans Chiefs
9 25 2793 2020 1 Texans Chiefs
10 45 2793 2020 1 Chiefs Texans
# ... with 80,666 more rows
What I am trying to do: I am attempting to get the game_id from the nflfastrpbp data onto the routes.merging data and, of course, matching it up with the correct games so that I can merge the two together (specifically to pull the pass_defense_player information from nflfastrpbp to routes.merging.)
I've been trying to write a function but cannot figure it out.
If it helps, here is reprex for each dataset (I will include the 2020_01_ARI_SF game from both for helping in matching).
nflfastrpbp reprex:
structure(list(game_id = c("2020_01_ARI_SF", "2020_01_ARI_SF",
"2020_01_ARI_SF", "2020_01_ARI_SF", "2020_01_ARI_SF"), old_game_id = c("2020091311",
"2020091311", "2020091311", "2020091311", "2020091311"), week = c(1L,
1L, 1L, 1L, 1L), home_team = c("SF", "SF", "SF", "SF", "SF"),
away_team = c("ARI", "ARI", "ARI", "ARI", "ARI"), game_date = c("2020-09-13",
"2020-09-13", "2020-09-13", "2020-09-13", "2020-09-13"),
pass_defense_1_player_id = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), pass_defense_1_player_name = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
routes.merging reprex:
structure(list(EventID = c(30L, 30L, 30L, 30L, 45L), GameID = c(2805L,
2805L, 2805L, 2805L, 2805L), Season = c(2020L, 2020L, 2020L,
2020L, 2020L), Week = c(1L, 1L, 1L, 1L, 1L), OffensiveTeam = c("49ers",
"49ers", "49ers", "49ers", "Cardinals"), DefensiveTeam = c("Cardinals",
"Cardinals", "Cardinals", "Cardinals", "49ers")), row.names = c(NA,
-5L), groups = structure(list(EventID = c(30L, 45L), GameID = c(2805L,
2805L), .rows = structure(list(1:4, 5L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = 1:2, class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
I hope all that made sense.
EDIT - Expected Outcome: The expected outcome is the routes.merging DF with a new column, id.for.merging, that is the game_id from the nflfastrpbp DF ... again, matched up correctly by game.
EventID GameID Season Week OffensiveTeam DefensiveTeam id.for.merging
<int> <int> <int> <int> <chr> <chr> <chr>
1 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
2 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
3 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
4 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
5 15 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
6 25 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
7 25 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
8 25 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
9 25 2793 2020 1 Texans Chiefs 2020_01_HOU_KC
10 45 2793 2020 1 Chiefs Texans 2020_01_HOU_KC
EDIT #2: The GameID from nflfastrpbp and GameID from routes.merging DO NOT match. That is why I am here for help. As seen in the above Expected Outcome, I need the GameID from nflfastrpbp to be on the data for routes.merging so that I can merge all the data from nflfastrpbp onto the routes.merging DF.
I started to write a function that used paste and got as far as 2020_0 but couldn't figure out how to grab the week (which is 01 in the above example ... but will go all the way to 17 with the full data) and then the away_team followed by the home_team ... so 2020_01_HOU_KC
EDIT #3 There is not a column to match.
I am trying to CREATE that column by recreating the game_id column in nflfastpbp within the routes.merging DF so that I can merge the two together on that newly created column.
So, I started to write this function:
testing <- function(x) {
add.column.nflfastrpbp.to.routes.merged <- paste("2020_0")
}
routes.merging$id.for.merging <- testing()
And, in the id.for.merging column in routes.merging you can see it is working:
EventID GameID Season Week OffensiveTeam DefensiveTeam id.for.merging
<int> <int> <int> <int> <chr> <chr> <chr>
1 15 2793 2020 1 Texans Chiefs 2020_0
2 15 2793 2020 1 Texans Chiefs 2020_0
3 15 2793 2020 1 Texans Chiefs 2020_0
4 15 2793 2020 1 Texans Chiefs 2020_0
5 15 2793 2020 1 Texans Chiefs 2020_0
6 25 2793 2020 1 Texans Chiefs 2020_0
7 25 2793 2020 1 Texans Chiefs 2020_0
8 25 2793 2020 1 Texans Chiefs 2020_0
9 25 2793 2020 1 Texans Chiefs 2020_0
10 45 2793 2020 1 Chiefs Texans 2020_0
# ... with 80,666 more rows
What I cannot figure out is how to finish writing that function to take all the information and correctly match the game_id from nflfastrpbp for all the unique games.
So, taking:
testing <- function(x) {
add.column.nflfastrpbp.to.routes.merged <- paste("2020_0")
}
... and finishing it so that it outputs:
2020_01_ARI_SF
or
2020_07_GB_HOU
into the newly created id.for.merging column.
To be clear:
2020 = year (not included in the data)
01 & 07 = week (included)
GB_HOU = away_team, home_towm

You don't need to write a separate function to create a new column. But if you do want to, you can do this:
testing <- function(df) {
library(dplyr)
with(df, # assumes `df` is a data structure like `routes.merging`
paste0(
"2020_",
sprintf("%02d", Week),
"_",
case_when( # away_team == "team_name" ~ "city"
OffensiveTeam == "Texans" ~ "HOU",
OffensiveTeam == "Chiefs" ~ "KC",
OffensiveTeam == "Cardinals" ~ "ARI",
OffensiveTeam == "49ers" ~ "SF",
# etc.
OffensiveTeam == "Packers" ~ "GB"
),
"_",
case_when( # home_team == "team_name" ~ "city"
DefensiveTeam == "Texans" ~ "HOU",
DefensiveTeam == "Chiefs" ~ "KC",
DefensiveTeam == "Cardinals" ~ "ARI",
DefensiveTeam == "49ers" ~ "SF",
# etc.
DefensiveTeam == "Packers" ~ "GB"
)
)
)
}
routes.merging$id.for.merging <- testing(routes.merging)
Otherwise, you can add the column directly like this:
library(dplyr)
routes.merging <- mutate(routes.merging,
id.for.merging = paste0(
"2020_",
sprintf("%02d", Week),
"_",
case_when( # away_team == "team_name" ~ "city"
OffensiveTeam == "Texans" ~ "HOU",
OffensiveTeam == "Chiefs" ~ "KC",
OffensiveTeam == "Cardinals" ~ "ARI",
OffensiveTeam == "49ers" ~ "SF",
# etc.
OffensiveTeam == "Packers" ~ "GB"
),
"_",
case_when( # home_team == "team_name" ~ "city"
DefensiveTeam == "Texans" ~ "HOU",
DefensiveTeam == "Chiefs" ~ "KC",
DefensiveTeam == "Cardinals" ~ "ARI",
DefensiveTeam == "49ers" ~ "SF",
# etc.
DefensiveTeam == "Packers" ~ "GB"
)
)
)
sprintf("%02d", Week) makes any single digit (e.g., "1") into double digits (e.g., "01"), and double digits stay double digits.
case_when() is a function in dplyr R package. The function allows you to vectorize multiple ifelse() statements. You will need to add more lines in case_when() for a complete list of the NFL teams, of course.
The output from using your reprex data structure looks like this:
# A tibble: 5 x 7
# Groups: EventID, GameID [2]
EventID GameID Season Week OffensiveTeam DefensiveTeam id.for.merging
<int> <int> <int> <int> <chr> <chr> <chr>
1 30 2805 2020 1 49ers Cardinals 2020_01_SF_ARI
2 30 2805 2020 1 49ers Cardinals 2020_01_SF_ARI
3 30 2805 2020 1 49ers Cardinals 2020_01_SF_ARI
4 30 2805 2020 1 49ers Cardinals 2020_01_SF_ARI
5 45 2805 2020 1 Cardinals 49ers 2020_01_ARI_SF
Finally, merging:
merged_data <- full_join(routes.merging, nflfastrpbp, by = c("id.for.merging" = "game_id"))
Run ?dplyr::join or ?merge to learn more about some other merge functions and options.

Related

Aggregate dataframe by condition in R

I have the following DataFrame in R:
Y ... Price Year Quantity Country
010190 ... 4781 2021 4 Germany
010190 ... 367 2021 3 Germany
010190 ... 4781 2021 6 France
010190 ... 250 2021 3 France
020190 ... 690 2021 NA USA
020190 ... 10 2021 6 USA
...... ... .... .. ...
217834 ... 56 2021 3 USA
217834 ... 567 2021 9 USA
As you see the numbers in Y column startin with 01.., 02..., 21... I want to aggregate such kind of rows from 6 digit to 2 digit by considering different categorical column (e.g. Country and Year) and sum numerical columns like Quantity and Price. Also I want to take into account rows with NAs during caclulation. So, in the end I want such kind of output:
Y Price Year Quantity Country
01 5148 2021 7 Germany
01 5031 2021 9 USA
02 700 2021 6 USA
.. .... ... .... ...
21 623 2021 12 USA
You can use group_by and summarize from dplyr
library(dplyr)
df %>%
mutate(Y = sprintf(as.numeric(factor(Y, unique(Y))), fmt = '%02d')) %>%
group_by(Y, Year, Country) %>%
summarize(across(where(is.numeric), sum))
#> # A tibble: 4 x 5
#> # Groups: Y, Year [3]
#> Y Year Country Price Quantity
#> <chr> <int> <chr> <int> <int>
#> 1 01 2021 France 5031 9
#> 2 01 2021 Germany 5148 7
#> 3 02 2021 USA 700 NA
update: request:
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(across(c(Price, Quantity), ~sum(., na.rm = TRUE)))
We could use substr to get the first two characters from Y and group_by and summarise() with sum()
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(Price = sum(Price, na.rm = TRUE),
Quantity = sum(Quantity, na.rm = TRUE)
)
Y Year Country Price Quantity
<chr> <dbl> <chr> <dbl> <dbl>
1 01 2021 France 5031 9
2 01 2021 Germany 5148 7
3 02 2021 USA 700 6
4 21 2021 USA 623 12
Using aggregate and the substring of Y.
aggregate(cbind(Quantity, Price) ~ Y + Year + Country,
transform(dat, Y=substr(Y, 1, 2)), sum)
# Y Year Country Quantity Price
# 1 10 2021 France 9 5031
# 2 10 2021 Germany 7 5148
# 3 20 2021 USA 7 700
# 4 21 2021 USA 12 623
Data:
dat <- structure(list(Y = c(10190L, 10190L, 10190L, 10190L, 20190L,
20190L, 217834L, 217834L), foo = c("...", "...", "...", "...",
"...", "...", "...", "..."), Price = c(4781L, 367L, 4781L, 250L,
690L, 10L, 56L, 567L), Year = c(2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L), model = c(NA, NA, NA, NA, NA, NA, "Tesla",
"Tesla"), Quantity = c(4L, 3L, 6L, 3L, 1L, 6L, 3L, 9L), Country = c("Germany",
"Germany", "France", "France", "USA", "USA", "USA", "USA")), class = "data.frame", row.names = c(NA,
-8L))

Summarise across each column by grouping their names

I want to calculate the weighted variance using the weights provided in the dataset, while group for the countries and cities, however the function returns NAs:
library(Hmisc) #for the 'wtd.var' function
weather_winter.std<-weather_winter %>%
group_by(country, capital_city) %>%
summarise(across(starts_with("winter"),wtd.var))
The provided output from the console (when in long format):
# A tibble: 35 x 3
# Groups: country [35]
country capital_city winter
<chr> <chr> <dbl>
1 ALBANIA Tirane NA
2 AUSTRIA Vienna NA
3 BELGIUM Brussels NA
4 BULGARIA Sofia NA
5 CROATIA Zagreb NA
6 CYPRUS Nicosia NA
7 CZECHIA Prague NA
8 DENMARK Copenhagen NA
9 ESTONIA Tallinn NA
10 FINLAND Helsinki NA
# … with 25 more rows
This is the code that I used to get the data from a wide format into a long format:
weather_winter <- weather_winter %>% pivot_longer(-c(31:33))
weather_winter$name <- NULL
names(weather_winter)[4] <- "winter"
Some example data:
structure(list(`dec-wet_2011` = c(12.6199998855591, 12.6099996566772,
14.75, 11.6899995803833, 18.2899990081787), `dec-wet_2012` = c(13.6300001144409,
14.2199993133545, 14.2299995422363, 16.1000003814697, 18.0299987792969
), `dec-wet_2013` = c(4.67999982833862, 5.17000007629395, 4.86999988555908,
7.56999969482422, 5.96000003814697), `dec-wet_2014` = c(14.2999992370605,
14.4799995422363, 13.9799995422363, 15.1499996185303, 16.1599998474121
), `dec-wet_2015` = c(0.429999977350235, 0.329999983310699, 1.92999994754791,
3.30999994277954, 7.42999982833862), `dec-wet_2016` = c(1.75,
1.29999995231628, 3.25999999046326, 6.60999965667725, 8.67999935150146
), `dec-wet_2017` = c(13.3400001525879, 13.3499994277954, 15.960000038147,
10.6599998474121, 14.4699993133545), `dec-wet_2018` = c(12.210000038147,
12.4399995803833, 11.1799993515015, 10.75, 18.6299991607666),
`dec-wet_2019` = c(12.7199993133545, 13.3800001144409, 13.9899997711182,
10.5299997329712, 12.3099994659424), `dec-wet_2020` = c(15.539999961853,
16.5200004577637, 11.1799993515015, 14.7299995422363, 13.5499992370605
), `jan-wet_2011` = c(8.01999950408936, 7.83999967575073,
10.2199993133545, 13.8899993896484, 14.5299997329712), `jan-wet_2012` = c(11.5999994277954,
11.1300001144409, 12.5500001907349, 10.1700000762939, 22.6199989318848
), `jan-wet_2013` = c(17.5, 17.4099998474121, 15.5599994659424,
13.3199996948242, 20.9099998474121), `jan-wet_2014` = c(12.5099992752075,
12.2299995422363, 15.210000038147, 9.73999977111816, 9.63000011444092
), `jan-wet_2015` = c(17.6900005340576, 16.9799995422363,
11.75, 9.9399995803833, 19), `jan-wet_2016` = c(15.6099996566772,
15.5, 14.5099992752075, 10.3899993896484, 18.4499988555908
), `jan-wet_2017` = c(9.17000007629395, 9.61999988555908,
9.30999946594238, 15.8499994277954, 11.210000038147), `jan-wet_2018` = c(8.55999946594238,
9.10999965667725, 13.2599992752075, 9.85999965667725, 15.8899993896484
), `jan-wet_2019` = c(17.0699996948242, 16.8699989318848,
14.5699996948242, 19.0100002288818, 19.4699993133545), `jan-wet_2020` = c(6.75999975204468,
6.25999975204468, 6.00999975204468, 5.35999965667725, 8.15999984741211
), `feb-wet_2011` = c(9.1899995803833, 8.63999938964844,
6.21999979019165, 9.82999992370605, 4.67999982833862), `feb-wet_2012` = c(12.2699995040894,
11.6899995803833, 8.27999973297119, 14.9399995803833, 13.0499992370605
), `feb-wet_2013` = c(15.3599996566772, 15.9099998474121,
17.0599994659424, 13.3599996566772, 16.75), `feb-wet_2014` = c(10.1999998092651,
11.1399993896484, 13.8599996566772, 10.7399997711182, 7.35999965667725
), `feb-wet_2015` = c(11.9200000762939, 12.2699995040894,
8.01000022888184, 14.5299997329712, 5.71999979019165), `feb-wet_2016` = c(14.6999998092651,
14.7799997329712, 16.7899990081787, 4.90000009536743, 19.3500003814697
), `feb-wet_2017` = c(8.98999977111816, 9.17999935150146,
11.7699995040894, 6.3899998664856, 13.9899997711182), `feb-wet_2018` = c(16.75,
16.8599987030029, 12.0599994659424, 16.1900005340576, 8.51000022888184
), `feb-wet_2019` = c(7.58999967575073, 7.26999998092651,
8.21000003814697, 7.57999992370605, 8.81999969482422), `feb-wet_2020` = c(10.6399993896484,
10.4399995803833, 13.4399995803833, 8.53999996185303, 19.939998626709
), country = c("SERBIA", "SERBIA", "SLOVENIA", "GREECE",
"CZECHIA"), capital_city = c("Belgrade", "Belgrade", "Ljubljana",
"Athens", "Prague"), weight = c(20.25, 19.75, 14.25, 23.75,
14.25)), row.names = c(76L, 75L, 83L, 16L, 5L), class = "data.frame")
Your code seems to provide the right answer, now there's more data:
# Groups: country [4]
country capital_city winter
<chr> <chr> <dbl>
1 CZECHIA Prague 27.2
2 GREECE Athens 14.6
3 SERBIA Belgrade 19.1
4 SLOVENIA Ljubljana 16.3
Is this what you were looking for?
I took the liberty of streamlining your code:
weather_winter <- weather_winter %>%
pivot_longer(-c(31:33), values_to = "winter") %>%
select(-name)
weather_winter.std <- weather_winter %>%
group_by(country, capital_city) %>%
summarise(winter = wtd.var(winter))
With only one "winter" column, there's no need for the across().
Finally, you are not using the weights. If these are needed, then change the last line to:
summarise(winter = wtd.var(winter, weights = weight))
To give:
# A tibble: 4 x 3
# Groups: country [4]
country capital_city winter
<chr> <chr> <dbl>
1 CZECHIA Prague 26.3
2 GREECE Athens 14.2
3 SERBIA Belgrade 18.8
4 SLOVENIA Ljubljana 15.8

Creating a new column using scores from past years (which is in the same dataframe)

I'm sorry if this question has already been answered, but I don't really know how to phrase my question.
I have a data frame structured in this way:
country
year
score
France
2020
10
France
2019
9
Germany
2020
15
Germany
2019
14
I would like to have a new column called previous_year_score that would look into the data frame looking for the "score" of a country for the "year - 1". In this case France 2020 would have a previous_year_score of 9, while France 2019 would have a NA.
You can use match() for this. I imagine there are plenty of other solutions too.
Data:
df <- structure(list(country = c("France", "France", "Germany", "Germany"
), year = c(2020L, 2019L, 2020L, 2019L), score = c(10L, 9L, 15L,
14L), prev_score = c(9L, NA, 14L, NA)), row.names = c(NA, -4L
), class = "data.frame")
Solution:
i <- match(paste(df[[1]],df[[2]]-1),paste(df[[1]],df[[2]]))
df$prev_score <- df[i,3]
You can use the following solution:
library(dplyr)
df %>%
group_by(country) %>%
arrange(year) %>%
mutate(prev_val = ifelse(year - lag(year) == 1, lag(score), NA))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 Germany 2019 14 NA
3 France 2020 10 9
4 Germany 2020 15 14
Using case_when
library(dplyr)
df1 %>%
arrange(country, year) %>%
group_by(country) %>%
mutate(prev_val = case_when(year - lag(year) == 1 ~ lag(score)))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 France 2020 10 9
3 Germany 2019 14 NA
4 Germany 2020 15 14

Merge two dataframes: specifically merge a selection of columns based on two conditions?

I have two datasets on the same 2 patients. With the second dataset I want to add new information to the first, but I can't seem to get the code right.
My first (incomplete) dataset has a patient ID, measurement time (either T0 or FU1), year of birth, date of the CT scan, and two outcomes (legs_mass and total_mass):
library(tidyverse)
library(dplyr)
library(magrittr)
library(lubridate)
df1 <- structure(list(ID = c(115, 115, 370, 370), time = structure(c(1L,
6L, 1L, 6L), .Label = c("T0", "T1M0", "T1M6", "T1M12", "T2M0",
"FU1"), class = "factor"), year_of_birth = c(1970, 1970, 1961,
1961), date_ct = structure(c(16651, 17842, 16651, 18535), class = "Date"),
legs_mass = c(9.1, NA, NA, NA), total_mass = c(14.5, NA,
NA, NA)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
# Which gives the following dataframe
df1
# A tibble: 4 x 6
ID time year_of_birth date_ct legs_mass total_mass
<dbl> <fct> <dbl> <date> <dbl> <dbl>
1 115 T0 1970 2015-08-04 9.1 14.5
2 115 FU1 1970 2018-11-07 NA NA
3 370 T0 1961 2015-08-04 NA NA
4 370 FU1 1961 2020-09-30 NA NA
The second dataset adds to the legs_mass and total_mass columns:
df2 <- structure(list(ID = c(115, 370), date_ct = structure(c(17842,
18535), class = "Date"), ctscan_label = c("PXE115_CT_20181107_xxxxx-3.tif",
"PXE370_CT_20200930_xxxxx-403.tif"), legs_mass = c(956.1, 21.3
), total_mass = c(1015.9, 21.3)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
# Which gives the following dataframe:
df2
# A tibble: 2 x 5
ID date_ct ctscan_label legs_mass total_mass
<dbl> <date> <chr> <dbl> <dbl>
1 115 2018-11-07 PXE115_CT_20181107_xxxxx-3.tif 956. 1016.
2 370 2020-09-30 PXE370_CT_20200930_xxxxx-403.tif 21.3 21.3
What I am trying to do, is...
Add the legs_mass and total_mass column values from df2 to df1, based on ID number and date_ct.
Add the new columns of df2 (the one that is not in df1; ctscan_label) to df1, also based on the date of the ct and patient ID.
So that the final dataset df3 looks as follows:
df3 <- structure(list(ID = c(115, 115, 370, 370), time = structure(c(1L,
6L, 1L, 6L), .Label = c("T0", "T1M0", "T1M6", "T1M12", "T2M0",
"FU1"), class = "factor"), year_of_birth = c(1970, 1970, 1961,
1961), date_ct = structure(c(16651, 17842, 16651, 18535), class = "Date"),
legs_mass = c(9.1, 956.1, NA, 21.3), total_mass = c(14.5,
1015.9, NA, 21.3)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
# Corresponding to the following tibble:
# A tibble: 4 x 6
ID time year_of_birth date_ct legs_mass total_mass
<dbl> <fct> <dbl> <date> <dbl> <dbl>
1 115 T0 1970 2015-08-04 9.1 14.5
2 115 FU1 1970 2018-11-07 956. 1016.
3 370 T0 1961 2015-08-04 NA NA
4 370 FU1 1961 2020-09-30 21.3 21.3
I have tried the merge function and rbind from baseR, and bind_rows from dplyr but can't seem to get it right.
Any help?
You can join the two datasets and use coalesce to keep one non-NA value from the two datasets.
library(dplyr)
left_join(df1, df2, by = c("ID", "date_ct")) %>%
mutate(leg_mass = coalesce(legs_mass.x , legs_mass.y),
total_mass = coalesce(total_mass.x, total_mass.y)) %>%
select(-matches('\\.x|\\.y'), -ctscan_label)
# ID time year_of_birth date_ct leg_mass total_mass
# <dbl> <fct> <dbl> <date> <dbl> <dbl>
#1 115 T0 1970 2015-08-04 9.1 14.5
#2 115 FU1 1970 2018-11-07 956. 1016.
#3 370 T0 1961 2015-08-04 NA NA
#4 370 FU1 1961 2020-09-30 21.3 21.3
We can use data.table methods
library(data.table)
setDT(df1)[setDT(df2), c("legs_mass", "total_mass") :=
.(fcoalesce(legs_mass, i.legs_mass),
fcoalesce(total_mass, i.total_mass)), on = .(ID, date_ct)]
-output
df1
ID time year_of_birth date_ct legs_mass total_mass
1: 115 T0 1970 2015-08-04 9.1 14.5
2: 115 FU1 1970 2018-11-07 956.1 1015.9
3: 370 T0 1961 2015-08-04 NA NA
4: 370 FU1 1961 2020-09-30 21.3 21.3

How to calculate a day difference like a SQL Windows function in R

Input:
Aim:
Create a new column named 'dayDifference' with the following rule: for each pair 'item-city' pair calculate the day difference of the related pair.
Desired output:
Row 1 and 2 [Pair Piza-Berlin] correspond to 3 because there is 3 days between 2 Feb and 4 Feb
Row 3 [Pair Pizza-Hambourg] correspond to 0 because there is no day difference
Row 4 and 5 [Pair Pasta-Hambourg] correspond to 21 because there is 21 days from 10 to 20
Row 6 [Pair Pasta-Berlin] correspond to 0 because there is no day difference
Info: Of course there can be more than 2 rows of pair [for instance I can have the pair 'pizza-berlin' 100 rows : if so always take the max(date) and substract to the min(date) pizza-berlin pair.
Constraint:
Need to be done in R [e.g. no outside connection with a database]
Source code:
df <- structure(list(id = c(4848L, 4887L, 4899L, 4811L, 4834L, 4892L
), item = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Pasta",
"Pizza"), class = "factor"), city = structure(c(1L, 1L, 2L, 2L,
2L, 1L), .Label = c("Berlin", "Hamburg"), class = "factor"),
date = structure(c(17199, 17201, -643892, 17449, 17459, 17515
), class = "Date")), .Names = c("id", "item", "city", "date"
), row.names = c(NA, -6L), class = "data.frame")
I would do it using data.table:
library(data.table)
setDT(df)
df[, min_date := min(date), by = c("item", "city")]
df[, max_date := max(date), by = c("item", "city")]
df[, dayDifference := difftime(max_date, min_date, units = "days")]
df[, c("min_date", "max_date") := NULL]
It'll give you desired output:
id item city date dayDifference
1: 4848 Pizza Berlin 2017-02-02 2 days
2: 4887 Pizza Berlin 2017-02-04 2 days
3: 4899 Pizza Hamburg 0207-02-01 0 days
4: 4811 Pasta Hamburg 2017-10-10 10 days
5: 4834 Pasta Hamburg 2017-10-20 10 days
6: 4892 Pasta Berlin 2017-12-15 0 days
You can also use df[, dayDifference := max_date - min_date] instead of df[, dayDifference := difftime(max_date, min_date, units = "days")].
Reduce is an awesome function
library(dplyr)
df %>%
group_by(item, city) %>%
mutate(dayDifference=abs(Reduce(`-`, as.numeric(range(date)))))
# A tibble: 6 x 5
# Groups: item, city [4]
id item city date dayDifference
<int> <fctr> <fctr> <date> <dbl>
1 4848 Pizza Berlin 2017-02-02 2
2 4887 Pizza Berlin 2017-02-04 2
3 4899 Pizza Hamburg 0207-02-01 0
4 4811 Pasta Hamburg 2017-10-10 10
5 4834 Pasta Hamburg 2017-10-20 10
6 4892 Pasta Berlin 2017-12-15 0
Not pretty, but...
i<-unique(lapply(1:nrow(df),function(x) which(paste(df[,2],df[,3]) %in% paste(df[x,2],df[x,3]))))
for(j in 1:length(i)) df[i[[j]],"days"]<-abs(difftime(df[i[[j]],][1,"date"],df[i[[j]],][2,"date"]))
> df
id item city date days
1 4848 Pizza Berlin 2017-02-02 2
2 4887 Pizza Berlin 2017-02-04 2
3 4899 Pizza Hamburg 0207-02-01 NA
4 4811 Pasta Hamburg 2017-10-10 10
5 4834 Pasta Hamburg 2017-10-20 10
6 4892 Pasta Berlin 2017-12-15 NA

Resources