Merge two data with respect to date and week using R - r

I have two data. One has a year_month_week column and the other has a date column. I simplified and made two data for demonstration purpose.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 2022051 means 1st week of May,2022. Likewise, 2022052 means 2nd week of May,2022. For df2,20220503 means May 3rd, 2022.
What I want to do now is merge df1 and df2 with respect to year_month_week. In this case, 20220503 and 20220506 are 1st week of May,2022. If more than one date are in year_month_week, I will just include the first of them. So my expected output is as follows:
df<-data.frame(id=c(1,1,2,2),
year_month_week=c(2022051,2022052,2022041,2022042),
points=c(65,58,21,25),
temperature=c(36.1,36.6,34.3,34.9))

One way of doing it is to extract the last two digits of your date column in df2, divide the digits by 7, then round them up. This would be your week number (this part is in the mutate function).
Then just group_by the year_month_week column and only output one record per year_month_week, and join with df1.
library(tidyverse)
library(stringr)
df <- df2 %>%
mutate(year_month_week =
as.integer(
paste0(str_extract(df2$date, ".*(?=\\d\\d$)"),
ceiling(as.integer(str_extract(df2$date, "\\d\\d$"))/7))
)) %>%
group_by(year_month_week) %>%
slice_min(date) %>%
left_join(df1,
by = c("year_month_week", "id")) %>%
select(-date)
df
# A tibble: 4 × 4
# Groups: year_month_week [4]
id temperature year_month_week points
<dbl> <dbl> <dbl> <dbl>
1 2 34.3 2022041 21
2 2 34.9 2022042 25
3 1 36.1 2022051 65
4 1 36.6 2022052 58

Related

Using dplyr - how can I create a new category for one column when another column has duplicates?

I have a dataframe of coordinates for different studies that have been conducted. The studies are either experiment or observation however at some locations both experiment AND observation occur. For these sites, I would like to create a new study category called both. How can I do this using dplyr?
Example Data
df1 <- data.frame(matrix(ncol = 4, nrow = 6))
colnames(df1)[1:4] <- c("value", "study", "lat","long")
df1$value <- c(1,1,2,3,4,4)
df1$study <- rep(c('experiment','observation'),3)
df1$lat <- c(37.541290,37.541290,38.936604,29.9511,51.509865,51.509865)
df1$long <- c(-77.434769,-77.434769,-119.986649,-90.0715,-0.118092,-0.118092)
df1
value study lat long
1 1 experiment 37.54129 -77.434769
2 1 observation 37.54129 -77.434769
3 2 experiment 38.93660 -119.986649
4 3 observation 29.95110 -90.071500
5 4 experiment 51.50986 -0.118092
6 4 observation 51.50986 -0.118092
Note that the value above is duplicated when study has experiment AND observation.
The ideal output would look like this
value study lat long
1 1 both 37.54129 -77.434769
2 2 experiment 38.93660 -119.986649
3 3 observation 29.95110 -90.071500
4 4 both 51.50986 -0.118092
We can replace those 'value' cases where both experiment and observation is available to 'both' and get the distinct
library(dplyr)
df1 %>%
group_by(value) %>%
mutate(study = if(all(c("experiment", "observation") %in% study))
"both" else study) %>%
ungroup %>%
distinct
-output
# A tibble: 4 × 4
value study lat long
<dbl> <chr> <dbl> <dbl>
1 1 both 37.5 -77.4
2 2 experiment 38.9 -120.
3 3 observation 30.0 -90.1
4 4 both 51.5 -0.118

Identifying values from one database to use in another database

I am working on a project in which I need to work with 2 databases, identify values from one database to use in another.
I have a dataframe 1,
df1<-data.frame("ID"=c(1,2,3),"Condition A"=c("B","B","A"),"Condition B"=c("1","1","2"),"Year"=c(2002,1988,1995))
and a dataframe 2,
df2 <- data.frame("Condition A"=c("A","A","B","B"),"Condiction B"=c("1","2","1","2"),"<1990"=c(20,30,50,80),"1990-2000"=c(100,90,80,30),">2000"=c(300,200,800,400))
I would like to add a new column to df1 called "Value", in which, for each ID (from df1), collects the values from column 3,4 or 5 from df2 (depending on the year), and following conditions A and B available in both databases. The end result would be something like this:
df1<-data.frame("ID"=c(1,2,3),"Condition A"=c("B","B","A"),"Condition B"=c("1","1","2"),"Year"=c(2002,1988,1995),"Value"=c(800,50,90))
thanks!
I think we can simply left_join, then mutate with case_when, then drop the undesired columns with select:
library(dplyr)
left_join(df1, df2, by=c("Condition.A", "Condition.B"))%>%
mutate(Value=case_when(Year<1990 ~ X.1990,
Year<2000 ~ X1990.2000,
Year>=2000 ~ X.2000))%>%
select(-starts_with("X"))
ID Condition.A Condition.B Year Value
1 1 B 1 2002 800
2 2 B 1 1988 50
3 3 A 2 1995 90
EDIT: I edited your code, removing the "Condiction" typo
You could use
library(dplyr)
library(tidyr)
df2 %>%
rename(Condition.B = Condiction.B) %>%
pivot_longer(matches("\\d+{4}")) %>%
right_join(df1, by = c("Condition.A", "Condition.B")) %>%
filter(name == case_when(
Year < 1990 ~ "X.1990",
Year > 2000 ~ "X.2000",
TRUE ~ "X1990.2000")) %>%
select(ID, Condition.A, Condition.B, Year, Value = value) %>%
arrange(ID)
This returns
# A tibble: 3 x 5
ID Condition.A Condition.B Year Value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 B 1 2002 800
2 2 B 1 1988 50
3 3 A 2 1995 90
At first we rename the misspelled column Condiction.B of df2 and bring it into a "long format" based on the "<1990", "1990-2000", ">2000" columns. Note that those columns can't be named like this, they are automatically renamed to X.1990, X1990.2000 and X.2000.
Next we use a right join with df1 on the two Condition columns.
Finally we filter just the matching years based on a hard coded case_when function and do some clean up (selecting and arranging).
We could do it this way:
Condiction must be a typo so I changed it to Condition
in df1 create a helper column that assigns each your to the group which is a column name in df2
bring df2 in long format
finally apply left_join by by=c("Condition.A", "Condition.B", "helper"="name")
library(dplyr)
library(tidyr)
df1 <- df1 %>%
mutate(helper = case_when(Year >=1990 & Year <=2000 ~"X1990.2000",
Year <1990 ~ "X.1990",
Year >2000 ~ "X.2000"))
df2 <- df2 %>%
pivot_longer(
cols=starts_with("X")
)
df3 <- left_join(df1, df2, by=c("Condition.A", "Condition.B", "helper"="name")) %>%
select(-helper)
ID Condition.A Condition.B Year value
1 1 B 1 2002 800
2 2 B 1 1988 50
3 3 A 2 1995 90

Number of days spent in each STATE in r

I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14

Calculate the mean of values that fall between 2 dates

I have 2 dataframes. One is a list of occasional events. It has a date column and a column of values.
df1 = data.frame(date = c(as.Date('2020-01-01'), as.Date('2020-02-02'), as.Date('2020-03-01')),
value = c(1,5,9))
I have another data frame that is a daily record. It too has a date column and a column of values.
set.seed(1)
df2 = data.frame(date = seq.Date(from = as.Date('2020-01-01'), to = as.Date('2020-04-01'), by = 1),
value = rnorm(92))
I want to create a new column in df1 that is the mean of df2$value from the current row date to the subsequent date value (non inclusive of the second value, so in this example, the first new value would be the mean of values from df2 of row 1 through row 32, where row 33 is the row that matches df1$date[2]). The resultant data frame would look like the following:
date value value_new
1 2020-01-01 1 0.1165512
2 2020-02-02 5 0.0974052
3 2020-03-01 9 0.1241778
But I have no idea how to specify that. Also I would prefer the last value to be the mean of whatever data is beyond the last value of df1$date, but I would also accept an NA.
We can joion df2 with df1, fill the NA values with previous values and get mean of value_new column.
library(dplyr)
df2 %>%
rename(value_new = value) %>%
left_join(df1, by = 'date') %>%
tidyr::fill(value) %>%
group_by(value) %>%
summarise(date = first(date),
value_new = mean(value_new))
# A tibble: 3 x 3
# value date value_new
# <dbl> <date> <dbl>
#1 1 2020-01-01 0.117
#2 5 2020-02-02 0.0974
#3 9 2020-03-01 0.124

Calculate largest value for multiple overlapping events in a specific range

I have multiple large data frames that capture events that last a certain amount of time. This example gives a simplified version of my data set
Data frame 1:
ID Days Date Value
1 10 80 30
1 10 85 30
2 20 75 20
2 10 80 20
3 5 90 30
Data frame 2:
ID Days Date Value
1 20 0 30
1 10 3 20
2 20 5 30
3 20 1 10
3 10 10 10
The same ID is used for the same person in all datasets
Days specifies the length of the event (if Days has the value 10 then the event lasts 10 days)
Date specifies the date at which the event starts. In this case,Date can be any number between 0 and 90 or 91 (the data represent days in quarter)
Value is an attribute that is repeated for the number of Days specified. For example, for the first row in df1, the value 30 is repeated for 10 times starting from day 80 ( 30 is repeated for 10 days)
What I am interested in is to give for each ID in each data frame the highest value per day. Keep in mind that multiple events can overlap and values then have to be summed.
The final data frame should look like this:
ID HighestValuedf1 HighestValuedf2
1 60 80
2 40 30
3 30 20
For example, for ID 1 three events overlapped and resulted in the highest value of 80 in data frame 2. There was no overlap between the events of df1 and df1 for ID 3, only an overlap withing df2.
I would prefer a solution that avoids merging all data frames into one data frame because of the size of my files.
EDIT
I rearranged my data so that all events that overlap are in one data frame. I only need the highest overlap value for every data frame.
Code to reproduce the data frames:
ID = c(1,1,2,2,3)
Date = c(80,85,75,80,90)
Days = c(10,10,20,10,5)
Value = c(30,30,20,20,30)
df1 = data.frame(ID,Days, Date,Value)
ID = c(1,1,2,3,3)
Date = c(1,3,5,1,10)
Days = c(20,10,20,20,10 )
Value =c(30,20,30,10,10)
df2 = data.frame(ID,Days, Date,Value)
ID= c(1,2,3)
HighestValuedf1 = c(60,40,30)
HighestValuedf2 = c(80,30,20)
df3 = data.frame(ID, HighestValuedf1, HighestValuedf2)
I am interpreting highest value per day to mean highest value on a single day throughout the time period. This is probably not the most efficient solution, since I expect something can be done with map or apply functions, but I didn't see how on a first look. Using df1 and df2 as defined above:
EDIT: Modified code upon understanding that df1 and df2 are supposed to represent sequential quarters. I think the easiest way to do this is simply to stack the dataframes so anything that overlaps will automatically be caught (i.e. day 1 of df2 is day 91 overall). You will probably need to either adjust this code manually because of the different length of quarters, or preferably simply convert days of quarters into actual dates of the year with a date formate ((df1 day 1 becomes January 1st 2017, for example). The code below just rearranges to achieve this and then produces the results desired for each quarter by filtering on days 1:90, 91:180 as shown)
ID = c(1,1,2,2,3)
Date = c(80,85,75,80,90)
Days = c(10,10,20,10,5)
Value = c(30,30,20,20,30)
df1 = data.frame(ID,Days, Date,Value)
ID = c(1,1,2,3,3)
Date = c(1,3,5,1,10)
Days = c(20,10,20,20,10 )
Value =c(30,20,30,10,10)
df2 = data.frame(ID,Days, Date,Value)
library(tidyverse)
#> -- Attaching packages --------------------------------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 2.2.1.9000 v purrr 0.2.4
#> v tibble 1.4.2 v dplyr 0.7.4
#> v tidyr 0.7.2 v stringr 1.2.0
#> v readr 1.1.1 v forcats 0.2.0
#> -- Conflicts ------------------------------------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
df2 <- df2 %>%
mutate(Date = Date + 90)
# Make a dataframe with complete set of day-ID combinations
df_completed <- df1 %>%
mutate(day = factor(Date, levels = 1:180)) %>% # set to total day length
complete(ID, day) %>%
mutate(daysum = 0) %>%
select(ID, day, daysum)
# Function to apply to each data frame containing events
# Should take each event and add value to the appropriate days
sum_df_daily <- function(df_complete, df){
for (i in 1:nrow(df)){
event_days <- seq(df[i, "Date"], df[i, "Date"] + df[i, "Days"] - 1)
df_complete <- df_complete %>%
mutate(
to_add = case_when(
ID == df[i, "ID"] & day %in% event_days ~ df[i, "Value"],
!(ID == df[i, "ID"] & day %in% event_days) ~ 0
),
daysum = daysum + to_add
)
}
return(df_complete)
}
df_filled <- df_completed %>%
sum_df_daily(df1) %>%
sum_df_daily(df2) %>%
mutate(
quarter = case_when(
day %in% 1:90 ~ "q1",
day %in% 91:180 ~ "q2"
)
)
df_filled %>%
group_by(quarter, ID) %>%
summarise(maxsum = max(daysum))
#> # A tibble: 6 x 3
#> # Groups: quarter [?]
#> quarter ID maxsum
#> <chr> <dbl> <dbl>
#> 1 q1 1.00 60.0
#> 2 q1 2.00 40.0
#> 3 q1 3.00 30.0
#> 4 q2 1.00 80.0
#> 5 q2 2.00 30.0
#> 6 q2 3.00 40.0

Resources