Tidyverse merging to datasets on most recent dates - r

In R, I have a two data sets with dates that I am attempting to merge. The first is the environmental conditions that have start_dates and stop_dates. Interval time lengths irregular, ranging from a day to a year. The second data set is events that have a given date. I would like to merge them so that I know the environmental conditions that existed during each event.
In the below example, the merged result should be a data set should be the Event_data with a new column showing the weather at each date.
require(tidyverse)
( Envir_data = data.frame(envir_start_date=as.Date(c("2017-05-31","2018-01-17", "2018-02-03"), format="%Y-%m-%d"),
envir_end_date=as.Date(c("2018-01-17", "2018-01-20", "2018-04-17"), format="%Y-%m-%d"),
weather = c("clear","storming","windy")) )
( Event_data = data.frame(event_date=as.Date(c("2017-06-03","2017-10-18", "2018-01-19"), format="%Y-%m-%d"),
cars_sold=c(2,3,7)) )

SQL lets you do a between join that gets exactly the result you are looking for.
library(sqldf)
join <- sqldf(
"SELECT L.Event_date, L.cars_sold, R.weather
FROM Event_data as L
LEFT JOIN Envir_data as R
ON L.event_date BETWEEN R.envir_start_date AND R.envir_end_date"
)

We use seq.Date to generate a sequence of dates based on the data in Envir_data. It is important to use rowwise to only create a list based on the row grouping. This operation results in a list column. We then unnest that list column to have one row per date. Finally we join to the Event_data.
Envir_data_2 <- Envir_data %>%
rowwise() %>%
mutate(event_date = list(seq.Date(envir_start_date, envir_end_date,
by = "day"))) %>%
unnest(event_date) %>%
select(event_date, weather)
Event_data %>%
inner_join(Envir_data_2)
# event_date cars_sold weather
# 1 2017-06-03 2 clear
# 2 2017-10-18 3 clear
# 3 2018-01-19 7 storming

Related

Filter data in R based on condition?

I want to filter the dataframe below, to where only certain rows are kept.
total.Date = date of event
total.start = start time of event
total.TotalTime = duration of event (minutes)
total.ISSUE_DATE = date of item ordered
total.ISSUE_TIME = time of item ordered
In this specific subsetted dataset, I believe all rows will be excluded. However when I perform this on the entire dataset, some rows are expected to remain.
First pasting together the surgery and order dates and times to form proper datetimes, then converting the integer minutes into a "period" object in lubridate terminology.
Then it's straightforward to filter: greater than the start time minus 30 minutes AND less than the start time plus the length of the surgery.
library(dplyr)
library(lubridate)
your_df %>%
mutate(
surgery_start = mdy_hms(paste(total.Date, total.PTIN)),
order_time = mdy_hms(paste(total.ISSUE_DATE, total.ISSUE_TIME)),
surgery_duration = minutes(total.TotalORTime)
) %>%
filter(
order_time > surgery_start - minutes(30),
order_time < surgery_start + surgery_duration
)

How can I get a conditional statement to select the most recent timestamp?

I have a data frame consisting of ~1,000,000 rows and am classifying some data.
Where there are two or more dates present against a record, I want to use the first date in a new field called Day1 and the second date in a field called Day2.
I achieve this thus:
df %>%
group_by(pii, cn) %>%
summarise(Day1 = min(TestDate, na.rm = TRUE), # Selects the first available date
Day2 = sort(TestDate, na.last = TRUE)[2]) # Selects the second available date
However, I have come across a problem affecting around 1.6% of the records (~14,000) where there are only two dates listed, which are identical.
In this case, I want to be able to look at the time listed against each date (recorded in df$time) to determine which came first, still with the intention of taking the first (earlier) date as Day1 and the second as Day2.
How can I incorporate this into my current structure?
For the sake of an illustrative example (albeit non-functioning), I am thinking that it could be something like this:
if_else(sort(TestDate,na.last = TRUE)[2] == Day1, [CHECK TIMES HERE], sort(TestDate, na.last = TRUE)[2])
As such, I would hope for something like this as an output:
id Day1 D1_Time Day2 D2_Time
1 2021-01-02 NA 2021-01-04 NA
2 2021-01-01 04.45 2021-01-01 04.48
3 2021-01-03 NA 2021-01-08 NA
In this output example, the record with id value 2 has two identical dates listed, so the df$time field was consulted to determine which came first.
I think this would solve your problem (though it might not answer your specific question). I'd do something like this:
library(dplyr)
library(tidyr)
df %>%
group_by(pii, cn) %>%
arrange(TestDate, time) %>% # order within the groups by date and time
mutate(rownum = 1:n()) %>% # number the rows in order
filter(rownum <= 2) %>% # only keep the top 2 in each group. slice_head(n=2) would be an alternative to the last 2 steps, but I want the row number below
ungroup() %>%
pivot_wider(names_from = rownum, # spread to match your desired
values_from = c(TestDate, time)) %>%
select(pii, cn, TestDate_1, time_1, TestDate_2, time_2) #reorder the columns to match your sample
Or, if you don't like that approach, could you combine each date/time pair into a single datetime field, and use your original logic?

Create a column in one dataframe based on another column in another dataframe in R

I am fairly new to R and DPLYR and I am stuck on a this issue:
I have two tables:
(1) Repairs done on cars
(2) Amount owed on each car over time
What I would like to do is create three extra columns on the repair table that gives me:
(1) the amount owed on the car when the repair was done,
(2) 3months down the road and
(3) finally last payment record on file.
And if the case where the repair date does not match with any payment record, I need to use the closest amount owed on record.
So something like:
Any ideas how I can do that?
Here are the data frames:
Repairs done on cars:
df_repair <- data.frame(unique_id =
c("A1","A2","A3","A4","A5","A6","A7","A8"),
car_number = c(1,1,1,2,2,2,3,3),
repair_done = c("Front Fender","Front
Lights","Rear Lights","Front Fender", "Rear Fender","Rear Lights","Front
Lights","Front Fender"),
YearMonth = c("2014-03","2016-03","2016-07","2015-05","2015-08","2016-01","2018-01","2018-05"))
df_owed <- data.frame(car_number = c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3),
YearMonth = c("2014-02","2014-05","2014-06","2014-08","2015-06","2015-12","2016-03","2016-04","2016-05","2016-06","2016-07","2016-08","2015-05","2015-08","2015-12","2016-03","2018-01","2018-02","2018-03","2018-04","2018-05","2018-09"),
amount_owed = c(20000,18000,17500,16000,10000,7000,6000,5500,5000,4500,4000,3000,10000,8000,6000,0,50000,40000,35000,30000,25000,15000))
Using zoo for year-months, and tidyverse, you could try the following. Using left_join add all the df_owed data to your df_repair data, by the car_number. You can convert your year-month columns to yearmon objects with zoo. Then, sort your rows by the year-month column from df_owed.
For each unique_id (using group_by) you can create your three columns of interest. The first will use the latest amount_owed where the owed date is prior to the service date. Then second (3 months) will use the first amount_owed value where the owed date follows the service date by 3 months (3/12). Finally, the most recent take just the last value from amount_owed.
Using the example data, the results differ a bit, possibly due to the data frames not matching the images in the post.
library(tidyverse)
library(zoo)
df_repair %>%
left_join(df_owed, by = "car_number") %>%
mutate_at(c("YearMonth.x", "YearMonth.y"), as.yearmon) %>%
arrange(YearMonth.y) %>%
group_by(unique_id, car_number) %>%
summarise(
owed_repair_done = last(amount_owed[YearMonth.y <= YearMonth.x]),
owed_3_months = first(amount_owed[YearMonth.y >= YearMonth.x + 3/12]),
owed_most_recent = last(amount_owed)
)

R generate one random date per month between defined interval

I'd like to generate a list of random dates between a defined interval using R such that there is only one date for each month present in the interval.
I've tried using a variation of the code from another solution, but I can't seem to limit it to one date per month. I get multiple dates for a given month.
Here's my attempt
df = data.frame(Date=c(sample(seq(as.Date('2020/01/01'), as.Date('2020/09/01'), by="day"), 9)))
But I seem to get more than one date for a given month. Any inputs would be highly appreciated.
First I create a table, containing all the possible dates that you want to sample. And I store in a column of this table, the index, or the number of the month of each date, using the month() function from lubridate package.
library(lubridate)
dates <- data.frame(
days = seq(as.Date('2020/01/01'), as.Date('2020/09/01'), by="day")
)
dates$month <- month(dates$day)
Then, the idea is to create a loop with lapply() function. In each loop, I select in the table dates, only the dates of that month, and I paste these months in to the sample() function.
results <- lapply(1:9, function(x){
sample_dates <- dates$days[dates$month == x]
return(sample(sample_dates, size = 1))
})
df <- data.frame(
dates = as.Date(unlist(results), origin = "1970-01-01")
)
Resulting this:
dates
1 2020-01-19
2 2020-02-06
3 2020-03-26
4 2020-04-13
5 2020-05-16
6 2020-06-29
7 2020-07-06
8 2020-08-21
9 2020-09-01
In other words, the ideia of this approach is to provide selected dates to sample() function on each loop. So it will sample, or choose a date, only for that specific month, on each loop.
How about this:
First you create a function that returns a random day from month 'month'
Then you lapply for all months you need, 1 to 9
x <- function(month){
(Date=c(sample(seq(as.Date(paste0('2020/',month,'/01')), as.Date(paste0('2020/',month+1,'/01')), by="day"), 1)))
}
df <- data.frame(
dates = as.Date(unlist(lapply(1:9,x)), origin = "1970-01-01")
)
If you also want the results to be random (not January, February, March...) you only need to add a sample:
df <- data.frame(
dates = as.Date(unlist(sample(lapply(1:9,x))), origin = "1970-01-01")
)

Can I match a character string containing m-d with a date vector in R?

All, Ive seen that date conversion questions get downvoted a lot, but I couldn't find any information online or in the help files...
I have a df with a date formatted as ymd_hm() and then some data in other columns. Then I have another df with 366 row, one for each day, and a column containing some values relevant for that day (some climatological stuff, that is essentially the same every year, so the year doesn't matter). The dfs might look something like this:
df1 <- tibble(Date=seq(ymd_hm('2010-05-01 00:00'),ymd_hm('2010-05-03 00:00'), by = 'hour'), Data=c(1:length(Date)))
df2 <- tibble(MonthDay=c("04-30", "05-01", "05-02","05-03","05-04"), OtherData=c(20,30,40,50, 60))
Now, is it possible to do some lookup sort of thing and match Date and MonthDay and then write whatever OtherData is into df1? I'm struggling since I can't convert MonthDay to a date.
So, all the 2010-05-01 dates should have 30 next to them, all 2010-05-02 dates should have 40 in the next column, and so on and so forth...
Thanks y'all!
We extract the 'MondayDay' with format, use that as common joining column in left_join
library(dplyr)
df1 %>%
mutate(MonthDay = format(Date, "%m-%d")) %>%
left_join(df2) %>%
select(-MonthDay)

Resources