Create a column in one dataframe based on another column in another dataframe in R - r

I am fairly new to R and DPLYR and I am stuck on a this issue:
I have two tables:
(1) Repairs done on cars
(2) Amount owed on each car over time
What I would like to do is create three extra columns on the repair table that gives me:
(1) the amount owed on the car when the repair was done,
(2) 3months down the road and
(3) finally last payment record on file.
And if the case where the repair date does not match with any payment record, I need to use the closest amount owed on record.
So something like:
Any ideas how I can do that?
Here are the data frames:
Repairs done on cars:
df_repair <- data.frame(unique_id =
c("A1","A2","A3","A4","A5","A6","A7","A8"),
car_number = c(1,1,1,2,2,2,3,3),
repair_done = c("Front Fender","Front
Lights","Rear Lights","Front Fender", "Rear Fender","Rear Lights","Front
Lights","Front Fender"),
YearMonth = c("2014-03","2016-03","2016-07","2015-05","2015-08","2016-01","2018-01","2018-05"))
df_owed <- data.frame(car_number = c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3),
YearMonth = c("2014-02","2014-05","2014-06","2014-08","2015-06","2015-12","2016-03","2016-04","2016-05","2016-06","2016-07","2016-08","2015-05","2015-08","2015-12","2016-03","2018-01","2018-02","2018-03","2018-04","2018-05","2018-09"),
amount_owed = c(20000,18000,17500,16000,10000,7000,6000,5500,5000,4500,4000,3000,10000,8000,6000,0,50000,40000,35000,30000,25000,15000))

Using zoo for year-months, and tidyverse, you could try the following. Using left_join add all the df_owed data to your df_repair data, by the car_number. You can convert your year-month columns to yearmon objects with zoo. Then, sort your rows by the year-month column from df_owed.
For each unique_id (using group_by) you can create your three columns of interest. The first will use the latest amount_owed where the owed date is prior to the service date. Then second (3 months) will use the first amount_owed value where the owed date follows the service date by 3 months (3/12). Finally, the most recent take just the last value from amount_owed.
Using the example data, the results differ a bit, possibly due to the data frames not matching the images in the post.
library(tidyverse)
library(zoo)
df_repair %>%
left_join(df_owed, by = "car_number") %>%
mutate_at(c("YearMonth.x", "YearMonth.y"), as.yearmon) %>%
arrange(YearMonth.y) %>%
group_by(unique_id, car_number) %>%
summarise(
owed_repair_done = last(amount_owed[YearMonth.y <= YearMonth.x]),
owed_3_months = first(amount_owed[YearMonth.y >= YearMonth.x + 3/12]),
owed_most_recent = last(amount_owed)
)

Related

How to count the number of days that pass between two dates in a dataset column in R

I am working with a dataset This is the dataset. In the dataset there are 33 unique Ids that are repeated for each day they provided data, within 30 days, from their fitbit. I am trying to count the number of days they input data through the ActivityDay column and group it to the Id, so that I can see how many total days they used their fitbit out of the 30 days.
the Activity date data type was originally POSIXct and I converted it to Date type. How can I count the dates as number or days and group it to each indvidual ID?
I tried using count within a dplyr::summarise to get the ID and number of days counted while grouping the data to the ID. that failed.
I also thought of using a case_when, however, I thought that wouldn't work because it wouldn't count all the way up to the end dates I specify, so anything between the two dates would get the ouputs I specified. I also tried count_date_between(min(user_device_activity), max(user_device_activity), by 'day') but it said that the function doesn't exist and when I tried installing it. It said it didn't exist within R.
library(dplyr)
user_device_activity %>%
distinct(Id, ActivityDate) %>% # in case duplicates possible in data
count(Id, month = lubridate::floor_date(ActivityDate, "month"))

Filter data in R based on condition?

I want to filter the dataframe below, to where only certain rows are kept.
total.Date = date of event
total.start = start time of event
total.TotalTime = duration of event (minutes)
total.ISSUE_DATE = date of item ordered
total.ISSUE_TIME = time of item ordered
In this specific subsetted dataset, I believe all rows will be excluded. However when I perform this on the entire dataset, some rows are expected to remain.
First pasting together the surgery and order dates and times to form proper datetimes, then converting the integer minutes into a "period" object in lubridate terminology.
Then it's straightforward to filter: greater than the start time minus 30 minutes AND less than the start time plus the length of the surgery.
library(dplyr)
library(lubridate)
your_df %>%
mutate(
surgery_start = mdy_hms(paste(total.Date, total.PTIN)),
order_time = mdy_hms(paste(total.ISSUE_DATE, total.ISSUE_TIME)),
surgery_duration = minutes(total.TotalORTime)
) %>%
filter(
order_time > surgery_start - minutes(30),
order_time < surgery_start + surgery_duration
)

How can I show Q1 to quarter without year on r

I am studying R and the exercise needs that I create a column to Quarter where the data seem Q4.
I use zoo library and lubridate, but the result that I achieved was a year plus Q1.
I have a column InvoiceDate where I it has a complete datetime.
What I need:
The original table, more specifically column, has a date-time like this: 2010-12-01 08:26:00. The column name is InvoiceDate. I need to get this column and to create other columns with specific values, like a quarter, year, month, etc.
What I archieved:
How do I achieve my goal?
You can use the inbuilt functions to extract the information that you need.
library(dplyr)
df %>%
mutate(quarter = quarters(InvoiceDate),
Month = format(InvoiceDate, '%b'),
weekday = weekdays(InvoiceDate))

Difficulty in generating time series in R for my data set

So I am trying to generate time series for my dataset in R but finding difficulty in doing so. My dataset has two columns- one for date and other for price of a material. Now there are many dates which don't have price and hence are not in the dataset. Data is roughly for a year. NOw i am finding difficulty in setting the frequency and start time for the time series. Is there any way to set the start as per the dataset and time series automatically incorporates the missing data points.
the following is for a dataframe df with two columns called "date" and "price".
This will create missing dates and fill the missing prices for those dates as the preceding price. You can change fill('price') to fill with other specified values.
library(tidyverse)
df<-df %>%
complete(date = seq.Date(min(date), max(date), by="day")) %>%
fill('price')

Tidyverse merging to datasets on most recent dates

In R, I have a two data sets with dates that I am attempting to merge. The first is the environmental conditions that have start_dates and stop_dates. Interval time lengths irregular, ranging from a day to a year. The second data set is events that have a given date. I would like to merge them so that I know the environmental conditions that existed during each event.
In the below example, the merged result should be a data set should be the Event_data with a new column showing the weather at each date.
require(tidyverse)
( Envir_data = data.frame(envir_start_date=as.Date(c("2017-05-31","2018-01-17", "2018-02-03"), format="%Y-%m-%d"),
envir_end_date=as.Date(c("2018-01-17", "2018-01-20", "2018-04-17"), format="%Y-%m-%d"),
weather = c("clear","storming","windy")) )
( Event_data = data.frame(event_date=as.Date(c("2017-06-03","2017-10-18", "2018-01-19"), format="%Y-%m-%d"),
cars_sold=c(2,3,7)) )
SQL lets you do a between join that gets exactly the result you are looking for.
library(sqldf)
join <- sqldf(
"SELECT L.Event_date, L.cars_sold, R.weather
FROM Event_data as L
LEFT JOIN Envir_data as R
ON L.event_date BETWEEN R.envir_start_date AND R.envir_end_date"
)
We use seq.Date to generate a sequence of dates based on the data in Envir_data. It is important to use rowwise to only create a list based on the row grouping. This operation results in a list column. We then unnest that list column to have one row per date. Finally we join to the Event_data.
Envir_data_2 <- Envir_data %>%
rowwise() %>%
mutate(event_date = list(seq.Date(envir_start_date, envir_end_date,
by = "day"))) %>%
unnest(event_date) %>%
select(event_date, weather)
Event_data %>%
inner_join(Envir_data_2)
# event_date cars_sold weather
# 1 2017-06-03 2 clear
# 2 2017-10-18 3 clear
# 3 2018-01-19 7 storming

Resources