Merging two data set in R (Based on ID and Range Date) - r

I have two large data sets.
The first data set includes ID, Starting time and Ending time.
The second data set includes ID, Starting time and Ending time.
I want to merge these two data sets based on ID and Starting time by considering the fact that each date from the first data set can merge to any date with its range of 5 days more or less. It means if we have 23/4/2012 in the first data set, It can merge to any staring date between 18/4/2012 to 28/4/2012.
Input data:
x<-c(1,2,3,4,5,6,6,7,7,8,8,9,10)
StartTime<-c(24/5/1980,2/6/1932,24/6/1945,25/9/1954,12/11/1970,14/3/1984,15/5/1999,20/5/1990,25/9/1981,28/2/1980,29/1/1984,24/4/1987,30/6/1988)
Endtime<-c(24/6/1980,2/8/1932,24/9/1945,25/10/1954,14/11/1970,14/12/1984,15/10/1999,26/5/1990,29/9/1981,28/3/1980,29/1/1984,24/6/1987,30/7/1988)
df1<-data.frame(x,StartTime,Endtime)
x<-c(1,1,1,2,2,3,3,4,5,5,6,6,7)
StartTime<-c(29/5/1980,20/5/1980,23/5/1945,5/6/1932,7/6/1932,27/6/1945,20/6/1945,20/5/1990,25/9/1981,28/2/1980,29/3/1984,24/5/1987,30/7/1988)
Endtime1<-c(24/6/1980,2/8/1990,24/9/1945,25/10/1954,14/11/1970,14/12/1984,15/10/1999,26/5/1990,29/9/1981,28/3/1980,29/1/1984,24/6/1987,30/7/1988)
df2<-data.frame(x,StartTime,Endtime2)

Convert your date strings to Dates using as.POSIXct() https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.POSIXlt.html
library(sqldf)
df3 <- sqldf("SELECT df1.*, df2.* FROM df1 INNER JOIN df2 ON julianday(df1.StartDate) - julianday(df2.StartDate) BETWEEN -5 AND 5 AND df1.ID = df2.ID")

Related

Tidyverse merging to datasets on most recent dates

In R, I have a two data sets with dates that I am attempting to merge. The first is the environmental conditions that have start_dates and stop_dates. Interval time lengths irregular, ranging from a day to a year. The second data set is events that have a given date. I would like to merge them so that I know the environmental conditions that existed during each event.
In the below example, the merged result should be a data set should be the Event_data with a new column showing the weather at each date.
require(tidyverse)
( Envir_data = data.frame(envir_start_date=as.Date(c("2017-05-31","2018-01-17", "2018-02-03"), format="%Y-%m-%d"),
envir_end_date=as.Date(c("2018-01-17", "2018-01-20", "2018-04-17"), format="%Y-%m-%d"),
weather = c("clear","storming","windy")) )
( Event_data = data.frame(event_date=as.Date(c("2017-06-03","2017-10-18", "2018-01-19"), format="%Y-%m-%d"),
cars_sold=c(2,3,7)) )
SQL lets you do a between join that gets exactly the result you are looking for.
library(sqldf)
join <- sqldf(
"SELECT L.Event_date, L.cars_sold, R.weather
FROM Event_data as L
LEFT JOIN Envir_data as R
ON L.event_date BETWEEN R.envir_start_date AND R.envir_end_date"
)
We use seq.Date to generate a sequence of dates based on the data in Envir_data. It is important to use rowwise to only create a list based on the row grouping. This operation results in a list column. We then unnest that list column to have one row per date. Finally we join to the Event_data.
Envir_data_2 <- Envir_data %>%
rowwise() %>%
mutate(event_date = list(seq.Date(envir_start_date, envir_end_date,
by = "day"))) %>%
unnest(event_date) %>%
select(event_date, weather)
Event_data %>%
inner_join(Envir_data_2)
# event_date cars_sold weather
# 1 2017-06-03 2 clear
# 2 2017-10-18 3 clear
# 3 2018-01-19 7 storming

How do I delete rows in a data frame based on the value (date) in one of the columns?

I have a data frame that consists of daily data. It has 500,000+ rows and 18 columns. The 2nd column contains the date.
For example, it goes from 7/1/2017 to the current date, chronologically.
I pull the data every Monday and input it into R, but I only want data up until the most recent Friday.
I've set a variable equal to the most recent Friday's date (in the exact date format of the data):
library(lubridate)
LastFriday <- gsub("X", "", gsub("X0", "", format(
Sys.Date() - wday(Sys.date()+1), "X%m/X%d/%Y)))
which returns 9/15/2017
How do I delete all the rows in the data frame after the last row that contains last Friday's date?
The following should work, though I have not tested it
keep_index <- as.POSIXct(as.Date(df[,2]), "X%m/X%d/%Y") <= as.POSIXct(LastFriday, format = "X%m/X%d/%Y")
mydf <- df[keep_index, ]

Combine different rows

Consider a dataframe of the form
id start end
2009.36220 65693384 2010-03-20 2010-07-04
2010.36221 65693592 2010-01-01 2010-12-31
2010.36222 65698250 2010-01-01 2010-12-31
2010.36223 65704349 2010-01-01 2010-12-31
where I have around 20k observations per year for 15 years.
I need to combine the rows by the following rule:
if for the same id, there exists a record that ends at the last day of the year
and a record that starts at the first day of the following year
then
- create a new row with start value of the earlier row and end value of the later year
- and delete the two original rows
Given that the same id can be visible several times (since I have more than 2 years) I will then just iterate over the script several time to combine different ids that have for example 4 rows in consecutive years that satisfy the condition.
The Question
I'd know how to program this in an iterative manner, where I would go over every single row and check if there's a row with a start date next year somewhere in the whole data frame that corresponds to the end date this year - but that's extremely slow and non satisfying from an aesthetic point of view. I'm a very beginner with R, so I have no clue of where to even look to do such a thing in a more efficient manner - I'm open for any suggestion.
Warning: this kind of code with rbind() is cancerous, but this is the easiest solution I could think of. Let df be your data.
df$start = as.POSIXct(df$start)
df$end = as.POSIXct(df$end)
df2 = data.frame()
for (i in unique(df$id)){
s = subset(df, id==i)
df2 = rbind(df2, c(id, min(s$start), max(s$end)))
}

How can I subtract 2 dataframes of different length by searching for the closest timestamp in R?

I am using R to extract data from a process historian using SQL. I have two dataframes, one of net weights (NetWt) with timestamps (100 rows) and another of weight setpoints (SetPt) with timestamps (6 rows). The setpoint is changed infrequently but a new bag weight is recorded every 30 seconds. I need to subtract the two such that I get a resultant dataframe of NetWt - SetPt for each timestamp in NetWt. In my last dataset the most recent SetPt timestamp is earlier than the first NetWt timestamp. I need a function that will go through each row in NetWt, take the timestamp, search for the closest timestamp before that time in the SetPt dataframe, return the most recent SetPt and output the difference (NetWt-SetPt).
I have researched merge, rbind, cbind, and I can't find a function to search backwards for the most recent SetPt value and merge that with the NetWt so that I can subtract them to plot the difference with time. Can anyone please help?
Data:
SetPtLines <- "Value,DateTime
51.35,2014-02-10 08:10:49
53.30,2014-02-10 07:52:37
53.10,2014-02-10 07:52:19
51.70,2014-02-10 07:50:26
51.35,2014-02-09 19:25:21
51.40,2014-02-09 19:13:11
51.50,2014-02-09 18:24:53
51.45,2014-02-09 16:10:38
51.40,2014-02-09 15:54:42"
SetPt <- read.csv(text=SetPtLines, header=TRUE)
NetWtLines <- "DateTime,Value
2014-02-11 12:51:50,50.90735
2014-02-11 12:52:24,50.22308
2014-02-11 12:52:55,50.88604
2014-02-11 12:53:27,50.69514
2014-02-11 12:53:58,51.38968
2014-02-11 12:54:29,50.96672"
NetWt <- read.csv(text=NetWtLines, header=TRUE)
There are 100 rows in NetWt.
data.table has a roll argument which would probably be very helpful here
library(data.table)
NetWt <- as.data.table(NetWt)
SetPt <- as.data.table(SetPt)
## Only needed if dates are strings:
## Ensure that your DateTime columns are actually times and not strings
NetWt[, DateTime := as.POSIXct(DateTime)]
SetPt[, DateTime := as.POSIXct(DateTime)]
## Next, set keys to the dates
setkey(NetWt, DateTime)
setkey(SetPt, DateTime)
## Join the two, use roll
NetWt[SetPt, NewValue := Value - i.Value, roll="nearest"]
## It's not clear which you want to subtract from which
SetPt[NetWt, NewValue := Value - i.Value, roll="nearest"]
Here's a solution using xts. Note that your example would be more helpful if SetPt and NetWt included some overlapping observations.
library(xts)
# convert your data to xts
xSetPt <- xts(SetPt$Value, as.POSIXct(SetPt$DateTime))
xNetWt <- xts(NetWt$Value, as.POSIXct(NetWt$DateTime))
# merge them
xm <- merge(xNetWt, xSetPt)
# fill all missing values in the SetPt column with their prior value
xm$xSetPt <- na.locf(xm$xSetPt)
# plot the difference
plot(na.omit(xm$xNetWt - xm$xSetPt))

R Querying Dates in one data frame with dates from a separate data frame

After searching for a few hours, it seams I can't find a solution to the following problem. I've got 2 data frames, one contains a column of observation dates, the other contains a start date and and end date:
for example:
head(x)
station temp obsdate
311820 65.0 1973-01-01
311821 62.0 1973-01-02
etc...
head(seasonDates)
season startDate endDate
A 1973-11-01 1974-06-30
B 1974-11-01 1975-06-30
C 1975-11-01 1976-06-30
etc...
I'd like to assign the 'season' from the 'seasonDates' data frame to the 'x' data frame if the observation date 'obsdate' is within the range of dates indicated by 'startDate' to 'endDate'. Any help is greatly appreciated.
Assuming the three date columns are of class "Date" :
library(sqldf)
sqldf("select * from x left join seasonDates on
(obsdate between startDate and endDate)")

Resources