R: Merge dataframes by nearest datetime

R: Merge dataframes by nearest datetime - r

I have two dataframes of different lengths: NROW(data) = 20000
NROW(database) = 8000
Both of dataframes have date time values in a format as : YYYY-MM-DD HH-MM-SS which are not the same in each dataframe
What I want is to merge them by the nearest date-time and keep only the records that exist in database.
I tried the approach posted in another stackexchange post
[R – How to join two data frames by nearest time-date?][1]
based on data.table library. I tried following but without success:
require("data.table")
database <- data.table(database)
data <- data.table(data)
setkey( data, "timekey")
setkey( database, "timekeyd")
database <- data[ database, roll = "nearest"]
But the merge was almost completely wrong. You can see how the merged was performed in the following table that has only the two keys (timekey and timekeyd)
1 2017-11-01 00:00:00 2017-10-31 21:00:00
2 2017-11-01 00:00:00 2017-10-31 22:10:00
3 2017-11-02 19:00:00 2017-11-02 21:00:00
4 2017-11-02 19:00:00 2017-11-02 21:00:00
5 2017-11-03 20:08:00 2017-11-03 22:10:00
6 2017-11-04 19:00:00 2017-11-04 21:00:00
7 2017-11-04 19:00:00 2017-11-04 21:00:00
8 2017-11-05 19:00:00 2017-11-05 21:10:00
9 2017-11-07 18:00:00 2017-11-07 20:00:00

Related

How to get people in the store at every 5 minutes?

I have a data table like below :
library(data.table)
DT1<-data.table(
id=c(1,2,3,4,3,2),
in_time=c("2017-11-01 08:37:35","2017-11-01 09:07:44","2017-11-01 09:46:16","2017-11-01 10:32:29","2017-11-01 10:59:25","2017-11-01 13:24:12"),
out_time=c("2017-11-01 08:45:35","2017-11-01 09:15:30","2017-11-01 10:11:16","2017-11-01 10:37:05","2017-11-01 11:45:25","2017-11-01 14:10:09")
)
It contains each information about what time a person enters the store and exits the store.
Now I want to take the people in the store every 5 minutes (standard 5 minutes like minute 0,5,10,15 ...60). If there is no one I need a 0 value.
So I tried with
library(lubridate)
DT1[,time:=ymd_hms(in_time)]
DT1[,time:=ceiling_date(time,"5mins")]
DT1[,.N,by=list(time)]
which only gives how many people entered at each time but I am now stuck at how to take into account the out_time.For example, the id 1 entered at 2017-11-01 08:37:35 and left at 2017-11-01 08:45:35.So he will be in the shop for the 5-minute interval from 2017-11-01 08:40:00 to
2017-11-01 08:45:00 and not in 2017-11-01 08:50:00 and so on .
An id can repeat multiple times like one person came drop by the store multiple times a day.
Any help is appreciated .

Here is an option using data.table::foverlaps:
#generate intervals of 5mins
times <- seq(as.POSIXct("2017-11-01 00:00:00", format=fmt),
as.POSIXct("2017-11-02 00:00:00", format=fmt),
by="5 min")
DT2 <- data.table(in_time=times[-length(times)], out_time=times[-1L], key=c("in_time","out_time"))
#set keys before foverlaps
setkey(DT1, in_time, out_time)
#find overlaps and count distinct in each 5min interval.
#!is.na(id) is for truncating the output for checking. to be removed in actual code
foverlaps(DT2, DT1)[!is.na(id), uniqueN(id), .(i.in_time, i.out_time)]
And if id is unique in each time interval, the last line of code can be foverlaps(DT2, DT1)[, sum(!is.na(id)), .(i.in_time, i.out_time)] instead
first 8 rows of output:
i.in_time i.out_time V1
1: 2017-11-01 08:35:00 2017-11-01 08:40:00 1
2: 2017-11-01 08:40:00 2017-11-01 08:45:00 1
3: 2017-11-01 08:45:00 2017-11-01 08:50:00 1
4: 2017-11-01 09:05:00 2017-11-01 09:10:00 1
5: 2017-11-01 09:10:00 2017-11-01 09:15:00 1
6: 2017-11-01 09:15:00 2017-11-01 09:20:00 1
7: 2017-11-01 09:45:00 2017-11-01 09:50:00 1
8: 2017-11-01 09:50:00 2017-11-01 09:55:00 1
data:
library(data.table)
DT1 <- data.table(
id=c(1,2,3,4,3,2),
in_time=c("2017-11-01 08:37:35","2017-11-01 09:07:44","2017-11-01 09:46:16","2017-11-01 10:32:29","2017-11-01 10:59:25","2017-11-01 13:24:12"),
out_time=c("2017-11-01 08:45:35","2017-11-01 09:15:30","2017-11-01 10:11:16","2017-11-01 10:37:05","2017-11-01 11:45:25","2017-11-01 14:10:09")
)
cols <- c("in_time", "out_time")
fmt <- "%Y-%m-%d %T"
DT1[, (cols) := lapply(.SD, as.POSIXct, format=fmt), .SDcols=cols]

Matching Columns with Overlapping Intervals (lubridate)

I have two data frames of different number of rows and number of columns: each of these data frames have a date interval. df has an additional column which indicates some kind of attribute. My goal is to extract information from df ( with the attributes) to df2 under certain conditions. The procedure should be the following:
For each date interval of df2, check if there is any interval in df which overlaps with the interval of df2. If yes, create a column in df2 which indicates the attributes matching with the overlapping interval of df. There can be multiple attributes that are matched to a specific interval of df2.
I created the following example of my data:
library(lubridate)
date1 <- as.Date(c('2017-11-1','2017-11-1','2017-11-4'))
date2 <- as.Date(c('2017-11-5','2017-11-3','2017-11-5'))
df <- data.frame(matrix(NA,nrow=3, ncol = 4))
names(df) <- c("Begin_A", "End_A", "Interval", "Attribute")
df$Begin_A <-date1
df$End_A <-date2
df$Interval <-df$Begin_A %--% df$End_A
df$Attribute<- as.character(c("Attr1","Attr2","Attr3"))
### Second df:
date1 <- as.Date(c('2017-11-2','2017-11-5','2017-11-7','2017-11-1'))
date2 <- as.Date(c('2017-11-3','2017-11-6','2017-11-8','2017-11-1'))
df2 <- data.frame(matrix(NA,nrow=4, ncol = 3))
names(df2) <- c("Begin_A", "End_A", "Interval")
df2$Begin_A <-date1
df2$End_A <-date2
df2$Interval <-df2$Begin_A %--% df2$End_A
This results in these data frames:
df:
Begin_A End_A Interval Attribute
2017-11-01 2017-11-05 2017-11-01 UTC--2017-11-05 UTC Attr1
2017-11-01 2017-11-03 2017-11-01 UTC--2017-11-03 UTC Attr2
2017-11-04 2017-11-05 2017-11-04 UTC--2017-11-05 UTC Attr3
df2:
Begin_A End_A Interval
2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC
2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC
2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC
2017-11-01 2017-11-01 2017-11-01 UTC--2017-11-01 UTC
My desired data frames look like this:
Begin_A End_A Interval Matched_Attr
2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1;Attr2
2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC Attr1;Attr3
2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC NA
2017-11-01 2017-11-01 2017-11-01 UTC--2017-11-01 UTC Attr1;Attr2
I already looked into the int_overlaps() function but could not make the "scanning through all intervals of another column"-part work.
If yes, is there any solution that makes use of the tidyr environment?

Using tidyverse´s lubridate package and it´s function int_overlaps(), you can create a simple for loop to go through the individual values of df2$Interval like follows:
df2$Matched_Attr <- NA
for(i in 1:nrow(df2)){
df2$Matched_Attr[i] <- paste(df$Attribute[int_overlaps(df2$Interval[i], df$Interval)], collapse=", ")
}
giving the following outcome
# Begin_A End_A Interval Matched_Attr
#1 2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1, Attr2
#2 2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC Attr1, Attr3
#3 2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC
#4 2017-11-01 2017-11-01 2017-11-01 UTC--2017-11-01 UTC Attr1, Attr2
I left the NA strategy open, but additional line df2$Matched_Attr[df2$Matched_Attr==""]<-NA would return exact desired outcome.
In response to your comment (only perform the above action when a df$ID[i]==df2$ID[i] condition is met), the inplementation follows:
library(lubridate)
#df
df <- data.frame(Attribute=c("Attr1","Attr2","Attr3"),
ID = c(3,2,1),
Begin_A=as.Date(c('2017-11-1','2017-11-1','2017-11-4')),
End_A=as.Date(c('2017-11-5','2017-11-3','2017-11-5')))
df$Interval <- df$Begin_A %--% df$End_A
### Second df:
df2 <- data.frame(ID=c(3,4,5),
Begin_A=as.Date(c('2017-11-2','2017-11-5','2017-11-7')),
End_A=as.Date(c('2017-11-3','2017-11-6','2017-11-8')))
df2$Interval <- df2$Begin_A %--% df2$End_A
df2$Matched_Attr <- NA
for(i in 1:nrow(df2)){
if(df2$ID[i]==df$ID[i]){
df2$Matched_Attr[i] <- paste(df$Attribute[int_overlaps(df2$Interval[i], df$Interval)], collapse=", ")
}
}
print(df2)
# ID Begin_A End_A Interval Matched_Attr
#1 3 2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1, Attr2
#2 4 2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC <NA>
#3 5 2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC <NA>

Dynamically Read all possible formats of dates in R

I am trying to dynamically read dates in R from csv or xlsx files. Challenge is the dates could be in all possible formats of dates like combination of %d for Day, %m or %b or %B for month and %y or %Y for year and in any sequence of day, month and Year.
Are there any ready made functions that I can use or is reading characters from a series of dates and then determining which format it could be, as a solution.
Any pointers, highly appreciated.

Function findAndTransformDates from dataPreparation will find auomaticaly the format in each columns and transform them.
NB: it only works if you have the same format in all rows of one columns.
For example:
require(dataPreparation)
data("messy_adult")
head(messy_adult[, .(date1, date2, date3, date4)])
date1 date2 date3 date4
1: 2017-10-07 NA 19-Jan-2017 21-January-2017
2: 2017-31-12 1513465200 06-Jun-2017 08-June-2017
3: 2017-12-10 1511305200 03-Jul-2017 05-July-2017
4: 2017-06-09 1485126000 19-Jul-2017 21-July-2017
5: 2017-02-03 1498345200 16-May-2017 18-May-2017
6: 2017-04-10 1503183600 02-Apr-2017 04-April-2017
messy_adult <- findAndTransformDates(messy_adult)
head(messy_adult[, .(date1, date2, date3, date4)])
date1 date2 date3 date4
1: 2017-07-10 <NA> 2017-01-19 2017-01-21
2: 2017-12-31 2017-12-17 00:00:00 2017-06-06 2017-06-08
3: 2017-10-12 2017-11-22 00:00:00 2017-07-03 2017-07-05
4: 2017-09-06 2017-01-23 00:00:00 2017-07-19 2017-07-21
5: 2017-03-02 2017-06-25 01:00:00 2017-05-16 2017-05-18
6: 2017-10-04 2017-08-20 01:00:00 2017-04-02 2017-04-04
Hope it helps
Disclamer: I'm the author of this package.

R time series missing values

I was working with a time series dataset having hourly data. The data contained a few missing values so I tried to create a dataframe (time_seq) with the correct time value and do a merge with the original data so the missing values become 'NA'.
> data
date value
7980 2015-03-30 20:00:00 78389
7981 2015-03-30 21:00:00 72622
7982 2015-03-30 22:00:00 65240
7983 2015-03-30 23:00:00 47795
7984 2015-03-31 08:00:00 37455
7985 2015-03-31 09:00:00 70695
7986 2015-03-31 10:00:00 68444
//converting the date in the data to POSIXct format.
> data$date <- format.POSIXct(data$date,'%Y-%m-%d %H:%M:%S')
// creating a dataframe with the correct sequence of dates.
> time_seq <- seq(from = as.POSIXct("2014-05-01 00:00:00"),
to = as.POSIXct("2015-04-30 23:00:00"), by = "hour")
> df <- data.frame(date=time_seq)
> df
date
8013 2015-03-30 20:00:00
8014 2015-03-30 21:00:00
8015 2015-03-30 22:00:00
8016 2015-03-30 23:00:00
8017 2015-03-31 00:00:00
8018 2015-03-31 01:00:00
8019 2015-03-31 02:00:00
8020 2015-03-31 03:00:00
8021 2015-03-31 04:00:00
8022 2015-03-31 05:00:00
8023 2015-03-31 06:00:00
8024 2015-03-31 07:00:00
// merging with the original data
> a <- merge(data,df, x.by = data$date, y.by = df$date ,all=TRUE)
> a
date value
4005 2014-07-23 07:00:00 37003
4006 2014-07-23 07:30:00 NA
4007 2014-07-23 08:00:00 37216
4008 2014-07-23 08:30:00 NA
The values I get after merging are incorrect and they contain half-hourly values. What would be the correct approach for solving this?
Why are is the merge result in 30 minute intervals when both my dataframes are hourly?
PS:I looked into this question : Fastest way for filling-in missing dates for data.table and followed the steps but it didn't help.

You can use the padr package to solve this problem.
library(padr)
library(dplyr) #for the pipe operator
data %>%
pad() %>%
fill_by_value()

subset by vector in r

I am trying to subset an xts object of OHLC hourly data with a vector.
If i create the vector myself with the following command
lookup = c("2012-01-12", "2012-01-31", "2012-03-05", "2012-03-19")
testdfx[lookup]
testdfx[lookup]
I get the correct data displayed which shows all the hours that match the dates in the vector (00:00 to 23:00.
> head(testdfx[lookup])
open high low close
2012-01-12 00:00:00 1.27081 1.27217 1.27063 1.27211
2012-01-12 01:00:00 1.27212 1.27216 1.27089 1.27119
2012-01-12 02:00:00 1.27118 1.27166 1.27017 1.27133
2012-01-12 03:00:00 1.27134 1.27272 1.27133 1.27261
2012-01-12 04:00:00 1.27260 1.27262 1.27141 1.27183
2012-01-12 05:00:00 1.27183 1.27230 1.27145 1.27165
> tail(testdfx[lookup])
open high low close
2012-03-19 18:00:00 1.32451 1.32554 1.32386 1.32414
2012-03-19 19:00:00 1.32417 1.32465 1.32331 1.32372
2012-03-19 20:00:00 1.32373 1.32415 1.32340 1.32372
2012-03-19 21:00:00 1.32373 1.32461 1.32366 1.32376
2012-03-19 22:00:00 1.32377 1.32424 1.32359 1.32366
2012-03-19 23:00:00 1.32364 1.32406 1.32333 1.32336
However when I extract a dates from an object and create a vector to use for subsetting I only get the hours of 00:00-19:00 displayed in my subset.
> head(testdfx[dates])
open high low close
2007-01-05 00:00:00 1.3092 1.3093 1.3085 1.3088
2007-01-05 01:00:00 1.3087 1.3092 1.3075 1.3078
2007-01-05 02:00:00 1.3079 1.3091 1.3078 1.3084
2007-01-05 03:00:00 1.3083 1.3084 1.3073 1.3074
2007-01-05 04:00:00 1.3073 1.3080 1.3061 1.3071
2007-01-05 05:00:00 1.3070 1.3072 1.3064 1.3069
> tail(euro[nfp.releases])
open high low close
2014-01-10 14:00:00 1.35892 1.36625 1.35728 1.36366
2014-01-10 15:00:00 1.36365 1.36784 1.36241 1.36743
2014-01-10 16:00:00 1.36742 1.36866 1.36693 1.36719
2014-01-10 17:00:00 1.36720 1.36752 1.36579 1.36617
2014-01-10 18:00:00 1.36617 1.36663 1.36559 1.36624
2014-01-10 19:00:00 1.36630 1.36717 1.36585 1.36702
I have compared both objects containing the require dates and they appear to be the same.
> class(lookup)
[1] "character"
> class(nfp.releases)
[1] "character"
> str(lookup)
chr [1:4] "2012-01-12" "2012-01-31" "2012-03-05" "2012-03-19"
> str(nfp.releases)
chr [1:86] "2014-02-07" "2014-01-10" "2013-12-06" "2013-11-08" ..
I am new to R but have tried everything over the past 3 days to get this to work. If I can't to it this way I will end up having to create a variable by hand but as its got 86 dates this may take some time.
Thanks in advance.

I cannot reproduce your problem
lookup = c("2012-01-12", "2012-01-31", "2012-03-05", "2012-03-19")
time_index <- seq(from = as.POSIXct("2012-01-01 07:00"), to = as.POSIXct("2012-05-17 18:00"), by = "hour")
set.seed(1)
value <- matrix(rnorm(n = 4*length(time_index)),length(time_index),4)
testdfx <- xts(value, order.by = time_index)
testdfx[lookup[1]]
testdfx["2012-01-12"]

Thanks for the response guys I actually thought i had deleted this thread but obviously not.
The problem in the case above was to be found around 3' from the computer. When looking through the data I was only interested in Fridays which also means that the FX market is closing down for the week end.
Sorry to have wasted your time and Admin please remove.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Merge dataframes by nearest datetime - r

Related

How to get people in the store at every 5 minutes?

Matching Columns with Overlapping Intervals (lubridate)

Dynamically Read all possible formats of dates in R

R time series missing values

subset by vector in r

Categories

Resources