I have two dataframes of different lengths: NROW(data) = 20000
NROW(database) = 8000
Both of dataframes have date time values in a format as : YYYY-MM-DD HH-MM-SS which are not the same in each dataframe
What I want is to merge them by the nearest date-time and keep only the records that exist in database.
I tried the approach posted in another stackexchange post
[R – How to join two data frames by nearest time-date?][1]
based on data.table library. I tried following but without success:
require("data.table")
database <- data.table(database)
data <- data.table(data)
setkey( data, "timekey")
setkey( database, "timekeyd")
database <- data[ database, roll = "nearest"]
But the merge was almost completely wrong. You can see how the merged was performed in the following table that has only the two keys (timekey and timekeyd)
1 2017-11-01 00:00:00 2017-10-31 21:00:00
2 2017-11-01 00:00:00 2017-10-31 22:10:00
3 2017-11-02 19:00:00 2017-11-02 21:00:00
4 2017-11-02 19:00:00 2017-11-02 21:00:00
5 2017-11-03 20:08:00 2017-11-03 22:10:00
6 2017-11-04 19:00:00 2017-11-04 21:00:00
7 2017-11-04 19:00:00 2017-11-04 21:00:00
8 2017-11-05 19:00:00 2017-11-05 21:10:00
9 2017-11-07 18:00:00 2017-11-07 20:00:00
Related
I have a data table like below :
library(data.table)
DT1<-data.table(
id=c(1,2,3,4,3,2),
in_time=c("2017-11-01 08:37:35","2017-11-01 09:07:44","2017-11-01 09:46:16","2017-11-01 10:32:29","2017-11-01 10:59:25","2017-11-01 13:24:12"),
out_time=c("2017-11-01 08:45:35","2017-11-01 09:15:30","2017-11-01 10:11:16","2017-11-01 10:37:05","2017-11-01 11:45:25","2017-11-01 14:10:09")
)
It contains each information about what time a person enters the store and exits the store.
Now I want to take the people in the store every 5 minutes (standard 5 minutes like minute 0,5,10,15 ...60). If there is no one I need a 0 value.
So I tried with
library(lubridate)
DT1[,time:=ymd_hms(in_time)]
DT1[,time:=ceiling_date(time,"5mins")]
DT1[,.N,by=list(time)]
which only gives how many people entered at each time but I am now stuck at how to take into account the out_time.For example, the id 1 entered at 2017-11-01 08:37:35 and left at 2017-11-01 08:45:35.So he will be in the shop for the 5-minute interval from 2017-11-01 08:40:00 to
2017-11-01 08:45:00 and not in 2017-11-01 08:50:00 and so on .
An id can repeat multiple times like one person came drop by the store multiple times a day.
Any help is appreciated .
Here is an option using data.table::foverlaps:
#generate intervals of 5mins
times <- seq(as.POSIXct("2017-11-01 00:00:00", format=fmt),
as.POSIXct("2017-11-02 00:00:00", format=fmt),
by="5 min")
DT2 <- data.table(in_time=times[-length(times)], out_time=times[-1L], key=c("in_time","out_time"))
#set keys before foverlaps
setkey(DT1, in_time, out_time)
#find overlaps and count distinct in each 5min interval.
#!is.na(id) is for truncating the output for checking. to be removed in actual code
foverlaps(DT2, DT1)[!is.na(id), uniqueN(id), .(i.in_time, i.out_time)]
And if id is unique in each time interval, the last line of code can be foverlaps(DT2, DT1)[, sum(!is.na(id)), .(i.in_time, i.out_time)] instead
first 8 rows of output:
i.in_time i.out_time V1
1: 2017-11-01 08:35:00 2017-11-01 08:40:00 1
2: 2017-11-01 08:40:00 2017-11-01 08:45:00 1
3: 2017-11-01 08:45:00 2017-11-01 08:50:00 1
4: 2017-11-01 09:05:00 2017-11-01 09:10:00 1
5: 2017-11-01 09:10:00 2017-11-01 09:15:00 1
6: 2017-11-01 09:15:00 2017-11-01 09:20:00 1
7: 2017-11-01 09:45:00 2017-11-01 09:50:00 1
8: 2017-11-01 09:50:00 2017-11-01 09:55:00 1
data:
library(data.table)
DT1 <- data.table(
id=c(1,2,3,4,3,2),
in_time=c("2017-11-01 08:37:35","2017-11-01 09:07:44","2017-11-01 09:46:16","2017-11-01 10:32:29","2017-11-01 10:59:25","2017-11-01 13:24:12"),
out_time=c("2017-11-01 08:45:35","2017-11-01 09:15:30","2017-11-01 10:11:16","2017-11-01 10:37:05","2017-11-01 11:45:25","2017-11-01 14:10:09")
)
cols <- c("in_time", "out_time")
fmt <- "%Y-%m-%d %T"
DT1[, (cols) := lapply(.SD, as.POSIXct, format=fmt), .SDcols=cols]
I have two data frames of different number of rows and number of columns: each of these data frames have a date interval. df has an additional column which indicates some kind of attribute. My goal is to extract information from df ( with the attributes) to df2 under certain conditions. The procedure should be the following:
For each date interval of df2, check if there is any interval in df which overlaps with the interval of df2. If yes, create a column in df2 which indicates the attributes matching with the overlapping interval of df. There can be multiple attributes that are matched to a specific interval of df2.
I created the following example of my data:
library(lubridate)
date1 <- as.Date(c('2017-11-1','2017-11-1','2017-11-4'))
date2 <- as.Date(c('2017-11-5','2017-11-3','2017-11-5'))
df <- data.frame(matrix(NA,nrow=3, ncol = 4))
names(df) <- c("Begin_A", "End_A", "Interval", "Attribute")
df$Begin_A <-date1
df$End_A <-date2
df$Interval <-df$Begin_A %--% df$End_A
df$Attribute<- as.character(c("Attr1","Attr2","Attr3"))
### Second df:
date1 <- as.Date(c('2017-11-2','2017-11-5','2017-11-7','2017-11-1'))
date2 <- as.Date(c('2017-11-3','2017-11-6','2017-11-8','2017-11-1'))
df2 <- data.frame(matrix(NA,nrow=4, ncol = 3))
names(df2) <- c("Begin_A", "End_A", "Interval")
df2$Begin_A <-date1
df2$End_A <-date2
df2$Interval <-df2$Begin_A %--% df2$End_A
This results in these data frames:
df:
Begin_A End_A Interval Attribute
2017-11-01 2017-11-05 2017-11-01 UTC--2017-11-05 UTC Attr1
2017-11-01 2017-11-03 2017-11-01 UTC--2017-11-03 UTC Attr2
2017-11-04 2017-11-05 2017-11-04 UTC--2017-11-05 UTC Attr3
df2:
Begin_A End_A Interval
2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC
2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC
2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC
2017-11-01 2017-11-01 2017-11-01 UTC--2017-11-01 UTC
My desired data frames look like this:
Begin_A End_A Interval Matched_Attr
2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1;Attr2
2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC Attr1;Attr3
2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC NA
2017-11-01 2017-11-01 2017-11-01 UTC--2017-11-01 UTC Attr1;Attr2
I already looked into the int_overlaps() function but could not make the "scanning through all intervals of another column"-part work.
If yes, is there any solution that makes use of the tidyr environment?
Using tidyverse´s lubridate package and it´s function int_overlaps(), you can create a simple for loop to go through the individual values of df2$Interval like follows:
df2$Matched_Attr <- NA
for(i in 1:nrow(df2)){
df2$Matched_Attr[i] <- paste(df$Attribute[int_overlaps(df2$Interval[i], df$Interval)], collapse=", ")
}
giving the following outcome
# Begin_A End_A Interval Matched_Attr
#1 2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1, Attr2
#2 2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC Attr1, Attr3
#3 2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC
#4 2017-11-01 2017-11-01 2017-11-01 UTC--2017-11-01 UTC Attr1, Attr2
I left the NA strategy open, but additional line df2$Matched_Attr[df2$Matched_Attr==""]<-NA would return exact desired outcome.
In response to your comment (only perform the above action when a df$ID[i]==df2$ID[i] condition is met), the inplementation follows:
library(lubridate)
#df
df <- data.frame(Attribute=c("Attr1","Attr2","Attr3"),
ID = c(3,2,1),
Begin_A=as.Date(c('2017-11-1','2017-11-1','2017-11-4')),
End_A=as.Date(c('2017-11-5','2017-11-3','2017-11-5')))
df$Interval <- df$Begin_A %--% df$End_A
### Second df:
df2 <- data.frame(ID=c(3,4,5),
Begin_A=as.Date(c('2017-11-2','2017-11-5','2017-11-7')),
End_A=as.Date(c('2017-11-3','2017-11-6','2017-11-8')))
df2$Interval <- df2$Begin_A %--% df2$End_A
df2$Matched_Attr <- NA
for(i in 1:nrow(df2)){
if(df2$ID[i]==df$ID[i]){
df2$Matched_Attr[i] <- paste(df$Attribute[int_overlaps(df2$Interval[i], df$Interval)], collapse=", ")
}
}
print(df2)
# ID Begin_A End_A Interval Matched_Attr
#1 3 2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1, Attr2
#2 4 2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC <NA>
#3 5 2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC <NA>
I am trying to dynamically read dates in R from csv or xlsx files. Challenge is the dates could be in all possible formats of dates like combination of %d for Day, %m or %b or %B for month and %y or %Y for year and in any sequence of day, month and Year.
Are there any ready made functions that I can use or is reading characters from a series of dates and then determining which format it could be, as a solution.
Any pointers, highly appreciated.
Function findAndTransformDates from dataPreparation will find auomaticaly the format in each columns and transform them.
NB: it only works if you have the same format in all rows of one columns.
For example:
require(dataPreparation)
data("messy_adult")
head(messy_adult[, .(date1, date2, date3, date4)])
date1 date2 date3 date4
1: 2017-10-07 NA 19-Jan-2017 21-January-2017
2: 2017-31-12 1513465200 06-Jun-2017 08-June-2017
3: 2017-12-10 1511305200 03-Jul-2017 05-July-2017
4: 2017-06-09 1485126000 19-Jul-2017 21-July-2017
5: 2017-02-03 1498345200 16-May-2017 18-May-2017
6: 2017-04-10 1503183600 02-Apr-2017 04-April-2017
messy_adult <- findAndTransformDates(messy_adult)
head(messy_adult[, .(date1, date2, date3, date4)])
date1 date2 date3 date4
1: 2017-07-10 <NA> 2017-01-19 2017-01-21
2: 2017-12-31 2017-12-17 00:00:00 2017-06-06 2017-06-08
3: 2017-10-12 2017-11-22 00:00:00 2017-07-03 2017-07-05
4: 2017-09-06 2017-01-23 00:00:00 2017-07-19 2017-07-21
5: 2017-03-02 2017-06-25 01:00:00 2017-05-16 2017-05-18
6: 2017-10-04 2017-08-20 01:00:00 2017-04-02 2017-04-04
Hope it helps
Disclamer: I'm the author of this package.
I was working with a time series dataset having hourly data. The data contained a few missing values so I tried to create a dataframe (time_seq) with the correct time value and do a merge with the original data so the missing values become 'NA'.
> data
date value
7980 2015-03-30 20:00:00 78389
7981 2015-03-30 21:00:00 72622
7982 2015-03-30 22:00:00 65240
7983 2015-03-30 23:00:00 47795
7984 2015-03-31 08:00:00 37455
7985 2015-03-31 09:00:00 70695
7986 2015-03-31 10:00:00 68444
//converting the date in the data to POSIXct format.
> data$date <- format.POSIXct(data$date,'%Y-%m-%d %H:%M:%S')
// creating a dataframe with the correct sequence of dates.
> time_seq <- seq(from = as.POSIXct("2014-05-01 00:00:00"),
to = as.POSIXct("2015-04-30 23:00:00"), by = "hour")
> df <- data.frame(date=time_seq)
> df
date
8013 2015-03-30 20:00:00
8014 2015-03-30 21:00:00
8015 2015-03-30 22:00:00
8016 2015-03-30 23:00:00
8017 2015-03-31 00:00:00
8018 2015-03-31 01:00:00
8019 2015-03-31 02:00:00
8020 2015-03-31 03:00:00
8021 2015-03-31 04:00:00
8022 2015-03-31 05:00:00
8023 2015-03-31 06:00:00
8024 2015-03-31 07:00:00
// merging with the original data
> a <- merge(data,df, x.by = data$date, y.by = df$date ,all=TRUE)
> a
date value
4005 2014-07-23 07:00:00 37003
4006 2014-07-23 07:30:00 NA
4007 2014-07-23 08:00:00 37216
4008 2014-07-23 08:30:00 NA
The values I get after merging are incorrect and they contain half-hourly values. What would be the correct approach for solving this?
Why are is the merge result in 30 minute intervals when both my dataframes are hourly?
PS:I looked into this question : Fastest way for filling-in missing dates for data.table and followed the steps but it didn't help.
You can use the padr package to solve this problem.
library(padr)
library(dplyr) #for the pipe operator
data %>%
pad() %>%
fill_by_value()
I am trying to subset an xts object of OHLC hourly data with a vector.
If i create the vector myself with the following command
lookup = c("2012-01-12", "2012-01-31", "2012-03-05", "2012-03-19")
testdfx[lookup]
testdfx[lookup]
I get the correct data displayed which shows all the hours that match the dates in the vector (00:00 to 23:00.
> head(testdfx[lookup])
open high low close
2012-01-12 00:00:00 1.27081 1.27217 1.27063 1.27211
2012-01-12 01:00:00 1.27212 1.27216 1.27089 1.27119
2012-01-12 02:00:00 1.27118 1.27166 1.27017 1.27133
2012-01-12 03:00:00 1.27134 1.27272 1.27133 1.27261
2012-01-12 04:00:00 1.27260 1.27262 1.27141 1.27183
2012-01-12 05:00:00 1.27183 1.27230 1.27145 1.27165
> tail(testdfx[lookup])
open high low close
2012-03-19 18:00:00 1.32451 1.32554 1.32386 1.32414
2012-03-19 19:00:00 1.32417 1.32465 1.32331 1.32372
2012-03-19 20:00:00 1.32373 1.32415 1.32340 1.32372
2012-03-19 21:00:00 1.32373 1.32461 1.32366 1.32376
2012-03-19 22:00:00 1.32377 1.32424 1.32359 1.32366
2012-03-19 23:00:00 1.32364 1.32406 1.32333 1.32336
However when I extract a dates from an object and create a vector to use for subsetting I only get the hours of 00:00-19:00 displayed in my subset.
> head(testdfx[dates])
open high low close
2007-01-05 00:00:00 1.3092 1.3093 1.3085 1.3088
2007-01-05 01:00:00 1.3087 1.3092 1.3075 1.3078
2007-01-05 02:00:00 1.3079 1.3091 1.3078 1.3084
2007-01-05 03:00:00 1.3083 1.3084 1.3073 1.3074
2007-01-05 04:00:00 1.3073 1.3080 1.3061 1.3071
2007-01-05 05:00:00 1.3070 1.3072 1.3064 1.3069
> tail(euro[nfp.releases])
open high low close
2014-01-10 14:00:00 1.35892 1.36625 1.35728 1.36366
2014-01-10 15:00:00 1.36365 1.36784 1.36241 1.36743
2014-01-10 16:00:00 1.36742 1.36866 1.36693 1.36719
2014-01-10 17:00:00 1.36720 1.36752 1.36579 1.36617
2014-01-10 18:00:00 1.36617 1.36663 1.36559 1.36624
2014-01-10 19:00:00 1.36630 1.36717 1.36585 1.36702
I have compared both objects containing the require dates and they appear to be the same.
> class(lookup)
[1] "character"
> class(nfp.releases)
[1] "character"
> str(lookup)
chr [1:4] "2012-01-12" "2012-01-31" "2012-03-05" "2012-03-19"
> str(nfp.releases)
chr [1:86] "2014-02-07" "2014-01-10" "2013-12-06" "2013-11-08" ..
I am new to R but have tried everything over the past 3 days to get this to work. If I can't to it this way I will end up having to create a variable by hand but as its got 86 dates this may take some time.
Thanks in advance.
I cannot reproduce your problem
lookup = c("2012-01-12", "2012-01-31", "2012-03-05", "2012-03-19")
time_index <- seq(from = as.POSIXct("2012-01-01 07:00"), to = as.POSIXct("2012-05-17 18:00"), by = "hour")
set.seed(1)
value <- matrix(rnorm(n = 4*length(time_index)),length(time_index),4)
testdfx <- xts(value, order.by = time_index)
testdfx[lookup[1]]
testdfx["2012-01-12"]
Thanks for the response guys I actually thought i had deleted this thread but obviously not.
The problem in the case above was to be found around 3' from the computer. When looking through the data I was only interested in Fridays which also means that the FX market is closing down for the week end.
Sorry to have wasted your time and Admin please remove.