I have two data frames of different number of rows and number of columns: each of these data frames have a date interval. df has an additional column which indicates some kind of attribute. My goal is to extract information from df ( with the attributes) to df2 under certain conditions. The procedure should be the following:
For each date interval of df2, check if there is any interval in df which overlaps with the interval of df2. If yes, create a column in df2 which indicates the attributes matching with the overlapping interval of df. There can be multiple attributes that are matched to a specific interval of df2.
I created the following example of my data:
library(lubridate)
date1 <- as.Date(c('2017-11-1','2017-11-1','2017-11-4'))
date2 <- as.Date(c('2017-11-5','2017-11-3','2017-11-5'))
df <- data.frame(matrix(NA,nrow=3, ncol = 4))
names(df) <- c("Begin_A", "End_A", "Interval", "Attribute")
df$Begin_A <-date1
df$End_A <-date2
df$Interval <-df$Begin_A %--% df$End_A
df$Attribute<- as.character(c("Attr1","Attr2","Attr3"))
### Second df:
date1 <- as.Date(c('2017-11-2','2017-11-5','2017-11-7','2017-11-1'))
date2 <- as.Date(c('2017-11-3','2017-11-6','2017-11-8','2017-11-1'))
df2 <- data.frame(matrix(NA,nrow=4, ncol = 3))
names(df2) <- c("Begin_A", "End_A", "Interval")
df2$Begin_A <-date1
df2$End_A <-date2
df2$Interval <-df2$Begin_A %--% df2$End_A
This results in these data frames:
df:
Begin_A End_A Interval Attribute
2017-11-01 2017-11-05 2017-11-01 UTC--2017-11-05 UTC Attr1
2017-11-01 2017-11-03 2017-11-01 UTC--2017-11-03 UTC Attr2
2017-11-04 2017-11-05 2017-11-04 UTC--2017-11-05 UTC Attr3
df2:
Begin_A End_A Interval
2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC
2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC
2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC
2017-11-01 2017-11-01 2017-11-01 UTC--2017-11-01 UTC
My desired data frames look like this:
Begin_A End_A Interval Matched_Attr
2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1;Attr2
2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC Attr1;Attr3
2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC NA
2017-11-01 2017-11-01 2017-11-01 UTC--2017-11-01 UTC Attr1;Attr2
I already looked into the int_overlaps() function but could not make the "scanning through all intervals of another column"-part work.
If yes, is there any solution that makes use of the tidyr environment?
Using tidyverse´s lubridate package and it´s function int_overlaps(), you can create a simple for loop to go through the individual values of df2$Interval like follows:
df2$Matched_Attr <- NA
for(i in 1:nrow(df2)){
df2$Matched_Attr[i] <- paste(df$Attribute[int_overlaps(df2$Interval[i], df$Interval)], collapse=", ")
}
giving the following outcome
# Begin_A End_A Interval Matched_Attr
#1 2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1, Attr2
#2 2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC Attr1, Attr3
#3 2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC
#4 2017-11-01 2017-11-01 2017-11-01 UTC--2017-11-01 UTC Attr1, Attr2
I left the NA strategy open, but additional line df2$Matched_Attr[df2$Matched_Attr==""]<-NA would return exact desired outcome.
In response to your comment (only perform the above action when a df$ID[i]==df2$ID[i] condition is met), the inplementation follows:
library(lubridate)
#df
df <- data.frame(Attribute=c("Attr1","Attr2","Attr3"),
ID = c(3,2,1),
Begin_A=as.Date(c('2017-11-1','2017-11-1','2017-11-4')),
End_A=as.Date(c('2017-11-5','2017-11-3','2017-11-5')))
df$Interval <- df$Begin_A %--% df$End_A
### Second df:
df2 <- data.frame(ID=c(3,4,5),
Begin_A=as.Date(c('2017-11-2','2017-11-5','2017-11-7')),
End_A=as.Date(c('2017-11-3','2017-11-6','2017-11-8')))
df2$Interval <- df2$Begin_A %--% df2$End_A
df2$Matched_Attr <- NA
for(i in 1:nrow(df2)){
if(df2$ID[i]==df$ID[i]){
df2$Matched_Attr[i] <- paste(df$Attribute[int_overlaps(df2$Interval[i], df$Interval)], collapse=", ")
}
}
print(df2)
# ID Begin_A End_A Interval Matched_Attr
#1 3 2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1, Attr2
#2 4 2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC <NA>
#3 5 2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC <NA>
Related
I have a data table like below :
library(data.table)
DT1<-data.table(
id=c(1,2,3,4,3,2),
in_time=c("2017-11-01 08:37:35","2017-11-01 09:07:44","2017-11-01 09:46:16","2017-11-01 10:32:29","2017-11-01 10:59:25","2017-11-01 13:24:12"),
out_time=c("2017-11-01 08:45:35","2017-11-01 09:15:30","2017-11-01 10:11:16","2017-11-01 10:37:05","2017-11-01 11:45:25","2017-11-01 14:10:09")
)
It contains each information about what time a person enters the store and exits the store.
Now I want to take the people in the store every 5 minutes (standard 5 minutes like minute 0,5,10,15 ...60). If there is no one I need a 0 value.
So I tried with
library(lubridate)
DT1[,time:=ymd_hms(in_time)]
DT1[,time:=ceiling_date(time,"5mins")]
DT1[,.N,by=list(time)]
which only gives how many people entered at each time but I am now stuck at how to take into account the out_time.For example, the id 1 entered at 2017-11-01 08:37:35 and left at 2017-11-01 08:45:35.So he will be in the shop for the 5-minute interval from 2017-11-01 08:40:00 to
2017-11-01 08:45:00 and not in 2017-11-01 08:50:00 and so on .
An id can repeat multiple times like one person came drop by the store multiple times a day.
Any help is appreciated .
Here is an option using data.table::foverlaps:
#generate intervals of 5mins
times <- seq(as.POSIXct("2017-11-01 00:00:00", format=fmt),
as.POSIXct("2017-11-02 00:00:00", format=fmt),
by="5 min")
DT2 <- data.table(in_time=times[-length(times)], out_time=times[-1L], key=c("in_time","out_time"))
#set keys before foverlaps
setkey(DT1, in_time, out_time)
#find overlaps and count distinct in each 5min interval.
#!is.na(id) is for truncating the output for checking. to be removed in actual code
foverlaps(DT2, DT1)[!is.na(id), uniqueN(id), .(i.in_time, i.out_time)]
And if id is unique in each time interval, the last line of code can be foverlaps(DT2, DT1)[, sum(!is.na(id)), .(i.in_time, i.out_time)] instead
first 8 rows of output:
i.in_time i.out_time V1
1: 2017-11-01 08:35:00 2017-11-01 08:40:00 1
2: 2017-11-01 08:40:00 2017-11-01 08:45:00 1
3: 2017-11-01 08:45:00 2017-11-01 08:50:00 1
4: 2017-11-01 09:05:00 2017-11-01 09:10:00 1
5: 2017-11-01 09:10:00 2017-11-01 09:15:00 1
6: 2017-11-01 09:15:00 2017-11-01 09:20:00 1
7: 2017-11-01 09:45:00 2017-11-01 09:50:00 1
8: 2017-11-01 09:50:00 2017-11-01 09:55:00 1
data:
library(data.table)
DT1 <- data.table(
id=c(1,2,3,4,3,2),
in_time=c("2017-11-01 08:37:35","2017-11-01 09:07:44","2017-11-01 09:46:16","2017-11-01 10:32:29","2017-11-01 10:59:25","2017-11-01 13:24:12"),
out_time=c("2017-11-01 08:45:35","2017-11-01 09:15:30","2017-11-01 10:11:16","2017-11-01 10:37:05","2017-11-01 11:45:25","2017-11-01 14:10:09")
)
cols <- c("in_time", "out_time")
fmt <- "%Y-%m-%d %T"
DT1[, (cols) := lapply(.SD, as.POSIXct, format=fmt), .SDcols=cols]
I have two dataframes of different lengths: NROW(data) = 20000
NROW(database) = 8000
Both of dataframes have date time values in a format as : YYYY-MM-DD HH-MM-SS which are not the same in each dataframe
What I want is to merge them by the nearest date-time and keep only the records that exist in database.
I tried the approach posted in another stackexchange post
[R – How to join two data frames by nearest time-date?][1]
based on data.table library. I tried following but without success:
require("data.table")
database <- data.table(database)
data <- data.table(data)
setkey( data, "timekey")
setkey( database, "timekeyd")
database <- data[ database, roll = "nearest"]
But the merge was almost completely wrong. You can see how the merged was performed in the following table that has only the two keys (timekey and timekeyd)
1 2017-11-01 00:00:00 2017-10-31 21:00:00
2 2017-11-01 00:00:00 2017-10-31 22:10:00
3 2017-11-02 19:00:00 2017-11-02 21:00:00
4 2017-11-02 19:00:00 2017-11-02 21:00:00
5 2017-11-03 20:08:00 2017-11-03 22:10:00
6 2017-11-04 19:00:00 2017-11-04 21:00:00
7 2017-11-04 19:00:00 2017-11-04 21:00:00
8 2017-11-05 19:00:00 2017-11-05 21:10:00
9 2017-11-07 18:00:00 2017-11-07 20:00:00
I have a table that looks like this;
user_id timestamp
aa 2018-01-01 12:01 UTC
ab 2018-01-01 05:01 UTC
bb 2018-06-01 09:01 UTC
bc 2018-03-03 23:01 UTC
cc 2018-01-02 11:01 UTC
I have another table that has every week in 2018.
week_id week_start week_end
1 2018-01-01 2018-01-07
2 2018-01-08 2018-01-15
3 2018-01-16 2018-01-23
4 2018-01-23 2018-01-30
... ... ...
Assume the week_start is a Monday and week_end is a Sunday.
I'd like to do two things. I'd first like to join the week_id to the first table and then I'd like to assign a day to each of the timestamps. My output would look like this:
user_id timestamp week_id day_of_week
aa 2018-01-01 12:01 UTC 1 Monday
ab 2018-01-02 05:01 UTC 1 Tuesday
bb 2018-01-13 09:01 UTC 2 Friday
bc 2018-01-28 23:01 UTC 4 Friday
cc 2018-01-06 11:01 UTC 1 Saturday
In Excel I could easily do this with a vlookup. My main interest is to learn how to join tables in cases like this. For that reason, I won't accept answers that use the weekday function.
Here are both of the tables in a more accessible format.
user_id <- c("aa", "ab", "bb", "bc", "cc")
timestamp <- c("2018-01-01 12:01", "2018-01-01 05:01", "2018-06-01 09:01", "2018-03-03 23:01", "2018-01-02 11:01")
week_id <- seq(1,52)
week_start <- seq(as.Date("2018-01-01"), as.Date("2018-12-31"), 7)
week_end <- week_start + 6
week_start <- week_start[1:52]
week_end <- week_end[1:52]
table1 <- data.frame(user_id, timestamp)
table2 <- data.frame(week_id, week_start, week_end)
Using SQL one can join two tables on a range like this. This seems the most elegant solution expressing our intent directly but we also provide some alternatives further below.
library(sqldf)
DF1$date <- as.Date(DF1$timestamp)
sqldf("select *
from DF1 a
left join DF2 b on date between week_start and week_end")
giving:
user_id timestamp date week_id week_start week_end
1 aa 2018-01-01 12:01:00 2018-01-01 1 2018-01-01 2018-01-07
2 ab 2018-01-01 05:01:00 2018-01-01 1 2018-01-01 2018-01-07
3 bb 2018-06-01 09:01:00 2018-06-01 NA <NA> <NA>
4 bc 2018-03-03 23:01:00 2018-03-04 NA <NA> <NA>
5 cc 2018-01-02 11:01:00 2018-01-02 1 2018-01-01 2018-01-07
dplyr
In a comment the poster asked for whether it could be done in dplyr. It can't be done directly since dplyr does not support complex joins but a workaound would be to do a full cross join of the two data frames which gives rise to an nrow(DF1) * nrow(DF2) intermediate result and then filter this down. dplyr does not directly support cross joins but we can simulate one by doing a full join on an identical dummy constant column that is appended to both data frames in the full join. Since we actually need a right join here to add back the unmatched rows, we do a final right join with the original DF1 data frame. Obviously this is entirely impractical for sufficiently large inputs but for the small input here we can do it. If it were known that there is a match in DF2 to every row in DF1 then the right_join at the end could be omitted.
DF1 %>%
mutate(date = as.Date(timestamp), dummy = 1) %>%
full_join(DF2 %>% mutate(dummy = 1)) %>%
filter(date >= week_start & date <= week_end) %>%
select(-dummy) %>%
right_join(DF1)
R Base
findix finds the index in DF2 corresponding to a date d. We then sapply it over the dates corresponding to rows of DF1 and put DF1 and the corresponding DF2 row together.
findix <- function(d) c(which(d >= DF2$week_start & d <= DF2$week_end), NA)[1]
cbind(DF1, DF2[sapply(as.Date(DF1$timestamp), findix), ])
Note
The input data in reproducible form used is:
Lines1 <- "user_id timestamp
aa 2018-01-01 12:01 UTC
ab 2018-01-01 05:01 UTC
bb 2018-06-01 09:01 UTC
bc 2018-03-03 23:01 UTC
cc 2018-01-02 11:01 UTC"
DF1 <- read.csv(text = gsub(" +", ",", Lines1), strip.white = TRUE)
DF1$timestamp <- as.POSIXct(DF1$timestamp)
Lines2 <- "week_id week_start week_end
1 2018-01-01 2018-01-07
2 2018-01-08 2018-01-15
3 2018-01-16 2018-01-23
4 2018-01-23 2018-01-30"
DF2 <- read.table(text = Lines2, header = TRUE)
DF2$week_start <- as.Date(DF2$week_start)
DF2$week_end <- as.Date(DF2$week_end)
This is a case for the fuzzyjoin-package. With the match_fun- argument we can specify conditions for each column. In this case table1$date >= table2$week_start and table1$date <= table2$week_end.
library(fuzzyjoin)
library(lubridate)
table1$date <- as.Date(table1$timestamp)
fuzzy_left_join(table1, table2,
by = c("date" = "week_start", "date" = "week_end"),
match_fun = list(`>=`, `<=`)) %>%
mutate(day_of_week = wday(date, label = TRUE)) %>%
select(user_id, timestamp, week_id, day_of_week)
user_id timestamp week_id day_of_week
1 aa 2018-01-01 12:01 1 Mo
2 ab 2018-01-01 05:01 1 Mo
3 bb 2018-06-01 09:01 22 Fr
4 bc 2018-03-03 23:01 9 Sa
5 cc 2018-01-02 11:01 1 Di
I'm also a smartass because I didn't use the weekday-function but wday from the lubridate-package.
I have a data frame that contains two POSIXct columns. How can I go about calculating the number of weekdays between these two columns?
df <- data.frame(StartDate=as.POSIXct(c("2017-05-17 12:53:00","2017-08-31 21:16:00","2017-08-25 13:54:00","2017-09-06 15:47:00","2017-10-15 05:11:00"), format = "%Y-%m-%d %H:%M:%S"),
EndDate=as.POSIXct(c("2017-06-09 11:57:00","2017-11-29 16:51:00","2017-09-06 15:13:00","2018-01-03 16:22:00","2017-11-17 11:51:00"), format = "%Y-%m-%d %H:%M:%S"))
Using dplyr:
df %>%
dplyr::rowwise() %>%
dplyr::mutate(wdays = sum(!weekdays(seq(StartDate, EndDate, by="day")) %in% c("Saturday", "Sunday")))
Source: local data frame [5 x 3]
Groups: <by row>
# A tibble: 5 x 3
StartDate EndDate wdays
<dttm> <dttm> <int>
1 2017-05-17 12:53:00 2017-06-09 11:57:00 17
2 2017-08-31 21:16:00 2017-11-29 16:51:00 64
3 2017-08-25 13:54:00 2017-09-06 15:13:00 9
4 2017-09-06 15:47:00 2018-01-03 16:22:00 86
5 2017-10-15 05:11:00 2017-11-17 11:51:00 25
This makes use of the fact that dates can easily be sequenced, and that because TRUE is equal to one, we can just sum up all of the non-weekend days.
Try the bizdays package:
library(bizdays) # Load the package
## Make a calendar that excludes Saturdays and Sundays
create.calendar("Workdays",weekdays = c("saturday", "sunday"))
## Calculate difference in days using the new Workdays calendar
df$bizdays <- bizdays(df$StartDate,df$EndDate,"Workdays")
df$bizdays
[1] 17 63 8 85 24
That returned 17, 63, 8, 85, and 24 business days between the start and end dates you provided. This looks right when I checked the 8 business days between 8/25/2017 and 9/6/2017.
Consider this
time <- seq(ymd_hms("2014-02-24 23:00:00"), ymd_hms("2014-06-25 08:32:00"), by="hour")
group <- rep(LETTERS[1:20], each = length(time))
value <- sample(-10^3:10^3,length(time), replace=TRUE)
df2 <- data.frame(time,group,value)
str(df2)
> head(df2)
time group value
1 2014-02-24 23:00:00 A 246
2 2014-02-25 00:00:00 A -261
3 2014-02-25 01:00:00 A 628
4 2014-02-25 02:00:00 A 429
5 2014-02-25 03:00:00 A -49
6 2014-02-25 04:00:00 A -749
I would like to create a variable that contains, for each group, the rolling mean of value
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At time 2014-02-24 23:00:00, df2['rolling_mean_same_hour'] contains the mean of the values of value observed at 23:00:00 during the last 5 days in the data (not including 2014-02-24 of course).
I would like to do that in either dplyr or data.table. I confess having no ideas how to do that.
Any ideas?
Many thanks!
You can calculate the rollmean() with your data grouped by the group variable and hour of the time variable, normally the rollmean() will include the current observation, but you can use shift() function to exclude the current observation from the rollmean:
library(data.table); library(zoo)
setDT(df2)
df2[, .(rolling_mean_same_hour = shift(
rollmean(value, 5, na.pad = TRUE, align = 'right'),
n = 1,
type = 'lag'),
time), .(hour(time), group)]
# hour group rolling_mean_same_hour time
# 1: 23 A NA 2014-02-24 23:00:00
# 2: 23 A NA 2014-02-25 23:00:00
# 3: 23 A NA 2014-02-26 23:00:00
# 4: 23 A NA 2014-02-27 23:00:00
# 5: 23 A NA 2014-02-28 23:00:00
# ---
#57796: 22 T -267.0 2014-06-20 22:00:00
#57797: 22 T -389.6 2014-06-21 22:00:00
#57798: 22 T -311.6 2014-06-22 22:00:00
#57799: 22 T -260.0 2014-06-23 22:00:00
#57800: 22 T -26.8 2014-06-24 22:00:00