difftime in R is not calculating correctly - r

I am trying to find the difference between two dates in hours, and for the time differences that occur over the span of more than one day I am getting really outrageous and incorrect numbers.
Here is an example of the data:
Observation Status DateTime
1 Active 2016-11-04 22:32:49
2 Inactive 2016-11-05 08:30:56
I am running this command:
getDiff <- function(x) {
difftime(shift(x, fill = NA, type = "lead"), x, units = "hours")
}
diff_result <- dataframe[, time.diff := ifelse(Status == "Active",
getDiff(DateTime), NA)]
And I get the following output:
Observation Status DateTime Time.diff
1 Active 2016-11-04 22:32:49 8757.884
2 Inactive 2016-11-05 08:30:56
This command works for all other differences that do not happen on separate days.
The correct answer should be around 10 hours, not over 8000.
Also,
> class(DataFrame$DateTime)
[1] "POSIXct" "POSIXt"
Thank you in advance!

It seems OP has not converted DateTime format correctly. The 8757 hours are equivalent to about 1 year. Hence, it is possible DateTime are wrongly formatted.
The result looks fine using OP's data at my end.
library(data.table)
getDiff <- function(x) {
difftime(shift(x, fill = NA, type = "lead"), x, units = "hours")
}
setDT(df)
diff_result <- df[, time.diff := ifelse(Status == "Active",
getDiff(DateTime), NA)]
diff_result
# Observation Status DateTime time.diff
# 1: 1 Active 2016-11-04 22:32:49 9.968611
# 2: 2 Inactive 2016-11-05 08:30:56 NA
#
Data:
df <- read.table(text =
"Observation Status DateTime
1 Active '2016-11-04 22:32:49'
2 Inactive '2016-11-05 08:30:56'",
header = TRUE, stringsAsFactors = FALSE)
df$DateTime = as.POSIXct(df$DateTime, format = "%Y-%m-%d %H:%M:%S")

Related

Categorizing data using date variable in R

I am having trouble in using the date variable in my dataset to create categories of 6 months time period. I want to create these time period categories for years between 2017-1-1 and 2020-6-30. The time period categories for each year would be from 2017-1-1 to 2017-6-30, and 2017-7-1 to 2017-12-31 until 2020-6-30.
I have used the following two types of codes to create date categories but I am getting a similar error:
#CODE1
#checking for date class
myData <- str(myData)
myData #date in factor class
#convert to date class
date_class <- as.Date(myData$date, format = "%m/%d/%Y")
myData$date_class <- as.Date(myData$date, format = "%m/%d/%Y")
myData
#creating timeperiod category 1
date_cat <- NA
myData$date_cat[which(myData$date_class >= "2017-1-1" & myData$date_class < "2017-7-1")] <- 1
#CODE2
#converting to date format
myData$date <- strptime(myData$date,format="%m/%d/%Y")
myData$date <- as.POSIXct(myData$date)
myData
#creating timeperiod category 1
date_cat <- NA
myData$date_cat[which(myData$date >= "2017-1-1" & myData$date < "2017-7-1")] <- 1
For both the codes I am getting a similar error
Error in $<-.data.frame(*tmp*, date_cat, value = numeric(0)) :
replacement has 0 rows, data has 1123
Please help me with understanding where I am going wrong.
Thanks,
Priya
Here's a function (to.interval) that returns a time interval {0, 1, 2, 3, ...}, given parameters of the event date, index date, and interval width. Probably a good idea to include error checking in the function, so if for example the event date is prior to the anchor date, it returns NA.
df <- data.frame(event.date=as.Date(c("2017-01-01", "2017-08-01", "2018-04-30")))
to.interval <- function(anchor.date, future.date, interval.days){
round(as.integer(future.date - anchor.date) / interval.days, 0)}
df$interval <- to.interval(as.Date('2017-01-01'),
df$event.date, 180 )
df
Output
event.date interval
1 2017-01-01 0
2 2017-08-01 1
3 2018-04-30 3

Compare date intervals within the same data frame

I have search around and find similar questions but can make it work for my data.
I have a data frame with start and end dates, as well as several other factors. Ideally, the start date of a row should be posterior to the end date of any previous row, but the data has duplicated starts or ends, and sometimes the interval of the dates overlap.
I tried to make a reproducible example:
df = data.frame(start=c("2018/04/15 9:00:00","2018/04/15 9:00:00","2018/04/16 10:20:00","2018/04/16 15:30:00",
"2018/04/17 12:40:00","2018/04/17 18:50:00"),
end=c("2018/04/16 8:00:00","2018/04/16 7:10:00","2018/04/17 18:20:00","2018/04/16 16:30:00",
"2018/04/17 16:40:00","2018/04/17 19:50:00"),
value=c(10,15,11,13,14,12))
I was able to remove the duplicated (end or start dates), but I can't remove the overlapping intervals. I want to create a loop that "cleans" the intervals contained within any larger interval. So the results looks like this:
result = df[c(1,3,6),]
I thought I could make a loop that would "clean" both duplicates and overlapping intervals, but I can't make it work.
Any suggestions?
The data.table package is suited for this kind of problem using the overlapping join function foverlaps (inspired by findOverlaps function from the Bioconductor package IRanges) and then an anti-join (data.table syntax is B[!A, on]) to remove those inner intervals.
library(data.table)
cols <- c("start", "end")
setDT(df)
df[, (cols) := lapply(.SD, function(x) as.POSIXct(x, format="%Y/%m/%d %H:%M:%S")), .SDcols=cols]
setkeyv(df, cols)
anti <- foverlaps(df, df, type="within")[start!=i.start | end!=i.end | value!=i.value]
df[!anti, on=.(start=i.start, end=i.end, value=i.value)]
# start end value
# 1: 2018-04-15 09:00:00 2018-04-16 08:00:00 10
# 2: 2018-04-16 10:20:00 2018-04-17 18:20:00 11
# 3: 2018-04-17 18:50:00 2018-04-17 19:50:00 12
Alternative approach is to use %within% of the lubridate() package:
library(lubridate)
# transform characters to dates
start_time <- as_datetime(df[ , "start"], tz = "UTC")
end_time <- as_datetime(df[ , "end"], tz = "UTC")
# construct intervals
start_end_intrvls <- interval(start_time, end_time)
# find indices of the non-within intervals
not_within <- !(sapply(FUN = function(i) any(start_end_intrvls[i] %within% start_end_intrvls[-i]),
X = seq(along.with = df[ , "start"])))
df[not_within, ]
# start end value
# 1 2018/04/15 9:00:00 2018/04/16 8:00:00 10
# 3 2018/04/16 10:20:00 2018/04/17 18:20:00 11
# 6 2018/04/17 18:50:00 2018/04/17 19:50:00 12
Update
The as_datetime() function causes an error when being applied to a tibble:
as_datetime(tibble("2018/04/15 9:00:00"), tz = "UTC")
Error in as.POSIXct.default(x) :
do not know how to convert 'x' to class “POSIXct”
The solution above may be modified to resolve this issue with substitution of the as_datetime() with the as.POSIXlt():
df_tibble <- tibble(start=c("2018/04/15 9:00:00","2018/04/15 9:00:00","2018/04/16 10:20:00",
"2018/04/16 15:30:00", "2018/04/17 12:40:00","2018/04/17 18:50:00"),
end=c("2018/04/16 8:00:00","2018/04/16 7:10:00","2018/04/17 18:20:00","2018/04/16 16:30:00",
"2018/04/17 16:40:00","2018/04/17 19:50:00"), value=c(10,15,11,13,14,12))
start_time_lst <- lapply(FUN = function(i) as.POSIXlt(as.character(df_tibble[i , "start"]),
tz = "UTC"),
X = seq(along.with = unlist(df_tibble[ , "start"])))
end_time_lst <- lapply(FUN = function(i) as.POSIXlt(as.character(df_tibble[ i, "end"]),
tz = "UTC"),
X = seq(along.with = unlist(df_tibble[ , "end"])))
start_end_intrvls <- lapply(function(i) interval(start_time_lst[[i]] , end_time_lst[[i]]),
X = seq(along.with = unlist(df_tibble[ , "start"])))
not_within <- sapply(function(i) !(any(unlist(Map(`%within%`,
start_end_intrvls[[i]], start_end_intrvls[-i])))),
X = seq(along.with = unlist(df_tibble[ , "start"])))

Count number of occurences in date range in R

I have a dataframe with a number of accounts, their status and the start and endtime for that status. I would like to report on the number of accounts in each of these statuses over a date range. The data looks like the df below, with the resulting report. (Actual data contains more state values. N/A values are shown with a dummy date far in the future.)
df <- data.frame(account = c(1,1,2,3),
state = c("Open","Closed","Open","Open"),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05")
)
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 6),
number.open = c(2,2,2,1,1,1)
)
I have looked at options involving rowwise() and mutate from dplyr and foverlaps from data.table, but haven't been able to code it up so it works.
(See Checking if Date is Between two Dates in R)
We can use sapply to do this for us:
report$NumberOpen <-
sapply(report$date, function(x)
sum(as.Date(df1$startdate) < as.Date(x) &
as.Date(df1$enddate) > as.Date(x) &
df1$state == 'Open'))
# report
# date NumberOpen
# 1 2016-04-01 2
# 2 2016-04-02 2
# 3 2016-04-03 2
# 4 2016-04-04 1
# 5 2016-04-05 1
# 6 2016-04-06 1
data
df1 <- data.frame(account = c(1,1,2,3),
state = c("Open","Closed","Open","Open"),
startdate = c("2016-01-01","2016-04-04","2016-03-02","2016-08-01"),
enddate = c("2016-04-04","2999-01-01","2016-05-02","2016-08-05")
)
report <- data.frame(date = seq(from = as.Date("2016-04-01"),by="1 day", length.out = 6)
)

Count rows based on multiple consecutive time flows

I have a large file of time-series data, which looks as follows. The dataset covers years, in increments of 15 minutes. A small subset looks like:
uniqueid time
a 2014-04-30 23:30:00
a 2014-04-30 23:45:00
a 2014-05-01 00:00:00
a 2014-05-01 00:15:00
a 2014-05-12 13:45:00
a 2014-05-12 14:00:00
b 2014-05-12 13:45:00
b 2014-05-12 14:00:00
b 2014-05-12 14:30:00
To reproduce above:
time<-c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
My goal is to count the number of rows per unique id, per consecutive timeflow. A consecutive timespan is when a unique id is stamped for each 15 minutes in a row (such as id A, which is stamped from 30.04.14 23.30 hrs until 01.05.14 00.15 hrs - hence 4 rows), yet when this flow of 15-minute iterations is disrupted (after 01.05.14 00:15, it is not stamped at 01.05.14 00:30 hence it is disrupted), it should count the next timestamp as start of a new consecutive timeflow and again calculate the number of rows until this flow is disrupted again. Time is POSIX.
As you can see in above example; a consecutive timeflow may cover different days, different months, or different years. I have many unique ids (and as said, a very large file), so I'm looking for a way that my computer can handle (loops probably wouldn't work).
I am looking for output something like:
uniqueid flow number_rows
a 1 4
a 2 2
b 3 2
b 4 1
I have looked into some time packages (such as lubridate), but given my limited R knowledge, I don't even know where to begin.
I hope all is clear - if not, I'd be happy to try to clarify it further. Thank you very much in advance!
Another way to do this with data.table also using a time difference would be to make use of the data.table internal values for group number and number of rows in each group:
library(data.table)
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(res)
uniqueid number_rows flow
1: a 4 1
2: a 2 2
3: b 2 3
4: b 1 4
Also since the sample data you posted didn't align with the subset you posted, I have included my data below:
Data
time<-as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00"))
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
You can groupby the uniqueid and the cumulative sum of the difference of time between rows which is not equal to 15 min and that gives the flow id and then a count of rows should give you what you need:
A justification of the logic is whenever the time difference is not equal to 15 within each uniqueid, a new flow process should be generated so we label it as TRUE and combine that with the cumsum, it becomes a new flow id with the following consecutive rows:
library(dplyr)
mydf$time <- as.POSIXct(mydf$time, "%Y-%m-%d %H:%M:%S")
# convert the time column to POSIXct class so that we can apply the diff function correctly
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
# Source: local data frame [4 x 3]
# Groups: uniqueid [?]
#
# uniqueid flow num_rows
# <fctr> <dbl> <int>
# 1 a 1 4
# 2 a 2 2
# 3 b 3 2
# 4 b 4 1
Base R is pretty fast. Using crude benchmarking, I found it finished in half the time of DT, and I got tired of waiting for dplyr.
# estimated size of data, years x days x hours x 15mins x uniqueids
5*365*24*4*1000 # = approx 180M
# make data with posixct and characters of 180M rows, mydf is approx 2.5GB in memory
time<-rep(as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")),times = 20000000)
uniqueid<-rep(as.character(c("a","a","a","a","a","a","b","b","b")),times = 20000000)
mydf<-data.frame(uniqueid,time = time)
rm(time,uniqueid);gc()
Base R:
# assumes that uniqueid's are in groups and in order, and there won't be a followed by b that have the 15 minute "flow"
starttime <- Sys.time()
# find failed flows
mydf$diff <- c(0,diff(mydf$time))
mydf$flowstop <- mydf$diff != 15
# give each flow an id
mydf$flowid <- cumsum(mydf$flowstop)
# clean up vars
mydf$time <- mydf$diff <- mydf$flowstop <- NULL
# find flow length
mydfrle <- rle(mydf$flowid)
# get uniqueid/flowid pairs (unique() is too slow)
mydf <- mydf[!duplicated(mydf$flowid), ]
# append rle and remove separate var
mydf$number_rows <- mydfrle$lengths
rm(mydfrle)
print(Sys.time()-starttime)
# Time difference of 30.39437 secs
data.table:
library(data.table)
starttime <- Sys.time()
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(Sys.time()-starttime)
# Time difference of 57.08156 secs
dplyr:
library(dplyr)
# convert the time column to POSIXct class so that we can apply the diff function correctly
starttime <- Sys.time()
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
print(Sys.time()-starttime)
# too long, did not finish after a few minutes
I think the assumption of uniqueid's and times being in order is huge, and the other solutions might be able to take advantage of that better. order() is easy enough to do.
I'm not sure about the impact of memory, or of the impact of different data sets that aren't so simple. It should be easy enough to break it into chunks and process if memory is an issue. It takes more code in Base R for sure.
Having both ordered "id" and "time" columns, we could build a single group to operate on by creating a logical vector of indices wherever either "id" changes or "time" is > 15 minutes.
With:
id = as.character(mydf$uniqueid)
tm = mydf$time
find where "id":
id_gr = c(TRUE, id[-1] != id[-length(id)])
and "time":
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
change and combine them in:
gr = id_gr | tm_gr
which shows wherever either "id" changed or "time" > 15.
And to get the result:
tab = tabulate(cumsum(gr)) ## basically, the only operation per group -- 'n by group'
data.frame(id = id[gr], flow = seq_along(tab), n = tab)
# id flow n
#1 a 1 4
#2 a 2 2
#3 b 3 2
#4 b 4 1
On a larger scale:
set.seed(1821); nid = 1e4
dat = replicate(nid, as.POSIXct("2016-07-07 12:00:00 EEST") +
cumsum(sample(c(1, 5, 10, 15, 20, 30, 45, 60, 90, 120, 150, 200, 250, 300), sample(5e2:1e3, 1), TRUE)*60),
simplify = FALSE)
names(dat) = make.unique(rep_len(letters, nid))
dat = data.frame(id = rep(names(dat), lengths(dat)), time = do.call(c, dat))
system.time({
id = as.character(dat$id); tm = dat$time
id_gr = c(TRUE, id[-1] != id[-length(id)])
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
gr = id_gr | tm_gr
tab = tabulate(cumsum(gr))
ans1 = data.frame(id = id[gr], flow = seq_along(tab), n = tab)
})
# user system elapsed
# 1.44 0.19 1.66
For comparison, included MikeyMike's answer:
library(data.table)
dat2 = copy(dat)
system.time({
ans2 = setDT(dat2)[, list(flow = .GRP, n = .N),
by = .(id, cumsum(as.numeric(difftime(time,
shift(time, 1L, type = "lag", fill = 0),
unit = "mins")) > 15))][, cumsum := NULL]
})
# user system elapsed
# 3.95 0.22 4.26
identical(as.data.table(ans1), ans2)
#[1] TRUE

Take a daily rolling mean of a seven day window for 30 minute sampled data

I would like to take a mean of a 7 day rolling window with 1 day increments of data that is collected at 30 minute intervals.
I have tried using data.table with by conditional statement with no success. Any guidane would be greatly appreciated.
# packages
library(data.table)
library(lubridate)
# Set set.seed to have reproducible sampling
set.seed(42)
# Create some Data
start = ymd_hms("2014-01-01 00:00:00")
end = ymd_hms("2014-12-31 23:59:59")
# Create data with 30 minute intervals.
dat <- data.table(timestamp = seq(start, end, by = "30 min"),
sample1 = sample(1:20, 17520, replace = TRUE))
# Create date variable for merging datasets.
dat[, date := as.Date(timestamp)]
# Create data for 7 day window moving window with one day increments.
dat2 <- data.table(start = seq(start, end, by = "1 day"),
end = seq(start + days(7), end + days(7), by = "1 day"))
# Create date variable for merging datasets.
dat2[, date := as.Date(start)]
# mergre datasets.
dat <- merge(dat, dat2, by="date")
# Tried
dat[, .(sample.mean = mean(sample1)), by = .(timestamp >= start & timestamp < end)]
# timestamp sample.mean
# 1: TRUE 10.46638
dat[, .(sample.mean = mean(sample1)), by = .(timestamp %in% c(start:end))]
# timestamp sample.mean
# 1: TRUE 10.40059
# 2: FALSE 10.46767
# Warning messages:
# 1: In start:end :
# numerical expression has 17520 elements: only the first used
# 2: In start:end :
# numerical expression has 17520 elements: only the first used
dat[, .(sample.mean = mean(sample1)), by = .(timestamp %between% c(start, end))]
# timestamp sample.mean
# 1: TRUE 19.00000
# 2: FALSE 10.46589
I'm not 100% sure I understand your exact parameters, but here's the basic approach:
setkey(dat, date)
#pull the 7 previous days
dat[ , dat[.(seq(.BY$date - 7L,
.BY$date, by = "day")),
#nomatch = 0L will exclude any requested dates outside the interval
mean(sample1), nomatch = 0L], by = date]
# date V1
# 1: 2014-01-01 12.31250
# 2: 2014-01-02 10.94792
# 3: 2014-01-03 11.27083
# 4: 2014-01-04 11.10417
# 5: 2014-01-05 10.79167
# ---
# 361: 2014-12-27 10.50260
# 362: 2014-12-28 10.52344
# 363: 2014-12-29 10.05990
# 364: 2014-12-30 10.03906
# 365: 2014-12-31 10.38542
Some possible tinkers:
Change 7L to whatever window you'd like; use positive if you want forward-looking averages
If you want to go by timestamp, you'll have to adjust the 7L to match whatever units (seconds/minutes/hours/etc)
The extreme points of the interval are not technically correct since the window is shorter than requested; exclude nomatch and these points will return as NA
Use .(var = mean(sample1)) to name the output column var.
Here's one approach:
library(zoo)
daymeans = dat[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat[daymeans, rmean := i.rmean, on="date"]
This assumes that your data is already sorted by date; if not, use keyby=date instead of by=date. If you don't want to juggle intermediate objects, there is a one-liner:
# Michael Chirico's suggestion from the comments
dat[dat[, mean(sample1), by=date][, rollmean(V1, 7, fill=NA)], rmean := i.V1, on = "date"]
You may need to tweak the arguments to rollmean to fit your particular definition of the window. #eddi suggested that runmean from the caTools library is typically faster than zoo's rollmean and so is probably also worth a look.
Crude benchmark with the OP's example data:
dat2 = copy(dat)
# Michael's answer
system.time({
setkey(dat, date)
dat[ , dat[.(seq(.BY$date - 7L,
.BY$date, by = "day")),
mean(sample1), nomatch = 0L], by = date]
})
user system elapsed
0.33 0.00 0.35
# this answer
system.time({
daymeans = dat2[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat2[daymeans, rmean := i.rmean, on="date"]
})
user system elapsed
0 0 0
Why it's faster: Here, we're computing 365 means of 48 numbers and then a rolling mean of length 365; which is less computationally costly than making 365 merges to find 48*7 numbers and then taking the mean of the latter.

Resources