I'm currently struggling with R and calculating the time difference in days.
I have data.frame with around 60 000 rows. In this data frame there are two columns called "start" and "end". Both columns contain data in UNIX time format WITH milliseconds - as you can see by the last three digits.
Start <- c("1470581434000", "1470784954000", "1470811368000", "1470764345000")
End <- c("1470560601000", "1470581549000", "1470785452000", "1470764722000")
d <- data.frame(Start, End)
My desired output should be a extra column called timediff where the time difference is outline in days.
I tried it with timediff and strptime which I found here. But nothing worked out.
Maybe one of you worked with calculation of time differences in the past.
Thanks a lot
There is a very small and fast solution:
Start_POSIX <- as.POSIXct(as.numeric(Start)/1000, origin="1970-01-01")
End_POSIX <- as.POSIXct(as.numeric(End)/1000, origin="1970-01-01")
difftime(Start_POSIX, End_POSIX)
Time differences in mins
[1] 347.216667 3390.083333 431.933333 -6.283333
or if you want another unit:
difftime(Start_POSIX, End_POSIX, unit = "sec")
Time differences in secs
[1] 20833 203405 25916 -377
You have a few steps you'll need to take:
# 1. Separate the milliseconds.
# To do this, insert a period in front of the last three digits
Start <-
sub(pattern = "(\\d{3}$)", # get the pattern of three digits at the end of the string
replacement = ".\\1", # replace with a . and then the pattern
x = Start)
# 2. Convert to numeric
Start <- as.numeric(Start)
# 3. Convert to POSIXct
Start <- as.POSIXct(Start,
origin = "1970-01-01")
For convenience, it would be good to put these all into a function
# Bundle all three steps into one function
unixtime_to_posixct <- function(x)
{
x <- sub(pattern = "(\\d{3}$)",
replacement = ".\\1",
x = x)
x <- as.numeric(x)
as.POSIXct(x,
origin = "1970-01-01")
}
And with that, you can get your differences in days
#* Put it all together.
library(dplyr)
library(magrittr)
Start <- c("1470581434000", "1470784954000", "1470811368000", "1470764345000")
End <- c("1470560601000", "1470581549000", "1470785452000", "1470764722000")
d <- data.frame(Start,
End,
stringsAsFactors = FALSE)
lapply(
X = d,
FUN = unixtime_to_posixct
) %>%
as.data.frame() %>%
mutate(diff = difftime(Start, End, units = "days"))
Related
I've got the following time frame:
A <- c('2016-01-01', '2019-01-05')
B <- c('2017-05-05','2019-06-05')
X_Period <- interval("2015-01-01", "2019-12-31")
Y_Periods <- interval(A, B)
I'd like to find the non overlapping periods between X_Period and Y_Periods so that the result would be:
[1]'2015-01-01'--'2015-12-31'
[2]'2017-05-06'--'2019-01-04'
[3]'2019-06-06'--'2019-31-12'
I'm trying to use setdiff but it does not work
setdiff(X_Period, Y_Periods)
Here is an option:
library(lubridate)
seq_X <- as.Date(seq(int_start(X_Period), int_end(X_Period), by = "1 day"))
seq_Y <- as.Date(do.call("c", sapply(Y_Periods, function(x)
seq(int_start(x), int_end(x), by = "1 day"))))
unique_dates_X <- seq_X[!seq_X %in% seq_Y]
lst <- aggregate(
unique_dates_X,
by = list(cumsum(c(0, diff.Date(unique_dates_X) != 1))),
FUN = function(x) c(min(x), max(x)),
simplify = F)$x
lapply(lst, function(x) interval(x[1], x[2]))
#[[1]]
#[1] 2015-01-01 UTC--2015-12-31 UTC
#
#[[2]]
#[1] 2017-05-06 UTC--2019-01-04 UTC
#
#[[3]]
#[1] 2019-06-06 UTC--2019-12-31 UTC
The strategy is to convert the intervals to by-day sequences (one for X_Period and one for Y_Period); then we find all days that are only part of X_Period (and not part of Y_Periods). We then aggregate to determine the first and last date in all sub-sequences of consecutive dates. The resulting lst is a list with those start/end dates. To convert to interval, we simply loop through the list and convert the start/end dates to an interval.
Dear Stackoverflow Community,
I have a Dataset with Datetimes [posixct '%d.%m.%Y %H:%M'] and Sensor measurements in [A] and [V].
The Datetime is one column and the different sensors are the other columns, with one column for each sensor.
I'd like to calculate a correction value with values within the column of each sensor.
The correction value should be written into a new colum hourly.
Therefore I'd like to calculate the correction as following:
correction = |x - (0.5 * (y+z))|
x= value of sensor 1, if Minute =='00'
y= value of sensor 1, if Minute =='03'
z= value of sensor 1, if Minute =='06'
What I'd like to have is a function, which calculates the written formula for every hour, but only if a value for all three minutes ('00'&'03'&'06') in the hour is given and write out the correction value into a new column (Data$correction).
I hope I could explain, what I'd like to do.
I tried several loops and apply and mapply functions, but there was always a problem with the date format, or the function.
This is, what seems to be the the best approach to me, though it doesn't work right now, but I hope there is a way to make it start working.
Also I think, that writing out vectors and merge them back with melt or merge might not be the best way. but right now I'm jst struggling and don't now how to solve the problem.
I really hope you can help me. Thanks so much.
Test_sub <- read.table(file= 'Test_sub.csv',
header=T, sep= ';', dec='.', stringsAsFactors= F)
sensor1_V_0 <- Test_sub[format(Test_sub$Datehour, format = '%M') == '00',]
sensor1_V_3 <- Test_sub[format(Test_sub$Datehour, format = '%M') == '03',]
sensor12_V_6 <- Test_sub[format(Test_sub$Datehour, format = '%M') == '06',]
test_sub2<- mapply(function(x, y, z) x-(0.5*(y+z)), sensor1_V_0$sensor1_V, sensor1_V_3$sensor1_V, sensor1_V_6$sensor1_V)
Let's start by creating some fake data:
dill<-data.frame(time=seq(as.POSIXct("2019-01-01 11:30"), as.POSIXct("2019-01-01 13:20"), by=180),val=runif(37,0,100))
Now we can do this:
require(tidyverse)
require(lubridate)
dill<- dill %>%
group_by(hour(time)) %>% # group by the hour -- note this assumes there's only one day in the data, you'll need to adjust this if there's more than one day
filter(any(minute(time)==3) & any(minute(time)==6) & any(minute(time)==0)) %>% # remove any hours in the data that don't have minutes 0, 3 and 6
mutate(correction=abs(val[minute(time)==0]-0.5*(val[minute(time)==3]+val[minute(time)==6]))) # calculate the correction
An example of the data would be:
y <- seq(from= 0.1, to= 0.5, by= 0.1)
min <- as.POSIXct('2018-09-25 09:00:00')
max <- as.POSIXct('2018-09-26 17:45:00')
SEQ <- data.frame(Datehour = seq.POSIXt(min,max, by = 60*03))
str(SEQ)
SEQ <- data.frame(SEQ[format(SEQ, format = '%M') == '00' |
format(SEQ, format = '%M') == '03' |
format(SEQ, format = '%M') == '06' |
format(SEQ, format = '%M') == '15' |
format(SEQ, format = '%M') == '30' |
format(SEQ, format = '%M') == '45' ,])
data <- data.frame(Datehour=SEQ, y = 0.1, z= 0.3)
Other questions have centered around having a start and end date. (see the following for examples
Given start date and end date, reshape/expand data for each day between (each day on a row)
Expand rows by date range using start and end date
My question is different in that I only have one date column and I would like to convert the unequal date ranges to daily counts. This specific example created deals with number of workers on a job site at one time. Different crews of people come on different dates
A brief data frame provided is as follows:
dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))
I would like the data to look like this:
dw <- data.frame(date=c(seq(as.Date("1999-03-22"),as.Date("1999-04-10"),by= "day")),
work=c(rep(43,7),rep(95,10),rep(92,3)),
cumwork=c(rep(43,7),rep(138,10),rep(230,3)))
I have been stuck on this for some time. Any help would be appreciated!
UPDATE (7/5/2017): As pointed out by #Scarabee the dates in the dataframe 'dd' should be in date format. Have updated the code to reflect this
A possible way:
First, create the sequence of dates you're interested in as a one-column dataframe:
v <- data.frame(date = seq(min(dd$date), as.Date("1999-04-10"), by="day"))
Next, join with your original dataframe and fill the missing values, for instance using dplyr and zoo:
library(dplyr)
library(zoo)
v %>%
left_join(dd, by = "date") %>%
na.locf
NB: I suppose that your dataframe dd actually contains dates (and not factors).
dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))
A solution similar, with base R (and zoo package):
dd$date <- as.Date(as.character(dd$date))
my.seq <- data.frame(date=seq.Date(from=range(dd$date)[1], to=range(dd$date)[2], by="day"))
output <- merge(my.seq, dd, all.x=TRUE)
output <- zoo::na.locf(output)
You first have to transform your date into a Date format. Then separately create a vector of complete dates and merge it with the original data. Eventually, run a "last observation carried forward" algorithm.
Here is a really fast pure base R solution:
ExpandDates <- function(df, lastColRepeat) {
myDiff <- diff(df$date)
dfOut <- data.frame(df$date[1] + 0:(sum(myDiff) + lastColRepeat - 1L),
stringsAsFactors=FALSE)
myDiff <- c(myDiff, lastColRepeat)
for (i in 2:3) {dfOut[,i] <- rep(df[ ,i], times = myDiff)}
names(dfOut) <- names(df)
dfOut
}
The last argument is to determine the number of times the last value should be repeated. As it stands, there is nothing in the original data.frame that would give this value. I'm also assuming that the "date" field is actually a date as pointed out by #Scarabee.
Here is some test data:
set.seed(123)
workVec <- sample(5000, 3000)
testDF <- data.frame(date = as.Date(sort(sample(12000, 3000)),
origin = "1970-01-01"), work = workVec,
cumwork = cumsum(workVec))
DplyrTest <- function(dd) { ## from #Scarabee
v <- data.frame(date = seq(min(dd$date), max(dd$date), by="day"))
v %>%
left_join(dd, by = "date") %>%
na.locf
}
a <- ExpandDates(testDF, 1)
b <- DplyrTest(testDF)
Test for equality:
identical(a$cumwork, as.integer(b$cumwork))
[1] TRUE
identical(a$work, as.integer(b$work))
[1] TRUE
identical(a$date, as.Date(b$date))
[1] TRUE
Benchmarks:
library(microbenchmark)
microbenchmark(DplyrTest(testDF), ExpandDates(testDF,1))
Unit: milliseconds
expr min lq mean median uq max neval cld
DplyrTest(testDF) 80.909303 84.337006 91.315057 86.320883 88.818739 173.69395 100 b
ExpandDates(testDF, 1) 1.122384 1.208184 2.521693 1.355564 1.486317 72.23444 100 a
I'm currently struggling with a beginner's issue regarding the calculation of a time difference between two events.
I want to take a column consisting of date and time (both values in one column) into consideration and calculate a time difference with the value of the previous/next row with the same ID (A or B in this example).
ID = c("A", "A", "B", "B")
time = c("08.09.2014 10:34","12.09.2014 09:33","13.08.2014 15:52","11.09.2014 02:30")
d = data.frame(ID,time)
My desired output is in the format Hours:Minutes
time difference = c("94:59","94:59","682:38","682:38")
The format Days:Hours:Minutes or anything similar would also work, as long as it could be conveniently implemented. I am flexible regarding the format of the output, the above is just an idea that crossed my mind.
For each single ID, I always have two rows (in the example 2xA and 2xB). I don't have a convincing idea how to avoid the repition of the difference.
I've tried some examples before, which I found on stackoverflow. Most of them used POSIXt and strptime. However, I didn't manage to apply those ideas to my data set.
Here's my attempt using dplyr
library(dplyr)
d %>%
mutate(time = as.POSIXct(time, format = "%d.%m.%Y %H:%M")) %>%
group_by(ID) %>%
mutate(diff = paste0(gsub("[.].*", "", diff(time)*24), ":",
round(as.numeric(gsub(".*[.]", ".", diff(time)*24))*60)))
# Source: local data frame [4 x 3]
# Groups: ID
#
# ID time diff
# 1 A 2014-09-08 10:34:00 94:59
# 2 A 2014-09-12 09:33:00 94:59
# 3 B 2014-08-13 15:52:00 682:38
# 4 B 2014-09-11 02:30:00 682:38
A very (to me) hack-ish base solution:
ID <- c("A", "A", "B", "B")
time <- c("08.09.2014 10:34", "12.09.2014 09:33", "13.08.2014 15:52","11.09.2014 02:30")
d <- data.frame(ID, time)
d$time <- as.POSIXct(d$time, format="%d.%m.%Y %H:%M")
unlist(unname(lapply(split(d, d$ID), function(d) {
sapply(abs(diff(c(d$time[2], d$time))), function(x) {
sprintf("%s:%s", round(((x*24)%/%1)), round(((x*24)%%1 *60)))
})
})))
## [1] "94:59" "94:59" "682:38" "682:38"
I have to believe this function exists somewhere already, tho.
similar to the attempts of David and hrmbrmstr, I found that this solution using difftime works
I use a rowshift script I found on stackoverflow
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
d$time.c <- as.POSIXct(d$time, format = "%d.%m.%Y %H:%M")
d$time.prev <- rowShift(d$time.c,-1)
d$diff <- difftime(d$time.c,d$time.prev, units="hours")
Every other row of d$diff has positive/negative values in the results. I do remove all the rows with negative values and have the difference between the first and the last time for every ID.
Lets say I have dataframe consisting of 3 columns with dates:
index <- c("31.10.2012", "16.06.2012")
begin <- c("22.10.2012", "29.05.2012")
end <- c("24.10.2012", "17.06.2012")
index.new <- as.Date(index, format = "%d.%m.%Y")
begin.new <- as.Date(begin, format = "%d.%m.%Y")
end.new <- as.Date(end, format = "%d.%m.%Y")
data.frame(index.new, begin.new, end.new)
My problem: I want to select (subset) the rows, where the interval of begin and end-date is within 4 days before the index-day. This is obviously only in row no 2.
Can you help me out here?
Your way to express the problem is messy, in the first case dates.new[1]>dates.new[2] and in the second case dates.new[3]<dates.new[4]. Making things proper:
interval1 = c(dates.new[2], dates.new[1])
interval2 = c(dates.new[3],dates.new[4])
If you wanna check interval2 CONTAINS interval1:
all.equal(findInterval(interval1, interval2),c(1,1))
Pleas let me know if this works and if is what you want
library("timeDate")
index <- c("31.10.2012", "16.06.2012")
begin <- c("22.10.2012", "29.05.2012")
end <- c("24.10.2012", "17.06.2012")
index.new <- as.Date(index, format = "%d.%m.%Y")
begin.new <- as.Date(begin, format = "%d.%m.%Y")
end.new <- as.Date(end, format = "%d.%m.%Y")
data <- data.frame(index.new, begin.new, end.new)
apply(data, 1, function(x){paste(x[1]) %in% paste(timeSequence(x[2], x[3], by = "day"))})