Converting an interval to duration per hour per weekday in R using data.table - r

I have the following problem:
Suppose we have:
Idx ID StartTime EndTime
1: 1 2014-01-01 02:20:00 2014-01-01 03:42:00
2: 1 2014-01-01 14:51:00 2014-01-01 16:44:00
note: Idx is not given, but I'm simply adding it to the table view.
Now we see that person with ID=1 is using the computer from 2:20 to 3:42. Now what I would like to do is to convert this interval into a set of variables representing hour and weekday and the duration in those periods.
Idx ID Monday-0:00 Monday-1:00 ... Wednesday-2:00 Wednesday-3:00
1: 1 40 42
For the second row we would have
Idx ID Monday-0:00 Monday-1:00 ... Wednesday-14:00 Wednesday-15:00 Wednesday-16:00
2: 1 9 60 44
Now the problem is of course that it can span over multiple hours as you can see from the second row.
I would like to do this per row and I was wondering if this is possible without too much computational effort and using data.table?
PS: it is also possible that the interval spans over the day.

library(data.table)
library(lubridate)
#produce sample data
DT<-data.table(idx=1:100,ID=rep(1:20,5), StartTime=runif(100,60*60,60*60*365)+ymd('2014-01-01'))
DT[,EndTime:=StartTime+runif(1,60,60*60*8)]
#make fake start and end dates with same day of week and time but all within a single calendar week
DT[,fakestart:=as.numeric(difftime(StartTime,ymd('1970-01-01'),units="days"))%%7*60*60*24+ymd('1970-01-01')]
DT[,fakeend:=as.numeric(difftime(EndTime,ymd('1970-01-01'),units="days"))%%7*60*60*24+ymd('1970-01-01')]
setkey(DT,fakestart,fakeend)
#check that weekdays line up
nrow(DT[weekdays(EndTime)==weekdays(fakeend)])
nrow(DT[weekdays(StartTime)==weekdays(fakestart)])
#both are 100 so we're good.
#check that fakeend > fakestart
DT[fakeend<fakestart]
#uh-oh some ends are earlier than starts, let's add 7 days to those ends
DT[fakeend<fakestart,fakeend:=fakeend+days(7)]
#make data.table with all possible labels
DTin<-data.table(start=seq(from=ymd('1970-01-01'),to=DT[,floor_date(max(fakeend),"hour")],by=as.difftime(hours(1))))
DTin[,end:=start+hours(1)]
DTin[,label:=paste0(format(start,format="%A-%H:00"),' ',format(end,format="%A-%H:00"))]
#set key and use new foverlaps feature of data.table which merges by interval
setkey(DT,fakestart,fakeend)
setkey(DTin,start,end)
DTout<-foverlaps(DT,DTin,type="any")
#compute duration in each interval
DTout[,dur:=60-pmax(0,difftime(fakestart,start,unit="mins"))-pmax(0,difftime(end,fakeend,unit="mins"))]
#cast all the rows up to columns for final result
castout<-dcast.data.table(DTout,idx+ID~label,value.var="dur",fill=0)

Related

Turn date column into days from beginning integer Rstudio

Hi everyone so I am currently plotting time series graphs in Rstudio, I have created a nice time series graph however I would actually like the x axis not to be showing me the date but more like an integer showing a number from the starting date of the graph.
Time Series Graph
Such as instead of seeing 01/01/2021 I want to see day 100, as in its the 100th day of recording data.
Do i need to create another column converting all the days into a numerical value then plot this?
If so how do i do this. At the moment all i have is a Date column and the value i am plotting column.
Column Data
Thanks
Assuming you want 01/01/2021 as first day you could use that as a reference and calculate the number of days passed since the first day of recording and plot that, this should give you more like an integer showing a number from the starting date.
Not sure what your data frame looks like so hopefully this helps.
Using lubridate
library(lubridate)
df
Date
1 01/01/2021
2 02/01/2021
3 03/01/2021
4 04/01/2021
df$days <- yday(dmy(df$Date)) -1
Output:
Date days
1 01/01/2021 0
2 02/01/2021 1
3 03/01/2021 2
4 04/01/2021 3
Which is indeed a numeric
str(df$days)
num [1:4] 0 1 2 3
This a simulation of dates
date.simulation = as.Date(1:100, "2001-01-01")
factor(date.simulation-min(date.simulation))
You just subtract the dates to the minimum date. And you need it as a factor for plotting purposes.

Plot data over time in R

I'm working with a dataframe including the columns 'timestamp' and 'amount'. The data can be produced like this
sample_size <- 40
start_date = as.POSIXct("2020-01-01 00:00")
end_date = as.POSIXct("2020-01-03 00:00")
timestamps <- as.POSIXct(sample(seq(start_date, end_date, by=60), sample_size))
amount <- rpois(sample_size, 5)
df <- data.frame(timestamps=timestamps, amount=amount)
Now I'd like to plot the sum of the amount entries for some timeframe (like every hour, 30 min, 20 min). The final plot would look like a histogram of the timestamps but should not just count how many timestamps fell into the timeframe, but what amount fell into the timeframe.
How can I approach this? I could create an extra vector with the amount of each timeframe, but don't know how to proceed.
Also I'd like to add a feature to reduce by hour. Such that just just one day is plotted (notice the range between start_date and end_date is two days) and in each timeframe (lets say every hour) the amount of data located in this hour is plotted. In this case the data
2020-01-01 13:03:00 5
2020-01-02 13:21:00 10
2020-01-02 13:38:00 1
2020-01-01 13:14:00 3
would produce a bar of height sum(5, 10, 1, 3) = 19 in the timeframe 13:00-14:00. How can I implement the plotting to easily switch between these two modes (plot days/plot just one day and reduce)?
EDIT: Following the advice of #Gregor Thomas I added a grouping column like this:
df$time_group <- lubridate::floor_date(df$timestamps, unit="20 minutes")
Now I'm wondering how to ignore the dates and thus reduce by 20 minute frame (independent of date).

In R, finding the start and end dates for each interval after using diff()

I am using diff() to find the difference in variables down a column. However, I would also like to display the dates the difference is found over.
For example:
Dates <- c("2017-06-07","2017-06-10","2017-06-15","2017-07-07","2017-07-12","2017-07-18")
Variable<-c(5,6,7,8,9,3)
dd<-diff(Dates)
dv<-diff(Variable)
I'd like to find a way to add columns for the start and end date for each interval, so "06-07" as the start and "06-10" as the end date for the difference between the first 2 variables. Any ideas?
The OP has requested to add columns for the start and end date for each interval.
This can be accomplished by using the head() and tail() functions:
# data provided by OP
Dates <- c("2017-06-07","2017-06-10","2017-06-15","2017-07-07","2017-07-12","2017-07-18")
Variable<-c(5,6,7,8,9,3)
start <- head(Dates, -1) # take all Dates except the last one
end <- tail(Dates, -1L) # take all Dates except the first one
dd <- diff(as.Date(Dates)) # coersion to class Date required for date arthmetic
dv <- diff(Variable)
# create data.frame of intervals
intervals <- data.frame(start, end, dd, dv)
intervals
start end dd dv
1 2017-06-07 2017-06-10 3 days 1
2 2017-06-10 2017-06-15 5 days 1
3 2017-06-15 2017-07-07 22 days 1
4 2017-07-07 2017-07-12 5 days 1
5 2017-07-12 2017-07-18 6 days -6
Note that intervals has 5 rows while the vector of breakpoints Dates it was constructed from has a length of 6.
Are you after the difference in dates?
diff(as.Date(as.character(Dates,format="%Y-%M-%D")))

how to iterate based on a condition, and assign aggregated value to a row in new dataframe in R

I have a large dataset of stock prices with 203615 rows and 2 columns(price and Timestamp). in below format
price(USD) | Timestamp
3.5 | 2014-01-01 20:00:00
2 | 2014-01-01 20:15:00
5 | 2014-01-01 20:15:00
----
4 | 2014-01-31 23:00:00
5 | 2014-01-31 23:00:00
4.5 | 2014-01-31 23:00:00
203615 2.3 | 2014-01-31 23:00:00
Time stamp varies from "2014-01-01 20:00:00" to "2014-01-31 23:00:00" with intervals of 15min(rounded to 15min). i have several transactions on same timestamp.
I have to group rows based on timestamp with difference of one day, and caluclate min,max and mean of the price and no of rows within the timestamp limits and assign them to a row in a new dataframe for every iteration until it reaches the end timestamp("2014-01-31 23:00:00") from starting date('2014-01-02 20:00:00")
note: iteration has to be done for every 15min
i have tried while loop. please help me with this and suggest me if i can use any packages
This is my own code which I used as a way of creating a window of time (the prior 24 hours) to iterate over and create min and max values for a project I am working on...
inter is the inteval I worked on in the loop
raw is the data frame name
i is the specific row from which the datetime column was selected from raw
I started my intervals at 97th row ( (i in 97:nrow(raw) ) because the stamps were taken at 15 minute intervals and I wanted a 24 hour backward window, so I needed to leave 96 intervals to pull from...I could not reach back into time I had no data for...so I started far enough into my data to leave room for those intervals.
for (i in 97:nrow(raw)){
inter=raw$datetime[i] - as.difftime(24, unit='hours')
raw$deltaAirTemp_24[i] <-max(temp$Air.Temperature)- min(temp$Air.Temperature)
}
The key is getting into a real date time format. Run str() on the field with the dates, if the come back as anything but Factor, use:
as.POSIXct(yourdate$field, %Y-%m-%d %H:%M:%S)
If they come back from str(yourdatecolumn here) as FACTOR then wrap it in as.POSIXct(as.character(yourdate$field), %Y-%m-%d %H:%M:%S) to be sure it does not coerce the date into a Level number then time..
Get them into a consistent date format, then construct something like above to extract the periods you need. difftime is in the base package and works well you can use positive and negative intervals with it. I hope his helps!

Using dplyr::mutate between two dataframes to create column based on date range

Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands
eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481

Resources