Speed improvement for sapply along a POSIX sequence - r

I am iterating along a POSIX sequence to identify the number of concurrent events at a given time with exactly the method described in this question and the corresponding answer:
How to count the number of concurrent users using time interval data?
My problem is that my tinterval sequence in minutes covers a year, which means it has 523.025 entries. In addition, I am also thinking about a resolution in seconds, which would make thinks even worse.
Is there anything I can do to improve this code (e.g. is the order of the date intervals from the input data (tdata) of relevance?) or do I have to accept the performance if I like to have a solution in R?

You could try using data.tables new foverlaps function. With the data from the other question:
library(data.table)
setDT(tdata)
setkey(tdata, start, end)
minutes <- data.table(start = seq(trunc(min(tdata[["start"]]), "mins"),
round(max(tdata[["end"]]), "mins"), by="min"))
minutes[, end := start+59]
setkey(minutes, start, end)
DT <- foverlaps(tdata, minutes, type="any")
counts <- DT[, .N, by=start]
plot(N~start, data=counts, type="s")
I haven't timed this for huge data. Try yourself.

Here is another approach that should be faster than processing a list. It relies on data.table joins and lubridate for binning times at closest minute. It also assumes that there were 0 users before you started recording them, but this can be fixed by adding a constant number to concurrent at the end:
library(data.table)
library(lubridate)
td <- data.table(start=floor_date(tdata$start, "minute"),
end=ceiling_date(tdata$end, "minute"))
# create vector of all minutes from start to end
# about 530K for a whole year
time.grid <- seq(from=min(td$start), to=max(td$end), by="min")
users <- data.table(time=time.grid, key="time")
# match users on starting time and
# sum matches by start time to count multiple loging in same minute
setkey(td, start)
users <- td[users,
list(started=!is.na(end)),
nomatch=NA,
allow.cartesian=TRUE][, list(started=sum(started)),
by=start]
# match users on ending time, essentially the same procedure
setkey(td, end)
users <- td[users,
list(started, ended=!is.na(start)),
nomatch=NA,
allow.cartesian=TRUE][, list(started=sum(started),
ended=sum(ended)),
by=end]
# fix timestamp column name
setnames(users, "end", "time")
# here you can exclude all entries where both counts are zero
# for a sparse representation
users <- users[started > 0 | ended > 0]
# last step, take difference of cumulative sums to get concurrent users
users[, concurrent := cumsum(started) - cumsum(ended)]
The two complex-looking joins can be split into two (first join, then summary by minute), but I recall reading that this way is more efficient. If not, splitting them would make the operations more legible.

R is an interpretive language, which means that every time you ask it to execute a command, it has to interprete your code first, and then execute it. For loops it means that in each iteration of for it has to "recompile" your code, which is, of course, very slow.
There are three common way that I am aware of, which help solve this.
R is vector-oriented, so loops are most likely not a good way to use it. So, if possible, you should try and rethink your logic here, vectorizing the approach.
Using just-in-time compiler.
(what I came to do in the end) Use Rcpp to translate your loopy-code in C/Cpp. This will give you a speed boost of a thousand times easy.

Related

assign phase of the day to each date - but fast

I want to compare an element within a list to intervals within a data frame and assign the respective interval to that element.
In my case I want to get a phase of the day (i.e. morning,day,evening,night) for a measurement. I found the R package 'suncalc' which creates the intervals for such phases and also have a solution to assign these phases of the day. BUT this is very slow and I wonder how to do this faster.
#make a list of different days and times
times<-seq.POSIXt(from=Sys.time(),
to=Sys.time()+2*24*60*60,length.out = 50)
#load the suncalc package
library(suncalc)
#a function to get a phase for one point in time
get.dayphase<-function(x){
phases<-getSunlightTimes(date=as.Date(x,tz=Sys.timezone()),
lat=52.52,lon=13.40,
tz=Sys.timezone())
if(x<phases$nightEnd)return("night_morning")
if(x>=phases$nightEnd&x<phases$goldenHourEnd)return("morning")
if(x>=phases$goldenHourEnd&x<phases$goldenHour)return("day")
if(x>=phases$goldenHour&x<phases$night)return("evening")
if(x>=phases$night)return("night_evening")
}
#use sapply to get a phase for each point in time of the list
df=data.frame(time=times,dayphase=sapply(times,get.dayphase))
the desired but slow result:
head(df)
time dayphase
1 2019-09-05 16:12:08 day
2 2019-09-05 17:10:55 day
3 2019-09-05 18:09:41 day
4 2019-09-05 19:08:28 evening
5 2019-09-05 20:07:14 evening
6 2019-09-05 21:06:01 evening
Basically, this is what I want. But it is too slow when I run it on a lot of points in time. getSunlightTimes() can also take a list of dates and returns a data table, but I have no idea how to handle this to get the desired result.
Thanks for your help
What is slowing your process down is most likely the sapply function, which is basically a hidden for loop.
To improve perform you need to vectorize your code. The getSunlightTimes can take vector of dates. Also, instead of using a series of if statements the case_when function from the dplyr package simplifies the code and should reduces the number of logic operations.
library(dplyr)
times<-seq.POSIXt(from=Sys.time(),
to=Sys.time()+2*24*60*60,length.out = 50)
library(suncalc)
#a function to get phases for all of the times
phases<-getSunlightTimes(as.Date(times),
lat=52.52,lon=13.40,
tz=Sys.timezone(),
keep = c("nightEnd", "goldenHourEnd", "goldenHour", "night"))
dayphase<-case_when(
times < phases$nightEnd ~ "night_morning",
times < phases$goldenHourEnd ~ "morning",
times < phases$goldenHour ~ "day",
times < phases$night ~ "evening",
TRUE ~ "night_evening"
)
This should provide a significant improvement. Additional performance improvements are possible, if you have a large number times on each day. If this is the case, calculate the phases dataframe once each day and then use this result as a lookup table for the individual times.

T tests in R- unable to run together

I have an airline dataset from stat computing which I am trying to analyse.
There are variables DepTime and ArrDelay (Departure Time and Arrival Delay). I am trying to analyse how Arrival Delay is varying with certain chunks of departure time. My objective is to find which time chunks should a person avoid while booking their tickets to avoid arrival delay
My understanding-If a one tailed t test between arrival delays for dep time >1800 and arrival delays for dep time >1900 show a high significance, it means that one should avoid flights between 1800 and 1900. ( Please correct me if I am wrong). I want to run such tests for all departure hours.
**Totally new to programming and Data Science. Any help would be much appreciated.
Data looks like this. The highlighted columns are the ones I am analysing
Sharing an image of the data is not the same as providing the data for us to work with...
That said I went and grabbed one year of data and worked this up.
flights <- read.csv("~/Downloads/1995.csv", header=T)
flights <- flights[, c("DepTime", "ArrDelay")]
flights$Dep <- round(flights$DepTime-30, digits = -2)
head(flights, n=25)
# This tests each hour of departures against the entire day.
# Alternative is set to "less" because we want to know if a given hour
# has less delay than the day as a whole.
pVsDay <- tapply(flights$ArrDelay, flights$Dep,
function(x) t.test(x, flights$ArrDelay, alternative = "less"))
# This tests each hour of departures against every other hour of the day.
# Alternative is set to "less" because we want to know if a given hour
# has less delay than the other hours.
pAllvsAll <- tapply(flights$ArrDelay, flights$Dep,
function(x) tapply(flights$ArrDelay, flights$Dep, function (z)
t.test(x, z, alternative = "less")))
I'll let you figure out multiple hypothesis testing and the like.
All vs All

Cutting a time elapsed variable into manageable things

flight_time
11:42:00
19:37:06
18:11:17
I am having trouble working with the time played variable in the dataset. I can't seem to figure out how to get R to treat this value as a numeric.
Apologies if this has been asked before.
EDIT:
Okay well given the stuff posted below I've realised there's a few things I didn't know/check before.
First of all this is a factor variable. I read through the lubridate package documentation, and since I want to perform arithmetic operations (if this is the right terminology) I believe the duration function is the correct one.
However looking at the examples - I am not entirely sure what the syntax is for applying this to a whole column in a large(ish) data from. Since I have 4.5k observations, I'm not sure exactly how to appply this. I don't need an excessive amount of granularity - ideally even hours and minutes are fine.
So I'm thinking I would want my code to look like:
conversion from factor variable to character string > conversion from character string to duration/as.numeric.
Try this code:
#dummy data with factors
df <- data.frame(flight_time=c("11:42:00","19:37:06","18:11:17"))
#add Seconds column
df$Seconds <-
sapply(as.character(df$flight_time), function(i)
sum(as.numeric(unlist(strsplit(i,":"))) * c(60^2,60,1)))
#result
df
# flight_time Seconds
# 1 11:42:00 42120
# 2 19:37:06 70626
# 3 18:11:17 65477

Timeline Event Concentration

Given a series of events, is there an algorithm for determining if a certain number of events occur in a certain period of time? For example, given list of user logins, are there any thirty day periods that contain more than 10 logins?
I can come up with a few brute force ways to do this, just wondering if there is an algorithm or name for this kind of problem that I havent turned up with the usual google searching.
In general it is called binning. It is basically aggregating one variable (e.g. events) over an index (e.g. time) using count as a summary function.
Since you didn't provide data I'll just show a simple example:
# Start with a dataframe of dates and number of events
data <- data.frame(date=paste('2013', rep(1:12, each=20), rep(1:20, times=12), sep='-'),
logins=rpois(12*20, 5))
# Make sure to store dates as class Date, it can be useful for other purposes
data$date <- as.Date(data$date)
# Now bin it. This is just a dirty trick, exactly how you do it depends on what you want.
# Lets just sum the number of events for each month
data$month <- sub('-', '', substr(data$date, 6, 7))
aggregate(logins~month, data=data, sum, na.rm=TRUE)
Is that what you wanted?

Obtaining or subsetting the first 5 minutes of each day of data from an xts

I would like to subset out the first 5 minutes of time series data for each day from minutely data, however the first 5 minutes do not occur at the same time each day thus using something like xtsobj["T09:00/T09:05"] would not work since the beginning of the first 5 minutes changes. i.e. sometimes it starts at 9:20am or some other random time in the morning instead of 9am.
So far, I have been able to subset out the first minute for each day using a function like:
k <- diff(index(xtsobj))> 10000
xtsobj[c(1, which(k)+1)]
i.e. finding gaps in the data that are larger than 10000 seconds, but going from that to finding the first 5 minutes of each day is proving more difficult as the data is not always evenly spaced out. I.e. between first minute and 5th minute there could be from 2 row to 5 rows and thus using something like:
xtsobj[c(1, which(k)+6)]
and then binding the results together
is not always accurate. I was hoping that a function like 'first' could be used, but wasn't sure how to do this for multiple days, perhaps this might be the optimal solution. Is there a better way of obtaining this information?
Many thanks for the stackoverflow community in advance.
split(xtsobj, "days") will create a list with an xts object for each day.
Then you can apply head to the each day
lapply(split(xtsobj, "days"), head, 5)
or more generally
lapply(split(xtsobj, "days"), function(x) {
x[1:5, ]
})
Finally, you can rbind the days back together if you want.
do.call(rbind, lapply(split(xtsobj, "days"), function(x) x[1:5, ]))
What about you use the package lubridate, first find out the starting point each day that according to you changes sort of randomly, and then use the function minutes
So it would be something like:
five_minutes_after = starting_point_each_day + minutes(5)
Then you can use the usual subset of xts doing something like:
5_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
xtsobj[5_min_period]
Edit:
#Joshua
I think this works, look at this example:
library(lubridate)
x <- xts(cumsum(rnorm(20, 0, 0.1)), Sys.time() - seq(60,1200,60))
starting_point_each_day= index(x[1])
five_minutes_after = index(x[1]) + minutes(5)
five_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
x[five_min_period]
In my previous example I made a mistake, I put the five_min_period between quotes.
Was that what you were pointing out Joshua? Also maybe the starting point is not necessary, just:
until5min=paste('/',five_minutes_after,sep="")
x[until5min]

Resources