I want to compare an element within a list to intervals within a data frame and assign the respective interval to that element.
In my case I want to get a phase of the day (i.e. morning,day,evening,night) for a measurement. I found the R package 'suncalc' which creates the intervals for such phases and also have a solution to assign these phases of the day. BUT this is very slow and I wonder how to do this faster.
#make a list of different days and times
times<-seq.POSIXt(from=Sys.time(),
to=Sys.time()+2*24*60*60,length.out = 50)
#load the suncalc package
library(suncalc)
#a function to get a phase for one point in time
get.dayphase<-function(x){
phases<-getSunlightTimes(date=as.Date(x,tz=Sys.timezone()),
lat=52.52,lon=13.40,
tz=Sys.timezone())
if(x<phases$nightEnd)return("night_morning")
if(x>=phases$nightEnd&x<phases$goldenHourEnd)return("morning")
if(x>=phases$goldenHourEnd&x<phases$goldenHour)return("day")
if(x>=phases$goldenHour&x<phases$night)return("evening")
if(x>=phases$night)return("night_evening")
}
#use sapply to get a phase for each point in time of the list
df=data.frame(time=times,dayphase=sapply(times,get.dayphase))
the desired but slow result:
head(df)
time dayphase
1 2019-09-05 16:12:08 day
2 2019-09-05 17:10:55 day
3 2019-09-05 18:09:41 day
4 2019-09-05 19:08:28 evening
5 2019-09-05 20:07:14 evening
6 2019-09-05 21:06:01 evening
Basically, this is what I want. But it is too slow when I run it on a lot of points in time. getSunlightTimes() can also take a list of dates and returns a data table, but I have no idea how to handle this to get the desired result.
Thanks for your help
What is slowing your process down is most likely the sapply function, which is basically a hidden for loop.
To improve perform you need to vectorize your code. The getSunlightTimes can take vector of dates. Also, instead of using a series of if statements the case_when function from the dplyr package simplifies the code and should reduces the number of logic operations.
library(dplyr)
times<-seq.POSIXt(from=Sys.time(),
to=Sys.time()+2*24*60*60,length.out = 50)
library(suncalc)
#a function to get phases for all of the times
phases<-getSunlightTimes(as.Date(times),
lat=52.52,lon=13.40,
tz=Sys.timezone(),
keep = c("nightEnd", "goldenHourEnd", "goldenHour", "night"))
dayphase<-case_when(
times < phases$nightEnd ~ "night_morning",
times < phases$goldenHourEnd ~ "morning",
times < phases$goldenHour ~ "day",
times < phases$night ~ "evening",
TRUE ~ "night_evening"
)
This should provide a significant improvement. Additional performance improvements are possible, if you have a large number times on each day. If this is the case, calculate the phases dataframe once each day and then use this result as a lookup table for the individual times.
Related
I want to forecast the number of customers entering a shop during service hours. I have hourly data for
Monday to Friday
8:00 to 18:00
Thus, I assume my time series is in fact regular, but atypical in a sense, since I have 10 hours a day and 5 days a week.
I am able to do modeling with this regular 24/7 time series by setting non-service hours to zero, but I find this inefficient and also incorrect, because the times are not missing. Rather, they do not exist.
Using the old ts-framework I was able to explicitly specify
myTS <- ts(x, frequency = 10)
However, within the new tsibble/fable-framework this is not possible. It detects hourly data and expects 24 hours per day and not 10. Every subsequent function reminds me of implicit gaps in time. Manually overriding the interval-Attribute works:
> attr(ts, "interval") <- new_interval(hour = 10)
> has_gaps(ts)
# A tibble: 1 x 1
.gaps
<lgl>
1 FALSE
But has no effect on modeling:
model(ts,
snaive = SNAIVE(customers ~ lag("week")))
I still get the same error message:
1 error encountered for snaive [1] .data contains implicit gaps in
time. You should check your data and convert implicit gaps into
explicit missing values using tsibble::fill_gaps() if required.
Any help would be appreciated.
This question actually corresponds to this gh issue. As far as I know, there's no R packages that allow users to construct custom schedule, for example to specify certain intra-days and days. A couple of packages provide some specific calendars (like business dates), but none gives a solution to setting up intra days. Tsibble will gain a calendar argument for custom calendars to respect structural missings, when such a package is made available. But currently no support for that.
As you stated, it's hourly data. Hence the data interval should be 1 hour, not 10 hours. However, ts() frequency is seasonal periods, 10 hours per day, for modelling.
I have an airline dataset from stat computing which I am trying to analyse.
There are variables DepTime and ArrDelay (Departure Time and Arrival Delay). I am trying to analyse how Arrival Delay is varying with certain chunks of departure time. My objective is to find which time chunks should a person avoid while booking their tickets to avoid arrival delay
My understanding-If a one tailed t test between arrival delays for dep time >1800 and arrival delays for dep time >1900 show a high significance, it means that one should avoid flights between 1800 and 1900. ( Please correct me if I am wrong). I want to run such tests for all departure hours.
**Totally new to programming and Data Science. Any help would be much appreciated.
Data looks like this. The highlighted columns are the ones I am analysing
Sharing an image of the data is not the same as providing the data for us to work with...
That said I went and grabbed one year of data and worked this up.
flights <- read.csv("~/Downloads/1995.csv", header=T)
flights <- flights[, c("DepTime", "ArrDelay")]
flights$Dep <- round(flights$DepTime-30, digits = -2)
head(flights, n=25)
# This tests each hour of departures against the entire day.
# Alternative is set to "less" because we want to know if a given hour
# has less delay than the day as a whole.
pVsDay <- tapply(flights$ArrDelay, flights$Dep,
function(x) t.test(x, flights$ArrDelay, alternative = "less"))
# This tests each hour of departures against every other hour of the day.
# Alternative is set to "less" because we want to know if a given hour
# has less delay than the other hours.
pAllvsAll <- tapply(flights$ArrDelay, flights$Dep,
function(x) tapply(flights$ArrDelay, flights$Dep, function (z)
t.test(x, z, alternative = "less")))
I'll let you figure out multiple hypothesis testing and the like.
All vs All
I am iterating along a POSIX sequence to identify the number of concurrent events at a given time with exactly the method described in this question and the corresponding answer:
How to count the number of concurrent users using time interval data?
My problem is that my tinterval sequence in minutes covers a year, which means it has 523.025 entries. In addition, I am also thinking about a resolution in seconds, which would make thinks even worse.
Is there anything I can do to improve this code (e.g. is the order of the date intervals from the input data (tdata) of relevance?) or do I have to accept the performance if I like to have a solution in R?
You could try using data.tables new foverlaps function. With the data from the other question:
library(data.table)
setDT(tdata)
setkey(tdata, start, end)
minutes <- data.table(start = seq(trunc(min(tdata[["start"]]), "mins"),
round(max(tdata[["end"]]), "mins"), by="min"))
minutes[, end := start+59]
setkey(minutes, start, end)
DT <- foverlaps(tdata, minutes, type="any")
counts <- DT[, .N, by=start]
plot(N~start, data=counts, type="s")
I haven't timed this for huge data. Try yourself.
Here is another approach that should be faster than processing a list. It relies on data.table joins and lubridate for binning times at closest minute. It also assumes that there were 0 users before you started recording them, but this can be fixed by adding a constant number to concurrent at the end:
library(data.table)
library(lubridate)
td <- data.table(start=floor_date(tdata$start, "minute"),
end=ceiling_date(tdata$end, "minute"))
# create vector of all minutes from start to end
# about 530K for a whole year
time.grid <- seq(from=min(td$start), to=max(td$end), by="min")
users <- data.table(time=time.grid, key="time")
# match users on starting time and
# sum matches by start time to count multiple loging in same minute
setkey(td, start)
users <- td[users,
list(started=!is.na(end)),
nomatch=NA,
allow.cartesian=TRUE][, list(started=sum(started)),
by=start]
# match users on ending time, essentially the same procedure
setkey(td, end)
users <- td[users,
list(started, ended=!is.na(start)),
nomatch=NA,
allow.cartesian=TRUE][, list(started=sum(started),
ended=sum(ended)),
by=end]
# fix timestamp column name
setnames(users, "end", "time")
# here you can exclude all entries where both counts are zero
# for a sparse representation
users <- users[started > 0 | ended > 0]
# last step, take difference of cumulative sums to get concurrent users
users[, concurrent := cumsum(started) - cumsum(ended)]
The two complex-looking joins can be split into two (first join, then summary by minute), but I recall reading that this way is more efficient. If not, splitting them would make the operations more legible.
R is an interpretive language, which means that every time you ask it to execute a command, it has to interprete your code first, and then execute it. For loops it means that in each iteration of for it has to "recompile" your code, which is, of course, very slow.
There are three common way that I am aware of, which help solve this.
R is vector-oriented, so loops are most likely not a good way to use it. So, if possible, you should try and rethink your logic here, vectorizing the approach.
Using just-in-time compiler.
(what I came to do in the end) Use Rcpp to translate your loopy-code in C/Cpp. This will give you a speed boost of a thousand times easy.
I've seen a lot of solutions to working with groups of times or date, like aggregate to sum daily observations into weekly observations, or other solutions to compute a moving average, but I haven't found a way do what I want, which is to pluck relative dates out of data keyed by an additional variable.
I have daily sales data for a bunch of stores. So that is a data.frame with columns
store_id date sales
It's nearly complete, but there are some missing data points, and those missing data points are having a strong effect on our models (I suspect). So I used expand.grid to make sure we have a row for every store and every date, but at this point the sales data for those missing data points are NAs. I've found solutions like
dframe[is.na(dframe)] <- 0
or
dframe$sales[is.na(dframe$sales)] <- mean(dframe$sales, na.rm = TRUE)
but I'm not happy with the RHS of either of those. I want to replace missing sales data with our best estimate, and the best estimate of sales for a given store on a given date is the average of the sales 7 days prior and 7 days later. E.g. for Sunday the 8th, the average of Sunday the 1st and Sunday the 15th, because sales is significantly dependent on day of the week.
So I guess I can use
dframe$sales[is.na(dframe$sales)] <- my_func(dframe)
where my_func(dframe) replaces every stores' sales data with the average of the store's sales 7 days prior and 7 days later (ignoring for the first go around the situation where one of those data points is also missing), but I have no idea how to write my_func in an efficient way.
How do I match up the store_id and the dates 7 days prior and future without using a terribly inefficient for loop? Preferably using only base R packages.
Something like:
with(
dframe,
ave(sales, store_id, FUN=function(x) {
naw <- which(is.na(x))
x[naw] <- rowMeans(cbind(x[naw+7],x[naw-7]))
x
}
)
)
I would like to subset out the first 5 minutes of time series data for each day from minutely data, however the first 5 minutes do not occur at the same time each day thus using something like xtsobj["T09:00/T09:05"] would not work since the beginning of the first 5 minutes changes. i.e. sometimes it starts at 9:20am or some other random time in the morning instead of 9am.
So far, I have been able to subset out the first minute for each day using a function like:
k <- diff(index(xtsobj))> 10000
xtsobj[c(1, which(k)+1)]
i.e. finding gaps in the data that are larger than 10000 seconds, but going from that to finding the first 5 minutes of each day is proving more difficult as the data is not always evenly spaced out. I.e. between first minute and 5th minute there could be from 2 row to 5 rows and thus using something like:
xtsobj[c(1, which(k)+6)]
and then binding the results together
is not always accurate. I was hoping that a function like 'first' could be used, but wasn't sure how to do this for multiple days, perhaps this might be the optimal solution. Is there a better way of obtaining this information?
Many thanks for the stackoverflow community in advance.
split(xtsobj, "days") will create a list with an xts object for each day.
Then you can apply head to the each day
lapply(split(xtsobj, "days"), head, 5)
or more generally
lapply(split(xtsobj, "days"), function(x) {
x[1:5, ]
})
Finally, you can rbind the days back together if you want.
do.call(rbind, lapply(split(xtsobj, "days"), function(x) x[1:5, ]))
What about you use the package lubridate, first find out the starting point each day that according to you changes sort of randomly, and then use the function minutes
So it would be something like:
five_minutes_after = starting_point_each_day + minutes(5)
Then you can use the usual subset of xts doing something like:
5_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
xtsobj[5_min_period]
Edit:
#Joshua
I think this works, look at this example:
library(lubridate)
x <- xts(cumsum(rnorm(20, 0, 0.1)), Sys.time() - seq(60,1200,60))
starting_point_each_day= index(x[1])
five_minutes_after = index(x[1]) + minutes(5)
five_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
x[five_min_period]
In my previous example I made a mistake, I put the five_min_period between quotes.
Was that what you were pointing out Joshua? Also maybe the starting point is not necessary, just:
until5min=paste('/',five_minutes_after,sep="")
x[until5min]