Cutting a time elapsed variable into manageable things - r

flight_time
11:42:00
19:37:06
18:11:17
I am having trouble working with the time played variable in the dataset. I can't seem to figure out how to get R to treat this value as a numeric.
Apologies if this has been asked before.
EDIT:
Okay well given the stuff posted below I've realised there's a few things I didn't know/check before.
First of all this is a factor variable. I read through the lubridate package documentation, and since I want to perform arithmetic operations (if this is the right terminology) I believe the duration function is the correct one.
However looking at the examples - I am not entirely sure what the syntax is for applying this to a whole column in a large(ish) data from. Since I have 4.5k observations, I'm not sure exactly how to appply this. I don't need an excessive amount of granularity - ideally even hours and minutes are fine.
So I'm thinking I would want my code to look like:
conversion from factor variable to character string > conversion from character string to duration/as.numeric.

Try this code:
#dummy data with factors
df <- data.frame(flight_time=c("11:42:00","19:37:06","18:11:17"))
#add Seconds column
df$Seconds <-
sapply(as.character(df$flight_time), function(i)
sum(as.numeric(unlist(strsplit(i,":"))) * c(60^2,60,1)))
#result
df
# flight_time Seconds
# 1 11:42:00 42120
# 2 19:37:06 70626
# 3 18:11:17 65477

Related

Calculating Time Differences using xts in R

I was wondering if there was a way to calculate time differences using the xts package without having to convert time values etc. if possible. I have an xts object with a time format given as 2010-02-15 13:35:59.123 (where the .123 is the milliseconds).
Now, I would like to find the number of milliseconds until the end of the day (i.e. 17:00:00). The problem however is that I basically have to do a few conversions of the data before I can do this (such as using as.POSIXct) and this becomes more complicated since I have to do it for several different days and possibly even different times. For this reason, I would prefer to not have to convert the "end of day time" and leave it as 17:00:00 such that in order to find the number of milliseconds between the present time and the end of day time I can just have a fairly simple operation such as 17:00:00.000 - 13:35:59.123 = ...
Is there a simple way to do this with minimal conversions? I'm certain xts has a function which I don't know of but I couldn't find anything in the documentation :/
EDIT: I forgot to mention, I tried the more 'straightforward' route by trying to compute the time differences by first trying to use the function as.POSIXct(16:00:00, format = "%H:%M:%S") but this gives an error, and I'm honestly not sure why...
You should be able to do this using a combination of ave(), .indexDate(), and a custom function. You didn't provide a reproducible example, so here's one using the daily data that comes with xts.
library(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)
secsRemaining <- function(x) { end(x)-index(x) })
tdiff <- ave(x[,1], as.yearmon(index(x)), FUN = secsRemaining)
tdiff[86:92,]
# Open
# 2007-03-28 259200
# 2007-03-29 172800
# 2007-03-30 86400
# 2007-03-31 0
# 2007-04-01 2505600
# 2007-04-02 2419200
# 2007-04-03 2332800
In your case, the call would use .indexDate(x) instead of as.yearmon(index(x)).
tdiff <- ave(x[,1], .indexDate(x), FUN = secsRemaining)
Also note that this call to ave() only works on a 1-column xts object. Seems like a bug that it doesn't. Also note that you have to use FUN = with ave(), since the FUN argument occurs after ....

Speed improvement for sapply along a POSIX sequence

I am iterating along a POSIX sequence to identify the number of concurrent events at a given time with exactly the method described in this question and the corresponding answer:
How to count the number of concurrent users using time interval data?
My problem is that my tinterval sequence in minutes covers a year, which means it has 523.025 entries. In addition, I am also thinking about a resolution in seconds, which would make thinks even worse.
Is there anything I can do to improve this code (e.g. is the order of the date intervals from the input data (tdata) of relevance?) or do I have to accept the performance if I like to have a solution in R?
You could try using data.tables new foverlaps function. With the data from the other question:
library(data.table)
setDT(tdata)
setkey(tdata, start, end)
minutes <- data.table(start = seq(trunc(min(tdata[["start"]]), "mins"),
round(max(tdata[["end"]]), "mins"), by="min"))
minutes[, end := start+59]
setkey(minutes, start, end)
DT <- foverlaps(tdata, minutes, type="any")
counts <- DT[, .N, by=start]
plot(N~start, data=counts, type="s")
I haven't timed this for huge data. Try yourself.
Here is another approach that should be faster than processing a list. It relies on data.table joins and lubridate for binning times at closest minute. It also assumes that there were 0 users before you started recording them, but this can be fixed by adding a constant number to concurrent at the end:
library(data.table)
library(lubridate)
td <- data.table(start=floor_date(tdata$start, "minute"),
end=ceiling_date(tdata$end, "minute"))
# create vector of all minutes from start to end
# about 530K for a whole year
time.grid <- seq(from=min(td$start), to=max(td$end), by="min")
users <- data.table(time=time.grid, key="time")
# match users on starting time and
# sum matches by start time to count multiple loging in same minute
setkey(td, start)
users <- td[users,
list(started=!is.na(end)),
nomatch=NA,
allow.cartesian=TRUE][, list(started=sum(started)),
by=start]
# match users on ending time, essentially the same procedure
setkey(td, end)
users <- td[users,
list(started, ended=!is.na(start)),
nomatch=NA,
allow.cartesian=TRUE][, list(started=sum(started),
ended=sum(ended)),
by=end]
# fix timestamp column name
setnames(users, "end", "time")
# here you can exclude all entries where both counts are zero
# for a sparse representation
users <- users[started > 0 | ended > 0]
# last step, take difference of cumulative sums to get concurrent users
users[, concurrent := cumsum(started) - cumsum(ended)]
The two complex-looking joins can be split into two (first join, then summary by minute), but I recall reading that this way is more efficient. If not, splitting them would make the operations more legible.
R is an interpretive language, which means that every time you ask it to execute a command, it has to interprete your code first, and then execute it. For loops it means that in each iteration of for it has to "recompile" your code, which is, of course, very slow.
There are three common way that I am aware of, which help solve this.
R is vector-oriented, so loops are most likely not a good way to use it. So, if possible, you should try and rethink your logic here, vectorizing the approach.
Using just-in-time compiler.
(what I came to do in the end) Use Rcpp to translate your loopy-code in C/Cpp. This will give you a speed boost of a thousand times easy.

Getting Date to Add Correctly

I have a 3000 x 1000 matrix time series database going back 14 years that is updated every three months. I am forecasting out 9 months using this data still keeping a 3200 x 1100 matrix (mind you these are rough numbers).
During the forecasting process I need the variables Year and Month to be calculated appropriately . I am trying to automate the process so I don't have to mess with the code any more; I can just run the code every three months and upload the projections into our database.
Below is the code I am using right now. As I said above I do not want to have to look at the data or the code just run the code every three months. Right now everything else is working as planed, but I still have to ensure the dates are appropriately annotated. The foo variables are changed for privacy purposes due to the nature of their names.
projection <- rbind(projection, data.frame(foo=forbar, bar=barfoo,
+ Year=2012, Month=1:9,
+ Foo=as.vector(fc$mean)))
I'm not sure exactly where the year/months are coming from, but if you want to refer to the current date for those numbers, here is an option (using the wonderful package, lubridate):
library(lubridate)
today = Sys.Date()
projection <- rbind(projection, data.frame(foo=foobar, bar=barfoo,
year = year(today),
month = sapply(1:9,function(x) month(today+months(x))),
Foo = as.vector(fc$mean)))
I hope this is what you're looking for.

Obtaining or subsetting the first 5 minutes of each day of data from an xts

I would like to subset out the first 5 minutes of time series data for each day from minutely data, however the first 5 minutes do not occur at the same time each day thus using something like xtsobj["T09:00/T09:05"] would not work since the beginning of the first 5 minutes changes. i.e. sometimes it starts at 9:20am or some other random time in the morning instead of 9am.
So far, I have been able to subset out the first minute for each day using a function like:
k <- diff(index(xtsobj))> 10000
xtsobj[c(1, which(k)+1)]
i.e. finding gaps in the data that are larger than 10000 seconds, but going from that to finding the first 5 minutes of each day is proving more difficult as the data is not always evenly spaced out. I.e. between first minute and 5th minute there could be from 2 row to 5 rows and thus using something like:
xtsobj[c(1, which(k)+6)]
and then binding the results together
is not always accurate. I was hoping that a function like 'first' could be used, but wasn't sure how to do this for multiple days, perhaps this might be the optimal solution. Is there a better way of obtaining this information?
Many thanks for the stackoverflow community in advance.
split(xtsobj, "days") will create a list with an xts object for each day.
Then you can apply head to the each day
lapply(split(xtsobj, "days"), head, 5)
or more generally
lapply(split(xtsobj, "days"), function(x) {
x[1:5, ]
})
Finally, you can rbind the days back together if you want.
do.call(rbind, lapply(split(xtsobj, "days"), function(x) x[1:5, ]))
What about you use the package lubridate, first find out the starting point each day that according to you changes sort of randomly, and then use the function minutes
So it would be something like:
five_minutes_after = starting_point_each_day + minutes(5)
Then you can use the usual subset of xts doing something like:
5_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
xtsobj[5_min_period]
Edit:
#Joshua
I think this works, look at this example:
library(lubridate)
x <- xts(cumsum(rnorm(20, 0, 0.1)), Sys.time() - seq(60,1200,60))
starting_point_each_day= index(x[1])
five_minutes_after = index(x[1]) + minutes(5)
five_min_period = paste(starting_point_each_day,five_minutes_after,sep='/')
x[five_min_period]
In my previous example I made a mistake, I put the five_min_period between quotes.
Was that what you were pointing out Joshua? Also maybe the starting point is not necessary, just:
until5min=paste('/',five_minutes_after,sep="")
x[until5min]

Data aggregation loop in R

I am facing a problem concerning aggregating my data to daily data.
I have a data frame where NAs have been removed (Link of picture of data is given below). Data has been collected 3 times a day, but sometimes due to NAs, there is just 1 or 2 entries per day; some days data is missing completely.
I am now interested in calculating the daily mean of "dist": this means summing up the data of "dist" of one day and dividing it by number of entries per day (so 3 if there is no data missing that day). I would like to do this via a loop.
How can I do this with a loop? The problem is that sometimes I have 3 entries per day and sometimes just 2 or even 1. I would like to tell R that for every day, it should sum up "dist" and divide it by the number of entries that are available for every day.
I just have no idea how to formulate a for loop for this purpose. I would really appreciate if you could give me any advice on that problem. Thanks for your efforts and kind regards,
Jan
Data frame: http://www.pic-upload.de/view-11435581/Data_loop.jpg.html
Edit: I used aggregate and tapply as suggested, however, the mean value of the data was not really calculated:
Group.1 x
1 2006-10-06 12:00:00 636.5395
2 2006-10-06 20:00:00 859.0109
3 2006-10-07 04:00:00 301.8548
4 2006-10-07 12:00:00 649.3357
5 2006-10-07 20:00:00 944.8272
6 2006-10-08 04:00:00 136.7393
7 2006-10-08 12:00:00 360.9560
8 2006-10-08 20:00:00 NaN
The code used was:
dates<-Dis_sub$date
distance<-Dis_sub$dist
aggregate(distance,list(dates),mean,na.rm=TRUE)
tapply(distance,dates,mean,na.rm=TRUE)
Don't use a loop. Use R. Some example data :
dates <- rep(seq(as.Date("2001-01-05"),
as.Date("2001-01-20"),
by="day"),
each=3)
values <- rep(1:16,each=3)
values[c(4,5,6,10,14,15,30)] <- NA
and any of :
aggregate(values,list(dates),mean,na.rm=TRUE)
tapply(values,dates,mean,na.rm=TRUE)
gives you what you want. See also ?aggregate and ?tapply.
If you want a dataframe back, you can look at the package plyr :
Data <- as.data.frame(dates,values)
require(plyr)
ddply(data,"dates",mean,na.rm=TRUE)
Keep in mind that ddply is not fully supporting the date format (yet).
Look at the data.table package especially if your data is huge. Here is some code that calculates the mean of dist by day.
library(data.table)
dt = data.table(Data)
Data[,list(avg_dist = mean(dist, na.rm = T)),'date']
It looks like your main problem is that your date field has times attached. The first thing you need to do is create a column that has just the date using something like
Dis_sub$date_only <- as.Date(Dis_sub$date)
Then using Joris Meys' solution (which is the right way to do it) should work.
However if for some reason you really want to use a loop you could try something like
newFrame <- data.frame()
for d in unique(Dis_sub$date){
meanDist <- mean(Dis_sub$dist[Dis_sub$date==d],na.rm=TRUE)
newFrame <- rbind(newFrame,c(d,meanDist))
}
But keep in mind that this will be slow and memory-inefficient.

Resources