Considering the data collected with 5 minutes time interval with a numeric variable a,and a discret variable acc, which represents if there's any incident happened(0 for no incident while 1 for incident):
a<-c(1:(288*4))
t<-seq(as.POSIXct("2016-01-01 00:05:00"), as.POSIXct("2016-01-05 00:00:00"), by = '5 min')
acc<-rep(0,288*4)
df<-data.frame(t,a,acc)
Now I have another data set which has the time(accurates to 1 sec) at which the incidents happened during the collection period:
T<-sample(seq(as.POSIXct("2016-01-01 00:05:00"), as.POSIXct("2016-01-05 00:00:00"), by = '1 sec'),size = 5)
I want to mark the nearest 2 prior observation's acc as 1 according to the time in T. For example, if the incident happened at 2016-01-02 07:13:23, the observations' acc with t of 2016-01-02 07:05:00 and 2016-01-02 07:10:00 are marked as 1
How could I manage to do this?
ind <- findInterval(T, df$t)
df$acc[c(ind, ind + 1)] <- 1
One way could be:
library(lubridate)
df$acc=apply(sapply(T,function(x) x %within% interval((df$t - minutes(4)-seconds(59)),(df$t + minutes(4)+seconds(59)))),1,sum)
lubridate allows for the easy manipulation of dates, minutes(x) and seconds(x) adds x minutes or second to a period object.
interval() is used to create a time interval confined by the time in df$t ± 4min59s.
sapply() is used to check if any of the time in T is within the interval.
apply() is used to collapse the results of sapply() (it outputs 1 column for each element in T)
If T contains a value that is exactly equal to one in df$t such as 2016-01-04 12:05:00 CET this will only put 1 for this one.
Related
I am attempting to write a function that counts the number of holidays a person worked in my organization between their start and term date in the year 2017. My organization recognized 6 holidays that year-
New Years Day- 2017-01-02
Memorial Day- 2017-05-29
Independence Day - 2017-07-04
Labor Day - 2017-09-04
Thanksgiving Day- 2017-11-23
Christmas day - 2017-12-25
I used lubridate to combine my year-month-day columns into complete dates using lubridate and dyplr like so:
dates<- data %>% mutate("Term Date" = make_date(month = `Term Month`,
day = data$`Term Day`,
year =data$`Term Year`),
"Start Date"= make_date(month = data$`Start Month`,
day = data$`Start Day`,
year = data$`Start Year`))
I then went on to attempt to write my function.
holidays <- function(x){
z<- 0
if( ymd("2017-01-01") %within% interval(dates$`Start Date`, dates$`Term Date`)){
z <- z + 1
}
print(z)
}
This was only my first step. My goal was to first make my function work for new years and then continue to build in other holidays step by step using if statements.I was unable to get the apply function to work correctly and am unsure if my function even works. I attempted to apply the function like so :
apply(dates,2,holidays)
But got an error argument.
Does anyone have any advice?
Putting the holidays in a vector:
holidays <- as.Date(c('2017-01-02', '2017-05-29', '2017-07-04', '2017-09-04', '2017-11-23', '2017-12-25'))
Extracting month and day (to make it independent of year), "%j" stands for day of year:
holidays <- format(as.Date(holidays), "%j")
Generating some random data to test (1000 uniformly distributed work entries in 2017, 5 employees):
d <- data.frame(
'date' = as.Date(as.integer(runif(1000, 17167, 17531)), origin = '1970-01-01'),
'emp' = sample(LETTERS[1:5], 1000, replace = T)
)
Filtering out the holidays:
h <- d[format(d$date, "%j") %in% holidays, ]
Counting number of holidays worked per employee using aggregate():
aggregate(h$date, list(h$emp), length)
# Group.1 x
#1 A 3
#2 B 4
#3 C 2
#4 D 5
#5 E 1
NB: will work for 2017, but won't work for leap years (one workaround that doesn't involve altering the code too too much is to change the year in the holiday vector manually).
I have two dataframes; one that contains a year's worth of hourly temperatures and the other contains flight information. Bellow shows an extract from the temperature dataframe:
Time <- c("2000-01-01 00:53:00","2000-01-01 06:53:00","2000-01-01 10:53:00")
Time <- as.POSIXct(Time)
Temp <- c(20,30,10)
Temperature <- data.frame(Time,Temp)
Temperature
Time Temp
1 2000-01-01 00:53:00 20
2 2000-01-01 06:53:00 30
3 2000-01-01 10:53:00 10
Bellow shows an extract from the flight information dataframe:
DepartureTime <- c("2000-01-01 03:01:00","2000-01-01 10:00:00","2000-01-01 14:00:00")
DepartureTime <- as.POSIXct(DepartureTime)
FlightInformation <- data.frame(DepartureTime)
FlightInformation
DepartureTime
1 2000-01-01 03:01:00
2 2000-01-01 10:14:00
3 2000-01-01 14:55:00
My goal is to take each row of FlightInformation$DepartureTime and find the closest time in the whole column Temperature$Time. I then want to add the corresponding temperature to the FlightInformation dataframe. The desired output should look like this:
FlightInformation
DepartureTime Temp
1 2000-01-01 03:01:00 20
2 2000-01-01 10:14:00 10
3 2000-01-01 14:55:00 10
My attempts so far have come up with this:
i <- 1
j <- 1
while(i <= nrow(Temperature)){
while(j <= nrow(FlightInformation)){
if(Temperature$Time[i] == FlightInformation$Time[j]){
FlightInformation$Temp[j] == Temperature$Temp[i]
}
j <- j + 1
}
i <- i + 1
}
This involves first rounding all times to the nearest hour. This method is not as accurate as i would like it to be and seems VERY inefficient! Is there an easy way to find the nearest posix to give my desired output?
Some assumptions:
you have temperature data before and after all flight information; otherwise you'll see NA
temperature data is continuous-enough, meaning with the interpolation this presents, you don't grab something from 3 months prior (not useful)
temperature data is ordered (easy enough to fix if not)
We'll use cut, that finds the interval in which values fit within a series of breaks:
(ind <- cut(FlightInformation$DepartureTime, Temperature$Time, labels = FALSE))
# [1] 1 2 NA
These indicate rows within Temperature from which we should retrieve the $Temp. Unfortunately, it is absolute and does not allow for being closer to the next value, so we can compensate for that:
(ind <- ind + (abs(Temperature$Time[ind] - FlightInformation$DepartureTime) >
abs(Temperature$Time[1+ind] - FlightInformation$DepartureTime)))
# [1] 1 3 NA
Okay, now that NA: that indicates that the latest $DepartureTime is outside of the known times. This indicates a violation of my first assumption above, but it can be fixed. I use a magic-constant of "6 hours" here to determine that the data is close enough to be able to use it; there are certainly many other heuristics which will be less-wrong. For those, we can just assume the latest temperature:
(is_recoverable <- is.na(ind) & abs(FlightInformation$DepartureTime - max(Temperature$Time)) < 60*60*6)
# [1] FALSE FALSE TRUE
ind[is_recoverable] <- nrow(Temperature)
ind
# [1] 1 3 3
The the results:
FlightInformation$Temp <- Temperature$Temp[ ind ]
FlightInformation
# DepartureTime Temp
# 1 2000-01-01 03:01:00 20
# 2 2000-01-01 10:00:00 10
# 3 2000-01-01 14:00:00 10
Though definitely quicker than double while loops, it will be a problem if you have large gaps in your temperature data. That is, if you have a 3-year gap in your data, the most-recent temperature will be used, which might be 2.99 years ago. For a double-check, use this:
FlightInformation$TempTime <- Temperature$Time[ ind ]
FlightInformation$TimeDelta <- with(FlightInformation, abs(TempTime - DepartureTime))
FlightInformation
# DepartureTime Temp TempTime TimeDelta
# 1 2000-01-01 03:01:00 20 2000-01-01 00:53:00 128 mins
# 2 2000-01-01 10:00:00 10 2000-01-01 10:53:00 53 mins
# 3 2000-01-01 14:00:00 10 2000-01-01 10:53:00 187 mins
You can use different units for the time delta and check for problems with:
units(FlightInformation$TimeDelta) <- "secs"
which(FlightInformation$TimeDelta > 60*60*6)
# integer(0)
(where integer(0) says you have none that are outside of my magic window of 6 hours.)
Here's a way! Time is easiest to work with for this if you convert it to a numeric value. Then you can compare the numeric values to find the closest times before/after your reference time (FlightInformation$time_num in the below example). Once you have the closest time before and after your reference value, figure out which is really the closest to your reference. Use that time value to look up (index) the correct temperature value and add it to your data frame.
#convert time to numeric (seconds since origin of time)
Temperature$time_num <- as.numeric(Temperature$Time)
FlightInformation$time_num <- as.numeric(FlightInformation$DepartureTime)
#make sure time data is in correct order so that indexes for time are in correct order
Temperature <- Temperature[with(Temperature, order(time_num)), ] #sort data
for (i in 1:nrow(FlightInformation)) #for each row of data in flight...
{
#find the time in Temp that is closest + prior to Flight time
#create a logical vector saying which Temperature$time_num are <= to FlightInformation$time_num.
#pull the max row index from the logical vector where value == TRUE (this is the closest time for Temp that is prior to Flight Time)
#use that row index to look up the Temperature$time_num value that is closest + prior to Flight time
#will return NA/warning message if no time in Temp is before time in Flight
temptime_prior <- Temperature[max(which(Temperature$time_num <= FlightInformation$time_num[i])), "time_num"]
#find the time in Temp that is closest + after to Flight time
#will return NA/warning message if no time in Temp is after time in Flight
temptime_after <- Temperature[min(which(Temperature$time_num > FlightInformation$time_num[i])), "time_num"]
#compare times before and after to see which is closest to flight time. If no before/after time was found (e.g., NA was returned), always use the other time value
temptime_closest <- ifelse(is.na(temptime_prior), temptime_after,
ifelse(is.na(temptime_after), temptime_prior,
ifelse((FlightInformation$time_num[i] - temptime_prior) <= (temptime_after - FlightInformation$time_num[i]),
temptime_prior, temptime_after)))
#look up the right temp by finding the row index of right Temp$time_num value and add it to Flight info
FlightInformation$Temp[i] <- Temperature[which(Temperature$time_num == temptime_closest), "Temp"]
}
#get rid of numeric time column, you don't need it anymore
FlightInformation <- FlightInformation[,!(names(FlightInformation) %in% c("time_num"))]
Output
DepartureTime Temp
1 2000-01-01 03:01:00 20
2 2000-01-01 10:00:00 10
3 2000-01-01 14:00:00 10
If you have subsets of data in each data frame you need to match up to (e.g., match df1$group1 time values only to df2$group1 time values), you can use survival::neardate. It's a nice function for this that does basically what the above code does, but has some additional parameters if you need them.
Hope this helps! The codes a lot shorter without all the comments =)
I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]
I have a large (150,000x7) dataframe that I intend to use for back-testing and real-time analysis of a financial market. The data represents the condition of an investment vehicle at 5 minute intervals (although holes do exist). It looks like this (but much longer):
pTime Time Price M1 M2 M3 M4
1 1212108300 20:45:00 1.5518 12.21849 -0.37125 4.50549 -31.00559
2 1212108900 20:55:00 1.5516 11.75350 -0.81792 -1.53846 -32.12291
3 1212109200 21:00:00 1.5512 10.75070 -1.47438 -8.24176 -34.35754
4 1212109500 21:05:00 1.5514 10.23529 -1.06044 -8.46154 -33.24022
5 1212109800 21:10:00 1.5514 9.74790 -1.02759 -10.21978 -33.24022
6 1212110100 21:15:00 1.5513 9.31092 -1.17076 -11.97802 -33.79888
7 1212110400 21:20:00 1.5512 8.84034 -1.28428 -13.62637 -34.35754
8 1212110700 21:25:00 1.5509 8.07843 -1.63715 -18.24176 -36.03352
9 1212111000 21:30:00 1.5509 7.39496 -1.49198 -20.65934 -36.03352
10 1212111300 21:35:00 1.5512 7.65266 -1.03717 -18.57143 -34.35754
The data is pre-loaded into R, but during my back-test I need to subset it by two criteria:
The first criteria is a sliding window to avoid peeking into the future. The window must be such that, each new 5 minute interval on the back-test shifts the whole window into the future by 5 minutes. This part I can do like this:
require(zoo)
zooser <- zoo(x=tser$Close, order.by=as.POSIXct(tser$pTime, origin="1970-01-01"))
window(zooser, start=A, end=B)
The second criteria is another sliding window, but one that slides through time of day and contains only those entries that are within N minutes of the input time on any given day.
Example: If the window's size is 2 hours, and the input time is 12:00PM then the window must contain all rows with Time between 10:00AM and 2:00PM
This is the part that I am having trouble figuring out.
Edit: My data has holes in it, two consecutive rows could be MORE than 5 minutes apart. The data looks like this (very zoomed in)
As the window moves through these gaps the number of points inside the windows should vary.
The following is my MySQL code that does what I want to do in R (same table structure):
SET #qTime = Time(FROM_UNIXTIME(SAMP_endTime));
SET #inc = -1;
INSERT INTO MetIndListBuys (pTime,ArrayPos,M1,M2,M3,M4)
SELECT pTime,#inc:=#inc+1,M1,M2,M3,M4
FROM mergebuys USE INDEX (`y`) WHERE pTime BETWEEN SAMP_startTime AND SAMP_endTime
AND TIME_TO_SEC(TIMEDIFF(Time,#qTime))/3600 BETWEEN 0-HourSpan AND HourSpan
;
Say that you have your target time t0 on the same scale as pTime: seconds since epoch. Then t0 - pTime = (difference in the number of days since epoch between the two) + (difference in remaining seconds). Taking t0 - pTime %% (num. seconds per day) will leave us with the difference in seconds in clock arithmetic (wrapped around if the difference is negative). This suggests the following function:
SecondsPerDay <- 24 * 60 * 60
within <- function(d, t0Sec, wMin) {
diff <- (d$pTime - t0Sec) %% SecondsPerDay
wSec <- 60 * wMin
return(d[diff < wSec | diff > (SecondsPerDay - wSec), ])
}
1) If DF is the data frame shown in the question then create a zoo object from it as you have done and split it into days giving zs. Then lapply your function f to each successive set of w points in each component (i.e. in each day). For example, if you want to apply your function to 2 hours of data at a time and your data is regularly spaced 5 minute data then w = 24 (since there are 24 five minute periods in two hours). In such a case f would be passed 24 rows of data as a matrix each time its called. Also align has been set to "right" below but it can alternately be set to align="center" and the condition giving ix can be changed to double sided, etc. For more on rollapply see: ?rollapply
library(zoo)
z <- zoo(DF[-2], as.POSIXct(DF[,1], origin = "1970-01-01"))
w <- 3 # replace this with 24 to handle two hours at a time with five min data
f <- function(x) {
tt <- x[, 1]
ix <- tt[w] - tt <= w * 5 * 60 # RHS converts w to seconds
x <- x[ix, -1]
sum(x) # replace sum with your function
}
out <- rollapply(z, w, f, by.column = FALSE, align = "right")
Using the data frame in the question we get this:
> out
$`2008-05-30`
2008-05-30 02:00:00 2008-05-30 02:05:00 2008-05-30 02:10:00 2008-05-30 02:15:00
-66.04703 -83.92148 -95.93558 -100.24924
2008-05-30 02:20:00 2008-05-30 02:25:00 2008-05-30 02:30:00 2008-05-30 02:35:00
-108.15038 -121.24519 -134.39873 -140.28436
By the way, be sure to read this post .
2) This could alternately be done as the following where w and f are as above:
n <- nrow(DF)
m <- as.matrix(DF[-2])
sapply(w:n, function(i) { m <- m[seq(length = w, to = i), ]; f(m) })
Replace the sapply with lapply if needed. Also this may seem shorter than the first solution but its not much different once you add the code to define f and w (which appear in the first but not the second).
If there are no holes during the day and only holes between days then these solutions could be simplified.
Let me begin by saying this question pertains to R (stat programming language) but I'm open straightforward suggestions for other environments.
The goal is to merge outcomes from dataframe (df) A to sub-elements in df B. This is a one to many relationship but, here's the twist, once the records are matched by keys they also have to match over a specific frame of time given by a start time and duration.
For example, a few records in df A:
OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal
And from df B:
OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00
The desired outcome from the merge would be:
OBS ID Time Outcome
1 01 10:12:10 Normal
3 02 10:12:45 Weird
Desired result: dataframe B with outcomes merged in from A. Notice observations 2 and 4 were dropped because although they matched IDs on records in A they did not fall within any of the time intervals given.
Question
Is it possible to perform this sort of operation in R and how would you get started? If not, can you suggest an alternative tool?
Set up data
First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):
LinesA <- "OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal"
LinesB <- "OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00"
A <- At <- read.table(textConnection(LinesA), header = TRUE,
colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE,
colClasses = c("numeric", rep("character", 2)))
# in At and Bt convert times columns to "times" class
library(chron)
At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)
sqldf with times class
Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:
library(sqldf)
out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration",
method = "raw")
out$Time <- times(as.numeric(out$Time))
The result is:
> out
OBS ID Time Outcome
1 1 01 10:12:10 Normal
2 3 02 10:12:45 Weird
With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:
library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration")
sqldf with character class
Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:
sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
where strftime('%s', Time) - strftime('%s', StartTime)
between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")
EDIT:
A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.
EDIT:
Simplified/improved final sqldf statement.
here is an example:
# first, merge by ID
z <- merge(A[, -1], B, by = "ID")
# convert string to POSIX time
z <- transform(z,
s_t = as.numeric(strptime(as.character(z$StartTime), "%H:%M:%S")),
dur = as.numeric(strptime(as.character(z$Duration), "%H:%M:%S")) -
as.numeric(strptime("00:00:00", "%H:%M:%S")),
tim = as.numeric(strptime(as.character(z$Time), "%H:%M:%S")))
# subset by time range
subset(z, s_t < tim & tim < s_t + dur)
the output:
ID StartTime Duration Outcome OBS Time s_t dur tim
1 1 10:12:06 00:00:10 Normal 1 10:12:10 1321665126 10 1321665130
2 1 10:12:06 00:00:10 Normal 2 10:12:15 1321665126 10 1321665135
7 2 10:12:30 00:00:30 Weird 3 10:12:45 1321665150 30 1321665165
OBS #2 looks to be in the range. does it make sense?
Merge the two data.frames together with merge(). Then subset() the resulting data.frame with the condition time >= startTime & time <= startTime + Duration or whatever rules make sense to you.