Let me begin by saying this question pertains to R (stat programming language) but I'm open straightforward suggestions for other environments.
The goal is to merge outcomes from dataframe (df) A to sub-elements in df B. This is a one to many relationship but, here's the twist, once the records are matched by keys they also have to match over a specific frame of time given by a start time and duration.
For example, a few records in df A:
OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal
And from df B:
OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00
The desired outcome from the merge would be:
OBS ID Time Outcome
1 01 10:12:10 Normal
3 02 10:12:45 Weird
Desired result: dataframe B with outcomes merged in from A. Notice observations 2 and 4 were dropped because although they matched IDs on records in A they did not fall within any of the time intervals given.
Question
Is it possible to perform this sort of operation in R and how would you get started? If not, can you suggest an alternative tool?
Set up data
First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):
LinesA <- "OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal"
LinesB <- "OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00"
A <- At <- read.table(textConnection(LinesA), header = TRUE,
colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE,
colClasses = c("numeric", rep("character", 2)))
# in At and Bt convert times columns to "times" class
library(chron)
At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)
sqldf with times class
Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:
library(sqldf)
out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration",
method = "raw")
out$Time <- times(as.numeric(out$Time))
The result is:
> out
OBS ID Time Outcome
1 1 01 10:12:10 Normal
2 3 02 10:12:45 Weird
With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:
library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration")
sqldf with character class
Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:
sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
where strftime('%s', Time) - strftime('%s', StartTime)
between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")
EDIT:
A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.
EDIT:
Simplified/improved final sqldf statement.
here is an example:
# first, merge by ID
z <- merge(A[, -1], B, by = "ID")
# convert string to POSIX time
z <- transform(z,
s_t = as.numeric(strptime(as.character(z$StartTime), "%H:%M:%S")),
dur = as.numeric(strptime(as.character(z$Duration), "%H:%M:%S")) -
as.numeric(strptime("00:00:00", "%H:%M:%S")),
tim = as.numeric(strptime(as.character(z$Time), "%H:%M:%S")))
# subset by time range
subset(z, s_t < tim & tim < s_t + dur)
the output:
ID StartTime Duration Outcome OBS Time s_t dur tim
1 1 10:12:06 00:00:10 Normal 1 10:12:10 1321665126 10 1321665130
2 1 10:12:06 00:00:10 Normal 2 10:12:15 1321665126 10 1321665135
7 2 10:12:30 00:00:30 Weird 3 10:12:45 1321665150 30 1321665165
OBS #2 looks to be in the range. does it make sense?
Merge the two data.frames together with merge(). Then subset() the resulting data.frame with the condition time >= startTime & time <= startTime + Duration or whatever rules make sense to you.
Related
I have two dataframes; one that contains a year's worth of hourly temperatures and the other contains flight information. Bellow shows an extract from the temperature dataframe:
Time <- c("2000-01-01 00:53:00","2000-01-01 06:53:00","2000-01-01 10:53:00")
Time <- as.POSIXct(Time)
Temp <- c(20,30,10)
Temperature <- data.frame(Time,Temp)
Temperature
Time Temp
1 2000-01-01 00:53:00 20
2 2000-01-01 06:53:00 30
3 2000-01-01 10:53:00 10
Bellow shows an extract from the flight information dataframe:
DepartureTime <- c("2000-01-01 03:01:00","2000-01-01 10:00:00","2000-01-01 14:00:00")
DepartureTime <- as.POSIXct(DepartureTime)
FlightInformation <- data.frame(DepartureTime)
FlightInformation
DepartureTime
1 2000-01-01 03:01:00
2 2000-01-01 10:14:00
3 2000-01-01 14:55:00
My goal is to take each row of FlightInformation$DepartureTime and find the closest time in the whole column Temperature$Time. I then want to add the corresponding temperature to the FlightInformation dataframe. The desired output should look like this:
FlightInformation
DepartureTime Temp
1 2000-01-01 03:01:00 20
2 2000-01-01 10:14:00 10
3 2000-01-01 14:55:00 10
My attempts so far have come up with this:
i <- 1
j <- 1
while(i <= nrow(Temperature)){
while(j <= nrow(FlightInformation)){
if(Temperature$Time[i] == FlightInformation$Time[j]){
FlightInformation$Temp[j] == Temperature$Temp[i]
}
j <- j + 1
}
i <- i + 1
}
This involves first rounding all times to the nearest hour. This method is not as accurate as i would like it to be and seems VERY inefficient! Is there an easy way to find the nearest posix to give my desired output?
Some assumptions:
you have temperature data before and after all flight information; otherwise you'll see NA
temperature data is continuous-enough, meaning with the interpolation this presents, you don't grab something from 3 months prior (not useful)
temperature data is ordered (easy enough to fix if not)
We'll use cut, that finds the interval in which values fit within a series of breaks:
(ind <- cut(FlightInformation$DepartureTime, Temperature$Time, labels = FALSE))
# [1] 1 2 NA
These indicate rows within Temperature from which we should retrieve the $Temp. Unfortunately, it is absolute and does not allow for being closer to the next value, so we can compensate for that:
(ind <- ind + (abs(Temperature$Time[ind] - FlightInformation$DepartureTime) >
abs(Temperature$Time[1+ind] - FlightInformation$DepartureTime)))
# [1] 1 3 NA
Okay, now that NA: that indicates that the latest $DepartureTime is outside of the known times. This indicates a violation of my first assumption above, but it can be fixed. I use a magic-constant of "6 hours" here to determine that the data is close enough to be able to use it; there are certainly many other heuristics which will be less-wrong. For those, we can just assume the latest temperature:
(is_recoverable <- is.na(ind) & abs(FlightInformation$DepartureTime - max(Temperature$Time)) < 60*60*6)
# [1] FALSE FALSE TRUE
ind[is_recoverable] <- nrow(Temperature)
ind
# [1] 1 3 3
The the results:
FlightInformation$Temp <- Temperature$Temp[ ind ]
FlightInformation
# DepartureTime Temp
# 1 2000-01-01 03:01:00 20
# 2 2000-01-01 10:00:00 10
# 3 2000-01-01 14:00:00 10
Though definitely quicker than double while loops, it will be a problem if you have large gaps in your temperature data. That is, if you have a 3-year gap in your data, the most-recent temperature will be used, which might be 2.99 years ago. For a double-check, use this:
FlightInformation$TempTime <- Temperature$Time[ ind ]
FlightInformation$TimeDelta <- with(FlightInformation, abs(TempTime - DepartureTime))
FlightInformation
# DepartureTime Temp TempTime TimeDelta
# 1 2000-01-01 03:01:00 20 2000-01-01 00:53:00 128 mins
# 2 2000-01-01 10:00:00 10 2000-01-01 10:53:00 53 mins
# 3 2000-01-01 14:00:00 10 2000-01-01 10:53:00 187 mins
You can use different units for the time delta and check for problems with:
units(FlightInformation$TimeDelta) <- "secs"
which(FlightInformation$TimeDelta > 60*60*6)
# integer(0)
(where integer(0) says you have none that are outside of my magic window of 6 hours.)
Here's a way! Time is easiest to work with for this if you convert it to a numeric value. Then you can compare the numeric values to find the closest times before/after your reference time (FlightInformation$time_num in the below example). Once you have the closest time before and after your reference value, figure out which is really the closest to your reference. Use that time value to look up (index) the correct temperature value and add it to your data frame.
#convert time to numeric (seconds since origin of time)
Temperature$time_num <- as.numeric(Temperature$Time)
FlightInformation$time_num <- as.numeric(FlightInformation$DepartureTime)
#make sure time data is in correct order so that indexes for time are in correct order
Temperature <- Temperature[with(Temperature, order(time_num)), ] #sort data
for (i in 1:nrow(FlightInformation)) #for each row of data in flight...
{
#find the time in Temp that is closest + prior to Flight time
#create a logical vector saying which Temperature$time_num are <= to FlightInformation$time_num.
#pull the max row index from the logical vector where value == TRUE (this is the closest time for Temp that is prior to Flight Time)
#use that row index to look up the Temperature$time_num value that is closest + prior to Flight time
#will return NA/warning message if no time in Temp is before time in Flight
temptime_prior <- Temperature[max(which(Temperature$time_num <= FlightInformation$time_num[i])), "time_num"]
#find the time in Temp that is closest + after to Flight time
#will return NA/warning message if no time in Temp is after time in Flight
temptime_after <- Temperature[min(which(Temperature$time_num > FlightInformation$time_num[i])), "time_num"]
#compare times before and after to see which is closest to flight time. If no before/after time was found (e.g., NA was returned), always use the other time value
temptime_closest <- ifelse(is.na(temptime_prior), temptime_after,
ifelse(is.na(temptime_after), temptime_prior,
ifelse((FlightInformation$time_num[i] - temptime_prior) <= (temptime_after - FlightInformation$time_num[i]),
temptime_prior, temptime_after)))
#look up the right temp by finding the row index of right Temp$time_num value and add it to Flight info
FlightInformation$Temp[i] <- Temperature[which(Temperature$time_num == temptime_closest), "Temp"]
}
#get rid of numeric time column, you don't need it anymore
FlightInformation <- FlightInformation[,!(names(FlightInformation) %in% c("time_num"))]
Output
DepartureTime Temp
1 2000-01-01 03:01:00 20
2 2000-01-01 10:00:00 10
3 2000-01-01 14:00:00 10
If you have subsets of data in each data frame you need to match up to (e.g., match df1$group1 time values only to df2$group1 time values), you can use survival::neardate. It's a nice function for this that does basically what the above code does, but has some additional parameters if you need them.
Hope this helps! The codes a lot shorter without all the comments =)
I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]
Considering the data collected with 5 minutes time interval with a numeric variable a,and a discret variable acc, which represents if there's any incident happened(0 for no incident while 1 for incident):
a<-c(1:(288*4))
t<-seq(as.POSIXct("2016-01-01 00:05:00"), as.POSIXct("2016-01-05 00:00:00"), by = '5 min')
acc<-rep(0,288*4)
df<-data.frame(t,a,acc)
Now I have another data set which has the time(accurates to 1 sec) at which the incidents happened during the collection period:
T<-sample(seq(as.POSIXct("2016-01-01 00:05:00"), as.POSIXct("2016-01-05 00:00:00"), by = '1 sec'),size = 5)
I want to mark the nearest 2 prior observation's acc as 1 according to the time in T. For example, if the incident happened at 2016-01-02 07:13:23, the observations' acc with t of 2016-01-02 07:05:00 and 2016-01-02 07:10:00 are marked as 1
How could I manage to do this?
ind <- findInterval(T, df$t)
df$acc[c(ind, ind + 1)] <- 1
One way could be:
library(lubridate)
df$acc=apply(sapply(T,function(x) x %within% interval((df$t - minutes(4)-seconds(59)),(df$t + minutes(4)+seconds(59)))),1,sum)
lubridate allows for the easy manipulation of dates, minutes(x) and seconds(x) adds x minutes or second to a period object.
interval() is used to create a time interval confined by the time in df$t ± 4min59s.
sapply() is used to check if any of the time in T is within the interval.
apply() is used to collapse the results of sapply() (it outputs 1 column for each element in T)
If T contains a value that is exactly equal to one in df$t such as 2016-01-04 12:05:00 CET this will only put 1 for this one.
I have a dataframe of time stamps which specify a categorical status. The status is valid until the next time stamp, at which time the category might change.
I'd like to be able to determine percentage of time spent in each category over regular time periods, like monthly, quarterly, or annually.
This seems like a common enough problem, but I've been unable to find an elegant solution or library to solve it.
For example, with the following sample dataframe:
date status
2016-02-20 09:11:00 a
2016-03-06 02:38:00 c
2016-03-10 15:20:00 b
2016-03-10 21:20:00 a
2016-03-11 11:51:00 b
2016-03-12 01:19:00 c
2016-03-22 14:39:00 c
2016-03-23 11:37:00 b
2016-03-25 17:38:00 c
2016-03-26 01:24:00 c
2016-03-26 12:40:00 a
2016-04-12 10:28:00 c
... I might want to report weekly from 3/1-3/7, 3/8-3/14, 3/15-3/21, the percent time in each week of 'a', 'b', and 'c' status.
I started brute force coding a solution to this (it's ugly...), when I decided maybe I should ask here whether there's a more elegant way to do it.
======== Edited to add an inelegant brute-force solution below ========
time_analysis <- function(df, starttime, endtime) {
# - assumes sorted by date
startindex <- sum(df$date <= starttime) # find the index of the entry which contains the start time
endindex <- sum(df$date <= endtime) + 1 # find the index of the entry which contains the end time
if ( (startindex == 0) || (endindex > nrow(df) ) ) {
print("Date outside of available data")
return(NULL)
}
df2 <- df[ startindex:endindex, ] # subset the dataframe to include the range, but still need to trim ends
df2$date[1] <- starttime # trim to the start time
df2$date[nrow(df2)] <- endtime # trim back the end time
df2$status[nrow(df2)] <- df2$status[nrow(df2)-1] # status hasn't changed yet, so still the previous status
duration <- diff(df2$date) # vector of the time within each segment, 1 fewer elements than the dataframe
units(duration) <- 'days'
duration <- as.numeric(duration) # need to convert to numeric, or else can't divide by total duration
df2 <- df2[ -nrow(df2), ] # remove the last row, to make length same as the duration vector
df2$duration <- duration # add the duration column
total <- sum(df2$duration) # to allow calculations within the ddply
return(ddply(df2[, c('status','duration')], 'status', function(x) { # calculate by each status category
return( c(
date = starttime,
totaldays = round(sum(x$duration), 2),
fraction = round(sum(x$duration) / total, 3)) )
} ))
}
And below would be a sample use, that would split the reporting into roughly 2-week chunks. I hate the use manual date coding and using a loop in R, but am too inexperienced to know a better way.
times <- c("2016-03-01","2016-03-15","2016-04-01","2016-04-15","2016-05-01","2016-05-15")
result <- data.frame()
for (i in 1:(length(times) - 1)) {
result <- rbind( result, time_analysis(d, times[i], times[i+1]) )
}
print(result, row.names = FALSE)
Yielding (other than some errors for dates out of range):
status date totaldays fraction
a 2016-03-01 5.71 0.409
b 2016-03-01 0.81 0.058
c 2016-03-01 7.43 0.532
a 2016-03-15 5.47 0.322
b 2016-03-15 2.25 0.132
c 2016-03-15 9.28 0.546
=====
And after posting, found a much nicer way to generate the times:
times <- as.character( seq( as.Date("2016-03-01"), as.Date("2016-05-15"), by = '2 weeks' ) )
Here's an approach that combines the cut.POSIXt() S3 specific with a nested data.table aggregation.
## define data
library(data.table);
dt <- data.table(date=as.POSIXct(c('2016-02-20 09:11:00','2016-03-06 02:38:00','2016-03-10 15:20:00','2016-03-10 21:20:00','2016-03-11 11:51:00','2016-03-12 01:19:00','2016-03-22 14:39:00','2016-03-23 11:37:00','2016-03-25 17:38:00','2016-03-26 01:24:00','2016-03-26 12:40:00','2016-04-12 10:28:00')),status=c('a','c','b','a','b','c','c','b','c','c','a','c'));
## solution
dt[,{ n1 <- .N; .SD[,.(pct=.N/n1*100),.(status)]; },.(month=cut(df$date,'month'))];
## month status pct
## 1: 2016-02-01 a 100
## 2: 2016-03-01 c 50
## 3: 2016-03-01 b 30
## 4: 2016-03-01 a 20
## 5: 2016-04-01 c 100
I have two data frames. One containing time periods marked with character unique IDs and another containing events with another set of unique IDs associated with them
Period DF (code):
periodID <- c("P_UID_00", "P_UID_01", "P_UDI_02", "P_UID_03")
periodStart <- as.POSIXct(c("2016/02/10 19:00", "2016/02/11 19:00",
"2016/02/12 19:00", "2016/02/13 19:00"))
periodEnd <- as.POSIXct(c("2016/02/10 21:00", "2016/02/11 21:00",
"2016/02/12 21:00", "2016/02/13 21:00"))
periodDF <- data.frame(periodID, periodStart, periodEnd)
Period DF:
periodID periodStart periodEnd
1 P_UID_00 2016-02-10 19:00:00 2016-02-10 21:00:00
2 P_UID_01 2016-02-11 19:00:00 2016-02-11 21:00:00
3 P_UDI_02 2016-02-12 19:00:00 2016-02-12 21:00:00
4 P_UID_03 2016-02-13 19:00:00 2016-02-13 21:00:00
Event DF (code):
eventID <- c("E_UID_00", "E_UID_01", "E_UDI_02", "E_UID_03")
eventTime <- as.POSIXct(c("2016/02/09 19:55:01", "2016/02/11 19:12:01",
"2016/02/11 20:22:01", "2016/02/15 19:00:01"))
eventDF <- data.frame(eventID, eventTime)
Event DF:
eventID eventTime
1 E_UID_00 2016-02-09 19:55:01
2 E_UID_01 2016-02-11 19:12:01
3 E_UDI_02 2016-02-11 20:22:01
4 E_UID_03 2016-02-15 19:00:01
I want to to map the event times in second DF to the time periods in the first DF in order to match the ID of the event to the ID of the period. Essentially the result table I want to see should look like:
eventID periodID
1 E_UID_00 NA
2 NA P_UID_00
3 E_UID_01 P_UID_01
4 E_UDI_02 P_UID_01
5 NA P_UID_02
6 NA P_UID_03
7 E_UID_03 NA
I suppose this can be achieved by using lubricate to transform the start and end cloumns in the first DF to intervals and the use some form of apply and instant %within% interval combination, but I am not really familiar with lubridate and did not manage to produce a working code
Additional considerations:
- periods are completely arbitrary and can last from seconds to years
- periods never overlap, so this is not an issue
- more than one event could be associated with a time period
- it is possible for DFs to contain unassociatable events and time periods
- the solution must not include loops
- does not have to be solved with lubridate, in fact a solution with the base R will be even more welcome.
I actually managed to come up with the code that produces exactly what I wanted using lubridate. So if anyone knows how to do this in base OR simply a better way than the one suggested below, sharing this will be greatly appreciated!
First off, the start and end times in the period DF should be converted to lubridate intervals:
intervalsP <- as.interval(periodStart, periodEnd)
Step 2: A function should be created for checking if an instant is located within a list of intervals. The only reason I have created a separate function is to be able using it with apply:
PeriodAssign <- function(x, y){
# x - instants
# y - intervals
variable1 <- mapply(`%within%`, x, y)
if (length(y[variable1]) != 0) {
as.character(y[variable1])
} else {
NA
}
}
NOTE: I had to use the interval to character coercion, because otherwise intervals were coerced to their length in seconds by the apply function and as such being not really useful for matching purposes - i.e. all four intervals in this example are the same length
Step 3: The function can the be used on the event DF and both DFs can then be merged to produce the DF I was looking for:
eventDF$intervals <- lapply(eventTime, PeriodAssign, intervalsP)
periodDF$intervals <- as.character(intervalsP)
mergedDF <- merge(periodDF, eventDF, by = "intervals")
presentableDF <- mergedDF[, c(2, 5)]
# adding in the unmatched Periods and Evenets
tDF1 <- data.frame(periodDF[!(periodDF$periodID %in% presentableDF$periodID), 1], NA)
colnames(tDF1) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF1)
tDF2 <- data.frame(NA, eventDF[!(eventDF$eventID %in% presentableDF$eventID), 1])
colnames(tDF2) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF2)
presentableDF <- presentableDF[order(presentableDF[,1]),]
The eventual DF looks like:
> presentableDF
periodID eventID
3 P_UID_00 <NA>
1 P_UID_01 E_UID_01
2 P_UID_01 E_UDI_02
4 P_UID_02 <NA>
5 P_UID_03 <NA>
6 <NA> E_UID_00
7 <NA> E_UID_03