Combine timedelta and date column, group by time interval - datetime

I need to combine two separate columns to one datetime column.
The pandas dataframe looks as follows:
calendarid time_delta_actualdeparture actualtriptime
20140101 0 days 06:35:49.000020000 27.11666667
20140101 0 days 06:51:37.000020000 24.83333333
20140101 0 days 07:11:40.000020000 28.1
20140101 0 days 07:31:40.000020000 23.03333333
20140101 0 days 07:53:34.999980000 23.3
20140101 0 days 08:14:13.000020000 51.81666667
I would like to convert it to look like this:
calendarid actualtriptime
2014-01-01 6:30:00 mean of trip times in time interval
2014-01-01 7:00:00 mean of trip times in time interval
2014-01-01 7:30:00 mean of trip times in time interval
2014-01-01 8:00:00 mean of trip times in time interval
2014-01-01 8:30:00 mean of trip times in time interval
Essentially i would like to combine the two columns as one and then group into 30 minute time intervals, taking the mean of the actual trip time in that interval. I've unsuccessfully tried many techniques, but i am still learning python/pandas. Can anyone help me with this?

Convert your 'calendarid' column to a datetime and add the delta to get the starting times.
In [5]: df['calendarid'] = pd.to_datetime(df['calendarid'], format='%Y%m%d')
In [7]: df['calendarid'] = df['calendarid'] + df['time_delta_actualdeparture']
In [8]: df
Out[8]:
calendarid time_delta_actualdeparture actualtriptime
0 2014-01-01 06:35:49.000020 06:35:49.000020 27.116667
1 2014-01-01 06:51:37.000020 06:51:37.000020 24.833333
2 2014-01-01 07:11:40.000020 07:11:40.000020 28.100000
3 2014-01-01 07:31:40.000020 07:31:40.000020 23.033333
4 2014-01-01 07:53:34.999980 07:53:34.999980 23.300000
5 2014-01-01 08:14:13.000020 08:14:13.000020 51.816667
Then you can you set your date column as an index and resample at the 30 minutes frequency to get the mean over each interval.
In [19]: df.set_index('calendarid').resample('30Min', how='mean', label='right')
Out[19]:
actualtriptime
calendarid
2014-01-01 07:00:00 25.975000
2014-01-01 07:30:00 28.100000
2014-01-01 08:00:00 23.166667
2014-01-01 08:30:00 51.816667

Related

Unexpected dplyr::right_join() behavior for expanding POSIXct time series

I have a data frame containing some daily data timestamped at midnight on each day and some hourly data timestamped at the beginning of each hour throughout the day. I want to expand the data so it's all hourly, and I'd like to do so within a tidyverse "pipe chain".
My thought was to create a data frame containing the full hourly time series and then dplyr::right_join() my data against this time series. I thought this would populate the proper values where there was a match for the daily data (at midnight) and populate NA wherever there was no match (any hour except midnight). This seems to work only when the time series in my data is daily only, rather than a mix of daily and hourly values, which was unexpected. Why does the right join not expand the daily time series when it coexists in a data frame along with another hourly time series?
I've generated a minimal example below. My representative data set that I want to expand is named allData and contains a mix of daily and hourly datasets from two different time series variables, Daily TS and Hourly TS.
dailyData <- data.frame(
DateTime = seq.POSIXt(lubridate::ymd_hms('2019-01-01', truncated=3),
lubridate::ymd_hms('2019-01-07', truncated=3),
by='day'),
Name = 'Daily TS'
)
allHours <- data.frame(
DateTime = seq.POSIXt(lubridate::ymd_hms('2019-01-01', truncated=3),
lubridate::ymd_hms('2019-01-07 23:00:00'),
by='hour')
)
hourlyData <- allHours %>%
dplyr::mutate( Name = 'Hourly TS' )
allData <- rbind( dailyData, hourlyData )
This gives
head( allData, n=15 )
DateTime Name
1 2019-01-01 00:00:00 Daily TS
2 2019-01-02 00:00:00 Daily TS
3 2019-01-03 00:00:00 Daily TS
4 2019-01-04 00:00:00 Daily TS
5 2019-01-05 00:00:00 Daily TS
6 2019-01-06 00:00:00 Daily TS
7 2019-01-07 00:00:00 Daily TS
8 2019-01-01 00:00:00 Hourly TS
9 2019-01-01 01:00:00 Hourly TS
10 2019-01-01 02:00:00 Hourly TS
11 2019-01-01 03:00:00 Hourly TS
12 2019-01-01 04:00:00 Hourly TS
13 2019-01-01 05:00:00 Hourly TS
14 2019-01-01 06:00:00 Hourly TS
15 2019-01-01 07:00:00 Hourly TS
Now, I thought that dplyr::right_join() of the full hourly sequence of POSIXct values against allData$DateTime would have expanded the daily time series, leaving NA values for any hours not explicitly present in the data. I could then use tidyr::fill() to fill these in over the day. However, the following code does not behave this way:
expanded_BAD <- allData %>%
dplyr::right_join( allHours, by='DateTime' ) %>%
tidyr::fill( dplyr::everything(), .direction='down' ) %>%
dplyr::arrange( Name, DateTime )
expanded_BAD shows that the daily data hasn't been expanded by the right_join(). That is, the hours in allHours missing from allData were not retained in the result, which I thought was the whole purpose of using a right join. Here's the head of the result:
head(expanded_BAD, n=15)
DateTime Name
1 2019-01-01 00:00:00 Daily TS
2 2019-01-02 00:00:00 Daily TS
3 2019-01-03 00:00:00 Daily TS
4 2019-01-04 00:00:00 Daily TS
5 2019-01-05 00:00:00 Daily TS
6 2019-01-06 00:00:00 Daily TS
7 2019-01-07 00:00:00 Daily TS
8 2019-01-01 00:00:00 Hourly TS
9 2019-01-01 01:00:00 Hourly TS
10 2019-01-01 02:00:00 Hourly TS
11 2019-01-01 03:00:00 Hourly TS
12 2019-01-01 04:00:00 Hourly TS
13 2019-01-01 05:00:00 Hourly TS
14 2019-01-01 06:00:00 Hourly TS
15 2019-01-01 07:00:00 Hourly TS
Interestingly, if we perform the exact same right join on only the daily data, we get the desired result:
dailyData_expanded_GOOD <- dailyData %>%
dplyr::right_join( allHours, by='DateTime' ) %>%
tidyr::fill( dplyr::everything(), .direction='down' )
Here's the head:
head(dailyData_expanded_GOOD, n=15)
DateTime Value
1 2019-01-01 00:00:00 Daily TS
2 2019-01-01 01:00:00 Daily TS
3 2019-01-01 02:00:00 Daily TS
4 2019-01-01 03:00:00 Daily TS
5 2019-01-01 04:00:00 Daily TS
6 2019-01-01 05:00:00 Daily TS
7 2019-01-01 06:00:00 Daily TS
8 2019-01-01 07:00:00 Daily TS
9 2019-01-01 08:00:00 Daily TS
10 2019-01-01 09:00:00 Daily TS
11 2019-01-01 10:00:00 Daily TS
12 2019-01-01 11:00:00 Daily TS
13 2019-01-01 12:00:00 Daily TS
14 2019-01-01 13:00:00 Daily TS
15 2019-01-01 14:00:00 Daily TS
Why does the right join do different things on the full data compared to only the daily data?
I think the problem is that you are trying to bind the dataframes together too soon. I believe this gives you what you want:
result <- bind_rows(dailyData_expanded_GOOD, hourlyData)
head(result)
#> DateTime Name
#> 1 2019-01-01 00:00:00 Daily TS
#> 2 2019-01-01 01:00:00 Daily TS
#> 3 2019-01-01 02:00:00 Daily TS
#> 4 2019-01-01 03:00:00 Daily TS
#> 5 2019-01-01 04:00:00 Daily TS
#> 6 2019-01-01 05:00:00 Daily TS
The reason right_join() doesn't work is that allHours matches perfectly the
rows in allData for hourly timeseries. From ?right_join
return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
You're hoping that rows in x with no match in y will have NA values, but the rows in y do match rows in x already. There are actually multiple matches, one for the daily and one for the hourly, but right_join() just returns both without expanding the daily time series rows.
This is different from the situation in this question, where the datetimes to be expanded do not occur in the left hand data frame. Then the strategy of merging would expand your result as expected.
So that explains why a bare right_join() doesn't work, but doesn't solve
the problem because you have to manually split up the data, and that would
get old fast if there are varying numbers of time series. There are a couple solutions in comments, and then one additional that I will add below.
tidyr::expand()
expandedData <- allData %>%
tidyr::expand( DateTime, Name ) %>%
dplyr::arrange( Name, DateTime )
This works, but only with both time series present. If there is only
dailyData, then the result is not expanded.
The kitchen sink
expandedData1 <- allData %>%
dplyr::right_join(allHours, by = 'DateTime') %>%
tidyr::fill(everything()) %>%
tidyr::expand( DateTime, Name) %>%
dplyr::arrange( Name, DateTime )
As pointed out in the comments, this works for all cases - both types,
only daily data, only hourly data. This solution and the next generate
warnings unless you use stringsAsFactors = FALSE in the data.frame()
calls above.
The only issue with this solution is that fill() and right_join() are
only there to deal with edge cases. I don't know if that is a real problem
or not.
"Split" in the pipe
The simple solution splits the dataset, and this can be done inside the
pipe in a couple ways.
expandedData2 <- allData %>%
tidyr::nest(-Name) %>%
mutate(data = purrr::map(data, ~right_join(., allHours, by = 'DateTime'))) %>%
tidyr::unnest()
The other way would use base::split() and then purrr::map_dfr()
Created on 2019-03-24 by the reprex
package (v0.2.0).

Aggregate by month,day of week hour and min in R

I have a folder with several files where the name of each file is the respective userID. Something like this:
Time Sms
1 2012-01-01 00:00:00 10
2 2012-01-01 00:30:00 11
3 2012-01-01 01:00:00 13
4 2012-01-01 01:30:00 10
How can i aggretate by moth, week, hour and minute? Something like this:
Month DayofWeek hour min SMS
1 Mon 0 0 14 <-mean
1 Mon 0 30 12
1 Mon 1 0 17
1 Mon 1 30 21
.............................
12 Sunday 23 30 12
I had a similar issue aggregating hourly data into daily data. This is the code that worked for me.
fun <- function(s,i,j) { sum(s[i:(i+j-1)]) }
radday<-sapply(X=seq(1,24*nb_of_days,24),FUN=fun,s=your_time_series,j=24)
This sums data across a period j, which in my case since I was summing over 24 hours was 24. By changing the j value you can adjust it for your different periods of hour, day, week, month assuming that you have a constant period.
thanks for the help. I solved my problem by applying this code:
df<-aggregate(Sms~month(Time)+weekdays(Time)+hour(Time)+minute(Time),df,FUN='mean')

Time to failure variable based off start and end timestamps in R

I have two data sets. Data set 1 contains time stamps of 15 minute intervals starting at 2009-08-18 18:15:00 and ending 2012-11-09 22:30:00 with measurements taken at those times. Data set 2 has start and end time stamps for faults occurring in a factory. There are 6 faults and these faults' start and end times are also 15 min intervals, yet can last longer than 1 interval. They also all fall somewhere between 2009-08-18 18:15:00 and 2012-11-09 22:30:00 as well. I am trying to create a time to failure variable for the faults, where -i would indicate the next fault is i intervals (which are 15 mins) away and i would indicate the fault started i intervals ago. For example,
DataSet1
Timestamp Sensor 1
2009-09-04 10:00:00 30
2009-09-04 10:30:00 40
2009-09-04 10:45:00 33
2009-09-04 11:00:00 23
2009-09-04 11:15:00 24
2009-09-04 11:30:00 42
DataSet 2
Start Time End Time Fault Type
09/04/09 10:45 9/4/2009 11:15 1
09/04/09 21:45 9/4/2009 22:00 1
09/04/09 23:00 9/4/2009 23:15 1
09/05/09 10:45 9/5/2009 11:15 1
09/05/09 21:30 9/5/2009 23:15 1
09/08/09 10:45 9/8/2009 12:30 1
So what I want to end up with is the following time to failure variable (TTF1) and then repeat the process for faults 2-6
Timestamp Sensor 1 TTF1
2009-09-04 10:00:00 30 -3
2009-09-04 10:30:00 40 -1
2009-09-04 10:45:00 33 0
2009-09-04 11:00:00 23 1
2009-09-04 11:15:00 24 2
2009-09-04 11:30:00 42 -41
I know I can use the sqldf function to separate out each fault type, but I have no clue where to begin to even create counting the time to fault variable. I'm very stuck, any help would be greatly appreciated!
You can use the difftime() function from base R to get the time difference between these 2 timestamps:
(z <- Sys.time() - 3600)
Sys.time() - z # just over 3600 seconds.
as.difftime(c("0:3:20", "11:23:15"))
as.difftime(c("3:20", "23:15", "2:"), format = "%H:%M") # 3rd gives NA
(z <- as.difftime(c(0,30,60), units = "mins"))
as.numeric(z, units = "secs")
as.numeric(z, units = "hours")
format(z)
I would recommend set units = "mins". You can convert the class to character, strip out any non-numeric data with gsub, then change the class with as.numeric. Finally just divide by 15 to get the 15-minute time units you want. You can use floor to round the result if needed.

Given start and end times, create hourly labels to indicate whether an hour is in the duration or not

I have start and end times of some commercial event for a couple of locations. The event may or may not take place on each day and the event duration does not overlap. For example, run this:
inputdata = data.frame(
location = c('x','x','y','z','z'),
start = c(as.POSIXct("2010/1/1 8:28:00"),as.POSIXct("2010/1/2 7:20:00"),
as.POSIXct("2010/1/1 10:22:00"),
as.POSIXct("2010/1/5 13:28:00"),as.POSIXct("2010/1/7 15:39:00")),
end = c(as.POSIXct("2010/1/1 13:25:00"),as.POSIXct("2010/1/2 10:09:00"),
as.POSIXct("2010/1/1 15:24:00"),
as.POSIXct("2010/1/6 00:28:00"),as.POSIXct("2010/1/7 19:34:00"))
)
The input data looks like:
location start end
1 x 2010-01-01 08:28:00 2010-01-01 13:25:00
2 x 2010-01-02 07:20:00 2010-01-02 10:09:00
3 y 2010-01-01 10:22:00 2010-01-01 15:24:00
4 z 2010-01-05 13:28:00 2010-01-06 00:28:00
5 z 2010-01-07 15:39:00 2010-01-07 19:34:00
I want to construct an hourly dataset with three columns: 1.location, 2.hour, and 3.indicator and each row is for a pair of location and sharp hour (for instance, as.POSIXct("2010/1/1 13:00:00")) where indicator is a dummy, =1 if this hour is between some event start and end times for the location.
For instance, let's say the output hourly data are for 2010-01-01 to 2010-01-07. Run this:
output = data.frame(
location = rep(c('x','y','z'),
each=length(seq(as.POSIXct("2010/1/1"), as.POSIXct("2010/1/7 23:00:00"), "hours"))),
hour = rep(seq(as.POSIXct("2010/1/1"), as.POSIXct("2010/1/7 23:00:00"), "hours"),3),
indicator = rep(0,3*length(seq(as.POSIXct("2010/1/1"), as.POSIXct("2010/1/7 23:00:00"), "hours"))))
So we get the first six rows look like this:
location hour indicator
1 x 2010-01-01 00:00:00 0
2 x 2010-01-01 01:00:00 0
3 x 2010-01-01 02:00:00 0
4 x 2010-01-01 03:00:00 0
5 x 2010-01-01 04:00:00 0
6 x 2010-01-01 05:00:00 0
Now, we need to change the value of indicator to 1 if the hour in the same row has an event in effect for the location in the same row.
For instance, since location x has an event between 8:28 am on 2010/1/1 and 13:25 pm on 2010/1/1. So the rows for 7 am to 14 pm should look like this:
location hour indicator
8 x 2010-01-01 07:00:00 0
9 x 2010-01-01 08:00:00 1
10 x 2010-01-01 09:00:00 1
11 x 2010-01-01 10:00:00 1
12 x 2010-01-01 11:00:00 1
13 x 2010-01-01 12:00:00 1
14 x 2010-01-01 13:00:00 1
15 x 2010-01-01 14:00:00 0
It seems that I can do exhaustively search for each pair of location and hour and update the value of indicator is the hour is between the start and end hour of some event at that location. But I doubt this is the best way.
Or I am thinking that I can first, convert the input data to hourly data where the hour would be there only if they are between the start and end hour. In other words, the converted data should look like:
location hour indicator
1 x 2010-01-01 08:00:00 1
2 x 2010-01-01 09:00:00 1
3 x 2010-01-01 10:00:00 1
4 x 2010-01-01 11:00:00 1
5 x 2010-01-01 12:00:00 1
6 x 2010-01-01 13:00:00 1
7 x 2010-01-02 07:00:00 1
8 x 2010-01-02 08:00:00 1
9 x 2010-01-02 09:00:00 1
10 x 2010-01-02 10:00:00 1
11 y 2010-01-01 10:00:00 1
12 y 2010-01-01 11:00:00 1
and then I go from there to get the correct indicators for each hour for each location. Though, I don't know how to convert the start/end hours to hourly observations.
This is all I get for this problem so far.
With this said, I do not have a solution and would like to ask for help.
Also, all I want is that output with three columns. When contributing, please do not constrained by my thoughts which may not be efficient.
It is worth mentioning that the actual problem covers 5 years and there are 30 locations. So the algorithm needs to be efficient.
Here is a way to do this with a cross join.
library(dplyr)
hours =
data_frame(hour = seq(as.POSIXct("2010/1/1"),
as.POSIXct("2010/1/7 23:00:00"),
"hours") ) %>%
merge(inputdata %>% select(location) %>% distinct)
hours %>%
left_join(inputdata) %>%
filter(start <= hour & hour <= end) %>%
right_join(hours) %>%
mutate(indicator = +!is.na(start))

Averaging a continuous measurement of meteorological parameters on R

I am quite new to R, and I am trying to find a way to average continuous data into a specific period of time.
My data is a month recording of several parameters with 1s time steps
The table via read.csv has a date and time in one column and several other columns with values.
TimeStamp UTC Pitch Roll Heave(m)
05-02-13 6:45 0 0 0
05-02-13 6:46 0.75 -0.34 0.01
05-02-13 6:47 0.81 -0.32 0
05-02-13 6:48 0.79 -0.37 0
05-02-13 6:49 0.73 -0.08 -0.02
So I want to average the data in specific intervals: 20 min for example in a way that the average for hour 7:00, takes all the points from hour 6:41 to 7:00 and returns the average in this interval and so on for the entire dataset.
The time interval will look like this :
TimeStamp
05-02-13 19:00 462
05-02-13 19:20 332
05-02-13 19:40 15
05-02-13 20:00 10
05-02-13 20:20 42
Here is a reproducible dataset similar to your own.
meteorological <- data.frame(
TimeStamp = rep.int("05-02-13", 1440),
UTC = paste(
rep(formatC(0:23, width = 2, flag = "0"), each = 60),
rep(formatC(0:59, width = 2, flag = "0"), times = 24),
sep = ":"
),
Pitch = runif(1440),
Roll = rnorm(1440),
Heave = rnorm(1440)
)
The first thing that you need to do is to combine the first two columns to create a single (POSIXct) date-time column.
library(lubridate)
meteorological$DateTime <- with(
meteorological,
dmy_hm(paste(TimeStamp, UTC))
)
Then set up a sequence of break points for your different time groupings.
breaks <- seq(ymd("2013-02-05"), ymd("2013-02-06"), "20 mins")
Finally, you can calculate the summary statistics for each group. There are many ways to do this. ddply from the plyr package is a good choice.
library(plyr)
ddply(
meteorological,
.(cut(DateTime, breaks)),
summarise,
MeanPitch = mean(Pitch),
MeanRoll = mean(Roll),
MeanHeave = mean(Heave)
)
Please see if something simple like this works for you:
myseq <- data.frame(time=seq(ISOdate(2014,1,1,12,0,0), ISOdate(2014,1,1,13,0,0), "5 min"))
myseq$cltime <- cut(myseq$time, "20 min", labels = F)
> myseq
time cltime
1 2014-01-01 12:00:00 1
2 2014-01-01 12:05:00 1
3 2014-01-01 12:10:00 1
4 2014-01-01 12:15:00 1
5 2014-01-01 12:20:00 2
6 2014-01-01 12:25:00 2
7 2014-01-01 12:30:00 2
8 2014-01-01 12:35:00 2
9 2014-01-01 12:40:00 3
10 2014-01-01 12:45:00 3
11 2014-01-01 12:50:00 3
12 2014-01-01 12:55:00 3
13 2014-01-01 13:00:00 4

Resources