Pandas increment time series by one minute - datetime

My data is here.
I want to add a minute to values in STA_STD to get a 5-minute regular time series, if the value in that column contains "23:59:00". Adding one minute should also change to date to next day 00:00 hours.
My code is here
dat=pd.read_csv("temp.csv")
if(dat['STA_STD'].str.contains("23:59:00")):
dat['STA_STD_NEW']= pd.to_datetime(dat[dat['STA_STD'].str.contains("23:59:00")] ['STA_STD'])+datetime.timedelta(minutes=1)
else:
dat['STA_STD_NEW'] = dat['STA_STD']
And this gives me below error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Pandas documentation here talks about the same error.
What is the procedure to iterate through all rows and increment value by one minute, if value contains "23:59:00"?
Please advise.

Two things:
You can't use if/else like that to evaluate multiple values at the same time (you would need to iterate over the values, and then do the if/else for each value separately). But using boolean indexing is much better in this case.
the str.contains does not work with datetimes. But you can eg check if the time part of the datetime values is equal to datetime.time(23, 59)
A small example:
In [2]: dat = pd.DataFrame({'STA_STD':pd.date_range('2012-01-01 23:50', periods=10, freq='1min')})
In [3]: dat['STA_STD_NEW'] = dat['STA_STD']
In [4]: dat.loc[dat['STA_STD'].dt.time == datetime.time(23,59), 'STA_STD_NEW'] += datetime.timedelta(minutes=1)
In [5]: dat
Out[5]:
STA_STD STA_STD_NEW
0 2012-01-01 23:50:00 2012-01-01 23:50:00
1 2012-01-01 23:51:00 2012-01-01 23:51:00
2 2012-01-01 23:52:00 2012-01-01 23:52:00
3 2012-01-01 23:53:00 2012-01-01 23:53:00
4 2012-01-01 23:54:00 2012-01-01 23:54:00
5 2012-01-01 23:55:00 2012-01-01 23:55:00
6 2012-01-01 23:56:00 2012-01-01 23:56:00
7 2012-01-01 23:57:00 2012-01-01 23:57:00
8 2012-01-01 23:58:00 2012-01-01 23:58:00
9 2012-01-01 23:59:00 2012-01-02 00:00:00 <-- increment of 1 minute
Using the dt.time approach needs pandas >= 0.15

Related

Using subset on dates giving shifted dates from the desired time frame

I have a data frame (called homeAnew) from which the head is as follows.
date total
1 2014-01-01 00:00:00 0.756
2 2014-01-01 01:00:00 0.717
3 2014-01-01 02:00:00 0.643
4 2014-01-01 03:00:00 0.598
5 2014-01-01 04:00:00 0.604
6 2014-01-01 05:00:00 0.638
I wanted to extract explicit dates and I originally used:
Hourly <- subset(homeAnew,date >= "2014-04-10 00:00:00" & date <= "2015-04-10 00:00:00")
However the result was a dataframe that started at 2014-04-09 12:00:00 and ended 2015-04-09 12:00:00. Basically it was shifted back 12 hours from where I wanted it.
I was able to use
Date1<-as.Date("2014-04-10 00:00:00")
Date2<-as.Date("2015-04-10 00:00:00")
Hourly<-homeAnew[homeAnew$date>=Date1 & homeAnew$date<=Date2,]
To get what was after but I was wondering if someone could explain to me why subset would work like that?

How do I prevent {data.table}foverlaps from feeding NA's into its any(...) call when executing on large datatables?

First of all, a similar problem:
Foverlaps error: Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop
The story
I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event. In total we consider three events : AC, CO and MT.
The data
Edit 1:
Here are two example datasets that allow the execution of the code below.
The code runs just fine for these sets. Once I have data that generates the error I'll make a second edit. Note that event.GN in the example dataset below is a data.table instead of a list
emissions.GN <- data.table(date.time=seq(ymd_hms("2016-01-01 00:00:00"), by="min",length.out = 1000000))
event.GN <- data.table(dat=seq(ymd_hms("2016-01-01 00:00:00"), by="15 mins", length.out = 26383))
Edit 2:
I created a csv file containing the data event.GN that generates the error. The file has 26383 rows of one variable dat but only about 14000 are necessary to generate the error.
Edit 3:
Up until the dat "2017-03-26 00:25:20" the function works fine. Right after adding the next record with dat "2017-03-26 01:33:46" the error occurs. I noticed that between those points there is more than 60 minutes. This means that between those two event times one or several emission records won't have corresponding events. This in turn will generate NA's that somehow get caught up in the any() call of the foverlaps function. Am I looking in the right direction?
The fluor emissions are stored in a large datatable (~1 million rows) called emissions.GN. Note that only the date.time (POSIXct) variable is relevant to my problem.
example of emissions.GN:
date.time fluor hall period dt
1: 2016-01-01 00:17:04 0.3044254 GN [2016-01-01,2016-02-21] -16.07373
2: 2016-01-01 00:17:04 0.4368381 GN [2016-01-01,2016-02-21] -16.07373
3: 2016-01-01 00:18:04 0.5655382 GN [2016-01-01,2016-02-21] -16.07395
4: 2016-01-01 00:19:04 0.6542259 GN [2016-01-01,2016-02-21] -16.07417
5: 2016-01-01 00:21:04 0.6579384 GN [2016-01-01,2016-02-21] -16.07462
The data of the three events is stored in three smaller datatables (~20 thousand records) contained in a list called events.GN. Note that only the dat (POSIXct) variable is relevant to my problem.
example of AC events (CO and MT are analogous):
events.GN[["AC"]]
dat hall numevt txtevt
1: 2016-01-01 00:04:54 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
2: 2016-01-01 00:09:21 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
3: 2016-01-01 00:38:53 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
4: 2016-01-01 02:30:33 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
5: 2016-01-01 02:34:11 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
The function
I have written a function that applies foverlaps on a given (large) x datatable and a given (small) y datatable. The function returns a datatable with two columns. The first column yid contains the indices of emissions.GN observations that overlap at least once with an event. The second column N contains the overlap count (i.e. the number of times an overlap occurs for that particular index). The index of emissions that have zero overlaps are omitted from the result.
# A function to compute the number of times an emission record falls between the defined starting point and end point of an event.
find_index_and_count <- function(hall,event, lower.margin=10, upper.margin=30){
# Define start and stop variables of the large emission dataset hall to be zero, i.e. each record is a single time point, not an interval.
hall$start <- hall$date.time
hall$stop <- hall$date.time
# Define the start and stop variables of the small event datatables equal to the defined margins oof 10 and 30 minutes respectively
event$start <- event$dat-minutes(lower.margin)
event$stop <- event$dat+minutes(upper.margin)
# Set they key of both datasets to be start and stop
setkey(hall,start,stop)
setkey(event,start,stop)
# Returns the index the of the emission record that falls N times within an event time interval. The call to na.omit is necessary to remove NA's introduced by x records that don't fall within any y interval.
foverlaps(event,hall,nomatch = NA, which = TRUE)[, .N, by=yid] %>% na.omit
}
The function executes succesfully for the events AC and CO
The function gives the desired result as discribed above when called on the events AC and CO:
find_index_and_count(emissions.GN,events.GN[["AC"]])
yid N
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
---
find_index_and_count(emissions.GN,events.GN[["CO"]])
yid N
1: 3 1
2: 4 1
3: 5 1
4: 6 1
5: 7 1
---
The function returns an error when called on the MT event
The following function call results in the error below:
find_index_and_count(emissions.GN,events.GN[["MT"]])
Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop("All entries in column ", : missing value where TRUE/FALSE needed
5.foverlaps(event, hall, nomatch = NA, which = TRUE)
4.eval(lhs, parent, parent)
3.eval(lhs, parent, parent)
2.foverlaps(event, hall, nomatch = NA, which = TRUE)[, .N, by = yid] %>% na.omit
1.find_index_and_count(emissions.GN, events.GN[["MT"]])
I assume the function returns an NA whenever a record in x (emissions.FN) has no overlap with any of the events in y (events.FN[["AC"]] etc.).
I don't understand why the function fails on the event MT when it works just fine for AC and CO. The data are exactly the same with the exception of the values and slightly different number of records.
What I have tried so far
Firstly, In the similar problem linked above, someone pointed out the following idea:
This often indicates an NA value being fed to the any function, so it returns NA and that's not a legal logical value. – Carl Witthoft May 7 '15 at 13:50
Hence , I modified the call to foverlaps to return 0 instead of NA whener no overlap between x and y is found, like this:
foverlaps(event,hall,nomatch = 0, which = TRUE)[, .N, by=yid] %>% na.omit
This did not change anything (the function works for AC and CO but fails for MT).
Secondly, I made absolutely sure that none of my datatables contained NA's.
More information
If required I can provide the SQL code that generates the emissions.FN data and all the events.FN data. Note that because all the events.FN date has the same origin, there should be no diffirences (other than the values) between the data of the events AC, CO and MT.
If anything else is required, please do feel free to ask !
I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event.
Just addressing this objective (since I don't know foverlaps well.)...
event.GN[, n :=
emissions.GN[.SD[, .(d_dn = dat - 10*60, d_up = dat + 30*60)], on=.(date.time >= d_dn, date.time <= d_up),
.N
, by=.EACHI]$N
]
dat n
1: 2016-01-01 00:00:00 31
2: 2016-01-01 00:15:00 41
3: 2016-01-01 00:30:00 41
4: 2016-01-01 00:45:00 41
5: 2016-01-01 01:00:00 41
---
26379: 2016-10-01 18:30:00 41
26380: 2016-10-01 18:45:00 41
26381: 2016-10-01 19:00:00 41
26382: 2016-10-01 19:15:00 41
26383: 2016-10-01 19:30:00 41
To check/verify one of these counts...
> # dat from 99th event...
> my_d <- event.GN[99, {print(.SD); dat}]
dat n
1: 2016-01-02 00:30:00 41
>
> # subsetting to overlapping emissions
> emissions.GN[date.time %between% (my_d + c(-10*60, 30*60))]
date.time
1: 2016-01-02 00:20:00
2: 2016-01-02 00:21:00
3: 2016-01-02 00:22:00
4: 2016-01-02 00:23:00
5: 2016-01-02 00:24:00
6: 2016-01-02 00:25:00
7: 2016-01-02 00:26:00
8: 2016-01-02 00:27:00
9: 2016-01-02 00:28:00
10: 2016-01-02 00:29:00
11: 2016-01-02 00:30:00
12: 2016-01-02 00:31:00
13: 2016-01-02 00:32:00
14: 2016-01-02 00:33:00
15: 2016-01-02 00:34:00
16: 2016-01-02 00:35:00
17: 2016-01-02 00:36:00
18: 2016-01-02 00:37:00
19: 2016-01-02 00:38:00
20: 2016-01-02 00:39:00
21: 2016-01-02 00:40:00
22: 2016-01-02 00:41:00
23: 2016-01-02 00:42:00
24: 2016-01-02 00:43:00
25: 2016-01-02 00:44:00
26: 2016-01-02 00:45:00
27: 2016-01-02 00:46:00
28: 2016-01-02 00:47:00
29: 2016-01-02 00:48:00
30: 2016-01-02 00:49:00
31: 2016-01-02 00:50:00
32: 2016-01-02 00:51:00
33: 2016-01-02 00:52:00
34: 2016-01-02 00:53:00
35: 2016-01-02 00:54:00
36: 2016-01-02 00:55:00
37: 2016-01-02 00:56:00
38: 2016-01-02 00:57:00
39: 2016-01-02 00:58:00
40: 2016-01-02 00:59:00
41: 2016-01-02 01:00:00
date.time

Create a time interval of 15 minutes from minutely data in R?

I have some data which is formatted in the following way:
time count
00:00 17
00:01 62
00:02 41
So I have from 00:00 to 23:59hours and with a counter per minute. I'd like to group the data in intervals of 15 minutes such that:
time count
00:00-00:15 148
00:16-00:30 284
I have tried to do it manually but this is exhausting so I am sure there has to be a function or sth to do it easily but I haven't figured out yet how to do it.
I'd really appreciate some help!!
Thank you very much!
For data that's in POSIXct format, you can use the cut function to create 15-minute groupings, and then aggregate by those groups. The code below shows how to do this in base R and with the dplyr and data.table packages.
First, create some fake data:
set.seed(4984)
dat = data.frame(time=seq(as.POSIXct("2016-05-01"), as.POSIXct("2016-05-01") + 60*99, by=60),
count=sample(1:50, 100, replace=TRUE))
Base R
cut the data into 15 minute groups:
dat$by15 = cut(dat$time, breaks="15 min")
time count by15
1 2016-05-01 00:00:00 22 2016-05-01 00:00:00
2 2016-05-01 00:01:00 11 2016-05-01 00:00:00
3 2016-05-01 00:02:00 31 2016-05-01 00:00:00
...
98 2016-05-01 01:37:00 20 2016-05-01 01:30:00
99 2016-05-01 01:38:00 29 2016-05-01 01:30:00
100 2016-05-01 01:39:00 37 2016-05-01 01:30:00
Now aggregate by the new grouping column, using sum as the aggregation function:
dat.summary = aggregate(count ~ by15, FUN=sum, data=dat)
by15 count
1 2016-05-01 00:00:00 312
2 2016-05-01 00:15:00 395
3 2016-05-01 00:30:00 341
4 2016-05-01 00:45:00 318
5 2016-05-01 01:00:00 349
6 2016-05-01 01:15:00 397
7 2016-05-01 01:30:00 341
dplyr
library(dplyr)
dat.summary = dat %>% group_by(by15=cut(time, "15 min")) %>%
summarise(count=sum(count))
data.table
library(data.table)
dat.summary = setDT(dat)[ , list(count=sum(count)), by=cut(time, "15 min")]
UPDATE: To answer the comment, for this case the end point of each grouping interval is as.POSIXct(as.character(dat$by15)) + 60*15 - 1. In other words, the endpoint of the grouping interval is 15 minutes minus one second from the start of the interval. We add 60*15 - 1 because POSIXct is denominated in seconds. The as.POSIXct(as.character(...)) is because cut returns a factor and this just converts it back to date-time so that we can do math on it.
If you want the end point to the nearest minute before the next interval (instead of the nearest second), you could to as.POSIXct(as.character(dat$by15)) + 60*14.
If you don't know the break interval, for example, because you chose the number of breaks and let R pick the interval, you could find the number of seconds to add by doing max(unique(diff(as.POSIXct(as.character(dat$by15))))) - 1.
The cut approach is handy but slow with large data frames. The following approach is approximately 1,000x faster than the cut approach (tested with 400k records.)
# Function: Truncate (floor) POSIXct to time interval (specified in seconds)
# Author: Stephen McDaniel # PowerTrip Analytics
# Date : 2017MAY
# Copyright: (C) 2017 by Freakalytics, LLC
# License: MIT
floor_datetime <- function(date_var, floor_seconds = 60,
origin = "1970-01-01") { # defaults to minute rounding
if(!is(date_var, "POSIXct")) stop("Please pass in a POSIXct variable")
if(is.na(date_var)) return(as.POSIXct(NA)) else {
return(as.POSIXct(floor(as.numeric(date_var) /
(floor_seconds))*(floor_seconds), origin = origin))
}
}
Sample output:
test <- data.frame(good = as.POSIXct(Sys.time()),
bad1 = as.Date(Sys.time()),
bad2 = as.POSIXct(NA))
test$good_15 <- floor_datetime(test$good, 15 * 60)
test$bad1_15 <- floor_datetime(test$bad1, 15 * 60)
Error in floor_datetime(test$bad, 15 * 60) :
Please pass in a POSIXct variable
test$bad2_15 <- floor_datetime(test$bad2, 15 * 60)
test
good bad1 bad2 good_15 bad2_15
1 2017-05-06 13:55:34.48 2017-05-06 <NA> 2007-05-06 13:45:00 <NA>
You can do it in one line by using trs function from FQOAT, just like:
df_15mins=trs(df, "15 mins")
Below is a repeatable example:
library(foqat)
head(aqi[,c(1,2)])
# Time NO
#1 2017-05-01 01:00:00 0.0376578
#2 2017-05-01 01:01:00 0.0341483
#3 2017-05-01 01:02:00 0.0310285
#4 2017-05-01 01:03:00 0.0357016
#5 2017-05-01 01:04:00 0.0337507
#6 2017-05-01 01:05:00 0.0238120
#mean
aqi_15mins=trs(aqi[,c(1,2)], "15 mins")
head(aqi_15mins)
# Time NO
#1 2017-05-01 01:00:00 0.02736549
#2 2017-05-01 01:15:00 0.03244958
#3 2017-05-01 01:30:00 0.03743626
#4 2017-05-01 01:45:00 0.02769419
#5 2017-05-01 02:00:00 0.02901817
#6 2017-05-01 02:15:00 0.03439455

Create a dataframe with columns of given Date and Time

I would like to create a dataframe with its first column as Date and second column as Time. The condition is the time should increase in 30 minutes interval and the date accordingly. And later i will add other columns manually.
> df
Date Time
2012-01-01 00:00:00
2012-01-01 00:30:00
2012-01-01 01:00:00
2012-01-01 01:30:00
.......... ........
.......... ........
and so on...
EDIT
Can be done in another way as well.
A single column can be created with the given date and time and then separated later using tidyr or any other packages.
> df
DateTime
2012-01-01 00:00:00
2012-01-01 00:30:00
2012-01-01 01:00:00
2012-01-01 01:30:00
..........
..........
and so on...
Any help will be appreciated. Thank you in advance.
you can generate a sequence using seq, specifying the start and end dates, and the time interval
df <- data.frame(DateTime = seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-02-01"),
by=(30*60)))
head(df)
DateTime
1 2012-01-01 00:00:00
2 2012-01-01 00:30:00
3 2012-01-01 01:00:00
4 2012-01-01 01:30:00
5 2012-01-01 02:00:00
6 2012-01-01 02:30:00
And to get them in two separate columns we can use ?strftime
date_seq <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-02-01"),
by=(30*60))
df <- data.frame(Date = strftime(date_seq, format="%Y-%m-%d"),
Time = strftime(date_seq, format="%H:%M:%S"))
Date Time
1 2012-01-01 00:00:00
2 2012-01-01 00:00:30
3 2012-01-01 00:01:00
4 2012-01-01 00:01:30
5 2012-01-01 00:02:00
6 2012-01-01 00:02:30
Update
You can include the time part of the POSIXct datetime too. This will give you finer control over your upper & lower bounds:
date_seq <- seq(as.POSIXct("2012-01-01 00:00:00"),
as.POSIXct("2012-02-02 23:30:00"),
by=(30*60))
df <- data.frame(Date = strftime(date_seq, format="%Y-%m-%d"),
Time = strftime(date_seq, format="%H:%M:%S"))

Alter values in one data frame based on comparison values in another in R

I am trying to subtract one hour to date/times within a POSIXct column that are earlier than or equal to a time stated in a different comparison dataframe for that particular ID.
For example:
#create sample data
Time<-as.POSIXct(c("2015-10-02 08:00:00","2015-11-02 11:00:00","2015-10-11 10:00:00","2015-11-11 09:00:00","2015-10-24 08:00:00","2015-10-27 08:00:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,01,02,02,03,03)
data<-data.frame(Time,ID)
Which produces this:
Time ID
1 2015-10-02 08:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 10:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 08:00:00 3
6 2015-10-27 08:00:00 3
I then have another dataframe with a key date and time for each ID to compare against. The Time in data should be compared against Comparison in ComparisonData for the particular ID it is associated with. If the Time value in data is earlier than or equal to the comparison value one hour should be subtracted from the value in data:
#create sample comparison data
Comparison<-as.POSIXct(c("2015-10-29 08:00:00","2015-11-02 08:00:00","2015-10-26 08:30:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,02,03)
ComparisonData<-data.frame(Comparison,ID)
This should look like this:
Comparison ID
1 2015-10-29 08:00:00 1
2 2015-11-02 08:00:00 2
3 2015-10-26 08:30:00 3
In summary, the code should check all times of a certain ID to see if any are earlier than or equal to the value specified in ComparisonData and if they are, subtract one hour. This should give this data frame as an output:
Time ID
1 2015-10-02 07:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 09:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 07:00:00 3
6 2015-10-27 08:00:00 3
I have looked at similar solutions such as this but I cannot understand how to also check the times using the right timing with that particular ID.
I think ddply seems quite a promising option but I'm not sure how to use it for this particular problem.
Here's a quick and efficient solution using data.table. First we join the two data sets by ID and then just modify the Times which are lower or equal to Comparison
library(data.table) # v1.9.6+
setDT(data)[ComparisonData, end := i.Comparison, on = "ID"]
data[Time <= end, Time := Time - 3600L][, end := NULL]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
Alternatively, we could do this in one step while joining using ifelse (not sure how efficient this though)
setDT(data)[ComparisonData,
Time := ifelse(Time <= i.Comparison,
Time - 3600L, Time),
on = "ID"]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
I am sure there is going to be a better solution than this, however, I think this works.
for(i in 1:nrow(data)) {
if(data$Time[i] < ComparisonData[data$ID[i], 1]){
data$Time[i] <- data$Time[i] - 3600
}
}
# Time ID
#1 2015-10-02 07:00:00 1
#2 2015-11-02 11:00:00 1
#3 2015-10-11 09:00:00 2
#4 2015-11-11 09:00:00 2
#5 2015-10-24 07:00:00 3
#6 2015-10-27 08:00:00 3
This is going to iterate through every row in data.
ComparisonData[data$ID[i], 1] gets the time column in ComparisonData for the corresponding ID. If this is greater than the Time column in data then reduce the time by 1 hour.

Resources