Algorithm to determine intervals in my data - r

I have data that deals with production months and relative frequency. On the x-axis the production months and on the y-axis the relative frequency. Now you can see increases and decreases in this course. My goal is to put these climbs and descents in an interval. There are a few procedures that are concerned with determining these abnormalities. I have already dealt with them and implemented the "Hill Climbing" algorithm. I get intervals, but these are not great. Now I wanted to extend this algorithm so that I get better intervals. I already tried using some packages like strucchange() or breakpoints() but these are always giving me errors. Since I'm neither a computer scientist nor a mathematician, it would be great to get some advice!
My code for hill climbing:
hillclimbing1 <- function(month,amount)
{
res <- c()
val <- amount[1]
j <- 1
for (i in 1:length(month))
{
if(abs(amount[i] - val) > abs((val*0.3)))
{
val <- amount[i]
res[j] <- i - 0.5
j <- j +1
}
}
return(res)
}
My dataframe looks like this:
month amount
2012-07-01 0.0000000
2012-08-01 1.1111111
2012-09-01 0.2985075
2012-10-01 0.5141388
2012-11-01 0.0000000
2012-12-01 0.0000000
2013-01-01 0.6849315
2013-02-01 1.9762846
2013-03-01 1.1799410
2013-04-01 0.2881844
2013-05-01 0.2617801
2013-06-01 1.2285012
2013-07-01 1.2285012
2013-08-01 1.3539652
2013-09-01 1.6694491
2013-10-01 2.4000000
2013-11-01 2.5065963
2013-12-01 2.4869110
2014-01-01 2.0497804
2014-02-01 1.4044944
2014-03-01 3.9443155
2014-04-01 2.9748284
2014-05-01 3.0623020
2014-06-01 2.2044088
2014-07-01 2.9686175
2014-08-01 3.1304348
2014-09-01 3.9028621
2014-10-01 2.3942538
2014-11-01 2.9021559
2014-12-01 4.6280992
2015-01-01 3.8616251
2015-02-01 3.0252101
2015-03-01 3.7565740
2015-04-01 4.0977714
EDIT:
After using the min/max method I get following plot:
Is there any way to get rid of interval no. 2 and 3?

Related

How do I prevent {data.table}foverlaps from feeding NA's into its any(...) call when executing on large datatables?

First of all, a similar problem:
Foverlaps error: Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop
The story
I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event. In total we consider three events : AC, CO and MT.
The data
Edit 1:
Here are two example datasets that allow the execution of the code below.
The code runs just fine for these sets. Once I have data that generates the error I'll make a second edit. Note that event.GN in the example dataset below is a data.table instead of a list
emissions.GN <- data.table(date.time=seq(ymd_hms("2016-01-01 00:00:00"), by="min",length.out = 1000000))
event.GN <- data.table(dat=seq(ymd_hms("2016-01-01 00:00:00"), by="15 mins", length.out = 26383))
Edit 2:
I created a csv file containing the data event.GN that generates the error. The file has 26383 rows of one variable dat but only about 14000 are necessary to generate the error.
Edit 3:
Up until the dat "2017-03-26 00:25:20" the function works fine. Right after adding the next record with dat "2017-03-26 01:33:46" the error occurs. I noticed that between those points there is more than 60 minutes. This means that between those two event times one or several emission records won't have corresponding events. This in turn will generate NA's that somehow get caught up in the any() call of the foverlaps function. Am I looking in the right direction?
The fluor emissions are stored in a large datatable (~1 million rows) called emissions.GN. Note that only the date.time (POSIXct) variable is relevant to my problem.
example of emissions.GN:
date.time fluor hall period dt
1: 2016-01-01 00:17:04 0.3044254 GN [2016-01-01,2016-02-21] -16.07373
2: 2016-01-01 00:17:04 0.4368381 GN [2016-01-01,2016-02-21] -16.07373
3: 2016-01-01 00:18:04 0.5655382 GN [2016-01-01,2016-02-21] -16.07395
4: 2016-01-01 00:19:04 0.6542259 GN [2016-01-01,2016-02-21] -16.07417
5: 2016-01-01 00:21:04 0.6579384 GN [2016-01-01,2016-02-21] -16.07462
The data of the three events is stored in three smaller datatables (~20 thousand records) contained in a list called events.GN. Note that only the dat (POSIXct) variable is relevant to my problem.
example of AC events (CO and MT are analogous):
events.GN[["AC"]]
dat hall numevt txtevt
1: 2016-01-01 00:04:54 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
2: 2016-01-01 00:09:21 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
3: 2016-01-01 00:38:53 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
4: 2016-01-01 02:30:33 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
5: 2016-01-01 02:34:11 GN 321 PHASE 1 CHANGEMENT D'ANODE (Position anode #1I)
The function
I have written a function that applies foverlaps on a given (large) x datatable and a given (small) y datatable. The function returns a datatable with two columns. The first column yid contains the indices of emissions.GN observations that overlap at least once with an event. The second column N contains the overlap count (i.e. the number of times an overlap occurs for that particular index). The index of emissions that have zero overlaps are omitted from the result.
# A function to compute the number of times an emission record falls between the defined starting point and end point of an event.
find_index_and_count <- function(hall,event, lower.margin=10, upper.margin=30){
# Define start and stop variables of the large emission dataset hall to be zero, i.e. each record is a single time point, not an interval.
hall$start <- hall$date.time
hall$stop <- hall$date.time
# Define the start and stop variables of the small event datatables equal to the defined margins oof 10 and 30 minutes respectively
event$start <- event$dat-minutes(lower.margin)
event$stop <- event$dat+minutes(upper.margin)
# Set they key of both datasets to be start and stop
setkey(hall,start,stop)
setkey(event,start,stop)
# Returns the index the of the emission record that falls N times within an event time interval. The call to na.omit is necessary to remove NA's introduced by x records that don't fall within any y interval.
foverlaps(event,hall,nomatch = NA, which = TRUE)[, .N, by=yid] %>% na.omit
}
The function executes succesfully for the events AC and CO
The function gives the desired result as discribed above when called on the events AC and CO:
find_index_and_count(emissions.GN,events.GN[["AC"]])
yid N
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 2
---
find_index_and_count(emissions.GN,events.GN[["CO"]])
yid N
1: 3 1
2: 4 1
3: 5 1
4: 6 1
5: 7 1
---
The function returns an error when called on the MT event
The following function call results in the error below:
find_index_and_count(emissions.GN,events.GN[["MT"]])
Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop("All entries in column ", : missing value where TRUE/FALSE needed
5.foverlaps(event, hall, nomatch = NA, which = TRUE)
4.eval(lhs, parent, parent)
3.eval(lhs, parent, parent)
2.foverlaps(event, hall, nomatch = NA, which = TRUE)[, .N, by = yid] %>% na.omit
1.find_index_and_count(emissions.GN, events.GN[["MT"]])
I assume the function returns an NA whenever a record in x (emissions.FN) has no overlap with any of the events in y (events.FN[["AC"]] etc.).
I don't understand why the function fails on the event MT when it works just fine for AC and CO. The data are exactly the same with the exception of the values and slightly different number of records.
What I have tried so far
Firstly, In the similar problem linked above, someone pointed out the following idea:
This often indicates an NA value being fed to the any function, so it returns NA and that's not a legal logical value. – Carl Witthoft May 7 '15 at 13:50
Hence , I modified the call to foverlaps to return 0 instead of NA whener no overlap between x and y is found, like this:
foverlaps(event,hall,nomatch = 0, which = TRUE)[, .N, by=yid] %>% na.omit
This did not change anything (the function works for AC and CO but fails for MT).
Secondly, I made absolutely sure that none of my datatables contained NA's.
More information
If required I can provide the SQL code that generates the emissions.FN data and all the events.FN data. Note that because all the events.FN date has the same origin, there should be no diffirences (other than the values) between the data of the events AC, CO and MT.
If anything else is required, please do feel free to ask !
I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event.
Just addressing this objective (since I don't know foverlaps well.)...
event.GN[, n :=
emissions.GN[.SD[, .(d_dn = dat - 10*60, d_up = dat + 30*60)], on=.(date.time >= d_dn, date.time <= d_up),
.N
, by=.EACHI]$N
]
dat n
1: 2016-01-01 00:00:00 31
2: 2016-01-01 00:15:00 41
3: 2016-01-01 00:30:00 41
4: 2016-01-01 00:45:00 41
5: 2016-01-01 01:00:00 41
---
26379: 2016-10-01 18:30:00 41
26380: 2016-10-01 18:45:00 41
26381: 2016-10-01 19:00:00 41
26382: 2016-10-01 19:15:00 41
26383: 2016-10-01 19:30:00 41
To check/verify one of these counts...
> # dat from 99th event...
> my_d <- event.GN[99, {print(.SD); dat}]
dat n
1: 2016-01-02 00:30:00 41
>
> # subsetting to overlapping emissions
> emissions.GN[date.time %between% (my_d + c(-10*60, 30*60))]
date.time
1: 2016-01-02 00:20:00
2: 2016-01-02 00:21:00
3: 2016-01-02 00:22:00
4: 2016-01-02 00:23:00
5: 2016-01-02 00:24:00
6: 2016-01-02 00:25:00
7: 2016-01-02 00:26:00
8: 2016-01-02 00:27:00
9: 2016-01-02 00:28:00
10: 2016-01-02 00:29:00
11: 2016-01-02 00:30:00
12: 2016-01-02 00:31:00
13: 2016-01-02 00:32:00
14: 2016-01-02 00:33:00
15: 2016-01-02 00:34:00
16: 2016-01-02 00:35:00
17: 2016-01-02 00:36:00
18: 2016-01-02 00:37:00
19: 2016-01-02 00:38:00
20: 2016-01-02 00:39:00
21: 2016-01-02 00:40:00
22: 2016-01-02 00:41:00
23: 2016-01-02 00:42:00
24: 2016-01-02 00:43:00
25: 2016-01-02 00:44:00
26: 2016-01-02 00:45:00
27: 2016-01-02 00:46:00
28: 2016-01-02 00:47:00
29: 2016-01-02 00:48:00
30: 2016-01-02 00:49:00
31: 2016-01-02 00:50:00
32: 2016-01-02 00:51:00
33: 2016-01-02 00:52:00
34: 2016-01-02 00:53:00
35: 2016-01-02 00:54:00
36: 2016-01-02 00:55:00
37: 2016-01-02 00:56:00
38: 2016-01-02 00:57:00
39: 2016-01-02 00:58:00
40: 2016-01-02 00:59:00
41: 2016-01-02 01:00:00
date.time

Calculating 2 hourly average of data

I have flow data for a year. I want to get the 2 hourly averages of the data and make a timeseries that records the average flow for the two hours along with the timestamp.
The data look like this:
2005-01-01 00:00:00 18
2005-01-01 00:15:00 18
2005-01-01 00:30:00 18
2005-01-01 00:45:00 18
2005-01-01 01:00:00 18
2005-01-01 01:15:00 18
2005-01-01 01:30:00 18
2005-01-01 01:45:00 19
So at the end I would like something that looks like:
2005-01-01 00:00:00 18.125
This is what I'm doing right now:
for (i in seq(1,length(streamflow),8)){
streamflow2hr[i] <- mean(streamflow[i:i+7])
}
valid2hr <- complete.cases(streamflow2hr)
validIndex <- which(valid2hr,arr.ind = TRUE)
streamflow2hrvalid <- streamflow2hr[validIndex]
streamflow2hrvalidTime <- streamflowDateTime[validIndex]
data2hr <- data.frame(streamflow2hrvalidTime,streamflow2hrvalid)
names(data2hr) <- c("DateTime","Flow")
But since I'm using relative positions it isn't consistent with the 2 hourly timestamp!
You can adjust this code for your needs:
# Generate a sample dataset
set.seed(1)
z <- as.POSIXct("2015-01-31 13:00:00") + 900*0:23
d <- data.frame(t=z,v=sample(length(z)))
d$cut <- cut(d$t,breaks="2 hours")
aggregate(v~cut,d,mean)
# cut v
#1 2015-01-31 13:00:00 12.875
#2 2015-01-31 15:00:00 12.125
#3 2015-01-31 17:00:00 12.500
This solution doesn't rely on 15-minute intervals between timestamps. Instead, it divides the time range into 2-hour intervals and uses them to calculate per-interval means.

Pandas increment time series by one minute

My data is here.
I want to add a minute to values in STA_STD to get a 5-minute regular time series, if the value in that column contains "23:59:00". Adding one minute should also change to date to next day 00:00 hours.
My code is here
dat=pd.read_csv("temp.csv")
if(dat['STA_STD'].str.contains("23:59:00")):
dat['STA_STD_NEW']= pd.to_datetime(dat[dat['STA_STD'].str.contains("23:59:00")] ['STA_STD'])+datetime.timedelta(minutes=1)
else:
dat['STA_STD_NEW'] = dat['STA_STD']
And this gives me below error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Pandas documentation here talks about the same error.
What is the procedure to iterate through all rows and increment value by one minute, if value contains "23:59:00"?
Please advise.
Two things:
You can't use if/else like that to evaluate multiple values at the same time (you would need to iterate over the values, and then do the if/else for each value separately). But using boolean indexing is much better in this case.
the str.contains does not work with datetimes. But you can eg check if the time part of the datetime values is equal to datetime.time(23, 59)
A small example:
In [2]: dat = pd.DataFrame({'STA_STD':pd.date_range('2012-01-01 23:50', periods=10, freq='1min')})
In [3]: dat['STA_STD_NEW'] = dat['STA_STD']
In [4]: dat.loc[dat['STA_STD'].dt.time == datetime.time(23,59), 'STA_STD_NEW'] += datetime.timedelta(minutes=1)
In [5]: dat
Out[5]:
STA_STD STA_STD_NEW
0 2012-01-01 23:50:00 2012-01-01 23:50:00
1 2012-01-01 23:51:00 2012-01-01 23:51:00
2 2012-01-01 23:52:00 2012-01-01 23:52:00
3 2012-01-01 23:53:00 2012-01-01 23:53:00
4 2012-01-01 23:54:00 2012-01-01 23:54:00
5 2012-01-01 23:55:00 2012-01-01 23:55:00
6 2012-01-01 23:56:00 2012-01-01 23:56:00
7 2012-01-01 23:57:00 2012-01-01 23:57:00
8 2012-01-01 23:58:00 2012-01-01 23:58:00
9 2012-01-01 23:59:00 2012-01-02 00:00:00 <-- increment of 1 minute
Using the dt.time approach needs pandas >= 0.15

Produce weekly average plots from large dataset in R

I am quite new to R and have been struggling with trying to convert my data and could use some much needed help.
I have a dataframe which is approx. 70,000*2. This data covers a whole year (52 weeks/365 days). A portion of it looks like this:
Create.Date.Time Ticket.ID
1 2013-06-01 12:59:00 INCIDENT684790
2 2013-06-02 07:56:00 SERVICE684793
3 2013-06-02 09:39:00 SERVICE684794
4 2013-06-02 14:14:00 SERVICE684796
5 2013-06-02 17:20:00 SERVICE684797
6 2013-06-03 07:20:00 SERVICE684799
7 2013-06-03 08:02:00 SERVICE684839
8 2013-06-03 08:04:00 SERVICE684841
9 2013-06-03 08:04:00 SERVICE684842
10 2013-06-03 08:08:00 SERVICE684843
I am trying to get the number of tickets in every hour of the week (that is, hour 1 to hour 168) for each week. Hour 1 would start on Monday at 00.00, and hour 168 would be Sunday 23.00-23.59. This would be repeated for each week. I want to use the Create.Date.Time data to calculate the hour of the week the ticket is in, say for:
2013-06-01 12:59:00 INCIDENT684790 - hour 133,
2013-06-03 08:08:00 SERVICE684843 - hour 9
I am then going to do averages for each hour and plot those. I am completely at a loss as to where to start. Could someone please point me to the right direction?
Before addressing the plotting aspect of your question, is this the format of data you are trying to get? This uses the package lubridate which you might have to install (install.packages("lubridate",dependencies=TRUE)).
library(lubridate)
##
Events <- paste(
sample(c("INCIDENT","SERVICE"),20000,replace=TRUE),
sample(600000:900000,20000)
)
t0 <- as.POSIXct(
"2013-01-01 00:00:00",
format="%Y-%m-%d %H:%M:%S",
tz="America/New_York")
Dates <- sort(t0 + sample(0:(3600*24*365-1),20000))
Weeks <- week(Dates)
wDay <- wday(Dates,label=TRUE)
Hour <- hour(Dates)
##
hourShift <- function(time,wday){
hShift <- sapply(wday, function(X){
if(X=="Mon"){
0
} else if(X=="Tues"){
24*1
} else if(X=="Wed"){
24*2
} else if(X=="Thurs"){
24*3
} else if(X=="Fri"){
24*4
} else if(X=="Sat"){
24*5
} else {
24*6
}
})
##
tOut <- hour(time) + hShift + 1
return(tOut)
}
##
weekHour <- hourShift(time=Dates,wday=wDay)
##
Data <- data.frame(
Event=Events,
Timestamp=Dates,
Week=Weeks,
wDay=wDay,
dayHour=Hour,
weekHour=weekHour,
stringsAsFactors=FALSE)
##
This gives you:
> head(Data)
Event Timestamp Week wDay dayHour weekHour
1 SERVICE 783405 2013-01-01 00:13:55 1 Tues 0 25
2 INCIDENT 860015 2013-01-01 01:06:41 1 Tues 1 26
3 INCIDENT 808309 2013-01-01 01:10:05 1 Tues 1 26
4 INCIDENT 835509 2013-01-01 01:21:44 1 Tues 1 26
5 SERVICE 769239 2013-01-01 02:04:59 1 Tues 2 27
6 SERVICE 762269 2013-01-01 02:07:41 1 Tues 2 27

Calculating elapsed time for different interview dates in R

So my data looks like this
dat<-data.frame(
subjid=c("a","a","a","b","b","c","c","d","e"),
type=c("baseline","first","second","baseline","first","baseline","first","baseline","baseline"),
date=c("2013-02-07","2013-02-27","2013-04-30","2013-03-03","2013-05-23","2013-01-02","2013-07-23","2013-03-29","2013-06-03"))
i.e)
subjid type date
1 a baseline 2013-02-07
2 a first 2013-02-27
3 a second 2013-04-30
4 b baseline 2013-03-03
5 b first 2013-05-23
6 c baseline 2013-01-02
7 c first 2013-07-23
8 d baseline 2013-03-29
9 e baseline 2013-06-03
and I'm trying to make a variable "elapsedtime" that denotes the time elapsed from the baseline date to first and second round interview dates (so that elapsedtime=0 for baselines). Note that it varies individually whether they have taken further interviews.
I tried to reshape the data so that I could subtract each dates but my brain isn't really functioning today--or is there another way?
Please help and thank you.
Screaming out for ave:
I'll throw an NA value in there just for good measure:
dat<-data.frame(
subjid=c("a","a","a","b","b","c","c","d","e"),
type=c("baseline","first","second","baseline","first","baseline","first","baseline","baseline"),
date=c("2013-02-07","NA","2013-04-30","2013-03-03","2013-05-23","2013-01-02","2013-07-23","2013-03-29","2013-06-03"))
And you should probably sort the data to be on the safe side:
dat$type <- ordered(dat$type,levels=c("baseline","first","second","third") )
dat <- dat[order(dat$subjid,dat$type),]
Turn your date into a proper Date object:
dat$date <- as.Date(dat$date)
Then calculate the differences:
dat$elapsed <- ave(as.numeric(dat$date),dat$subjid,FUN=function(x) x-x[1] )
# subjid type date elapsed
#1 a baseline 2013-02-07 0
#2 a first <NA> NA
#3 a second 2013-04-30 82
#4 b baseline 2013-03-03 0
#5 b first 2013-05-23 81
#6 c baseline 2013-01-02 0
#7 c first 2013-07-23 202
#8 d baseline 2013-03-29 0
#9 e baseline 2013-06-03 0
This makes no assumptions that baseline is the always at position 1:
dat$date <- as.Date(dat$date)
dat$elapesed <- unlist(by(dat, dat$subjid, FUN=function(x) {
as.numeric(x$date - x[x$type=="baseline",]$date)
}))

Resources