Quantifying Logged Non-activity in R-- Overlapping logged events - r

Update: This seems to be very well-described in SQL forums -- how to account for the gaps in-between time ranges (many of which overlap.) So I may have to turn to SQL to quickly solve this problem, but I'm surprised it cannot be done in "R". It would appear that the object used by interval gets almost all the way there, but outside of a slow loop, it seems difficult to apply on a vector-wide analysis. Please do let me know if you have any ideas, but here's a description of the problem and its solution in SQL:
https://www.simple-talk.com/sql/t-sql-programming/calculating-gaps-between-overlapping-time-intervals-in-sql/
.... What I'd like to do is come up with a list of non-activity time from a log, and then filter down on it to show a minimum amount of time of non-activity.
1/17/2012 0:15 1/17/2012 0:31
1/20/2012 0:21 1/20/2012 0:22
1/15/2013 1:08 1/15/2013 1:10
1/15/2013 1:08 1/15/2013 1:10
1/15/2013 7:39 1/15/2013 7:41
1/15/2013 7:39 1/15/2013 7:41
1/16/2013 1:11 1/16/2013 1:15
1/16/2013 1:11 1/16/2013 1:15
I was going to just lag the end times into the start row and compute the difference, but then it became clear there were overlapping activities. I also tried "price is right" type matching to get the closest end time... except, of course, if things are going on simultaneously, this doesn't guarantee there's no activity from a still-unfinished simultaneous task.
I currently have date-time in and date-time out columns. I am hoping there is a better idea than taking the many millions of entries, and using seq.POSIXt to write every individual minute that has activity? But even that doesn't seem very workable.. But it would seem there would be some easy way to identify gaps of time of a minimum size, whether it be 5 minutes or 30. Any suggestions?

Assuming that 1/17/2012 00:15 is the first value in your data set, I would convert your data into two columns, each column will contain the number of minutes since this time stamp
ie using the first 3 rows of your data as an example
_______|_______
0 | 16
4323 | 4324
528882 | 528884
... | ...
Subtracting these two columns from each other will tell you the minutes where activity occurred, you can then simply inverse this and you will get your non activity.

Related

Sqlite remove duplicates within specific time range

I know there are many questions asked about removing duplicates in SQL. However in my case it is slightly more complicated.
These are data with Barcode which repeats over a month. Therefore it is expected that there will be entries with the same Barcode. However it is found out that due to possibly a machine bug, same data will be recorded within 4-5 minutes timeframe 2 to 3 times. It does not happen for every entry, but it happens rather frequently.
Allow me to demonstrate with a sample table which contains the same Barcode "A00000"
Barcode No Date A B C D
A00000 1499456 10/10/2019 3:28 607 94 1743 72D
A00000 1803564 10/20/2019 22:09 589 75 1677 14D
A00000 1803666 10/20/2019 22:13 589 75 1677 14D
A00000 1803751 10/20/2019 22:17 589 75 1677 14D
A00000 2084561 10/30/2019 12:22 583 86 1677 14D
A00000 2383742 11/9/2019 23:18 594 81 1650 07D
As you can see the entries on 10/20 contains identical data which are duplicates which should be removed so only one of the entry remains (any of the entry is fine and the exact time is not the main concern). The "No" column is a pure arbitrary number which can be safely disregarded. The other entries should be remain as it is.
I know this should be done by using "Group by", but I am struggling on how to write the conditions. I have tried also using table INNER JOIN itself and then remove these selected results:
T2.A = T2.B AND
T2.[Date] > T1.[Date] AND
strftime('%s',T2.[Date]) - strftime('%s',T1.[Date]) < 600
The results still seem a bit off as some of the entries are selected twice and some are not selected. I am still not used to SQL style of thinking. Any help is appreciated.
The format of the Date column complicates things a bit, but otherwise the solution basically is to use GROUP BY in the normal way. In the following, I've assumed the name of the table is test:
WITH sane as
(SELECT *,
substr(date,1,instr(date, ' ') - 1) as time
FROM test)
SELECT Barcode, max(No), Date, A, B, C, D
FROM sane
GROUP BY barcode, time;
The use of max() is perhaps unneeded but it gives some determinacy, which might be helpful.

Building a model in R based on historic data

i'm working on a daily data frames for one month, the main variables for each data frame, are like this
Date_Heure Fonction Presence
2015-09-02 08:01:28 Acce 1
2015-09-02 08:15:56 check-out 0
2015-09-02 08:16:23 Alarme 0
the idea is to learn over 15 days the habits of the owner in his home, the rate of his presence each time slot, and when he activate the alarme of the home,
so after building this historic, we want to know (to predict) the next day (the 16th day), when he will activate his alarm based on the informations we calculated,
so basicly the historic should be transformed to a MODEL, but i cannot figure out how to do this ??
well what i have in hands are my inputs (i suppose) : the percentage of presence in the two half_hour before and after activating the alarm, and my input normally should be the time that the alarm should be activated, so what i have is like this
Presence 1st Time slot Presence 2nd Time slot Date_Heure
0.87 0 2015-09-02 08:16:23
0.91 0 2015-09-03 08:19:02
0.85 0 2015-09-04 08:18:11
i have the mean of the activated hour, of the percentage of presence in the two time slot
and every new day will be added to the historic (to the model, so the historic get bigger evry day by one day and the paramaters will change of course,teh mean, the max and the min of my indicators), it's like we are doing a "Statistical Learning"
So if you have any ideas, any clue that help me to start with, it would be helful for me cause when i serached, it's very vague for me, and i just need the right key to work

Efficient chunking timeseries in R

I am working with a number of large timeseries currency pair pricing data in R. Files tend to be 100-300MB in size, and I will generally be working with 3 files at a time. I am looking for a (much) more efficient way of considering the TIME column of these data.
My data begins looking like :
PAIR TIME BID ASK
1 USD/JPY 2012-01-02 00:00:00.307 77.023 77.055
2 USD/JPY 2012-01-02 00:00:00.493 77.030 77.049
3 USD/JPY 2012-01-02 00:00:05.003 77.030 77.050
4 USD/JPY 2012-01-02 00:00:05.005 77.023 77.056
5 USD/JPY 2012-01-02 00:00:05.006 77.024 77.056
6 USD/JPY 2012-01-02 00:00:06.008 77.023 77.056
... ... ... ...
R has no problem understanding the TIME column. For instance,
USDJPY$TIME[2] - USDJPY$TIME[1]
Gives output
Time difference of 0.1860001 secs
Data are already organized into files by month. Unfortunately this is also much too large. I would like to break down pricing data by 'trading week'
Forex trading occurs in continuous multi-day stretches, usually from Monday to Friday. Some trading holidays will suspend trading, and there will not be data on these days. The nature of trading scheduling is such that, if
USDJPY$TIME[t+1] - USDJPY$TIME[t]
... is greater than 12 hours, time t is the last time index for that week in USDJPY.
I have not found an acceptable way to break the data into trading weeks, by indices, or otherwise. All my attempts end up hanging. The USDJPY file contains ~1,900,000 rows.
One approach I tried :
for(i in 1:(length(USDJPY$TIME)-1)){
USDJPY.diff <- c(USDJPY.diff, USDJPY$TIME[i+1]-USDJPY$TIME[i])
}
takes far too long (I quit before it could finish)
I would think data.table should speed things up here quite a bit:
library(data.table) #1.9.5
setDT(data)
data[, DIFF := as.numeric(TIME-shift(TIME,n=1,type="lag"))]
Week number calc (increment with difference is greater than 12 hours)
data[, Week.num := cumsum(DIFF>12)]

Mismatching drawdown calculations

I would like to ask you to clarify the next question, which is of extreme importance to me, since a major part of my master's thesis relies on properly implementing the data calculated in the following example.
I hava a list of financial time series, which look like this (AUDUSD example):
Open High Low Last
1992-05-18 0.7571 0.7600 0.7565 0.7598
1992-05-19 0.7594 0.7595 0.7570 0.7573
1992-05-20 0.7569 0.7570 0.7548 0.7562
1992-05-21 0.7558 0.7590 0.7540 0.7570
1992-05-22 0.7574 0.7585 0.7555 0.7576
1992-05-25 0.7575 0.7598 0.7568 0.7582
From this data I calculate log returns for the column Last to obtain something like this
Last
1992-05-19 -0.0032957646
1992-05-20 -0.0014535847
1992-05-21 0.0010573620
1992-05-22 0.0007922884
Now I want to calculate the drawdowns in the above presented time series, which I achieve by using (from package PerformanceAnalytics)
ddStats <- drawdownsStats(timeSeries(AUDUSDLgRetLast[,1], rownames(AUDUSDLgRetLast)))
which results in the following output (here are just the first 5 lines, but it returns every single drawdown, including also one day long ones)
From Trough To Depth Length ToTrough Recovery
1 1996-12-03 2001-04-02 2007-07-13 -0.4298531511 2766 1127 1639
2 2008-07-16 2008-10-27 2011-04-08 -0.4003839141 713 74 639
3 2011-07-28 2014-01-24 2014-05-13 -0.2254426369 730 652 NA
4 1992-06-09 1993-10-04 1994-12-06 -0.1609854215 650 344 306
5 2007-07-26 2007-08-16 2007-09-28 -0.1037999707 47 16 31
Now, the problem is the following: The depth of the worst drawdown (according to the upper output) is -0.4298, whereas if I do the following calculations "by hand" I obtain
(AUDUSD[as.character(ddStats[1,1]),4]-AUDUSD[as.character(ddStats[1,2]),4])/(AUDUSD[as.character(ddStats[1,1]),4])
[1] 0.399373
To make things clearer, this are the two lines from the AUDUSD dataframe for from and through dates:
AUDUSD[as.character(ddStats[1,1]),]
Open High Low Last
1996-12-03 0.8161 0.8167 0.7845 0.7975
AUDUSD[as.character(ddStats[1,2]),]
Open High Low Last
2001-04-02 0.4858 0.4887 0.4773 0.479
Also, the other drawdown depts do not agree with the calculations "by hand". What I am missing? How come that this two numbers, which should be the same, differ for a substantial amount?
I have tried replicating the drawdown via:
cumsum(rets) -cummax(cumsum(rets))
where rets is the vector of your log returns.
For some reason when I calculate Drawdowns that are say less than 20% I get the same results as table.Drawdowns() & drawdownsStats() but when there is a large difference say drawdowns over 35%, then the Max Drawdown begin to diverge between calculations. More specifically the table.Drawdowns() & drawdownsStats() are overstated (at least what i noticed). I do not know why this is so, but perhaps what might help is if you use an confidence interval for large drawdowns (those over 35%) by using the Standard error of the drawdown. I would use: 0.4298531511/sqrt(1127) which is the max drawdown/sqrt(depth to trough). This would yield a +/- of 0.01280437 or a drawdown of 0.4169956 to 0.4426044 respectively, which the lower interval of 0.4169956 is much closer to you "by-hand" calculation of 0.399373. Hope it helps.

Ensuring temporal data density in R

ISSUE ---------
I have thousands of time series files (.csv) that contain intermittent data spanning for between 20-50 years (see df). Each file contains the date_time and a metric (temperature). The data is hourly and where no measurement exists there is an 'NA'.
>df
date_time temp
01/05/1943 11:00 5.2
01/05/1943 12:00 5.2
01/05/1943 13:00 5.8
01/05/1943 14:00 NA
01/05/1943 15:00 NA
01/05/1943 16:00 5.8
01/05/1943 17:00 5.8
01/05/1943 18:00 6.3
I need to check these files to see if they have sufficient data density. I.e. that the ratio of NA's to data values is not too high. To do this I have 3 criteria that must be checked for each file:
Ensure that no more than 10% of the hours in a day are NA's
Ensure that no more than 10% of the days in a month are NA's
Ensure that there are 3 continuous years of data with valid days and months.
Each criterion must be fulfilled sequentially and if the file does not meet the requirements then I must create a data frame (or any list) of the files that do not meet the criteria.
QUESTION--------
I wanted to ask the community how to go about this. I have considered the value of nested if loops, along with using sqldf, plyr, aggregate or even dplyr. But I do not know the simplest way to achieve this. Any example code or suggestions would be very much appreciated.
I think this will work for you. These will check every hour for NA's in the next day, month or 3 year period. Not tested because I don't care to make up data to test it. These functions should spit out the number of NA's in the respective time period. So for function checkdays if it returns a value greater than 2.4 then according to your 10% rule you'd have a problem. For months 72 and for 3 year periods you're hoping for values less than 2628. Again please check these functions. By the way the functions assume your NA data is in column 2. Cheers.
checkdays <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-23)){
nadata=data[i:(i+23),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
checkmonth <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-719)){
nadata=data[i:(i+719),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
check3years <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-26279)){
nadata=data[i:(i+26279),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
So I ended up testing these. They work for me. Here are system times for a dataset a year long. So I don't think you'll have problems.
> system.time(checkdays(RM_W1))
user system elapsed
0.38 0.00 0.37
> system.time(checkmonth(RM_W1))
user system elapsed
0.62 0.00 0.62
Optimization:
I took the time to run these functions with the data you posted above and it wasn't good. For loops are dangerous because they work well for small data sets but slow down exponentially as datasets get larger, that is if they're not constructed properly. I cannot report system times for the functions above with your data (it never finished) but I waited about 30 minutes. After reading this awesome post Speed up the loop operation in R I rewrote the functions to be much faster. By minimising the amount of things that happen in the loop and pre-allocating memory you can really speed things up. You need to call the function like checkdays(df[,2]) but its faster this way.
checkdays <- function(data){
countNA=numeric(length(data)-23)
for(i in 1:(length(data)-23)){
nadata=data[i:(i+23)]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
> system.time(checkdays(df[,2]))
user system elapsed
4.41 0.00 4.41
I believe this should be sufficient for your needs. In regards to leap years you should be able to modify the optimized function as I mentioned in the comments. However make sure you specify a leap year dataset as second dataset rather than a second column.

Resources