Building a model in R based on historic data - r

i'm working on a daily data frames for one month, the main variables for each data frame, are like this
Date_Heure Fonction Presence
2015-09-02 08:01:28 Acce 1
2015-09-02 08:15:56 check-out 0
2015-09-02 08:16:23 Alarme 0
the idea is to learn over 15 days the habits of the owner in his home, the rate of his presence each time slot, and when he activate the alarme of the home,
so after building this historic, we want to know (to predict) the next day (the 16th day), when he will activate his alarm based on the informations we calculated,
so basicly the historic should be transformed to a MODEL, but i cannot figure out how to do this ??
well what i have in hands are my inputs (i suppose) : the percentage of presence in the two half_hour before and after activating the alarm, and my input normally should be the time that the alarm should be activated, so what i have is like this
Presence 1st Time slot Presence 2nd Time slot Date_Heure
0.87 0 2015-09-02 08:16:23
0.91 0 2015-09-03 08:19:02
0.85 0 2015-09-04 08:18:11
i have the mean of the activated hour, of the percentage of presence in the two time slot
and every new day will be added to the historic (to the model, so the historic get bigger evry day by one day and the paramaters will change of course,teh mean, the max and the min of my indicators), it's like we are doing a "Statistical Learning"
So if you have any ideas, any clue that help me to start with, it would be helful for me cause when i serached, it's very vague for me, and i just need the right key to work

Related

Anomaly Detection - Correlated Variables

I am working on an 'Anomaly' detection assignment in R. My dataset has around 30,000 records of which around 200 are anomalous. It has around 30 columns & all are quantitative. Some of the variables are highly correlated (~0.9). By anomaly I mean some of the records have unusual (high/low) values for some column(s) while some have correlated variables not behaving as expected. The below example will give some idea.
Suppose vehicle speed & heart rate are highly positively correlated. Usually vehicle speed varies between 40 & 60 while heart rate between 55-70.
time_s steering vehicle.speed running.distance heart_rate
0 -0.011734953 40 0.251867414 58
0.01 -0.011734953 50 0.251936555 61
0.02 -0.011734953 60 0.252005577 62
0.03 -0.011734953 60 0.252074778 90
0.04 -0.011734953 40 0.252074778 65
Here we have two types of anomalies. 4th record has exceptionally high value for heart_rate while 5th record seems okay if we look individual columns. But as we can see that heart_rate increases with speed, we expected a lower heart rate for 5th record while we have a higher value.
I could identify the column level anomalies using box plots etc but find it hard to identify the second type. Somewhere I read about PCA based anomaly detection but I couldn't find it's implementation in R.
Will you please help me with PCA based anomaly detection in R for this scenario. My google search was throwing mainly time series related stuff which is not something I am looking for.
Note: There is a similar implementation in Microsoft Azure Machine Learning - 'PCA Based Anomaly Detection for Credit Risk' which does the job but I wan't to know the logic behind it & replicate the same in R.

Quantifying Logged Non-activity in R-- Overlapping logged events

Update: This seems to be very well-described in SQL forums -- how to account for the gaps in-between time ranges (many of which overlap.) So I may have to turn to SQL to quickly solve this problem, but I'm surprised it cannot be done in "R". It would appear that the object used by interval gets almost all the way there, but outside of a slow loop, it seems difficult to apply on a vector-wide analysis. Please do let me know if you have any ideas, but here's a description of the problem and its solution in SQL:
https://www.simple-talk.com/sql/t-sql-programming/calculating-gaps-between-overlapping-time-intervals-in-sql/
.... What I'd like to do is come up with a list of non-activity time from a log, and then filter down on it to show a minimum amount of time of non-activity.
1/17/2012 0:15 1/17/2012 0:31
1/20/2012 0:21 1/20/2012 0:22
1/15/2013 1:08 1/15/2013 1:10
1/15/2013 1:08 1/15/2013 1:10
1/15/2013 7:39 1/15/2013 7:41
1/15/2013 7:39 1/15/2013 7:41
1/16/2013 1:11 1/16/2013 1:15
1/16/2013 1:11 1/16/2013 1:15
I was going to just lag the end times into the start row and compute the difference, but then it became clear there were overlapping activities. I also tried "price is right" type matching to get the closest end time... except, of course, if things are going on simultaneously, this doesn't guarantee there's no activity from a still-unfinished simultaneous task.
I currently have date-time in and date-time out columns. I am hoping there is a better idea than taking the many millions of entries, and using seq.POSIXt to write every individual minute that has activity? But even that doesn't seem very workable.. But it would seem there would be some easy way to identify gaps of time of a minimum size, whether it be 5 minutes or 30. Any suggestions?
Assuming that 1/17/2012 00:15 is the first value in your data set, I would convert your data into two columns, each column will contain the number of minutes since this time stamp
ie using the first 3 rows of your data as an example
_______|_______
0 | 16
4323 | 4324
528882 | 528884
... | ...
Subtracting these two columns from each other will tell you the minutes where activity occurred, you can then simply inverse this and you will get your non activity.

Mismatching drawdown calculations

I would like to ask you to clarify the next question, which is of extreme importance to me, since a major part of my master's thesis relies on properly implementing the data calculated in the following example.
I hava a list of financial time series, which look like this (AUDUSD example):
Open High Low Last
1992-05-18 0.7571 0.7600 0.7565 0.7598
1992-05-19 0.7594 0.7595 0.7570 0.7573
1992-05-20 0.7569 0.7570 0.7548 0.7562
1992-05-21 0.7558 0.7590 0.7540 0.7570
1992-05-22 0.7574 0.7585 0.7555 0.7576
1992-05-25 0.7575 0.7598 0.7568 0.7582
From this data I calculate log returns for the column Last to obtain something like this
Last
1992-05-19 -0.0032957646
1992-05-20 -0.0014535847
1992-05-21 0.0010573620
1992-05-22 0.0007922884
Now I want to calculate the drawdowns in the above presented time series, which I achieve by using (from package PerformanceAnalytics)
ddStats <- drawdownsStats(timeSeries(AUDUSDLgRetLast[,1], rownames(AUDUSDLgRetLast)))
which results in the following output (here are just the first 5 lines, but it returns every single drawdown, including also one day long ones)
From Trough To Depth Length ToTrough Recovery
1 1996-12-03 2001-04-02 2007-07-13 -0.4298531511 2766 1127 1639
2 2008-07-16 2008-10-27 2011-04-08 -0.4003839141 713 74 639
3 2011-07-28 2014-01-24 2014-05-13 -0.2254426369 730 652 NA
4 1992-06-09 1993-10-04 1994-12-06 -0.1609854215 650 344 306
5 2007-07-26 2007-08-16 2007-09-28 -0.1037999707 47 16 31
Now, the problem is the following: The depth of the worst drawdown (according to the upper output) is -0.4298, whereas if I do the following calculations "by hand" I obtain
(AUDUSD[as.character(ddStats[1,1]),4]-AUDUSD[as.character(ddStats[1,2]),4])/(AUDUSD[as.character(ddStats[1,1]),4])
[1] 0.399373
To make things clearer, this are the two lines from the AUDUSD dataframe for from and through dates:
AUDUSD[as.character(ddStats[1,1]),]
Open High Low Last
1996-12-03 0.8161 0.8167 0.7845 0.7975
AUDUSD[as.character(ddStats[1,2]),]
Open High Low Last
2001-04-02 0.4858 0.4887 0.4773 0.479
Also, the other drawdown depts do not agree with the calculations "by hand". What I am missing? How come that this two numbers, which should be the same, differ for a substantial amount?
I have tried replicating the drawdown via:
cumsum(rets) -cummax(cumsum(rets))
where rets is the vector of your log returns.
For some reason when I calculate Drawdowns that are say less than 20% I get the same results as table.Drawdowns() & drawdownsStats() but when there is a large difference say drawdowns over 35%, then the Max Drawdown begin to diverge between calculations. More specifically the table.Drawdowns() & drawdownsStats() are overstated (at least what i noticed). I do not know why this is so, but perhaps what might help is if you use an confidence interval for large drawdowns (those over 35%) by using the Standard error of the drawdown. I would use: 0.4298531511/sqrt(1127) which is the max drawdown/sqrt(depth to trough). This would yield a +/- of 0.01280437 or a drawdown of 0.4169956 to 0.4426044 respectively, which the lower interval of 0.4169956 is much closer to you "by-hand" calculation of 0.399373. Hope it helps.

Ensuring temporal data density in R

ISSUE ---------
I have thousands of time series files (.csv) that contain intermittent data spanning for between 20-50 years (see df). Each file contains the date_time and a metric (temperature). The data is hourly and where no measurement exists there is an 'NA'.
>df
date_time temp
01/05/1943 11:00 5.2
01/05/1943 12:00 5.2
01/05/1943 13:00 5.8
01/05/1943 14:00 NA
01/05/1943 15:00 NA
01/05/1943 16:00 5.8
01/05/1943 17:00 5.8
01/05/1943 18:00 6.3
I need to check these files to see if they have sufficient data density. I.e. that the ratio of NA's to data values is not too high. To do this I have 3 criteria that must be checked for each file:
Ensure that no more than 10% of the hours in a day are NA's
Ensure that no more than 10% of the days in a month are NA's
Ensure that there are 3 continuous years of data with valid days and months.
Each criterion must be fulfilled sequentially and if the file does not meet the requirements then I must create a data frame (or any list) of the files that do not meet the criteria.
QUESTION--------
I wanted to ask the community how to go about this. I have considered the value of nested if loops, along with using sqldf, plyr, aggregate or even dplyr. But I do not know the simplest way to achieve this. Any example code or suggestions would be very much appreciated.
I think this will work for you. These will check every hour for NA's in the next day, month or 3 year period. Not tested because I don't care to make up data to test it. These functions should spit out the number of NA's in the respective time period. So for function checkdays if it returns a value greater than 2.4 then according to your 10% rule you'd have a problem. For months 72 and for 3 year periods you're hoping for values less than 2628. Again please check these functions. By the way the functions assume your NA data is in column 2. Cheers.
checkdays <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-23)){
nadata=data[i:(i+23),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
checkmonth <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-719)){
nadata=data[i:(i+719),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
check3years <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-26279)){
nadata=data[i:(i+26279),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
So I ended up testing these. They work for me. Here are system times for a dataset a year long. So I don't think you'll have problems.
> system.time(checkdays(RM_W1))
user system elapsed
0.38 0.00 0.37
> system.time(checkmonth(RM_W1))
user system elapsed
0.62 0.00 0.62
Optimization:
I took the time to run these functions with the data you posted above and it wasn't good. For loops are dangerous because they work well for small data sets but slow down exponentially as datasets get larger, that is if they're not constructed properly. I cannot report system times for the functions above with your data (it never finished) but I waited about 30 minutes. After reading this awesome post Speed up the loop operation in R I rewrote the functions to be much faster. By minimising the amount of things that happen in the loop and pre-allocating memory you can really speed things up. You need to call the function like checkdays(df[,2]) but its faster this way.
checkdays <- function(data){
countNA=numeric(length(data)-23)
for(i in 1:(length(data)-23)){
nadata=data[i:(i+23)]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
> system.time(checkdays(df[,2]))
user system elapsed
4.41 0.00 4.41
I believe this should be sufficient for your needs. In regards to leap years you should be able to modify the optimized function as I mentioned in the comments. However make sure you specify a leap year dataset as second dataset rather than a second column.

How to enter censored data into R's survival model?

I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.
The original subscriptions dataset looks like this..
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
Which I manipulate to look like this..
id tenure_in_months status(1=cancelled, 0=active)
1 2 1
2 ? 0
3 1 1
..in order to feed the survival model:
obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
plot(fit)
What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?
First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.
This is a situation known as right-censor in survival analysis. See: http://en.wikipedia.org/wiki/Censoring_%28statistics%29
So your data would need to translate from
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
to:
id t1 t2 status(3=interval_censored)
1 2 2 3
2 3 NA 3
3 1 1 3
Then you will need to change your R surv object, from:
Surv(time=tenure_in_months, event=status, type="right")
to:
Surv(t1, t2, event=status, type="interval2")
See http://stat.ethz.ch/R-manual/R-devel/library/survival/html/Surv.html for more syntax details. A very good summary of computational details can be found: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm
Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.
If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.
NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.
SQL code to get the time till event (use in SELECT part of query)
DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months
BTW:
I would use difference in days, for my analysis. Does not make sense to round off the time to months.
You need to know the date the data was collected. The tenure_in_months for id 2 should then be this date minus 2013-06-01.
Otherwise I believe your encoding of the data is correct. the status of 0 for id 2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).

Resources