Predicting WHEN an event is going to occur - r

I am very new to Machine learning and r, so my question might seem unclear or would need more information. I have tried to explain as much as possible. Please correct me if I have used wrong terminologies or phrases. Any help on this will be greatly appreciated.
Context - I am trying to build a model to predict "when" an event is going to happen.
I have a dataset which has the below structure. This is not the actual data. It is a dummy data created to explain the scenario. Actual data cannot be shared due to confidentiality.
About data -
A customer buys a subscription under which he is allowed to use x$
amount of the service provided.
A customer can have multiple subscriptions. Subscriptions could be overlapping in time or could be serialized in time
Each subscription has a limit on the usage which is x$
Each subscription has a startdate and end date.
Subscription will no longer be used after enddate.
Customer has his own behavior/pattern in which he uses the service. This is described by other derived variables Monthly utilization, avg monthly utilization etc.
Customer can use the service above $x. This is indicated by column
"ExceedanceMonth" in the table above. Value of 1 says customer
went above $x in the first month of the subscription, value 5 says
customer went above $x in 5th month of the subscription. Value of
NULL indicates that the limit $x is not reached yet. This could be
either because
subscription ended and customer didn't overuse
or
subscription is yet to end and customer might overuse in future
The 2nd scenario after or condition described above is what I want to
predict. Among the subscriptions which are yet to end and customer
hasn't overused, WHEN will the limit be reached. i.e. predict the
ExceedanceMonth column in the above table.
Before reaching this model - I have a classification model built using decision tree which predicts if customer is going to cross the limitamount or not i.e. predict if LimitReached = 1 or 0 in next 2 months. I am not sure if I should train the model discussed here (predict time to event) with all the data and test/use the model on customer/subscriptions with Limitreached = 1 or train the model with only the customers/subscription which will have Limitreached = 1
I have researched on survival models. I understand that a survival model like Cox can be used to understand the hazard function and understand how each variable can affect the time to event. I tried to use predict function with cox but I did not understand if any of the values passed to "type" parameter can be used to predict the actual time. i.e. I did not understand how I can predict the actual value for "WHEN" the limit will be crossed
May be survival model isn't the right approach for this scenario. So, please advise me of what could be the best way to approach this problem.
#define survival object
recsurv <- Surv(time=df$ExceedanceMonth, event=df$LimitReached)
#only for testing the code
train = subset(df,df$SubStartDate>="20150301" & df$SubEndDate<="20180401")
test = subset(df,df$SubStartDate>"20180401") #only for testing the code
fit <- coxph(Surv(df$ExceedanceMonth, df$LimitReached) ~ df$SubDurationInMonths+df$`#subs`+df$LimitAmount+df$Monthlyutitlization+df$AvgMonthlyUtilization, train, model = TRUE)
predicted <- predict(fit, newdata = test)
head(predicted)
1 2 3 4 5 6
0.75347328 0.23516619 -0.05535162 -0.03759123 -0.65658488 -0.54233043
Thank you in advance!

Survival models are fine for what you're trying to do. (I'm assuming you've estimated the model correctly from this point on.)
The key is understanding what comes out of the model. For a Cox, the default quantity out of predict() is the linear combination (b0 + b1x1 + b2x2..., though the Cox doesn't estimate a b0). That alone won't tell you anything about when.
Specifying type="expected" for predict() will give you when via the expected duration--how long, on average, until the customer reaches his/her data limit, with the follow-up time (how long you watch the customer) set equal to the customer's actual duration (retrieved from the coxph model object).
The coxed package will also give you expected durations, calculated using a different method, without the need to worry about follow-up time. It's also a little more forgiving when it comes to inputting a newdata argument, particularly if you have a specific covariate profile in mind. See the package vignette here.
See also this thread for more on coxph.predict().

Related

How can Keras predict sequences of sales (individually) of 11106 distinct customers, each a series of varying length (anyway from 1 to 15 periods)

I am approaching a problem that Keras must offer an excellent solution for, but I am having problems developing an approach (because I am such a neophyte concerning anything for deep learning). I have sales data. It contains 11106 distinct customers, each with its time series of purchases, of varying length (anyway from 1 to 15 periods).
I want to develop a single model to predict each customer's purchase amount for the next period. I like the idea of an LSTM, but clearly, I cannot make one for each customer; even if I tried, there would not be enough data for an LSTM in any case---the longest individual time series only has 15 periods.
I have used types of Markov chains, clustering, and regression in the past to model this kind of data. I am asking the question here, though, about what type of model in Keras is suited to this type of prediction. A complication is that all customers can be clustered by their overall patterns. Some belong together based on similarity; others do not; e.g., some customers spend with patterns like $100-$100-$100, others like $100-$100-$1000-$10000, and so on.
Can anyone point me to a type of sequential model supported by Keras that might handle this well? Thank you.
I am trying to achieve this in R. Haven't been able to build a model that gives me more than about .3 accuracy.
I don't think the main difficulty is coming from which model to use as much as how to frame the problem.
As you mention, "WHO" is spending the money seems as relevant as their past transaction in knowing how much they will likely spend.
But you cannot train 10k+ models either for each customers.
Instead I would suggest clustering your customers base, and instead trying to fit a model by cluster, using all the time series combined for the customers in that cluster to train the same model.
This would allow each model to learn the spending pattern of that particular group.
For that you can use LTSM or RNN model.
Hi here's my suggestion and I will edit it later to provide you with more information
Since its a sequence problem you should use RNN based models: LSTM, GRU's

Event duration prediction in R? need tips for methods and packages

I have a general question of methodology.
I have to create a model to predict when/after how many days of therapy a patient reaches a certain value. I have data of this value from tight laboratory controls and also information on some influence variables. But now I'm at a loss and I don't know how to do it best, so that in the end there will be something that can be used to make predictions for new patients when this threshold is reached, or as a binary variable when the value is not longer detectable.
The more I read about the topic, the more unsure I am about the optimal method.

forecast time to event survival analysis

I'm currently trying to model the time to event, where there are three different events that might happen. It's for telecommunication data, and I want to predict the expected lifetime of unlocked customers, so customers for who their contract period has ended and they can resign now on a monthly basis. They are unlocked customers as soon as their 1- or 2-year contract ends, and as time goes by they can churn, retain (buy a new contract) or stay an unlocked customer (so I will need a competing risk model).
Now what my point of interest is, is the time until one of these events happens. I was thinking to use a Cox regression model to find the influence of covariates on the survival probability, but since the baseline hazard is not defined for Cox, it will be hard to predict the time to event (right?). I was thinking a parametric survival model might work better then, but I can't really make up my mind from what I find on the internet so far.
Now is my question, is survival analysis the right method to predict time to event? Does anyone maybe have experience with predicting time to event?
You can assume a parametric model for baseline by using e.g. survival::survreg. This way you avoid the baseline. Moreover, you can estimate the non-parametric baseline in-sample with a cox model. See the type = "expected" argument in ?predict.coxph.

Simulating returns from ARMA(1,1) - MCsGARCH(1,1) model

How can I find expected intraday return of ARMA(1,1) - MCsGARCH(1,1) Model in R?
The sample code of the model is available at http://www.unstarched.net/2013/03/20/high-frequency-garch-the-multiplicative-component-garch-mcsgarch-model/
I think you are mixing up something here. There is no "expected intraday return", for the ARMA(1,1) - MCsGARCH(1,1) there only is an estimation of the volatility of the following period/day (sigma, as you've already noticed in the comments).
I assume you are referring to the last plot on the website you provided, that would mean you want to know the VaR (Value-at-Risk) that is calculated with the volatility from the estimation procedure.
If you look at the code that was used to provide the plot:
D = as.POSIXct(rownames(roll#forecast$VaR))
VaRplot(0.01, actual = xts(roll#forecast$VaR[, 3], D), VaR = xts(roll#forecast$VaR[,1], D))
You can see that the VaR (and the returns) where taken from the object roll. After you've run the simulation (without changing any variable names from the example), you could store them in a variable for later use like this:
my_VaR = roll#forecast$VaR[, 1]
my_act = roll#forecast$VaR[, 3]
Where VaR, 1] is the first listelement for VaR. If you check str(roll) you see pretty much at the end, that:
Element 1: stands for the alpha(1%) VaR
Element 2: stands for the alpha(5%) VaR and
Element 3: stands for the realized return.
To adress what you said in your comment:
Have a look at the variable df (generated from as.data.frame(roll), that may include what you are looking for.
I want to compare the expected return and the actual return.
This seems to drift more in the direction of Cross Validated, but I'll try to give a brief outline.
GARCH models are primarily used for volatility forecasting and to learn about the volatility dynamics of a time series (and/or the correlation dynamics in multivariate models). Now since variance is of the second moment, which translates to squared, it is always positive. But are returns always positive? Of course they are not. This means the volatility forecast gives us an idea of the magnitude of the returns of the next period, but at that point we don't know if it will be a positve return or a negative return. That's were the Value-at-Risk (VaR) comes into play.
Take e. g. a portfolio manager who owns one asset. With a GARCH model he could predict the volatility of the next period (let's say he uses a daily return series, then that would be tomorrow). Traders watch the risk of their portfolio, it is much more closely monitored than the potential chances. So with the volatility forecast he can make a good guess about the risk he has of his asset loosing in value tomorrow. A 95%-VaR of lets say 1,000 EUR means, with a 95% probability, the risk (or loss) of tomorrow will not exceed 1,000 EUR. A higher probability comes with less certainty, so a 99%-VaR will be higher, e. g. 1,500 EUR.
To wrap this up: there is no "expected" return, there is only a volatility forecast for tomorrow that gives an inclination (never certainty) of how tomorrows return could turn out. With the VaR this can be used for risk management. This is what is being done in the last part of the article you provided.
what is the difference of ugarchsim and roll function?
You could check in the documentation of the rugarch package, every function and its properties are explained in more detail in there. At a quick glance I would say ugarchsim is used if you want to fit a model to a complete time series. The last standard deviation is then the forecast for the next period. The documentation for ugarchroll says:
ugarchroll-methods {rugarch} function: Univariate GARCH Rolling
Density Forecast and Backtesting
Description
Method for creating rolling density forecast from ARMA-GARCH models with option for refitting every n periods with
parallel functionality. is used aswell for forecasting as for
backtesting.
This is if you want to test how your model would've performed in the past. It only takes e. g. the first 300 datapoints provided and give the forecast for datapoint 301. Then the VaR (95% or 99%) is compared to the realized return of datapoint 301. Then the model is being refitted, giving a forecast for datapoint 302 and so on and on.
Edit: added answers to the questions from the comments.

Cluster Analysis using R for large data sample

I am just starting out with segmenting a customer database using R I have for an ecommerce retail business. I seek some guidance about the best approach to proceed with for this exercise.
I have searched the topics already posted here and tried them out myself like dist() and hclust(). However I am running into one issue or another and not able to overcome it since I am new to using R.
Here is the brief description of my problem.
I have approximately 480K records of customers who have bought so far. The data contains following columns:
email id
gender
city
total transactions so far
average basket value
average basket size ( no of item purchased during one transaction)
average discount claimed per transaction
No of days since the user first purchased
Average duration between two purchases
No of days since last transaction
The business goal of this exercise is to identify the most profitable segments and encourage repeat purchases in those segments using campaigns. Can I please get some guidance as to how to do this successfully without running into problems like the size of the sample or the data type of columns?
Read this to learn how to subset data frames. When you try to define d, it looks like you're providing way to much data, which might be fixed by subsetting your table first. If not, you might want to take a random sample of your data instead of all of it. Suppose you know that columns 4 through 10 of your data frame called cust_data contain numerical data, then you might try this:
cust_data2 <- cust_data[, 4:10]
d <- dist(cust_data2)
For large values, you may want to log transform them--just experiment and see what makes sense. I really am not sure about this, and that's just a suggestion. Maybe choosing a more appropriate clustering or distance metric would be better.
Finally, when you run hclust, you need to pass in the d matrix, and not the original data set.
h <- hclust(d, "ave")
Sadly your data does not contain any attributes that indicate what types of items/transactions did NOT result in a sale.
I am not sure if clustering is the way to go here.
Here are some ideas:
First split your data into a training set (say 70%) and a test set.
Set up a simple linear regression model with,say, "average basket value" as a response variable, and all other variables as independent variables.
fit <-lm(averagebasketvalue ~., data = custdata)
Run the model on the training set, determine significant attributes (those with at least one star in the summary(fit) output), then focus on those variables.
Check your regression coefficients on the test set, by calculating R-squared and Sum of squared errors (SSE) on the test set. You can use the predict() function , the calls will look like
fitpred <- predict(fit, newdata=testset)
summary(fitpred) # will give you R²
Maybe "city" contains too many unique values to be meaningful. Try to generalize them by introducing a new attribute CityClass (e.g. BigCity-MediumCity-SmallCity ... or whatever classification scheme is useful for your cities). You might also condition the model on "gender". Drop "email id".
This can go on for a while... play with the model to try to get better R-squared and SSEs.
I think a tree-based model (rpart) might also work well here.
Then you might change to cluster analysis at a later time.

Resources