I'm currently trying to model the time to event, where there are three different events that might happen. It's for telecommunication data, and I want to predict the expected lifetime of unlocked customers, so customers for who their contract period has ended and they can resign now on a monthly basis. They are unlocked customers as soon as their 1- or 2-year contract ends, and as time goes by they can churn, retain (buy a new contract) or stay an unlocked customer (so I will need a competing risk model).
Now what my point of interest is, is the time until one of these events happens. I was thinking to use a Cox regression model to find the influence of covariates on the survival probability, but since the baseline hazard is not defined for Cox, it will be hard to predict the time to event (right?). I was thinking a parametric survival model might work better then, but I can't really make up my mind from what I find on the internet so far.
Now is my question, is survival analysis the right method to predict time to event? Does anyone maybe have experience with predicting time to event?
You can assume a parametric model for baseline by using e.g. survival::survreg. This way you avoid the baseline. Moreover, you can estimate the non-parametric baseline in-sample with a cox model. See the type = "expected" argument in ?predict.coxph.
Related
My job is to make sure that an online retailer achieves a certain service level (in stock rate) for their products, while avoiding aging and excess stock. I have a robust cost and leadtime simulation model. One of the inputs into that model is a vector of prediction intervals for cumulative demand over the next leadtime weeks.
I've been reading about quantile regression, conforming models, gradient boosting, and quantile random forest... frankly all of these are far above my head, and they seem focused on multivariate regression of non-time-series data. I know that I can't just regress against time, so I'm not even sure how to set up a complex regression method correctly. Moreover, since I'm forecasting many thousands of items every week, the parameter setting and tuning needs to be completely automated.
To date, I've been using a handful of traditional forecast methods (TSB [variation of Croston], ETS, ARIMA, etc) including hybrids, using R packages like hybridForecast. My prediction intervals are almost universally much narrower than our actual results (e.g. in a sample of 500 relatively steady-selling items, 20% were below my ARIMA 1% prediction interval, and 12% were above the 99% prediction interval).
I switched to using simulation + bootstrapping the residuals to build my intervals, but the results are directionally the same as above.
I'm looking for the simplest way to arrive at a univariate time series model with more accurate prediction intervals for cumulative demand over leadtime weeks, particularly at the upper / lower 10% and beyond. All my current models are training on MSE, so one step is probably to use something more like pinball loss scoring against cumulative demand (rather than the per-period error). Unfortunately I'm totally unfamiliar with how to write a custom loss function for the legacy forecasting libraries (much less the new sexy ones above).
I'd deeply appreciate any advice!
A side note: we already have an AWS setup that can compute each item from an R job in parallel, so computing time is not a major factor.
Can you calculate the time at set risk from proportional hazard models in R?
Say I've got my model built for the whole population. The risk of developing an event in one year is 5%. After stratification, some patients have a higher risk, some lower.
How do I get a value, per patient, that will tell me the time at which they have a 5% risk of developing an event?
Apologies for not showing any code examples, I'm wondering if this is even possible. if it isn't, could you suggest other models?
I am very new to Machine learning and r, so my question might seem unclear or would need more information. I have tried to explain as much as possible. Please correct me if I have used wrong terminologies or phrases. Any help on this will be greatly appreciated.
Context - I am trying to build a model to predict "when" an event is going to happen.
I have a dataset which has the below structure. This is not the actual data. It is a dummy data created to explain the scenario. Actual data cannot be shared due to confidentiality.
About data -
A customer buys a subscription under which he is allowed to use x$
amount of the service provided.
A customer can have multiple subscriptions. Subscriptions could be overlapping in time or could be serialized in time
Each subscription has a limit on the usage which is x$
Each subscription has a startdate and end date.
Subscription will no longer be used after enddate.
Customer has his own behavior/pattern in which he uses the service. This is described by other derived variables Monthly utilization, avg monthly utilization etc.
Customer can use the service above $x. This is indicated by column
"ExceedanceMonth" in the table above. Value of 1 says customer
went above $x in the first month of the subscription, value 5 says
customer went above $x in 5th month of the subscription. Value of
NULL indicates that the limit $x is not reached yet. This could be
either because
subscription ended and customer didn't overuse
or
subscription is yet to end and customer might overuse in future
The 2nd scenario after or condition described above is what I want to
predict. Among the subscriptions which are yet to end and customer
hasn't overused, WHEN will the limit be reached. i.e. predict the
ExceedanceMonth column in the above table.
Before reaching this model - I have a classification model built using decision tree which predicts if customer is going to cross the limitamount or not i.e. predict if LimitReached = 1 or 0 in next 2 months. I am not sure if I should train the model discussed here (predict time to event) with all the data and test/use the model on customer/subscriptions with Limitreached = 1 or train the model with only the customers/subscription which will have Limitreached = 1
I have researched on survival models. I understand that a survival model like Cox can be used to understand the hazard function and understand how each variable can affect the time to event. I tried to use predict function with cox but I did not understand if any of the values passed to "type" parameter can be used to predict the actual time. i.e. I did not understand how I can predict the actual value for "WHEN" the limit will be crossed
May be survival model isn't the right approach for this scenario. So, please advise me of what could be the best way to approach this problem.
#define survival object
recsurv <- Surv(time=df$ExceedanceMonth, event=df$LimitReached)
#only for testing the code
train = subset(df,df$SubStartDate>="20150301" & df$SubEndDate<="20180401")
test = subset(df,df$SubStartDate>"20180401") #only for testing the code
fit <- coxph(Surv(df$ExceedanceMonth, df$LimitReached) ~ df$SubDurationInMonths+df$`#subs`+df$LimitAmount+df$Monthlyutitlization+df$AvgMonthlyUtilization, train, model = TRUE)
predicted <- predict(fit, newdata = test)
head(predicted)
1 2 3 4 5 6
0.75347328 0.23516619 -0.05535162 -0.03759123 -0.65658488 -0.54233043
Thank you in advance!
Survival models are fine for what you're trying to do. (I'm assuming you've estimated the model correctly from this point on.)
The key is understanding what comes out of the model. For a Cox, the default quantity out of predict() is the linear combination (b0 + b1x1 + b2x2..., though the Cox doesn't estimate a b0). That alone won't tell you anything about when.
Specifying type="expected" for predict() will give you when via the expected duration--how long, on average, until the customer reaches his/her data limit, with the follow-up time (how long you watch the customer) set equal to the customer's actual duration (retrieved from the coxph model object).
The coxed package will also give you expected durations, calculated using a different method, without the need to worry about follow-up time. It's also a little more forgiving when it comes to inputting a newdata argument, particularly if you have a specific covariate profile in mind. See the package vignette here.
See also this thread for more on coxph.predict().
I am trying to investigate the relationship between some Google Trends Data and Stock Prices.
I performed the augmented ADF Test and KPSS test to make sure that both time series are integrated of the same order (I(1)).
However, after I took the first differences, the ACF plot was completely insigificant (except for 1 of course), which told me that the differenced series are behaving like white noise.
Nevertheless I tried to estimate a VAR model which you can see attached.
As you can see, only one constant is significant. I have already read that because Stocks.ts.l1 is not significant in the equation for GoogleTrends and GoogleTrends.ts.l1 is not significant in the equation for Stocks, there is no dynamic between the two time series and both can also be models independently from each other with a AR(p) model.
I checked the residuals of the model. They fulfill the assumptions (normally distributed residuals are not totally given but ok, there is homoscedasticity, its stable and there is no autocorrelation).
But what does it mean if no coefficient is significant as in the case of the Stocks.ts equation? Is the model just inappropriate to fit the data, because the data doesn't follow an AR process. Or is the model just so bad, that a constant would describe the data better than the model? Or a combination of the previous questions? Any suggestions how I could proceed my analysis?
Thanks in advance
I have a general question of methodology.
I have to create a model to predict when/after how many days of therapy a patient reaches a certain value. I have data of this value from tight laboratory controls and also information on some influence variables. But now I'm at a loss and I don't know how to do it best, so that in the end there will be something that can be used to make predictions for new patients when this threshold is reached, or as a binary variable when the value is not longer detectable.
The more I read about the topic, the more unsure I am about the optimal method.