Can you calculate the time at set risk from proportional hazard models in R?
Say I've got my model built for the whole population. The risk of developing an event in one year is 5%. After stratification, some patients have a higher risk, some lower.
How do I get a value, per patient, that will tell me the time at which they have a 5% risk of developing an event?
Apologies for not showing any code examples, I'm wondering if this is even possible. if it isn't, could you suggest other models?
Related
My job is to make sure that an online retailer achieves a certain service level (in stock rate) for their products, while avoiding aging and excess stock. I have a robust cost and leadtime simulation model. One of the inputs into that model is a vector of prediction intervals for cumulative demand over the next leadtime weeks.
I've been reading about quantile regression, conforming models, gradient boosting, and quantile random forest... frankly all of these are far above my head, and they seem focused on multivariate regression of non-time-series data. I know that I can't just regress against time, so I'm not even sure how to set up a complex regression method correctly. Moreover, since I'm forecasting many thousands of items every week, the parameter setting and tuning needs to be completely automated.
To date, I've been using a handful of traditional forecast methods (TSB [variation of Croston], ETS, ARIMA, etc) including hybrids, using R packages like hybridForecast. My prediction intervals are almost universally much narrower than our actual results (e.g. in a sample of 500 relatively steady-selling items, 20% were below my ARIMA 1% prediction interval, and 12% were above the 99% prediction interval).
I switched to using simulation + bootstrapping the residuals to build my intervals, but the results are directionally the same as above.
I'm looking for the simplest way to arrive at a univariate time series model with more accurate prediction intervals for cumulative demand over leadtime weeks, particularly at the upper / lower 10% and beyond. All my current models are training on MSE, so one step is probably to use something more like pinball loss scoring against cumulative demand (rather than the per-period error). Unfortunately I'm totally unfamiliar with how to write a custom loss function for the legacy forecasting libraries (much less the new sexy ones above).
I'd deeply appreciate any advice!
A side note: we already have an AWS setup that can compute each item from an R job in parallel, so computing time is not a major factor.
I want to calculate the 95% coverage rate of an simple OLS estimator.
The (for me) tricky addition is that the independent variable has 91 values that i have to test against each other in order to see which value leads to the best estimate.
For each value of the independent variable i want to draw 1000 samples.
I tried looking up on the theory and also on multiple platforms such as stackoverflow, but i didn't manage to find an appropriate answer.
My biggest question is how to calculate a coverage rate for a 95% confidence interval.
I would deeply appreciate it, if you could provide me with some possibilities or insights.
I'm working on a churn model, the main idea it's to predict if a customer whether be a churner or no within 30 days.
I've been struggling with my dataset, I have 100k rows and my target variable is unbalanced, 95% no churn and 5% churn.
I'm trying with GLM and RF, if I train both models with raw data, I don't get any churn prediction, so, It doesn't work for me. I have tried balancing, taking all churners and same amount of no churners (50% churn, 50% no churn), training with that and then testing with my data and I get a lot of churn predictions when they not. I tried oversampling, undersampling, ROSE, SMOTE, and it seems that nothing's working for me.
With luck both models predict a maximum 20% of all my churners, then my gain and lift are not that good. I think I've tried everything, but I don't get more than 20% prediction of what I need.
I have customer behavior variables, personal information and more. I also made an exploratory analysis, calculating percentage of churn per age, per sex, per behavior and I saw that every group have the same churn percentage, so, I'm thinking that maybe I lack of more variables that separates groups in a better form (this last idea it's just personal).
Thank you everyone, greetings.
Dataset Description: I use a dataset with neuropsychological (np) tests from several subjects. Every subject has more than one tests in his/her follow up i.e one test per year. I study the cognitive decline in these subjects. The information that I have are: Individual number(identity number), Education(years), Gender(M/F as factor), Age(years), Time from Baseline (= years after the first np test).
AIM: My aim is to measure the rate of change in their np tests i.e the cognitive decline per year for each of them. To do that I use Linear Mixture Effects Models (LMEM), taking into account the above parameters and I compute the slope for each subject.
Question: When I run the possible models (combining different parameters every time), I also check their singularity and the result in almost all cases is TRUE. So my models present singularity! In the case that I would like to use these models to do predictions this is not good as it means that the model overfits the data. But now that I just want to find the slope for each individual I think that this is not a problem, or even better I think that this is an advantage, as in that case singularity offers a more precise calculation for the subjects' slopes. Do you think that this thought is correct?
I'm currently trying to model the time to event, where there are three different events that might happen. It's for telecommunication data, and I want to predict the expected lifetime of unlocked customers, so customers for who their contract period has ended and they can resign now on a monthly basis. They are unlocked customers as soon as their 1- or 2-year contract ends, and as time goes by they can churn, retain (buy a new contract) or stay an unlocked customer (so I will need a competing risk model).
Now what my point of interest is, is the time until one of these events happens. I was thinking to use a Cox regression model to find the influence of covariates on the survival probability, but since the baseline hazard is not defined for Cox, it will be hard to predict the time to event (right?). I was thinking a parametric survival model might work better then, but I can't really make up my mind from what I find on the internet so far.
Now is my question, is survival analysis the right method to predict time to event? Does anyone maybe have experience with predicting time to event?
You can assume a parametric model for baseline by using e.g. survival::survreg. This way you avoid the baseline. Moreover, you can estimate the non-parametric baseline in-sample with a cox model. See the type = "expected" argument in ?predict.coxph.