Event duration prediction in R? need tips for methods and packages - r

I have a general question of methodology.
I have to create a model to predict when/after how many days of therapy a patient reaches a certain value. I have data of this value from tight laboratory controls and also information on some influence variables. But now I'm at a loss and I don't know how to do it best, so that in the end there will be something that can be used to make predictions for new patients when this threshold is reached, or as a binary variable when the value is not longer detectable.
The more I read about the topic, the more unsure I am about the optimal method.

Related

How can Keras predict sequences of sales (individually) of 11106 distinct customers, each a series of varying length (anyway from 1 to 15 periods)

I am approaching a problem that Keras must offer an excellent solution for, but I am having problems developing an approach (because I am such a neophyte concerning anything for deep learning). I have sales data. It contains 11106 distinct customers, each with its time series of purchases, of varying length (anyway from 1 to 15 periods).
I want to develop a single model to predict each customer's purchase amount for the next period. I like the idea of an LSTM, but clearly, I cannot make one for each customer; even if I tried, there would not be enough data for an LSTM in any case---the longest individual time series only has 15 periods.
I have used types of Markov chains, clustering, and regression in the past to model this kind of data. I am asking the question here, though, about what type of model in Keras is suited to this type of prediction. A complication is that all customers can be clustered by their overall patterns. Some belong together based on similarity; others do not; e.g., some customers spend with patterns like $100-$100-$100, others like $100-$100-$1000-$10000, and so on.
Can anyone point me to a type of sequential model supported by Keras that might handle this well? Thank you.
I am trying to achieve this in R. Haven't been able to build a model that gives me more than about .3 accuracy.
I don't think the main difficulty is coming from which model to use as much as how to frame the problem.
As you mention, "WHO" is spending the money seems as relevant as their past transaction in knowing how much they will likely spend.
But you cannot train 10k+ models either for each customers.
Instead I would suggest clustering your customers base, and instead trying to fit a model by cluster, using all the time series combined for the customers in that cluster to train the same model.
This would allow each model to learn the spending pattern of that particular group.
For that you can use LTSM or RNN model.
Hi here's my suggestion and I will edit it later to provide you with more information
Since its a sequence problem you should use RNN based models: LSTM, GRU's

R caret training - but each sample has three separate measurements and I want to use majority vote to predict

I have a very specific datasets with 50 people. Each person has a response (sex) and ~2000 measurements of some biological stuff.
We have three independent replicates from each person, so 3 rows pr. person.
I can easily use caret and groupKFold() to keep each person in either training or test sets - that works fine.
Then I simply predict each replicate separately (so 3 prediction pr person).
I want to use these three prediction together and make a combined prediction pr. person either using majority vote and/or some other scheme.
I.e. - so for each person I get the 3 predictions and predict the response to be the one with most votes. That's pretty easy to do for the final model, but it should also be used in the tuning step (i.e. in the cross validation picking parameter values).
I think I can do that in the summaryFunction=... when calling caret::trainControl() but I would simply like to ask:
Is there a simpler way of doing this?
I have googled around - but I keep failing in finding people with similar problems. And I really hope someone can point me in the right direction.

Predicting WHEN an event is going to occur

I am very new to Machine learning and r, so my question might seem unclear or would need more information. I have tried to explain as much as possible. Please correct me if I have used wrong terminologies or phrases. Any help on this will be greatly appreciated.
Context - I am trying to build a model to predict "when" an event is going to happen.
I have a dataset which has the below structure. This is not the actual data. It is a dummy data created to explain the scenario. Actual data cannot be shared due to confidentiality.
About data -
A customer buys a subscription under which he is allowed to use x$
amount of the service provided.
A customer can have multiple subscriptions. Subscriptions could be overlapping in time or could be serialized in time
Each subscription has a limit on the usage which is x$
Each subscription has a startdate and end date.
Subscription will no longer be used after enddate.
Customer has his own behavior/pattern in which he uses the service. This is described by other derived variables Monthly utilization, avg monthly utilization etc.
Customer can use the service above $x. This is indicated by column
"ExceedanceMonth" in the table above. Value of 1 says customer
went above $x in the first month of the subscription, value 5 says
customer went above $x in 5th month of the subscription. Value of
NULL indicates that the limit $x is not reached yet. This could be
either because
subscription ended and customer didn't overuse
or
subscription is yet to end and customer might overuse in future
The 2nd scenario after or condition described above is what I want to
predict. Among the subscriptions which are yet to end and customer
hasn't overused, WHEN will the limit be reached. i.e. predict the
ExceedanceMonth column in the above table.
Before reaching this model - I have a classification model built using decision tree which predicts if customer is going to cross the limitamount or not i.e. predict if LimitReached = 1 or 0 in next 2 months. I am not sure if I should train the model discussed here (predict time to event) with all the data and test/use the model on customer/subscriptions with Limitreached = 1 or train the model with only the customers/subscription which will have Limitreached = 1
I have researched on survival models. I understand that a survival model like Cox can be used to understand the hazard function and understand how each variable can affect the time to event. I tried to use predict function with cox but I did not understand if any of the values passed to "type" parameter can be used to predict the actual time. i.e. I did not understand how I can predict the actual value for "WHEN" the limit will be crossed
May be survival model isn't the right approach for this scenario. So, please advise me of what could be the best way to approach this problem.
#define survival object
recsurv <- Surv(time=df$ExceedanceMonth, event=df$LimitReached)
#only for testing the code
train = subset(df,df$SubStartDate>="20150301" & df$SubEndDate<="20180401")
test = subset(df,df$SubStartDate>"20180401") #only for testing the code
fit <- coxph(Surv(df$ExceedanceMonth, df$LimitReached) ~ df$SubDurationInMonths+df$`#subs`+df$LimitAmount+df$Monthlyutitlization+df$AvgMonthlyUtilization, train, model = TRUE)
predicted <- predict(fit, newdata = test)
head(predicted)
1 2 3 4 5 6
0.75347328 0.23516619 -0.05535162 -0.03759123 -0.65658488 -0.54233043
Thank you in advance!
Survival models are fine for what you're trying to do. (I'm assuming you've estimated the model correctly from this point on.)
The key is understanding what comes out of the model. For a Cox, the default quantity out of predict() is the linear combination (b0 + b1x1 + b2x2..., though the Cox doesn't estimate a b0). That alone won't tell you anything about when.
Specifying type="expected" for predict() will give you when via the expected duration--how long, on average, until the customer reaches his/her data limit, with the follow-up time (how long you watch the customer) set equal to the customer's actual duration (retrieved from the coxph model object).
The coxed package will also give you expected durations, calculated using a different method, without the need to worry about follow-up time. It's also a little more forgiving when it comes to inputting a newdata argument, particularly if you have a specific covariate profile in mind. See the package vignette here.
See also this thread for more on coxph.predict().

forecast time to event survival analysis

I'm currently trying to model the time to event, where there are three different events that might happen. It's for telecommunication data, and I want to predict the expected lifetime of unlocked customers, so customers for who their contract period has ended and they can resign now on a monthly basis. They are unlocked customers as soon as their 1- or 2-year contract ends, and as time goes by they can churn, retain (buy a new contract) or stay an unlocked customer (so I will need a competing risk model).
Now what my point of interest is, is the time until one of these events happens. I was thinking to use a Cox regression model to find the influence of covariates on the survival probability, but since the baseline hazard is not defined for Cox, it will be hard to predict the time to event (right?). I was thinking a parametric survival model might work better then, but I can't really make up my mind from what I find on the internet so far.
Now is my question, is survival analysis the right method to predict time to event? Does anyone maybe have experience with predicting time to event?
You can assume a parametric model for baseline by using e.g. survival::survreg. This way you avoid the baseline. Moreover, you can estimate the non-parametric baseline in-sample with a cox model. See the type = "expected" argument in ?predict.coxph.

Predicting future emissions from fitted HMM model

I've fitted a HMM model to my data using hmm.discnp package in R as follows:
library(hmm.discnp)
zs <- hmm(y=lis,K=5)
Now I want to predict the future K observations (emissions) from this model. But I am only able to get most probable state sequence for the observations that I already have through Viterbi algorithm.
I have t emissions already , i.e (y(1),...,y(t)).
I want the most probable future K emissions from the fitted HMM object i.e (y(t+1),...y(t+k)).
Is there a function to calculate this? if not then how do I calculate it manually?
Generating emissions from an HMM is pretty straightforward to do manually. I'm am not really familiar with R but I explain here the steps to generate data as you ask.
First thing to keep in mind is that, by its Markovian nature, the HMM has no memory. At any time, only the current state is known, what happened before is "forgotten". This means that the generation of the sample at time t+1 only depends of the sample at time t.
If you have a sequence, the first thing you can do is to fit the most probable state sequence (with the Viterbi algorithm) as you did. Now, you know the state that generated the last observation that you have (the one that you denote y(t)).
Now, from this state, you know the probabilities to transit to each other state of the model thanks to the transition matrix. This is a probability mass function (pmf) and you can draw a state number from this pmf (not by hand! R should have a built-in function to draw a sample from a pmf). The state number you draw is the state in which your system is at time t+1.
With this information, you can now draw a sample observation from the probability function that is assigned to this new state (same here, if it is a Gaussian distribution, use a Gaussian random generator that should exist in R).
From this state t+1, you can now apply the same procedure to reach a state at time t+2 and so on.
Keep in mind that if you do this full procedure several times (to generate data samples from time t+1 to t+k), you will end up with different results. This is due to the probabilistic nature of the model. I am not sure of what you mean by most probable future emissions and I am not sure whether there are some routines or not to do so. You can compute the likelihood of the full sequence you obtain at the end (from 1 to t+k). It will in general be greater that the likelihood of the sequence up to t as the last part has been truly generated from the model itself and thus "perfectly" fits in some regards.

Resources