Survival analysis for various timepoints - r

I am running survival analysis on a very large data set and attempting to examine the impact of a particular variable on survival at various time points (30 days, 90 days, 180 days, 365 days).
I want to run a univariate Cox regression and I am not sure how to do this properly. The data set contains a variable, "Time", which contains the amount of days that patients have been present in the data set.
At first, I simply did a subset of the major dataset at the various time points (i.e. subset by Time <= 30...etc) and then just ran a simple Cox regression in each data frame (coxph(surv(time, event)~x) ...this was obviously foolish because it only included information leading up to each interval. I have no idea how to attack this problem otherwise and was unable to find a good answer through my searches
Any suggestions would be greatly appreciated. thank you!

Related

Data organization for running joint latent class mixed models in R

I'm attempting to run a joint latent class mixed model in R ("lcmm" package) to identify different patterns (latent trajectories) of a patient score reported over ~5/6 time points for each patient. I'm using JLCM because it should allow joint modeling of repeated measurements and time-to-event outcomes. More specifically, there's a survival component in the model that I'd like to include for the event of death during the study.
I'm assuming that the data should be in long format, where each row/observation includes a time point (days from baseline), patient score, and patient ID. But, how should the survival/event data be organized? I assume that the event data should be at the person-level (how many days from baseline did the patient experience the event of death=1, or the last day the patient submitted a score but the study ended before death=0).
Here is the R documentation for the jointlcmm function:
https://www.rdocumentation.org/packages/lcmm/versions/1.8.1.1/topics/Jointlcmm
In general, I'm most interested in how "time" is coded in the survival function (Surv(Time,Indicator)) given that my data set is long and each row corresponds to a patient score.
There are some papers that provide great resources on reasons to use this type of model, their R syntax, etc, but I can't find an example data set to refer to. Any help would be greatly appreciated!

How to correctly compute target variable for stock price prediction 1-day 14-days and 30-days ahead?

I am trying to predict stock price movement using different machine learning algorithms with various technical indicators as features. I intend to predict whether the stock price will go up or down 1-day ahead 14-days ahead and 30-days ahead.
I am a little bit confused about how to compute the target variables to make the predictions correctly.
So far I have computed daily returns for each firm and constructed a class variable to predict 1-day ahead.
data <- data %>% group_by(company) %>% mutate(ret =(`CLOSING PRICE` / lag(`CLOSING PRICE`)-1))
data$class <- ifelse((data$ret) >= 0, "Up, "Down")
The problem now is that I am not sure how to properly make predictions 14 and 30 days ahead.
The accuracy of all the models (SVM, RF, and DT) is very similar, around 82-85%, for 1-day ahead predictions. Is this something to be concerned about or is it logical that the accuracy is very similar for all the models?
You need to decide what it is you want to predict at these time points and make the appropriate calculations, as you've already done for the 1 day interval. Some options: you could do similar to your 1 day interval - calculate whether the closing price at day 14 or 30 is above or below the closing price on day 0, then try to predict a binary response of "up" or "down". Or you could calculate the actual difference in price between those days and use that as your response - this would be a regression problem rather than a binary classification one. However you decide to calculate your response, you then calculate the same metric in your training data and use that to train your models.
It's not unusual for different models to offer similar accuracy, especially if you've taken time to tune them all before testing. Do make sure you test against some unseen data, as some models are more prone to over-fitting than others.

Can one improve coefficients in recursion using some machine learning algorithm?

I've got this prediction problem for daily data across several years. My data has both yearly and weekly seasonality.
I tried using the following recurrence:(which I just came up with, from nowhere if you like) xn = 1/4(xn-738 + xn-364 + xn-7 + 1/6(xn-1+xn-2+xn-3+xn-4+xn-5+xn-6)
Basically, I am taking into consideration some of the previous days in the week before the day I am trying to predict and also the corresponding day a year and two years earlier. I am doing an average over them.
My question is: can one try to improve the prediction by replacing the coefficients 1/4,1/6 etc with coefficients that would make the mean squared residual smaller?
Personally I see your problem as a regression.
If you have enough data I would run for timeseries prediction.
You said that the data has yearly and weekly seasonality. In order to cope with that you can have two models one with weekly window and one dealing with the yearly pattern and then somehow combine them (linear combination or even another model).
However if you don't have you could try passing the above xi as features to a regression model such as linear regression,svm,feed forward neural network and on theory it will find those coefficients that produce small enough loss (error).

Forecasting panel data and time series

I have a panel data set of lets say 1000 observations, so i=1,2,...,1000 . The data set runs in daily basis for a month, so t=1,2,...,31.
I want to estimate individual specific in R:
y_i10=αi+βi∗yi9+γi∗yi8+...+δi∗yi1+ϵit
and then produce density forecasts for the next 21 days, that is produce density forecasts for yi11,yi12 etc
My questions are:
Can I do this with plm package ? I am aware how to estimate with plm package but I do not know how to produce the forecasts.
Would it be easier (and correct) if I consider each observation as a separate time series, and use arima(9,0,0) for each one of them, and then get the density forecasts ? If so, how can I get the density forecasts ?
In (2.) , how can I include individual specific effects that are constant over time ?
Thanks a lot

Auto-ARIMA function in R giving odd results

I have a day level dataset for 3 years,
I ran auto.arima() in R on it for simple time series forecasting and it gave me a (2,1,2) model.
When I used this model to predict the variable for the next 1 year the plot became constant after a few days, which can't be correct
As I have a daily data for 3 years, and a frequency of 364 days, is ARIMA incapable of handling daily data with large frequencies?
Any help will be appreciated
This sounds like you are trying to forecast too far into the future.
The forecast for tomorrow is going to be accurate, but the forecast for the next day and the day after that are not going to be influenced much by the past data and they will therefore settle around some constant when trying to forecast too far into the future. "Too far into the future" probably means two or more time points.
Lets say you have data up until time point T+100, which you used to estimate your ARIMA(2,1,2) model. You can "forecast" the value for time T+1 by pretending you only have data until point T and use your ARIMA(2,1,2) model to forecast T+1. Then move ahead by one period in your data and pretend you only have data until time T+1 and "forecast" T+2. This way you can assess the forecasting accuracy of your ARIMA(2,1,2) model, for example by calculating the Mean Squared Error (MSE) of the "forecasts".

Resources