survfit.coxph ; Predicting Survival using newdata and ID option - r

I am attempting to use surfit.coxph to predict an estimate of the survival function using the newdata and Id option. I am aware of the limitations of this; the baseline hazard is defined as the average of all covariates and what constitutes a typical patient, but please can we put this aside for one moment;
I am fitting the model;
Model.Cox <- coxph(Surv(Start,Stop, censor) ~ baseline,data = data)
I then try to use;
summary(survfit(Model.Cox, newdata = data,id = Id ))
to predict new data. However, both and
summary(survfit(Model.Cox, newdata = data,id = Id ))$time
summary(survfit(Model.Cox, newdata = data,id = Id ))$surv
give different times than the original data? I would expect predictions for the times in the original dataset, is there a time when this would not be the case?

If times is missing (the default) and censored=FALSE (also its default) then you get only predictions at event times. If your expectation is for predictions only for a limited number of individuals, but at all the times in the original dataset ,then you need to provide a vector of times to the times parameter.
allT <- data$Stop
summfitID <- summary(survfit(Model.Cox, newdata = data,id = Id ), times=allT)
summfitID$time
summfitID$surv
Looking at the code I wondered if the same effect could be had just by setting censored-TRUE in the summary.survfit arguments.

Related

Smote function in R

Anyone knows how to set up the perc.over and perc.under in my case? I tried a couple of combination, but it did not give me good result. I want my target variable to be split into almost 50/50.
I have 266776 for my training set, and the current ratio of my target variable in this dataset is 88/12.
Here is my code.
smoted_data <- SMOTE(Response ~ ., data= train,perc.over = 100)

How to predict future values of time series using h2o.predict

I am going through the book "Hands-on Time series analysis with R" and I am stuck at the example using machine learning h2o package. I don't get how to use h2o.predict function. In the example it requires newdata argument, which is test data in this case. But how do you predict future values of time series if you in fact don't know these values ?
If I just ignore newdata argument I get : predictions with a missing newdata argument is not implemented yet.
library(h2o)
h2o.init(max_mem_size = "16G")
train_h <- as.h2o(train_df)
test_h <- as.h2o(test_df)
forecast_h <- as.h2o(forecast_df)
x <- c("month", "lag12", "trend", "trend_sqr")
y <- "y"
rf_md <- h2o.randomForest(training_frame = train_h,
nfolds = 5,
x = x,
y = y,
ntrees = 500,
stopping_rounds = 10,
stopping_metric = "RMSE",
score_each_iteration = TRUE,
stopping_tolerance = 0.0001,
seed = 1234)
h2o.varimp_plot(rf_md)
rf_md#model$model_summary
library(plotly)
tree_score <- rf_md#model$scoring_history$training_rmse
plot_ly(x = seq_along(tree_score), y = tree_score,
type = "scatter", mode = "line") %>%
layout(title = "Random Forest Model - Trained Score History",
yaxis = list(title = "RMSE"),
xaxis = list(title = "Num. of Trees"))
test_h$pred_rf <- h2o.predict(rf_md, test_h)
test_1 <- as.data.frame(test_h)
mape_rf <- mean(abs(test_1$y - test_1$pred_rf) / test_1$y)
mape_rf
H2O-3 does not support traditional time series algorithms (e.g., ARIMA). Instead, the recommendation is to treat the time series use case as a supervised learning problem and perform time-series specific pre-processing.
For example, if your goal was to predict the sales for a store tomorrow, you could treat this as a regression problem where your target would be the Sales. If you try to train a supervised learning model on the raw data, however, chances are your performance would be pretty poor. So the trick is to add historical attributes like lags as a pre-processing step.
If we trained a model on an unaltered dataset, the Mean Absolute Error is around 35%.
If we start adding historical features like the sales from the previous day for that store, we can reduce the Mean Absolute Error to about 15%.
While H2O-3 does not support lagging, you can leverage Sparkling Water to perform this pre-processing. You can use Spark to generate the lags per group and then use H2O-3 to train the regression model. Here is an example of this process: https://github.com/h2oai/h2o-tutorials/tree/master/best-practices/forecasting
The training data, train_df has to have all the columns listed in both x (c("month", "lag12", "trend", "trend_sqr")) and y ("y"), whereas the data you give to h2o.predict() just has to have the columns in x; the y-column is what will be returned as the prediction.
As you have features (in x) that are things like lag, trend, etc. the fact that it is a time series does not matter. (But you do have to be very careful when preparing those features to make sure you do not use any information in them that was not known at that point in time - but I would imagine the book has already been emphasizing that.)
Normally with a time series, for a given row in the training data, your x data is the data known at time t, and the value in the y column is the value of interest at time t+1. So when doing a prediction, you give the x values as the values at the moment, and the prediction returned is what will happen next.

Using the caret::train package for calculating prediction error (MdAE) of glmms with beta-binomial errors

The question is more or less as the title indicates. I would like to use the caret::train function with beta-binomial models made with glmmTMB package (although I am not opposed to other functions capable of fitting beta-binomial models) to calculate median absolute error (MdAE) estimates through jack-knife (leave-one-out) cross-validation. The glmmTMBControl function is already capable of estimating the optimal dispersion parameter but I was hoping to retain this information somehow as well... or having caret do the calculation possibly?
The dataset I am working with looks like this:
df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(seq(1:20), each = 5))
Ideally I would be able to pass the glmmTMB function to trainControl like so:
BB.glmm1 <- train(Time ~ Effect,
data = df, method = "glmmTMB",
method = "", metric = "MAD")
The output would be as per the examples contained in train, although possibly with estimates for the dispersion parameter.
Although I am in no way opposed to work arounds - Thank you in advance!
I am unsure how to perform the required operation with caret without creating a custom method but I trust it is fairly easy to implement it with a for (lapply) loop.
In the example I will use the sleepstudy data set since your example data throws a bunch of warnings.
library(glmmTMB)
to perform LOOCV - for every row, create a model without that row and predict on that row:
data(sleepstudy,package="lme4")
LOOCV <- lapply(1:nrow(sleepstudy), function(x){
m1 <- glmmTMB(Reaction ~ Days + (Days|Subject),
data = sleepstudy[-x,])
return(predict(m1, sleepstudy[x,], type = "response"))
})
get the median of the residuals (I think this is MdAE? if not post a comment on how its calculated):
median(abs(unlist(LOOCV) - sleepstudy$Reaction))

Excluding ID field when fitting model in R

I have a simple Random Forest model I have created and tested in R. For now I have excluded an internal company ID from my training/testing data frames. Is there a way in R that I could include this column in my data and have the training/execution of my model ignore the field?
I obviously would not want the model to try and incorporate it as a variable, but upon an export of the data with a column added of the predicted outcome, I would need that internal id to tie back in other customer data so I know what customers have been categorized as
I am just using the out of the box random forest function from the randomForest library
#divide data into training and test sets
set.seed(3)
id<-sample(2,nrow(Churn_Model_Data_v2),prob=c(0.7,0.3),replace = TRUE)
churn_train<-Churn_Model_Data_v2[id==1,]
churn_test<-Churn_Model_Data_v2[id==2,]
#changes Churn data 1/2 to a factor for model
Churn_Model_Data_v2$`Churn` <- as.factor(Churn_Model_Data_v2$`Churn`)
churn_train$`Churn` <- as.factor(churn_train$`Churn`)
#churn_test$`Churn` <- as.factor(churn_test$`Churn`)
bestmtry <- tuneRF(churn_train,churn_train$`Churn`, stepFactor = 1.2,
improve =0.01, trace=T, plot=T )
#creates model based on training data, views model
churn_forest <- randomForest(`Churn`~. , data= churn_train )
churn_forest
#shows us what variables are most important
importance(churn_forest)
varImpPlot(churn_forest)
#predicts churn diagnosis on test data
predict_churn <- predict(churn_forest, newdata = churn_test, type="class")
predict_churn
A simple example of excluding a particular column or set of columns is as follows
library(MASS)
temp<-petrol
randomForest(No ~ .,data = temp[, !(colnames(temp) %in% c("SG"))]) # One Way
randomForest(No ~ .-SG,data = temp) #Another way with similar result
This method of exclusion is commonly valid across other fuctions/alogorithms in R too.

svm.predict cannot predict new dataset

the data set is rather simple,
I was trying to see what the prediction of the result would be if i consider result1 as a numeric variable.
so I subsetted the dataset into two part, one for training and the other one for testing randomly, run the svm code, and predict the testing set
wh.train = wh[1:1106,]
wh.test = wh[1107:1120,]
model1 = svm(result1 ~., wh.train)
pred1 = predict(model1, wh.test)
then result come out to me is really weird,
named numeric(0)

Resources