I'm trying to predict future return using the caret package.
I know how to validate my model through Time-series cross validation
but I don't know how to get the latest prediction value.
As you can see in this picture,
last value is always used as "horizon"
I want to use this value as training data and get the last prediction even though I can't validate it anymore.
Should I use predict function? or Are there other good ways?
Here is my codes for building model and time-series validation.
timecontrol <- trainControl(method = 'timeslice', initialWindow = window_length, horizon =4, selectionFunction = "best",
returnResamp = 'final', fixedWindow = TRUE, savePredictions = 'final')
cur_val_m <- train(test_sample[,-1], test_sample[,1], method = "knn",
trControl = timecontrol, tuneGrid = "knnGrid")
You need to put some part of your code or data. But, in general, if we need to predict one step ahead we can use this:
prediction<-predict(model,yourdata[nrow(yourdata)+ 1,])
Related
The website I am trying to run the code is using an old version of R and does not accept ranger as the library. I have to use the caret package. I am trying to process about 800,000 lines in my train data frame and here is the code I use
control <- trainControl(method = 'repeatedcv',
number = 3,
repeats = 1,
search = 'grid')
tunegrid <- expand.grid(.mtry = c(sqrt(ncol(train_1))))
fit <- train(value~.,
data = train_1,
method = 'rf',
ntree = 73,
tuneGrid = tunegrid,
trControl = control)
Looking at previous posts, I tried to tune my control parameters, is there any way I can make the model run faster? Am I able to specify a specific setting so that it just generates a model with the parameters I set, and not try multiple options?
This is my code from ranger which I optimized and currently having accurate model
fit <- ranger(value ~ .,
data = train_1,
num.trees = 73,
max.depth = 35,mtry = 7,importance='impurity',splitrule = "extratrees")
Thank you so much for your time
When you specify method='rf', caret is using the randomForest package to build the model. If you don't want to do all the cross-validation that caret is useful for, just build your model using the randomForest package directly. e.g.
library(randomForest)
fit <- randomForest(value ~ ., data=train_1)
You can specify values for ntree, mtry etc.
Note that the randomForest package is slow (or just won't work) for large datasets. If ranger is unavailable, have you tried the Rborist package?
I would like to create a Random Forest model with caret. Since there are missing values in the training set, I was looking for possible solutions and came across the option "na.roughfix" from the package "randomForest". If the library randomForest is loaded, this option can be used as argument for the parameter "na.action" within the train function of caret. Inside the train function I use a 5-fold CV and tune for the best ROC value. I do this to ensure comparability between other models. The method I've chosen for the Random Forest is "ranger".
But now something strange happens: When I trigger the train function, the calculation is started, but for example the following error message appears:
model fit failed for Fold5: mtry= 7, splitrule=gini, min.node.size= 5 Error : Missing data in columns: ...
The "..." stands for the columns in which the missing values occur. Moreover, this error message always occurs, no matter for which fold or value for mtry.
I am well aware that there are missing values in these columns ... that's why I use na.roughfix. I also remove the NZVs, but that doesn't help either.
I would be very happy about an explanation or even a solution!
Many greetings
Edit.: I've seen now that, if I want to choose the "na.action" arugment in the train function, it does not appear automatically, which it usually does. It seems that it's somehow lost ... maybe this is the reason, why caret does not use the na.roughfix ...
Edit. 2: I guess that this is one part of the problem. train behaves always differently, depending on the previous arguments. In my train function I use a recipe from the recipe package to remove the NZVs. As soon as I remove the recipe, the na.action argument becomes available again. However, now the preProcess argument vanished, meaning I cannot remove the NZVs anymore. This is really a mess :-/ Is there a possibilty to apply the na.action AND the preProcess argument at the same time or any other solution for my Missing-Values-NZV-problem?
Edit. 3: As wished by the user missuse I try to provide you with a code expamle. Unfortunately I cannot provide you with data since mine is relatively sensitve - thank you for your understanding.
At first, I create a "blueprint" which I hand over to the train function. Here, I remove the Near Zero Variance Variables.
blueprint <- recipe(target ~ ., data = train_data) %>%
step_nzv(all_predictors())
In the next step, I define the trainControl
train_control <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
verboseIter = TRUE)
and a grid:
hyper_grid <- expand.grid(mtry=c(1:(ncol(train_data)-1)),
splitrule = c("gini", "extratrees"),
min.node.size = c(1, 3, 5, 7, 10))
Finally, I put it all together into the train function:
tuned_rf <- train(
blueprint,
data = train_data,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
na.action = na.roughfix
)
Here, the argument na.action doesn't get suggested by R, meaning that is not available. This throws the error message in the opening question. However, if I remove the blueprint and write the model like this:
tuned_rf <- train(
target ~ .,
data = train_data,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
na.action = na.roughfix
)
na.action is available and na.roughfix can be used. However, now, the pre processing is missing. If I want to add the argument "preProcess =" to remove the NZVs, R does not suggest it, meaning that is not available anymore. Therefore, I would have to replace the fomula and the data with the training_data X and the response variable y. Now, preProcess is available again ... but na.action has vanished, therefore I cannot use na.roughfix.
tuned_rf <- train(
X,
Y,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
preProcess = "nzv"
)
Of course I could identify the NZVs first and remove them manually - but if I want to apply further steps, the whole process gets complicated.
I hope, my problem is now more understandable ...
From the help of ?randomForest::na.roughfix just performs median/mode imputation you can replace it when using a recipe with step_impute_median and step_impute_mode
your blueprint would look like:
library(recipes)
blueprint <- recipe(target ~ ., data = train_data) %>%
step_nzv(all_predictors()) %>%
step_impute_median(all_numeric()) %>%
step_impute_mode(all_nominal())
Perhaps also try
blueprint <- recipe(target ~ ., data = train_data) %>%
step_impute_median(all_numeric()) %>%
step_impute_mode(all_nominal()) %:%
step_nzv(all_predictors())
Depending on how step_nzv handles missing values.
I would also check performance with other imputing functions like
step_impute_bag
step_impute_knn
I am new to R and trying to generate predictions of a "yes"/"no" variable using an ensemble model. To do so, I am using caret to generate predictions using a random forest (ranger), a LASSO (glmnet) and a gradient boosted regression tree (xgblinear) model. My dataset contains around 600k observations and 500 variables (a mix of continuous and binary variables, weighting 322MB), of which I am using 30% to train the models.
Here is the code I am using for this:
train_control_final <- trainControl(method = "none", savePredictions = TRUE, allowParallel = T, classProbs = TRUE,summaryFunction = twoClassSummary)
rff_final <- train(training_y~., data = training_final, method = "ranger",
tuneGrid = rfgrid_final, num.trees = 250, metric = "ROC", sample.fraction = 0.1,
replace = TRUE, trControl=train_control_final, maximize = FALSE, na.action = na.omit)
rboost_final <- train(training_y~., data = training_final, method = "xgbLinear",
tuneGrid = boostgrid_final, metric = "ROC", subsample = 0.1,
trControl=train_control_final, maximize = FALSE, na.action = na.omit)
rlasso_final <- train(training_y~., data = training_final, method = "glmnet",
tuneGrid = lassogrid_final, metric = "ROC",
trControl=train_control_final, maximize = FALSE, na.action = na.omit)
In a first step, I use a different 10% of the sample to tune the parameters for each model (using 3-fold CV), with the results stored in rf/lasso/boostgrid_final (for this, I use the same code structure).
The problem is that every time I run this code to generate the predictions, R crashes because of memory issues. The code works without problems when tuning the parameters and when including less variables (around 60). So my question is what can I try to make this work with the full dataset? Or is there an alternative way of accomplish what I want (generate predictions using these three different algorithms) without running into memory issues?
Thanks a lot in advance!
I am using the Caret package train function to fit a model and then predict to predict values on an unknown data set (which I then get feedback on so I know the quality of my predictions). I'm having problems and I'm convinced it has to do with preprocessing the unknown data.
Briefly and simply, this is what I'm doing:
Pre-Process Training Data:
preproc = preProcess(train_num,method = c("center", "scale"))
train_standardized <- predict(preproc, train_num)
Train the Model:
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = c(100,500),
shrinkage = 0.1,
n.minobsinnode = 20)
train.boost = train(x=train_standardized[,-length(train_standardized)],
y=train_standardized$response,
method = "gbm",
metric = "ROC",
maximize = FALSE,
tuneGrid= gbmGrid,
trControl = trainControl(method="cv",
number=5,
classProbs = TRUE,
verboseIter = TRUE,
summaryFunction=twoClassSummary,
savePredictions = TRUE))
Prepare unknown data for predictions:
...
unknown_standardized <- predict(preproc, unknown_num)
...
Make the actual prediction on the unknown data:
preds <- predict(train.boost,newdata=unknown_standardized,type="prob")
Note that the "preproc" object is the same one resulting from analysis of the training set and used to make the centered/standardized predictions on which the model was trained.
When I get my evaluation back my evaluation on the unknown data it is substantially worse than what was predicted using the training set (ROC using training data via cross validation is about .83, ROC using the unknown data that I get back from the evaluating party is about .70).
Do I have the process right? What am I doing wrong?
Thanks in advance.
In one sense, you are not doing anything wrong at all.
A predictor is likely to do better on a training sample as it has used that data to build the model.
The whole point of the training set is to see how well that model generalizes. It is likely to "overfit" to the training data to a greater or lesser extent and to do somewhat worse on new data.
At least once you have your score against new data, you know the true accuracy of the model. If that accuracy is sufficient for your purposes, then the model will be useable and (because you have done the training/test) robust to new data.
Now, it is possible that the model could be better if it was trained on a wider variety of data. So to increase real accuracy, it might be worth using cross-validation to train it on multiple slices of the data - k fold cross-validation. Caret has a nice facility for that. http://machinelearningmastery.com/how-to-estimate-model-accuracy-in-r-using-the-caret-package/
I am using CARET package to fine tune random forest mtry parameter. In the package, tunelength parameter can be used to automate search for best mtry parameter. But the problem is the "tunelength" works when i set minimum 2 folds in crossvalidation. It does not work when i do not want cross validation.
ctrl <- trainControl(method = "cv", classProbs = TRUE, summaryFunction = twoClassSummary, number = 2)
set.seed(2)
trained <- train(Y ~ . , data = mydata, method = "rf", ntree = 500, tunelength = 10, metric = "ROC", trControl = ctrl, importance = TRUE)
And do anyone know the default setting of tunelength? I mean which value of mtry , it would start with.
I think you don't understand what does parameter tuning mean. You want to select the best combination of parameters that improve some quality measure. The thing is that this quality measure can't be computed on the training set itself, because this would lead to over fitting. Crossvalidation precisely gives you an unbiased estimate of your quality measure.
But the problem is the "tunelength" works when i set minimum 2 folds in crossvalidation. It does not work when i do not want cross validation.
I'm not sure what "does not work" means. If you are not resampling, there are not many ways for determining mtry. You could use method = "OOB" in trainControl and use the internal random forest estimate and set tuneLength the same way you did before (see these two pages for more details).
Again, I'm not sure if this answers your question.
Max