GBM cross validation - r

I'm trying to use R's gbm regression model.
I want to compute the coefficient of determination (R squared) between the cross validation predicted response values and the true response values. However, the cv.fitted values of the gbm.object only provides the predicted response values for 1-train.fraction. So in order to get what I want I need to find which of the observations correspond to the cv.fitted values.
Any idea how to get that information?

You can use the predict function to easily get at model predictions, if I'm understanding your question correctly.
dat <- data.frame(y = runif(1000), x=rnorm(1000))
gbmMod <- gbm::gbm(y~x, data=dat, n.trees=5000, cv.folds=0)
summary(lm(predict(gbmMod, n.trees=5000) ~ dat$y))$adj.r.squared
But shouldn't we hold data to the side and assess model accuracy on test data? This would correspond to the following, where I partition the data into a training set (70%) and testing set (30%):
inds <- sample(1:nrow(dat), 0.7*nrow(dat))
train <- dat[inds, ]
test <- dat[-inds, ]
gbmMod2 <- gbm::gbm(y~x, data=train, n.trees=5000)
preds <- predict(gbmMod2, newdata = test, n.trees=5000)
summary(lm(preds ~ test[,1]))$adj.r.squared
It's also worth noting that the number of trees in the gbm can be tuned using the gbm.perf function and the cv.folds argument to the gbm function. This helps avoids overfitting.

Related

Random Forest model yields incorrect predictions despite having accuracy of over 99 percent

For a ML course, I am supposed to build a model based on the training set to predict the variable "classe" on a validation set. I removed all unnecessary variables in the training set, used cross validation to prevent over-fitting, and made sure the validation set matched the training set in terms of which columns are removed. When I predict classe in the validation set, it yields all classe A, and I know this is incorrect.
I included the entire script below.
Where did I go wrong?
library(caret)
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "train.csv")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "test.csv")
train <- read.csv("./train.csv")
val <- read.csv("./test.csv")
#getting rid of columns with NAs
nas <- sapply(train, function(x) sum(is.na(x)))
train <- train[, nas<1900]
#removing near zero variance columns
remove <- nearZeroVar(train)
train <- train[, -remove]
#create partition in our training set
set.seed(8675309)
inTrain <- createDataPartition(train$classe, p = .7, list = FALSE)
training <- train[inTrain,]
testing <- train[-inTrain,]
model <- train(classe ~ ., method = "rf", data = training)
confusionMatrix(predict(model, testing), testing$classe)
#make sure validation set has same features as training set
trainforvalid <- subset(training, select = -classe)
val <- val[, colnames(trainforvalid)]
predict(model, val)
#the above step yields all predictions as classe A
This might be happening because the data is unbalanced. If the data have a lot more data points for Class A then Class B, the model will simply learn to predict always Class A.
Try to use a better metric in this case like F1 score.
I also recommend using techniques like oversampling or undersampling to avoid the unbalanced data issue.

Extimate prediction accuracy of cox ph

i would like to develop a cox proportional hazard model with r, use it to predict input and evaluate the accuracy of the model. For the evaluation I would like to use the Brior score.
# import various packages, needed at some point of the script
library("survival")
library("survminer")
library("prodlim")
library("randomForestSRC")
library("pec")
library("rpart")
library("mlr")
library("Hmisc")
library("ipred")
# load lung cancer data
data("lung")
head(lung)
# recode status variable
lung$status <- lung$status-1
# Delete rows with missing values
lung <- na.omit(lung)
# split data into training and testing
## 80% of the sample size
smp_size <- floor(0.8 * nrow(lung))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(lung)), size = smp_size)
# training and testing data
train.lung <- lung[train_ind, ]
test.lung <- lung[-train_ind, ]
# time and failure event
s <- Surv(train.lung$time, train.lung$status)
# create model
cox.ph2 <- coxph(s~age+meal.cal+wt.loss, data=train.lung)
# predict
pred <- predict(cox.ph2, newdata = train.lung)
# evaluate
sbrier(s, pred)
as an outcome of the prediction I would expect the time (as in "when does this individuum experience failure). Instead I get values like this
[1] 0.017576359 -0.135928959 -0.347553969 0.112509137 -0.229301199 -0.131861582 0.044589175 0.002634008
[9] 0.345966978 0.209488560 0.002418358
What does that mean?
Furthermore sbrier does not work. Apparently it can not work with the prediction pred (no surprise there)
How do I solve this? How do I make a prediction with cox.ph2? How can I evaluate the model afterwards?
The predict() function won't return a time value, you have to specify the argument type = c("lp", "risk","expected","terms","survival") in the predict() function.
If you want to get the hazard ratios :
predict(cox.ph2, newdata = test.lung, type = "risk")
Note that you want to predict the values on the test set not the training set.
I have read that you can use AFT models in your case :
https://stats.stackexchange.com/questions/79362/how-to-get-predictions-in-terms-of-survival-time-from-a-cox-ph-model
You also can read this post :
Calculate the Survival prediction using Cox Proportional Hazard model in R
Hope it will help

Get row number for prediction with caret

I use caret a lot for my machine learning tasks in R and I like it a lot.
But I face the following problem:
I train a model in caret, say a linear regression with lm()
When I want to score new data, I do: predict(model, new_data)
When new_datacontains missing values in my predictors, predict returns no prediction, instead of say NA
Is it possible to either:
return a prediction for all rows in new_data with a prediction of NA when it is not possible or
return predictions + the row number of the dataframe the prediction corresponds to?
E.g. like the mlr-package does with an id-column that shows which row the prediction corresponds to:
Here is the link to the mlr-predict page with more details:
mlr-package: predict with row-id
Any help greatly appreciated!
You can identify the cases with missing values prior to running caret::train() by creating a new column with the row names in your data set, since these default to the row numbers in the data frame.
Using the Sonar data set from the mlbench package as an illustration:
library(mlbench)
data(Sonar)
library(caret)
set.seed(95014)
# add row numbers
Sonar$rowId <- rownames(Sonar)
# create training & testing data sets
inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
# set column 60 to NA for some values in test data
testing[48:51,60] <- NA
testing[!complete.cases(testing),"rowId"]
...and the output:
> testing[!complete.cases(testing),"rowId"]
[1] "193" "194" "200" "206"
You can then run predict() on the rows in the test data set that have complete cases. Again using the Sonar dataset with a random forest model and 3 fold cross validation to expedite processing:
fitControl <- trainControl(method = "cv",number = 3)
fit <- train(x,y, method="rf",data=Sonar,trControl = fitControl)
predicted <- predict(fit,testing[complete.cases(testing),])
Another way to handle this situation is to use an imputation strategy to eliminate the missing values for the independent variables in your model. My article on Github, Strategies for Handling Missing Values links to a number of research papers on this topic.

Different no of tuples for the prediction model and test set data in SVM

I have a dataset with two columns as shown below, where Column 1, timestamp is a particular value for time for which Column.10 gives the total power usage at that instance of time. There are totally 81502 instances for this data.
I'm doing support vector regression on this data in R using the e1071 package to predict the future usage of power. The code is given below. I first divided the dataset into training and test data. Then using the training data modeled the data using the svm function and then predict the power usage for the testset.
library(e1071)
attach(data.csv)
index <- 1:nrow(data.csv)
testindex <- sample(index,trunc(length(index)/3))
testset <- na.omit(data.csv[testindex, ])
trainingset <- na.omit(data.csv[-testindex, ])
model <- svm(Column.10 ~ timestamp, data=trainingset)
prediction <- predict(model, testset[,-2])
tab <- table(pred = prediction, true = testset[,2])
However, when I try to make a confusion matrix from the prediction, I'm getting the error:
Error in table(pred = prediction, true = testset[, 2]) : all arguments must have the same length
So I tried to find the length of the two arguments and found that
the length(prediction) to be 81502
and the length(testset[,2]) to be 27167
Since I had done the prediction only for the testset, I don't know how prediction is done for 81502 values. How are the total no of values different for the prediction and the testset? How is the power value for the entire dataset getting predicted eventhough I gave it only for the testset?
Change
prediction <- predict(model, testset[,-2])
in
prediction <- predict(model, testset)
However, you should not use table when doing regression, use the MSE instead.

how to get predictions using gbm in r

fit <- gbm(Crop_Damage ~ Estimated_Insects_Count+Crop_Type+ Soil_Type
+Pesticide_Use_Category+Number_Doses_Week+Number_Weeks_Used
+Number_Weeks_Quit+Season,
data = mydata, distribution="multinomial")
gbmpred <- predict(fit,mydata,n.trees = fit$n.trees)
I tried above code but it gives me probabilities.I want to get predictions
By default the predictions of the predict function are in the scale of f(x), hence in the returned values for the multinomial (classification) case will return the probabilities. All you have to do is to translate them to responses by taking the class that has the highest probability

Resources