I am trying to run different regression models on the Prostate cancer data from the lasso2 package. When I use Lasso, I saw two different methods to calculate the mean square error. But they do give me quite different results, so I would want to know if I'm doing anything wrong or if it just means that one method is better than the other ?
# Needs the following R packages.
library(lasso2)
library(glmnet)
# Gets the prostate cancer dataset
data(Prostate)
# Defines the Mean Square Error function
mse = function(x,y) { mean((x-y)^2)}
# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))
# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)
# Training set
train = Prostate[train_ind, ]
# Test set
test = Prostate[-train_ind, ]
# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa
# Fitting a linear model by Lasso regression on the "train" data set
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse',alpha=1)
lambda.lasso = pr.lasso$lambda.min
# Getting predictions on the "test" data set and calculating the mean square error
lasso.pred = predict(pr.lasso, s = lambda.lasso, newx = xtest)
# Calculating MSE via the mse function defined above
mse.1 = mse(lasso.pred,ytest)
cat("MSE (method 1): ", mse.1, "\n")
# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")
So these are the outputs I got for both MSE:
MSE (method 1): 0.4609978
MSE (method 2): 0.5654089
And they're quite different. Does anyone know why ?
Thanks a lot in advance for your help!
Samuel
As pointed out by #alistaire, in the first case you are using the test data to compute the MSE, in the second case the MSE from the cross-validation (training) folds are reported, so it's not an apples to apples comparison.
We can do something like the following to do apples to apples comparison (by keeping the fitted values on the training folds) and as we can see mse.1 and mse.2 are exactly equal if computed on the same training folds (although the value is little different from yours, with my desktop R version 3.1.2, x86_64-w64-mingw32, windows 10):
# Needs the following R packages.
library(lasso2)
library(glmnet)
# Gets the prostate cancer dataset
data(Prostate)
# Defines the Mean Square Error function
mse = function(x,y) { mean((x-y)^2)}
# 75% of the sample size.
smp_size = floor(0.75 * nrow(Prostate))
# Sets the seed to make the partition reproductible.
set.seed(907)
train_ind = sample(seq_len(nrow(Prostate)), size = smp_size)
# Training set
train = Prostate[train_ind, ]
# Test set
test = Prostate[-train_ind, ]
# Creates matrices for independent and dependent variables.
xtrain = model.matrix(lpsa~. -1, data = train)
ytrain = train$lpsa
xtest = model.matrix(lpsa~. -1, data = test)
ytest = test$lpsa
# Fitting a linear model by Lasso regression on the "train" data set
# keep the fitted values on the training folds
pr.lasso = cv.glmnet(xtrain,ytrain,type.measure='mse', keep=TRUE, alpha=1)
lambda.lasso = pr.lasso$lambda.min
lambda.id <- which(pr.lasso$lambda == pr.lasso$lambda.min)
# get the predicted values on the training folds with lambda.min (not from test data)
mse.1 = mse(pr.lasso$fit[,lambda.id], ytrain)
cat("MSE (method 1): ", mse.1, "\n")
MSE (method 1): 0.6044496
# Calculating MSE via the cvm attribute inside the pr.lasso object
mse.2 = pr.lasso$cvm[pr.lasso$lambda == lambda.lasso]
cat("MSE (method 2): ", mse.2, "\n")
MSE (method 2): 0.6044496
mse.1 == mse.2
[1] TRUE
Related
Here is a simple reproducible example of my dilemma.
My goal is very simple - train a random forest model using 'ranger' and run the train data through the trained model to double check against the model's predicted values. The 2 predicted values do not match. In fact, running the train data through the trained model produces the actual train target data.
Persons have posted related questions to the forum, though I have yet to discover a definitive response.
This makes no sense to me. Exposing any trained model to the data from the training set should provide consistent solutions.
library(tidyverse)
library(ranger)
train <- tibble(target = ifelse( runif( 1000) > 0.5, 1, 0 ),
feature1 = runif( 1000),
feature2 = runif( 1000)
)
# Train the model
rf <- ranger(
target ~ .,
data = train,
classification = TRUE,
num.trees = 500,
seed = 123
)
# Obtain the in-sample model predictions
in_sample_predictions_2 <- rf$predictions
# Alternatively, run the train data through the trained model to ensure the results are the same as provided by 'in_sample_predictions_2'
in_sample_predictions_1 <- predict( rf, data = train )$predictions
# Check for equivalency fails, Sum of Squared differences should equal 0
sum( (in_sample_predictions_1 - in_sample_predictions_2 )^2 )
# [1] 506
# Yet, the incorrect predictions exactly equals the OOB error rate = 50.6% = 506/1000
# It turns out that the predicted values from the trained model using the train data is equivalent to the actual train 'target' data.
sum( (train$target - in_sample_predictions_1)^2 )
# [1] 0
I have a dataset that contains information about patients. It includes several variables and their clinical status (0 if they are healthy, 1 if they are sick).
I have tried to implement an SVM model to predict patient status based on these variables.
library(e1071)
Index <-
order(Ytrain, decreasing = FALSE)
SVMfit_Var <-
svm(Xtrain[Index, ], Ytrain[Index],
type = "C-classification", gamma = 0.005, probability = TRUE, cost = 0.001, epsilon = 0.1)
preds1 <-
predict(SVMfit_Var, Xtest, probability = TRUE)
preds1 <-
attr(preds1, "probabilities")[,1]
samples <- !is.na(Ytest)
pred <- prediction(preds1[samples],Ytest[samples])
AUC<-performance(pred,"auc")#y.values[[1]]
prediction <- predict(SVMfit_Var, Xtest)
xtab <- table(Ytest, prediction)
To test the performance of the model, I have calculated the ROC AUC, and with the validation set I obtain an AUC = 0.997.
But when I view the predictions, all the patients have been assigned as healthy.
AUC = 0.997
> xtab
prediction
Ytest 0 1
0 72 0
1 52 0
Can anyone help me with this problem?
Did you look at the probabilities versus the fitted values? You can read about how probability works with SVM here.
If you want to look at the performance you can use the library DescTools and the function Conf or with the library caret and the function confusionMatrix. (They provide the same output.)
library(DescTools)
library(caret)
# for the training performance with DescTools
Conf(table(SVMfit_Var$fitted, Ytrain[Index]))
# svm.model$fitted, y-values for training
# training performance with caret
confusionMatrix(SVMfit_Var$fitted, as.factor(Ytrain[Index]))
# svm.model$fitted, y-values
# if y.values aren't factors, use as.factor()
# for testing performance with DescTools
# with `table()` in your question, you must flip the order:
# predicted first, then actual values
Conf(table(prediction, Ytest))
# and for caret
confusionMatrix(prediction, as.factor(Ytest))
Your question isn't reproducible, so I went through this with iris data. The probability was the same for every observation. I included this, so you can see this with another data set.
library(e1071)
library(ROCR)
library(caret)
data("iris")
# make it binary
df1 <- iris %>% filter(Species != "setosa") %>% droplevels()
# check the subset
summary(df1)
set.seed(395) # keep the sample repeatable
tr <- sample(1:nrow(df1), size = 70, # 70%
replace = F)
# create the model
svm.fit <- svm(df1[tr, -5], df1[tr, ]$Species,
type = "C-classification",
gamma = .005, probability = T,
cost = .001, epsilon = .1)
# look at probabilities
pb.fit <- predict(svm.fit, df1[-tr, -5], probability = T)
# this shows EVERY row has the same outcome probability distro
pb.fit <- attr(pb.fit, "probabilities")[,1]
# look at performance
performance(prediction(pb.fit, df1[-tr, ]$Species), "auc")#y.values[[1]]
# [1] 0.03555556 that's abysmal!!
# test the model
p.fit = predict(svm.fit, df1[-tr, -5])
confusionMatrix(p.fit, df1[-tr, ]$Species)
# 93% accuracy with NIR at 50%... the AUC score was not useful
# check the trained model performance
confusionMatrix(svm.fit$fitted, df1[tr, ]$Species)
# 87%, with NIR at 50%... that's really good
I am trying to write my own gradient boosting algorithm. I understand there are existing packages like gbm and xgboost, but I wanted to understand how the algorithm works by writing my own.
I am using the iris data set, and my outcome is Sepal.Length (continuous). My loss function is mean(1/2*(y-yhat)^2) (basically the mean squared error with 1/2 in front), so my corresponding gradient is just the residual y - yhat. I'm initializing the predictions at 0.
library(rpart)
data(iris)
#Define gradient
grad.fun <- function(y, yhat) {return(y - yhat)}
mod <- list()
grad_boost <- function(data, learning.rate, M, grad.fun) {
# Initialize fit to be 0
fit <- rep(0, nrow(data))
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Initialize model
mod[[1]] <- fit
# Loop over a total of M iterations
for(i in 1:M){
# Fit base learner (tree) to the gradient
tmp <- data$Sepal.Length
data$Sepal.Length <- grad
base_learner <- rpart(Sepal.Length ~ ., data = data, control = ("maxdepth = 2"))
data$Sepal.Length <- tmp
# Fitted values by fitting current model
fit <- fit + learning.rate * as.vector(predict(base_learner, newdata = data))
# Update gradient
grad <- grad.fun(y = data$Sepal.Length, yhat = fit)
# Store current model (index is i + 1 because i = 1 contain the initialized estiamtes)
mod[[i + 1]] <- base_learner
}
return(mod)
}
With this, I split up the iris data set into a training and testing data set and fit my model to it.
train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M = 1000
my.model <- grad_boost(data = train.dat, learning.rate = learning.rate, M = M, grad.fun = grad.fun)
Now I calculate the predicted values from my.model. For my.model, the fitted values are 0 (vector of initial estimates) + learning.rate * predictions from tree 1 + learning rate * predictions from tree 2 + ... + learning.rate * predictions from tree M.
yhats.mymod <- apply(sapply(2:length(my.model), function(x) learning.rate * predict(my.model[[x]], newdata = test.dat)), 1, sum)
# Calculate RMSE
> sqrt(mean((test.dat$Sepal.Length - yhats.mymod)^2))
[1] 2.612972
I have a few questions
Does my gradient boosting algorithm look right?
Did I calculate the predicted values yhats.mymod correctly?
Yes this looks correct. At each step you are fitting to the psuedo-residuals, which are computed as the derivative of loss with respect to the fit. You have correctly derived this gradient at the start of your question, and even bothered to get the factor of 2 right.
This also looks correct. You are aggregating across the models, weighted by learning rate, just as you did during training.
But to address something that was not asked, I noticed that your training setup has a few quirks.
The iris dataset is split equally between 3 species (setosa, versicolor, virginica) and these are adjacent in the data. Your training data has all of the setosa and versicolor, while the test set has all of the virginica examples. There is no overlap, which will lead to out-of-sample problems. It is preferable to balance your training and test sets to avoid this.
The combination of learning rate and model count looks too low to me. The fit converges as (1-lr)^n. With lr = 1e-3 and n = 1000 you can only model 63.2% of the data magnitude. That is, even if every model predicts every sample correctly, you would be estimating 63.2% of the correct value. Initializing the fit with an average, instead of 0s, would help since then the effect is a regression to the mean instead of just a drag.
Under certain circumstances, there are differences in predictions from e1071 package svm models depending on the setting of the probability input argument. This code example:
rm(list = ls())
data(iris)
## Training and testing subsets
set.seed(73) # For reproducibility
ri = sample(seq(1, nrow(iris)), round(nrow(iris)*0.8))
train = iris[ri, ]
test = iris[-ri,]
## Models and predictions with probability setting F or T
set.seed(42) # Just to exclude that randomness in algorithm itself is the cause
m1 <- svm(Species ~ ., data = train, probability = F)
pred1 = predict(m1, newdata = test, probability = F)
set.seed(42) # Just to exclude that randomness in algorithm itself is the cause
m2 <- svm(Species ~ ., data = train, probability = T)
pred2 = predict(m2, newdata = test, probability = T)
## Accuracy
acc1 = sum(test$Species == pred1)/nrow(iris)
acc2 = sum(test$Species == pred2)/nrow(iris)
will give
acc1 = 0.18666...
acc2 = 0.19333...
My conclusion is that svm() performs calculations differently based on the setting of the probability parameter.
Is that correct?
If so, why and how does it differ?
I haven't seen anything about this in the docs for the package or function.
The reason I bother with this is that I have found the performance of the classification to be not only different, but consistently slightly worse when probability = T in a project where I do classification based on ~800 observations of ~250 gene abundances (bioinformatics stuff). The code from that project contains data cleaning and uses cross-validation, making it a bit bulky to include here, so you'll have to take my word for it.
Any ideas folks?
For the purpose of doing Twitter sentiment analysis I am using the SVM function from e1071 package.
I used the RTextTools package to create a document term matrix which I split into a training and a test set. I have "train" which is a data frame of training observations excluding the predicted variable. Then I have "sentitrain", which is a vector of sentiment values corresponding to the training set. Same for testing.
Then I used three different ways for fitting the a SVM model.
Firstly, I created a container
trainmat = as.matrix(train)
# create container object
traincontainer = create_container(trainmat,
sentitrain,
trainSize = 1:nrow(trainmat),
virgin = FALSE)
# create test matrix
testmat = as.matrix(test)
testcontainer = create_container(testmat, labels = rep(0, nrow(test)),
testSize = 1:nrow(test), virgin = FALSE)
model <- train_model(traincontainer , "SVM", kernel="radial", cost=400)
results = classify_model(testcontainer, model)
preds = results[,1]
confusionMatrix(table(preds, sentitest))
This gave me approximately 76% classification accuracy.
In the second method I simply took the column names of my training matrix and then created a formula:
n = names(train)
# exclude the predicted variable
n = setdiff(n, c("sentiment"))
predictors = paste(n, collapse = " + ")
# create formula
f = as.formula(paste("sentiment ~ ", predictors))
model = svm(f, data = train, cost = 400, kernel = "radial")
preds = predict(model, test)
confusionMatrix(table(preds, sentitest))
This gave me around 69% accuracy.
Thirdly I just passed the data frames and the vectors of predicted values directly to the function call:
model = svm(train, sentitrain, data = train, cost = 400, kernel =
"radial")
preds = predict(model, test)
confusionMatrix(table(preds, sentitest))
This resulted in an astonishing 87% accuracy.
I ran each model several times with cross validation to make sure these differences were not due to randomness.
As I understand it I always used the same function from the same package (RTextTools calls the SVM function from e1071 internally) with the same parameters on the same data. The only difference is the way I passed those parameters. How can the results be so different?