Error in h(simpleError(msg, call)) in Ridge/Lasso Regression - r

I am trying to run ridge/lasso with the glmnetand onehot package and getting an error.
library(glmnet)
library(onehot)
set.seed(123)
Sample <- HouseData[1:1460, ]
smp_size <- floor(0.5 * nrow(Sample))
train_ind <- sample(seq_len(nrow(Sample)), size = smp_size)
train <- Sample[train_ind, ]
test <- Sample[-train_ind, ]
############Ridge & Lasso Regressions ################
# Define the response for the training + test set
y_train <- train$SalePrice
y_test <- test$SalePrice
# Define the x training and test
x_train <- train[,!names(train)=="SalePrice"]
x_test <- test[,!names(train)=="SalePrice"]
str(y_train)
## encoding information for training set
x_train_encoded_data_info <- onehot(x_train,stringsAsFactors = TRUE, max_levels = 50)
x_train_matrix <- (predict(x_train_encoded_data_info,x_train))
x_train_matrix <- as.matrix(x_train_matrix)
# create encoding information for x test
x_test_encoded_data_info <- onehot(x_test,stringsAsFactors = TRUE, max_levels = 50)
x_test_matrix <- (predict(x_test_encoded_data_info,x_test))
str(x_train_matrix)
###Calculate best lambda
cv.out <- cv.glmnet(x_train_matrix, y_train,
alpha = 0, nlambda = 100,
lambda.min.ratio = 0.0001)
best.lambda <- cv.out$lambda.min
best.lambda
model <- glmnet(x_train_matrix, y_train, alpha = 0, lambda = best.lambda)
results_ridge <- predict(model,newx=x_test_matrix)
I know my data is clean and my matrices are the same size, But I keep getting this error when I try to run my prediction.
Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90
My professor has also told me to one-hot encode before I split my data, but that makes no sense to me.

It's hard to debug that specific error because it's not entirely clear where the onehot function in your code is coming from; it doesn't exist in base R or the glmnet package.
That said, I would recommend using the old built-in standby function model.matrix (or its sparse cousin, sparse.model.matrix, if you have larger datasets) for creating the x argument to glmnet. model.matrix will automatically one-hot encode factor or categorical variables for you. It requires a model formula as input, which you can create from your dataset as shown below.
# create the model formula
y_variable <- "SalePrice"
model_formula <- as.formula(paste(y_variable, "~",
paste(names(train)[names(train) != y_variable], collapse = "+")))
# test & train matrices
x_train_matrix <- model.matrix(model_formula, data = train)[, -1]
x_test_matrix <- model.matrix(model_formula, data = test)[, -1]
###Calculate best lambda
cv.out <- cv.glmnet(x_train_matrix, y_train,
alpha = 0, nlambda = 100,
lambda.min.ratio = 0.0001)
A second, newer option would be to use the built-in glmnet function makeX(), which builds matrices off of your test/train dataframes. This can just be fed into cv.glmnet as the x argument as below.
## option 2: use glmnet built in function to create x matrices
x_matrices <- glmnet::makeX(train = train[, !names(train) == "SalePrice"],
test = test[, !names(test) == "SalePrice"])
###Calculate best lambda
cv.out <- cv.glmnet(x_matrices$x, y_train,
alpha = 0, nlambda = 100,
lambda.min.ratio = 0.0001)

Related

Use of PCA results as input to XGboost model throwing an error: Feature names stored in `object` and `newdata` are different

I use PCA on my divided train dataset and project the test dataset to the results after removing irrelevant columns.
data <- read.csv('bottom10.csv')
set.seed(1)
inTrain <- createDataPartition(data$cuisine, p = .8)[[1]]
dataTrain <- data[,-1][inTrain,][,-1]
dataTest <- data[,-1][-inTrain,][,-1]
cuisine.pca <- prcomp(dataTrain[,-1])
Then I extract the first 500 components and project the test dataset.
traincom <- cuisine.pca$x[,1:500]
testcom <- scale(dataTest[,-1], cuisine.pca$center) %*% cuisine.pca$rotation
Then I transfer the labels into integer, and combine components and labels into xgbDMatrix form.
label_train <- as.integer(dataTrain$cuisine) - 1
label_test <- as.integer(dataTest$cuisine) - 1
xgb_train <- xgb.DMatrix(data = traincom, label = label_train)
xgb_test <- xgb.DMatrix(data = testcom, label = label_test)
Then I build the xgboost model as
xgb.fit <- xgboost(cuisine~., data = xgb_train, nrounds = 40, num_class = 10, early_stopping_rounds = 5)
And after I run this, there is a warning but the training can still run.
xgboost: label will be ignored
I can predict the train dataset using the model but when I try to predict test dataset there will be an error.
xgb_pred <- predict(xgb.fit, newdata = xgb_train)
sum(label_train == xgb_pred)/length(label_train)
xgb_pred <- predict(xgb.fit, newdata = xgb_test, rescale = T)
Error in predict.xgb.Booster(xgb.fit, newdata = xgb_test, rescale = T) :
Feature names stored in `object` and `newdata` are different!
Please let me know what am I doing wrong?
Regards

How can I calculate F1-measure and ROC in multiclass classification problem in R?

I have this code for a multiclass classification problem:
data$Class = as.factor(data$Class)
levels(data$Class) <- make.names(levels(factor(data$Class)))
trainIndex <- createDataPartition(data$Class, p = 0.6, list = FALSE, times=1)
trainingSet <- data[ trainIndex,]
testingSet <- data[-trainIndex,]
train_x <- trainingSet[, -ncol(trainingSet)]
train_y <- trainingSet$Class
testing_x <- testingSet[, -ncol(testingSet)]
testing_y <- testingSet$Class
oneRM <- OneR(trainingSet, verbose = TRUE)
oneRM
summary(oneRM)
plot(oneRM)
oneRM_pred <- predict(oneRM, testing_x)
oneRM_pred
eval_model(oneRM_pred, testing_y)
AUC_oneRM_pred <- auc(roc(oneRM_pred,testing_y))
cat ("AUC=", oneRM_pred)
# Recall-Precision curve
oneRM_prediction <- prediction(oneRM_pred, testing_y)
RP.perf <- performance(oneRM_prediction, "tpr", "fpr")
plot (RP.perf)
plot(roc(oneRM_pred,testing_y))
But code does not work, after this line:
oneRM_prediction <- prediction(oneRM_pred, testing_y)
I get this error:
Error in prediction(oneRM_pred, testing_y) : Format of predictions is
invalid.
In addition, I donĀ“t know how I can get easily the F1-measure.
Finally, a question, does it make sense to calculate AUC in a multi-class classification problem?
Let's start from F1.
Assuming that you are using the iris dataset, first, we need to load everything, train the model and perform the predictions as you did.
library(datasets)
library(caret)
library(OneR)
library(pROC)
trainIndex <- createDataPartition(iris$Species, p = 0.6, list = FALSE, times=1)
trainingSet <- iris[ trainIndex,]
testingSet <- iris[-trainIndex,]
train_x <- trainingSet[, -ncol(trainingSet)]
train_y <- trainingSet$Species
testing_x <- testingSet[, -ncol(testingSet)]
testing_y <- testingSet$Species
oneRM <- OneR(trainingSet, verbose = TRUE)
oneRM_pred <- predict(oneRM, testing_x)
Then, you should calculate the precision, recall, and F1 for each class.
cm <- as.matrix(confusionMatrix(oneRM_pred, testing_y))
n = sum(cm) # number of instances
nc = nrow(cm) # number of classes
rowsums = apply(cm, 1, sum) # number of instances per class
colsums = apply(cm, 2, sum) # number of predictions per class
diag = diag(cm) # number of correctly classified instances per class
precision = diag / colsums
recall = diag / rowsums
f1 = 2 * precision * recall / (precision + recall)
print(" ************ Confusion Matrix ************")
print(cm)
print(" ************ Diag ************")
print(diag)
print(" ************ Precision/Recall/F1 ************")
print(data.frame(precision, recall, f1))
After that, you are able to find the macro F1.
macroPrecision = mean(precision)
macroRecall = mean(recall)
macroF1 = mean(f1)
print(" ************ Macro Precision/Recall/F1 ************")
print(data.frame(macroPrecision, macroRecall, macroF1))
To find the ROC (precisely the AUC), it best to use pROC library.
print(" ************ AUC ************")
roc.multi <- multiclass.roc(testing_y, as.numeric(oneRM_pred))
print(auc(roc.multi))
Hope that it helps you.
Find details on this link for F1 and this for AUC.
If I use levels(oneRM_pred) <- levels(testing_y) in this way:
...
oneRM <- OneR(trainingSet, verbose = TRUE)
oneRM
summary(oneRM)
plot(oneRM)
oneRM_pred <- predict(oneRM, testing_x)
levels(oneRM_pred) <- levels(testing_y)
...
The accuracy is very much lower than before. So, I am not sure if to enforce the same levels is a good solution.

How to calculate the cross-validated R2 on a LASSO regression?

I am using this code to fit a model using LASSO regression.
library(glmnet)
IV1 <- data.frame(IV1 = rnorm(100))
IV2 <- data.frame(IV2 = rnorm(100))
IV3 <- data.frame(IV3 = rnorm(100))
IV4 <- data.frame(IV4 = rnorm(100))
IV5 <- data.frame(IV5 = rnorm(100))
DV <- data.frame(DV = rnorm(100))
data<-data.frame(IV1,IV2,IV3,IV4,IV5,DV)
x <-model.matrix(DV~.-IV5 , data)[,-1]
y <- data$DV
AB<-glmnet(x=x, y=y, alpha=1)
plot(AB,xvar="lambda")
lambdas = NULL
for (i in 1:100)
{
fit <- cv.glmnet(x,y)
errors = data.frame(fit$lambda,fit$cvm)
lambdas <- rbind(lambdas,errors)
}
lambdas <- aggregate(lambdas[, 2], list(lambdas$fit.lambda), mean)
bestindex = which(lambdas[2]==min(lambdas[2]))
bestlambda = lambdas[bestindex,1]
fit <- glmnet(x,y,lambda=bestlambda)
I would like to calculate some sort of R2 using the training data. I assume that one way to do this is using the cross-validation that I performed in choosing lambda. Based off of this post it seems like this can be done using
r2<-max(1-fit$cvm/var(y))
However, when I run this, I get this error:
Warning message:
In max(1 - fit$cvm/var(y)) :
no non-missing arguments to max; returning -Inf
Can anyone point me in the right direction? Is this the best way to compute R2 based off of the training data?
The function glmnet does not return cvm as a result on fit
?glmnet
What you want to do is use cv.glmnet
?cv.glmnet
The following works (note you must specify more than 1 lambda or let it figure it out)
fit <- cv.glmnet(x,y,lambda=lambdas[,1])
r2<-max(1-fit$cvm/var(y))
I'm not sure I understand what you are trying to do. Maybe do this?
for (i in 1:100)
{
fit <- cv.glmnet(x,y)
errors = data.frame(fit$lambda,fit$cvm)
lambdas <- rbind(lambdas,errors)
r2[i]<-max(1-fit$cvm/var(y))
}
lambdas <- aggregate(lambdas[, 2], list(lambdas$fit.lambda), mean)
bestindex = which(lambdas[2]==min(lambdas[2]))
bestlambda = lambdas[bestindex,1]
r2[bestindex]

predict() R function caret package errors: "newdata" rows different, "type" not accepted

I am running a logistic regression analysis using the caret package.
Data is input as a 18x6 matrix
everything is fine so far except the predict() function.
R is telling me the type parameter is supposed to be raw or prob but raw just spits out an exact copy of the last column (the values of the binomial variable). prob gives me the following error:
"Error in dimnames(out)[[2]] <- modelFit$obsLevels :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
'newdata' had 7 rows but variables found have 18 rows"
install.packages("pbkrtest")
install.packages("caret")
install.packages('e1071', dependencies=TRUE)
#install.packages('caret', dependencies = TRUE)
require(caret)
library(caret)
A=matrix(
c(
64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946,66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627,68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755,69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500,73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500,1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1
),
nrow = 18,
ncol = 6,
byrow = FALSE) #"bycol" does NOT exist
################### data set as vectors
a<-c(64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946)
b<-c(66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627)
c<-c(68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755)
d<-c(69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500)
e<-c(73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500)
f<-c(1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1)
######################
n<-nrow(A);
K<-ncol(A)-1;
Train <- createDataPartition(f, p=0.6, list=FALSE) #60% of data set is used as training.
training <- A[ Train, ]
testing <- A[ -Train, ]
nrow(training)
#this is the logistic formula:
#estimates from logistic regression characterize the relationship between the predictor and response variable on a log-odds scale
mod_fit <- train(f ~ a + b + c + d +e, data=training, method="glm", family="binomial")
mod_fit
#this isthe exponential function to calculate the odds ratios for each preditor:
exp(coef(mod_fit$finalModel))
predict(mod_fit, newdata=training)
predict(mod_fit, newdata=testing, type="prob")
I'm not very sure to understand, but A is a matrix of (a,b,c,d,e,f). So you don't need to create two objects.
install.packages("pbkrtest")
install.packages("caret")
install.packages('e1071', dependencies=TRUE)
#install.packages('caret', dependencies = TRUE)
require(caret)
library(caret)
A=matrix(
c(
64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946,66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627,68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755,69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500,73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500,1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1
),
nrow = 18,
ncol = 6,
byrow = FALSE) #"bycol" does NOT exist
A <- data.frame(A)
colnames(A) <- c('a','b','c','d','e','f')
A$f <- as.factor(A$f)
Train <- createDataPartition(A$f, p=0.6, list=FALSE) #60% of data set is used as training.
training <- A[ Train, ]
testing <- A[ -Train, ]
nrow(training)
And to predict a variable you must enter the explanatory variables and not the variable to predict
mod_fit <- train(f ~ a + b + c + d +e, data=training, method="glm", family="binomial")
mod_fit
#this isthe exponential function to calculate the odds ratios for each preditor:
exp(coef(mod_fit$finalModel))
predict(mod_fit, newdata=training[,-which(colnames(training)=="f")])
predict(mod_fit, newdata=testing[,-which(colnames(testing)=="f")])
Short answer, you should not include the explained variable, which is f in your predict equation. So you should do:
predict(mod_fit, newdata=training[, -ncol(training])
predict(mod_fit, newdata=testing[, -ncol(testing])
The issue with the warning message 'newdata' had 11 rows but variables found have 18 rows is because you run the regression using the whole data set (18 observations), but predict using just part of it (either 11 or 7).
EDIT: To simplify the data creation and glm processes we can do:
library(caret)
A <- data.frame(a = c(64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946),
b = c(66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627),
c = c(68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755),
d = c(69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500),
e = c(73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500),
f = c(1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1))
Train <- createDataPartition(f, p=0.6, list=FALSE) #60% of data set is used as training.
training <- A[ Train, ]
testing <- A[ -Train, ]
mod_fit <- train(f ~ a + b + c + d + e, data=training, method="glm", family="binomial")
I try to run logistic regression model. I wrote this code:
install.packages('caret')
library(caret)
setwd('C:\\Users\\BAHOZ\\Documents\\')
D<-read.csv(file = "D.csv",header = T)
D<-read.csv(file = 'DataSet.csv',header=T)
names(D)
set.seed(111134)
Train<-createDataPartition(D$X, p=0.7,list = FALSE)
training<-D[Train,]
length(training$age)
testing<-D[-Train,]
length(testing$age)
mod_fit<-train(X~age + gender + total.Bilirubin + direct.Bilirubin + total.proteins + albumin + A.G.ratio+SGPT + SGOT + Alkphos,data=training,method="glm", family="binomial")
summary(mod_fit)
exp(coef(mod_fit$finalModel))
And I recived this message for last command:
(Intercept) age gender total.Bilirubin direct.Bilirubin total.proteins albumin A.G.ratio
0.01475027 1.01596886 1.03857883 1.00022899 1.78188072 1.00065332 1.01380334 1.00115742
SGPT SGOT Alkphos
3.93498241 0.05616662 38.29760014
By running this command I could predict my data,
predict(mod_fit , newdata=testing)
But if I set type="prob" or type="raw"
predict(mod_fit , newdata=testing, type = "prob")
it falls in error:
Error in dimnames(out) <- *vtmp* :
length of 'dimnames' [2] not equal to array extent

predict with kernlab package error Error in .local(object, ...) : test vector does not match model R

I'm testing the kernlab package in a regression problem. It seems it's a common issue to get 'Error in .local(object, ...) : test vector does not match model ! when passing the ksvm object to the predict function. However I just found answers to classification problems or custom kernels that are not applicable to my problem (I'm using a built-in one for regression). I'm running out of ideas here, my sample code is:
data <- matrix(rnorm(200*10),200,10)
tr <- data[1:150,]
ts <- data[151:200,]
mod <- ksvm(x = tr[,-1],
y = tr[,1],
kernel = "rbfdot", type = 'nu-svr',
kpar = "automatic", C = 60, cross = 3)
pred <- predict(mod,
ts
)
You forgot to remove the y variable in the test set, and so it fails because the number of predictors don't match. This will work:
predict(mod,ts[,-1])
You can use pred <- predict(mod, ts) if ts is a dataframe.
It would be
data <- setNames(data.frame(matrix(rnorm(200*10),200,10)),
c("Y",paste("X", 1:9, sep = "")))
tr <- data[1:150,]
ts <- data[151:200,]
mod <- ksvm(as.formula("Y ~ ."), data = tr,
kernel = "rbfdot", type = 'nu-svr',
kpar = "automatic", C = 60, cross = 3)
pred <- predict(mod, ts)

Resources