Format of newx in Lasso regression gives error in R - r

I am trying to implement lasso linear regression. I train my model but when I try to make prediction on unknown data it gives me the following error:
Error in cbind2(1, newx) %*% nbeta :
invalid class 'NA' to dup_mMatrix_as_dgeMatrix
Summary of my data is:
I want to predict the unknown percent_gc. I initially train my model using data for which percent_gc is known
set.seed(1)
###training data
data.all <- tibble(description = c('Xylanimonas cellulosilytica XIL07, DSM 15894','Teredinibacter turnerae T7901',
'Desulfotignum phosphitoxidans FiPS-3, DSM 13687','Brucella melitensis bv. 1 16M'),
phylum = c('Actinobacteria','Proteobacteria','Proteobacteria','Bacteroidetes'),
genus = c('Acaryochloris','Acetohalobium','Acidimicrobium','Acidithiobacillus'),
Latitude = c('63.93','69.372','3.493.11','44.393.704'),
Longitude = c('-22.1','88.235','134.082.527','-0.130781'),
genome_size = c(8361599,2469596,2158157,3207552),
percent_gc = c(34,24,55,44),
percent_psuedo = c(0.0032987747,0.0291222313,0.0353728489,0.0590663703),
percent_signalpeptide = c(0.02987198,0.040607055,0.048757170,0.061606859))
###data for prediction
data.prediction <- tibble(description = c('Liberibacter crescens BT-1','Saprospira grandis Lewin',
'Sinorhizobium meliloti AK83','Bifidobacterium asteroides ATCC 25910'),
phylum = c('Actinobacteria','Proteobacteria','Proteobacteria','Bacteroidetes'),
genus = c('Acaryochloris','Acetohalobium','Acidimicrobium','Acidithiobacillus'),
Latitude = c('39.53','69.372','5.493.12','44.393.704'),
Longitude = c('20.1','-88.235','134.082.527','-0.130781'),
genome_size = c(474832,2469837,2158157,3207552),
percent_gc = c(NA,NA,NA,NA),
percent_psuedo = c(0.0074639239,0.0291222313,0.0353728489,0.0590663703),
percent_signalpeptide = c(0.02987198,0.040607055,0.048757170,0.061606859))
x=model.matrix(percent_gc~.,data.all)
y=data.all$percent_gc
cv.out <- cv.glmnet (x, y, alpha = 1,family = "gaussian")
best.lambda= cv.out$lambda.min
fit <- glmnet(x,y,alpha=1)
I then want to make predictions for which percent_gc in not known.
newX = matrix(data = data.prediction %>% select(-percent_gc))
data.prediction$percent_gc <-
predict(object = fit ,type="response", s=best.lambda, newx=newX)
And this generates the error I mentioned above.
I don't understand which format newX should be in order to get rid of this help. Insights would be appreciated.

I could not really figure out how to construct a appropiate matrix, but package glmnetUtils provides functionality to directly fit a formula on a dataframe and predict. With this I got it to predict values:
library(glmnetUtils)
fit <- glmnet(percent_gc~.,data.all,alpha=1)
cv.out <- cv.glmnet (percent_gc~.,data.all, alpha = 1,family = "gaussian")
best.lambda= cv.out$lambda.min
predict(object = fit,data.prediction,s=best.lambda)

Related

R: Caret package: Brier Score

I want to perform a logistic regression with the train() function from the caret package. My model looks something like that:
model <- train(Y ~.,
data = train_data,
family = "binomial",
method = "glmnet")
With the resulting model, I want to make predictions:
pred <- predict(model, newdata = test_data, s = "lambda.min", type = "prob")
Now, I want to evaluate how good the model predictions are in comparison with the actual test data. For this I know how to receive the ROC and AUC. However I am also interested in receiveing the BRIER SCORE. The formula for the Brier Score is almost identical to the MSE.
The problem I am facing, is that the type argument in predict only allows "prob" (or "class" which I am not interested in) which gives the probability of one prediction beeing a ONE (e.g. 0.64) , and the complementing probability of beeing a ZERO (e.g. 0.37). For the Brier Score however, I need One probability estimate for each prediction that contains the information of both (e.g. a value above 0.5 would indicate a 1 and a value below 0.5 would indicate a 0).
I have not found any solution for receiving the Brier Score in the caret package. I am aware that with the package cv.glmnet the predict function allows the argument "response" which would solve my problem. However, for personal preferences I would like to stay with the caretpackage.
Thanks for the help!
If we go by the wiki definition of brier score:
The most common formulation of the Brier score is
where f_t is the probability that was forecast, o_t the actual outcome of the (0 or 1) and N is the number of forecasting instances.
In R, if your label is a factor, then the logistic regression will always predict with respect to the 2nd level, meaning you just calculate the probability and 0/1 with respect to that. For example:
library(caret)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor","v","o"))
levels(data$Species)
[1] "o" "v"
In this case, o is 0 and v is 1.
train_data = data[idx,]
test_data = data[-idx,]
model <- train(Species ~.,data = train_data,family = "binomial",method = "glmnet")
pred <- predict(model, newdata = test_data)
So we can see the probability of the class:
head(pred)
o v
1 0.8367885 0.16321154
2 0.7970508 0.20294924
3 0.6383656 0.36163437
4 0.9510763 0.04892370
5 0.9370721 0.06292789
To calculate the score:
f_t = pred[,2]
o_t = as.numeric(test_data$Species)-1
mean((f_t - o_t)^2)
[1] 0.32
I use the Brier score to tune my models in caret for binary classification. I ensure that the "positive" class is the second class, which is the default when you label your response "0:1". Then I created this master summary function, based on caret's own suite of summary functions, to return all the metrics I want to see:
BigSummary <- function (data, lev = NULL, model = NULL) {
pr_auc <- try(MLmetrics::PRAUC(data[, lev[2]],
ifelse(data$obs == lev[2], 1, 0)),
silent = TRUE)
brscore <- try(mean((data[, lev[2]] - ifelse(data$obs == lev[2], 1, 0)) ^ 2),
silent = TRUE)
rocObject <- try(pROC::roc(ifelse(data$obs == lev[2], 1, 0), data[, lev[2]],
direction = "<", quiet = TRUE), silent = TRUE)
if (inherits(pr_auc, "try-error")) pr_auc <- NA
if (inherits(brscore, "try-error")) brscore <- NA
rocAUC <- if (inherits(rocObject, "try-error")) {
NA
} else {
rocObject$auc
}
tmp <- unlist(e1071::classAgreement(table(data$obs,
data$pred)))[c("diag", "kappa")]
out <- c(Acc = tmp[[1]],
Kappa = tmp[[2]],
AUCROC = rocAUC,
AUCPR = pr_auc,
Brier = brscore,
Precision = caret:::precision.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
Recall = caret:::recall.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
F = caret:::F_meas.default(data = data$pred, reference = data$obs,
relevant = lev[2]))
out
}
Now I can simply pass summaryFunction = BigSummary in trainControl and then metric = "Brier", maximize = FALSE in the train call.

Use of PCA results as input to XGboost model throwing an error: Feature names stored in `object` and `newdata` are different

I use PCA on my divided train dataset and project the test dataset to the results after removing irrelevant columns.
data <- read.csv('bottom10.csv')
set.seed(1)
inTrain <- createDataPartition(data$cuisine, p = .8)[[1]]
dataTrain <- data[,-1][inTrain,][,-1]
dataTest <- data[,-1][-inTrain,][,-1]
cuisine.pca <- prcomp(dataTrain[,-1])
Then I extract the first 500 components and project the test dataset.
traincom <- cuisine.pca$x[,1:500]
testcom <- scale(dataTest[,-1], cuisine.pca$center) %*% cuisine.pca$rotation
Then I transfer the labels into integer, and combine components and labels into xgbDMatrix form.
label_train <- as.integer(dataTrain$cuisine) - 1
label_test <- as.integer(dataTest$cuisine) - 1
xgb_train <- xgb.DMatrix(data = traincom, label = label_train)
xgb_test <- xgb.DMatrix(data = testcom, label = label_test)
Then I build the xgboost model as
xgb.fit <- xgboost(cuisine~., data = xgb_train, nrounds = 40, num_class = 10, early_stopping_rounds = 5)
And after I run this, there is a warning but the training can still run.
xgboost: label will be ignored
I can predict the train dataset using the model but when I try to predict test dataset there will be an error.
xgb_pred <- predict(xgb.fit, newdata = xgb_train)
sum(label_train == xgb_pred)/length(label_train)
xgb_pred <- predict(xgb.fit, newdata = xgb_test, rescale = T)
Error in predict.xgb.Booster(xgb.fit, newdata = xgb_test, rescale = T) :
Feature names stored in `object` and `newdata` are different!
Please let me know what am I doing wrong?
Regards

plotting interaction effects for LASSO models in R

I fitted a lasso logistic model with interaction terms. Then i wanted to visualize those interactions using a interaction plot.
I tried to find some R function that will plot interactions for glmnet models and i couldnt find any .
Is there any R package that will plot interactions for LASSO ?
Since i couldnt find any, i tried to do it manually , by plotting the predicted values. But i am getting some errors.
My code is as follows,
require(ISLR)
require(glmnet)
y <- Smarket$Direction
x <- model.matrix(Direction ~ Lag1 + Lag4* Volume, Smarket)[, -1]
lasso.mod <- cv.glmnet(x, y, alpha=1,family="binomial",nfolds = 5, type.measure="class",
lambda = seq(0.001,0.1,by = 0.001))
lasso.mod$lambda.min
pred = expand.grid(Lag1 = median(Smarket$Lag1),
Lag4 = c(-0.64,0.0385,0.596750),
Volume = seq(min(Smarket$Volume), max(Smarket$Volume), length=100))
lasso.mod1 <- glmnet(x, y, alpha=1,family="binomial",
lambda = lasso.mod$lambda.min)
pred$Direction = predict(lasso.mod1, newx=pred,
type="response", s= lasso.mod$lambda.min)
i am getting this error :
Error in cbind2(1, newx) %*% nbeta :
not-yet-implemented method for <data.frame> %*% <dgCMatrix>
Can any suggest anything to fix this issue ?
Thank you
predict.glmnet says newx must be a matrix. And you need to give interaction value by yourself.
library(dplyr)
pred = expand.grid(Lag1 = median(Smarket$Lag1),
Lag4 = c(-0.64,0.0385,0.596750),
Volume = seq(min(Smarket$Volume), max(Smarket$Volume), length=100)) %>%
mutate(`Lag4:Volume` = Lag4 * Volume) # preparing interaction values
pred$Direction = predict(lasso.mod1, newx = as.matrix(pred), # convert to matrix
type = "link", s= lasso.mod$lambda.min)
[EDITED]
Oh, I overlooked more general, better way.
pred = expand.grid(Lag1 = median(Smarket$Lag1),
Lag4 = c(-0.64,0.0385,0.596750),
Volume = seq(min(Smarket$Volume), max(Smarket$Volume), length=100))
pred$Direction = predict(lasso.mod1,
newx = model.matrix( ~ Lag1 + Lag4* Volume, pred)[, -1],
type="response", s= lasso.mod$lambda.min)

Obtaining an AUC value from the KNN function

I am using the class package in order to use the KNN algorithm. I am also using the ROCR package to calculate the AUC value.
knn_one<-knn(train, test, train$Digit, k=1)
To calculate the AUC value for another method, e.g. classification trees, I used these series of commands:
treeTrain_Pred<-predict(Tree_Train, test , type = "prob")[,2]
Pred<-prediction(treeTrain_Pred, test$Digit)
Perf<-performance(Pred, "auc")
Perf#y.values[[1]]
However, when I try
knn_one = predict(knn_one, test, type="prob")[,2]
I get the following error:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "factor"
How can I fix this and obtain an AUC value for my KNN function?
There is no predict method for knn models, instead you train and receive predictions as part of a single call. Example on sonar data:
library(mlbench)
data(Sonar)
create data partition:
set.seed(1)
tr_ind <- sample(1:nrow(Sonar), 150)
train <- Sonar[tr_ind,]
test <- Sonar[-tr_ind,]
mod <- class::knn(cl = train$Class,
test = test[,1:60],
train = train[,1:60],
k = 5,
prob = TRUE)
Now the probability of the predictions are in:
attributes(mod)$prob
library(pROC)
roc(test$Class, attributes(mod)$prob)
#output
Call:
roc.default(response = test$Class, predictor = attributes(mod)$prob)
Data: attributes(mod)$prob in 30 controls (test$Class M) < 28 cases (test$Class R).
Area under the curve: 0.4667
plot(roc(test$Class, attributes(mod)$prob),
print.thres = T,
print.auc=T)
lets try with k = 4
mod <- class::knn(cl = train$Class,
test = test[,1:60],
train = train[,1:60],
k = 4,
prob = TRUE)
plot(roc(test$Class, attributes(mod)$prob),
print.thres = T,
print.auc = T,
print.auc.y = 0.2)

predict with kernlab package error Error in .local(object, ...) : test vector does not match model R

I'm testing the kernlab package in a regression problem. It seems it's a common issue to get 'Error in .local(object, ...) : test vector does not match model ! when passing the ksvm object to the predict function. However I just found answers to classification problems or custom kernels that are not applicable to my problem (I'm using a built-in one for regression). I'm running out of ideas here, my sample code is:
data <- matrix(rnorm(200*10),200,10)
tr <- data[1:150,]
ts <- data[151:200,]
mod <- ksvm(x = tr[,-1],
y = tr[,1],
kernel = "rbfdot", type = 'nu-svr',
kpar = "automatic", C = 60, cross = 3)
pred <- predict(mod,
ts
)
You forgot to remove the y variable in the test set, and so it fails because the number of predictors don't match. This will work:
predict(mod,ts[,-1])
You can use pred <- predict(mod, ts) if ts is a dataframe.
It would be
data <- setNames(data.frame(matrix(rnorm(200*10),200,10)),
c("Y",paste("X", 1:9, sep = "")))
tr <- data[1:150,]
ts <- data[151:200,]
mod <- ksvm(as.formula("Y ~ ."), data = tr,
kernel = "rbfdot", type = 'nu-svr',
kpar = "automatic", C = 60, cross = 3)
pred <- predict(mod, ts)

Resources