Fastshap summary plot - Error: can't combine <double> and <factor<919a3>> - r

I'm trying to get a summary plot using fastshap explain function as in the code below.
p_function_G<- function(object, newdata)
caret::predict.train(object,
newdata =
newdata,
type = "prob")[,"AntiSocial"] # select G class
# Calculate the Shapley values
#
# boostFit: is a caret model using catboost algorithm
# trainset: is the dataset used for bulding the caret model.
# The dataset contains 4 categories W,G,R,GM
# corresponding to 4 diferent animal behaviors
library(caret)
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= game_train[which(game_test=="AntiSocial"),])
)
However I'm getting error
Error in 'stop_vctrs()':
can't combine latitude and gender <factor<919a3>>
What's the way out?

I see that you are adapting code from Julia Silge's Predict ratings for board games Tutorial. The original code used SHAPforxgboost for generating SHAP values, but you're using the fastshap package.
Because Shapley explanations are only recently starting to gain traction, there aren't very many standard data formats. fastshap does not like tidyverse tibbles, it only takes matrices or matrix-likes.
The error occurs because, by default, fastshap attempts to convert the tibble to a matrix. But this fails, because matrices can only have one type (f.x. either double or factor, not both).
I also ran into a similar issue and found that you can solve this by passing the X parameter as a data.frame. I don't have access to your full code but you could you try replacing the shap_values_G code-block as so:
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= as.data.frame(game_train[which(game_test=="AntiSocial"),]))
)
Wrap newdata with as.data.frame. This converts the tibble to a dataframe and so shouldn't upset fastshap.

Related

How can I include both my categorical and numeric predictors in my elastic net model? r

As a note beforehand, I think I should mention that I am working with highly sensitive medical data that is protected by HIPAA. I cannot share real data with dput- it would be illegal to do so. That is why I made a fake dataset and explained my processes to help reproduce the error.
I have been trying to estimate an elastic net model in r using glmnet. However, I keep getting an error. I am not sure what is causing it. The error happens when I go to train the data. It sounds like it has something to do with the data type and matrix.
I have provided a sample dataset. Then I set the outcomes and certain predictors to be factors. After setting certain variables to be factors, I label them. Next, I create an object with the column names of the predictors I want to use. That object is pred.names.min. Then I partition the data into the training and test data frames. 65% in the training, 35% in the test. With the train control function, I specify a few things I want to have happen with the model- random paraments for lambda and alpha, as well as the leave one out method. I also specify that it is a classification model (categorical outcome). In the last step, I specify the training model. I write my code to tell it to use all of the predictor variables in the pred.names.min object for the trainingset data frame.
library(dplyr)
library(tidyverse)
library(glmnet),0,1,0
library(caret)
#creating sample dataset
df<-data.frame("BMIfactor"=c(1,2,3,2,3,1,2,1,3,2,1,3,1,1,3,2,3,2,1,2,1,3),
"age"=c(0,4,8,1,2,7,4,9,9,2,2,1,8,6,1,2,9,2,2,9,2,1),
"L_TartaricacidArea"=c(0,1,1,0,1,1,1,0,0,1,0,1,1,0,1,0,0,1,1,0,1,1),
"Hydroxymethyl_5_furancarboxylicacidArea_2"=
c(1,1,0,1,0,0,1,0,1,1,0,1,1,0,1,1,0,1,0,1,0,1),
"Anhydro_1.5_D_glucitolArea"=
c(8,5,8,6,2,9,2,8,9,4,2,0,4,8,1,2,7,4,9,9,2,2),
"LevoglucosanArea"=
c(6,2,9,2,8,6,1,8,2,1,2,8,5,8,6,2,9,2,8,9,4,2),
"HexadecanolArea_1"=
c(4,9,2,1,2,9,2,1,6,1,2,6,2,9,2,8,6,1,8,2,1,2),
"EthanolamineArea"=
c(6,4,9,2,1,2,4,6,1,8,2,4,9,2,1,2,9,2,1,6,1,2),
"OxoglutaricacidArea_2"=
c(4,7,8,2,5,2,7,6,9,2,4,6,4,9,2,1,2,4,6,1,8,2),
"AminopentanedioicacidArea_3"=
c(2,5,5,5,2,9,7,5,9,4,4,4,7,8,2,5,2,7,6,9,2,4),
"XylitolArea"=
c(6,8,3,5,1,9,9,6,6,3,7,2,5,5,5,2,9,7,5,9,4,4),
"DL_XyloseArea"=
c(6,9,5,7,2,7,0,1,6,6,3,6,8,3,5,1,9,9,6,6,3,7),
"ErythritolArea"=
c(6,7,4,7,9,2,5,5,8,9,1,6,9,5,7,2,7,0,1,6,6,3),
"hpresponse1"=
c(1,0,1,1,0,1,1,0,0,1,0,0,1,0,1,1,1,0,1,0,0,1),
"hpresponse2"=
c(1,0,1,0,0,1,1,1,0,1,0,1,0,1,1,0,1,0,1,0,0,1))
#setting variables as factors
df$hpresponse1<-as.factor(df$hpresponse1)
df$hpresponse2<-as.factor(df$hpresponse2)
df$BMIfactor<-as.factor(df$BMIfactor)
df$L_TartaricacidArea<- as.factor(df$L_TartaricacidArea)
df$Hydroxymethyl_5_furancarboxylicacidArea_2<-
as.factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2)
#labeling factor levels
df$hpresponse1 <- factor(df$hpresponse1, labels = c("group1.2", "group3.4"))
df$hpresponse2 <- factor(df$hpresponse2, labels = c("group1.2.3", "group4"))
df$L_TartaricacidArea <- factor(df$L_TartaricacidArea, labels =c ("No",
"Yes"))
df$Hydroxymethyl_5_furancarboxylicacidArea_2 <-
factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2, labels =c ("No",
"Yes"))
df$BMIfactor <- factor(df$BMIfactor, labels = c("<40", ">=40and<50",
">=50"))
#creating list of predictor names
pred.start.min <- which(colnames(df) == "BMIfactor"); pred.start.min
pred.stop.min <- which(colnames(df) == "ErythritolArea"); pred.stop.min
pred.names.min <- colnames(df)[pred.start.min:pred.stop.min]
#partition data into training and test (65%/35%)
set.seed(2)
n=floor(nrow(df)*0.65)
train_ind=sample(seq_len(nrow(df)), size = n)
trainingset=df[train_ind,]
testingset=df[-train_ind,]
#specifying that I want to use the leave one out cross-
#validation method and
use "random" as search for elasticnet
tcontrol <- trainControl(method = "LOOCV",
search="random",
classProbs = TRUE)
#training model
elastic_model1 <- train(as.matrix(trainingset[,
pred.names.min]),
trainingset$hpresponse1,
data = trainingset,
method = "glmnet",
trControl = tcontrol)
After I run the last chunk of code, I end up with this error:
Error in { :
task 1 failed - "error in evaluating the argument 'x' in selecting a
method for function 'as.matrix': object of invalid type "character" in
'matrix_as_dense()'"
In addition: There were 50 or more warnings (use warnings() to see the first
50)
I tried removing the "as.matrix" arguemtent:
elastic_model1 <- train((trainingset[, pred.names.min]),
trainingset$hpresponse1,
data = trainingset,
method = "glmnet",
trControl = tcontrol)
It still produces a similar error.
Error in { :
task 1 failed - "error in evaluating the argument 'x' in selecting a method
for function 'as.matrix': object of invalid type "character" in
'matrix_as_dense()'"
In addition: There were 50 or more warnings (use warnings() to see the first
50)
When I tried to make none of the predictors factors (but keep outcome as factor), this is the error I get:
Error: At least one of the class levels is not a valid R variable name; This
will cause errors when class probabilities are generated because the
variables names will be converted to X0, X1 . Please use factor levels that
can be used as valid R variable names (see ?make.names for help).
How can I fix this? How can I use my predictors (both the numeric and categorical ones) without producing an error?
glmnet does not handle factors well. The recommendation currently is to dummy code and re-code to numeric where possible:
Using LASSO in R with categorical variables

How to input matrix data into brms formula?

I am trying to input matrix data into the brm() function to run a signal regression. brm is from the brms package, which provides an interface to fit Bayesian models using Stan. Signal regression is when you model one covariate using another within the bigger model, and you use the by parameter like this: model <- brm(response ~ s(matrix1, by = matrix2) + ..., data = Data). The problem is, I cannot input my matrices using the 'data' parameter because it only allows one data.frame object to be inputted.
Here are my code and the errors I obtained from trying to get around that constraint...
First off, my reproducible code leading up to the model-building:
library(brms)
#100 rows, 4 columns. Each cell contains a number between 1 and 10
Data <- data.frame(runif(100,1,10),runif(100,1,10),runif(100,1,10),runif(100,1,10))
#Assign names to the columns
names(Data) <- c("d0_10","d0_100","d0_1000","d0_10000")
Data$Density <- as.matrix(Data)%*%c(-1,10,5,1)
#the coefficients we are modelling
d <- c(-1,10,5,1)
#Made a matrix with 4 columns with values 10, 100, 1000, 10000 which are evaluation points. Rows are repeats of the same column numbers
Bins <- 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T)
Bins
As mentioned above, since 'data' only allows one data.frame object to be inputted, I've tried other ways of inputting my matrix data. These methods include:
1) making the matrix within the brm() function using as.matrix()
signalregression.brms <- brm(Density ~ s(Bins,by=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])),data = Data)
#Error in is(sexpr, "try-error") :
argument "sexpr" is missing, with no default
2) making the matrix outside the formula, storing it in a variable, then calling that variable inside the brm() function
Donuts <- as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])
signalregression.brms <- brm(Density ~ s(Bins,by=Donuts),data = Data)
#Error: The following variables can neither be found in 'data' nor in 'data2':
'Bins', 'Donuts'
3) inputting a list containing the matrix using the 'data2' parameter
signalregression.brms <- brm(Density ~ s(Bins,by=donuts),data = Data,data2=list(Bins = 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T),donuts=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])))
#Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
None of the above worked; each had their own errors and it was difficult troubleshooting them because I couldn't find answers or examples online that were of a similar nature in the context of brms.
I was able to use the above techniques just fine for gam(), in the mgcv package - you don't have to define a data.frame using 'data', you can call on variables defined outside of the gam() formula, and you can make matrices inside the gam() function itself. See below:
library(mgcv)
signalregression2 <- gam(Data$Density ~ s(Bins,by = as.matrix(Data[,c("d0_10","d0_100","d0_1000","d0_10000")]),k=3))
#Works!
It seems like brms is less flexible... :(
My question: does anyone have any suggestions on how to make my brm() function run?
Thank you very much!
My understanding of signal regression is limited enough that I'm not convinced this is correct, but I think it's at least a step in the right direction. The problem seems to be that brm() expects everything in its formula to be a column in data. So we can get the model to compile by ensuring all the things we want are present in data:
library(tidyverse)
signalregression.brms = brm(Density ~
s(cbind(d0_10_bin, d0_100_bin, d0_1000_bin, d0_10000_bin),
by = cbind(d0_10, d0_100, d0_1000, d0_10000),
k = 3),
data = Data %>%
mutate(d0_10_bin = 10,
d0_100_bin = 100,
d0_1000_bin = 1000,
d0_10000_bin = 10000))
Writing out each column by hand is a little annoying; I'm sure there are more general solutions.
For reference, here are my installed package versions:
map_chr(unname(unlist(pacman::p_depends(brms)[c("Depends", "Imports")])), ~ paste(., ": ", pacman::p_version(.), sep = ""))
[1] "Rcpp: 1.0.6" "methods: 4.0.3" "rstan: 2.21.2" "ggplot2: 3.3.3"
[5] "loo: 2.4.1" "Matrix: 1.2.18" "mgcv: 1.8.33" "rstantools: 2.1.1"
[9] "bayesplot: 1.8.0" "shinystan: 2.5.0" "projpred: 2.0.2" "bridgesampling: 1.1.2"
[13] "glue: 1.4.2" "future: 1.21.0" "matrixStats: 0.58.0" "nleqslv: 3.3.2"
[17] "nlme: 3.1.149" "coda: 0.19.4" "abind: 1.4.5" "stats: 4.0.3"
[21] "utils: 4.0.3" "parallel: 4.0.3" "grDevices: 4.0.3" "backports: 1.2.1"

Issue with h2o Package in R using subsetted dataframes leading to near perfect prediction accuracy

I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
The problem is that if I evaluate this against my test subset I get almost 0% error:
[1] 0.0007531255
Has anyone encountered this issue? Have an idea of how to alleviate this problem?
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.

Kaggle Digit Recognizer Using SVM (e1071): Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty

I am trying to solve the digit Recognizer competition in Kaggle and I run in to this error.
I loaded the training data and adjusted the values of it by dividing it with the maximum pixel value which is 255. After that, I am trying to build my model.
Here Goes my code,
Given_Training_data <- get(load("Given_Training_data.RData"))
Given_Testing_data <- get(load("Given_Testing_data.RData"))
Maximum_Pixel_value = max(Given_Training_data)
Tot_Col_Train_data = ncol(Given_Training_data)
training_data_adjusted <- Given_Training_data[, 2:ncol(Given_Training_data)]/Maximum_Pixel_value
testing_data_adjusted <- Given_Testing_data[, 2:ncol(Given_Testing_data)]/Maximum_Pixel_value
label_training_data <- Given_Training_data$label
final_training_data <- cbind(label_training_data, training_data_adjusted)
smp_size <- floor(0.75 * nrow(final_training_data))
set.seed(100)
training_ind <- sample(seq_len(nrow(final_training_data)), size = smp_size)
training_data1 <- final_training_data[training_ind, ]
train_no_label1 <- as.data.frame(training_data1[,-1])
train_label1 <-as.data.frame(training_data1[,1])
svm_model1 <- svm(train_label1,train_no_label1) #This line is throwing an error
Error : Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty!
Please Kindly share your thoughts. I am not looking for an answer but rather some idea that guides me in the right direction as I am in a learning phase.
Thanks.
Update to the question :
trainlabel1 <- train_label1[sapply(train_label1, function(x) !is.factor(x) | length(unique(x))>1 )]
trainnolabel1 <- train_no_label1[sapply(train_no_label1, function(x) !is.factor(x) | length(unique(x))>1 )]
svm_model2 <- svm(trainlabel1,trainnolabel1,scale = F)
It didn't help either.
Read the manual (https://cran.r-project.org/web/packages/e1071/e1071.pdf):
svm(x, y = NULL, scale = TRUE, type = NULL, ...)
...
Arguments:
...
x a data matrix, a vector, or a sparse matrix (object of class
Matrix provided by the Matrix package, or of class matrix.csr
provided by the SparseM package,
or of class simple_triplet_matrix provided by the slam package).
y a response vector with one label for each row/component of x.
Can be either a factor (for classification tasks) or a numeric vector
(for regression).
Therefore, the mains problems are that your call to svm is switching the data matrix and the response vector, and that you are passing the response vector as integer, resulting in a regression model. Furthermore, you are also passing the response vector as a single-column data-frame, which is not exactly how you are supposed to do it. Hence, if you change the call to:
svm_model1 <- svm(train_no_label1, as.factor(train_label1[, 1]))
it will work as expected. Note that training will take some minutes to run.
You may also want to remove features that are constant (where the values in the respective column of the training data matrix are all identical) in the training data, since these will not influence the classification.
I don't think you need to scale it manually since svm itself will do it unlike most neural network package.
You can also use the formula version of svm instead of the matrix and vectors which is
svm(result~.,data = your_training_set)
in your case, I guess you want to make sure the result to be used as factor,because you want a label like 1,2,3 not 1.5467 which is a regression
I can debug it if you can share the data:Given_Training_data.RData

Predict function from Caret package give an Error

I am doing just a regular logistic regression using the caret package in R. I have a binomial response variable coded 1 or 0 that is called a SALES_FLAG and 140 numeric response variables that I used dummyVars function in R to transform to dummy variables.
data <- dummyVars(~., data = data_2, fullRank=TRUE,sep="_",levelsOnly = FALSE )
dummies<-(predict(data, data_2))
model_data<- as.data.frame(dummies)
This gives me a data frame to work with. All of the variables are numeric. Next I split into training and testing:
trainIndex <- createDataPartition(model_data$SALE_FLAG, p = .80,list = FALSE)
train <- model_data[ trainIndex,]
test <- model_data[-trainIndex,]
Time to train my model using the train function:
model <- train(SALE_FLAG~. data=train,method = "glm")
Everything runs nice and I get a model. But when I run the predict function it does not give me what I need:
predict(model, newdata =test,type="prob")
and I get an ERROR:
Error in dimnames(out)[[2]] <- modelFit$obsLevels :
length of 'dimnames' [2] not equal to array extent
On the other hand when I replace "prob" with "raw" for type inside of the predict function I get prediction but I need probabilities so I can code them into binary variable given my threshold.
Not sure why this happens. I did the same thing without using the caret package and it worked how it should:
model2 <- glm(SALE_FLAG ~ ., family = binomial(logit), data = train)
predict(model2, newdata =test, type="response")
I spend some time looking at this but not sure what is going on and it seems very weird to me. I have tried many variations of the train function meaning I didn't use the formula and used X and Y. I used method = 'bayesglm' as well to check and id gave me the same error. I hope someone can help me out. I don't need to use it since the train function to get what I need but caret package is a good package with lots of tools and I would like to be able to figure this out.
Show us str(train) and str(test). I suspect the outcome variable is numeric, which makes train think that you are doing regression. That should also be apparent from printing model. Make it a factor if you want to do classification.
Max

Resources