xgboost in R providing unexpected prediction - r

Below is a code that produces a simple xgboost model to show the issue I've been seeing. Once the model has been built, we predict using this model and take the second row in our data. If we take the log of relative difference between prediction of the 10th and 9th model, it should give us the prediction for the 10th tree: 0.00873184 in this case.
Now if we use the input to the tree (matrix "a" which has value 0.1234561702 for row 2) and run through the model, we expect a prediction of 0.0121501638. However, it looks like after the second split (<0.123456173) it takes the wrong direction and ends up at the node with 0.00873187464 - very close to what we expect!
Does anyone have an idea what is going on?
10th Tree
Versions:
R: 4.1.0
xgboost: 1.4.1.1
dplyr: 1.0.7
data.table: 1.14.0
library(xgboost)
library(dplyr)
library(data.table)
set.seed(2)
a <- matrix(runif(1000,0.1234561,0.1234562),
ncol=1,nrow=1000)
colnames(a) <- c("b")
d <- abs(rnorm(1000,3*a[,1]))
d2 <- xgb.DMatrix(data = a,label = d)
e <- xgboost::xgboost(data=d2,nrounds=10,method="hist",objective="reg:gamma")
xgb.plot.tree(e$feature_names,e,trees=9)
x <- 2
log((predict(e,a,ntreelimit = 10)/predict(e,a,ntreelimit = 9)))[x]
format(a[x,],nsmall=10)

For anyone interested in the answer, the xgboost team provided it here:
https://github.com/dmlc/xgboost/issues/7294
In short, xgboost converts the input data into float32 before training whereas R uses double by default. Hence, what should be done is convert 0.1234561702 to float32 before running through the model. Doing that gives the value 0.123456173 which now takes the right path.

Related

Fastshap summary plot - Error: can't combine <double> and <factor<919a3>>

I'm trying to get a summary plot using fastshap explain function as in the code below.
p_function_G<- function(object, newdata)
caret::predict.train(object,
newdata =
newdata,
type = "prob")[,"AntiSocial"] # select G class
# Calculate the Shapley values
#
# boostFit: is a caret model using catboost algorithm
# trainset: is the dataset used for bulding the caret model.
# The dataset contains 4 categories W,G,R,GM
# corresponding to 4 diferent animal behaviors
library(caret)
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= game_train[which(game_test=="AntiSocial"),])
)
However I'm getting error
Error in 'stop_vctrs()':
can't combine latitude and gender <factor<919a3>>
What's the way out?
I see that you are adapting code from Julia Silge's Predict ratings for board games Tutorial. The original code used SHAPforxgboost for generating SHAP values, but you're using the fastshap package.
Because Shapley explanations are only recently starting to gain traction, there aren't very many standard data formats. fastshap does not like tidyverse tibbles, it only takes matrices or matrix-likes.
The error occurs because, by default, fastshap attempts to convert the tibble to a matrix. But this fails, because matrices can only have one type (f.x. either double or factor, not both).
I also ran into a similar issue and found that you can solve this by passing the X parameter as a data.frame. I don't have access to your full code but you could you try replacing the shap_values_G code-block as so:
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= as.data.frame(game_train[which(game_test=="AntiSocial"),]))
)
Wrap newdata with as.data.frame. This converts the tibble to a dataframe and so shouldn't upset fastshap.

KNN in R -- All arguments must have the same length, test.X is empty

I'm trying to perform KNN in R on a dataframe, following 3-way classification for vehicle types (car, boat, plane), using columns such as mpg, cost as features.
To start, when I run:
knn.pred=knn(train.X,test.X,train.VehicleType,k=3)
then
knn.pred
returns
factor(0) Levels: car boat plane
And
table(knn.pred,VehicleType.All)
returns
Error in table(knn.pred, VehicleType.All) :
all arguments must have the same length
I think my problem is that I can successfully load train.X with cbind() but when I try the same for test.X it remains an empty matrix. My code looks like this:
train=(DATA$Values<=200) # to train for all 200 entries including cars, boats and planes
train.X = cbind(DATA$mpg,DATA$cost)[train,]
summary(train.X)
Here, summary(train.X) returns correctly, but when I try the same for test.X:
test.X = cbind(DATA$mpg,DATA$cost)[!train,]
When I try and print test.X it returns an empty matrix like so:
[,1] [,2]
Apologies for such a long question and I'm probably not including all relevant info. If anyone has any idea what's going wrong here or why my test.X isn't loading through any data I'd appreciate it!
Without any info on your data, it is hard to guess where the problem is. You should post a minimal reproducible example
or at least dput your data or part of it. However here I show 2 methods for training a knn model, using 2 different package (class, and caret) with the mtcars built-in dataset.
with class
library(class)
data("mtcars")
str(mtcars)
mtcars$gear <- as.factor(mtcars$gear)
ind <- sample(1:nrow(mtcars),20)
train.X <- mtcars[ind,]
test.X <- mtcars[-ind,]
train.VehicleType <- train.X[,"gear"]
VehicleType.All <- test.X[,"gear"]
knn.pred=knn(train.X,test.X,train.VehicleType,k=3)
table(knn.pred,VehicleType.All)
with caret
library(caret)
ind <- createDataPartition(mtcars$gear,p=0.60,list=F)
train.X <- mtcars[ind,]
test.X <- mtcars[-ind,]
control <-trainControl(method = "cv",number = 10)
grid <- expand.grid(k=2:10)
knn.pred <- train(gear~.,data=train.X,method="knn",tuneGrid=grid)
pred <- predict(knn.pred,test.X[,-10])
cm <- confusionMatrix(pred,test.X$gear)
the caret package allows performing cross-validation for parameters tuning during model fitting, in a straightforward way. By default train perform a 25 rep bootstrap cross-validation to find the best value of k among the values I've supplied in the grid object.
From your example, it seems that your test object is empty so the result of knn is a 0-length vector. Probably your problem is in the data reading. However, a better way to subset your DATA can be this:
#insetad of
train.X = cbind(DATA$mpg,DATA$cost)[train,]
#you should do:
train.X <- DATA[train,c("mpg","cost")]
test.X <- DATA[-train,c("mpg","cost")]
However, I do not understand what variable is DATA$Values, Firstly I was thinking it was the outcome, but, this line confused me a lot:
train=(DATA$Values<=200)
You can work on these examples to catch your error on your own. If you can't post an example that reproduces your situation.

Kaggle Digit Recognizer Using SVM (e1071): Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty

I am trying to solve the digit Recognizer competition in Kaggle and I run in to this error.
I loaded the training data and adjusted the values of it by dividing it with the maximum pixel value which is 255. After that, I am trying to build my model.
Here Goes my code,
Given_Training_data <- get(load("Given_Training_data.RData"))
Given_Testing_data <- get(load("Given_Testing_data.RData"))
Maximum_Pixel_value = max(Given_Training_data)
Tot_Col_Train_data = ncol(Given_Training_data)
training_data_adjusted <- Given_Training_data[, 2:ncol(Given_Training_data)]/Maximum_Pixel_value
testing_data_adjusted <- Given_Testing_data[, 2:ncol(Given_Testing_data)]/Maximum_Pixel_value
label_training_data <- Given_Training_data$label
final_training_data <- cbind(label_training_data, training_data_adjusted)
smp_size <- floor(0.75 * nrow(final_training_data))
set.seed(100)
training_ind <- sample(seq_len(nrow(final_training_data)), size = smp_size)
training_data1 <- final_training_data[training_ind, ]
train_no_label1 <- as.data.frame(training_data1[,-1])
train_label1 <-as.data.frame(training_data1[,1])
svm_model1 <- svm(train_label1,train_no_label1) #This line is throwing an error
Error : Error in predict.svm(ret, xhold, decision.values = TRUE) : Model is empty!
Please Kindly share your thoughts. I am not looking for an answer but rather some idea that guides me in the right direction as I am in a learning phase.
Thanks.
Update to the question :
trainlabel1 <- train_label1[sapply(train_label1, function(x) !is.factor(x) | length(unique(x))>1 )]
trainnolabel1 <- train_no_label1[sapply(train_no_label1, function(x) !is.factor(x) | length(unique(x))>1 )]
svm_model2 <- svm(trainlabel1,trainnolabel1,scale = F)
It didn't help either.
Read the manual (https://cran.r-project.org/web/packages/e1071/e1071.pdf):
svm(x, y = NULL, scale = TRUE, type = NULL, ...)
...
Arguments:
...
x a data matrix, a vector, or a sparse matrix (object of class
Matrix provided by the Matrix package, or of class matrix.csr
provided by the SparseM package,
or of class simple_triplet_matrix provided by the slam package).
y a response vector with one label for each row/component of x.
Can be either a factor (for classification tasks) or a numeric vector
(for regression).
Therefore, the mains problems are that your call to svm is switching the data matrix and the response vector, and that you are passing the response vector as integer, resulting in a regression model. Furthermore, you are also passing the response vector as a single-column data-frame, which is not exactly how you are supposed to do it. Hence, if you change the call to:
svm_model1 <- svm(train_no_label1, as.factor(train_label1[, 1]))
it will work as expected. Note that training will take some minutes to run.
You may also want to remove features that are constant (where the values in the respective column of the training data matrix are all identical) in the training data, since these will not influence the classification.
I don't think you need to scale it manually since svm itself will do it unlike most neural network package.
You can also use the formula version of svm instead of the matrix and vectors which is
svm(result~.,data = your_training_set)
in your case, I guess you want to make sure the result to be used as factor,because you want a label like 1,2,3 not 1.5467 which is a regression
I can debug it if you can share the data:Given_Training_data.RData

No missing values are allows kNN in R

I've data set of 45212 elements with 17 columns and i want to find the class label of last column using kNN algorithm, according to me everything is OK, but I always come up with error
"Error in knn(train = data_train, test = data_test, cl = data_train_labels, :
no missing values are allowed"
here is my code
> data_train <-data[1:25000,]
> data_test <-data[25001:45212,]
> data_train_labels <- data[1:25000, 17]
> data_test_labels <- data[1:25000, 17]
> install.package("class")
> library(class)
> data_test_pred <- knn(train=data_train, test=data_test, cl=data_train_labels, k=10)
here is how my data set looks like:
age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no
42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no
58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no
43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no
41,admin.,divorced,secondary,no,270,yes,no,unknown,5,may,222,1,-1,0,unknown,no
I think that your problem is all of the factors in your data. The knn documentation says that it uses Euclidean distance, which does not make sense for factors. Here is a possible solution if you really want to use knn. You can get a distance matrix between the points using daisy in the cluster package. There are several implementations of knn in R but I don't know of one that accepts a distance matrix. You could either write your own (not so difficult) or you could map the distance matrix to a Euclidean space using cmdscale. Then use knn on the projected space.
I believe that your mistake is: data_train <-data[1:25000,]
You are including your header that you have not normalized. I was able to reproduce the same error. But when I changed to data_train <-data[2:25000,] it ran fine.

predict in caret ConfusionMatrix is removing rows

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

Resources