R Caret knnImpute for partially NA rows - r

I'm trying to run some code to preprocess my data for machine learning in Caret. One step I'm having a lot of trouble with is KNN imputation. When I run the following block of code:
library(caret)
traindf <- data.frame(matrix( rnorm(7*7,mean=0,sd=1), nrow=7, ncol=7))
testdf <- data.frame(matrix( rnorm(7*7,mean=0,sd=1), nrow=7, ncol=7))
for(i in 1:7){
traindf[i,i] <- NA #generates NA's in every row
}
impute_model <- preProcess(traindf, method = c('knnImpute')) #this line is problematic
imputed_train <- predict(impute_model, traindf)
imputed_test <- predict(impute_model, testdf)
I get an error:
Error in RANN::nn2(old[, non_missing_cols, drop = FALSE], new[, non_missing_cols, :
Cannot find more nearest neighbours than there are points
From some research, I believe this is due to the fact that the kNN imputation implementation Caret uses discards rows with any NA's. In my dataset, NA's are scattered throughout such that this would result in all rows being discarded for imputation purposes. Instead I would like to keep these partially NA rows and still use them for imputation.
I know of one package that does this:https://www.rdocumentation.org/packages/impute/versions/1.46.0/topics/impute.knn. However, this one doesn't override predict, so I can't use it easily to impute the test set as well like in the above example.
Does anyone have suggestions on how I can get this partial-NA KNN imputation working with Caret?

Related

Fastshap summary plot - Error: can't combine <double> and <factor<919a3>>

I'm trying to get a summary plot using fastshap explain function as in the code below.
p_function_G<- function(object, newdata)
caret::predict.train(object,
newdata =
newdata,
type = "prob")[,"AntiSocial"] # select G class
# Calculate the Shapley values
#
# boostFit: is a caret model using catboost algorithm
# trainset: is the dataset used for bulding the caret model.
# The dataset contains 4 categories W,G,R,GM
# corresponding to 4 diferent animal behaviors
library(caret)
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= game_train[which(game_test=="AntiSocial"),])
)
However I'm getting error
Error in 'stop_vctrs()':
can't combine latitude and gender <factor<919a3>>
What's the way out?
I see that you are adapting code from Julia Silge's Predict ratings for board games Tutorial. The original code used SHAPforxgboost for generating SHAP values, but you're using the fastshap package.
Because Shapley explanations are only recently starting to gain traction, there aren't very many standard data formats. fastshap does not like tidyverse tibbles, it only takes matrices or matrix-likes.
The error occurs because, by default, fastshap attempts to convert the tibble to a matrix. But this fails, because matrices can only have one type (f.x. either double or factor, not both).
I also ran into a similar issue and found that you can solve this by passing the X parameter as a data.frame. I don't have access to your full code but you could you try replacing the shap_values_G code-block as so:
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= as.data.frame(game_train[which(game_test=="AntiSocial"),]))
)
Wrap newdata with as.data.frame. This converts the tibble to a dataframe and so shouldn't upset fastshap.

KNN in R -- All arguments must have the same length, test.X is empty

I'm trying to perform KNN in R on a dataframe, following 3-way classification for vehicle types (car, boat, plane), using columns such as mpg, cost as features.
To start, when I run:
knn.pred=knn(train.X,test.X,train.VehicleType,k=3)
then
knn.pred
returns
factor(0) Levels: car boat plane
And
table(knn.pred,VehicleType.All)
returns
Error in table(knn.pred, VehicleType.All) :
all arguments must have the same length
I think my problem is that I can successfully load train.X with cbind() but when I try the same for test.X it remains an empty matrix. My code looks like this:
train=(DATA$Values<=200) # to train for all 200 entries including cars, boats and planes
train.X = cbind(DATA$mpg,DATA$cost)[train,]
summary(train.X)
Here, summary(train.X) returns correctly, but when I try the same for test.X:
test.X = cbind(DATA$mpg,DATA$cost)[!train,]
When I try and print test.X it returns an empty matrix like so:
[,1] [,2]
Apologies for such a long question and I'm probably not including all relevant info. If anyone has any idea what's going wrong here or why my test.X isn't loading through any data I'd appreciate it!
Without any info on your data, it is hard to guess where the problem is. You should post a minimal reproducible example
or at least dput your data or part of it. However here I show 2 methods for training a knn model, using 2 different package (class, and caret) with the mtcars built-in dataset.
with class
library(class)
data("mtcars")
str(mtcars)
mtcars$gear <- as.factor(mtcars$gear)
ind <- sample(1:nrow(mtcars),20)
train.X <- mtcars[ind,]
test.X <- mtcars[-ind,]
train.VehicleType <- train.X[,"gear"]
VehicleType.All <- test.X[,"gear"]
knn.pred=knn(train.X,test.X,train.VehicleType,k=3)
table(knn.pred,VehicleType.All)
with caret
library(caret)
ind <- createDataPartition(mtcars$gear,p=0.60,list=F)
train.X <- mtcars[ind,]
test.X <- mtcars[-ind,]
control <-trainControl(method = "cv",number = 10)
grid <- expand.grid(k=2:10)
knn.pred <- train(gear~.,data=train.X,method="knn",tuneGrid=grid)
pred <- predict(knn.pred,test.X[,-10])
cm <- confusionMatrix(pred,test.X$gear)
the caret package allows performing cross-validation for parameters tuning during model fitting, in a straightforward way. By default train perform a 25 rep bootstrap cross-validation to find the best value of k among the values I've supplied in the grid object.
From your example, it seems that your test object is empty so the result of knn is a 0-length vector. Probably your problem is in the data reading. However, a better way to subset your DATA can be this:
#insetad of
train.X = cbind(DATA$mpg,DATA$cost)[train,]
#you should do:
train.X <- DATA[train,c("mpg","cost")]
test.X <- DATA[-train,c("mpg","cost")]
However, I do not understand what variable is DATA$Values, Firstly I was thinking it was the outcome, but, this line confused me a lot:
train=(DATA$Values<=200)
You can work on these examples to catch your error on your own. If you can't post an example that reproduces your situation.

Subscript out of bound error in predict function of randomforest

I am using random forest for prediction and in the predict(fit, test_feature) line, I get the following error. Can someone help me to overcome this. I did the same steps with another dataset and had no error. but I get error here.
Error: Error in x[, vname, drop = FALSE] : subscript out of bounds
training_index <- createDataPartition(shufflled[,487], p = 0.8, times = 1)
training_index <- unlist(training_index)
train_set <- shufflled[training_index,]
test_set <- shufflled[-training_index,]
accuracies<- c()
k=10
n= floor(nrow(train_set)/k)
for(i in 1:k){
sub1<- ((i-1)*n+1)
sub2<- (i*n)
subset<- sub1:sub2
train<- train_set[-subset, ]
test<- train_set[subset, ]
test_feature<- test[ ,-487]
True_Label<- as.factor(test[ ,487])
fit<- randomForest(x= train[ ,-487], y= as.factor(train[ ,487]))
prediction<- predict(fit, test_feature) #The error line
correctlabel<- prediction == True_Label
t<- table(prediction, True_Label)
}
I had similar problem few weeks ago.
To go around the problem, you can do this:
df$label <- factor(df$label)
Instead of as.factor try just factor generic function. Also, try first naming your label variable.
Are there identical column names in your training and validation x?
I had the same error message and solved it by renaming my column names because my data was a matrix and their colnames were all empty, i.e. "".
Your question is not very clear, anyway I try to help you.
First of all check your data to see the distribution in levels of your various predictors and outcomes.
You may find that some of your predictor levels or outcome levels are very highly skewed, or some outcomes or predictor levels are very rare. I got that error when I was trying to predict a very rare outcome with a heavily tuned random forest, and so some of the predictor levels were not actually in the training data. Thus a factor level appears in the test data that the training data thinks is out of bounds.
Alternatively, check the names of your variables.
Before calling predict() to make sure that the variable names match.
Without your data files, it's hard to tell why your first example worked.
For example You can try:
names(test) <- names(train)
Add the expression
dimnames(test_feature) <- NULL
before
prediction <- predict(fit, test_feature)

No missing values are allows kNN in R

I've data set of 45212 elements with 17 columns and i want to find the class label of last column using kNN algorithm, according to me everything is OK, but I always come up with error
"Error in knn(train = data_train, test = data_test, cl = data_train_labels, :
no missing values are allowed"
here is my code
> data_train <-data[1:25000,]
> data_test <-data[25001:45212,]
> data_train_labels <- data[1:25000, 17]
> data_test_labels <- data[1:25000, 17]
> install.package("class")
> library(class)
> data_test_pred <- knn(train=data_train, test=data_test, cl=data_train_labels, k=10)
here is how my data set looks like:
age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no
42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no
58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no
43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no
41,admin.,divorced,secondary,no,270,yes,no,unknown,5,may,222,1,-1,0,unknown,no
I think that your problem is all of the factors in your data. The knn documentation says that it uses Euclidean distance, which does not make sense for factors. Here is a possible solution if you really want to use knn. You can get a distance matrix between the points using daisy in the cluster package. There are several implementations of knn in R but I don't know of one that accepts a distance matrix. You could either write your own (not so difficult) or you could map the distance matrix to a Euclidean space using cmdscale. Then use knn on the projected space.
I believe that your mistake is: data_train <-data[1:25000,]
You are including your header that you have not normalized. I was able to reproduce the same error. But when I changed to data_train <-data[2:25000,] it ran fine.

Use of randomforest() for classification in R?

I originally had a data frame composed of 12 columns in N rows. The last column is my class (0 or 1). I had to convert my entire data frame to numeric with
training <- sapply(training.temp,as.numeric)
But then I thought I needed the class column to be a factor column to use the randomforest() tool as a classifier, so I did
training[,"Class"] <- factor(training[,ncol(training)])
I proceed to creating the tree with
training_rf <- randomForest(Class ~., data = trainData, importance = TRUE, do.trace = 100)
But I'm getting two errors:
1: In Ops.factor(training[, "Status"], factor(training[, ncol(training)])) :
<= this is not relevant for factors (roughly translated)
2: In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I would appreciate it if someone could point out the formatting mistake I'm making.
Thanks!
So the issue is actually quite simple. It turns out my training data was an atomic vector. So it first had to be converted as a data frame. So I needed to add the following line:
training <- as.data.frame(training)
Problem solved!
First, your coercion to a factor is not working because of syntax errors. Second, you should always use indexing when specifying a RF model. Here are changes in your code that should make it work.
training <- sapply(training.temp,as.numeric)
training[,"Class"] <- as.factor(training[,"Class"])
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=training[,"Class"],
importance=TRUE, do.trace=100)
# You can also coerce to a factor directly in the model statement
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=as.factor(training[,"Class"]),
importance=TRUE, do.trace=100)

Resources