KNN in R: 'train and class have different lengths'? - r

Here is my code:
train_points <- read.table("kaggle_train_points.txt", sep="\t")
train_labels <- read.table("kaggle_train_labels.txt", sep="\t")
test_points <- read.table("kaggle_test_points.txt", sep="\t")
#uses package 'class'
library(class)
knn(train_points, test_points, train_labels, k = 5);
dim(train_points) is 42000 x 784
dim(train_labels) is 42000 x 1
I don't see the issue, but I'm getting the error :
Error in knn(train_points, test_points, train_labels, k = 5) :
'train' and 'class' have different lengths.
What's the problem?

Without access to the data, it's really hard to help. However, I suspect that train_labels should be a vector. So try
cl = train_labels[,1]
knn(train_points, test_points, cl, k = 5)
Also double check:
dim(train_points)
dim(test_points)
length(cl)

I had the same issue in trying to apply knn on breast cancer diagnosis from wisconsin dataset I found that the issue was linked to the fact that cl argument need to be a vector factor (my mistake was to write cl=labels , I thought this was the vector to be predicted it was in fact a data frame of one column ) so the solution was to use the following syntax : knn (train, test,cl=labels$diagnosis,k=21) diagnosis was the header of the one column data frame labels and it worked well
Hope this help !

I have recently encountered a very similar issue.
I wanted to give only a single column as a predictor. In such cases, selecting a column, you have to remember about drop argument and set it to FALSE. The knn() function accepts only matrices or data frames as train and test arguments. Not vectors.
knn(train = trainSet[, 2, drop = FALSE], test = testSet[, 2, drop = FALSE], cl = trainSet$Direction, k = 5)

Try converting the data into a dataframe using as.dataframe(). I was having the same problem & afterwards it worked fine:
train_pointsdf <- as.data.frame(train_points)
train_labelsdf <- as.data.frame(train_labels)
test_pointsdf <- as.data.frame(test_points)

Simply set drop = TRUE while you're excluding cl from dataframe, it causes to remove dimension from an array which have only one level:
cl = train_labels[,1, drop = TRUE]
knn(train_points, test_points, cl, k = 5)

I had a similar error when I was reading to a tibble (read_csv) and when I switched to read.csv the code worked.

Followed the code as given in the book but will show error due to mismatch lengths (1 is df other is vector returned). I reached here but nothing worked exactly but ideas helped that vectors were needed for comparison.
This throws error
gmodels::CrossTable(x = wbcd_test_labels, # actuals
y = wbcd_test_pred, # predicted
prop.chisq = FALSE)
The following works :
gmodels::CrossTable(x = wbcd_test_labels$diagnosis, # actuals
y = wbcd_test_pred, # predicted
prop.chisq = FALSE)
where using $ for x makes it a vector and hence matches
Additionally while running knn
Cl parameter shoud also have vector save labels in vectors else there will be length mismatch OR use labelDF$Class_label
wbcd_test_pred <- knn(train = wbcd_train,
test = wbcd_test,
cl =wbcd_train_labels$diagnosis, #note this
k = 21)
Hope this helps beginners like me.

Uninstall R Previous versions and install R version > 4.0. It will work.

Related

SMOTE target variable not found in R

Why is it when I run the smote function in R, an error appears saying that my target variable is not found? I am using the smotefamily package to run this smote function.
tv_smote <- SMOTE(tv_smote, Churn, K = 5, dup_size = 0)
Error in table(target) : object 'Churn' not found
Chunk Codes
df data structure
df 1st few rows
Generally, you should include the minimum code and data needed to reproduce the problem. This saves time and gives you more chance of getting an answer. However, try this...
tv_smote <- SMOTE(tv_smote, tv_smote$Churn, K = 5, dup_size = 0)
I can get the same error by doing this:
library(smotefamily)
df <- data.frame(x = 1:8, y = 8:1)
df_smote <- SMOTE(df, y, K = 3, dup_size = 0)
It appears to me that SMOTE doesn't know y is a column name. In the documentation, ?SMOTE, it says that the target argument is "A vector of a target class attribute corresponding to a dataset X." My interpretation of this is that you need to supply a vector, not a name. Changing it to df_smote <- SMOTE(df, df$y, K = 3, dup_size = 0) gets past that part.
I am not familiar with the SMOTE function, but from testing it would appear that SMOTE cannot accept dataframes with any factor type columns. I get Error in get.knnx(data, query, k, algorithm) : Data non-numeric when using as.factor.

Error when running PerformanceAnalytics function in R

I am getting a Error in 1:T : argument of length 0 when running the Performance Analytics package in R. am I missing a package? Below is my code with error.
#clean z, all features, alpha = .01, run below
setwd("D:/LocalData/casaler/Documents/R/RESULTS/PLOTS_PCA/CLN_01")
PGFZ_ALL <- read.csv("D:/LocalData/casaler/Documents/R/PG_DEUX_Z.csv", header=TRUE)
options(max.print = 100000) #Sets ability to view all dealer records
pgfzc_all <- PGFZ_ALL
#head(pgfzc_all,10)
library("PerformanceAnalytics")
library("RGraphics")
Loading required package: grid
pgfzc_elev <- pgfzc_all$ELEV
#head(pgfzc_elev,5)
#View(pgfzc_elev)
set.seed(123) #for replication purposes; always use same seed value
cln_elev <- clean.boudt(pgfzc_elev, alpha = 0.01) #set alpha .001 to give the most extreme outliers
Error in 1:T : argument of length 0
It's hard to answer your question without knowing what your data looks like. But I can tell you what throws that error. Looking into the source code of the clean.boudt function I find the following cause of your error:
T = dim(R)[1]
...
for (t in c(1:T)) {
d2t = as.matrix(R[t, ] - mu) %*% invSigma %*% t(as.matrix(R[t,
] - mu))
vd2t = c(vd2t, d2t)
}
...
The dim(R)[1] extracts the number of rows in the data supplied to the R argument in the function. It appears that your data has no rows, so check the data type of pgfzc_elev
The cause of the error is likely from your use of $ to subset pgfzc_all.
pgfzc_elev <- pgfzc_all$ELEV
I reckon it is of class integer, which is why dim(R)[1] does not work in the function.
Rather subset your object like this:
pgfzc_elev <- pgfzc_all[, ELEV, drop = F]
Try that and see if it works.

MXnet odd error

This is my first ANN so I imagine that there might be a lot of things done wrong here. I don't follow
I'm trying to predict species of flowers from iris data set provided in R language but I get following error:
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(n)) :
invalid 'dimnames' given for data frame
My code:
require(mxnet)
train <- iris[1:130,]
test <- iris[131:150,]
train.data <- as.data.frame(train[-5])
train.label <- data.frame(model.matrix(data=train,object =~Species-1))
test.data <- as.data.frame(test[-5])
test.label <- data.frame(model.matrix(data=test,object =~Species-1))
var1 <- mx.symbol.Variable("data")
layer0 <- mx.symbol.FullyConnected(var1, num.hidden=3)
cat.out <- mx.symbol.SoftmaxOutput(layer0)
net.model <- mx.model.FeedForward.create(cat.out,
array.layout = "auto",
X=train.data,
y=train.label,
eval.data = list(data=test.data,label=test.label),
num.round = 20,
array.batch.size = 20,
learning.rate=0.1,
momentum=0.9,
eval.metric = mx.metric.accuracy)
UPDATE:
I managed to get rid of this error by specifying column to use in labels(traning.label[,1]and test.label[,1]).
However now I'm training my net to predict just one of my binary variables while I have 3 (one for each species).
I had the same problem, turned out that:
train.data should be a matrix
train.label should be a numeric vector
Check these two and hopefully it should work.
I had a similar problem but during the prediction step. It turns out that my features were in a Data Frame which was causing the issue. Once I converted the data frame into a matrix, the issue went away.
pred.values = stats::predict(model,as.matrix(features))
instead of
pred.values = stats::predict(model,features)
So, the features need to be a matrix both during training and during the process of making predictions.

R, TRUE/FALSE needed in FindCorrelation

I've created a simple correlation matrix in R and I'm trying to use caret for feature selection so I can remove the highly correlated X attributes.
Here is my code:
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff = 0.90, verbose = FALSE, names = TRUE, exact = ncol(correlationMatrix) < 100)
highlyCorrelated is the name for the new object
correlationMatrix is the name of my correlation matrix
I'm getting the following error regardless of how I enter the function into R. Even if I only use one parameter I still get this error:
Error in if (x[i, j] > cutoff) { : missing value where TRUE/FALSE needed
Any thoughts?
I had the same problem and #user20650 answer was correct.
I always do the same "preprocess" to ensure finCorrelation works:
nums <- sapply(data, is.numeric)
data.numeric <- data[ , nums]
data.without_na <- na.omit(data.numeric)
cor_matrix <- cor(data.without_na)
findCorrelation(cor_matrix, 0.7)
I had the same problem. In my case the issue was, Infinite values in my data, which use='complete.obs' doesn't account for in cor().
Solved it by preprocessing the data with
data <- apply(data, 2, function(y) {y[!is.finite(y)]=NA; y})

adabag boosting function throws error when giving mfinal>10

I have a strange issue, whenever I try increasing the mfinal argument in boosting function of adabag package beyond 10 I get an error, Even with mfinal=9 I get warnings.
My train data has 7 class Dependant variable and 100 independant variables and around 22000 samples of data(Smoted one class using DMwR). My Dependant Variable is at the end of the training dataset in sequence.
library(adabag)
gc()
exp_recog_boo <- boosting(V1 ~ .,data=train_dataS,boos=TRUE,mfinal=9)
Error in 1:nrow(object$splits) : argument of length 0
In addition: Warning messages:
1: In acum + acum1 :
longer object length is not a multiple of shorter object length
Thanks in advance.
My mistake was that I didn't set the TARGET as factor before.
Try this:
train$target <- as.factor(train$target)
and check by doing:
str(train$TARGET)
This worked for me:
modelADA <- boosting(lettr ~ ., data = trainAll, boos = TRUE, mfinal = 10, control = (minsplit = 0))
Essentially I just told rpart to require a minimum split length of zero to generate tree, it eliminated the error. I haven't tested this extensively so I can't guarantee it's a valid solution (what does a tree with a zero length leaf actually mean?), but it does prevent the error from being thrown.
I think i Hit the problem.
ignore this -if you configure your control with a cp = 0, this wont happen. I think that if the first node of a tree make no improvement (or at least no better than the cp) the tree stay wiht 0 nodes so you have an empty tree and that make the algorithm fail.
EDIT: The problem is that the rpart generates trees with only one leaf(node) and the boosting metod use this sentence "k <- varImp(arboles[[m]], surrogates = FALSE, competes = FALSE)" being arboles[[m]] a tree with only one node it give you the eror.
To solve that you can modify the boosting metod:
Write: fix(boosting) and add the *'S lines.
if (boos == TRUE) {
** k <- 1
** while (k == 1){
boostrap <- sample(1:n, replace = TRUE, prob = pesos)
fit <- rpart(formula, data = data[boostrap, -1],
control = control)
** k <- length(fit$frame$var)
** }
flearn <- predict(fit, newdata = data[, -1], type = "class")
ind <- as.numeric(vardep != flearn)
err <- sum(pesos * ind)
}
this will prevent the algorith from acepting one leaf trees but you have to set the CP from the control param as 0 to avoid an endless loop..
Just ran into the same problem, and setting the complexity parameter to -1 or minimum split to 0 both work for me with rpart.control, e.g.
library(adabag)
r1 <- boosting(Y ~ ., data = data, boos = TRUE,
mfinal = 10, control = rpart.control(cp = -1))
r2 <- boosting(Y ~ ., data = data, boos = TRUE,
mfinal = 10, control = rpart.control(minsplit = 0))
I also run into this same problem recently and this example R script solves it completely!
The main idea is that you need to set the control for rpart (which adabag uses for creating trees, see rpart.control) appropriately, so that at least a split is attempted in every tree.
I'm not totally sure but it appears that your "argument of length 0" may be the result of an empty tree, which can happen since there is a default setting of a "complexity" parameter that tells the function not to attempt a split if the decrease in homogeneity/lack of fit is below certain threshold.
use str() to see the attributes of your dataframe. For me, I just convert myclass variable as factor, then everything runs.

Resources