I am trying to run the following code to run a Knn algorithm on a data set. My code is below:
# random number that is 90% of the total number of rows in dataset
ran <- sample(1:nrow(Knn_data), 0.9*nrow(Knn_data))
# the normalization function created
nor <- function(x) { (x-min(x))/(max(x)-min(x))}
#run normalization function on predictors
Knn_data_norm <- as.data.frame(lapply(Knn_data[,c(1,2,3,4,5,6,7)], nor))
summary(Knn_data_norm)
# extract training set
Knn_train <- Knn_data_norm[ran,]
# extract testing set
Knn_test <- Knn_data_norm[-ran,]
# extract 8th column of train dataset because it will be used as 'cl' argument in knn function
Knn_target_category <- Knn_data[ran,8]
# extract 8th column of test dataset to measure the accuracy
Knn_test_category <- Knn_data[-ran,8]
library(class)
#run knn function
pr <- knn(Knn_train, Knn_test, cl=Knn_target_category, k=3)
I keep getting the following error:
Error in knn(Knn_train, Knn_test, cl = Knn_target_category, k = 3) : 'train' and 'class' have different lengths
I am not sure how to change the code to correct this error enter image description here
Context
I have data frame with multiple columns and i have to do regression between alternate columns (like between 1&2, 3&4) and predict using a test data set and then find residuals. I do not want to reference each column using column name (like target.data$nifty), hence I am using data matrix.
Problem
When i predict using a data matrix, I am getting the following error:
number of items to replace is not a multiple of replacement length
I know my training data set has 255 rows and test data set has 50 rows but it should be able to predict on test data set and give me 50 predicted values, but it is not.
b is a matrix having training data set of 255 rows and 8 columns and t is a matrix having test data set of 50 rows and 8 columns.
I tried using matrix directly as newdata in predict but it was giving following error:
Error in eval(predvars, data, env) : numeric 'envir' arg not of
length one
... so i converted it into a data frame. Please suggest how can I use matrix inside predict.
Here's the code i am using:
b<-as.matrix(target.data)
t<-as.matrix(target.test)
t_pred <- matrix(,nrow = 50,ncol = 8)
t_res <- matrix(,nrow = 50,ncol = 8)
for(i in 1:6) {
if (!i %% 2) {
t_model <- lm(b[,i] ~ b[,i])
t_pred[,i] <- predict(t_model, newdata=data.frame(t[,i]), type = 'response')
t_res[,i] <- t_pred[,i] - t[,i]
}
}
I have a dataframe dfab that contains 2 columns that I used as argument to generate a series of linear models as following:
models = list()
for (i in 1:10){
models[[i]] = lm(fc_ab10 ~ (poly(nUs_ab, i)), data = dfab)
}
dfab has 32 observations and I want to predict fc_ab10 for only 1 value.
I thought of doing so:
newdf = data.frame(newdf = nUs_ab)
newdf[] = 0
newdf[1,1] = 56
prediction = predict(models[[1]], newdata = newdf)
First I tried writing newdf as a dataframe with only one position, but since there are 32 in the dataset on which the model was built, I thought I had to provide at least 32 points as well. I don't think this is necessary though.
Every time I run that piece of code I am given the following error:
Error: variable 'poly(nUs_ab, i) was fitted with type “nmatrix.1” but type “numeric” was supplied.
In addition: Warning message:
In Z/rep(sqrt(norm2[-1L]), each = length(x)) :
longer object length is not a multiple of shorter object length
I thought all I need to use predict was a LM model, predictors (the number 56) given in a column-named dataframe. Obviously, I am mistaken.
How can I fix this issue?
Thanks.
newdf should be a data.frame with column name nUs_ab, otherwise R won't be able to know which column to operate upon (i.e., generate the prediction design matrix). So the following code should work
newdf = data.frame(nUs_ab = 56)
prediction = predict(models[[1]], newdata = newdf)
In QDA (Quadratic Discriminant Analysis), do i need to keep length of training and test data exactly same? If not, how do you find a Confusion Matrix in such cases?
Here's psuedo data.
Because if I keep training-data and test data sets of different lengths, it gives an error (Using R Studio):
"Error in table(pred, true) : all arguments must have the same length".
Tried to remove NAs using na.omit() on both data sets as well as pred and true; and using na.action = na.exclude for qda(), but it didn't work.
After dividing the data set in exactly half; half of it as training and half as test; it worked perfectly after na.omit() on pred and true.
Following is the code used for either of approaches. In approach 2, with data split into equal halves, it worked perfectly fine.
#Approach 1: divide data age-wise
train <- vif_data$Age < 30
# there are around 400 values passing (TRUE) above condition and around 50 failing (FALSE)
train_vif <- vif_data[train,]
test_vif <- vif_data[!train,]
#taking QDA
zone_qda <- qda(train_vif$Awareness~train_vif$Zone, na.action = na.exclude)
#compare QDA against test data
zone_pred <- predict(zone_qda, test_vif)
#omitting nulls
pred <- na.omit(zone_pred$class)
true <- na.omit(test_vif$Awareness)
length(pred) # result: 399
length(true) # result: 47
#that's where it throws error: "Error in table(zone_pred$class, train_vif) : all arguments must have the same length"
zone_aware <- table(zone_pred$class, train_vif)
# OR
zone_aware <- table(pred, true)
accur <- mean(zone_pred$class==test_vif$Awareness)
###############################
#Approach 2: divide data into random halves
train <- splitSample(dataset = vif_data, div = 2, path = "./", type = "csv")
train_data <- read.csv("splitSample_s1.csv")
test_data <- read.csv("splitSample_s2.csv")
#taking QDA
zone_qda <- qda(train_vif$Awareness~train_vif$Zone, na.action = na.exclude)
#compare QDA against test data
zone_pred <- predict(zone_qda, test_vif)
#omitting nulls
pred <- na.omit(zone_pred$class)
true <- na.omit(test_vif$Awareness)
length(train_vif)
# this works fine
zone_aware <- table(zone_pred$class, train_vif)
# OR
zone_aware <- table(pred, true)
accur <- mean(zone_pred$class==test_vif$Awareness)
Want to know if there is any method by which we can have a confusion matrix with data set unequally divided into training and test data set.
Thanks!
Are you plugging in your training inputs instead of your test set input data to predict? Notice how this yields the same error message:
table(c(1,2),c(1,2,3))
If pred isn't the right length, then you're probably predicting incorrectly. At the moment, you haven't shared any of your code, so I cannot say anything more. But there is no reason that you shouldn't be able to get a confusion matrix using test data of different size than your training data.
I am trying to split my data into training and test data using a code obtained from my professor but am getting errors. I thought it was because of the data's format, but I went back to hard code it and nothing works. The data is in matrix form right now and I believe the code is being used to predict how accurate the logistic regression is.
A = matrix(
c(64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,1,1,1,0,1,0,0,0,0,1),
nrow = 10,
ncol = 6,
byrow = FALSE)
n<-row(A);
K<-ncol(A)-1;
x<-matrix(0,n,K);
for(i in 1:K){x[,i]<-A[,i];}
#A[,i] is 10long and x[,i] is 1long.
A[,i:length(x[,i])]=x[,i]
y<-A[,K+1];
#training/test data split:
idx<-sample(1:n,floor(n/2),replace=FALSE);
xtr<-x[idx,]; ytr<-y[idx];
xts<-x[-idx,]; yts<-y[-idx];
#fit the logistic model to it
myglm<-glmnet(xtr,ytr,family = "binomial");
#Error in if (is.null(np) | (np[2] <= 1)) stop("x should be a matrix with 2 or more columns") : argument is of length zero
#apply traning data to test data
mypred<-predict(myglm,newx=xts,type="response",s=0.01);
posteriprob<-mypred[,,1];
yhat<-matrix(1,nrow(xts),1);
for(i in 1:nrow(xts))
{
yhat[i]<-which.max(posteriprob[i,]);
}
acc<-sum(yhat+2==yts)/nrow(xts);
cat("accuracy of test data:", acc, "\n");
The first forloop gives me this error:
Error in x[, i] <- A[, i]:
Number of items to replace is not a multiple of replacement length
When I run the logistic model using xtr/ytr I get error in if (is.null(np) | (np[2] <= 1)) stop("x should be a matrix with 2 or more columns"):
argument is of length zero
For the first error, it was a typo. Change n<-row(A) to n<-nrow(A) and it should've worked. But after that, A[,i:length(x[,i])]=x[,i] produces other error, since the size of A is 10x6 while the length(x[,i]) is 10. Probaby you wanted to do different thing here, than what is currently coded.
For the second error, the xtr should have a size of at least n x 2. Also, your data is not appropriate for binomial glm. Observations should be either 1 or 0.