I am trying to run the following code to run a Knn algorithm on a data set. My code is below:
# random number that is 90% of the total number of rows in dataset
ran <- sample(1:nrow(Knn_data), 0.9*nrow(Knn_data))
# the normalization function created
nor <- function(x) { (x-min(x))/(max(x)-min(x))}
#run normalization function on predictors
Knn_data_norm <- as.data.frame(lapply(Knn_data[,c(1,2,3,4,5,6,7)], nor))
summary(Knn_data_norm)
# extract training set
Knn_train <- Knn_data_norm[ran,]
# extract testing set
Knn_test <- Knn_data_norm[-ran,]
# extract 8th column of train dataset because it will be used as 'cl' argument in knn function
Knn_target_category <- Knn_data[ran,8]
# extract 8th column of test dataset to measure the accuracy
Knn_test_category <- Knn_data[-ran,8]
library(class)
#run knn function
pr <- knn(Knn_train, Knn_test, cl=Knn_target_category, k=3)
I keep getting the following error:
Error in knn(Knn_train, Knn_test, cl = Knn_target_category, k = 3) : 'train' and 'class' have different lengths
I am not sure how to change the code to correct this error enter image description here
Related
We created a table in R with values from the S&P500 and added rows like the simple 10 Days Moving Average. We set the NA-values to 0. Example:
myStartDate <- '2020-01-01'
myEndDate <- Sys.Date()
Dataset$SMA10 <- SMA(Dataset[,"Close"], 10)
Dataset$SMA10 <- as.numeric(Dataset$SMA10)
Dataset$SMA10[is.na(Dataset$SMA10)] <- 0
Our goal is to create a random forest model. Therefore we split the data into a train and a valid data:
set.seed(100)
train <- sample(nrow(Dataset), 0.5*nrow(Dataset), replace = FALSE)
TrainSet <- Dataset [train,]
ValidSet <- Dataset [-train,]
Now if we want to generate the model with following code;
model1 <- randomForest(SMA10~.,data=TrainSet, mtry=5, importance=TRUE,ntree=500)
print(model1)
we get this error message:
Error in x[, i] <- frame[[i]] : number of items to replace is not a multiple of replacement length
By looking up this error in the forum, we found that this is related with NA-Values. Therefore we are a little confused, because we have no NA-Values in our table. Can you tell us what we are doing wrong? Thank you very much in advance.
I try to use kknn + loop to create a leave-out-one cross validation for a model, and compare that with train.kknn.
I have split the data into two parts: training (80% data), and test (20% data). In the training data, I exclude one point in the loop to manually create LOOCV.
I think something gets wrong in predict(knn.fit, data.test). I have tried to find how to predict in kknn through the kknn package instruction and online but all the examples are "summary(model)" and "table(validation...)" rather than the prediction on a separate test data. The code predict(model, dataset) works successfully in train.kknn function, so I thought I could use the similar arguments in kknn.
I am not sure if there is such a prediction function in kknn. If yes, what arguments should I give?
Look forward to your suggestion. Thank you.
library(kknn)
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
# train.data + validation.data is the 80% data I split.
}
pred.knn <- predict(knn.fit, data.test) # data.test is 20% data.
Here is the error message:
Error in switch(type, raw = object$fit, prob = object$prob,
stop("invalid type for prediction")) : EXPR must be a length 1
vector
Actually I try to compare train.kknn and kknn+loop to compare the results of the leave-out-one CV. I have two more questions:
1) in kknn: is it possible to use another set of data as test data to see the knn.fit prediction?
2) in train.kknn: I split the data and use 80% of the whole data and intend to use the rest 20% for prediction. Is it an correct common practice?
2) Or should I just use the original data (the whole data set) for train.kknn, and create a loop: data[-i,] for training, data[i,] for validation in kknn? So they will be the counterparts?
I find that if I use the training data in the train.kknn function and use prediction on test data set, the best k and kernel are selected and directly used in generating the predicted value based on the test dataset.
In contrast, if I use kknn function and build a loop of different k values, the model generates the corresponding prediction results based on
the test data set each time the k value is changed. Finally, in kknn + loop, the best k is selected based on the best actual prediction accuracy rate of test data. In short, the best k train.kknn selected may not work best on test data.
Thank you.
For objects returned by kknn, predict gives the predicted value or the predicted probabilities of R1 for the single row contained in validation.data:
predict(knn.fit)
predict(knn.fit, type="prob")
The predict command also works on objects returned by train.knn.
For example:
train.kknn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 10,
kernel = "rectangular", scale = TRUE)
class(train.kknn.fit)
# [1] "train.kknn" "kknn"
pred.train.kknn <- predict(train.kknn.fit, data.test)
table(pred.train.kknn, as.factor(data.test$R1))
The train.kknn command implements a leave-one-out method very close to the loop developed by #vcai01. See the following example:
set.seed(43210)
n <- 500
data.train <- data.frame(R1=rbinom(n,1,0.5), matrix(rnorm(n*10), ncol=10))
library(kknn)
pred.kknn <- array(0, nrow(data.train))
for (i in 1:nrow(data.train)) {
train.data <- data.train[-i,]
validation.data <- data.train[i,]
knn.fit <- kknn(as.factor(R1)~., train.data, validation.data, k = 40,
kernel = "rectangular", scale = TRUE)
pred.kknn[i] <- predict(knn.fit)
}
knn.fit <- train.kknn(as.factor(R1)~., data.train, ks = 40,
kernel = "rectangular", scale = TRUE)
pred.train.kknn <- predict(knn.fit, data.train)
table(pred.train.kknn, pred.kknn)
# pred.kknn
# pred.train.kknn 1 2
# 0 374 14
# 1 9 103
I have a SVM model using K-fold Cross-Validation and I want to save the result of each fold (cross-validation result and their corresponding actual and predicted values) in an array. Therefore, I have tried the following code but I am struggling with this error. I am not good in R. I would be thanks full if anyone can solve my problem with this loop.
Error:
Error in *tmp*[[j]] : subscript out of bounds
My code is as follows:
Edited
#required Packages
library(rminer)
library("caret")
library("e1071")
#Generating random numbers
B1 <- c(runif(100))
B2 <- c(runif(100))
B3 <- c(runif(100))
AWC<-c(runif(100))#Target variable(respond)
data_scale<-data.frame(B1,B2,B3,AWC)
foldss<-createFolds(data_scale,,k=3)
#creating list and arry for storring the resuls for all folds.
value_svm<-list()
value_svm_all<-array()
cv_ksvm_result<-list()
cv_ksvm_total_result<-array()
#Construct the loop for all process
for(i in 1:3){
for(j in 1:3) {
#crearetest and trainset
dat_terain<-data_scale[(-foldss[[i]]),]
dat_test<-data_scale[foldss[[i]],]
#Build the model
fit_svm<-e1071::svm(AWC~.,data=dat_terain,kernel="radial")
#predict
AWC_pred<-predict(fit_svm, dat_test)
print(value_svm[[j]])<-AWC_pred
value_svm_all<-cbind(value_svm_all,value_svm[[j]])
cv_ksvm_result[[i]]<-
mmetric(dat_test$AWC,AWC_pred,c("MAE","RMSE","MAPE","RMSPE",
"RRSE","RAE","COR","R2"))
print(cv_ksvm_result[[i]])
cv_ksvm_total_result<-cbind(cv_ksvm_total_result, cv_ksvm_result[[i]])
}
}
The source of error is this chunk:
print(value_svm[[j]])<-AWC_pred
You just need to replace to:
value_svm[[j]]<-AWC_pred
But, in reality there´s another issues on this code.
When you set...
foldss<-createFolds(data_scale,,k=3)
... you are intent to have 3 folds, but it returns only 2. createFolds function expects a vector as first argument in order to get the dataframe´s number of rows. When a full dataset is provided it takes the number of columns.
I´ve made the necessary adjustments and now the code runs correctly
#required Packages
library(rminer)
library("caret")
library("e1071")
#Generating random numbers
B1 <- c(runif(100))
B2 <- c(runif(100))
B3 <- c(runif(100))
AWC<-c(runif(100))#Target variable(respond)
data_scale<-data.frame(B1,B2,B3,AWC)
foldss<-createFolds(data_scale$AWC,,k=3)
#creating list and arry for storring the resuls for all folds.
value_svm<-list()
value_svm_all<-array()
cv_ksvm_result<-list()
cv_ksvm_total_result<-array()
#Construct the loop for all process
for(i in 1:3){
#crearetest and trainset
dat_terain<-data_scale[(-foldss[[i]]),]
dat_test<-data_scale[foldss[[i]],]
#Build the model
fit_svm<-e1071::svm(AWC~.,data=dat_terain,kernel="radial")
#predict
AWC_pred<-predict(fit_svm, dat_test)
value_svm[[i]]<-AWC_pred
value_svm_all<-cbind(value_svm_all,value_svm[[i]])
cv_ksvm_result[[i]]<-
mmetric(dat_test$AWC,AWC_pred,c("MAE","RMSE","MAPE","RMSPE",
"RRSE","RAE","COR","R2"))
print(cv_ksvm_result[[i]])
cv_ksvm_total_result<-cbind(cv_ksvm_total_result, cv_ksvm_result[[i]])
}
I am trying to apply table function but I got this error, so I think that because the test is a factor and the prediction is a matrix:
Error in table(rfe_nB_test_folds[, 7], rfe_nB_predict) :
all arguments must have the same length
for that, I need to convert prediction result to factor so I can use it on table function, but I get this error and I think that because of 10 cross-validation because when I try it without 10 cross-validation it works:
Error in [.default(rfe_nB_predict, , 2) :
incorrect number of dimensions
My code:
set.seed(100)
rfe_nB_folds<-createFolds(BC_bind$outcome, k=10) #create folds
rfe_nB_fun <- lapply (rfe_nB_folds, function(x){
rfe_nB_traing_folds<-BC_bind[-x,]
rfe_nB_test_folds<-BC_bind[x,]
#build the model
rfe_nB_model<-naiveBayes(outcome ~ ., data = rfe_nB_traing_folds) #test the model
rfe_nB_predict<-predict(rfe_nB_model,rfe_nB_test_folds[-7],type="raw")
rfe_nB_predict<-as.factor(rfe_nB_predict)
CR<-roc.curve(rfe_nB_test_folds[,7], rfe_nB_predict[,2])
print(CR)
rfe_nB_table<-table(rfe_nB_test_folds[,7],rfe_nB_predict)
rfe_nB_confusionMatrix<-confusionMatrix(rfe_nB_table,positive = "R") #to see the matrex of echo flods
return (rfe_nB_confusionMatrix$table)
})
I'm used specific columns so I saved them on BC_bind as shown in the code.
Top_6featurs <- wpdc[,c(33,11,10,32,29,12)] #column number of top 6 featur
BC_bind <- data.frame(cbind(Top_6featurs , wpdc$outcome))
I have constructed a decision tree using rpart for a dataset.
I have then divided the data into 2 parts - a training dataset and a test dataset. A tree has been constructed for the dataset using the training data. I want to calculate the accuracy of the predictions based on the model that was created.
My code is shown below:
library(rpart)
#reading the data
data = read.table("source")
names(data) <- c("a", "b", "c", "d", "class")
#generating test and train data - Data selected randomly with a 80/20 split
trainIndex <- sample(1:nrow(x), 0.8 * nrow(x))
train <- data[trainIndex,]
test <- data[-trainIndex,]
#tree construction based on information gain
tree = rpart(class ~ a + b + c + d, data = train, method = 'class', parms = list(split = "information"))
I now want to calculate the accuracy of the predictions generated by the model by comparing the results with the actual values train and test data however I am facing an error while doing so.
My code is shown below:
t_pred = predict(tree,test,type="class")
t = test['class']
accuracy = sum(t_pred == t)/length(t)
print(accuracy)
I get an error message that states -
Error in t_pred == t : comparison of these types is not implemented In
addition: Warning message: Incompatible methods ("Ops.factor",
"Ops.data.frame") for "=="
On checking the type of t_pred, I found out that it is of type integer however the documentation
(https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/predict.rpart.html)
states that the predict() method must return a vector.
I am unable to understand why is the type of the variable is an integer and not a list. Where have I made the mistake and how can I fix it?
Try calculating the confusion matrix first:
confMat <- table(test$class,t_pred)
Now you can calculate the accuracy by dividing the sum diagonal of the matrix - which are the correct predictions - by the total sum of the matrix:
accuracy <- sum(diag(confMat))/sum(confMat)
My response is very similar to #mtoto's one but a bit more simply... I hope it also helps.
mean(test$class == t_pred)