I just wanted to conduct a kNN classification with the situation when k is 3. I would like to predict the dependent variable “diabetes” in valid set using train set and calculate the accuracy.
But I faced to the error message with
Error in knn(train = TrainXNormDF, test = ValidXNormDF, cl = MLdata2[, : 'train' and 'class' have different lengths
I can't solve this problem with get approach with
for(i in ((length(MLValidY) + 1):length(TrainXNormDF)))+(MLValidY = c(MLValidY, 0))
What can I do for it? Please help.
My code is as like below
install.packages("mlbench")
install.packages("gbm")
library(mlbench)
library(gbm)
data("PimaIndiansDiabetes2")
head(PimaIndiansDiabetes2)
MLdata <- as.data.frame(PimaIndiansDiabetes2)
head(MLdata)
str(MLdata)
View(MLdata)
any(is.na(MLdata))
sum(is.na(MLdata))
MLdata2 <- na.omit(MLdata)
any(is.na(MLdata2))
sum(is.na(MLdata2))
View(MLdata2)
MLIdx <- sample(1:3, size = nrow(MLdata2), prob = c(0.6, 0.2, 0.2), replace = TRUE)
MLTrain <- MLdata2[MLIdx == 1,]
MLValid <- MLdata2[MLIdx == 2,]
MLTest <- MLdata2[MLIdx == 3,]
head(MLTrain)
head(MLValid)
head(MLTest)
str(MLTrain)
str(MLValid)
str(MLTest)
View(MLTestY)
MLTrainX <- MLTrain[ , -9]
MLValidX <- MLValid[ , -9]
MLTestX <- MLTest[ , -9]
MLTrainY <- as.data.frame(MLTrain[ , 9])
MLValidY <- as.data.frame(MLValid[ , 9])
MLTestY <- as.data.frame(MLTest[ , 9])
View(MLTrainX)
View(MLTrainY)
library(caret)
NormValues <- preProcess(MLTrainX, method = c("center", "scale"))
TrainXNormDF <- predict(NormValues, MLTrainX)
ValidXNormDF <- predict(NormValues, MLValidX)
TestXNormDF <- predict(NormValues, MLTestX)
head(TrainXNormDF)
head(ValidXNormDF)
head(TestXNormDF)
install.packages('FNN')
library(FNN)
library(class)
NN <- knn(train = TrainXNormDF,
test = ValidXNormDF,
cl = MLValidY,
k = 3)
Thank you
Your cl variable is not the same length as your train variable. MLValidY only has 74 observations, while TrainXNormDF has 224.
cl should provide the true classification for every row in your training set.
Furthermore, cl is a data.frame instead of a vector.
Try the following:
NN <- knn(train = TrainXNormDF,
test = ValidXNormDF,
cl = MLTrainY$`MLTrain[, 9]`,
k = 3)
As #rw2 stated, it's the length of cl. I think you meant to use MLtrainY, not MLvalidY. When you have a single column data frame, you can still run into shape problems (converts it to a vector). You could walk back to make sure that you use the right content here, like so:
NN <- knn(train = TrainXNormDF,
test = ValidXNormDF,
cl = MLdata2[MLIdx == 1,]$diabetes, # shape no longer an issue
k = 3)
Related
When I run the code below I got this error. How do I solve this error?
This dataset is Caravan Insurance dataset
Caravan <- read.csv(file="~/Desktop/Caravan.csv")
dim(Caravan)
table(Purchase)]
[1] 5822 86
Purchase
No Yes
5474 348
I am trying to use 5-fold cross-validation approach to determine
the optimal K value with the training data only. The candidate values of K are {3, 4, . . . , 10}.
I am doing a K-fold = 5
standardized.X = scale(Caravan[, -86])
var(standardized.X[,1])
var(standardized.X[,2])
test = 1:1000
train.X=standardized.X[-test ,]
test.X=standardized.X[test ,]
train.Y=Purchase[-test]
test.Y=Purchase[test]
#Validation set approach
set.seed(2001835)
ntrain=nrow(train.X)
retrainData <- sample(ntrain, 0.7*ntrain)
train.setX <- train.X[retrainData,]
test.setX <- train.X[-retrainData,]
train.setY <- train.Y[retrainData]
test.setY <- train.Y[-retrainData]
F <- 5
set.seed(2001835)
folds <- cut(seq(1,ntrain), breaks = F, labels = FALSE)
folds <- folds[sample(ntrain)]
folds
k = 3:10
ErrCV <- matrix(0, nrow=F, ncol= k)
for (f in 1:F) {
retrainData <- which(folds != f)
train.setX <- train.X[retrainData,]
test.setX <- train.X[-retrainData,]
train.setY <- train.Y[retrainData]
test.setY <- train.Y[-retrainData]
for(k in 3:10){
# Model fitting
knn.pred <- knn(train = train.setX, test = test.setX, cl = train.setY, k = k)
# Error
ErrCV[f,k]=mean(knn.pred != test.setY)
}
}
CV <- apply(ErrCV, 2, mean)
plot(CV,type="l")
When i run the code it kept saying that it is out of bounds
I created a simple neural network. The issue is whatever I done for my network it classifies things perfectly.Always get accuracy, sensitivity, specificity as 1.
I need to know that I have made something wrong
following is my code.
#maxs <- apply(new_columns, 2, max)
#mins <- apply(new_columns, 2, min)
#scaled <- as.data.frame(scale(new_columns, center = mins, scale = maxs - mins))
#train_ <- scaled[index,]
#test_ <- scaled[-index,]
# Notice the coercion happens before doing anything else.
#.new_columns <- new_columns %>% mutate(yyes = as.factor(yyes))
set.seed(3033)
intrain <- createDataPartition(y = new_columns$yyes, p= 0.7, list = FALSE)
train_ <- new_columns[intrain,]
test_ <- new_columns[-intrain,]
n <- names(train_)
f <- as.formula(paste("yyes ~", paste(n[!n %in% "yyes"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(3,2),algorithm = "rprop+",threshold=0.1,stepmax = 1e+06)
pr.nn <- compute(nn,test_)
pr.nn_ <- pr.nn$net.result*(max(new_columns$yyes)-min(new_columns$yyes))+min(new_columns$yyes)
test.r <- (test_$yyes)*(max(new_columns$yyes)-min(new_columns$yyes))+min(new_columns$yyes)
test_pred_binary_neu <- +(pr.nn_ >= 0.5)
MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
caret::confusionMatrix(factor(test_pred_binary_neu),factor(test.r))
summary(new_columns)
xtab_neu <- table(test_pred_binary_neu, test.r)
confusionMatrix(xtab_neu, positive = "1")
Kindly help to to sort this.
following are my statistic results
I'm tuning parameters with custom summaryFunction in caret.
I originally thought that if I set K-fold cross validation and input data has N points, performance will be measured with N/K data points.
However, apparently it seems not correct because when I extract data$pred by using browser() which is the handed data to summary function, it only had 10 data.
Since the input(df) has over 500 data points, this number is way smaller than my expectation.
Why does it only have 10 data? Is there any way to increase this?(=performance testing with more large data points)
Any kind of help is needed. Thank you.
sigma.list <- seq(1, 5, 1)
c.list <- seq(1, 10, 1)
met <- "FValue"
#define evaluation function
eval <- function(data, lev = NULL, model = NULL){
mat <- table(data$pred, data$obs)
pre <- mat[1,1]/sum(mat[1,]) #precision
rec <- mat[1,1]/sum(mat[,1]) #recall
res <- c("Precision"=pre, "Recall"=rec, "FValue"=2*pre*rec/(pre+rec))
browser()
res
}
#define train control
tc <- trainControl(method = "cv",
number = 5,
summaryFunction = eval,
classProbs = TRUE,
)
#tune with caret
svm.tune <- train(Flag~.,
data = df,
method = "svmRadial",
tuneGrid = expand.grid(C=c.list, sigma=sigma.list),
trControl = tc,
metric = met
)
After tracking this down, it appears this is normal caret behavior.
I think that caret is essentially verifying that your summaryFunction is working properly by passing fake data (of length 10) to it. The function inside caret that is doing this is evalSummaryFunction.
I'm not quite sure what I'm doing in the RStudio's debugger but this code in train.default:
testSummary <- evalSummaryFunction(y, wts = weights,
ctrl = trControl, lev = classLevels, metric = metric,
method = method)
perfNames <- names(testSummary)
calls evalSummaryFunction which looks like:
function (y, wts = NULL, perf = NULL, ctrl, lev, metric, method)
{
n <- if (class(y)[1] == "Surv")
nrow(y)
else length(y)
if (class(y)[1] != "Surv") {
if (is.factor(y)) {
values <- rep_len(levels(y), min(10, n))
pred_samp <- factor(sample(values), levels = lev)
obs_samp <- factor(sample(values), levels = lev)
}
else {
pred_samp <- sample(y, min(10, n))
obs_samp <- sample(y, min(10, n))
}
}
else {
pred_samp <- y[sample(1:n, min(10, n)), "time"]
obs_samp <- y[sample(1:n, min(10, n)), ]
}
testOutput <- data.frame(pred = pred_samp, obs = obs_samp)
if (!is.null(perf)) {
if (is.vector(perf))
stop("`perf` should be a data frame", call. = FALSE)
perf <- perf[sample(1:nrow(perf), nrow(testOutput)),
, drop = FALSE]
testOutput <- cbind(testOutput, perf)
}
if (ctrl$classProbs) {
for (i in seq(along = lev)) testOutput[, lev[i]] <- runif(nrow(testOutput))
testOutput[, lev] <- t(apply(testOutput[, lev], 1, function(x) x/sum(x)))
}
else {
if (metric == "ROC" & !ctrl$classProbs)
stop("train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()")
}
if (!is.null(wts))
testOutput$weights <- sample(wts, min(10, length(wts)))
testOutput$rowIndex <- sample(1:n, size = nrow(testOutput))
ctrl$summaryFunction(testOutput, lev, method)
}
It appears that 10 is the length of fake data caret passes to your summary function to evaluate it (make sure it is working properly?).
If anyone can verify/explain better that this is what caret is actually doing, please post.
I have a multiclassification problem and I'm trying to run KNN algorithm to find the 50 nearest neighbors around each data point. I have used FNN package in R, however it takes a long time since my dataset has about 29 million row. I was wondering if there is a package in R that can run KNN in parallel. Do you have any suggestion with an example of its useage?
you can use the following by modifying it accordig to KNN .. If need i will provided you with exact code .. the following code is for svc
pkgs <- c('foreach', 'doParallel')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 4)
### PREPARE FOR THE DATA ###
df1 <- read.csv(...... your dataset path........)
## do normalization if needed ##
### SPLIT DATA INTO K FOLDS ###
set.seed(2016)
df1$fold <- caret::createFolds(1:nrow(df1), k = 10, list = FALSE)
### PARAMETER LIST ###
cost <- 10^(-1:4)
gamma <- 2^(-4:-1)
parms <- expand.grid(cost = cost, gamma = gamma)
### LOOP THROUGH PARAMETER VALUES ###
result <- foreach(i = 1:nrow(parms), .combine = rbind) %do% {
c <- parms[i, ]$cost
g <- parms[i, ]$gamma
### K-FOLD VALIDATION ###
out <- foreach(j = 1:max(df1$fold), .combine = rbind, .inorder = FALSE) %dopar% {
deve <- df1[df1$fold != j, ]
test <- df1[df1$fold == j, ]
mdl <- e1071::svm(Classification-type-column ~ ., data = deve, type = "C-classification", kernel = "radial", cost = c, gamma = g, probability = TRUE)
pred <- predict(mdl, test, decision.values = TRUE, probability = TRUE)
data.frame(y = test$DEFAULT, prob = attributes(pred)$probabilities[, 2])
}
### CALCULATE SVM PERFORMANCE ###
roc <- pROC::roc(as.factor(out$y), out$prob)
data.frame(parms[i, ], roc = roc$auc[1])
}
Getting the aforementioned error using the following code:
install.packages("class")
library("class")
mydata <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE);
index <- 1:nrow(mydata)
testindex <- sample(index, trunc(length(index)/6))
trainset <-mydata[testindex,]
testset <- mydata[-testindex,]
cl <- factor(c(rep("quality",3),rep("residual.sugar",3)))
knn(train = trainset, test = testset, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)
Please advise. feel free to change the way I set up 'cl'. honestly have no idea what I'm doing with that. I seek to classify 'quality' based on 'residual.sugar'
If you need to classify quality based on residual.sugar then quality is your cl argument. This is written in the documentation as well:
cl: factor of true classifications of training set
So, in order to run your knn model you need to do:
library("class")
mydata <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", header=TRUE);
index <- 1:nrow(mydata)
testindex <- sample(index, trunc(length(index)/6))
trainset <-mydata[testindex,]
testset <- mydata[-testindex,]
knn(train = trainset['residual.sugar'], #you only need residual.sugar you said so just use that
test=testset['residual.sugar'], #again test is the residual.sugar
cl=as.factor(trainset[['quality']]) , #your cl argument is quality
k=1, l=0, prob=F, use.all=T)
And do not define cl previously at all.