How to implement knn based on weights - r

I would like to implement the weighted knn algorithm but I don't know how to do it. Everything and that I can use kknn, I suppose that it can also be done with knn. In the function train(caret) there is an option "weights" but I can't find the solution, any suggestion?
I use the following code in R :
library(caret)
library(corrplot)
glass <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data",
col.names=c("","RI","Na","Mg","Al","Si","K","Ca","Ba","Fe","Type"))
str(glass)
head(glass)
glass_1<- glass[,-7]
glass_2<- glass_1[,-7]
head(glass_2)
glass<- glass_2
standard.features <- scale(glass[,2:8])
data <- cbind(standard.features,glass[9])
anyNA(data)
head(data)
corrplot(cor(data))
data$Type<-factor(data$Type)
inTraining <- createDataPartition(data$Type, p = .7, list = FALSE, times =1 )
training <- data[ inTraining,]
testing <- data[-inTraining,]
prop.table(table(training$Type))
prop.table(table(testing$Type))
dim(training); dim(testing);
summary(data)
fitControl <- trainControl(## 5-fold CV
method = "cv",
number = 5,
## repeated ten times
#repeats = 5)
)
#k_value <- expand.grid(kmax = 3, distance = 2, kernel = "optimal")
k_value <- expand.grid(k = 3)
set.seed(825)
knn_Fit <- train(Type ~ ., data = training, weights = ????,
method = "knn", tuneGrid = k_value,
trControl = fitControl)
## This last option is actually one
## for gbm() that passes through
#verbose = FALSE)
knn_Fit
knn_Fit$finalModel

Related

Training, validation and testing without using caret

I'm having doubts during the hyperparameters tune step. I think I might be making some confusion.
I split my dataset into training (70%), validation (15%) and testing (15%). Below is the code used for regression with Random Forest.
1. Training
I perform the initial training with the dataset, as follows:
rf_model <- ranger(y ~.,
date = train ,
num.trees = 500,
mtry = 5,
min.node.size = 100,
importance = "impurity")
I get the R squared and the RMSE using the actual and predicted data from the training set.
pred_rf <- predict(rf_model,train)
pred_rf <- data.frame(pred = pred_rf, obs = train$y)
RMSE_rf <- RMSE(pred_rf$pred, pred_rf$obs)
R2_rf <- (color(pred_rf$pred, pred_rf$obs)) ^2
2. Parameter optimization
Using a parameter grid, the best model is chosen based on performance.
hyper_grid <- expand.grid(mtry = seq(3, 12, by = 4),
sample_size = c(0.5,1),
min.node.size = seq(20, 500, by = 100),
MSE = as.numeric(NA),
R2 = as.numeric(NA),
OOB_RMSE = as.numeric(NA)
)
And I perform the search for the best model according to the smallest OOB error, for example.
for (i in 1:nrow(hyper_grid)) {
model <- ranger(formula = y ~ .,
date = train,
num.trees = 500,
mtry = hyper_grid$mtry[i],
sample.fraction = hyper_grid$sample_size[i],
min.node.size = hyper_grid$min.node.size[i],
importance = "impurity",
replace = TRUE,
oob.error = TRUE,
verbose = TRUE
)
hyper_grid$OOB_RMSE[i] <- sqrt(model$prediction.error)
hyper_grid[i, "MSE"] <- model$prediction.error
hyper_grid[i, "R2"] <- model$r.squared
hyper_grid[i, "OOB_RMSE"] <- sqrt(model$prediction.error)
}
Choose the best performing model
x <- hyper_grid[which.min(hyper_grid$OOB_RMSE), ]
The final model:
rf_fit_model <- ranger(formula = y ~ .,
date = train,
num.trees = 100,
mtry = x$mtry,
sample.fraction = x$sample_size,
min.node.size = x$min.node.size,
oob.error = TRUE,
verbose = TRUE,
importance = "impurity"
)
Perform model prediction with validation data
rf_predict_val <- predict(rf_fit_model, validation)
rf_predict_val <- as.data.frame(rf_predict_val[1])
names(rf_predict_val) <- "pred"
rf_predict_val <- data.frame(pred = rf_predict_val, obs = validation$y)
RMSE_rf_fit <- RMSE rf_predict_val$pred, rf_predict_val$obs)
R2_rf_fit <- (cor(rf_predict_val$pred, rf_predict_val$obs)) ^ 2
Well, now I wonder if I should replicate the model evaluation with the test data.
The fact is that the validation data is being used only as a "test" and is not effectively helping to validate the model.
I've used cross validation in other methods, but I'd like to do it more manually. One of the reasons is that the CV via caret is very slow.
I'm in the right way?
Code using Caret, but very slow:
ctrl <- trainControl(method = "repeatedcv",
repeats = 10)
grid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = 1000,
shrinkage = c(0.01,0.1),
n.minobsinnode = 50)
gbmTune <- train(y ~ ., data = train,
method = "gbm",
tuneGrid = grid,
verbose = TRUE,
trControl = ctrl)

How to prevent "algorithm did not converge" errors in neuralnet / Caret / R?

I am trying to train a neural network using train function and neuralnet as my method paramater to predict times table.
I am scaling my training data set as well.
Even though I've tried different learningrates, stepmaxes, and thresholds for my neuralnet, each time I tried to train the network using train function one of the k-folds happened to fail every time saying
1: Algorithm did not converge in 1 of 1 repetition(s) within the stepmax.
2: predictions failed for Fold05.Rep1: layer1=8, layer2=0, layer3=0 Error in cbind(1, pred) %*% weights[[num_hidden_layers + 1]] :
requires numeric/complex matrix/vector arguments
I am guessing that this is because of weights being random so somehow each time I happen to get some weights that are not going to converge.
Is there anyway of preventing this? Maybe trying to re-train the particular fold which has failed using different weights?
Here is my code:
library(caret)
library(neuralnet)
# Create the dataset
tt = data.frame(multiplier = rep(1:10, times = 10), multiplicand = rep(1:10, each = 10))
tt = cbind(tt, data.frame(product = tt$multiplier * tt$multiplicand))
# Splitting
indexes = createDataPartition(tt$product,
times = 1,
p = 0.7,
list = FALSE)
tt.train = tt[indexes,]
tt.test = tt[-indexes,]
# Pre-process
preProc <- preProcess(tt, method = c('center', 'scale'))
tt.preProcessed <- predict(preProc, tt)
tt.preProcessed.train <- tt.preProcessed[indexes,]
tt.preProcessed.test <- tt.preProcessed[-indexes,]
# Train
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
tune.grid <- expand.grid(layer1 = 8,
layer2 = 0,
layer3 = 0)
tt.cv <- train(product ~ .,
data = tt.preProcessed.train,
method = 'neuralnet',
tuneGrid = tune.grid,
trControl = train.control,
linear.output = TRUE,
algorithm = 'backprop',
learningrate = 0.01,
stepmax = 500000,
lifesign = 'minimal',
threshold = 0.01)

How does setting preProcess argument in train function in Caret work?

I am trying to predict the times table training a neural network. However, I couldn't really get how preProcess argument works in train function in Caret.
In the docs, it says:
The preProcess class can be used for many operations on predictors, including centering and scaling.
When we set preProcess like below,
tt.cv <- train(product ~ .,
data = tt.train,
method = 'neuralnet',
tuneGrid = tune.grid,
trControl = train.control,
linear.output = TRUE,
algorithm = 'backprop',
preProcess = 'range',
learningrate = 0.01)
Does it mean that the train function preprocesses (normalizes) the training data passed, in this case tt.train?
After the training is done, when we are trying to predict, do we pass normalized inputs to the predict function or are inputs normalized in the function because we set the preProcess parameter?
# Do we do
predict(tt.cv, tt.test)
# or
predict(tt.cv, tt.normalized.test)
And from the quote above, it seems that when we use preProcess, outputs are not normalized this way in training, how do we go about normalizing outputs? Or do we just normalize the training data beforehand like below and then pass it to the train function?
preProc <- preProcess(tt, method = 'range')
tt.preProcessed <- predict(preProc, tt)
tt.preProcessed.train <- tt.preProcessed[indexes,]
tt.preProcessed.test <- tt.preProcessed[-indexes,]
The whole code:
library(caret)
library(neuralnet)
# Create the dataset
tt = data.frame(multiplier = rep(1:10, times = 10), multiplicand = rep(1:10, each = 10))
tt = cbind(tt, data.frame(product = tt$multiplier * tt$multiplicand))
# Splitting
indexes = createDataPartition(tt$product,
times = 1,
p = 0.7,
list = FALSE)
tt.train = tt[indexes,]
tt.test = tt[-indexes,]
# Pre-process
preProc <- preProcess(tt, method = c('center', 'scale'))
tt.preProcessed <- predict(preProc, tt)
tt.preProcessed.train <- tt.preProcessed[indexes,]
tt.preProcessed.test <- tt.preProcessed[-indexes,]
# Train
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
savePredictions = TRUE)
tune.grid <- expand.grid(layer1 = 8,
layer2 = 0,
layer3 = 0)
tt.cv <- train(product ~ .,
data = tt.train,
method = 'neuralnet',
tuneGrid = tune.grid,
trControl = train.control,
algorithm = 'backprop',
learningrate = 0.01,
stepmax = 100000,
preProcess = c('center', 'scale'),
lifesign = 'minimal',
threshold = 0.01)

how to use the PCs (resulting from PCA) on my dataset in R?

I am an R learner. I am working on 'Human Activity Recognition' dataset from internet. It has 563 variables, the last variable being the class variable 'Activity' which has to be predicted.
I am trying to use KNN algorithm here from CARET package of R.
I have created another dataset with 561 numeric variables excluding the last 2 - subject and activity.
I ran the PCA on that and decided that I will use the top 20 PCs.
pca1 <- prcomp(human2, scale = TRUE)
I saved the data of those PCs in another dataset called 'newdat'
newdat <- pca1$x[ ,1:20]
Now I am tryig to run the below code : but it gives me error because, this newdat doesn't have my class variable
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
knn_fit <- train(Activity ~., data = newdat, method = "knn",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
I tried to extract the last column 'activity' from the raw data and appending it using cbind() with 'newdat' to use that on knn-fit (above) but its not getting appended.
any suggestions how to use the PCs ?
Below is the code:
human1 <- read.csv("C:/NIIT/Term 2/Prog for Analytics II/human-activity-recognition-with-smartphones (1)/train1.csv", header = TRUE)
humant <- read.csv("C:/NIIT/Term 2/Prog for Analytics II/human-activity-recognition-with-smartphones (1)/test1.csv", header = TRUE)
#taking the predictor columns
human2 <- human1[ ,1:561]
pca1 <- prcomp(human2, scale = TRUE)
newdat <- pca1$x[ ,1:15]
newdat <- cbind(newdat, Activity = as.character(human1$Activity))
pca1 <- preProcess(human1[,1:561],
method=c("BoxCox", "center",
"scale", "pca"))
PC = predict(pca1, human1[,1:561])
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
knn_fit <- train(Activity ~., data = newdat, method = "knn",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
#applying knn_fit to test data
test_pred <- predict(knn_fit, newdata = testing)
test_pred
#checking the prediction
confusionMatrix(test_pred, testing$V1 )
I am running into error in the below part. I have attached with the error:
> knn_fit <- train(Activity ~., data = newdat, method = "knn",
+ trControl=trctrl,
+ preProcess = c("center", "scale"),
+ tuneLength = 10)
Error: cannot allocate vector of size 1.3 Gb
How have you tried to cbind the column, could you please show the code? I think you simply stepped into the difficulties produced by StringsAsFactors = TRUE. Does the following line solve your problem:
#...
#newdat <- pca1$x[ ,1:20]
newdat <- cbind(newdat, Activity = as.character(human2$Activity))

Caret: undefined columns selected

I have been trying to get the below code to run in caret but get the error. Can anyone tell me how to trouble shoot it.
Error in [.data.frame(data, , lvls[1]) : undefined columns selected
library(tidyverse)
library(caret)
mydf <- iris
mydf <- mydf %>%
mutate(tgt = as.factor(ifelse(Species == 'setosa','Y','N'))) %>%
select(everything(), -Species)
trainIndex <- createDataPartition(mydf$tgt, p = 0.75, times = 1, list = FALSE)
train <- mydf[trainIndex,]
test <- mydf[-trainIndex,]
fitControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
allowParallel = TRUE,
summaryFunction = twoClassSummary)
fit_log <- train(tgt~.,
data = train,
method = "glm",
trControl = fitControl,
family = "binomial")
You need to used classProbs = TRUE in your control function. The ROC curve is based on the class probabilities and the error is the summary function not finding those columns.
Use data = data.frame(xxxxx). As in the example below
fit.cart <- train(Condition~., data = data.frame(trainset), method="rpart", metric=metric, trControl=control)

Resources