k-nearest-neighbors in caret with categorical data - r

I want to predict a binary variable with some other variables, some of them are categorical. I've set up a code and everything seemed to work fine, immediately. The the predictions were quite similar in comparison to a logistic regression and a random forest. This is my code (I don't think there is something wrong with it):
knn.Fit <-
train(Y ~ .,
data = Data,
method = "knn",
trControl = trainControl(method = "repeatedcv",
repeats = 5,
number = 5),
tuneLength = 20)
Now my question is how is this done with categorical variables? For example, if I have a categorial variable with values a, b and c, does the function create three (or two?) dummy variables in the background and calculates the distance with them? And are the numeric variables standardized automatically? Otherwise these dummy variables should not fall into account if one or more numeric variables have much bigger standard deviations? I've thought I have to do quite much data preparation before running the algorithm ...
EDIT:
I've seen that I can standardize with the argument preProcess:
preProcess = c("center", "scale")
My numeric variables didn't have a big SD, indeed.

For categorical variables, k-nearest neighbors won't work. Try k-modes (https://en.wikipedia.org/wiki/K-medoids) which can be used via the klaR package. The two cannot typically be combined however.

Related

How to identify the non-zero coefficients in final caret elastic net model -

I have used caret to build a elastic net model using 10-fold cv and I want to see which coefficients are used in the final model (i.e the ones that aren't reduced to zero). I have used the following code to view the coefficients, however, this apears to create a dataframe of every permutation of coefficient values used, rather than the ones used in the final model:
tr_control = train_control(method="cv",number=10)
formula = response ~.
model1 = caret::train(formula,
data=training,
method="glmnet",
trControl=tr_control,
metric = "Accuracy",
family = "binomial")
Then to extract the coefficients from the final model and using the best lambda value, I have used the following:
data.frame(as.matrix(coef(model1$finalModel, model1$bestTune$.lambda)))
However, this just returns a dataframe of all the coefficients and I can see different instances of where the coefficients have been reduced to zero, however, I'm not sure which is the one the final model uses. Using some slightly different code, I get slightly different results, but in this instance, non of the coefficients are reduced to zero, which suggests to me that the the final model isn't reducing any coefficients to zero:
data.frame(as.matrix(coef(model1$finalModel, model1$bestTune$lambda))) #i have removed the full stop preceeding lambda
Basically, I want to know which features are in the final model to assess how the model has performed as a feature reduction process (alongside standard model evaluation metrics such as accuracy, sensitivity etc).
Since you do not provide any example data I post an example based on the iris built-in dataset, slightly modified to fit better your need (a binomial outcome).
First, modify the dataset
library(caret)
set.seed(5)#just for reproducibility
iris
irisn <- iris[iris$Species!="virginica",]
irisn$Species <- factor(irisn$Species,levels = c("versicolor","setosa"))
str(irisn)
summary(irisn)
fit the model (the caret function for setting controls parameters for train is trainControl, not train_control)
tr_control = trainControl(method="cv",number=10)
model1 <- caret::train(Species~.,
data=irisn,
method="glmnet",
trControl=tr_control,
family = "binomial")
You can extract the coefficients of the final model as you already did:
data.frame(as.matrix(coef(model1$finalModel, model1$bestTune$lambda)))
Also here the model did not reduce any coefficients to 0, but what if we add a random variable that explains nothing about the outcome?
irisn$new1 <- runif(nrow(irisn))
model2 <- caret::train(Species~.,
data=irisn,
method="glmnet",
trControl=tr_control,
family = "binomial")
var <- data.frame(as.matrix(coef(model2$finalModel, model2$bestTune$lambda)))
Here, as you can see, the coefficient of the new variable was turning to 0. You can extract the variable name retained by the model with:
rownames(var)[var$X1!=0]
Finally, the accuracy metrics from the test set can be obtained with
confusionMatrix(predict(model1,test),test$outcome)

Conduct quantile regression with several dependent variables in R

I'm interested in doing a multivariate regression in R, looking at the effects of a grouping variable (2 levels) on several dependent variables. However, due to my data being non-normal and the 2 groups not having homogenous variances, I'm looking to use a quantile regression instead. I'm using the rq function from the quantreg toolbox to do this.
My code is as follows
# Generate some fake data
DV = matrix(rnorm(40*5),ncol=5) #construct matrix for dependent variables
IV = matrix(rep(1:2,20)) #matrix for grouping factor
library(quantreg)
model.q = rq(DV~IV,
tau = 0.5)
I get the following error message when this is run:
Error in y - x %*% z$coef : non-conformable arrays
In addition: Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique
I believe this is due to my having several DVs, as the model works fine when I try using a DV of one column. Is there a specific way I should be formatting my data? Or perhaps there is another function I may be able to use?
Thank you!
If you just want to run several regressions, each with the same set of independent variables, but with a different dependent variable, you could write a function and then apply it to all columns of your DV matrix and save the models in a list:
reg <- function(col_number) {
model.q <- rq(DV[, col_number] ~ IV, tau = 0.5)
}
model_list <- lapply(1:ncol(DV), reg)
However, as pointed out in the comments, it might be that you want a multivariate model accounting for the correlation of the outcome - but then I do not think the rq method would be appropriate
If you have multiple responses, what you most likely need is:
DV = matrix(rnorm(40*5),ncol=5) #construct matrix for dependent variables
IV = matrix(rep(1:2,20)) #matrix for grouping factor
library(quantreg)
rqs.fit(x=IV, y=DV, tau=0.5, tol = 0.0001)
Unfortunately, there's really not a lot of documentation about how this works.. I can update if i do find it

How can I extract coefficients from this model in caret?

I'm using the caret package with the leaps package to get the number of variables to use in a linear regression. How do I extract the model with the lowest RMSE that uses mdl$bestTune number of variables? If this can't be done are there functions in other packages you would recommend that allow for loocv of a stepwise linear regression and actually allow me to find the final model?
Below is reproducible code. From it, I can tell from mdl$bestTune that the number of variables should be 4 (even though I would have hoped for 3). It seems like I should be able to extract the variables from the third row of summary(mdl$finalModel) but I'm not sure how I would do this in a general case and not just this example.
library(caret)
set.seed(101)
x <- matrix(rnorm(36*5), nrow=36)
colnames(x) <- paste0("V", 1:5)
y <- 0.2*x[,1] + 0.3*x[,3] + 0.5*x[,4] + rnorm(36) * .0001
train.control <- trainControl(method="LOOCV")
mdl <- train(x=x, y=y, method="leapSeq", trControl = train.control, trace=FALSE)
coef(mdl$finalModel, as.double(mdl$bestTune))
mdl$bestTune
summary(mdl$finalModel)
mdl$results
Here's the context behind my question in case it's of interest. I have historical monthly returns hundreds of mutual fund. Each fund's returns will be a dependent variable that I'd like to regress against a set of returns on a handful (e.g. 5) factors. For each fund I want to run a stepwise regression. I expect only 1 to 3 of the five factors to be significant for any fund.
you can use:
coef(mdl$finalModel,unlist(mdl$bestTune))

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

How to build regression models and then compare their fits with data held out from the model training-testing?

I have been building a couple different regression models using the caret package in R in order to make predictions about how fluorescent certain genetic sequences will become under certain experimental conditions.
I have followed the basic protocol of splitting my data into two sets: one "training-testing set" (80%) and one "hold-out set" (20%), the former of which would be utilized to build the models, and the latter would be used to test them in order to compare and pick the final model, based on metrics such as their R-squared and RMSE values. One such guide of the many I followed can be found here (http://www.kimberlycoffey.com/blog/2016/7/16/compare-multiple-caret-run-machine-learning-models).
However, I run into a block in that I do not know how to test and compare the different models based on how well they can predict the scores in the hold-out set. In the guide I linked to above, the author uses a ConfusionMatrix in order to calculate the specificity and accuracy for each model after building a predict.train object that applied the recently built models on the hold-out set of data (which is referred to as test in the link). However, ConfusionMatrix can only be applied to classification models, wherein the outcome (or response) is a categorical value (as far as my research has indicated. Please correct me if this is incorrect, as I have not been able to conclude without any doubt that this is the case).
I have found that the resamples method is capable of comparing multiple models against each other (source: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/resamples), but it cannot take into account how the new models fit with the data that I excluded from the training-testing sessions.
I tried to create predict objects using the recently built models and hold-out data, then calculate Rsquared and RMSE values using caret's R2 and RMSE methods. But I'm not sure if such an approach is best possible way for comparing and picking the best model.
At this point, I should note that all the model building methods I am using are based on linear regression, since I need to be able to extract the coefficients and apply them in a separate Python script.
Another option I considered was setting a threshold in my outcome, wherein any genetic sequence that had a fluorescence value over 100 was considered useful, while sequences scoring values under 100 were not. This would allow me utilize the ConfusionMatrix. But I'm not sure how I should implement this within my R code to make these two classes in my outcome variable. I'm further concerned that this approach might make it difficult to apply my regression models to other data and make predictions.
For what it's worth, each of the predictors is either an integer or a float, and have ranges that are not normally distributed.
Here is the code I thus far been using:
library(caret)
data <- read.table("mydata.csv")
sorted_Data<- data[order(data$fluorescence, decreasing= TRUE),]
splitprob <- 0.8
traintestindex <- createDataPartition(sorted_Data$fluorescence, p=splitprob, list=F)
holdoutset <- sorted_Data[-traintestindex,]
trainingset <- sorted_Data[traintestindex,]
traindata<- trainingset[c('x1', 'x2', 'x3', 'x4', 'x5', 'fluorescence')]
cvCtrl <- trainControl(method = "repeatedcv", number= 20, repeats = 20, verboseIter = FALSE)
modelglmStepAIC <- train(fluorescence~., traindata, method = "glmStepAIC", preProc = c("center","scale"), trControl = cvCtrl)
model_rlm <- train(fluorescence~., traindata, method = "rlm", preProc = c("center","scale"), trControl = cvCtrl)
pred_glmStepAIC<- predict.lm(modelglmStepAIC$finalModel, holdoutset)
pred_rlm<- predict.lm(model_rlm$finalModel, holdoutset)
glmStepAIC_r2<- R2(pred_glmStepAIC, holdoutset$fluorescence)
glmStepAIC_rmse<- RMSE(pred_glmStepAIC, holdoutset$fluorescence)
rlm_r2<- R2(pred_rlm, holdoutset$fluorescence)
rlm_rmse<- RMSE(pred_rlm, holdoutset$fluorescence)
The out-of-sample performance measures offered by Caret are RMSE, MAE and squared correlation between fitted and observed values (called R2). See more info here https://topepo.github.io/caret/measuring-performance.html
At least in time series regression context, RMSE is the standard measure for out-of-sample performance of regression models.
I would advise against discretising continuous outcome variable, because you are essentially throwing away information by discretising.

Resources