for loop in train(caret) to select different predictors in a lm model - r

I'm just a beginner, so i hope you can help with a problem due the KNN model (via train in caret) in R.
I tried this:
models.list = as.list(vector(length = ncol(FIFA21_db)))
for(i in 1:ncol(mtcars)) {
models.list[[i]] <- train(x = mtcars[,i], y = mtcars[,1], method = "lm")
}
This cause the error " Please use column names for x": Do you know how i can use the column names instead of observations in a for loop? My goal is to use different variables for a lm regression.

Related

Creating multiple GLM in a for loop, skipping models in the loop where the coefficients do not work in R

I have created a list that contains all possible combinations of my dependent variables and I am trying to create multiple glm with all of those combinations.
combinations = list()
models = list()
for(i in length(combinations) {
models[[i]]<- glm ( as.formula(paste("(x) ~", combinations[i])), family= 'Gamma' , data = df)
}
I get the error message:
Error: no valid set of coefficients has been found: please supply starting values
I know that is because one or a few models the combination of dependent variables seems to create some issues. How can I skip the models that create those issues (or leave them blank) and keep the loop going?
As a side comment: I tried implementing a normal exponential glm, which worked, but I would really like to stay with the gamma family.
combinations = list()
models = list()
for(i in length(combinations) {
models[[i]]<- glm ( as.formula(paste("exp(x) ~", combinations[i])), family= 'gaussian' , data = df)
}
Many thanks!

Caret train function for muliple data frames as function

there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.
By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

For loops regression in R

I'm fitting GARCH model to the residuals of and ARIMA, and trying to apply ARCH(p) for p from 1 to 10 to compare the fitness. Here is my code. Errors are returned in the for loop part but I cannot figure out the reason why. Could anyone give some tips?
So for the single value p=1 the codes are as below and it's no problem.
fitone<- garchFit(~garch(1,0),data=logprice)
coef(fitone)
summary(fitone)
And for the for loop my codes go like
for (n in 1:10) {
fit [[n]]<- garchFit(~garch(n,0),data=logprice)
coef(fit[[n]])
summary(fit[[n]])
}
Error in .garchArgsParser(formula = formula, data = data, trace = FALSE) :
Formula and data units do not match.
I never wrote a loop code before. Can someone help me with the codes?
The problem is that generally one tries to evaluate all the variables in a formula in the context of the data= parameter, but your n variable isn't coming from logprice, it's coming from the global environment. You will need to dynamically create the formula. Here's one way to run all the models with lapply rather than a for look would be
library(fGarch)
#sample data
x.vec = as.vector(garchSim(garchSpec(rseed = 1985), n = 200)[,1])
fits <- lapply(1:10, function(n) {
garchFit(bquote(~garch(.(n),0)), data = x.vec, trace = FALSE)
})
and then we can get the coefs with
lapply(fits, coef)

How to add specific conditions to stepAIC

I am running a regression with 37 variables, and I am using stepAIC to perform model selection. I do NOT want a predictive model. I just want to find out what varibles have the best explanatory power.
My current code looks like:
fitObject <- lm(mydata)
DEP.select <- stepAIC(fitObject, direction = 'both', scope= list(lower = ~AUC), trace = F, k = log(obs))
# DEP is my dependent variable, and AUC is an independent variable I was want to have in my model.
The problem is that a lot of my variables have high correlation, and the result stepAIC gives me contains several of those highly correlated variables. Notice that I have forced AUC in the model, multicollinearity is a problem especially when those variables highly correlated with AUC are chosen in the model.
Is there a way to specify in the function some thresholds for correlation or p-value of the coefficients?
Or any comments on other approaches that can solve my problem are welcome.
Thank you!
Perhaps Variance Inflation Factor will work better for you. This article explains some of the logic. http://en.wikipedia.org/wiki/Variance_inflation_factor
Example use:
v=ezvif(df,yvar ='columnNameOfWhichYouAreTryingToPredict')
Here is the function I wrote that combines VIF::vif with cross validation.
require(VIF)
require(cvTools);
#returns selected variables using VIF and kfolds cross validation
ezvif=function(df,yvar,folds=5,trace=F){
f=cvFolds(nrow(df),K=folds);
findings=list();
for(v in names(df)){
if(v==yvar)next;
findings[[v]]=0;
}
for(i in 1:folds){
rows=f$subsets[f$which!=i]
y=df[rows,yvar];
xdf=df[rows,names(df) != yvar]; #remove output var
vifResult=vif(y,xdf,trace=trace,subsize=min(200,floor(nrow(xdf))))
for(v in names(xdf)[vifResult$select]){
findings[[v]]=findings[[v]]+1; #vote
}
}
findings=(sort(unlist(findings),decreasing = T))
if(trace) print(findings[findings>0]);
return( c(yvar,names(findings[findings==findings[1]])) )
}
I would recommend to remove the variables with high correlations. The libraries caret and corrplot can help:
library(corrplot)
library(caret)
dm = data.matrix(mydata[,names(mydata) != 'DEP'] #without your outcome var
Visualize your correlations clustering highly correlated together
corrplot(cor(dm), order = 'hclust')
And find the indices of variables that you could remove due to high (>0.75) correlations
findCorrelations(cor(dm), 0.75)
Removing these variables can improve your model. After removing the variables, continue doing the stepAIC as you described in your question.
To assess multicollinearity between predictors when running the dredge function (MuMIn package), include the following max.r function as the "extra" argument:
max.r <- function(x){
corm <- cov2cor(vcov(x))
corm <- as.matrix(corm)
if (length(corm)==1){
corm <- 0
max(abs(corm))
} else if (length(corm)==4){
cormf <- corm[2:nrow(corm),2:ncol(corm)]
cormf <- 0
max(abs(cormf))
} else {
cormf <- corm[2:nrow(corm),2:ncol(corm)]
diag(cormf) <- 0
max(abs(cormf))
}
}
then simply run dredge specifying the number of predictor variables and including the max.r function:
options(na.action = na.fail)
Allmodels <- dredge(Fullmodel, rank = "AIC", m.lim=c(0, 3), extra= max.r)
Allmodels[Allmodels$max.r<=0.6, ] ##Subset models with max.r <=0.6 (not collinear)
NCM <- get.models(Allmodels, subset = max.r<=0.6) ##Retrieve models with max.r <=0.6 (not collinear)
model.sel(NCM) ##Final model selection table
This works for lme4 models. For nlme models see: https://github.com/rojaff/dredge_mc

Resources