R. How to apply sapply() to random forest - r

I need to use a batch of models of RandomForest package. I decided to use a list list.of.models to store them. Now I don't know how to apply them. I append a list using
list.of.models <- append(list.of.models, randomForest(data, as.factor(label))
and then tried to use
sapply(list.of.models[length(list.of.models)], predict, data, type = "prob")
to call the last one but the problem is that randomForest returns a list of many values, not a learner.
What to do to add to list RF-model and then call it? For example lets take a source code
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.pred <- predict(iris.rf, iris[ind == 2,])

With append your're extending your model.list with the elements inside the RF.model, thus not recognized by predict.randomForest etc. because the outer container list and its attribute class="randomForest" are lost. Also data input is named newdata in predict.randomForest.
This should work:
set.seed(1234)
library(randomForest)
data(iris)
test = sample(150,25)
#create 3 models
RF.models = lapply(1:3,function(mtry) {
randomForest(formula=Species~.,data=iris[-test,],mtry=mtry)
})
#append extra model
RF.models[[length(RF.models)+1]] = randomForest(formula=Species~.,data=iris[-test,],mtry=4)
summary(RF.models)
#predict all models
sapply(RF.models,predict,newdata=iris[test,])
#predict one model
predict(RF.models[[length(RF.models)]],newdata=iris[test,])

Related

Error : 'data' must be a data.frame, environment, or list

#define training and testing sets
set.seed(555)
train <- df2[1:800, c("charges")]
y_test <- df2[801:nrow(df2), c("charges")]
test <- df2[801:nrow(df2), c("age","bmi","children","smoker")]
#use model to make predictions on a test set
model <- pcr(charges~age+bmi+children+smoker, data = train, scale=TRUE, validation="CV")
pcr_pred <- predict(model, test, ncomp = 4)
#calculate RMSE
sqrt(mean((pcr_pred - y_test)^2))
I dont know why i get this error... already tried number of things but still stuck here
When you executed:
train <- df2[1:800, c("charges")]
You created an R atomic character vector. The class of the result would not be a list unless you also added the drop=FALSE parameter:
train <- df2[1:800, c("charges"), drop=FALSE]
That should fix that error although the lack of any data prevents any of us from determining whether further errors might arise. Actually, I'm pretty sure you did not want that train object to be just a single column since your model obviously expected other columns. Try this instead:
set.seed(555)
train <- df2[1:800, ]
test <- df2[801:nrow(df2), ]
#use model to make predictions on a test set
model <- pcr(charges~age+bmi+children+smoker, data = train, scale=TRUE, validation="CV")
pcr_pred <- predict(model, test, ncomp = 4)
#calculate RMSE
sqrt(mean((pcr_pred - y_test)^2))

Caret train function for muliple data frames as function

there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.
By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)

R Caret Train Input values

I want to make a list of input values for the caret train function to ignore. So far I can do it and it works, however, it has to be done withing the train function.
Example:
LabCa_R1_Fit <- train(LabCa ~ . -EV1 -kgpm -Fe ,...)
The -EV1 -kgpm -Fe is me removing the values, however, I want it in the form of:
list <- c(-EV1, -kgpm, -Fe)
LabCa_R1_Fit <- train (LabCa ~ . list, ...)
The problem is when I put the options to delete outside the train function they area treated as variables instead of options and I get the appropriate error. How do I create a list of the options I want?
There is also an undocumented feature that will allow you do use:
mod <- train(Species ~ ., data = iris, method = "lda", preProc = list(ignore = "Sepal.Width"))
I found the solution by doing the following:
# Outside
list <- LabCa ~ . -EV13
# Inside
LabCa_R1_Fit <- train( list , ... )

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

Setting Random seeds do not affect classification methods C5.0 and ctree

I want to compare between two different classification methods, namely ctree and C5.0 in the libraries partyand c50 respectively, the comparison is to test their sensitivity to the initial start points. The test should be carried 30 times for each time the number of wrong classified items are calculated and stored in a vector then by using t-test I hope to see if they are really different or not.
library("foreign"); # for read.arff
library("party") # for ctree
library("C50") # for C5.0
trainTestSplit <- function(data, trainPercentage){
newData <- list();
all <- nrow(data);
splitPoint <- floor(all * trainPercentage);
newData$train <- data[1:splitPoint, ];
newData$test <- data[splitPoint:all, ];
return (newData);
}
ctreeErrorCount <- function(st,ss){
set.seed(ss);
model <- ctree(Class ~ ., data=st$train);
class <- st$test$Class;
st$test$Class <- NULL;
pre = predict(model, newdata=st$test, type="response");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
C50ErrorCount <- function(st,ss){
model <- C5.0(Class ~ ., data=st$train, seed=ss);
class <- st$test$Class;
pre = predict(model, newdata=st$test, type="class");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
compare <- function(n = 30){
data <- read.arff(file.choose());
set.seed(100);
errors = list(ctree = c(), c50 = c());
seeds <- floor(abs(rnorm(n) * 10000));
for(i in 1:n){
splitData <- trainTestSplit(data, 0.66);
errors$ctree[i] <- ctreeErrorCount(splitData, seeds[i]);
errors$c50[i] <- C50ErrorCount(splitData, seeds[i]);
}
cat("\n\n");
cat("============= ctree Vs C5.0 =================\n");
cat(paste(errors$ctree, " ", errors$c50, "\n"))
tt <- t.test(errors$ctree, errors$c50);
print(tt);
}
The program shown is supposedly doing the job of comparison, but because of the number of errors does not change in the vectors then the t.test function produces an error. I used iris inside R (but changing class to Class) and Winchester breast cancer data which can be downloaded here to test it but any data can be used as long as it has Class attribute
But I get in to the problem that the result of both methods remain constant and not changes while I am changing the random seed, theoretically ,as described in their documentation,both of the functions use random seeds, ctree uses set.seed(x) while C5.0 uses an argument called seed to set seed, unfortunatly I can not find the effect.
Could you please tell me how to control initials of these functions
ctrees does only depend on a random seed in the case where you configure it to use a random selection of input variables (ie that mtry > 0 within ctree_control). See http://cran.r-project.org/web/packages/party/party.pdf (p. 11)
In regards to C5.0-trees the seed is used this way:
ctrl = C5.0Control(sample=0.5, seed=ss);
model <- C5.0(Class ~ ., data=st$train, control = ctrl);
Notice that the seed is used to select a sample of the data, not within the algoritm itself. See http://cran.r-project.org/web/packages/C50/C50.pdf (p. 5)

Resources