Error in root.matrix(crossprod(process)) : matrix is not positive semidefinite - r

I want to extend the RandomForest so that each leaf will contain naivebayes regression instead of average. In the following, I first tried to use mob() for adding linearModel. I got the following error:
Error in root.matrix(crossprod(process)) : matrix is not positive semidefinite
Here is my code:
require (data.table)
require (party)
set.seed(123)
data1 <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data',header = TRUE)
colnames(data1)<- c("BuyingPrice","Maintenance","NumDoors","NumPersons","BootSpace","Safety","Condition")
# Split into Train and Validation sets
# Training Set : Validation Set = 70 : 30 (random)
set.seed(100)
train <- sample(nrow(data1), 0.7*nrow(data1), replace = FALSE)
TrainSet <- data1[train,]
ValidSet <- data1[-train,]
summary(TrainSet)
summary(ValidSet)
# Create a Random Forest model with default parameters
model1 <- randomForest(Condition ~ ., data = TrainSet, importance = TRUE)
model1
# Fine tuning parameters of Random Forest model
model2 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)
model2
fmBH <- mob(Condition ~ BuyingPrice + Maintenance | NumDoors+ NumPersons + BootSpace + Safety ,
data = TrainSet, model = linearModel)

Related

predict() function in R is not providing prediction in R console

My training data has 87620 rows and 5 columns. My test data has the same number of rows and columns. When I use a CART model to predict the "Defaults" (that is the target variable), my model works and provides me with predictions.
When I use a validation data set that has 6 columns and only 19561 rows, and does not have the Defaults variable, and then proceed to use the
View(validationsetpreds.CART3.3x)
I get the attached picture
Validationsetpreds Picture
And when I perform the same command using the test data set I get the following Testsetpreds Picture
set.seed(123)
loans_training$Default <- as.factor(loans_training$Default)#Make the default variable categorical
loans_test$Default <- as.factor(loans_test$Default)#Make the default variable categorical
loans_training$term <- as.factor(loans_training$term)
loans_test$term <- as.factor(loans_test$term)
#Standardize datasets
library(psych)
library(caret)
preprocess.train.z <- preProcess(loans_training[1:5], method = c("center", "scale"))
preprocess.train.z
loans_train.z <- predict(preprocess.train.z,loans_training[1:5])
describe(loans_train.z)
View(loans_train.z)
summary(loans_train.z$Default)
preprocess.test.z <- preProcess(loans_test[1:5], method = c("center", "scale"))
preprocess.test.z
loans_test.z <- predict(preprocess.test.z,loans_test[1:5])
describe(loans_test.z)
View(loans_test.z)
summary(loans_train.z$Default)
(22417 * 2.3) + 22417
#Resampling subroutine
rare.record.indices <- which(loans_train.z$Default == "1")
rare.indices.resampled <- sample(x = rare.record.indices,size = 51559, replace = TRUE)
rare.records.resampled <- loans_train.z[rare.indices.resampled,]
loans_train.3.3x <- rbind(loans_train.z, rare.records.resampled)
table(loans_train.3.3x$Default)
#Develop 3.3x CART model
TC <- trainControl(method = "CV", number = 10)
fit.CART.3.3x <- train(Default ~ ., data = loans_train.3.3x, method = "rpart", trControl = TC)
fit.CART.3.3x$resample
testsetpreds.CART3.3x <- predict(fit.CART.3.3x,loans_test.z)
table(loans_test.z$Default, testsetpreds.CART3.3x)
testsetpreds.CART3.3x
#Predictions
set.seed(123)
loans_validation$grade <- as.character(loans_validation$grade)#Make the grade variable categorical
loans_validation$term <- as.factor(loans_validation$term)#Make the term variable categorical
loans_validation$Index <- as.factor(loans_validation$Index)#Make the Index variable categorical
#Standardize dataset
library(psych)
library(caret)
preprocess.validation.z <- preProcess(loans_validation[1:6], method = c("center", "scale"))
preprocess.validation.z
loans_validation.z <- predict(preprocess.validation.z,loans_validation[1:6])
#Predict Defaults using Cart
validationsetpreds.CART3.3x <- predict(fit.CART.3.3x,loans_validation.z)
View(validationsetpreds.CART3.3x)
Any help would be greatly appreaciated :)
How would I apply this to the validation data set?

How to extract RMSE from models built using caret?

I have built a glm model using R package "caret" and I'd like to assess its performance using RMSE. I notice that the two RMSEs are different and I wonder which one is the real RMSE?
Also, how can I extract each fold (5*5=25 in total) of the training data, test data, and predicted data (based on the optimal tuned parameter) from the model?
library(caret)
data("mtcars")
set.seed(100)
mydata = mtcars[, -c(8,9)]
model_glm <- train(
hp ~ .,
data = mydata,
method = "glm",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE
)
)
GLM.pred = predict(model_glm, subset(mydata, select = -hp))
RMSE(pred = GLM.pred, obs = mydata$hp) # 21.89
model_glm$results$RMSE # 32.16
With the following code, I get :
sqrt(mean((mydata$hp - predict(model_glm)) ^ 2))
[1] 21.89127
This suggests that the real is "RMSE(pred = GLM.pred, obs = mydata$hp)"
Also, you have
model_glm$resample$RMSE
[1] 28.30254 34.69966 25.55273 25.29981 40.78493 31.91056 25.05311 41.83223 26.68105 23.64629 27.98388 25.98827 45.26982 37.28214
[15] 38.13617 31.14513 23.35353 42.05274 34.04761 35.17733 28.28838 35.89639 21.42580 45.17860 29.13998
which is the RMSE for each of the 25 CV. Also, we have
mean(model_glm$resample$RMSE)
32.16515
So, the 32.16 is the average of the RMSE of the 25 CV. The 21.89 is the RMSE on the original dataset.

train() function in r - SVM

I'm running a SVM in R with caret package. My entire df (named total, which includes train and test) are scaled numbers from 0 to 1. My Y is binary (0-1). All the variables have the class "num". Here is the code:
model_SVM <- train(
Y ~ ., training,
method = "svmPoly",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE
)
)
summary(model_SVM)
model_SVM
SVMprediction <-predict(model_SVM, testing)
cmSVM <-confusionMatrix(SVMprediction,testing$Y) # ERROR
print(SVMprediction)
I got this error1 in the line # ERROR
> cmSVM <-confusionMatrix(SVMprediction,testing$Y)
Error: `data` and `reference` should be factors with the same levels.
It was solved by adding:
SVMprediction<-as.factor(SVMprediction)
testing$Y<-as.factor(testing$Y)
However, I got an error2 now in # ERROR2:
Error in confusionMatrix.default(SVMprediction, testing$Y) :
the data cannot have more levels than the reference
When I check the levels, SVMprediction has 361 levels, and testing$Y 2 levels. How SVMprediction got 361 levels if the Y had just two?
Thanks!
PS: The full code:
totalY <- total
total <- total%>%
select(-Y)
# Missing Values with MICE
mod_mice <- mice(data = total, m = 5,meth='cart')
total <- complete(mod_mice)
post_mv_var_top10 <- total
Y <- totalY$Y
total<-cbind(total,Y)
train_ <- total%>%
filter(is.na(Y)==FALSE)
test_ <- total%>%
filter(is.na(Y)==TRUE)
inTraining <- createDataPartition(train_$Y, p = .70, list = FALSE)
training <- train_[ inTraining,]
testing <- train_[-inTraining,]
# MODEL SVM
model_SVM <- train(
Y ~ ., training,
method = "svmPoly",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE
)
)
summary(model_SVM)
SVMprediction <-predict(model_SVM, testing)
SVMprediction<-as.factor(SVMprediction)
testing$Y<-as.factor(testing$Y)
cmSVM <-confusionMatrix(SVMprediction,testing$Y) # ERROR 2
print(cmSVM)

How to create Random Forest from scratch in R (without the randomforest package)

This is the way I want to use Random Forest by using the RandomForest Package:
library (randomForest)
rf1 <- randomForest(CLA ~ ., dat, ntree=100, norm.votes=FALSE)
p1 <- predict(rf1, testing, type='response')
confMat_rf1 <- table(p1,testing_CLA$CLA)
accuracy_rf1 <- sum(diag(confMat_rf1))/sum(confMat_rf1)
I don't want to use the RandomForest Package at all. Given a dataset (dat) and using rpart and default values of randomforest package, how can I get the same results? For instance, for the 100 decision trees, I need to run the following:
for(i in 1:100){
cart.models[[i]]<-rpart(CLA~ ., data = random_dataset[[i]],cp=-1)
}
Where each random_dataset[[i]] would be randomly chosen default number of attributes and rows. In addition, does rpart used for randomforest?
It is possible to simulate training a random forest by training multiple trees using rpart and bootstrap samples on the training set and the features of the training set.
The following code snippet trains 10 trees to classify the iris species and returns a list of trees with the out of bag accuracy on each tree.
library(rpart)
library(Metrics)
library(doParallel)
library(foreach)
library(ggplot2)
random_forest <- function(train_data, train_formula, method="class", feature_per=0.7, cp=0.01, min_split=20, min_bucket=round(min_split/3), max_depth=30, ntrees = 10) {
target_variable <- as.character(train_formula)[[2]]
features <- setdiff(colnames(train_data), target_variable)
n_features <- length(features)
ncores <- detectCores(logical=FALSE)
cl <- makeCluster(ncores)
registerDoParallel(cl)
rf_model <- foreach(
icount(ntrees),
.packages = c("rpart", "Metrics")
) %dopar% {
bagged_features <- sample(features, n_features * feature_per, replace = FALSE)
index_bag <- sample(nrow(train_data), replace=TRUE)
in_train_bag <- train_data[index_bag,]
out_train_bag <- train_data[-index_bag,]
trControl <- rpart.control(minsplit = min_split, minbucket = min_bucket, cp = cp, maxdepth = max_depth)
tree <- rpart(formula = train_formula,
data = in_train_bag,
control = trControl)
oob_pred <- predict(tree, newdata = out_train_bag, type = "class")
oob_acc <- accuracy(actual = out_train_bag[, target_variable], predicted = oob_pred)
list(tree=tree, oob_perf=oob_acc)
}
stopCluster(cl)
rf_model
}
train_formula <- as.formula("Species ~ .")
forest <- random_forest(train_data = iris, train_formula = train_formula)

predict() R function caret package errors: "newdata" rows different, "type" not accepted

I am running a logistic regression analysis using the caret package.
Data is input as a 18x6 matrix
everything is fine so far except the predict() function.
R is telling me the type parameter is supposed to be raw or prob but raw just spits out an exact copy of the last column (the values of the binomial variable). prob gives me the following error:
"Error in dimnames(out)[[2]] <- modelFit$obsLevels :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
'newdata' had 7 rows but variables found have 18 rows"
install.packages("pbkrtest")
install.packages("caret")
install.packages('e1071', dependencies=TRUE)
#install.packages('caret', dependencies = TRUE)
require(caret)
library(caret)
A=matrix(
c(
64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946,66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627,68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755,69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500,73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500,1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1
),
nrow = 18,
ncol = 6,
byrow = FALSE) #"bycol" does NOT exist
################### data set as vectors
a<-c(64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946)
b<-c(66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627)
c<-c(68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755)
d<-c(69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500)
e<-c(73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500)
f<-c(1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1)
######################
n<-nrow(A);
K<-ncol(A)-1;
Train <- createDataPartition(f, p=0.6, list=FALSE) #60% of data set is used as training.
training <- A[ Train, ]
testing <- A[ -Train, ]
nrow(training)
#this is the logistic formula:
#estimates from logistic regression characterize the relationship between the predictor and response variable on a log-odds scale
mod_fit <- train(f ~ a + b + c + d +e, data=training, method="glm", family="binomial")
mod_fit
#this isthe exponential function to calculate the odds ratios for each preditor:
exp(coef(mod_fit$finalModel))
predict(mod_fit, newdata=training)
predict(mod_fit, newdata=testing, type="prob")
I'm not very sure to understand, but A is a matrix of (a,b,c,d,e,f). So you don't need to create two objects.
install.packages("pbkrtest")
install.packages("caret")
install.packages('e1071', dependencies=TRUE)
#install.packages('caret', dependencies = TRUE)
require(caret)
library(caret)
A=matrix(
c(
64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946,66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627,68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755,69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500,73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500,1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1
),
nrow = 18,
ncol = 6,
byrow = FALSE) #"bycol" does NOT exist
A <- data.frame(A)
colnames(A) <- c('a','b','c','d','e','f')
A$f <- as.factor(A$f)
Train <- createDataPartition(A$f, p=0.6, list=FALSE) #60% of data set is used as training.
training <- A[ Train, ]
testing <- A[ -Train, ]
nrow(training)
And to predict a variable you must enter the explanatory variables and not the variable to predict
mod_fit <- train(f ~ a + b + c + d +e, data=training, method="glm", family="binomial")
mod_fit
#this isthe exponential function to calculate the odds ratios for each preditor:
exp(coef(mod_fit$finalModel))
predict(mod_fit, newdata=training[,-which(colnames(training)=="f")])
predict(mod_fit, newdata=testing[,-which(colnames(testing)=="f")])
Short answer, you should not include the explained variable, which is f in your predict equation. So you should do:
predict(mod_fit, newdata=training[, -ncol(training])
predict(mod_fit, newdata=testing[, -ncol(testing])
The issue with the warning message 'newdata' had 11 rows but variables found have 18 rows is because you run the regression using the whole data set (18 observations), but predict using just part of it (either 11 or 7).
EDIT: To simplify the data creation and glm processes we can do:
library(caret)
A <- data.frame(a = c(64830,18213,4677,24761,9845,17504,22137,12531,5842,28827,51840,4079,1000,2069,969,9173,11646,946),
b = c(66161,18852,5581,27219,10159,17527,23402,11409,8115,31425,55993,0,0,1890,1430,7873,12779,627),
c = c(68426,18274,5513,25687,10971,14104,19604,13438,6011,30055,57242,0,0,2190,1509,8434,10492,755),
d = c(69716,18366,5735,26556,11733,16605,20644,15516,5750,31116,64330,0,0,1850,1679,9233,12000,500),
e = c(73128,18906,5759,28555,11951,19810,22086,17425,6152,28469,72020,0,0,1400,1750,8599,12000,500),
f = c(1,1,1,0,1,0,0,0,0,1,0,1,1,1,1,1,1,1))
Train <- createDataPartition(f, p=0.6, list=FALSE) #60% of data set is used as training.
training <- A[ Train, ]
testing <- A[ -Train, ]
mod_fit <- train(f ~ a + b + c + d + e, data=training, method="glm", family="binomial")
I try to run logistic regression model. I wrote this code:
install.packages('caret')
library(caret)
setwd('C:\\Users\\BAHOZ\\Documents\\')
D<-read.csv(file = "D.csv",header = T)
D<-read.csv(file = 'DataSet.csv',header=T)
names(D)
set.seed(111134)
Train<-createDataPartition(D$X, p=0.7,list = FALSE)
training<-D[Train,]
length(training$age)
testing<-D[-Train,]
length(testing$age)
mod_fit<-train(X~age + gender + total.Bilirubin + direct.Bilirubin + total.proteins + albumin + A.G.ratio+SGPT + SGOT + Alkphos,data=training,method="glm", family="binomial")
summary(mod_fit)
exp(coef(mod_fit$finalModel))
And I recived this message for last command:
(Intercept) age gender total.Bilirubin direct.Bilirubin total.proteins albumin A.G.ratio
0.01475027 1.01596886 1.03857883 1.00022899 1.78188072 1.00065332 1.01380334 1.00115742
SGPT SGOT Alkphos
3.93498241 0.05616662 38.29760014
By running this command I could predict my data,
predict(mod_fit , newdata=testing)
But if I set type="prob" or type="raw"
predict(mod_fit , newdata=testing, type = "prob")
it falls in error:
Error in dimnames(out) <- *vtmp* :
length of 'dimnames' [2] not equal to array extent

Resources