Foreach Parallel RandomForest Predicting Error - r

I am trying to run randomForest multicore mode using foreach function. The fitting of the trees seems to be working, however when trying to use predict on the resulting model it gives me the following error message:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "call"
It seems like the foreach function gives back a simple listinstead of a proper randomForest model.
Here is the complete code I am trying to run:
# sample from 1 to k, nrow times (the number of observations in the data)
labeled_data <- bundesliga[bundesliga$Season<2017,]
labeled_data$id <- sample(1:k, nrow(labeled_data), replace = TRUE)
list <- 1:k
# prediction and testset data frames that we add to with each iteration over
# the folds
#Creating a progress bar to know the status of CV
progress.bar <- create_progress_bar("text")
progress.bar$init(k)
prediction <- data.frame()
testsetCopy <- data.frame()
accuracy <- list()
rf.formula <- as.formula(paste("as.factor(FTR)","~",paste("AvgAgeHome",
"AvgAge_Away",
"AvgMarketValueHome_z_score",
"AvgMarketValue_Away_z_score",
"ForeignPlayersHome",
"ForeignPlayers_Away",
"KaderHome",
"Kader_Away",
"no_won_matches_last_20_home",
"no_won_matches_last_20_away",
"no_won_matches_last_15_home",
"no_won_matches_last_15_away",
"no_won_matches_last_10_home",
"no_won_matches_last_10_away",
"no_won_matches_last_5_home",
"no_won_matches_last_5_away",
"no_won_matches_last_3_home",
"no_won_matches_last_3_away",
"no_won_matches_last_2_home",
"no_won_matches_last_2_away",
"won_last_1_matches_away",
"won_last_1_matches_home",
"OverallMarketValueHome_z_score",
"OverallMarketValue_Away_z_score",
"roll_FTHG_Home",
"roll_FTAG_Away",
"Stadium.Capacity.y",
"WDL_3_roll_matches_away",
"WDL_3_roll_matches_home",
"WDL_2_roll_matches_home",
"WDL_2_roll_matches_away",
"WDL_1_roll_matches_home",
"WDL_1_roll_matches_away",sep="+")))
for (i in 1:k){
# remove rows with id i from dataframe to create training set
# select rows with id i to create test set
trainingset <- subset(labeled_data, id %in% list[-i])
testset <- subset(labeled_data, id %in% c(i))
#run a random forest model
rf <- foreach(ntree=rep(1, 8),
.combine=combine,.packages='randomForest') %dopar% {
environment(rf.formula) <- environment()
randomForest(rf.formula,data=trainingset, ntree=ntree)
}
print(class(rf))
# remove response column 1
pred <- predict(rf, testset[,-1])
temp <- as.data.frame(pred)
match_test_pred <- cbind(as.data.frame(testset),temp)
accuracy_fold <- sum(match_test_pred$Correct)/nrow(match_test_pred)
accuracy <- rbind(accuracy,accuracy_fold)
# append this iteration's predictions to the end of the prediction
data frame
prediction <- rbind(prediction, temp)
# append this iteration's test set to the test set copy data frame
# keep only the Sepal Length Column
testsetCopy <- rbind(testsetCopy, as.data.frame(testset$FTR))
print(confusionMatrix(pred,testset$FTR))
progress.bar$step()
}
Thanks in advance for your help!

Related

Why is the error rate from bagging trees much higher than that from a single tree?

I cross-post this question here, but it seems to me that I'm unlikely to receive any answer. So I post it here.
I'm running the classification method Bagging Tree (Bootstrap Aggregation) and compare the misclassification error rate with one from one single tree. We expect that the result from bagging tree is better then that from one single tree, i.e. error rate from bagging is lower than that of single tree.
I repeat the whole procedure M = 100 times (each time splitting randomly the original data set into a training set and a test set) to obtain 100 test errors and bagging test errors (use a for loop). Then I use boxplots to compare the distributions of these two types of errors.
# Loading package and data
library(rpart)
library(boot)
library(mlbench)
data(PimaIndiansDiabetes)
# Initialization
n <- 768
ntrain <- 468
ntest <- 300
B <- 100
M <- 100
single.tree.error <- vector(length = M)
bagging.error <- vector(length = M)
# Define statistic
estim.pred <- function(a.sample, vector.of.indices)
{
current.train <- a.sample[vector.of.indices, ]
current.fitted.model <- rpart(diabetes ~ ., data = current.train, method = "class")
predict(current.fitted.model, test.set, type = "class")
}
for (j in 1:M)
{
# Split the data into test/train sets
train.idx <- sample(1:n, ntrain, replace = FALSE)
train.set <- PimaIndiansDiabetes[train.idx, ]
test.set <- PimaIndiansDiabetes[-train.idx, ]
# Train a direct tree model
fitted.tree <- rpart(diabetes ~ ., data = train.set, method = "class")
pred.test <- predict(fitted.tree, test.set, type = "class")
single.tree.error[j] <- mean(pred.test != test.set$diabetes)
# Bootstrap estimates
res.boot = boot(train.set, estim.pred, B)
pred.boot <- vector(length = ntest)
for (i in 1:ntest)
{
pred.boot[i] <- ifelse (mean(res.boot$t[, i] == "pos") >= 0.5, "pos", "neg")
}
bagging.error[j] <- mean(pred.boot != test.set$diabetes)
}
boxplot(single.tree.error, bagging.error, ylab = "Misclassification errors", names = c("single.tree", "bagging"))
The result is
Could you please explain why the error rate for bagging trees is much higher than that of a single tree? I feel that this does not make sense. I've checked my code but could not found anything unusual.
I've received an answer from https://stats.stackexchange.com/questions/452882/why-is-the-error-rate-from-bagging-trees-much-higher-than-that-from-a-single-tre. I posted it here to close this question and for future visitors.

I am trying to run XGBoost in R but am facing some issues

I have a dataset of 25 variables and 248 rows.
There are 8-factor variables and the rest are integers and numbers.
I am trying to run XGBoost.
I have done the following code: -
# Partition Data
set.seed(1234)
ind <- sample(2, nrow(mission), replace = T, prob = c(0.7,0.3))
train <- mission[ind == 1,]
test <- mission[ind == 2,]
# Create matrix - One-Hot Encoding for Factor variables
trainm <- sparse.model.matrix(GRL ~ .-1, data = train)
head(trainm)
train_label <- train[,"GRL"]
train_matrix <- xgb.DMatrix(data = as.matrix(trainm), label = train_label)
testm <- sparse.model.matrix(GRL~.-1, data = test)
test_label <- test[,"GRL"]
test_matrix <- xgb.DMatrix(data = as.matrix(testm),label = test_label)
The response variable here is "GRL" and I am running the test_label <- test[,"GRL"]
The above code is getting executed but when I am trying to use it in xgb.DMatrix, I am encountering the following error:
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
I have partitioned the data into 70:30.
test[,"GRL"] returns a data.frame, and XGBoost needs the label to be a vector.
Just use teste$GRL or test[["GRL"]] instead. You also need to do the same for the training dataset

Trying to create confusion matrix from cross-validated results using the best value of k in R

I have wrote the knn cross validation method below using the iris dataset in R. How would I get the best value of k from this and create a confusion matrix based on this? Any help would be great.
library(class)
data("iris")
kfolds = 5
iris$folds = cut(seq(1,nrow(iris)),breaks=kfolds,labels=FALSE)
iris$folds
# Sets the columns to use as predicators
pred = c("Petal.Width", "Petal.Length")
accuracies = c()
ks = c(1,3,5,7,9,11,13,15)
for (k in ks) {
k.accuracies = c()
for(i in 1:kfolds) {
# Builds the training set and test set for this fold.
train.items.this.fold = iris[iris$folds != i,]
validation.items.this.fold = iris[iris$folds == i,]
# Fit knn model on this fold.
predictions = knn(train.items.this.fold[,pred],
validation.items.this.fold[,pred],
train.items.this.fold$Species, k=k)
predictions.table <- table(predictions, validation.items.this.fold$Species)
# Work out the amount of correct and incorrect predictions.
correct.list <- predictions == validation.items.this.fold$Species
nr.correct = nrow(validation.items.this.fold[correct.list,])
# Get accuracy rate of cv.
accuracy.rate = nr.correct/nrow(validation.items.this.fold)
# Adds the accuracy list.
k.accuracies <- cbind(k.accuracies, accuracy.rate)
}
# Adds the mean accuracy to the total accuracy list.
accuracies <- cbind(accuracies, mean(k.accuracies))
}
# Accuracy for each value of k: visualisation.
accuracies
Update:
predictions.table <- table(predictions == ks[which.max(accuracies)], validation.items.this.fold$Species)
Your code have some problems, this one runs:
library(class)
data("iris")
kfolds = 5
iris$folds = cut(seq(1,nrow(iris)),breaks=kfolds,labels=FALSE)
iris$folds
# Sets the columns to use as predicators
pred = c("Petal.Width", "Petal.Length")
accuracies = c()
ks = c(1,3,5,7,9,11,13,15)
k.accuracies = c()
predictions.list = list()
for (k in ks) {
k.accuracies = c()
for(i in 1:kfolds) {
# Builds the training set and test set for this fold.
train.items.this.fold = iris[iris$folds != i,]
validation.items.this.fold = iris[iris$folds == i,]
# Fit knn model on this fold.
predictions = knn(train.items.this.fold[,pred],
validation.items.this.fold[,pred],
train.items.this.fold$Species, k=k)
predictions.list[[i]] = predictions
predictions.table <- table(predictions, validation.items.this.fold$Species)
# Work out the amount of correct and incorrect predictions.
correct.list <- predictions == validation.items.this.fold$Species
nr.correct = nrow(validation.items.this.fold[correct.list,])
# Get accuracy rate of cv.
accuracy.rate = nr.correct/nrow(validation.items.this.fold)
# Adds the accuracy list.
k.accuracies <- cbind(k.accuracies, accuracy.rate)
}
# Adds the mean accuracy to the total accuracy list.
accuracies <- cbind(accuracies, mean(k.accuracies))
}
accuracies
predictions.table <- table(predictions.list[[which.max(accuracies)]], validation.items.this.fold$Species)
When you calling predictions.table <- table(predictions, validation.items.this.fold$Species), this is the confusion matrix, and you are using the accuracy as the evaluation metric, so the best K is the best accuracy. You can get the best K value like this:
ks[which.max(accuracies)]
UPDATE
Create a list to store each prediction and then created the confusion matrix using the best accuracy.

How to predict test data using a GAM with MRF smooth and neighborhood structure?

I am having a problem using the predict() function for a mgcv::gam (training) model on a new (testing) dataset. The problem arises due to a mrf smooth I have integrated to account for the spatial nature of my data.
I use the following call to create my GAM model
## Run GAM with MRF
m <- gam(crime ~ s(district,k=nrow(traindata),
bs ='mrf',xt=list(nb=nbtrain)), #define MRF smooth
data = traindata,
method = 'REML',
family = scat(), #fit scaled t distribution
gamma = 1.4
)
where I predict the dependent variable crime using the neighbourhood structure, parsed into the model in the smooth term argument xt. The neighbourhood structure comes as a nb object that I created using the poly2nb() function.
Now, if I want to use predict() on a new testing dataset, I don't know how to pass the according neighbourhood structure into the call. Providing just the new data
pred <- predict.gam(m,newdata=testdata)
throws the following error:
Error in predict.gam(m, newdata = testdata) :
7, 16, 20, 28, 35, 36, 37, 43 not in original fit
Here's a full reproduction of the error using the Columbus dataset called from within R directly:
#ERROR REPRODUCTION
## Load packages
require(mgcv)
require(spdep)
require(dplyr)
## Load Columbus Ohio crime data (see ?columbus for details and credits)
data(columb.polys) #Columbus district shapes list
columb.polys <- lapply(columb.polys,na.omit) #omit NAs (unfortunate problem with the Columbus sample data)
data(columb) #Columbus data frame
df <- data.frame(district=numeric(0),x=numeric(0),y= numeric(0)) #Create empty df to store x, y and IDs for each polygon
## Extract x and y coordinates from each polygon and assign district ID
for (i in 1:length(columb.polys)) {
district <- i-1
x <- columb.polys[[i]][,1]
y <- columb.polys[[i]][,2]
df <- rbind(df,cbind(district,x,y)) #Save in df data.frame
}
## Convert df into SpatialPolygons
sp <- df %>%
group_by(district) %>%
do(poly=select(., x, y) %>%Polygon()) %>%
rowwise() %>%
do(polys=Polygons(list(.$poly),.$district)) %>%
{SpatialPolygons(.$polys)}
## Merge SpatialPolygons with data
spdf <- SpatialPolygonsDataFrame(sp,columb)
## Split into training and test sample (80/20 ratio)
splt <- sample(1:2,size=nrow(spdf),replace=TRUE,prob=c(0.8,0.2))
train <- spdf[splt==1,]
test <- spdf[splt==2,]
## Prepapre both samples and create NB objects
traindata <- train#data #Extract data from SpatialPolygonsDataFrame
testdata <- test#data
traindata <- droplevels(as(train, 'data.frame')) #Drop levels
testdata <- droplevels(as(test, 'data.frame'))
traindata$district <- as.factor(traindata$district) #Factorize
testdata$district <- as.factor(testdata$district)
nbtrain <- poly2nb(train, row.names=train$Precinct, queen=FALSE) #Create NB objects for training and test sample
nbtest <- poly2nb(test, row.names=test$Precinct, queen=FALSE)
names(nbtrain) <- attr(nbtrain, "region.id") #Set region.id
names(nbtest) <- attr(nbtest, "region.id")
## Run GAM with MRF
m <- gam(crime ~ s(district, k=nrow(traindata), bs = 'mrf',xt = list(nb = nbtrain)), # define MRF smooth
data = traindata,
method = 'REML', # fast version of REML smoothness selection; alternatively 'GCV.Cp'
family = scat(), #fit scaled t distribution
gamma = 1.4
)
## Run prediction using new testing data
pred <- predict.gam(m,newdata=testdata)
SOLUTION:
I finally found the time to update this post with the solution. Thanks to everyone for helping me out. Here is the code for implementing k-fold CV with a random training-testing split:
#Apply k-fold cross validation
mses <- data.frame() #Create empty df to store CV squared error values
scores <- data.frame() #Create empty df to store CV R2 values
set.seed(42) #Set seed for reproducibility
k <- 10 #Define number of folds
for (i in 1:k) {
# Create weighting column
data$weight <- sample(c(0,1),size=nrow(data),replace=TRUE,prob=c(0.2,0.8)) #0 Indicates testing sample, 1 training sample
#Run GAM with MRF
ctrl <- gam.control(nthreads = 6) #Set controls
m <- gam(crime ~ s(disctrict, k=nrow(data), bs = 'mrf',xt = list(nb = nb)), #define MRF smooth
data = data,
weights = data$weight, #Use only weight==1 observations (training)
method = 'REML',
control = ctrl,
family = scat(),
gamma = 1.4
)
#Generate test dataset
testdata <- data[data$weight==0,] #Select test data by weight
#Predict test data
pred <- predict(m,newdata=testdata)
#Extract MSES
mses[i,1] <- mean((data$R_MeanDiff[data$weight==0] - pred)^2)
scores[i,1] <- summary(m)$r.sq
}
av.mse.GMRF <- mean(mses$V1)
av.r2.GMRF <- mean(scores$V1)
I have one question criticism with the current solution, being that the full dataset was used to "train" the model meaning that the predictions are going to be biased since the testdata was used to train it.
This only requires a couple minor tweaks to fix:
#Apply k-fold cross validation
mses <- data.frame() #Create empty df to store CV squared error values
scores <- data.frame() #Create empty df to store CV R2 values
set.seed(42) #Set seed for reproducibility
k <- 10 #Define number of folds
#For loop for each fold
for (i in 1:k) {
# Create weighting column
data$weight <- sample(c(0,1),size=nrow(data),replace=TRUE,prob=c(0.2,0.8)) #0 Indicates testing sample, 1 training sample
#Generate training dataset
trainingdata <- data[data$weight == 1, ] #Select test data by weight
#Generate test dataset
testdata <- data[data$weight == 0, ] #Select test data by weight
#Run GAM with MRF
ctrl <- gam.control(nthreads = 6) #Set controls
m <- gam(crime ~ s(disctrict, k=nrow(data), bs = 'mrf',xt = list(nb = nb)), #define MRF smooth
data = trainingdata,
weights = data$weight, #Use only weight==1 observations (training)
method = 'REML',
control = ctrl,
family = scat(),
gamma = 1.4
)
#Predict test data
pred <- predict(m,newdata = testdata)
#Extract MSES
mses[i,1] <- mean((data$R_MeanDiff[data$weight==0] - pred)^2)
scores[i,1] <- summary(m)$r.sq
}
#Get average scores from each k-fold test
av.mse.GMRF <- mean(mses$V1)
av.r2.GMRF <- mean(scores$V1)

Predict warning-----new data rows <> variable rows

I'm a beginner in R.
I tried to build a model by using a part of samples and predict response by using the rest samples. But when I use predict(), I got a warning message:
'newdata' had 152 rows but variables found have 354 rows
I have searched some answers, but I still can't understand T.T. Please help
library(MASS)
data(Boston)
n <- nrow(Boston)
n_train <- round(.70*n)
train_set <- sample(n,size=n_train,replace = FALSE)
x <- cbind(Boston$lstat,log(Boston$lstat))
y <- Boston$medv
x_train <- x[train_set,]
y_train <- y[train_set]
x_test <- x[-train_set,]
y_test <- y[-train_set]
lm_temp <- lm(y_train~x_train)
y_test_hat <- predict(lm_temp,newdata=data.frame(x_test))
It looks like R is getting confused when you pass a matrix as the independent variables, but then the predict function requires a data frame(which is a list).
You can solve the problem by running your lm on a data frame
library(MASS)
data(Boston)
n <- nrow(Boston)
n_train <- round(.70*n)
train_set <- sample(n,size=n_train,replace = FALSE)
data <- Boston[ , c('medv', 'lstat')]
data$loglstat <- log(data$lstat)
train <- data[train_set, ]
test <- data[-train_set,]
lm_temp <- lm(medv ~ ., data = train)
y_test_hat <- predict(lm_temp,newdata=test)

Resources