Combining random forests built with different training sets in R - r

I am new to R (day 2) and have been tasked with building a forest of random forests. Each individual random forest will be built using a different training set and we will combine all the forests at the end to make predictions. I am implementing this in R and am having some difficulty combining two forests not built using the same set. My attempt is as follows:
d1 = read.csv("../data/rr/train/10/chunk0.csv",header=TRUE)
d2 = read.csv("../data/rr/train/10/chunk1.csv",header=TRUE)
rf1 = randomForest(A55~., data=d1, ntree=10)
rf2 = randomForest(A55~., data=d2, ntree=10)
rf = combine(rf1,rf2)
This of course produces an error:
Error in rf$votes + ifelse(is.na(rflist[[i]]$votes), 0, rflist[[i]]$votes) :
non-conformable arrays
In addition: Warning message:
In rf$oob.times + rflist[[i]]$oob.times :
longer object length is not a multiple of shorter object length
I have been browsing the web for some time looking at a clue for this but haven't had any success yet. Any help here would be most appreciated.

Ah. This is either an oversight in combine or what you're trying to do is nonsensical, depending on your point of view.
The votes matrix records the number of votes in the forest for each case in the training data for each response category. Naturally, it will have the same number of rows as the number of rows in your training data.
combine is assuming that you ran your random forests twice on the same set of data, so the dimensions of those matrices will be the same. It's doing this because it wants to provide you with some "overall" error estimates for the combined forest.
But if the two data sets are different combining the votes matrices becomes simply nonsensical. You could get combine to run by simply removing one row from your larger training data set, but the resulting votes matrix in the combined forest would be gibberish, since each row would be a combination of votes for two different training cases.
So maybe this is simply something that should be an option that can be turned off in combine. Because it should still make sense to combine the actual trees and predict on the resulting object. But some of "combined" error estimates in the output from combine will be meaningless.
Long story short, make each training data set the same size, and it will run. But if you do, I wouldn't use the resulting object for anything other than making new predictions. Anything that is combined that was summarizing the performance of the forests will be nonsense.
However, I think the intended way to use combine is to fit multiple random forests on the full data set, but with a reduced number of trees and then to combine those forests.
Edit
I went ahead and modified combine to "handle" unequal training set sizes. All that means really is that I removed a large chunk of code that was trying to stitch things together that weren't going to match up. But I kept the portion that combines the forests, so you can still use predict:
my_combine <- function (...)
{
pad0 <- function(x, len) c(x, rep(0, len - length(x)))
padm0 <- function(x, len) rbind(x, matrix(0, nrow = len -
nrow(x), ncol = ncol(x)))
rflist <- list(...)
areForest <- sapply(rflist, function(x) inherits(x, "randomForest"))
if (any(!areForest))
stop("Argument must be a list of randomForest objects")
rf <- rflist[[1]]
classRF <- rf$type == "classification"
trees <- sapply(rflist, function(x) x$ntree)
ntree <- sum(trees)
rf$ntree <- ntree
nforest <- length(rflist)
haveTest <- !any(sapply(rflist, function(x) is.null(x$test)))
vlist <- lapply(rflist, function(x) rownames(importance(x)))
numvars <- sapply(vlist, length)
if (!all(numvars[1] == numvars[-1]))
stop("Unequal number of predictor variables in the randomForest objects.")
for (i in seq_along(vlist)) {
if (!all(vlist[[i]] == vlist[[1]]))
stop("Predictor variables are different in the randomForest objects.")
}
haveForest <- sapply(rflist, function(x) !is.null(x$forest))
if (all(haveForest)) {
nrnodes <- max(sapply(rflist, function(x) x$forest$nrnodes))
rf$forest$nrnodes <- nrnodes
rf$forest$ndbigtree <- unlist(sapply(rflist, function(x) x$forest$ndbigtree))
rf$forest$nodestatus <- do.call("cbind", lapply(rflist,
function(x) padm0(x$forest$nodestatus, nrnodes)))
rf$forest$bestvar <- do.call("cbind", lapply(rflist,
function(x) padm0(x$forest$bestvar, nrnodes)))
rf$forest$xbestsplit <- do.call("cbind", lapply(rflist,
function(x) padm0(x$forest$xbestsplit, nrnodes)))
rf$forest$nodepred <- do.call("cbind", lapply(rflist,
function(x) padm0(x$forest$nodepred, nrnodes)))
tree.dim <- dim(rf$forest$treemap)
if (classRF) {
rf$forest$treemap <- array(unlist(lapply(rflist,
function(x) apply(x$forest$treemap, 2:3, pad0,
nrnodes))), c(nrnodes, 2, ntree))
}
else {
rf$forest$leftDaughter <- do.call("cbind", lapply(rflist,
function(x) padm0(x$forest$leftDaughter, nrnodes)))
rf$forest$rightDaughter <- do.call("cbind", lapply(rflist,
function(x) padm0(x$forest$rightDaughter, nrnodes)))
}
rf$forest$ntree <- ntree
if (classRF)
rf$forest$cutoff <- rflist[[1]]$forest$cutoff
}
else {
rf$forest <- NULL
}
#
#Tons of stuff removed here...
#
if (classRF) {
rf$confusion <- NULL
rf$err.rate <- NULL
if (haveTest) {
rf$test$confusion <- NULL
rf$err.rate <- NULL
}
}
else {
rf$mse <- rf$rsq <- NULL
if (haveTest)
rf$test$mse <- rf$test$rsq <- NULL
}
rf
}
And then you can test it like this:
data(iris)
d <- iris[sample(150,150),]
d1 <- d[1:70,]
d2 <- d[71:150,]
rf1 <- randomForest(Species ~ ., d1, ntree=50, norm.votes=FALSE)
rf2 <- randomForest(Species ~ ., d2, ntree=50, norm.votes=FALSE)
rf.all <- my_combine(rf1,rf2)
predict(rf.all,newdata = iris)
Obviously, this comes with absolutely no warranty! :)

Related

Expand for-loop to accommodate list in R?

I've recently been interested in trying to develop a for-loop that would be able to run multiple generalized additive models and then produce results in a table that ranks them based on AIC, p-value of each smooth in the model, deviance explained of the overall model, etc.
I found this related question in stack overflow which is basically what I want and was able to run this well for gam() instead of gamm(), however I want to expand this to include multiple independent variables in the model, not just 1.
Ideally, the models would run all possible combinations of independent variables against the dependent variable, and it would test combinations anywhere from 1 independent variable in the model, up to all of the possible covariates in "d_pred" in the model.
I have attempted to do this so far by starting out small and finding all possible combinations of 2 independent variables (df_combinations2), which results in a list of data frames. Then I adjusted the rest of the code to run the for loop such that each iteration will run a different combination of the two variables:
library(mgcv)
## Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
d_resp <- d[ c("y", "y1")]
d_pred <- d[, !(colnames(d) %in% c("y", "y1"))]
df_combinations2 <- lapply(1:(ncol(combn(1:ncol(d_pred), m = 2))),
function(y) d_pred[, combn(1:ncol(d_pred), m = 2)[,y]])
## create a "matrix" list of dimensions i x j
results_m2 <-lapply(1:length(df_combinations2), matrix, data= NA, nrow=ncol(d_resp), ncol=2)
## for-loop
for(k in 1:length(df_combinations2)){
for(i in 1:ncol(d_resp)){
for(j in 1:ncol(df_combinations2[[k]])){
results_m2[i, j][[1]] <- gam(d_resp[, i] ~ s(df_combinations2[[k]][,1])+s(df_combinations2[[k]][,2]))
}
}}
However, after running the for-loop I get the error "Error in all.vars1(gp$fake.formula[-2]) : can't handle [[ in formula".
Anyone know why I am getting this error/ how to fix it?
Any insight is much appreciated. Thanks!
Personally, I would create a data.table() containing all combinations of target variables and combinations of predictors and loop through all rows. See below.
library(data.table)
library(dplyr)
# Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
#select names of targets and predictors
targets <- c("y", "y1")
predictors <- colnames(d)[!colnames(d) %in% targets]
#create all combinations of predictors
predictor_combinations <- lapply(1:length(predictors), FUN = function(x){
#create combination
combination <- combn(predictors, m = x) |> as.data.table()
#add s() to all for gam
combination <- sapply(combination, FUN = function(y) paste0("s(", y, ")")) |> as.data.table()
#collapse
combination <- summarize_all(combination, .funs = paste0, collapse = "+")
#unlist
combination <- unlist(combination)
#remove names
names(combination) <- NULL
#return
return(combination)
})
#merge combinations of predictors as vector
predictor_combinations <- do.call(c, predictor_combinations)
#create folder to save results to
if(!dir.exists("dev")){
dir.create("dev")
}
if(!dir.exists("dev/models")){
dir.create("dev/models")
}
#create and save hypergrid (all combinations of targets and predictors combinations)
if(!file.exists("dev/hypergrid.csv")){
#create hypergrid and save to dev
hypergrid <- expand.grid(target = targets, predictors = predictor_combinations) |> as.data.table()
#add identifier
hypergrid[, model := paste0("model", 1:nrow(hypergrid))]
#save to dev
fwrite(hypergrid, file = "dev/hypergrid.csv")
} else{
#if file exists read
hypergrid <- fread("dev/hypergrid.csv")
}
#loop through hypergrid, create GAM models
#progressbar
pb <- txtProgressBar(min = 1, max = nrow(hypergrid), style = 3)
for(i in 1:nrow(hypergrid)){
#update progressbar
setTxtProgressBar(pb, i)
#select target
target <- hypergrid[i,]$target
#select predictors
predictors <- hypergrid[i,]$predictors
#create formula
gam.formula <- as.formula(paste0(target, "~", predictors))
#run gam
gam.model <- gam(gam.formula, data = d)
#save gam model do dev/model
saveRDS(gam.model, file = paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
}
#example where you extract model performances
for(i in 1:nrow(hypergrid)){
#read the right model
rel.model <- readRDS(paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
#extract model performance, add to hypergrid
hypergrid[i, R2 := summary(rel.model)[["r.sq"]]]
}
#arrange hypergrid on target and r2
hypergrid <- dplyr::arrange(hypergrid, hypergrid$target, desc(hypergrid$R2))
Which would give
head(hypergrid)
target predictors model R2
1: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5) model319 0.6957242
2: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5) model423 0.6953753
3: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x7) model437 0.6942054
4: y s(x0)+s(x1)+s(x2)+s(x5) model175 0.6941025
5: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x6) model435 0.6940569
6: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5)+s(x7) model481 0.6939756
All models are saved to a folder with an identifier (for if you want to use the model or extract more information from the model).
Notably, p-hacking comes to mind using this appraoch and I would be careful by conducting your analysis like this.

mixture copula in R

I want to use mixture copula for reliability analysis, now ,with the help of a friend ,I've already finished it ‘RVMs_fitted’ 。now i want to perform the probability integral transformation (PIT),but the function of RVINEPIT can’t use,because RVINEPIT(data,RVM),this RVM not RVINEMATRIX Here is my code:
library(vineclust)
data1 <- read.csv("D:/ASTUDY/Rlanguage/Mix copula/data.csv", header = FALSE)
fit <- vcmm(data = data1, total_comp=3,is_cvine = 0)
print(fit)
summary(fit)
RVMs_fitted <- list()
RVMs_fitted[[1]] <- VineCopula::RVineMatrix(Matrix=fit$output$vine_structure[,,1],
family=fit$output$bicop_familyset[,,1],
par=fit$output$bicop_param[,,1],
par2=fit$output$bicop_param2[,,1])
RVMs_fitted[[2]] <- VineCopula::RVineMatrix(Matrix=fit$output$vine_structure[,,2],
family=fit$output$bicop_familyset[,,2],
par=fit$output$bicop_param[,,2],
par2=fit$output$bicop_param2[,,2])
RVMs_fitted[[3]] <- VineCopula::RVineMatrix(Matrix=fit$output$vine_structure[,,3],
family=fit$output$bicop_familyset[,,3],
par=fit$output$bicop_param[,,3],
par2=fit$output$bicop_param2[,,3])
RVM<-RVMs_fitted
meanx <- c(0.47,0.508,0.45,0.52,0.48)
sigmax <- c(0.318,0.322,0.296,0.29,0.279)
ux1<-pnorm(x[1],meanx[1],sigmax[1])
ux2<-pnorm(x[2],meanx[2],sigmax[2])
ux3<-pnorm(x[3],meanx[3],sigmax[3])
ux4<-pnorm(x[4],meanx[4],sigmax[4])
ux5<-pnorm(x[5],meanx[5],sigmax[5])
data <- c(ux1,ux2,ux3,ux4,ux5)
du=RVinePIT(data, RVM)
y=t(qnorm(t(du)))
Error:
In RVinePIT: RVM has to be an RVineMatrix object.
You have multiple problems here:
RVM is a list. However, you tried to fit RVinePIT to a list, while it works for one data at a time.
The same holds for the y.
I do not have your data, but try it with other data.
Here is the code (it should work):
library(vineclust)
library(VineCopula)
data1 <- read.csv("D:/ASTUDY/Rlanguage/Mix copula/data.csv", header = FALSE)
fit <- vcmm(data = data, total_comp=3,is_cvine = 0)
print(fit)
summary(fit)
RVMs_fitted <- list()
RVMs_fitted[[1]] <- RVineMatrix(Matrix=fit$output$vine_structure[,,1],
family=fit$output$bicop_familyset[,,1],
par=fit$output$bicop_param[,,1],
par2=fit$output$bicop_param2[,,1])
RVMs_fitted[[2]] <- RVineMatrix(Matrix=fit$output$vine_structure[,,2],
family=fit$output$bicop_familyset[,,2],
par=fit$output$bicop_param[,,2],
par2=fit$output$bicop_param2[,,2])
RVMs_fitted[[3]] <- RVineMatrix(Matrix=fit$output$vine_structure[,,3],
family=fit$output$bicop_familyset[,,3],
par=fit$output$bicop_param[,,3],
par2=fit$output$bicop_param2[,,3])
RVM<-RVMs_fitted
meanx <- c(0.47,0.508,0.45,0.52,0.48)
sigmax <- c(0.318,0.322,0.296,0.29,0.279)
ux1<-pnorm(x[1],meanx[1],sigmax[1])
ux2<-pnorm(x[2],meanx[2],sigmax[2])
ux3<-pnorm(x[3],meanx[3],sigmax[3])
ux4<-pnorm(x[4],meanx[4],sigmax[4])
ux5<-pnorm(x[5],meanx[5],sigmax[5])
data <- c(ux1,ux2,ux3,ux4,ux5)### This must be a matrix to work with RVinePIT
du=lapply(1:3, function(i) RVinePIT(data, RVM[[i]]))
y <-lapply(1:3, function(i) t(qnorm(t(du[[i]]))))

speed up replication of rows using model

I would like to create replicate predictions for one integer independent variable (iv1) given some model and a data frame called training. This is my current approach. I appreciate this is not self containing but hopefully it is self explanatory:
number_of_samples <- 10
results <- NULL
for (row in 1:nrow(training)) {
fake_iv1_values <- sample(1:100, number_of_samples)
case <- training[row,]
for (iv1 in fake_iv1_values) {
case$iv1 <- iv1
case$prediction <- predict(some_model, newdata = case)
results <- rbind(results, case)
}
}
Using loops is very slow. I wonder, if this could be sped up? Thanks!
Try with this.
Reproducible fake data and model:
# create fake data
n_row <- 100
n_xs <- 100
training <- data.frame(y = rnorm(n_row), iv1 = rnorm(n_row))
training[, paste0("x",1:n_xs)] <- replicate(n_xs, list(rnorm(n_row)))
# example model
some_model <- lm(y~., training)
Rewritten code:
number_of_samples <- 10
results <- NULL
# vector of several fake_iv1_values vectors
fake_iv1_values <- as.numeric(replicate(nrow(training), sample(1:100, number_of_samples)))
# replicate each row of the original dataframe
results <- training[rep(seq_len(nrow(training)), each = number_of_samples), ]
# add fake values to the replicated dataframe
results$iv1 <- fake_iv1_values
# get predictions
results$prediction <- predict(some_model, newdata = results)

The for loop is working but without any good

I am using R to do a ML project, I have prepared the dataset and split the data into 10 equal splits but the problem is I need to fit the model 10 times manually (10-fold CV). I have tried to create train and test data using a for loop but each time it runs, train is the whole dataset and test is null. Can someone help me, please?
# Preparing the data
data <- read.csv("./project.csv")
id <- seq(1:103342)
data[, 'id'] <- id
for (i in 3:8) {
data[,i] <- as.factor(data[,i])
}
# splitting the data into 10 equal data frames
f <- rep(seq(1, 10), each=round(103342/10), length.out=103342)
df <- split(data, f)
lapply(df, dim)
# running 10-fold cross-validation and computing error rate and AUC for each run.
results <- matrix(nrow=10, ncol=2, dimnames= list(c(), c('error_rate', 'auc')))
for (i in 1:10) {
train <- data[!(data$id %in% df$`i`$id),]
test <- df$`i`
print(dim(test)) # Here is my problem the print statement will print null 10 times
glm.fit <- glm(canceled ~ ., data=train, family=binomial)
glm.prob <- predict(glm.fit, newdata=test, type="response")
...
}

R Simulation Programming Efficiency

I am a relatively new R programmer and have written a script that takes some statistical results and will ultimately compare it to a vector of results in which the target variable has been randomized. The result vector contains the statistical results of n simulations. As the number of simulations increases (I would like to run 10,000 simulations at least) the run time is longer than I would like. I have tried increasing the performance in ways I know to modify the code, but would love the help of others in optimizing it. The relevant part of the code is below.
#CREATE DATA
require(plyr)
Simulations <- 10001
Variation <- c("Control", "A", "B","C")
Trials <- c(727,724,723,720)
NonResponse <- c(692,669,679,682)
Response <- c(35,55,44,38)
ConfLevel <- .95
#PERFORM INITIAL CALCS
NonResponse <- Trials-Response
Data <-data.frame(Variation, NonResponse, Response, Trials)
total <- ddply(Data,.(Variation),function(x){data.frame(value = rep(c(0,1),times = c(x$NonResponse,x$Response)))})
total <- total[sample(1:nrow(total)), ]
colnames(total) <- c("Variation","Response")
#CREATE FUNCTION TO PERFORM SIMULATIONS
targetshuffle <- function(x)
{
shuffle_target <- x[,"Response"]
shuffle_target <- data.frame(sample(shuffle_target))
revised <- cbind(x[,"Variation"], shuffle_target)
colnames(revised) <- c("Variation","Yes")
yes_variation <- data.frame(table(revised$Yes,revised$Variation))
colnames(yes_variation) <- c("Yes","Variation","Shuffled_Response")
Shuffled_Data <- subset(yes_variation, yes_variation$Yes==1)
Shuffled_Data <- Shuffled_Data[match(Variation, Shuffled_Data$Variation),]
yes_variation <- cbind(Data,Shuffled_Data)
VectorPTest_All <- yes_variation[,c("Variation","NonResponse","Response","Trials","Shuffled_Response")]
Control_Only <- yes_variation[yes_variation$Variation=="Control",]
VectorPTest_Chall <- subset(yes_variation,!(Variation=="Control"))
VectorPTest_Chall <- VectorPTest_Chall[,c("Variation","NonResponse","Response","Trials","Shuffled_Response")]
ControlResponse <- Control_Only$Response
ControlResponseRevised <- Control_Only$Shuffled_Response
ControlTotal <- Control_Only$Trials
VariationCount <- length(VectorPTest_Chall$Variation)
VP <- data.frame(c(VectorPTest_Chall,rep(ControlResponse),rep(ControlResponseRevised),rep(ControlTotal)))
names(VP) <- c("Variation","NonResponse","Response", "Trials", "ResponseShuffled", "ControlReponse",
"ControlResponseShuffled","ControlTotal")
VP1 <<- data.frame(VP[,c(5,7,4,8)])
VP2 <<- data.frame(VP[,c(3,6,4,8)])
ptest <- apply(VP1, 1, function(column) prop.test(x=c(column[1], column[2]),
n=c(column[3], column[4]), alternative="two.sided",
conf.level=ConfLevel, correct=FALSE)$p.value)
min_p_value <- min(ptest)
return(min_p_value)
}
#CALL FUNCTION
sim_result <- do.call(rbind, rlply(Simulations, targetshuffle(total)))
Offhand, one thing to look at is creating all the data frames. Each time you do that you're copying all the data in the constituent object. If the dimensions are predictable you might consider creating empty matrices at the beginning of the function and populating them as you go.

Resources