How do I replace the bootstrap step in the package randomForest r - r

First some background info, which is probably more interesting on stats.stackexchange:
In my data analysis I try to compare the performance of different machine learning methods on time series data (regression, not classification). So for example I have trained a Boosting trained model and compare this with a Random Forest trained model (R package randomForest).
I use time series data where the explanatory variables are lagged values of other data and the dependent variable.
For some reason the Random Forest severely underperforms. One of the problems I could think of is that the Random Forest performs a sampling step of the training data for each tree. If it does this to time series data, the autoregressive nature of the series is completely removed.
To test this idea, I would like to replace the (bootstrap) sampling step in the randomForest() function with a so called block-wise bootstrap step. This basically means I cut the training set into k parts, where k<<N, where each k-th part is in the original order. If I sample these k parts, I could still benefit from the 'randomness' in the Random Forest, but with the time series nature left largely intact.
Now my problem is this:
To achieve this I would normally copy the existing function and edit the desired step/lines.
randomForest2 <- randomForest()
But the randomForest() function seems to be a wrapper for another wrapper for deeper underlying functions. So how can I edit the actual bootstrap step in the randomForest() function and still run the rest of the function regularly?

So for me the solution wasn't editing the existing randomForest function. Instead I coded the block-wise bootstrap myself, using the split2 function given by Soren H. Welling to create the blocks. Once I had my data block-wise bootstrapped, I looked for a package (rpart) that performed just a single Regression Tree and aggregated it myself (taking the means).
The result for my actual data is a slightly but consistently improved version over the normal random forest performance in terms of RMSPE.
For the code below the performance seems to be a coin-toss.
Taking Soren's code as an example it looks a bit like this:
library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
library(rpart)
#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)
#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2
#past to present orientation
y = sin((1:timepoints)*pi/30) * 1000 +
sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")
#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)
#compute lag
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1
dX = sapply(1:40,function(i){
getTheseLags = (1:timepoints) - i
getTheseLags[getTheseLags<1] = NA #remove before start timePoints
dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure
#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test = dy[-train]
dX.train = dX[ train,]
dX.test = dX[-train,]
#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)
#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
lVector = length(aVector)
mod = lVector %% splits
lBlocks = rep(floor(lVector/splits),splits)
if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
lapply(1:splits,function(i) {
Stop = sum(lBlocks[1:i])
Start = Stop - lBlocks[i] + 1
aVector[Start:Stop]
})
}
#create a list of block-wise bootstrapped samples
aBlock <- list()
numTrees <- 500
splits <- 40
for (ttt in 1:numTrees){
aBlock[[ttt]] <- unlist(
sample(
split2(1:nrow(dX.train),splits=splits),
splits,
replace=T
)
)
}
#put data into a dataframe so rpart understands it
df1 <- data.frame(dy.train, dX.train)
#perform regression trees for Blocks
rfBlocks = foreach(aBlock = aBlock,
.packages=("rpart")) %dopar% {
dBlock = df1[aBlock,]
rf = predict( rpart( dy.train ~., data = dBlock, method ="anova" ), newdata=data.frame(dX.test) )
}
#predict test, make results table
#use rowMeans to aggregate the block-wise predictions
results = data.frame(predBlock = rowMeans(do.call(cbind.data.frame, rfBlocks)),
true=dy.test,
predBootstrap = predict(rf,newdata=dX.test)
)
plot(results[,1:2],xlab="OOB-CV predicted change",
ylab="trueChange",
main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
ylab="trueChange",
col="blue")
#prediction results
print(cor(results)^2)
stopCluster(cl)#close cluster

To directly alter sampling of randomForest(type="reggression"): Learn basic C programming, download from cran source code randomForest.4.6-10.tar.gz, (if windows install Rtools), (if OSX install Xcode), install and open Rstudio, start new project, choose package, unpack ...tar.gz into folder, look into src folder, open regrf.c, checkout line 151 and 163. Write new sampling strategy, press occationally Ctrl+Shift+B package to rebuild/compile and overwrite randomForest library, correct stated compile errors, test occasionally if package still works, spend some hours figuring out the old uninformative code, perhaps change description file, namespace file, and some few other references so the package will change name to randomForestMod, rebuild, voilla.
A more easy way not changing the randomForest is described below. Any trees with the same feature inputs can be patched together with the function randomForest::combine, so you can design your sampling regime in pure R code. I thought it actually was a bad idea, but for this very naive simulation it actually works with similar/slightly better performance! Remember to not predict the absolute target value, but instead a stationary derivative such as relative change, absolute change etc. If predicting the absolute value, RF will fall back to predicting tomorrow is something pretty close of today. Which is a trivial useless information.
edited code [22:42 CEST]
library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)
#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2
#past to present orientation
y = sin((1:timepoints)*pi/30) * 1000 +
sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")
#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)
#compute lag
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1
dX = sapply(1:40,function(i){
getTheseLags = (1:timepoints) - i
getTheseLags[getTheseLags<1] = NA #remove before start timePoints
dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure
#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test = dy[-train]
dX.train = dX[ train,]
dX.test = dX[-train,]
#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)
#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
lVector = length(aVector)
mod = lVector %% splits
lBlocks = rep(floor(lVector/splits),splits)
if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
lapply(1:splits,function(i) {
Stop = sum(lBlocks[1:i])
Start = Stop - lBlocks[i] + 1
aVector[Start:Stop]
})
}
nBlocks=10 #combine do not support block of unequal size
rfBlocks = foreach(aBlock = split2(train,splits=nBlocks),
.combine=randomForest::combine,
.packages=("randomForest")) %dopar% {
dXblock = dX.train[aBlock,] ; dyblock = dy.train[aBlock]
rf = randomForest(x=dXblock,y=dyblock,sampsize=length(dyblock),
replace=T,ntree=50)
}
print(rfBlocks)
#predict test, make results table
results = data.frame(predBlock = predict(rfBlocks,newdata=dX.test),
true=dy.test,
predBootstrap = predict(rf,newdata=dX.test))
plot(results[,1:2],xlab="OOB-CV predicted change",
ylab="trueChange",
main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
ylab="trueChange",
col="blue")
#prediction results
print(cor(results)^2)
stopCluster(cl)#close cluster

Related

Is it possible to adapt standard prediction interval code for dlm in R with other distribution?

Using the dlm package in R I fit a dynamic linear model to a time series data set, consisting of 20 observations. I then use the dlmForecast function to predict future values (which I can validate against the genuine data for said period).
I use the following code to create a prediction interval;
ciTheory <- (outer(sapply(fut1$Q, FUN=function(x) sqrt(diag(x))), qnorm(c(0.05,0.95))) +
as.vector(t(fut1$f)))
However my data does not follow a normal distribution and I wondered whether it would be possible to
adapt the qnorm function for other distributions. I have tried qt, but am unable to apply qgamma.......
Just wondered if anyone knew how you would go about sorting this.....
Below is a reproduced version of my code...
library(dlm)
data <- c(20.68502, 17.28549, 12.18363, 13.53479, 15.38779, 16.14770, 20.17536, 43.39321, 42.91027, 49.41402, 59.22262, 55.42043)
mod.build <- function(par) {
dlmModPoly(1, dV = exp(par[1]), dW = exp(par[2]))
}
# Returns most likely estimate of relevant values for parameters
mle <- dlmMLE(a2, rep(0,2), mod.build); #nileMLE$conv
if(mle$convergence==0) print("converged") else print("did not converge")
mod1 <- dlmModPoly(dV = v, dW = c(0, w))
mod1Filt <- dlmFilter(a1, mod1)
fut1 <- dlmForecast(mod1Filt, n = 7)
Cheers

LSTM time series forecasting, predictions stabilize

My code is in R using the Keras and Tensorflow libraries. I'm creating an LSTM model to forecast 100 future values. My input shape is (100,200,1).
Let's say my input data is X. I make a prediction at time step t=201 and get the column Y of predictions. Then I create Xnew = c(X[2:200],Y) a new variable where I concatenate X (except for the first column) and Y. I use this Xnew to predict the next time step.
What's happening is that, after a certain number of predicted future time steps (around 15), the predictions become constant for each time step afterwards. Does anyone know why this happens?
prdvec = function(dat,modname, numpreds, cnt, scl){
model = load_model_hdf5(modname)
inpt = dat
pred = list()
for(i in 1:numpreds){
pred[[i]] <- predict(model, reshape_X_3d((inpt[,1:ncol(inpt)]-cnt)/scl), batch_size = 1)
inpt = cbind(inpt[,2:ncol(inpt)],(pred[[i]]*scl+cnt))
print(i)
flush.console()
}
pred
}
I encounter a similar problem. Maybe when the LSTM units take into input created by itself, it tends to stabilize.

R bsts predictions are not consistent

Whenever I run the predict function multiple times on a bsts model using the same prediction data, I get different answers. So my question is, is there a way to return consistent answers given I keep my predictor dataset the same?
Example using the iris data set (I know it's not time series but it will illustrate my point)
iris_train <- iris[1:100,1:3]
iris_test <- iris[101:150,1:3]
ss <- AddLocalLinearTrend(list(), y = iris_train$Sepal.Length)
iris_bsts <- bsts(formula = Sepal.Length ~ ., data = iris_train,
state.specification = ss,
family = 'gaussian', seed = 1, niter = 500)
burn <- SuggestBurn(0.1,iris_bsts)
Now if I run this following line say, 10 times, each result is different:
iris_predict <- predict(iris_bsts, newdata = iris_test, burn = burn)
iris_predict$mean
I understand that it is running MCMC simulations, but I require consistent results and have therefore tried:
Setting the seed in bsts and before predict
Setting the state space standard deviation to near 0, which just creates unstable results.
And neither seem to work. Any help would be appreciated!
I encountered the same problem. To fix it, you need to set the random seed in the embedded C code. I forked the packaged and made the modifications here: BSTS.
For package installation only, download bsts_0.7.1.1.tar.gz in the build folder. If you already have bsts installed, replace it with this version via:
remove.packages("bsts")
# assumes working directory is whre file is located
install.packages("bsts_0.7.1.1.tar.gz", repos=NULL, tyype="source")
If you do not have bsts installed, please install it first to ensure all dependencies are there. (This may require installing Rtools, Boom, and BoomSpikeSlab individually.)
This package version only modifies the predict function from bsts, all code should work as is. It automatically sets the random seed to 1 each time predict is called. If you want predictions to vary, you'll need to explicitly set the predict parameter each time.
You can make a function to specify seed each time (set.seed was unnecessary...):
reproducible_predict <- function(S) {
iris_bsts <- bsts(formula = Sepal.Length ~ ., data = iris_train, state.specification = ss, seed = S, family = 'gaussian', niter = 500)
burn <- SuggestBurn(0.1,iris_bsts)
iris_predict <- predict(iris_bsts, newdata = iris_test, burn = burn)
return(iris_predict$mean)
}
reproducible_predict(1)
[1] 7.043592 6.212780 6.789205 6.563942 6.746156
reproducible_predict(1)
[1] 7.043592 6.212780 6.789205 6.563942 6.746156
reproducible_predict(200)
[1] 7.013679 6.173846 6.763944 6.567651 6.715257
reproducible_predict(200)
[1] 7.013679 6.173846 6.763944 6.567651 6.715257
I have come across the same issue.
The problem comes from setting the seed within the model definition only.
To solve your problem, you have to set a seed within the predict function such as:
iris_predict <- predict(iris_bsts, newdata = iris_test, burn = burn, seed=X)
Hope this helps.

R e1071 SVM leave one out cross validation function result differ from manual LOOCV

I'm using e1071 svm function to classify my data.
I tried two different ways to LOOCV.
First one is like that,
svm.model <- svm(mem ~ ., data, kernel = "sigmoid", cost = 7, gamma = 0.009, cross = subSize)
svm.pred = data$mem
svm.pred[which(svm.model$accuracies==0 & svm.pred=='good')]=NA
svm.pred[which(svm.model$accuracies==0 & svm.pred=='bad')]='good'
svm.pred[is.na(svm.pred)]='bad'
conMAT <- table(pred = svm.pred, true = data$mem)
summary(svm.model)
I typed cross='subject number' to make LOOCV, but the result of classification is different from my manual version of LOOCV, which is like...
for (i in 1:subSize){
data_Tst <- data[i,1:dSize]
data_Trn <- data[-i,1:dSize]
svm.model1 <- svm(mem ~ ., data = data_Trn, kernel = "linear", cost = 2, gamma = 0.02)
svm.pred1 <- predict(svm.model1, data_Tst[,-dSize])
conMAT <- table(pred = svm.pred1, true = data_Tst[,dSize])
CMAT <- CMAT + conMAT
CORR[i] <- sum(diag(conMAT))
}
In my opinion, through LOOCV, accuracy should not vary across many runs of code because SVM makes model with all the data except one and does it until the end of the loop. However, with the svm function with argument 'cross' input, the accuracy differs across every runs of code.
Which way is more accurate? Thanks for read this post! :-)
You are using different hyper-parameters (cost, gamma) and different kernels (linear, sigmoid). If you want identical results, then these should be the same each run.
Also, it depends how Leave One Out (LOO) is implemented:
Does your LOO method leave one out randomly or as a sliding window over the dataset?
Does your LOO method leave one out from one class at a time or both classes at the same time?
Is the training set always the same, or are you using a randomisation procedure before splitting between a training and testing set (assuming you have a separate independent testing set)? In which case, the examples you are cross-validating would change each run.

Get randomForest regression faster in R

I have to make a regression with randomforest in R. My problem is that my dataframe is huge: I have 12 variables and more than 400k entries. When I try - the code is written in the bottom - to get a randomForest regression the system takes many hours to process the data: after 5, 6 hours of calculation, I am obliged to stop the operation without any output. Someone can suggests me how I can get it faster?
Thanks
library(caret)
library(randomForest)
dataset <- read.csv("/home/anonimo/Modelli/total_merge.csv", header=TRUE)
dati <- data.frame(dataset)
attach(dati)
trainSet <- dati[2:107570,]
testSet <- dati[107570:480343,]
output.forest <- randomForest(dati$Clip_pm25 ~ dati$e_1 + dati$Clipped_so + dati$Clip_no2 + dati$t2m_1 + dati$tp_1 + dati$Clipped_nh + dati$Clipped_co + dati$Clipped_o3 + dati$ssrd_1 + dati$Clipped_no + dati$Clip_pm10 + dati$sp_1, data=trainSet, ntree=250)
I don't think to parallelize on a single PC (2-4 cores) is the answer. There are plenty of lower hanging fruits to pick.
1) RF models increase in complexity with number of training samples. The average tree depth would be something like log(480,000/5)/log(2) = 16.5 intermediary nodes. In the vast majority of examples 2000-10000 samples per tree is fine. If you competing to win on kaggle, a small extra performance really matters, as winner takes all. In practice, you probably don't need that.
2) Don't clone you data set in your R code and try to only keep one copy of your data set (pass by reference is of course fine). It's not a big problem for this data set, as the dataset is not that big (~38Mb) even for R.
3) Don't use formula interface with randomForest algorithm for large datasets. It will make an extra copy of the data set. But again memory is not that much of a problem.
4) Use a faster RF algorithm: extraTrees, ranger or Rborist are available for R. extraTrees is not exactly a RF algorithm but pretty close.
5) avoid categorical features with more than 10 categories. RF can handle up to 32, but becomes super slow as any 2^32 possible split has to be evaluated. extraTrees and Rborist handle more categories by only testing some random selected splits (which works fine). Another solution as in the python-sklearn every category are assigned a unique integer, and the feature is handled as numeric. You can convert your categorical features with as.numeric and before runing randomForest to do the same trick.
6) For much bigger data. Split the data set in random blocks and train a few(~10) trees on each. Combine forests or save forests separate. This will slightly increase the tree correlation. There are some nice cluster implementation to train like these. But won't be necessary for datasets below 1-100Gb, depending on tree complexity etc.
#below I use solution 1-3) and get a run time of some minutes
library(randomForest)
#simulate data
dataset <- data.frame(replicate(12,rnorm(400000)))
dataset$Clip_pm25 = dataset[,1]+dataset[,2]^2+dataset[,4]*dataset[,3]
#dati <- data.frame(dataset) #no need to keep the data set, an extra time in memory
#attach(dati) #if you attach dati you don't need to write data$Clip_pm25, just Clip_pm25
#but avoid formula interface for randomForest for large data sets because it cost extra memory and time
#split data in X and y manually
y = dataset$Clip_pm25
X = dataset[,names(dataset) != "Clip_pm25"]
rm(dataset);gc()
object.size(X) #38Mb, no problemo
#if you were using formula interface
#output.forest <- randomForest(dati$Clip_pm25 ~ dati$e_1 + dati$Clipped_so + dati$Clip_no2 + dati$t2m_1 + dati$tp_1 + dati$Clipped_nh + dati$Clipped_co + dati$Clipped_o3 + dati$ssrd_1 + dati$Clipped_no + dati$Clip_pm10 + dati$sp_1, data=trainSet, ntree=250)
#output.forest <- randomForest(dati$Clip_pm25 ~ ., ntree=250) # use dot to indicate all variables
#start small, and scale up slowly
rf = randomForest(X,y,sampsize=1000,ntree=5) #runtime ~15 seconds
print(rf) #~67% explained var
#you probably really don't need to exeed 5000-10000 samples per tree, you could grow 2000 trees to sample most of training set
rf = randomForest(X,y,sampsize=5000,ntree=500) # runtime ~5 minutes
print(rf) #~87% explained var
#regarding parallel
#here you could implement some parallel looping
#.... but is it really worth for a 2-4 x speedup?
#coding parallel on single PC is fun but rarely worth the effort
#If you work at some company or university with a descent computer cluster,
#then you can spawn the process across 20-80-200 nodes and get a ~10-60-150 x speedup
#I can recommend the BatchJobs package
Since you are using caret, you could use the method = "parRF". This is an implementation of parallel randomforest.
For example:
library(caret)
library(randomForest)
library(doParallel)
cores <- 3
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
dataset <- read.csv("/home/anonimo/Modelli/total_merge.csv", header=TRUE)
dati <- data.frame(dataset)
attach(dati)
trainSet <- dati[2:107570,]
testSet <- dati[107570:480343,]
# 3 times cross validation.
my_control <- trainControl(method = "cv", number = 3 )
my_forest <- train(Clip_pm25 ~ e_1 + Clipped_so + Clip_no2 + t2m_1 + tp_1 + Clipped_nh + Clipped_co + Clipped_o3 + ssrd_1 + Clipped_no + Clip_pm10 + sp_1, ,
data = trainSet,
method = "parRF",
ntree = 250,
trControl=my_control)
Here is a foreach implementation as well:
foreach_forest <- foreach(ntree=rep(250, cores),
.combine=combine,
.multicombine=TRUE,
.packages="randomForest") %dopar%
randomForest(Clip_pm25 ~ e_1 + Clipped_so + Clip_no2 + t2m_1 + tp_1 + Clipped_nh + Clipped_co + Clipped_o3 + ssrd_1 + Clipped_no + Clip_pm10 + sp_1,
data = trainSet, ntree=ntree)
# don't forget to stop the cluster
stopCluster(cl)
Remember I didn't set any seeds. You might want to consider this as well. And here is a link to a randomforest package that also runs in parallel. But I have not tested this.
The other two answers are good. Another option is to actually use more recent packages that are purpose-built for highly dimensional / high volume data sets. They run their code using lower-level languages (C++ and/or Java) and in certain cases use parallelization.
I'd recommend taking a look into these three:
ranger (uses C++ compiler)
randomForestSRC (uses C++ compiler)
h2o (Java compiler - needs Java version 8 or higher)
Also, some additional reading here to give you more to go off on which package to choose: https://arxiv.org/pdf/1508.04409.pdf
Page 8 shows benchmarks showing the performance improvement of ranger against randomForest against growing data size - ranger is WAY faster due to linear growth in runtime rather than non-linear for randomForest for rising tree/sample/split/feature sizes.
Good Luck!

Resources