Random Forest with caret package: Error: cannot allocate vector of size 153.1 Gb - r

I was trying to build a random forest model for a dataset in Kaggle, i always doing machine learning with caret package, the dataset has 1.5 million + rows and 46 variables with no missing values (about 150 mb in size), 40+ variables are categorical and the outcome is the response i am trying to predict and it is binary. After some pre-processing with dplyr, I started working on building model with caret package, but i got this error message when i was trying to run the "train" function:"Error: cannot allocate vector of size 153.1 Gb" Here is my code:
## load packages
require(tidyr)
require(dplyr)
require(readr)
require(ggplot2)
require(ggthemes)
require(caret)
require(parallel)
require(doParallel)
## prepare for parallel processing
n_Cores <- detectCores()
n_Cluster <- makeCluster(n_Cores)
registerDoParallel(n_Cluster)
## import orginal datasets
people_Dt <- read_csv("people.csv",col_names = TRUE)
activity_Train <- read_csv("act_train.csv",col_names = TRUE)
### join two sets together and remove variables not to be used
first_Try <- people_Dt%>%
left_join(activity_Train,by="people_id")%>%
select(-ends_with("y"))%>%
filter(!is.na(outcome))
## try with random forest
in_Tr <- createDataPartition(first_Try$outcome,p=0.75,list=FALSE)
rf_Train <- firt_Try[in_Tr,]
rf_Test <- firt_Try[-in_Tr,]
## set model cross validation parameters
model_Control <- trainControl(method = "repeatedcv",repeats=2,number=2,allowParallel = TRUE)
rf_RedHat <- train(outcome~.,
data=rf_Train,
method="rf",
tuneLength=10,
importance=TRUE,
trControl=model_Control)
My computer is a fairly powerful machine with E3 processors and 32GB RAM. I have two questions:
1. Where did i get a vector that is as large as 150GB? Is it because some codes I wrote?
2. I cannot get a machine with that big ram, is there any workarouds to solve the issue that i can move on with my model building process?

the dataset has 1.5 million + rows and 46 variables with no missing values (about 150 mb in size)
To be clear here, you most likely don't need 1.5 million rows to build a model. Instead, you should be taking a smaller subset which doesn't cause the memory problems. If you are concerned about reducing the size of your sample data, then you can do some descriptive stats on the 40 predictors, on a smaller set, and make sure that the behavior appears to be the same.

The problem is probably related to the one-hot-encoding of caret in your categorical variables. Since you have a lot of categorical variables, this seems to be a real problem such that it increases your dataset in a huge way. One-hot encoding will create a new column for every factor per categorical variables that you have.
Maybe you could try something like the h2o-package, which handles categorical variable in another way such that in not exploding your dataset when the model is run.

Related

Variance inflation factors in R

I am trying to compute the VIFs from a regression model that has a lot of independent variables (> 100). I am using vif from the car package to do that.
I always get the error: cannot allocate vector of size 13.8 GB. I realize this is a memory issue, but my PC already has a lot of memory. So the question is, can this function optimized in any way, so it doesn't require so much memory? I am unsure if this is more of a stats or a computational question. And as my dataset is quite large, I am unsure how to represent this case with a MWE. Basically what is needed is a lot of independent variables (e.g. 200+) and one arbitrary dependent variable, with length of each variable around 440 observations. Thanks for any hints.
I just ran a simulated version of what you did, and it worked fine: it took less than a second to run. This is 250 explanatory variables and one response, 500 observations.
For entertainment I pasted together a formula for it, but that isn't really necessary. The vif() were computed easily.
In general, since vif(j) = 1/(1-R^2_j), where R^2_j is the R-squared value when regressing the jth explanatory variable against all the other explanatory variables, computation should take, at most, the time of 250 linear regressions, with 500 observations and 250 explanatory variables, which is very, very fast and not at all memory intensive.
You might need to post your code so we can see what went wrong.
> resp <- rnorm(500)
> X <- matrix(nrow=500, ncol=250, rnorm(500*250))
> X <- data.frame(X)
> colnames(X) <- col_names <- paste("x",1:250, sep="")
> formula <- paste(col_names, collapse="+")
> formula <- paste("resp~",formula)
> hold <- lm(formula, data=cbind(resp,X))
> summary(vif(hold))

Is there a way to handle "cannot allocate vector of size" issue without dropping data?

Unlike a previous question about this, this case is different to that and that is why I'm asking. I have an already cleaned dataset containing 120 000 observations of 25 variables, and I am supposed to analyze it all through logistic regression and random forest. However, I get an error "cannot allocate vector of size 98 GB whereas my friend doesn't.
Summary says most of it. I even tried to reduce number of observations to 50 000 and number of variables in dataset to 15 (used 5 of them in regression) and it failed. However, I tried sending the script where i shortened the dataset to a friend, and she could run it. This is odd because I have a 64 bit system and 8 GB RAM, she has only 4 GB. So it appears that the problem lies with me.
pd_data <- read.csv2("pd_data_v2.csv")
split <- rsample::initial_split(pd_data, prop = 0.7)
train <- rsample::training(split)
test <- rsample::testing(split)
log_model <- glm(default ~ profit_margin + EBITDA_margin + payment_reminders, data = pd_data, family = "binomial")
log_model
The result should be a logistic model where I can see coefficients and meassure it's accuracy, and make adjustments.

Speeding up the felm command in R (lfe library)

I am using the felm from the lfe library, and am running into serious speed issues when using a large data set. By large I mean 100 million rows. My data consists of one dependent variable and five categorical variables (factors). I am running regressions with no covariates, only factors.
The felm algorithm does not converge. And I also tried some of the tricks used in this short article, but it did not improve. My code is as follows:
library(lfe)
my_data=read.csv("path_to//data.csv")
attach(data.frame(my_data))
lev1 = unique(my_data$fac1)
my_data$fac1 <- factor(my_data$fac1, levels = lev1)
lev2 = unique(my_data$fac2)
my_data$fac2 <- factor(my_data$fac2, levels = lev2)
lev3 = unique(my_data$fac3)
my_data$fac3 <- factor(my_data$fac3, levels = lev3)
and now I run the regression, without covariates (because I'm only interested in the residuals), and with interactions as follows:
est <- felm(y ~ 0|fac1:fac2+fac1:fac3, my_data)
This line takes forever and does not converge. Note the dimension of the are as follows:
fac1 has about 6000 unique values
fac2 has about 100 unique values
fac3 has about 10 unique values
(and remember there are 100 million rows). I suspect there must be something wrong with how I use the factors, because I imagine that R should be able to handle such sizes (stata's reghdfe command handles it without problems). Any suggestions are highly appreciated.

RSNNS neural network prediction for raster image classification in R

I'm trying to harness the power of neural networks for image classification of big rasters using the RSNNS package in R.
As for the data preparation and training of the model, everything works perfectly fine and the accuracies look quite promising.
Subsequently, I'm trying to classify the raster values using the function predict with the trained model. Having a quite big amount of data (rasterstack with the dimension 10980x10980x16), I'm processing the data block by block. And here's the problem:
The prediction of the class values is extremely slow. I'm working on a quite powerful machine (Windows x64, 32GB Ram, i7 3.4GHZ quad-core) but still the process is almost literally taking ages. I already reduced the size of my blocks, but still the amount of time needed is unacceptable. Currently I split the data in blocks of 64 rows per block. That would result in a total of 172 blocks. If I assume a linear processing time for each block (in my case 33 minutes !!!), it would take me almost 95 hours to process the whole image. Again, that can not be right.
I've tried other neural network packages and for instance nnet classifies bigger blocks like these in under one minute.
So please, if you have any pointers on what I'm doing wrong, I'd greatly appreciate it.
Here's a working example similar to my code:
library(RSNNS)
#example data for training and testing
dat <- matrix(runif(702720),ncol = 16)
#example data to classify
rasval <- matrix(runif(11243520),ncol = 16)
dat <- as.data.frame(dat)
#example class labels from 1 to 11
classes <- matrix(,ncol=1,nrow=nrow(dat))
classes <- apply(classes,1,function(x) floor(runif(1,0,11)))
dat$classes <- classes
#shuffle dataset
dat <- dat[sample(nrow(dat)),]
datValues <- dat[,1:16]
datTargets <- decodeClassLabels(dat[,17])
#split dataset
dat <- splitForTrainingAndTest(datValues, datTargets, ratio=0.15)
#normalize data
dat <- normTrainingAndTestSet(dat)
#extract normalization variables
ncolmeans <- attributes(dat$inputsTrain)$normParams$colMeans
ncolsds <- attributes(dat$inputsTrain)$normParams$colSds
#train model
model <- mlp(dat$inputsTrain, dat$targetsTrain, size=1, learnFunc="SCG", learnFuncParams=c(0, 0, 0, 0),
maxit=400, inputsTest=dat$inputsTest, targetsTest=dat$targetsTest)
#normalize raster data
rasval <- sweep(sweep(rasval,2,ncolmeans),2,ncolsds,'/')
#Predict classes ##Problem##
pred <- predict(model,rasval)
yes, unfortunately RSNNS can be very slow when predicting. The new version 0.4-8 (not on CRAN yet, but you can get it from github) should speed things up a bit but the general problem is that every row of data needs to be passed separately into the SNNS kernel, and resolving this issue would mean reimplementing some things in the kernel. Not impossible but some work to do.

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components.
My vectors are composed by one email and by 30 qualitative variables.
Each quantitative variable has 4 classes: 0,1,2 and 3.
So first thing I'm doing is to load the library FactoMineR and to load my data:
library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")
Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):
for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}
I'm removing the emails from my vectors:
mydata2 = mydata[2:31]
And I'm running a MCA in this new dataset:
mca.res <- MCA(mydata2)
I now want to cluster my dataset using the hcpc function:
res.hcpc <- HCPC(mca.res)
But I got the following error message:
Error: cannot allocate vector of size 1296.0 Gb
What do you think I should do? Is my dataset too large? Am I using well the hcpc function?
Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.
There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:
k-means clustering in R on very large, sparse matrix? (bigkmeans)
Cluster Big Data in R and Is Sampling Relevant? (clara)
If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.
Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.
For example, using the tea data set from FactoMineR:
library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)
The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.
The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:
It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.
That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with
rm(mydata, mydata2)
(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:
R memory management / cannot allocate vector of size n Mb
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html

Resources