Issue using ff with SVM function in library(e1071) - r

I am trying to use a ff object to run a svm classification study.
I converted my dataframe to a ff object using ffdf <- as.ffdf(signalDF). The dataset has 1024 columns and ~ 600K rows.
When I run the function, svm(Y~., data=ffdf,scale=FALSE,kernel="linear"), I receive the error:
Error: cannot allocate vector of size 15.8 Gb
Running ulimit -n:
64000
Also, runnning df shows plenty of disk space.
Any reason why I am receiving a memory error when using a ff object?
Any help is appreciated.
Thank you

Disk space is different from memory available for computation. The error indicates that you don't have enough memory to perform the computation. Major reasons are that your data set is large and your computer has limited RAM. If you reduce the training size it will run.

Related

brms add_criterion how to manage large brmsfit models

I would like to get some overview of what the options are for model comparison in brms when the models are large (brmsfit objects of ~ 6 GB due to 2000000 iterations).
My immediate problem is that add_criterion() won't run after models are finished on my laptop (16GB memory). I got the error message "vector memory exhausted (limit reached?)"; after which I increased the memory cap on R in Renviron to 100GB (as described here: R on MacOS Error: vector memory exhausted (limit reached?)). The total memory usage goes up to about 90 GB; I get error messages in R when I want to estimate both 'waic' and 'loo', if I just estimate 'loo', R invariably crashes.
What are my options here and what would be the recommendations?
Use the cluster - local convention is to use a single node, is this recommendable? (I guess not, as we have 6, 10, and 16GB cores. Any (link to) advice on parallelising R on a cluster is welcome.)
Is it possible to have a less dense posterior in brms, i.e. sample less during estimation, as in BayesTraits?
Can I parallelise R/RStudio on my own laptop?
...?
Many thanks for your advice!

how do I run a function on a data set that is too large?

I'm trying to perform a hierarchical cluster in R. The data set I'm using is only 1.4 mb. for some reason when I run the dist() function, I get the following error:
Error: cannot allocate vector of size 13.3 Gb
I only have 8Gb of ram on this pc and I haven't been able to find any approach that could allow me to get the distance results.
Does anyone have any suggestions on how I can get the output for a function that is larger than my ram capacity?

Resample or aggregate large raster limiting memory use R

I'm trying to resample a huge worldpop population raster but I keep crashing my linux instance of R which has 32GB memory. When I launch a 52gb memory google compute instance, the code below works, but it crashes my regular 32gb ram computer.
Is there a way to do either a raster aggregation or resampling LIMITING MEMORY USE?
Download code for a large worldpop raster that I am having issues resampling:
###download the huge raster to recreate scale problem
devtools::install_github("nbarsch/spaceheater")
library(spaceheater)
getWPdownload("Tanzania","Population","adj",2015) #warning: downloads near 1gb file
library(raster)
wpras <- raster("TANZANIA_Population_adj_2015.tif")
TWO METHODS that work on a computer with 52GB ram, but kill my local computer 32GB ram:
#aggregate method
agras <- raster::aggregate(wpras,fact=10,expand=T)
#returns "Killed"
#resamp method
reras <- raster(nrow = ceiling(nrow(wpras)/10), ncol = ceiling(ncol(wpras)/10))
reras2 <- raster::resample(wpras, reras, method="bilinear")
#returns Error: cannot allocate vector of size 1.3 Gb
Anyone have a solution that doesn't use all the ram? Thanks!
You probably should update the raster package. The previous short-lived release (2.7-15) had an error in the memory settings. Version 2.8-4 should have fixed that.

R foreach error unable to allocate size 34.6 GB when I have 140GB+ memory allocated to R

I am trying to use R's doSNOW and foreach packages to parallelize ranger random forest classification predictions on a file that is a little over 15MM records and 82 attributes.
I am using an AWS EC2 instance the size of m4.10xlarge. It is a Windows OS and R is 64-bit. I have definitely allocated over 100GB of RAM to R and I have cleared any garbage memory out. I only have a random forest ranger model and the 15MM record dataframe as objects in R, yet when I try run my code (regardless of the splits/chunks I try), I get errors similar to the following:
error calling combine function:
<simpleError: cannot allocate vector of size 34.6 Gb>
For each record, I'm predicting the class at each tree so I can later calculate class probabilities. I can run the prediction easily on the entire file, but it's too slow and I have a lot more files to predict on. All I'm trying to do is speed up the process, but no matter what I do, I keep getting this error. I have spent hours online looking for info, but cannot find anyone else with this issue.
Here is my prediction code:
num_splits <- 7
cl <- makeCluster(num_splits,outfile="cluster.txt")
registerDoSNOW(cl)
writeLines(c(""), "log.txt")
system.time(
predictions<-foreach(d=isplitRows(df2, chunks=num_splits),
.combine = rbind, .packages=c("ranger")) %dopar% {
sink("log.txt", append=TRUE)
(predict(model1, data=d,predict.all=TRUE)$predictions)
})
stopCluster(cl)
Does anyone know why I'm getting this error and how to fix it? Is it possible I'm allocating too much memory to R and that's why?
Or could the way I'm combining the results be causing issues -- I'm trying to stack predictions from each chunk on top of each other so I have a row of tree level predictions for each of the 15MM records.
Thanks so much!

biglm predict unable to allocate a vector of size xx.x MB

I have this code:
library(biglm)
library(ff)
myData <- read.csv.ffdf(file = "myFile.csv")
testData <- read.csv(file = "test.csv")
form <- dependent ~ .
model <- biglm(form, data=myData)
predictedData <- predict(model, newdata=testData)
the model is created without problems, but when I make the prediction... it runs out of memory:
unable to allocate a vector of size xx.x MB
some hints?
or how to use ff to reserve memory for predictedData variable?
I have not used biglm package before. Based on what you said, you ran out of memory when calling predict, and you have nearly 7,000,000 rows for new dataset.
To resolve the memory issue, prediction must be done chunk-wise. For example, you iteratively predict 20,000 rows at a time. I am not sure whether the predict.bigglm can do chunk-wise prediction.
Why not have a look at mgcv pacakage? bam can fit linear models / generalized linear models / generalized additive models, etc, for large data set. Similar to biglm, it performs chunk-wise matrix factorization when fitting model. But, the predict.bam supports chunk-wise prediction, which is really useful for your case. Furthermore, it does parallel model fitting and model prediction, backed by parallel package [use argument cluster of bam(); see examples under ?bam and ?predict.bam for parallel examples].
Just do library(mgcv), and check ?bam, ?predict.bam.
Remark
Do not use nthreads argument for parallelism. That is not useful for parametric regression.
Here are the possible causes and solutions:
Cause: You're using 32-bit R
Solution: Use 64-bit R
Cause: You're just plain out of RAM
Solution: Allocate more RAM if you can (?memory.limit). If you can't then consider using ff, working in chunks, running gc(), or at worst scaling up by leveraging a cloud. Chunking is often the key to success with Big Data -- try doing the projections 10% at a time, saving the results to disk after each chunk and removing the in-memory objects after use.
Cause: There's a bug in your code leaking memory
Solution: Fix the bug -- this doesn't look like it's your case, however make sure that you have data of the expected size and keep an eye on your resource monitor program to make sure nothing funny is going on.
I've tryed with biglm and mgcv but memory and factor problems came quickly. I have had some success with: h2o library.

Resources