I'm relatively new to R and am currently trying to run a SIMPROF analysis (clustsig package) on a small dataset of 1000 observations and 24 variables. After ~30 iterations I receive the following error:
Error: cannot allocate vector of size 1.3 Mb.
In addition: There were 39 warnings (use warnings() to see them)
All the additional warnings relate to R reaching a total allocation of 8183Mb
The method I'm using to run the analysis is below.
Data <- read.csv(file.choose(), header=T, colClasses="numeric")
Matrix <- function(Data) vegan::vegdist(Data, method="gower")
SimprofOutput <- simprof(Data, num.expected=1000, num.simulated=999, method.cluster="average", method.distance=Matrix, alpha = 0.10, silent=FALSE, increment=100)
I'm wondering if anybody else has had trouble running the SIMPROF analysis or any ideas how to stop R running out of RAM. I'm running 64 bit Win7 Enterprise and using R 2.15.1 on a machine with 8gb RAM
Related
I am doing logistic regression on a data set with dimensions 190000 X 53.(91Mb) mixed of categorical & numeric Data.
but i am facing serious issues of memory as R gets hang every time I run the logistic regression code.
I have tried to reduce the sample by taking only 10% of the data but R stuck again.
Error: cannot allocate vector of size 12.9 Gb
I have tried to increase the memory using memory.limit() and memory.size() but not getting any results
> memory.size()
[1] 152.98
> memory.limit()
[1] 1.759219e+13
I am having 8Gb physical RAM but If i increase the memory using memory limit to 16GB then I get this error
> memory.limit(size=16000)
[1] 1.759219e+13
Warning message:
In memory.limit(size = 16000) : cannot decrease memory limit: ignored
can any one let me know how to solve the problem. I am using R 4.0.2 with windows 10(64Bit)
I am trying to fit a lightgbm model multiclass model on large dataframe:
train_data = lgb.Dataset(train_df[v1].values, label=label)
631,761 x 1786 columns (2.2gb). This runs fine. However, there is one column, which has 10000 unique classes (which I am currently using in model with the help of pd.factorize). However I want to transpose them and use as indicators for each class as below:
train_data = lgbm.Dataset(train_df[v1].values, label=label,feature_name=v1,categorical_feature=['ward_id'])
This transformation is leading to memory error. Is there any efficient way to do it, without running into memory error.
here are my configurations:
Core i7, 16 GB ram.
I am setting up a piece of code to parallel processes some computations for N groups in my data using foreach.
I have a computation that involves a call to h2o.gbm.
In my current, sequential set-up, I use up to about 70% of my RAM.
How do I correctly set-up my h2o.init() within the parallel piece of code? I am afraid that I might run out of RAM when I use multiple cores.
My Windows 10 machine has 12 cores and 128GB of RAM.
Would something like this pseudo-code work?
library(foreach)
library(doParallel)
#setup parallel backend to use 12 processors
cl<-makeCluster(12)
registerDoParallel(cl)
#loop
df4 <-foreach(i = as.numeric(seq(1,999)), .combine=rbind) %dopar% {
df4 <- data.frame()
#bunch of computations
h2o.init(nthreads=1, max_mem_size="10G")
gbm <- h2o.gbm(train_some_model)
df4 <- data.frame(someoutput)
}
fwrite(df4, append=TRUE)
stopCluster(cl)
The way your code is currently set up won't be the best option. I understand what you are trying to do -- execute a bunch of GBMs in parallel (each on a single core H2O cluster), so you can maximize the CPU usage across the 12 cores on your machine. However, what your code will do is try to run all the GBMs in your foreach loop in parallel on the same single-core H2O cluster. You can only connect to one H2O cluster at a time from a single R instance, however the foreach loop will create a new R instance.
Unlike most machine learning algos in R, the H2O algos are all multi-core enabled so the training process will already be parallelized at the algorithm level, without the need for a parallel R package like foreach.
You have a few options (#1 or #3 is probably best):
Set h2o.init(nthreads = -1) at the top of your script to use all 12 of your cores. Change the foreach() loop to a regular loop and train each GBM (on a different data partition) sequentially. Although the different GBMs are trained sequentially, each single GBM will be fully parallelized across the H2O cluster.
Set h2o.init(nthreads = -1) at the top of your script, but keep your foreach() loop. This should run all your GBMs at once, with each GBM parallelized across all cores. This could overwhelm the H2O cluster a bit (this is not really how H2O is meant to be used) and could be a bit slower than #1, but it's hard to say without knowing the size of your data and the number of partitions of you want to train on. If you are already using 70% of your RAM for a single GBM, then this might not be the best option.
You can update your code to do the following (which most closely resembles your original script). This will preserve your foreach loop, creating a new 1-core H2O cluster at a different port on your machine. See below.
Updated R code example which uses the iris dataset and returns the predicted class for iris as a data.frame:
library(foreach)
library(doParallel)
library(h2o)
h2o.shutdown(prompt = FALSE)
#setup parallel backend to use 12 processors
cl <- makeCluster(12)
registerDoParallel(cl)
#loop
df4 <- foreach(i = seq(20), .combine=rbind) %dopar% {
library(h2o)
port <- 54321 + 3*i
print(paste0("http://localhost:", port))
h2o.init(nthreads = 1, max_mem_size = "1G", port = port)
df4 <- data.frame()
data(iris)
data <- as.h2o(iris)
ss <- h2o.splitFrame(data)
gbm <- h2o.gbm(x = 1:4, y = "Species", training_frame = ss[[1]])
df4 <- as.data.frame(h2o.predict(gbm, ss[[2]]))[,1]
}
In order to judge which option is best, I would try running this on a few data partitions (maybe 10-100) to see which approach seems to scale the best. If your training data is small, it's possible that #3 will be faster than #1, but overall, I'd say #1 is probably the most scalable/stable solution.
Following Erin LeDell's answer, I just wanted to add that in many cases a decent practical solution can be something in between #1 and #3. To increase CPU utilization and still save RAM you can use multiple H2O instances in parallel, but they each can use multiple cores without much performance loss relative to running more instances with only one core.
I ran an experiment using a relatively small 40MB dataset (240K rows, 22 columns) on a 36 core server.
Case 1: Use all 36 cores (nthreads=36) to estimate 120 GBM models (with default
hyper-parameters) sequentially.
Case 2: Use foreach to run 4 H2O instances on this machine, each
using 9 cores to estimate 30 GBM default models sequentially (total = 120 estimations).
Case 3: Use foreach to run 12 H2O instances on this machine, each
using 3 cores to estimate 10 GBM default models sequentially (total = 120 estimations).
Using 36 cores estimating a single GBM model on this dataset is very inefficient. CPU utilization in Case 1 is jumping around a lot, but is on average below 50%. So there is definitely something to gain using more than one H2O instance at a time.
Runtime Case 1: 264 seconds
Runtime Case 2: 132 seconds
Runtime Case 3: 130 seconds
Given the small improvement from 4 to 12 H2O instances, I did not even run 36 H2O instances each using one core in parallel.
Trying to convert a data.frame with numeric, nominal, and NA values to a dissimilarity matrix using the daisy function from the cluster package in R. My purpose involves creating a dissimilarity matrix before applying k-means clustering for customer segmentation. The data.frame has 133,153 rows and 36 columns. Here's my machine.
sessionInfo()
R version 3.1.0 (2014-04-10)
Platform x86_64-w64-mingw32/x64 (64-bit)
How can I fix the daisy warning?
Since the Windows computer has 3 Gb RAM, I increased the virtual memory to 100GB hoping that would be enough to create the matrix - it didn't work. I still got a couple errors about the memory. I've looked into other R packages for solving the memory problem, but they don't work. I cannot use the bigmemory with the biganalytics package because it only accepts numeric matrices. The clara and ff packages also accept only numeric matrices.
CRAN's cluster package suggests the gower similarity coefficient as a distance measure before applying k-means. The gower coefficient takes numeric, nominal, and NA values.
Store1 <- read.csv("/Users/scdavis6/Documents/Work/Client1.csv", head=FALSE)
df <- as.data.frame(Store1)
save(df, file="df.Rda")
library(cluster)
daisy1 <- daisy(df, metric = "gower", type = list(ordratio = c(1:35)))
#Error in daisy(df, metric = "gower", type = list(ordratio = c(1:35))) :
#long vectors (argument 11) are not supported in .C
**EDIT: I have RStudio lined to Amazon Web Service's (AWS) r3.8xlarge with 244Gbs of memory and 32 vCPUs. I tried creating the daisy matrix on my computer, but did not have enough RAM. **
**EDIT 2: I used the clara function for clustering the dataset. **
#50 samples
clara2 <- clara(df, 3, metric = "euclidean", stand = FALSE, samples = 50,
rngR = FALSE, pamLike = TRUE)
Use algorithms that do not require O(n^2) memory, if you have a lot of data. Swapping to disk will kill performance, this is not a sensible option.
Instead, try either to reduce your data set size, or use index acceleration to avoid the O(n^2) memory cost. (And it's not only O(n^2) memory, but also O(n^2) distance computations, which will take a long time!)
I am running a zero inflated regression model function for a 498,501 rows dataframe on a 32gb linux machine with a 2.6GHz CPU. I used the following command
library(pscl)
zeroinfl(response ~ predictor1 + predictor2, data=dataframe, dist = "negbin", EM = TRUE)
R has now been computing for more that 48 hours now with no warning or error message.
I am a bit puzzled since a traditional lm() on the same data delivers almost instantaneously. Should I suspect some problem with my command? Or is it just a very slow function?