How to avoid the memory limit in R - r

I'm trying to replace values in a matrix, specifically "t"->1 and "f"->0, but I keep getting the error messages:
Error: cannot allocate vector of size 2.0 Mb
...
Reached total allocation of 16345Mb: see help(memory.size)
I'm using a Win7 computer with 16GB of memory on the 64-bit version of R in RStudio.
what i'm currently running is
a <- matrix( dataset, nrow=nrow(dataset), ncol=ncol(dataset), byrow=TRUE)
memory.size()
a[a=="t"] <- 1
where dataset is a (about) 525000x300 size data frame. the memory.size() line gives me less than 4GB used, and memory.limit() is 16GB. Why is it that the replace line would require so much memory to execute? Is there any way to do the replace without hitting the memory limit (and are there any good tips on avoiding it in general), and if so, is it going to cost me a lot of time to run it? I'm still pretty new to R so I don't know if it makes a difference depending on the data class I use and how R allocates memory...

when you call this line
a[a=="t"] <- 1
R has to create a whole new boolean matrix to index into a. If a is huge, this boolean matrix will also be huge.
Maybe you can try working on smaller sections of the matrix instead of trying to do it all in one shot.
for (i in 1:ncol(a)){
ix = (a[:,i] == "t")
a[ix,i] = 1
}
It's not fast or elegant, but it might get around the memory problem.

Related

How to fix "Error: cannot allocate vector of size 265.6 Mb"

I have a very large list of dataframes (300 dataframes, each with 2 columns and 300~600 rows), and I want to join all of them with
final <- subset %>% reduce(full_join, by = "Frame_times")
When I try to do this, however, I get the following error:
Error: cannot allocate vector of size 265.6 Mb"
I am operating on 64-bit Windows 10 with the latest installation of 64-bit R (4.0.0).
I have 8gb of RAM, and
> memory.limit()
[1] 7974
> memory.size(max = TRUE)
[1] 7939.94
I have also tried the gc() function, but it did not help.
It appears that I have enough space and memory to run this, so why am I getting this error?
And how can I fix it?
Thank you very much!
You are running out of RAM. A first step to troubleshooting might be to first run this code on a smaller subset of dataframes (say, 3). Are the results (in particular, the number of rows) what you were expecting? If yes and it's really doing the right thing, then it might help to do it in batches (say 5 batches of 100). It sounds like the mostly likely scenario is that for some reason the number of rows or columns is blowing up to a much bigger number than you're expecting.
The 266Mb mentioned in the error is just the final straw; not the total memory you're using.

How to manipulate a huge data set in R?

First of all, I'm sorry for probably duplicating this question.
But, I've looked at a lot of other similar question and couldn't be able to solve my problem.
Well, I'm working with a huge data set, which contains 184,903,890 rows. An object with over 6.5GB.
This csv file can be reached on this link: Ad Tracking Fraud Detection Challenge
I'm running it in a pc with the following specifications:
i7 - 7700K - 4.2GHz
16GB Ram
GeForce GTX 1080 Ti with 11.2GB DDR 5
But, even when I'm trying to set the column as Date, the system stops working.
Is it possible to deal with this size of data set using only R?
Code details:
training <- fread('train.csv')
Some tries which stop R or return that cannot allocate vector of size ...:
training$click_time <- as.Date(training$click_time)
training$click_time <- as.POSIXct(training$click_time, 'GMT')
training <- training %>% mutate(d_month = sapply(click_time, mday)
Additional updates:
I've already used gc() to clean the memory;
I've already selected only 2 columns to a new data set;
Maybe you reached out memory assigned to R. Try memory_limit() and if needed you can increase the default with memory.limit(size = xxxx)

R accumulating memory in each iteration with large input files

I am reading around 20,000 text files in a for loop for sentiment analysis. Each file is of around 20-40 MB size. In each loop, I am taking out some sentiment counts (just a 2 numbers) out of my input text and storing it in a dataframe. The issue is, in each iteration, I can see R is cumulatively accumulating memory. After 10,000 files I see around 13GB memory allocation for R in my task manager. I tried gc() and rm() to delete objects after each iteration, but still it does not work. The logic is as I am using the same objects iteratively R is not releasing memory used in the previous iterations.
for(i in 1:20,000){
filename <- paste0("file_", i, ".txt")
text <- readLines(filename)
# Doing sentiment analysis based on dictionary based approach
# Storing sentiment counts in dataframe
# Removing used objects
rm(filename, text)
gc()
}
You could try to check which objects are taking memory and that you do not use anymore:
print(sapply(ls(), function(x) pryr::object_size(get(x))/1024/1024))
(EDIT: just saw the comment with this almost identical advice)
this line will give you the size in Megabytes of every object present in the environment (in RAM).
Alternatively if nothing appears, you can call gc() several times instead of once like:
rm(filename, text)
for (i in 1:3) gc()
It is usually more effective...
If nothing works, that could mean the memory is fragmented and thus RAM is free but unavailable to use as misplaced between data you still use.
The solution could be to run your scripts by chunks of files say 1000 by 1000.

apcluster in R: Memory limitation

I am trying to run clustering exercise in R. The algorithm that I used is apcluster(). The script that I used is:
s1 <- negDistMat(df, r=2, method="euclidean")
apcluster <- apcluster(s1)
My data set is having around 0.1 million rows. When I ran the script, I got the following error:
Error in simpleDist(x[, sapply(x, is.numeric)], sel, method = method, :
negative length vectors are not allowed
When I searched online, I found out that negative length vector error occurs due to the memory limit of my RAM. My question is if there is any workaround to run apcluster() on my dataset with 0.1 million rows with the available RAM, or am I missing something that I will need to take care while running apcluster in R?
I have a machine with 8 GB of RAM.
The standard version of affinity propagation implemented in the apcluster() method will never ever run successfully on data of that size. On the one hand, the similarity matrix (s1 in your code sample) will have 100K x 100K = 10G entries. On the other hand, computation times will be excessive. I suggest you use apclusterL() instead.

R data.table growing table size when adding columns by reference

I'm using R to deal with some data that is not huge but big enough to cause problems with the available memory.
(I'm using a 32bit system with 3Gb Ram, there is no possibility to use another system.)
I found that the package data.table should be a good way to do memory efficient calculations. Especially this post dealing with joining tables without copying seems to help:
data.table join then add columns to existing data.frame without re-copy
When doing some test I found out, that even when using references tables-sizes are increasing quite fast:
#rm(list=ls()); gc();
library(data.table);
n <- 7000000;
A <- data.table(a=1:n, z="sometext", key="a");
B <- data.table(a=1:n, b=rnorm(n, 1), key="a");
#ACopy<-A[B, .(b=i.b, c=i.b, d=i.b, e=i.b, f=i.b, g=i.b, h=i.b, j=i.b, k=i.b, l=i.b, m=i.b)];
A[B, ':='(b=i.b, c=i.b, d=i.b, e=i.b, f=i.b, g=i.b, h=i.b, j=i.b, k=i.b, l=i.b, m=i.b)]
object.size(A);
When increasing the n in the above example I get a "cannot allocate vector if size ..." Error. I was surprised that this error starts to show up already at a table size of about 600Mb. (I know that not all of the 3Gb can be used, but 1.5Gb should be feasable.) Could anyone explain me why the error shows up at a site of 600Mb already? (Workspace clear and no other (memory expensive) applications running).
ACopy does not use data.tables reference features. Here an object limit of ~600Mb seems reasonable for me since some copying is done here. What surprised me is that a) ACopy is smaller than A and b) that the reference solution results in such a big object, (I expected it to be much smaller because of the reference). As you can see I'm new to this and would be glad if anyone could explain.
Thanks,
Michael

Resources