I make an application which has to store a lot of data in memory to improve calculation performance.
It is a hierarchy of lists and objects where the top object is a QList<myObject*>. When loading data, a lot of instances of new myObject* are created and added to the list. The memory consumption grows and when it comes to ~1.9Gb the program crashes. My computer (Vista) has 4Gb RAM, and I have tested on other computers with less RAM (XP) and it crashes at the same point. Can I not use more than 1.9Gb RAM?
When a smaller file is loaded and memory usage according to "Windows task manager" is (say) 1.2Gb I can work with the data. But if I want to load another file, the growing starts from 1.2Gb even after calling delete on all objects and clearing the list. Why?
I tried switching to QVector and call squeeze(), but memory stays the same. I have read the other threads here about dynamic memory allocation in QLists, but is it really no way to reset the memory before I load a new file? Especially since it crashes after 1.9Gb; loading 3 small files sequentially and I'm there.
Thanks a lot for any suggestions.
If you have 32-bit Windows, then your process can only use 2 GB of memory. You just cannot address more memory with 32 bits. If you need more memory, maybe you should change to 64-bit Windows.
Related
I'm running RStudio x64 on Windows 10 with 16GB of RAM. RStudio seems to be running out of memory for allocating large vectors, in this case a 265MB one. I've gone through multiple tests and checks to identify the problem:
Memory limit checks via memory.limit() and memory.size(). Memory limit is ~16GB and size of objects stored in environment is ~5.6GB.
Garbage collection via gc(). This removes some 100s of MBs.
Upped priority of rsession.exe and rstudio.exe via Task Manager to real-time.
Ran chkdsk and RAM diagnostics on system restart. Both returned no errors.
But the problem persists. It seems to me that R can access 16GB of RAM (and shows 16GB committed on Resource Monitor), but somehow is still unable to make a large vector. My main confusion is this: the problem only begins appearing if I run code on multiple datasets consecutively, without restarting RStudio in between. If I do restart RStudio, the problem doesn't show up anymore, not for a few runs.
The error should be replicable with any large R vector allocation (see e.g. the code here). I'm guessing the fault is software, some kind of memory leak, but I'm not sure where or how, or how to automate a fix.
Any thoughts? What have I missed?
Some confidential data is stored on a server and accessible for researchers via remote access.
Researchers can login via some (I think cisco) remote client, and share virtual machines on the same host
There's a 64 bit Windows running on the virtual machine
The system appears to be optimized for Stata, I'm among the first to use the data using R. There is no RStudio installed on the client, just the RGui 3.0.2.
And here's my problem: the data is saved in the stata format (.dta), and I need to open it in R. At the moment I am doing
read.dta(fileName, convert.factors = FALSE)[fields]
Loading in a smaller file (around 200MB) takes 1-2 minutes. However, loading in the main file (3-4 GB) takes very long, longer than my patience was for me. During that time, the R GUI is not responding anymore.
I can test my code on my own machine (OS X, RStudio) on a smaller data sample, which works all fine. Is this
because of OS X + RStudio, or only
because of the size of the file?
A college is using Stata on a similar file in their environment, and that was working fine for him.
What can I do to improve the situation? Possible solutions I came up with were
Load the data into R somehow differently (perhaps there is a way that doesn't require all this memory usage). I have also access to stata. If all else fails, I could prepare the data in Stata, for example slice it into smaller pieces and reassemble it in R
Ask them to allocate more memory to my user of the VM (if that indeed is the issue)
Ask them to provide RStudio as a backend (even if that's not faster, perhaps its less prone to crashes)
Certainly the size of the file is a prime factor, but the machine and configuration might be, too. Hard to tell without more information. You need a 64 bit operating system and a 64 bit version of R.
I don't imagine that RStudio will help or hinder the process.
If the process scales linearly, it means your big data case will take (120 seconds)*(4096 MB/200 MB) =2458 seconds, or around three quarters of an hour. Is that how long you waited?
The process might not be linear.
Was the processor making progress? If you checked CPU and memory, was the process still running? Was it doing a lot of page swaps?
Let's say I have a 4GB dataset on a server with 32 GB.
I can read all of that into R, make a data.table global variable and have all of my functions use that global as a kind of in-memory data-base. However, when I exit R and restart, I have to read that from disk again. Even with smart disk cacheing strategies (save/load or R.cache) I have 10 seconds delay or so getting that data in. Copying that data takes about 4 seconds.
Is there a good way to cache this in memory that survives the exit of an R session?
A couple of things comes to mind, RServe, redis/Rredis, Memcache, multicore ...
Shiny-Server and Rstudio-Server also seem to have ways of solving this problem.
But then again, it seems to me that perhaps data.table could provide this functionality since it appears to move data outside of R's memory block anyway. That would be ideal in that it wouldn't require any data copying, restructuring etc.
Update:
I ran some more detailed tests and I agree with the comment below that I probably don't have much to complain about.
But here are some numbers that others might find useful. I have a 32GB server. I created a data.table of 4GB size. According to gc() and also looking at top, it appeared to use about 15GB peak memory and that includes making one copy of the data. That's pretty good I think.
I wrote to disk with save(), deleted the object and used load() to remake it. This took 17 seconds and 10 seconds respectively.
I did the same with the R.cache package and this was actually slower. 23 and 14 seconds.
However both of those reload times are quite fast. The load() method gave me 357 MB/s transfer rate. By comparison, a copy took 4.6 seconds. This is a virtual server. Not sure what kind of storage it has or how much that read speed is influenced by the cache.
Very true: data.table hasn't got to on-disk tables yet. In the meantime some options are :
Don't exit R. Leave it running on a server and use svSocket to evalServer() to it, as the video on the data.table homepage demonstrates. Or the other similar options you mentioned.
Use a database for persistency such as SQL or any other noSQL database.
If you have large delimited files then some people have recently reported that fread() appears (much) faster than load(). But experiment with compress=FALSE. Also, we've just pushed fwrite to the most current development version (1.9.7, use devtools::install_github("Rdatatable/data.table") to install), which has some reported write times on par with native save.
Packages ff, bigmemory and sqldf, too. See the HPC Task View, the "Large memory and out-of-memory data" section.
In enterprises where data.table is being used, my guess is that it is mostly being fed with data from some other persistent database, currently. Those enterprises probably :
use 64bit with say 16GB, 64GB or 128GB of RAM. RAM is cheap these days. (But I realise this doesn't address persistency.)
The internals have been written with on-disk tables in mind. But don't hold your breath!
If you really need to exit R for some strange reasons between the computation sessions and the server is not restarted, then just make a 4 GB ramdisk in RAM and store the data there. Loading the data from RAM to RAM would be much faster compared to any SAS or SSD drive :)
This can be solved pretty easily on Linux with something like adding this line to /etc/fstab:
none /data tmpfs nodev,nosuid,noatime,size=5000M,mode=1777 0 0
Depending on how your dataset looks like, you might consider using package ff. If you save your dataset as an ffdf, it will be stored on disk but you can still access the data from R.
ff objects have a virtual part and a physical part. The physical part is the data on disk, the virtual part gives you information about the data.
To load this dataset in R, you only load in the virtual part of the dataset which is a lot smaller, maybe only a few Kb, depending if you have a lot of data with factors. So this would load your data in R in a matter of milliseconds instead of seconds, while still having access to the physical data to do your processing.
I am running some basic data manipulation on a Macbook Air (4GB Memory, 120GB HD with 8GB available). My input file is about 40 MB, and I don't write anything to the disk until end of the process. However, in the middle of my process, my Mac says there's no memory to run. I checked hard drive and found there's about 500MB left.
So here are my questions:
How is it possible that R filled up my disk so quickly? My understanding is that R store everything in memory (unless I explicitly write something out to disk).
If R does write temporary files on the disk, how can I find these files to delete them?
Thanks a lot.
Update 1: error message I got:
Force Quit Applications: Your Mac OS X startup disk has no more space available for
application memory
Update 2: I checked tempdir() and it shows "var/folders/k_xxxxxxx/T//Rtmpdp9GCo". But I can't locate this director from my Finder
Update 3: After unlink(tempdir(),recursive=TRUE) in R and restarting my computer, I got my disk space back. I still would like to know if R write on my hard drive to avoid similar situations in the future.
Update 4: My main object is about 1GB. I use Activity Monitor to track process, and while Memory usage is about 2GB, Disk activity is extremely high: Data read: 14GB, data write, 44GB. I have no idea what R is writing.
R writes to a temporary per-session directory which it also cleans up at exit.
It follows convention and respects TMP and related environment variables.
What makes you think that disk space has anything to do with this? R needs all objects held in memory, not off disk (by default; there are add-on packages that allow a subset of operations on on-disk stored files too big to fit into RAM).
One of the steps in the "process" is causing R to request a chunk of RAM from the OS to enable it to continue. The OS could not comply and thus R terminated the "process" that you were running with the error message you failed to give us. [Hint, it would help if you showed the actual error not your paraphrasing thereof. Some inkling of the code you were running would also help. 40MB on-disk sounds like a reasonably large file; how many rows/columns etc.? How big is the object within R; object.size()?
When I build my application statically, it comes out to just over 5Mb, so it's a small, simple program. However, any system that has under 3Gb of ram can't run the program, saying there's not enough memory. There is nothing very memory intensive in the program, and I did nothing to allocate memory specifically. Any thoughts on whats causing this?
I believe that less the 1Mb built code can easilly fill the 10GB memory. Make sure that your code does not use redundant memory.
There was a problem with the static build. I first got it to work by exporting from the visual studio plugin, and then I rebuilt the SDK and program again, and everything worked fine from QT creator.