Loading .dta data into R takes long time - r

Some confidential data is stored on a server and accessible for researchers via remote access.
Researchers can login via some (I think cisco) remote client, and share virtual machines on the same host
There's a 64 bit Windows running on the virtual machine
The system appears to be optimized for Stata, I'm among the first to use the data using R. There is no RStudio installed on the client, just the RGui 3.0.2.
And here's my problem: the data is saved in the stata format (.dta), and I need to open it in R. At the moment I am doing
read.dta(fileName, convert.factors = FALSE)[fields]
Loading in a smaller file (around 200MB) takes 1-2 minutes. However, loading in the main file (3-4 GB) takes very long, longer than my patience was for me. During that time, the R GUI is not responding anymore.
I can test my code on my own machine (OS X, RStudio) on a smaller data sample, which works all fine. Is this
because of OS X + RStudio, or only
because of the size of the file?
A college is using Stata on a similar file in their environment, and that was working fine for him.
What can I do to improve the situation? Possible solutions I came up with were
Load the data into R somehow differently (perhaps there is a way that doesn't require all this memory usage). I have also access to stata. If all else fails, I could prepare the data in Stata, for example slice it into smaller pieces and reassemble it in R
Ask them to allocate more memory to my user of the VM (if that indeed is the issue)
Ask them to provide RStudio as a backend (even if that's not faster, perhaps its less prone to crashes)

Certainly the size of the file is a prime factor, but the machine and configuration might be, too. Hard to tell without more information. You need a 64 bit operating system and a 64 bit version of R.
I don't imagine that RStudio will help or hinder the process.
If the process scales linearly, it means your big data case will take (120 seconds)*(4096 MB/200 MB) =2458 seconds, or around three quarters of an hour. Is that how long you waited?
The process might not be linear.
Was the processor making progress? If you checked CPU and memory, was the process still running? Was it doing a lot of page swaps?

Related

Rstudio potential memory leak / background activity?

I’m having a lot of trouble working with Rstudio on a new PC. I could not find a solution searching the web.
When Rstudio is on, it is constantly eating up memory until it becomes unworkable. If I work on an existing project, it takes half an hour to an hour to become impossible to work with. If I start a new project without loading any objects or packages, just writing scripts without even running them, it takes longer to reach that point, however, it still does.
When I first start the program, the Task Manager shows memory usage of 950-1000 MB already (sometimes larger), and as I work, it climbs up to 6000 MB at which point it is impossible to work with as every activity is delayed and 'stuck'. Just to compare, on my old PC while working on the program, the Task Manager shows 100-150 MB. When I click the "Memory Usage Report" within Rstudio, the "used by session" is very small, the "used by system" is almost at a maximum yet Rstudio is the only thing taking up they system memory on the PC.
Things I tried: installing older versions of both R and Rstudio, pausing my anti-virus program, changing compatibility mode, zoom on "100%". It feels like Rstudio is continuously running something in the background as the memory usage keeps growing (and quite quickly). But maybe it is something else entirely.
I am currently using the latest versions of R and Rstudio (4.1.2, and 2021.09.0-351), on a PC with processor Intel i7, x64 bit, RAM 16GM, Windows 10.
What should I look for at this point?
On Windows, there is several typical memory or CPU issues with Rstudio. In my answer, I explain how the Rstudio interface itself use memory and CPU, as soon as you open a project (e.g., when Rstudio show you some .Rmd files). The memory / CPU cost associated with the computation is not covered in my answer (i.e. when you have performance issues when executing a line of code = not covered).
When working on 'long' .Rmd files within Rstudio on Windows, the CPU and/or memory usage get sometimes very high and increases progressively (e.g., because of a process named 'Qtwebengineprocess'). To solve the problem caused by long Rmd files loaded within a Rstudio session, you should:
pay attention to the process of Rstudio that consume memory, when scanning your code (i.e. disable or enable stuff in the 'Global options' menu of Rstudio). For example, try to disable 'inline display'(Tools => Global options => Rmarkdown => Show equation and image preview => Never). This post put me on this way to consider that memory / CPU leak are sometimes due to Rstudio itself, nor the data or the code.
set up a bookdown project, in order to split your large Rmd files into several Rmd. See here.
Bonus step, see if there is a conflict in some packages which are loaded with the command tidyverse_conflicts(), but it's already a 'computing problem' (not covered here).

"Cannot allocate vector of size xxx mb" error, nothing seems to fix

I'm running RStudio x64 on Windows 10 with 16GB of RAM. RStudio seems to be running out of memory for allocating large vectors, in this case a 265MB one. I've gone through multiple tests and checks to identify the problem:
Memory limit checks via memory.limit() and memory.size(). Memory limit is ~16GB and size of objects stored in environment is ~5.6GB.
Garbage collection via gc(). This removes some 100s of MBs.
Upped priority of rsession.exe and rstudio.exe via Task Manager to real-time.
Ran chkdsk and RAM diagnostics on system restart. Both returned no errors.
But the problem persists. It seems to me that R can access 16GB of RAM (and shows 16GB committed on Resource Monitor), but somehow is still unable to make a large vector. My main confusion is this: the problem only begins appearing if I run code on multiple datasets consecutively, without restarting RStudio in between. If I do restart RStudio, the problem doesn't show up anymore, not for a few runs.
The error should be replicable with any large R vector allocation (see e.g. the code here). I'm guessing the fault is software, some kind of memory leak, but I'm not sure where or how, or how to automate a fix.
Any thoughts? What have I missed?

R Running out of memory over large file

I had a unique problem yesterday when trying to read a large .csv file into memory.
the file itself is 9GB with a bit more than 80 Million rows and 10 columns.
it loaded perfectly and took up around 7GB in memory using a remote machine with 128GB of RAM.
my problem is, i want to work on the data with a local machine that only has 32GB of RAM.
i tried reading it with data.table::fread but R crahes when it uses all of the machine's memory.
is there a safer way of reading the data that won't crash R?
is this a known issue? can something be wrong with the machine?
both machine are running windows 7 enterprise.
EDIT:
saving and reading the data in an RDS file worked, but i still want to be able to use just one computer for the entire job.
is there any other way to read the data directly from the csv file?
i don't want to report a bug in data.table unless i am sure this is an issue with fread and not a local issue.
any other ideas?

What a good way to get in-memory cache with data.table

Let's say I have a 4GB dataset on a server with 32 GB.
I can read all of that into R, make a data.table global variable and have all of my functions use that global as a kind of in-memory data-base. However, when I exit R and restart, I have to read that from disk again. Even with smart disk cacheing strategies (save/load or R.cache) I have 10 seconds delay or so getting that data in. Copying that data takes about 4 seconds.
Is there a good way to cache this in memory that survives the exit of an R session?
A couple of things comes to mind, RServe, redis/Rredis, Memcache, multicore ...
Shiny-Server and Rstudio-Server also seem to have ways of solving this problem.
But then again, it seems to me that perhaps data.table could provide this functionality since it appears to move data outside of R's memory block anyway. That would be ideal in that it wouldn't require any data copying, restructuring etc.
Update:
I ran some more detailed tests and I agree with the comment below that I probably don't have much to complain about.
But here are some numbers that others might find useful. I have a 32GB server. I created a data.table of 4GB size. According to gc() and also looking at top, it appeared to use about 15GB peak memory and that includes making one copy of the data. That's pretty good I think.
I wrote to disk with save(), deleted the object and used load() to remake it. This took 17 seconds and 10 seconds respectively.
I did the same with the R.cache package and this was actually slower. 23 and 14 seconds.
However both of those reload times are quite fast. The load() method gave me 357 MB/s transfer rate. By comparison, a copy took 4.6 seconds. This is a virtual server. Not sure what kind of storage it has or how much that read speed is influenced by the cache.
Very true: data.table hasn't got to on-disk tables yet. In the meantime some options are :
Don't exit R. Leave it running on a server and use svSocket to evalServer() to it, as the video on the data.table homepage demonstrates. Or the other similar options you mentioned.
Use a database for persistency such as SQL or any other noSQL database.
If you have large delimited files then some people have recently reported that fread() appears (much) faster than load(). But experiment with compress=FALSE. Also, we've just pushed fwrite to the most current development version (1.9.7, use devtools::install_github("Rdatatable/data.table") to install), which has some reported write times on par with native save.
Packages ff, bigmemory and sqldf, too. See the HPC Task View, the "Large memory and out-of-memory data" section.
In enterprises where data.table is being used, my guess is that it is mostly being fed with data from some other persistent database, currently. Those enterprises probably :
use 64bit with say 16GB, 64GB or 128GB of RAM. RAM is cheap these days. (But I realise this doesn't address persistency.)
The internals have been written with on-disk tables in mind. But don't hold your breath!
If you really need to exit R for some strange reasons between the computation sessions and the server is not restarted, then just make a 4 GB ramdisk in RAM and store the data there. Loading the data from RAM to RAM would be much faster compared to any SAS or SSD drive :)
This can be solved pretty easily on Linux with something like adding this line to /etc/fstab:
none /data tmpfs nodev,nosuid,noatime,size=5000M,mode=1777 0 0
Depending on how your dataset looks like, you might consider using package ff. If you save your dataset as an ffdf, it will be stored on disk but you can still access the data from R.
ff objects have a virtual part and a physical part. The physical part is the data on disk, the virtual part gives you information about the data.
To load this dataset in R, you only load in the virtual part of the dataset which is a lot smaller, maybe only a few Kb, depending if you have a lot of data with factors. So this would load your data in R in a matter of milliseconds instead of seconds, while still having access to the physical data to do your processing.

Where does R store temporary files

I am running some basic data manipulation on a Macbook Air (4GB Memory, 120GB HD with 8GB available). My input file is about 40 MB, and I don't write anything to the disk until end of the process. However, in the middle of my process, my Mac says there's no memory to run. I checked hard drive and found there's about 500MB left.
So here are my questions:
How is it possible that R filled up my disk so quickly? My understanding is that R store everything in memory (unless I explicitly write something out to disk).
If R does write temporary files on the disk, how can I find these files to delete them?
Thanks a lot.
Update 1: error message I got:
Force Quit Applications: Your Mac OS X startup disk has no more space available for
application memory
Update 2: I checked tempdir() and it shows "var/folders/k_xxxxxxx/T//Rtmpdp9GCo". But I can't locate this director from my Finder
Update 3: After unlink(tempdir(),recursive=TRUE) in R and restarting my computer, I got my disk space back. I still would like to know if R write on my hard drive to avoid similar situations in the future.
Update 4: My main object is about 1GB. I use Activity Monitor to track process, and while Memory usage is about 2GB, Disk activity is extremely high: Data read: 14GB, data write, 44GB. I have no idea what R is writing.
R writes to a temporary per-session directory which it also cleans up at exit.
It follows convention and respects TMP and related environment variables.
What makes you think that disk space has anything to do with this? R needs all objects held in memory, not off disk (by default; there are add-on packages that allow a subset of operations on on-disk stored files too big to fit into RAM).
One of the steps in the "process" is causing R to request a chunk of RAM from the OS to enable it to continue. The OS could not comply and thus R terminated the "process" that you were running with the error message you failed to give us. [Hint, it would help if you showed the actual error not your paraphrasing thereof. Some inkling of the code you were running would also help. 40MB on-disk sounds like a reasonably large file; how many rows/columns etc.? How big is the object within R; object.size()?

Resources