read.csv crashes RStudio - r

Help me figure out what I am doing wrong!
I have about 20 .csv files (product feeds) online. I used to be able to fetch them all. But now they crash R if I fetch more than one or two. File size is about 50K rows / 30 columns each.
I guess it's a memory issue but I've tried on a different computer with the exact same result.
Could it be some formatting in the files that make R use too much memory? Or what can it be?
If I run one of these everything is good. Two sometimes. Three and it almost certainly crashes
a <- read.csv("URL1")
b <- read.csv("URL2")
c <- read.csv("URL3")
I have tried specifying all sorts of stuff like:
d <- read.csv("URL4",skipNul=TRUE,sep=",",stringsAsFactors=FALSE,header=TRUE)
I keep getting this message:
R session aborted.
R encountered a fatal error.
The session was terminated.
We have some commercial software where I can fetch the same files without issues, so the files should be fine.
And my script was running twice daily for several months without issues
R version 3.6.1
Platform: x86_64-apple-darwin15.6.0 (64-bit)

I have had this issue as well but with read_csv(). I haven't figured out what the exact cause is yet, but my best guess is that trying to read a file and write that file to a variable at the same time is too much for memory or CPU to handle.
Stemming from that guess, I tried this method and it has worked perfectly for me:
library(dplyr)
a <- read.csv("URL1") %>% as_tibble()
# you can use other data types instead of tibble. that is just my example.
The whole idea is to split the reading process from the writing process by separating them using a pipe. This makes sure that one must be finished before the next can start.

Related

Issue with applying str_length to a dataframe

I created a simple R Script that is run on a monthly basis by colleagues.
This script brings in a fairly chunky RDS file that has around 2.6M observations and 521 variables.
Against this file the following two commands are run:
Latest$MFU <- substr(Latest$SUB_BUSINESS_UNIT_CODE, 1, 2)
Latest$LENGTH <- str_length(Latest$POLICYHOLDER_COMPANY_NAME_LAST_NAME)
This script has run perfectly for the last three years, but today, for some reason, it is now failing for all three people tasked to run it and has indeed fallen over for myself too.
The error message received is
Error: cannot allocate vector of size 10.0 Mb
At first I assumed that their computers were running out of memory, or they were not using 64Bit R, or some other reason such as not restarting their computers, etc.
It turns out though that they have plenty of memory available, have restarted their computers, are using 64 Bit R in R Studio and all are using different versions of R Studio/R.
I tried running the process myself, my computer has 32GB of Ram and 768GB of Hard Drive space free. I am getting the same error message.......
So, must be a corrupt source file I figure. Try last months file which all ran just fine last month for everyone and same error.
Maybe just try stringr package instead then, move around the problem that way. Nope, no dice, exact same error message.
I have to admit I'm stumped. I have tried gc(), tried previous versions of the file, tried cutting the file in half and running it that way, it just flat out refuses to run.
Anyone know of an alternative to stringr/base R commands to get the length of a character string as a new variable and to get a substring as a new variable?
What about rm(list=ls()) before running, and memory.limit(size = 16265*4) (or another big number) ?

cannot allocate vector but my environment is empty

I found lots of questions here asking how to deal with "cannot allocate vector of size **" and tried the suggestions but still can't find out why Rstudio crashes every time.
I'm using 64bit R in Windows 10, and my memory.limit() is 16287.
I'm working with a bunch of large data files (mass spectra) that take up 6-7GB memory each, so I've been calling individual files one at a time and saving it as a variable with the XCMS package like below.
msdata <- xcmsRaw(datafile1,profstep=0.01,profmethod="bin",profparam=list(),includeMSn=FALSE,mslevel=NULL, scanrange=NULL)
I do a series of additional operations to clean up data and make some plots using rawEIC (also in XCMS package), which increases my memory.size() to 7738.28. Then I removed all the variables I created that are saved in my global environment using rm(list=ls()). But when I try to call in a new file, it tells me it cannot allocate vector of size **Gb. With the empty environment, my memory.size() is 419.32, and I also checked with gc() to confirm that the used memory (on the Vcells row) is on the same order with when I first open a new R session.
I couldn't find any information on why R still thinks that something is taking up a bunch of memory space when the environment is completely empty. But if I terminate the session and reopen the program, I can import the data file - I just have to re-open the session every single time one data file processing is done, which is getting really annoying. Does anyone have suggestions on this issue?

XLConnect 'envir' error

I manage a number of Excel reports, and I use R to do the preprocessing and write the output report. It's great because all I have to do is run the R function and distribute the reports, and the rest of the report writing is inactive time. The reports need to be in Excel format because it is the easiest to disseminate and the audience is large and non-technical. Once the data is pre-processed, I do this very, very simply using XLConnect:
file.copy(from = template,
to = newFileName)
writeWorksheetToFile(file = newFileName,
data = newData,
sheet = "Data",
clearSheets = T)
However, one of my reports began throwing this error when I attempted to write the new data:
Error in ls(envir = envir, all.names = private) :
invalid 'envir' argument
Furthermore, before throwing the error, the function ties up R for 15 minutes. The normal writing time is less than 10 seconds. I must confess, I don't understand what this error even means, and it did not succumb to my usual debugging methods or to any other SO solution.
I've noticed that others have referred to rJava (reinstalling this package didn't work) and to a Java cache of log files (not sure where this would be located on Mac). I'm especially confused as the report ran with no problems just one day earlier using precisely the same process, AND my other reports using the exact same process still work just fine.
I didn't update Java or R or my OS, or debug/rewrite any of the R code. So, starting from the beginning - how can I investigate this 'envir' error? What would you do if you were in my shoes? I've been working on this for a couple days and I'm stumped.
I'm happy to provide extra information if it will provide better context for more discerning programmers than myself :)
Update:
My previous answer (below) did not, in fact, fix this intermittent error (which as the OP points out is extremely difficult to unpick due to the Java dependency). Instead, I followed the advice given here and migrated from the XLConnect package to openxlsx, which sidesteps the problem entirely.
Previous answer:
I've been frustrated by precisely this error for a while, including the apparent intermittency and the tying up of R for several minutes when writing a workbook.
I just realised what the problem was: the length of the name of an Excel worksheet appears to be limited to 31 characters, and my R code was generating worksheet names in excess of this limit.
Just to be clear, I'm referring to the names of the individual tabbed sheets within an Excel workbook, not the filename of the workbook itself.
Trimming each worksheet name to no more than 31 characters fixed this error for me.

How to open up a matrix that's running into an error

I am running into an error on a big job in R. I running it as an R script. I keep getting the error that Error in chol.default(F.mat) :
the leading minor of order 1 is not positive definite.
I normally run my job in a qsub but that only gives me an error output but I can't poke around. I then tried running my job locally but my 4gb Macbook was completely overwhelmed.
Now I am trying using screen name and running it on a screen with options(error=recover). Now I am running into the same error as above but I don't know how to access the data frames. I get recover called non-interactively; frames dumped, use debugger() to view but then I get put into my bash line and I don't know how to open up the data frame.
Any ideas?
This is a bit awkward since (1) it's more or less remote debugging and (2) I don't actually ever try to debug non-interactively myself, but: it seems that
options(error=function() dump.frames(to.file=TRUE)) might be worth trying?
After your frames dump to a file (last.dump.rda in the working directory,by default), you should be able to run load("last.dump.rda"); debugger(last.dump) to get back to the debugging environment.
Two caveats:
I haven't actually tested this, just read & interpreted ?dump.frames;
I strongly recommend that you test this with short test runs, either running your original code on a small subset of your data or setting a mini-test script something like
options(error=function() dump.frames(to.file=TRUE))
Sys.sleep(60)
stop("testing error exit")

R console unexpectedly slow, long behind job (PDF output) is finished

When I run a large R scripts (works nicely as expected, basically produces a correct PDF at the end of the script (base plotting plus beeswarm, last line of script is dev.off()), I notice that the PDF is finished after ~3 seconds and can even be opened in other applications, long before the console output (merely few integer values and echo of code ~400 lines) is finished (~20 seconds). There are no errors reported. In between, the echo stops and does nothing for seconds.
I work with R Studio V0.97.551, R version 3.0.1, on Win-7.
gc() or close and restart R did not help, and the data structures used are not big anyway (5 dataframes with up to 60 obs and 64 numeric or short character variables). The available memory should be sufficient (according to task manager, around 4 GB throughout), but CPU is busy during that time.
I agree this is not reproducible for other people w/o the script, which is however too large to post, but maybe someone has experienced the same problem or even an explanation or suggestion what to check? Thanks in advance!
EDIT:
I run exactly the same code directly in R 3.0.1 (w/o RStudio), and the problem was gone, suggesting the problem is related to RStudio. I added the tag RStudio, but I am not sure if I am now supposed to move this question somewhere else?
Recently I came across similar problem--running from RStudio becomes very slow, even when it is executing something as simple as example('plot'). After searching around, this post pointed me to the right place that eventually led to a workaround: resetting RStudio by renaming the "RStudio-Desktop Directory". The exact way to do so depends upon the OS you are using, and you could find the detail instruction here. I just tried it, and it works.

Resources