Issue with applying str_length to a dataframe - r

I created a simple R Script that is run on a monthly basis by colleagues.
This script brings in a fairly chunky RDS file that has around 2.6M observations and 521 variables.
Against this file the following two commands are run:
Latest$MFU <- substr(Latest$SUB_BUSINESS_UNIT_CODE, 1, 2)
Latest$LENGTH <- str_length(Latest$POLICYHOLDER_COMPANY_NAME_LAST_NAME)
This script has run perfectly for the last three years, but today, for some reason, it is now failing for all three people tasked to run it and has indeed fallen over for myself too.
The error message received is
Error: cannot allocate vector of size 10.0 Mb
At first I assumed that their computers were running out of memory, or they were not using 64Bit R, or some other reason such as not restarting their computers, etc.
It turns out though that they have plenty of memory available, have restarted their computers, are using 64 Bit R in R Studio and all are using different versions of R Studio/R.
I tried running the process myself, my computer has 32GB of Ram and 768GB of Hard Drive space free. I am getting the same error message.......
So, must be a corrupt source file I figure. Try last months file which all ran just fine last month for everyone and same error.
Maybe just try stringr package instead then, move around the problem that way. Nope, no dice, exact same error message.
I have to admit I'm stumped. I have tried gc(), tried previous versions of the file, tried cutting the file in half and running it that way, it just flat out refuses to run.
Anyone know of an alternative to stringr/base R commands to get the length of a character string as a new variable and to get a substring as a new variable?

What about rm(list=ls()) before running, and memory.limit(size = 16265*4) (or another big number) ?

Related

read.csv crashes RStudio

Help me figure out what I am doing wrong!
I have about 20 .csv files (product feeds) online. I used to be able to fetch them all. But now they crash R if I fetch more than one or two. File size is about 50K rows / 30 columns each.
I guess it's a memory issue but I've tried on a different computer with the exact same result.
Could it be some formatting in the files that make R use too much memory? Or what can it be?
If I run one of these everything is good. Two sometimes. Three and it almost certainly crashes
a <- read.csv("URL1")
b <- read.csv("URL2")
c <- read.csv("URL3")
I have tried specifying all sorts of stuff like:
d <- read.csv("URL4",skipNul=TRUE,sep=",",stringsAsFactors=FALSE,header=TRUE)
I keep getting this message:
R session aborted.
R encountered a fatal error.
The session was terminated.
We have some commercial software where I can fetch the same files without issues, so the files should be fine.
And my script was running twice daily for several months without issues
R version 3.6.1
Platform: x86_64-apple-darwin15.6.0 (64-bit)
I have had this issue as well but with read_csv(). I haven't figured out what the exact cause is yet, but my best guess is that trying to read a file and write that file to a variable at the same time is too much for memory or CPU to handle.
Stemming from that guess, I tried this method and it has worked perfectly for me:
library(dplyr)
a <- read.csv("URL1") %>% as_tibble()
# you can use other data types instead of tibble. that is just my example.
The whole idea is to split the reading process from the writing process by separating them using a pipe. This makes sure that one must be finished before the next can start.

cannot allocate vector but my environment is empty

I found lots of questions here asking how to deal with "cannot allocate vector of size **" and tried the suggestions but still can't find out why Rstudio crashes every time.
I'm using 64bit R in Windows 10, and my memory.limit() is 16287.
I'm working with a bunch of large data files (mass spectra) that take up 6-7GB memory each, so I've been calling individual files one at a time and saving it as a variable with the XCMS package like below.
msdata <- xcmsRaw(datafile1,profstep=0.01,profmethod="bin",profparam=list(),includeMSn=FALSE,mslevel=NULL, scanrange=NULL)
I do a series of additional operations to clean up data and make some plots using rawEIC (also in XCMS package), which increases my memory.size() to 7738.28. Then I removed all the variables I created that are saved in my global environment using rm(list=ls()). But when I try to call in a new file, it tells me it cannot allocate vector of size **Gb. With the empty environment, my memory.size() is 419.32, and I also checked with gc() to confirm that the used memory (on the Vcells row) is on the same order with when I first open a new R session.
I couldn't find any information on why R still thinks that something is taking up a bunch of memory space when the environment is completely empty. But if I terminate the session and reopen the program, I can import the data file - I just have to re-open the session every single time one data file processing is done, which is getting really annoying. Does anyone have suggestions on this issue?

How do you change memory available for data storage settings in R using OSX command line?

I'm very new to R and I'm having trouble changing the memory available for data storage setting "--max-ppsize". From reading other posts, the error that resulted from running a line of my code (Error: protect(): protection stack overflow) indicates that I should change this to the maximum allowed value (--max-ppsize = 500000) using command line. This threw an error because I'm trying to run the Rtsne package on a data set that is very large. I don't fully understand how to run command line for R in OSX terminal. I can launch R in terminal, but from there I'm not exactly sure what to do. Any help would be very much appreciated.
So this ended up being a fairly simple fix with a fresh brain the next day. I didn't end up changing the --max-ppsize. I read another post's comments section and saw that I needed to use a data matrix (used data.matrix to change my data frame to a matrix) for the function I was trying to run. Hope this helps someone else out.

How to open up a matrix that's running into an error

I am running into an error on a big job in R. I running it as an R script. I keep getting the error that Error in chol.default(F.mat) :
the leading minor of order 1 is not positive definite.
I normally run my job in a qsub but that only gives me an error output but I can't poke around. I then tried running my job locally but my 4gb Macbook was completely overwhelmed.
Now I am trying using screen name and running it on a screen with options(error=recover). Now I am running into the same error as above but I don't know how to access the data frames. I get recover called non-interactively; frames dumped, use debugger() to view but then I get put into my bash line and I don't know how to open up the data frame.
Any ideas?
This is a bit awkward since (1) it's more or less remote debugging and (2) I don't actually ever try to debug non-interactively myself, but: it seems that
options(error=function() dump.frames(to.file=TRUE)) might be worth trying?
After your frames dump to a file (last.dump.rda in the working directory,by default), you should be able to run load("last.dump.rda"); debugger(last.dump) to get back to the debugging environment.
Two caveats:
I haven't actually tested this, just read & interpreted ?dump.frames;
I strongly recommend that you test this with short test runs, either running your original code on a small subset of your data or setting a mini-test script something like
options(error=function() dump.frames(to.file=TRUE))
Sys.sleep(60)
stop("testing error exit")

R console unexpectedly slow, long behind job (PDF output) is finished

When I run a large R scripts (works nicely as expected, basically produces a correct PDF at the end of the script (base plotting plus beeswarm, last line of script is dev.off()), I notice that the PDF is finished after ~3 seconds and can even be opened in other applications, long before the console output (merely few integer values and echo of code ~400 lines) is finished (~20 seconds). There are no errors reported. In between, the echo stops and does nothing for seconds.
I work with R Studio V0.97.551, R version 3.0.1, on Win-7.
gc() or close and restart R did not help, and the data structures used are not big anyway (5 dataframes with up to 60 obs and 64 numeric or short character variables). The available memory should be sufficient (according to task manager, around 4 GB throughout), but CPU is busy during that time.
I agree this is not reproducible for other people w/o the script, which is however too large to post, but maybe someone has experienced the same problem or even an explanation or suggestion what to check? Thanks in advance!
EDIT:
I run exactly the same code directly in R 3.0.1 (w/o RStudio), and the problem was gone, suggesting the problem is related to RStudio. I added the tag RStudio, but I am not sure if I am now supposed to move this question somewhere else?
Recently I came across similar problem--running from RStudio becomes very slow, even when it is executing something as simple as example('plot'). After searching around, this post pointed me to the right place that eventually led to a workaround: resetting RStudio by renaming the "RStudio-Desktop Directory". The exact way to do so depends upon the OS you are using, and you could find the detail instruction here. I just tried it, and it works.

Resources