Why can't I break an *.Rdata loading process? - r

It seems that R is not responding when trying to break loading an *.Rdata file with load("*.Rdata"). What is the reason and is there a way around?
I tried to break several file loading processes with different files and sizes. The only possibility then seems to be to terminate R. I am working with large file sizes whose loading time exceeds half an hour.

I think you're stuck. R doesn't make guarantees about whether low-level processes can be interrupted by the user. Low-level C code needs a call to R_CheckUserInterrupt() in order to "notice" a request from the user to break execution (see Wickham's advanced r book. You can see the low-level code for loading data if you like (although it may not be too helpful ...)
The only workaround I can think of (besides making sure that you really do want to load a particular data file) is to find ways to decompose your data into smaller chunks (and concatenate the chunks appropriately after reading them into R). If data reading is a really big bottleneck you could look at the high-performance computing task view section on out-of-memory data tools ...

Related

Fastest way to read a data.frame into a shiny app on load?

For a shiny app in a repository containing a single static data file, what is the optimal format for that flat file (and corresponding function to read that file) which minimises the read time for that flat file to a data.frame?
For example, suppose when a shiny app starts it reads an .RDS, but suppose that takes ~30 seconds and we wish to decrease that. Are there any methods of saving the file and using a function which can save time?
Here's what I know already:
I have been reading some speed comparison articles, but none seem to comprehensive benchmark all methods in the context of a shiny app (and possible cores/threading implications). Some offer sound advice like trying to load in less data
I notice languages like julia can sometimes be faster, but I'm not sure if reading a file using another language would help since it would have to be converted to an object R recognises, and presumably that process would take longer than simply reading as an R object initially
I have noticed identical files seem to be smaller when saved as .RDS compared to .csv, however, I'm not sure if file size necessarily has an effect on read time.

R running very slowly after loading large datasets > 8GB

I have been unable to work in R given how slow it is operating once my datasets are loaded. These datasets total around 8GB. I am running on a 8GB RAM and have adjusted memory.limit to exceed my RAM but nothing seems to be working. Also, I have used fread from the data.table package to read these files; simply because read.table would not run.
After seeing a similar post on the forum addressing the same issue, I have attempted to run gctorture(), but to no avail.
R is running so slowly that I cannot even check the length of the list of datasets I have uploaded, cannot View or do any basic operation once these datasets are uploaded.
I have tried uploading the datasets in 'pieces', so 1/3 of the total files over 3 times, which seemed to make things run more smoothly for the importing part, but has not changed anything with regards to how slow R runs after this.
Is there any way to get around this issue? Any help would be much appreciated.
Thank you all for your time.
The problem arises because R loads the full dataset into the RAM which mostly brings the system to a halt when you try to View your data.
If it's a really huge dataset, first make sure the data contains only the most important columns and rows. Valid columns can be identified through the domain and world knowledge you have about the problem. You can also try to eliminate rows with missing values.
Once this is done, depending on your size of the data, you can try different approaches. One is through the use of packages like bigmemory and ff. bigmemory for example, creates a pointer object using which you can read the data from disk without loading it to the memory.
Another approach is through parallelism (implicit or explicit). MapReduce is another package which is very useful for handling big datasets.
For more information on these, check out this blog post on rpubs and this old but gold post from SO.

Will putting functions in one file improve speed?

If I write all my functions into one file that I use for multiple scripts, will sourcing the file containing the functions once at the top of my script improve my speed? If I call source(fn.r) for example, will I be able to call the functions I created as they are already saved in the workspace? I am trying to reduce the time it takes for the script to run and improve performance. Any other tips regarding improving speed are welcome aswell
Sourcing the file loads any functions within that script. Sourcing doesn't have much impact on the speed at which those functions run, as they would be in memory regardless, but you should look at the R compiler for an easy way to get a moderate speed boost.
See this blog post about the compiler
the performance gain for various made-up functions can range between
2x to 5x times faster running time. This is great for the small
amount of work ... it requires ... Moreover, by combining
C/C++ code with R code (through the {Rcpp} and {Inline} packages) you
can improve your code’s running time by a factor of 80 ... relative to
interpreted code. But to be fair to R, the code that is used for such
examples is often unrealistic code examples that is often not
representative of real R work. Thus, effective speed gains can be
expected to be smaller.
The easiest way to use the compiler is to place this at the beginning of your script. R will then automatically compile any function you create.
require(compiler)
enableJIT(3)

R code failed with: "Error: cannot allocate buffer"

Compiling an RMarkdown script overnight failed with the message:
Error: cannot allocate buffer
Execution halted
The code chunk that it died on was while training a caretEnsemble list of 10 machine learning algorithms. I know it takes a fair bit of RAM and computing time, but I did previously succeed to run that same code in the console. Why did it fail in RMarkdown? I'm fairly sure that even if it ran out of free RAM, there was enough swap.
I'm running Ubuntu with 3GB RAM and 4GB swap.
I found a blog article about memory limits in R, but it only applies to Windows: http://www.r-bloggers.com/memory-limit-management-in-r/
Any ideas on solving/avoiding this problem?
One reason why it may be backing up is that knitr and Rmarkdown just add a layer of computing complexity to things and they take some memory. The console is the most streamline implementation.
Also Caret is fat, slow and unapologetic about it. If the machine learning algorithm is complex, the data set is large and you have limited RAM it can become problematic.
Some things you can do to reduce the burden:
If there are unused variables in the set, use a subset of the ones you want and then clear the old set from memory using rm() with your variable name for the data frame in the parentheses.
After removing variables, run garbage collect, it reclaims the memory space your removed variables and interim sets are taking up in memory.
R has no native means of memory purging, so if a function is not written with a garbage collect and you do not do it, all your past executed refuse is persisting in memory making life hard.
To do this just type gc() with nothing in the parentheses. Also clear out the memory with gc() between the 10 ML runs. And if you import data with XLConnect the java implementation is nasty inefficient...that alone could tap your memory, gc() after using it every time.
After setting up training, testing and validation sets, save the testing and validation files in csv format on the hard drive and REMOVE THEM from your memory and run,you guessed it gc(). Load them again when you need them after the first model.
Once you have decided which of the algorithms to run, try installing their original packages separately instead of running Caret, require() each by name as you get to it and clean up after each one with detach(package:packagenamehere) gc().
There are two reasons for this.
One, Caret is a collection of other ML algorithms, and it is inherently slower than ALL of them in their native environment. An example: I was running a data set through random forest in Caret after 30 minutes I was less than 20% done. It had crashed twice already at about the one hour mark. I loaded the original independent package and in about 4 minutes had a completed analysis.
Two, if you require, detach and garbage collect, you have less in resident memory to worry about bogging you down. Otherwise you have ALL of carets functions in memory at once...that is wasteful.
There are some general things that you can do to make it go better that you might not initially think of but could be useful. Depending on your code they may or may not work or work to varying degrees, but try them and see where it gets you.
I. Use the lexical scoping to your advantage. Run the whole script in a clean Rstudio environment and make sure that all of the pieces and parts are living in your work space. Then garbage collect the remnants. Then go to knitr & rMarkdown and call pieces and parts from your existing work space. It is available to you in Markdown under the same rStudio shell so as long as nothing was created inside a loop and without saving it to to global environment.
II. In markdown set your code chunks up so that you cache the stuff that would need to be calculated multiple times so that it lives somewhere ready to be called upon instead of taxing memory multiple times.
If you call a variable from a data frame, do something as simple as multiply against it to each observation in one column and save it back into that original same frame, you could end up with as many as 3 copies in memory. If the file is large that is a killer. So make a clean copy, garbage collect and cache the pure frame.
Caching intuitively seems like it would waste memory, and done wrong it will, but if you rm() the unnecessary from the environment and gc() regularly, you will probably benefit from tactical caching
III. If things are still getting bogged down, you can try to save results in csv files send them to the hard drive and call them back up as needed to move them out of memory if you do not need all of the data at one time.
I am pretty certain that you can set the program up to load and unload libraries, data and results as needed. But honestly the best thing you can do, based on my own biased experience, is move away from Caret on big multi- algorithm processes.
I was getting this error when I was inadvertently running the 32-bit version of R on my 64-bit machine.

restriction on the size of excel file

I need to use R to open an excel file, which can have 1000~10000 rows and 5000~20000 columns. I would like to know is there any restriction on the size of this kind of excel file in R?
Generally speaking, your limitation in using R will be how well the data set fits in memory, rather than specific limits on the size or dimension of a data set. The closer you are to filling up your available RAM (including everything else you're doing on your computer) the more likely you are to run into problems.
But keep in mind that having enough RAM to simply load the data set into memory is often a very different thing that having enough RAM to manipulate the data set, which by the very nature of R will often involve a lot of copying of objects. And this in turn leads to a whole collection of specialized R packages that allow for the manipulation of data in R with minimal (or zero) copying...
The most I can say about your specific situation, given the very limited amount of information you've provided, is that it seems likely your data will not exceed your physical RAM constraints, but it will be large enough that you will need to take some care to write smart code, as many naive approaches may end up being quite slow.
I do not see any barrier to this on the R side. Looks like a fairly modestly sized dataset. It could possibly depend on "how" you do this, but you have not described any code, so that remains an unknown.
The above answers correctly discuss the memory issue. I have been recently importing some large excel files too. I highly recommend trying out the XLConnect package to read in (and write) files.
options(java.parameters = "-Xmx1024m") # Increase the available memory for JVM to 1GB or more.
# This option should be always set before loading the XLConnect package.
library(XLConnect)
wb.read <- loadWorkbook("path.to.file")
data <- readWorksheet(wb.read, sheet = "sheet.name")

Resources