Cannot load dataset due to 'Your notebook tried to allocate more memory than is available. It has restarted' - out-of-memory

So this is my first time doing a Kaggle competition and I am attempting to load the dataset using pd.read_csv() which returns the error 'Your notebook tried to allocate more memory than is available. It has restarted.'
Looking through various discussion boards and notebooks on Kaggle and Stack overflow, I found that the solutions provided are to adjust the dataset so that it uses less memory. However, my current problem is that I cannot even load the dataset to adjust.
Additionally, do people normally run datasets on their private PC? because I don't understand how I can increase my memory without upgrading my hardware which is not really an option for me.

Related

R running very slowly after loading large datasets > 8GB

I have been unable to work in R given how slow it is operating once my datasets are loaded. These datasets total around 8GB. I am running on a 8GB RAM and have adjusted memory.limit to exceed my RAM but nothing seems to be working. Also, I have used fread from the data.table package to read these files; simply because read.table would not run.
After seeing a similar post on the forum addressing the same issue, I have attempted to run gctorture(), but to no avail.
R is running so slowly that I cannot even check the length of the list of datasets I have uploaded, cannot View or do any basic operation once these datasets are uploaded.
I have tried uploading the datasets in 'pieces', so 1/3 of the total files over 3 times, which seemed to make things run more smoothly for the importing part, but has not changed anything with regards to how slow R runs after this.
Is there any way to get around this issue? Any help would be much appreciated.
Thank you all for your time.
The problem arises because R loads the full dataset into the RAM which mostly brings the system to a halt when you try to View your data.
If it's a really huge dataset, first make sure the data contains only the most important columns and rows. Valid columns can be identified through the domain and world knowledge you have about the problem. You can also try to eliminate rows with missing values.
Once this is done, depending on your size of the data, you can try different approaches. One is through the use of packages like bigmemory and ff. bigmemory for example, creates a pointer object using which you can read the data from disk without loading it to the memory.
Another approach is through parallelism (implicit or explicit). MapReduce is another package which is very useful for handling big datasets.
For more information on these, check out this blog post on rpubs and this old but gold post from SO.

What are the risks of "Reached total allocation of 31249Mb: see help(memory.size)"

A few times when dealing with modifying large objects (5gb), on a windows machine with 30gb of RAM, I have been reciving an error
Reached total allocation of 31249Mb: see help(memory.size). However the process seems to complete, i.e. I get a file with what looks like the right values. Checking every bit of a large file for exactly the right returns by cutting it up and comparing it to the right section is time consuming, but when I've done it it appears that the returned objects are correct with my expectations.
What risks/side effects can I expect from this error? What should I be checking? Is the process automatically recovering because I'm getting back the returns I'm expecting, or are the errors going to be more subtle? My entire analysis process is being written using tidyverse, does this mean I can rely on good error handling from Hadley et al., and is that why my process is warning, but also completing?
N.B. I have not included any attempt at an MWE, as every machine will have different limitations of what memory is available, though happy to be shown and MWE for this kind of process if there are suggestions.
Use memory.limit(x) where x is the amount of MB of memory to give it.
See link for more details:
Increasing (or decreasing) the memory available to R processes

Forcing R (and Rstudio) to use the virtual memory on Windows

I'm working with large datasets and quite often R produces an error telling it can't allocate a vector of that size or it doesn't have enough memory.
My computer has 16GB RAM (Windows 10) and I'm working with datasets of around 4GB but some operations need a lot of memory, for example converting the dataset from wide format to long.
In some situations I can use gc() to realease some memory but many times it's not enough.
Sometimes I can break the dataset on smaller chunks but sometimes I need to work with all the table at once.
I've read that Linux users don't have this problem, but what about Windows?
I've tried setting a large pagefile on a SSD (200GB) but I've found that R doesn't use it at all.
I can see the task manager and when the memory consumption reaches 16GB R stops working. The size of the pagefile doesn't seem to make any difference.
How can I force R to use the pagefile?
Do I need to compile it myself with some special flags?
PD: My experience is that deleting an object rm() and later using gc() doesn't recover all the memory. As I perform operations with large datasets my computer has less and less free memory at every step, no matter if I use gc().
PD2: I expect not to hear trivial solutions like "you need more RAM memory"
PD3: I've been testing and the problem only happens in Rstudio. If I use directly R it works well. Does anybody know how to do it in RStudio.
In order to get it working automatically every time you start RStudio the solution with R_MAX_MEM_SIZE is ignored, both if created as an environment variable or if created inside the .Rprofile.
Writing memory.limit(64000) is ignored too.
The proper way is adding the following line in the file .Rprofile
invisible(utils::memory.limit(64000))
or whatever number you want.
Of course you need to have a pagefile big enough. That number includes free RAM and free pagefile space.
Using the pagefile is slower but it's going to be used only when needed.
Something strange I've found is that it only let's you increase the maximum memory to use but it doesn't allow you to decrease it.

R code failed with: "Error: cannot allocate buffer"

Compiling an RMarkdown script overnight failed with the message:
Error: cannot allocate buffer
Execution halted
The code chunk that it died on was while training a caretEnsemble list of 10 machine learning algorithms. I know it takes a fair bit of RAM and computing time, but I did previously succeed to run that same code in the console. Why did it fail in RMarkdown? I'm fairly sure that even if it ran out of free RAM, there was enough swap.
I'm running Ubuntu with 3GB RAM and 4GB swap.
I found a blog article about memory limits in R, but it only applies to Windows: http://www.r-bloggers.com/memory-limit-management-in-r/
Any ideas on solving/avoiding this problem?
One reason why it may be backing up is that knitr and Rmarkdown just add a layer of computing complexity to things and they take some memory. The console is the most streamline implementation.
Also Caret is fat, slow and unapologetic about it. If the machine learning algorithm is complex, the data set is large and you have limited RAM it can become problematic.
Some things you can do to reduce the burden:
If there are unused variables in the set, use a subset of the ones you want and then clear the old set from memory using rm() with your variable name for the data frame in the parentheses.
After removing variables, run garbage collect, it reclaims the memory space your removed variables and interim sets are taking up in memory.
R has no native means of memory purging, so if a function is not written with a garbage collect and you do not do it, all your past executed refuse is persisting in memory making life hard.
To do this just type gc() with nothing in the parentheses. Also clear out the memory with gc() between the 10 ML runs. And if you import data with XLConnect the java implementation is nasty inefficient...that alone could tap your memory, gc() after using it every time.
After setting up training, testing and validation sets, save the testing and validation files in csv format on the hard drive and REMOVE THEM from your memory and run,you guessed it gc(). Load them again when you need them after the first model.
Once you have decided which of the algorithms to run, try installing their original packages separately instead of running Caret, require() each by name as you get to it and clean up after each one with detach(package:packagenamehere) gc().
There are two reasons for this.
One, Caret is a collection of other ML algorithms, and it is inherently slower than ALL of them in their native environment. An example: I was running a data set through random forest in Caret after 30 minutes I was less than 20% done. It had crashed twice already at about the one hour mark. I loaded the original independent package and in about 4 minutes had a completed analysis.
Two, if you require, detach and garbage collect, you have less in resident memory to worry about bogging you down. Otherwise you have ALL of carets functions in memory at once...that is wasteful.
There are some general things that you can do to make it go better that you might not initially think of but could be useful. Depending on your code they may or may not work or work to varying degrees, but try them and see where it gets you.
I. Use the lexical scoping to your advantage. Run the whole script in a clean Rstudio environment and make sure that all of the pieces and parts are living in your work space. Then garbage collect the remnants. Then go to knitr & rMarkdown and call pieces and parts from your existing work space. It is available to you in Markdown under the same rStudio shell so as long as nothing was created inside a loop and without saving it to to global environment.
II. In markdown set your code chunks up so that you cache the stuff that would need to be calculated multiple times so that it lives somewhere ready to be called upon instead of taxing memory multiple times.
If you call a variable from a data frame, do something as simple as multiply against it to each observation in one column and save it back into that original same frame, you could end up with as many as 3 copies in memory. If the file is large that is a killer. So make a clean copy, garbage collect and cache the pure frame.
Caching intuitively seems like it would waste memory, and done wrong it will, but if you rm() the unnecessary from the environment and gc() regularly, you will probably benefit from tactical caching
III. If things are still getting bogged down, you can try to save results in csv files send them to the hard drive and call them back up as needed to move them out of memory if you do not need all of the data at one time.
I am pretty certain that you can set the program up to load and unload libraries, data and results as needed. But honestly the best thing you can do, based on my own biased experience, is move away from Caret on big multi- algorithm processes.
I was getting this error when I was inadvertently running the 32-bit version of R on my 64-bit machine.

R becomes unresponsive while running randomforest on huge data. Does this mean it is still running or it has stopped working?

My data contains 229907 rows and 200 columns. I am training randomforest on it. I know it will take time. But do not know how much. While running randomforest on this data, R becomes unresponsive. "R Console (64 Bit) (Not Responding)". I just want to know what does it mean? Is R still working or it has stopped working and I should close it and start again?
It's common for RGui to be unresponsive during a long calculation. If you wait long enough, it will usually come back.
The running time won't scale linearly with your data size. With the default parameters, more data means both more observations to process and more nodes per tree. Try building some small forests with ntree=1, different values of the maxnodes parameter and different amounts of data, to get a feel for how long it should take. Have the Windows task manager or similar open at the same time so that you can monitor CPU and RAM usage.
Another thing you can try is making some small forests (small values of ntree) and then using the combine function to make a big forest.
You should check your CPU usage and memory usage. If the CPU is still showing a high usage with the R process, R is probably still going strong.
Consider switching to R 32 bit. For some reason, it seems more stable for me - even when my system is perfectly capable of 64 bit support.

Resources