I have installed irkernel following standard instruction. When trying to work on a large dataset (9M rows * 70 cols), it takes forever to run an easy command such as print(db[1,1]) or head(db) in jupyter-notebook irkernel.
I could run these commands in seconds on rstudio.
Is there anything I could improve in order to work on this dataset efficiently through jupyter-notebook? What could be the potential problem?
Thanks a lot.
Related
I'm not sure if this is a CV question or SO, so I apologize if it falls within the CV domain.
Problem
I know it's possible to microbenchmark specific chunks of R code, but is there any benchmark-ing tool for an entire Jupyter Notebook? I could just run the entire notebook and time it manually, but I'd like more statistics and precision on the timing for which the microbenchmark package provides (I'm trying to make a case for automation of data analyses and visualizations).
The other dilemma (an overall issue with notebooks) is that my notebook is divided into many individual cells, so the option of benchmark-ing in the Jupyter environment might be inefficient (forcing me to export all code and then running a microbenchmark on it say in R Studio).
Desired Solution
An efficient way to benchmark an entire, multi-celled JupyterLab Notebook.
I am stuck with huge dataset to be imported in R and then processing it (by randomForest). Basically, I have a csv file of about 100K rows and 6K columns. Importing it directly takes a long time with many warnings regarding space allocations (limit reached for 8061mb). At the end of many warnings, I do get that dataframe in R, but not sure whether to rely on it. Even if I use that dataframe, I am pretty sure running a randomForest on it will definitely be a huge problem. Hence, mine is a two part question:
How to efficiently import such a large csv file without any warnings/errors?
Once imported into R, how to proceed for using randomForest function on it.
Should we use some package which enhances computing efficiency. Any help is welcome, thanks.
Actually your limit for loading files in R seems to be 8G, try increasing that if your machine have more memory.
If that does not work, one option is that you can submit to MapReduce from R ( see http://www.r-bloggers.com/mapreduce-with-r-on-hadoop-and-amazon-emr/ and https://spark.apache.org/docs/1.5.1/sparkr.html). However, Random Forest is not supported in either way yet.
Are there any packages specifically to let R run faster via parallel computing? I have made a very large IP that needs to run for a while, so I was wondering if there was a specific package in R that could help me run my IP. Currently, I have a function that returns the solution of an IP and the primary line that R gets stuck on (for a very...very long time) is when I use lp (....all.int = TRUE). My CPU is around 12.5% (8 cores) on my Windows computer, and I want it to near 100
Edit: I tried using the doParallel package,
library('doParallel')
cl <- makeCluster(8)
registerDoParallel(cl)
But my CPU usage is still not at 100%. What else do i need to do? Is there a specific package that makes optimization problems run faster? Most parallel packages help with simulation, and foreach seems to only work for iterative structures/ apply functions. I just want R to use all my CPU usage
I am running my code in a PC and I don't think I have problem with the RAM.
When I run this step:
dataset <- rbind(dataset_1, dataset_2,dataset_3,dataset_4,dataset_5)
I got the
Error: cannot allocate vector of size 261.0 Mb
The dataset_1 until dataset_5 have around 5 million observation each.
Could anyone please advise how to solve this problem?
Thank you very much!
There are several packages available that may solve your problem under the High Performance Computing CRAN taskview. See "Large memory and out-of-memory data", the ff package, for example.
R, as matlab, load all the data into the memory which means you can quickly run out of RAM (especially for big datasets). The only alternative I can see is to partition your data (i.e. load only part of the data), do the analysis on that part and write the results to files before loading the next chunk.
In your case you might want to use Linux tools to merge the datasets.
Say you have two files dataset1.txt and dataset2.txt, you can merge them using the shell command join, cat or awk.
More generally, using Linux shell tools for parsing big datasets is usually much faster and requires much less memory.
So I think I don't quite understand how memory is working in R. I've been running into problems where the same piece of code gets slower later in the week (using the same R session - sometimes even when I clear the workspace). I've tried to develop a toy problem that I think reproduces the "slowing down affect" I have been observing, when working with large objects. Note the code below is somewhat memory intensive (don't blindly run this code without adjusting n and N to match what your set up can handle). Note that it will likely take you about 5-10 minutes before you start to see this slowing down pattern (possibly even longer).
N=4e7 #number of simulation runs
n=2e5 #number of simulation runs between calculating time elapsed
meanStorer=rep(0,N);
toc=rep(0,N/n);
x=rep(0,50);
for (i in 1:N){
if(i%%n == 1){tic=proc.time()[3]}
x[]=runif(50);
meanStorer[i] = mean(x);
if(i%%n == 0){toc[i/n]=proc.time()[3]-tic; print(toc[i/n])}
}
plot(toc)
meanStorer is certainly large, but it is pre-allocated, so I am not sure why the loop slows down as time goes on. If I clear my workspace and run this code again it will start just as slow as the last few calculations! I am using Rstudio (in case that matters). Also here is some of my system information
OS: Windows 7
System Type: 64-bit
RAM: 8gb
R version: 2.15.1 ($platform yields "x86_64-pc-mingw32")
Here is a plot of toc, prior to using pre-allocation for x (i.e. using x=runif(50) in the loop)
Here is a plot of toc, after using pre-allocation for x (i.e. using x[]=runif(50) in the loop)
Is ?rm not doing what I think it's doing? Whats going on under the hood when I clear the workspace?
Update: with the newest version of R (3.1.0), the problem no longer persists even when increasing N to N=3e8 (note R doesn't allow vectors too much larger than this)
Although it is quite unsatisfying that the fix is just updating R to the newest version, because I can't seem to figure out why there was problems in version 2.15. It would still be nice to know what caused them, so I am going to continue to leave this question open.
As you state in your updated question, the high-level answer is because you are using an old version of R with a bug, since with the newest version of R (3.1.0), the problem no longer persists.