RapidMiner Hangs on Basic Task (Fresh Install) - r

I selected the first 50,000 rows of the Homesite dataset - a current Kaggle competition. (30MB out of the full 200MB in the dataset.) I imported the .csv file and connected it to a Random Forest model. I changed one default - the RF will build 50 trees rather than 10.
If I click on any other task icon after I start the process the laptop hangs, requiring a power down - which I've never had to do before. If I don't select any other icon and just stay in RapidMiner, it hangs at 1:15 (the displayed timer).
This was my very first attempt at using RapidMiner Studio 6.5. I have the much older RapidMiner 5, which doesn't hang, but it is painfully slow as compared to r, and it doesn't even have Random Forests or many other models found in 6.5. 6.5 is also supposed to be much faster, and it is supposed to be able to use r scripts.
I performed a complete checkup on my Dell laptop. Everything passes.
I ran a complete scan with MalwareBytes.
I don't have any issues with any other software.
R builds neural networks (etc) at the expected speeds without any problems.
RapidMiner Studio 6.5 (64 Bit; Basic Edition)
Windows 8.0 (64 Bit); Intel Core I3 #2 Ghz; 6 GB Ram.

Related

Unusually long package installation time on RStudio Server Pro Standard on GCP?

If we install.packages("dplyr") on a GCP 'RStudio Server Pro Standard' VM, it takes around 3 minutes to install (on instance with 4 cores / 15 gb ram)
This seems unusual, as installation would typicaly take ~20 seconds on a laptop with equivalent specs.
Why so slow, and is there a quick and easy way of speeding this up?
Notes
I use the RStudio Server Pro Standard image from GCP marketplace to start the instance
Keen to know if there are any 'startup scripts' or similar I can set to run after the instance starts, e.g. to install a collection of commonly used packages
#user5783745 you can also adjust the Makevars to allow multithreaded compilation, which will help speed up compilations.
I followed this RStudio community post, and dropped MAKEFLAGS = -j4 into ~/.R/Makevars.
This basically halved the amount of time it took to install dplyr from scratch on the RStudio Server Pro Standard for GCP instance I spun up. (same as yours, 4 vCPU, 15GB ram)

Cordapp tutorial crashing in a Fedora VirtualBox Machine

I have downloaded the Cordapp example provided in the Corda website. I follow all the steps (to run it from the console) in
https://docs.corda.net/tutorial-cordapp.html
without any problem until "Running the example CorDapp". Here i get to errors one way or another.
First, when running
workflows-kotlin/build/nodes/runnodes
one or more of the nodes would not start. I was using a virtual machine with 2 cores and 4GB of RAM. Eventually, i noticed it seemed to be an issue with the RAM, so i changed the VM condig to 4 cpus and 10 GB of RAM.
Now, i can run
workflows-kotlin/build/nodes/runnodes
and get all 4 nodes working but, as soon as I run the following instruction
/gradlew runPartyXServer
Where X=[A,B,C] for each of the possible nodes, after 20-30 seconds as much, the machine repently slows down and aborts.
The VM has Fedora 30, 4 cores and 10GB of RAM. It is empty except for what i downloaded for the tutorial. I cannot believe those are not enough resources to run the tutorial, Am i wrong? Do i need more? may it be another thing?
Any help is welcome.
== Solved ==
The issue were the resources. I jumped to 8 cores and 32GB and it ran. I will try at some point with 16GB. In any case, the problem, from my point of view, is that having those large hardware requirements, the tutorial should include a section describing the minimum setup needed to run it.
From the given information, I believe you had ran into a Memory issue.
According to our documentation, Corda has a suggested minimal requirement of 1GB of Heap and 2-3GB of Host RAM per node.
https://docs.corda.net/docs/corda-enterprise/4.4/node/sizing-and-performance.html#sizing
I would suggest either reduce the number of nodes hosted on a single machine or expand your RAM size of the VM

Jupyter notebook on Ubuntu doesn't free memory

I have a piece of code (Python 3.7) in a Jupyter notebook cell that generates a pandas data frame (df) containing numpy arrays.
I am checking the memory consumption of the df just by looking at the system monitor app preinstalled in Ubuntu.
The problem is that if I run the cell a second time, the memory consumption double even if the df is assigned to the same variable.
If I run multiple times the same cell, the system goes out of memory, and the kernel will dye by itself.
Using del df or gc.collect() won't free the memory as well.
Restarting the notebook kernel is the only way to free the memory.
In practice, I would expect the memory to stay roughly the same because I am just reassigning a new df to the same variable over and over again.
Indeed, the memory accumulates only if I run the code on a linux machine and in the notebook. If I run the same code via terminal python script.py, or if I run the very same notebook on macOS, the memory pressure will not change, I can run the same cell multiple time and the occupied memory stays stable (as expected).
Can you help me pointing out where is the problem coming from and how to solve it?
P.S. Both Python and Jupiter are installed with Anaconda 2018.12 on Ubuntu 18.04.
I have asked the same question on the Ubuntu community since I am not sure this is strictly related to python itself, but I got no answers so far.

Loading .dta data into R takes long time

Some confidential data is stored on a server and accessible for researchers via remote access.
Researchers can login via some (I think cisco) remote client, and share virtual machines on the same host
There's a 64 bit Windows running on the virtual machine
The system appears to be optimized for Stata, I'm among the first to use the data using R. There is no RStudio installed on the client, just the RGui 3.0.2.
And here's my problem: the data is saved in the stata format (.dta), and I need to open it in R. At the moment I am doing
read.dta(fileName, convert.factors = FALSE)[fields]
Loading in a smaller file (around 200MB) takes 1-2 minutes. However, loading in the main file (3-4 GB) takes very long, longer than my patience was for me. During that time, the R GUI is not responding anymore.
I can test my code on my own machine (OS X, RStudio) on a smaller data sample, which works all fine. Is this
because of OS X + RStudio, or only
because of the size of the file?
A college is using Stata on a similar file in their environment, and that was working fine for him.
What can I do to improve the situation? Possible solutions I came up with were
Load the data into R somehow differently (perhaps there is a way that doesn't require all this memory usage). I have also access to stata. If all else fails, I could prepare the data in Stata, for example slice it into smaller pieces and reassemble it in R
Ask them to allocate more memory to my user of the VM (if that indeed is the issue)
Ask them to provide RStudio as a backend (even if that's not faster, perhaps its less prone to crashes)
Certainly the size of the file is a prime factor, but the machine and configuration might be, too. Hard to tell without more information. You need a 64 bit operating system and a 64 bit version of R.
I don't imagine that RStudio will help or hinder the process.
If the process scales linearly, it means your big data case will take (120 seconds)*(4096 MB/200 MB) =2458 seconds, or around three quarters of an hour. Is that how long you waited?
The process might not be linear.
Was the processor making progress? If you checked CPU and memory, was the process still running? Was it doing a lot of page swaps?

Linear programming (lpSolve) error using Big Data in R

I trying to optimize my model with 30000 variables and 1700 contraints, but i got this error´s when i put some more contraints.
n<-lp ("max", f.obj, f.con, f.dir, f.rhs)$solution
Error: cannot allocate vector of size 129.9 Mb
I´m working in win 32 bit, 2gb ram.
What can i do to work and optimize my model using a large dataset?
That's a tiny machine by modern standards, and a non-tiny problem. Short answer is that you should run on a machine with a lot more RAM. Note that the problem isn't that R can't allocate 130 MB vectors in general -- it can -- it's that it's run out of memory on your specific machine.
I'd suggest running on a 64-bit instance of R 3.0 on a machine with 16 GB of RAM, and see if that helps.
You may want to look into spinning up a machine on the cloud, and using RStudio remotely, which will be a lot cheaper than buying a new computer.

Resources