cannot allocate memory - RSelenium and EC2 - r

I am trying to implement a Selenium test to perform automated actions on a website (looping through pages). I am using R and RSelenium package as well as a PostgreSQL database using DBI package. All this using EC2 AWS server.
My problem is that after a few minutes that the script was launched, my RStudio session freezes (as well as my Linux session) and I can see a message like "cannot allocate memory".
So this is clearly a memory issue without a doubt, and by doing top I could see that my Selenium docker was using most of the resources.
But my question is how can I reduce the amount of memory used by the Selenium test?

IMHO there is no practical way for a test to use less memory than the memory required by the given test. You can try to simplify the given test by breaking it up into 2 or more tests. Check for memory leaks, as suggested in another answer.
It would be much easier to use the next largest instance type with more memory, and shut down the instance when not in use to save money, if that is an issue.

Don't forget drive.close() in your code, if you don't close your driver, you will have a lot instance of Chrome.

Related

R is using multiple threads with no job running (R v4.0/Win 10.018363)

A few days ago I noticed R was using 34% of the CPU when I have no code running. I noticed it again today and I can't figure out why. If I restart R, CPU usage returns to normal, then after 20 minutes or so it ramps up again.
I have a task scheduled that downloads a small file once a week using R, and another using wget in ubuntu (WSL). It might be the case that the constant CPU usage only happens after I download covid-related data from a github (link below). Is there a way to see if this is hijacking resources? If it is, other people should know about it.
I don't think it's a windows task reporting error since my temps are what I would expect for a constant 34% cpu usage (~56C).
Is this a security issue? Is there a way to see what R is doing? I'm sure there is a way to better inspect this but I don't know where to begin.. Glasswire hasn't reported any unusual activity.
From Win10 event viewer, I've noticed a lot of these recently but don't quite know how to read it:
The application-specific permission settings do not grant Local Activation permission for the COM Server application with CLSID {8BC3F05E-D86B-11D0-A075-00C04FB68820} and APPID {8BC3F05E-D86B-11D0-A075-00C04FB68820} to the user redacted SID (S-1-5-21-1564340199-2159526144-420669435-1001) from address LocalHost (Using LRPC) running in the application container Unavailable SID (S-1-15-2-181400768-2433568983-420332673-1010565321-2203959890-2191200666-700592917). This security permission can be modified using the Component Services administrative tool.
*edit: CPU usage seems to be positively correlated with the duration R is open.
Given the information you provided, it looks like RStudio (not R) is using a lot of resources. R and RStudio are 2 very different things. These types of issues are very difficult to investigate as one need to be able to reproduce them on another computer. One thing you can maybe do is raise the issue on github to the RStudio team.

Best way to execute code on remote machine

I am looking for the best way to execute code on a distant machine. Ideally, I am looking for a solution such as Cuda which provides the opportunity to allocate executions on GPU or CPU, but for distinct machine.
I tried distinct ways to do that :
I connect my machines with ssh, export my script, execut it. No particular issue, but not very handy. But maybe this solution could be optimise. Because I open my ssh connection with the terminal, or termius.
I try another way with mosh, same outcomes, but quicker.
Currently, I am working on a Spyder kernel to have a direct link in the place of execution.
I've seen there is also a possibility with a nohup connection, but I have to work on this solution to understand well the possibilities.
Everything works well, but I am looking for a more convenient solution.
Thank you in advance for your answers !
You could either use sshfs along to ssh to mount the remote filesystem on your machine it's easier than always copy the code by hand, if so I would recommend to use screen or something like that that if the connection breaks it offers no problems.
Personal I like to work with Visual Studio Code and the ssh fs extension for this purpose.
An other alternative is to work with X2Go. X2Go enables you to access a graphical desktop of a computer over a low bandwidth (or high bandwidth) connection.

Improve R Script execution from NodeJS

I'm new to R, and I'm invoking an R script from a NodeJS app. When the R Script is invoked, it takes a long time in producing output. I investigated and realized that the bulk of that overhead is when it loads the libraries and the model I'm using. Let me clarify that any optimization would work, taking into account that I'm running this code in a Raspberry Pi 2 b+.
My question is: Is there a way to preload all the libraries and the model on R and then trigger predictions on demand? So that I won't need to reload the libraries and the model every time I want a prediction.
No. Since you're just invoking a script the loading of everything it has to be done everytime the script is run; since nothing didn't exist in memory before you invoked it.
One workaround I would suggest is to instead run a R script have your R script running as a service and then query that service from nodejs.
I cannot help you with that since my expertise for R doesn't go very far away and I don't know if having an R server is even possible.
An alternative to that, if it is not too cumbersome, is to port your R project to python and mount a server of some kind (which with python is extremely easy to do) and then poke that server from nodejs. Since you would be running a server you can just cache the libraries at the server startup time and have everything in RAM for your next query.

multiple Rstudio sessions following the use of the parallel package

I've recently run into an issue when using Rstudio-Server that multiple sessions are spawned instead of a single session. In my case (see below) five sessions are created instead of one. This happens even after trying the normal solutions: deleting ~/.rstudio, clearing .GlobalEnv, and restarting R. Note, there is no spawning issue when using the R command prompt.
My belief about the source of this problem is that it is due to a prematurely terminated mclapply. Here are the relevant docs from the parallel package. (discovered after the fact)
It is strongly discouraged to use these functions in GUI or embedded environments, because it leads to several processes sharing the same GUI which will likely cause chaos (and possibly crashes). Child processes should never use on-screen graphics devices.
At least one other person has had the same error as me but there is no documented solution that I can find. As the warning has already been ignored, I would appreciate any pointers that can help me get untangled.
Edit:
I am still encountering the error but was able to catch the ephemeral script sourcing issue that I believe is causing this problem. Unfortunately, I don't know what other files are being sourced and therefore what settings need to be changed. Grrrrr.....

WinDBG - Analyse dump file on local PC

I have created a memory dump of an ASP.NET process on a server using the following command: .dump /ma mydump.dmp. I am trying to identify a memory leak.
I want to look at the dump file in more detail on my local development PC. I read somewhere that it is advisable to debug on the same machine as you create the dump file. However, I have also read that some developers do analyse the dump file on their local development PC's. What is the best approach?
I notice that when I create a dump file using the command above the W3WP process memory increases by about 1.5 times. Why this this? I suppose this should be avoided on a live server.
Analyzing on the same machine can save you from SOS loading issues thereafter. Unless you are familiar with WinDbg and SOS, you will find it confusing and frustrating then.
If you have to use another machine for analysis, make sure you read carefully this blog post, http://blogs.msdn.com/b/dougste/archive/2009/02/18/failed-to-load-data-access-dll-0x80004005-or-what-is-mscordacwks-dll.aspx as it shows you how to copy the necessary files from the source machine (where the dump is captured) to the target machine (the one you launch WinDbg).
For your second question, as you use WinDbg to attach to the process directly, and use .dump command to capture the dump, the target process unfortunately is modified. Not easy to explain in a few words. The recommended way is to use ADPlus.exe or Debug Diag. Even procdump from SysInternals is better. Those tools are designed for dump capture and they have minimal impact on the target processes.
For memory leak from unmanaged libraries, you should use memory leak rule of Debug Diag. for managed memory leak, you can simply capture hang dumps when memory usage is high.
I am no expert on WinDBG but I once had to analyse a dump file on my ASP.NET site to find a StackOverflowException.
While I got a dump file of my live site (I had no choice since that was what was failing), originally I tried to analyse that dump file on my local dev PC but ran into problems when trying to load the CLR data from it. The reason being that the exact version of the .NET framework differed between my dev PC and the server - both were .NET 4 but I imagine my dev PC had some cumulative updates installed that the server did not. The SOS module simply refused to load because of this discrepancy. I actually wrote a blog post about my findings.
So to answer part of your question it may be that you have no choice but to run WinDBG from your server, at least you can be sure that the dump file will match your environment.
It is not necessary to debug on the actual machine unless the problem is difficult to manifest on your development machine.
So long as you have the pdbs with the private symbols then the symbols should be resolved and call stacks correctly displayed and the correct version of .NET installed.
In terms of looking at memory leaks you should enable Gflags user stack trace and take memory dumps at 2 intervals so you can compare the memory usage before and after the action that provokes the memory leak, remember to disable gflags afterwards!
You could also run DebugDiag on the server which has automated memory pressure analysis scripts that will work with .Net leaks.

Resources