I’m writing a webserver that sometimes has to pass data through a R script.
Unfortunately startup is slow, since i have to load some libraries which load other libraries etc.
Is there a way to either
load libraries, save the interpreter state to a file, and load that state fast when invoked next time? Or
maintain a background R process that can be sent messages (not just lowlevel data streams), which are delegated to asynchronous workers (i.e. sending a new message before the previous is parsed shouldn’t block)
R-Websockets is unfortunately synchronous.
Rserve and RSclient is an easy way to do create and use an Async server.
Open two R sessions.
in the first one type:
require(Rserve)
run.Rserve(port=6311L)
in the second one type:
require(RSclient)
rsc = RS.connect(port=6311L)
# start with a synchronous call
RS.eval(rsc, {x <<- vector(mode="integer")}, wait=TRUE)
# continue with an asynchronous call
RS.eval(rsc, {cat("begin")
for (i in 1:100000) x[[i]] <-i
cat("end")
TRUE
},
wait=FALSE)
# call collect until the result is available
RS.collect(rsc, timeout=1)
Related
Context
In order to test the web capabilities of an R package I am writing, I'm attempting to serve a file locally use the httpuv package so that I can run tests using an offline copy of the page.
Issue
However, curl doesn't seem to want to play nice with httpuv - specifically, when trying to read the hosted file using curl (for example, with curl::curl() or curl::curl_fetch_memory()), the request hangs, and eventually times out if not manually interrupted.
Minimal example
# Serve a small page
server <- httpuv::startServer("0.0.0.0", port = 9359, app = list(
call = function(req) {
list(
status = 200L,
headers = list("Content-Type" = "text/html"),
body = "Some content..."
)
}
))
# Attempt to retrieve content (this hangs)
page <- curl::curl_fetch_memory(url = "http://127.0.0.1:9359")
httpuv::stopServer(server)
Current progress
Once the server has been started, running curl -v 127.0.0.1:9359 at the terminal returns content as expected. Additionally, if I open a new instance of RStudio and try to curl::curl_fetch_memory() in that new R session (while the old one is still open), it works perfectly.
Encouraged by that, I've been playing around with callr for a while, thinking maybe it's possible to launch the server in some background process, and then continue as usual. Unfortunately I haven't had any success so far with this approach.
Any insight or suggestions very much appreciated!
Isn't it a great feeling when you can come back and answer a question you asked!
From the httpuv::startServer() documentation:
startServer binds the specified port and listens for connections on an thread running in the background. This background thread handles the I/O, and when it receives a HTTP request, it will schedule a call to the user-defined R functions in app to handle the request. This scheduling is done with later(). When the R call stack is empty – in other words, when an interactive R session is sitting idle at the command prompt – R will automatically run the scheduled calls. However, if the call stack is not empty – if R is evaluating other R code – then the callbacks will not execute until either the call stack is empty, or the run_now() function is called. This function tells R to execute any callbacks that have been scheduled by later(). The service() function is essentially a wrapper for run_now().
In other words, if we want to respond to requests as soon as they are received, we have to explicitly do so using httpuv::service(). Something like the following does the trick!
s <- callr::r_session$new()
on.exit(s$close())
s$call(function() {
httpuv::startServer("0.0.0.0", port = 9359, app = list(
call = function(req) {
list(
status = 200L,
headers = list("Content-Type" = "text/html"),
body = "Some content...")
)
}
))
while (TRUE) httpuv::service()
})
# Give the server a chance to start
Sys.sleep(3)
page <- curl_fetch_memory(url = "http://127.0.0.1:9359")
As a result of several hours of unfruitful searches, I am posting this question.
I suppose it is a duplicate of this one:
How do you run RServe on AWS Lambda with NodeJS?
But since it seems that the author of that question did not accomplish his/her goal successfully, I am going to try again.
What I currently have:
A NodeJS server, that invokes an R script through Rserve and passes data to evaluate through node-rio.
Function responsible for that looks like this:
const R = (arg1, arg2) => {
return new Promise((resolve, reject)=>{
const args = {
arg1, arg2
};
//send data to Rserve to evaluate
rio.$e({
filename: path.resolve('./r-scripts/street.R'),
entrypoint: 'run',
data: args,
})
.then((data)=>{
resolve(JSON.parse(data));
})
.catch((err)=>{
reject(`err: ${err}`);
});
});
};
And this works just fine. I am sending data over to my R instance and getting results back into my server.
What I am ultimately trying to achieve:
Every request seems to spawn its own R workspace, which has a considerable memory overhead. Thus, serving even hundreds of concurrent requests using this approach is impossible, as my AWS EC2 runs out of memory pretty quickly.
So, I am looking for a way to deploy all the memory intensive parts to AWS Lambda and thus get rid of the memory overhead.
I guess, the specific question in my case is if there is a way to package R and Rserve together with NodeJS lambda function. Or if there is a way for me to get convinced that this approach won't work using lambda and I should try to look for an alternative.
Note: I cannot use anything other than R, since these are external R scripts, that I have to invoke from my server.
Thanks in advance!
I have an asp.net web page that obtains data from two web services, the main one and the fall-back one. Now those services are called consecutively. When running on my box the page render time is usually 6-10 s.
I decided to parellelize those calls so that the fall-back data is available faster if the main service fails.
Result res = null;
Task<Result> task = new Task<Result>(() => service1.method1());
task.Start();
res = service2.method();
if(res == null || res.Count == 0)
res = task.Result;
return res;
With that change the page render time grew significantly and reaches 60 s. I added some profiling code and I see that this piece of code typically runs in less than 3 s.
So I suspect that using Tasks somehow hinders the asp.net infrastructure performance.
What might cause that severe performace degradation?
Is there an better way of running those two external service calls in parallel?
Edit 1: I'm mostly interested in the result from service2. The result from service1 will be used only if service2 fails. However, I want to minimize the waiting time for the service1 result if service2 fails.
It turned out that the code executed within a Task called a lazily initialized singleton. The singleton's constructor tried to set up some caching if HttpContext.Current was available. Because HttpContext.Current was not available within a Task, the singleton did not use caching and this degraded performance.
I'm executing a batch file inside an R script. I'd like to run this and another large section of the R script twice using a foreach loop.
foreach (i=1:2, .combine = rbind)%do%{
shell.exec("\\\\network\\path\\to\\batch\\script.ext")
*rest of the R script*
}
One silly problem though is that this batch file generates data and that data is connected to SQL Server localdb inside the loop. I thought at first that the script would execute the batch file, wait for it to finish and then move on. However, (seems obvious in hindsight) the script instead executes the batch file, tries to grab data that hasn't been created yet (because the file isn't finished running) and the executes the batch file again before it finishes the first time.
I've been trying to find away to delay the rest of the script from executing until the batch script has finished executing but have not come up with anything yet. I'd appreciate any insights anyone has.
Use system2 instead of shell.exe. system2 calls are blocking — meaning, the function waits until the external program has finished running. On most systems, this can be used directly to run scripts. On Windows, you may have to invoke rundll32 to execute a script:
cmd = c('rundll32.exe', 'Shell32.dll,ShellExecute', 'NULL', 'open', scriptpath)
system2(paste(shQuote(cmd), collapse = ' '))
Windows users may use shell, which by default has wait=TRUE, which will cause R to wait for its completion. You may choose whether or not to directly "intern" the result.
On unix-like systems, use system, which also defaults to wait=TRUE.
If your batch file simply launches another process and terminates, then it may need to be modified to either wait for completion or return a suitable process or file indicator that can be monitored.
I would like to run OpenCPU job asynchronously and collect its results from a different session. In Rserve + RSclient I can do the following:
RS.eval(connection, expression, wait = FALSE)
# do something while the job is running
and then when I'm ready to receive results call either:
RS.collect(connection)
to try to collect results and wait until they are ready if job is still running or:
RS.collect(connection, timeout = 0)
if I want to check the job state and let it run if it is still not finished.
Is it possible with OpenCPU to receive the tmp/*/... path with the result id before the job has finished?
It seems acording to this post that OpenCPU does not support asynchronous jobs. Every request between the browser and the OpenCPU server must be alive in order to execute a script or function and receive a response succesfully.
If you find any workaround I would be pleased to know it.
In my case, I need to run a long process (may takes a few hours) and I can't keep alive the client request until the process finishes.