R Shiny and Spark: how to free Spark resources? - r

Say we have a Shiny app which is deployed on a Shiny Server. We expect that the app will be used several users via their web browser, as usual.
The Shiny app's server.R includes some sparklyr package code which connects to a Spark cluster for classic filter, select, mutate, and arrange operations on data located on HDFS.
Is it mandatory to disconnect from Spark: to include a spark_disconnect at the end of the server.R code to free resources ? I think we should never disconnect at let Spark handle the load for each arriving and leaving user. Can somebody please help me to confirm this ?

TL;DR SparkSession and SparkContext are not lightweight resources which can be started on demand.
Putting aside all security considerations related to starting Spark session directly from a user-facing application, maintaining SparkSession inside server (starting session on entry, stopping on exit) is simply not a viable option.
server function will be executed every time there is an upcoming event effectively restarting a whole Spark application, and rendering project unusable. And this only the tip of the iceberg. Since Spark reuses existing sessions (only one context is allowed for a single JVM), multiuser access could lead to random failures if reused session has been stopped from another server call.
One possible solution is to register onSessionEnded with spark_disconnect, but I am pretty sure it will be useful only in a single user environment.
Another possible approach is to use global connection, and wrap runApp with function calling spark_disconnect_all on exit:
runApp <- function() {
shiny::runApp()
on.exit({
spark_disconnect_all()
})
}
although in practice resource manager should free resources when driver disassociates, without stopping session explicitly.

Related

Best practice to maintain PSQL+R Shiny connections

I've built an application that does a lot of processing of user data (they can load in their data, map the variables, run analyses, review a dashboard, and download the results/report). It's a pretty heavy application, and I'm running into an issue that I'm not sure how to best solve for.
The problem is that sometimes the session will unexpectedly disconnect from the psql database. This causes problems because just about every corner of the application depends on retrieving or sending information. Basically, the app doesn't work at all without the connection. What's even worse, is the UI doesn't really inform the user of the problem, it kindof sits all lazy-like.
The application exists on an EC2 instance within a docker container, served through an HTTPS proxy (Caddy) to the public via a registered domain name. Each new session searches for a global pool connection, and if one does not exist, it creates one, then checks out a local connection and passes that into all the downstream modules.
I'm wondering how others have addressed this problem. Should I,
use a global pool, then check out a single connection and test for a severed connection at the start of each function? This is my current (unfinished) approach and seems not great.
search for a pool connection and checkout a connection at the start of each function, then return at the end? This would take a bit of time to implement (and test), but seems like a reasonable solution.
check for a connection every minute and if one doesn't exist, create one. I'm guessing this would need to happen in each module independently.
Any direction will be greatly appreciated.
Thanks,

Serve web request in python that spawns a new long running subprocess

I currently have a python command line application that uses python invoke package to organise, list and execute tasks. There are many task files (controlled & created by users, not me). Execution time for some task files can be more than an hour. Each task is actually a test script/program. invoke is useful in listing/executing all the tasks in a task file (we call it a testsuite) or only a bunch of them (a tasks collection) or a single task. (Having a ton of loose scripts and organising, listing & running them in the way users want would be quite a task, hence invoke).
However, invoke cannot be used as a library. It does not offer an API that can be leveraged to list and run test tasks. So I am forced to run invoke as a shell command in subprocess from command line program. I replace (via execl()) the current process with invoke because once the control passes to invoke, there is no need to come back to parent process. So far good..
Now, there is a requirement that this command line program be callable from a web application. So I need to wrap this cmdline program in a restful http API. I've decided to use bottle.py to keep things simple.
I understand that the long running testsuite (tasks) will have to be done off the http request/response cycle. But I'm unable to finalise exactly how to go about it (prob. I may be overthiniking). But here is what I want ...
Tasks are written by users. They are always synchronous, they may sleep or execute shell commands via subprocess.run().
Application is internal, it will not be bombarded with huge number of requests. No of users Max. 10.
But each request (of type that runs the task) will take minutes and some cases > hour to complete. New requests during this should not block.
Calling application (running on a different host) will need to report progress of the running task to the browser UI. ('progress bar')
Ability to communicate with running task and 'cancel' it from browser UI.
With above situation, am I correct in saying ..
because a new 'process' must be spawnned (due use of subprocess and excl in current code) for a request, it rules out using 'threads' of any type (os threads, greenlets, gevent)?
Using any async libraries (web framework, web/http server or in app code) won't be of much help, because every run request will have to be a new process anyway?
How will the process be spawned when a request comes in? Let the web/htpp server (gunicorn?) do it? or My application has to take case of forking itself?
is 'gunicorn' a good choice for this situation?
I have a feeling that users may also ask for the ability to schedule tasks/tests. I might end up using some sort of task queue. I have read 'huey' and feel that it is light & simple for my needs. (No redis/Celery). But any task queue also means a separate consumer process to administer? More moving parts to the mix.
'progress-bar' functionality means, subprocess has to keep updating its progress somewhere and calling application has to read from there. Does this necessitate 'task queue' anyway?
There is a lot of material on all of this and I have read quite some if it. But it still has left me unclear as to how exactly to go about implementing my requirements. Any direction/pointers would be appreciated. I'd also appreciate any advice on what 'not to use'.
If you need something really simple then you could write a wrapper around task spooler (linux tool to run tasks) https://vicerveza.homeunix.net/~viric/soft/ts/ (especially https://vicerveza.homeunix.net/~viric/soft/ts/article_linux_com.html for more details)
Otherwise it's probably better to switch to uwsgi spooler, rq with redis or celery with rabbitmq (cause with redis it works to certain extent).

Do shiny apps data-scoping rules apply to ShinyProxy?

as far as I understand, ShinyProxy launches a separate container for every connected user, is it possible to share data among user sessions by using these documented shiny scoping rules (see Objects visible across all sessions)?
My use case involves loading a big static dataset in memory that is the same for every app user, so the correct approach here is to have a single copy of the dataset in memory and share it among all user sessions (= load it before the 'server' function). Does this work with ShinyProxy as explained in the above Shiny documentation?
Thanks in advance,
Juanje.

R-Shiny two way communication

We have separate process which provides data to our R-Shiny application. I know I can provide data to R-Shiny via file or database and observe the data source via reactivePoll. This works fine and I understand it's sort of recommended way.
What I don't like on this approach is:
It's hard to send to shiny various type of inputs (like data and metadata)
I miss the shiny application feedback to the data providing process. I just write a file and hope that shiny app will pick it up and process and will be successful. Data sourcing process cannot be notified about failure for example (invalid data)
I would love to have some 2-ways protocol. For example send the data through a websocket (this would have to be different websocket than the one Shiny has with the UI obviously) or raw socket and be able to send response back.
Surely I can implement some file-based API where I store files under different names observing them with shiny and shiny then writes other files back and I would observe them with my application which provided the data. But this basically sucks :)
Any tips much appreciated!
Edit: it mas or may not be obvious from saying the Java and R applications are writing files for each other ... But the apps are running on the same host and I can live with this limitation

A Way to Run a Long Process From ASP.NET page

What are your most successful ways of running a long process, like 2 hours, in asp.net and return information to the client on the progress.
I've heard creating a windows service, httphandler and remoting can be successful.
Just a suggestion...
If you have logic that you are tyring to utilize already in asp.net... You could make an external app (windows service, console app, etc.) that calls a web service on your asp.net page.
For example, I had a similiar problem where the code I needed was asp.net and I needed to update about 3000 clients using this code. It started timing out, so I exposed the code through a web service. Then, instead of trying to run the whole 3000 clients at through asp.net all at once, I used a console app that is run by a nightly sql server job that ran the web service once for each client. This way all the time consuming processing was handled by the console app that doesn't have the time out issue, but the code we had already wrote in asp.net did not have to be recreated. In the end slighty modifying the design of my existing architecture allowed me easily get around this problem.
It really depends on the environment and constraints you have to deal with...Hope this helps.
There are two ways that I have handled this. First, you can simply run the process and let the client time out. This has two drawbacks: the UI isn't in synch and you are tying up an IIS thread for non-html purposes (I did this for a process that used to return quickly enough but that grew beyond time-out limits).
The better way to handle this is to write a "Service" application that handles the request as passed through a database table (put the details of the request there). Then you can create a window that gives the user a "window" into ongoing progress on the task (e.g. how many records have been processed or emails sent). This status window can either have a link to permit the user to refresh or you can automate the refresh using Ajax callbacks on a timer.
This isn't directly applicable but I wrote code that will let you run processes similar to "scheduled tasks" inside of ASP.NET without needing to use windows services or any type of cron jobs.
Scheduled Tasks in ASP.NET!
I very much prefer WCF service to scheduled tasks. You might (off the top of my head) pass an addr to the WCF service as a sort of 'callback' that the service can call with progress reports as it works.
I'd shy away from scheduled tasks... too course grained.

Resources