I'm interested in using a statistical programming language within a web site I'm building to do high performance stats processing that will then be displayed to the web.
I'm wondering if an R compiler can be embedded within a web server and threaded to work well with the LAMP stack so that it can work smoothly with the front-end and back-end of the web site and improve the performance of the site.
If R is not the right choice for such an application, then perhaps there is another tool that is?
The general rule is that webserver should do NO calculations -- whatever you do, it will always end in a bad user experience. The way is that the server should respond to calculation request by scheduling the job for some worker process, give the user some nice working status and then push the results obtained from worker when they are ready (most likely with AJAX polling or some more recent COMET idea).
Of course this requires some RPC protocol to R and some queuing agent -- this can be done either with background processes (easy yet slow), R HTTP servers (more difficult yet faster), or real RPC like Rserve or triggr (hard, yet fast to ultra-fast).
You are confusing two issues.
Yes, R can be used via a webplatform. In fact, the R FAQ has an entire section on this. In the fifteen+ years that both R and 'the Web' have ridden to prominence, many such frameworks have been proposed. And since R 2.13.0 R even has its own embedded web server (to drive documentation display).
Yes, R scripts can run faster via the bytecode compiler, but that does not give you orders of magnitude.
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I developed an R application and I want to deploy it.
Currently the application consists of a set of functions to be run from the command line, like an R package. In order to deploy it, I am thinking of repackaging R Portable adding the necessary libraries and my code to it. My main problem is choosing a proper GUI toolkit.
Production Environment
My app is a single-user one (i.e. a Desktop application) and the target platform is Windows. It could bootstrap in R and then call the toolkit, or bootstrap, say, in Java and then call the R engine. The GUI should first and foremost feed the app functions. It should also grab the function graphical output.
Possible Alternatives
Here is a potential list of alternatives. I’d like to know if they meet/fit the requisite environement described.
Java JRI is now released only as a part of rJava, but while the latter is clearly documented, I am unable to find docs and tutorials for the former.
As for Deducer, it is presented as a GUI front-end, but I found out that it is also a GUI toolkit
TCL/Tk bindings seem a natural choice for R and are well documented, but someone complains about limitations of this toolkit.
RGtk2 seems interesting and there are also some tutorials around.
gWidgets is one of the rare toolkits to sport a package vignette!
Despite I don’t need a real web application, an interesting option would be interfacing R with JavaScript/HTML. As most of us, I am familiar with this environment and the app could benefit of the many JS libraries.
The problem is that the beautiful Shiny server and rApache are for Linux only and this is probably true probably Concerto too. Instead Rserve runs on Windows and, while there is no official JS client, I found the third party rserve-js and also a node.js client.
Rook, by the same author of rApache, should be platform agnostic (isn't it?).
R Server Pages could work, but I didn't find examples on the functions HttpDaemon and HttpRequest in the vignette or reference manual.
I run some simple examples with gWidgetsWWW. It works, but it seems to produce canned web pages, without the possibility of modifying the HTML code.
EDIT
Let me clarify my question. I am not surveying your personal preferences.
The technologies or products mentioned here tend to be very young and not widespread. It would be very unpleasant to discover, after investing months of code, that they are not yet ready or don’t fit production. So I would like to know (not your subjective tastes, but) if they are able to work in the environment described above.
We have created a kind of webapp building on rApache and Ruby on Rails beside some other technologies at rapporter.net - that turned out to act rather a framework to host R-based statistical applications in the means of Rapplications instead of our initial goal to create a user-friendly online front-end to R. I'd warmly suggest to check out our features, as you might save tons of resources by not dealing with server-side, CMS and other boring issues, but could concentrate on the statistical tool.
Anyway, beside promoting our stuff, let me summarize my experiences:
rApache is definitely ready for production, but please note that only for rather stateless algorithms (by default Apache would start bunch of workers so the same user/client would end up interacting with different R sessions in each query). E.g. RServe would be a better alternative for a stateful application.
AFAIK Shiny server is meant to host dedicated statistical tools and applications -- just like our Rapplication service with or without DB backend -- with some customizable user input. You would need some technical skills to do so and also providing a (HA) environment might require way too much extra resources. This can be a huge advantage or disadvantage based on your requirements and expectations.
The hugest issue in such question should be security (like using RAppArmor or sandboxR), not just the R connector back-end, as users will interact with your servers (if hosted in the cloud). Desktop applications are a bit more developer-friendly, but supporting all major platforms can be a no-no in the era of tablets and smartphones. The cloud app can run on any device with a browser.
You should choose the optimal solution based on your requirements. There are bunch of tools ready for production, and each has its own advantages and special use cases. Just check out which related packages/applications are still under development with support, and answer a few questions like:
Is there any need to connect to databases?
What types of user input is needed (e.g. only parameters, datasets, R commands)?
Desktop/cloud app? Are you sure? If the later, would you like to care with the setup, maintenance and support?
Do you run any computational intensive tasks?
Do you want to implement an application to help users with repetitive and standardized tasks or rather providing a rather general and extensible software?
Do you need a responsive application with interactive setup or using that for reporting purposes?
What output formats do you need?
What other technologies are you familiar with? It's rather hard to built a Meteor-based app with a NoSQL backend if you worked with MySQL, PHP, Java or C# in the past.
I'm about to do something similar. The fastest way ( in respect of both deployment time and future application performance) seems to be c# interfacing with R through R.NET. In Visual Studio you will benefit from incredible visualisation options with just few clicks, and setting up interactive/drill down charts it's quite straightforward too. As you also mentioned, RServe (with Java) is another valuable option.
EDIT
If a web application is not required to run on a public ip address the r package Rook it's an interesting option. Some example with Rook: using ggplot2, using googleVis
Disclaimer: I'm a c# ASP.NET developer learning "RoR". Sorry if this question doesn't "get" RoR, any corrections greatly appreciated!
What is multithreading
My understanding of "multithread" ability in web apps is twofold:
Every time a web/app server receives a request it can assign a thread to the new request, thus multiple requests can run concurrently.
The app runtime + language allows for multiple threads to be used WITHIN a single request (in ASP.NET via "Async" methods and keywords for example).
In this way, IIS7 + ASP.NET can do points 1 AND 2.
I'm confused about RoR
I've read these two articles and they have left me confused:
Clearing up some things about LinkedIn mobile’s move from Rails to
node.js
How to deploy a multi-threaded Rails app
question one.
I think I understand that RoR doesn't lend itself very well to point number 2 above, that is, having multiple threads within the same request, have I got that right?
question two.
Just to be crystal clear, RoR app/web servers can also do point number 1 above right (that is, multiple requests can run concurrently)? is that not always the case with RoR?
Question 1:
You can spawn more Ruby threads in one request if you want, although that seems to be outside the typical use case for Rails. There are uses for it for certain long-running IO or external operations.
Question 2:
The limiting factor for Ruby concurrency in general, not just with Rails, is the Global Interpreter Lock. This feature of Ruby prevents more than 1 thread of Ruby from executing at any given time per process. The lock is released whenever there is non-Ruby code executing, such as waiting for disk IO or SQL responses. You can get around this by using a different implementation of Ruby than the default, such as JRuby, but not all.
Phusion Passenger uses process based concurrency to handle a few requests concurrently, so, strictly speaking, is not "multithreaded," but is still concurrent.
This talk from Ruby MidWest 2011 has some good thoughts on getting multithreaded Ruby on Rails going.
Since this is about "from ASP.NET to RoR" there is another small but important detail to remember: In *nix environments it's common to achieve concurrency of a service application through multi-processing rather than multi-threading. This is an architecture that goes way back and is related to the relatively cheap cost of multi-processing on *nix systems using fork and Copy-on-Write. Each process serves one request at a time in a single thread and the main process controls spawning and killing worker child processes. Multiple requests are served concurrently by different child processes.
Modern service applications, for example Apache, have multi-process, multi-threaded, and even combined modes (where the service forks several processes, each running several threads).
In cases where the application was built with portability at mind (examples again: Apache, MySQL, etc) it is customary to run it in multi-process or combined mode on *nix systems, and in multi-threaded mode on Windows servers.
However, admittedly Rails is somewhat lacking on the Windows front. It's not that you can't run it on Windows, it's just that not a lot of effort went into making sure it runs well and smoothly for production use on Windows servers. It's not a common production platform among the RoR community.
As a result, Eventhough Rails itself is thread-safe since version 2.2, there isn't yet a good multi-threaded server for it on Windows servers. And you get the best results by running it on *nix servers using multi-process/single-threaded concurrency model.
Rails as a framework is thread-safe. So, the answer is yes!
The second link that you posted here names many of the Rails servers that don't do well with multi-threading. He later mentions that nginx is the way to go (it is definitely the most popular one and highly recommended). But he doesn't mention what made him come up to the conclusions.
Ruby 1.9.3 came out recently and has some new threading goodness built in which didn't exist before.
Use of multi-threading generally depends on the use case.
Personally I have tried it once an year ago and it had worked but I haven't used it in any production code because I haven't come across a use case where using multi-threading made more sense over pushing the long running task to a background job.
I would love to explore this more. So, if you can describe what you are trying to achieve then maybe we can do a POC.
I would like to schedule and distribute on several machines - Windows or Ubuntu - (one task is only on one machine) the execution of R scripts (using RServe for instance).
I don't want to reinvent the wheel and would like to use a system that already exists to distribute these tasks in an optimal manner and ideally have a GUI to control the proper execution of the scripts.
1/ Is there a R package or a library that can be used for that?
2/ One library that seems to be quite widely used is mapReduce with Apache Hadoop.
I have no experience with this framework. What installation/plugin/setup would you advise for my purpose?
Edit: Here are more details about my setup:
I have indeed an office full of machines (small servers or workstations) that are sometimes also used for other purpose. I want to use the computing power of all these machines and distribute my R scripts on them.
I also need a scheduler eg. a tool to schedule the scripts at a fix time or regularly.
I am using both Windows and Ubuntu but a good solution on one of the system would be sufficient for now.
Finally, I don't need the server to get back the result of scripts. Scripts do stuff like accessing a database, saving files, etc, but do not return anything. I just would like to get back the errors/warnings if there are some.
If what you are wanting to do is distribute jobs for parallel execution on machines you have physical access to, I HIGHLY recommend the doRedis backend for foreach. You can read the vignette PDF to get more details. The gist is as follows:
Why write a doRedis package? After all, the foreach package already
has available many parallel back end packages, including doMC, doSNOW
and doMPI. The doRedis package allows for dynamic pools of workers.
New workers may be added at any time, even in the middle of running
computations. This feature is relevant, for example, to modern cloud
computing environments. Users can make an economic decision to \turn
on" more computing resources at any time in order to accelerate
running computations. Similarly, modernThe doRedis Package cluster
resource allocation systems can dynamically schedule R workers as
cluster resources become available
Hadoop works best if the machines running Hadoop are dedicated to the cluster, and not borrowed. There's also considerable overhead to setting up Hadoop which can be worth the effort if you need the map/reduce algo and distributed storage provided by Hadoop.
So what, exactly is your configuration? Do you have an office full of machines you're wanting to distribute R jobs on? Do you have a dedicated cluster? Is this going to be EC2 or other "cloud" based?
The devil is in the details, so you can get better answers if the details are explicit.
If you want the workers to do jobs and have the results of the jobs reconfigured back in one master node, you'll be much better off using a dedicated R solution and not a system like TakTuk or dsh which are more general parallelization tools.
Look into TakTuk and dsh as starting points. You could perhaps roll your own mechanism with pssh or clusterssh, though these may be more effort.
I have a bunch of R scripts which I am running on a Windows machine and want to ensure that the code remains unread by those not intended to see it. On a Linux box, I could wrap the R code in a bash script #! and make an encrypted (and perhaps even a limited-life) executable shell script. What are my options to do something on similar lines under Windows?
My answer is a bit late, but I believe this is a good question. Unfortunately, I don't believe that there is a solution, or at least an easy one, at the present time.
The difficulty is common because, for most interpreted languages, including R, it is often possible to turn on logging and inspection of all commands being run. This can negate many tricks to obfuscate the code.
For those who prefer to think of code being open == good, one should know that a common reason to obfuscate the code is if one is consulting with a client that hires multiple vendors. It is not uncommon for a client to take scripts from vendor A and ask vendor B why it doesn't work with their system. (This may be done by a low-level IT flunkie, rather than someone responsible for the NDA contracts.) If A & B are competitors, A's code has just been handed to B. When scripts == serious programs, then serious code has been given away.
The ways I've seen this addressed are:
Make a call to a compiled language, and use standard protections available there.
Host the executable on a different server, and use calls to the server to execute the calculations. (In R, there are multiple server-side options.)
Use compiled (preprocessed / bytecode) code within the language.
Option 2 is actually easier and better when the code may be widely distributed, not just for IP reasons. A major advantage is that it lets you upgrade the code without having to go through the pain of a site-wide release process. If new libraries are needed, no problem - update the server.
Option 3 is done in Matlab with .p files, and can be done with py2exe for Python on Windows. In R, the new bytecode compilation may be analogous, but I am not familiar enough with it to address any differences between .Rc files in the R context and .p files in the Matlab context. For more info on the compiler, see: http://www.inside-r.org/r-doc/compiler/compile
Hosting computations on the server is great for working with unsophisticated users, because it is easier to iterate quickly in response to bugs or feature requests. The IP protection is simply a benefit.
This is not a specifically R-oriented strategy. (And it's a bit unclear what your constraints or goals really are anyway.) If you want a cross-platform encryption method, you should look into the open-source program TrueCrypt. It supports creating encrypted files that can be mounted as volumes on any machine that supports the volume formatting method. I have tested this across the Mac PC divide , since the Mac can read FAT files, but have no experience with how it might work across the Linux-PC chasm.
(Their TODO list for Windows includes;"Command line options for volume creation (already implemented in Linux and Mac OS X versions)". So I don't see any clear way to use this from within R without you running the program from the OS.)
I don't think this is possible because the R interpreter has to be able to decrypt and read the code in order to execute it which means that whoever is using that interpreter will also be able to decrypt and read the code.
I am by no means an expert, so I reserve the right to be 100% wrong about that statement.
I believe the best solution is to ensure value comes from the expertise and services provided by your company and it's employers---not from keeping secrets.
Failing that, you could try separating the code into a client/server model. That way the client just sends data and receives results---they never have access to the code that runs on the server.
However, the scientist in me just said "that solution sucks and I would never trust results provided under such conditions".
There is a great presentation by Dan Farino, Chief Systems Architect at MySpace.com, showcasing a web-based stack dump tool that catalogues all threads running in a given process (what they're doing, how long they've been executing, etc.)
Their techniques are also summarized on highscalability.com:
PerfCollector.
Centralized
collection of performance data via
UDP. More reliable than Windows and
allows any client to connect and see
stats.
Web Based Stack Dump Tool.
Can right-click on a problem server
and get stack dump of the .Net
managed threads. Used to have to RDC
into system and attach a debugger and
1/2 later get an answer. Slow,
nonscalable, and tedious. Not just a
stack dump, gives a lot of context
about what the thread is doing.
Troubleshooting is easier because you
can see 90 threads are blocked on a
database so the database may be down.
Web Base Heap Dump Tool.
Dumps all
memory allocations. Very useful for
developers. Save hours of doing it by
hand. • Profiler. Traces a request
from start to finish and produces a
report. See URL, methods, status,
everything that will help you
identify a slow request. Looks at
lock contentions, are a lot of
exceptions being thrown, anything
that might be interesting. Very light
weight. It's running on one box in
every VIP (group of 100 servers) in
production. Samples 1 thread every 10
seconds. Always tracing in
background.
The question is: what tools are required to build a web-based stack dump tool for ASP.NET? For convenience, let's assume that an *.aspx hosted in the target AppDomain, able to output all managed call stacks in that process, is sufficient.
There are a few posts that cover the use of Mdbg (debugger for managed code written entirely in C#/IL that started shipping with CLR 2 SDK) and the mdbgcore assembly typically found in C:\Program Files\Microsoft Visual Studio 8\SDK\v2.0\Bin:
http://dotnetdebug.net/2005/11/09/exceptiondbg-v01-debug-your-exceptions/
http://blogs.msdn.com/jmstall/archive/tags/MDbg/default.aspx
http://blogs.msdn.com/vijaysk/archive/2009/11/04/asp-net-debugger-extension-for-iis-7.aspx
Would a solution simply reference this assembly to produce the desired output? What impact would a "list all managed call-stacks" operation have on the running process that's servicing production traffic?
I believe the profiling API of .Net are the way to go.
Look at the SlimTune project on Google Code to have a live sample, with sources, that you can check how to adapt and improve to work in a Asp.NET scenario.
Regards
Massimo
With the profiling API of .Net you have to stop the server and it takes a lot CPU (but it gives you full control over all called methods).
I think the most “light way” solution is to doing this with MDbg, I put together a very small but useful little app called StackDump that does the following:
1) The debugger stops the application and generates a list of all CLR stacks running for the process.
2) The application is started again.
This operation is a quick operation and can (maybe) be executed on a running production server with unchanged production code.
It just 80 lines of .Net code to manage this. I have published the source code on Codeplex.