How could I transform data between several computers? [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Background: I want to build a simple distributed environment in R, which could do some "data massive" jobs in WINDOWS. For example, to calculate "big" matrix multiplication. There seem to have varies of solutions and I worked on them for a while, but I can't fix it.
I already tried these: Rserve & RSclient, packages such as snow, snowfall.
I tried several ways but I can't find a proper solution to transform data between clients, and it could be a disaster if all the data transform has to through the master.
Question: Is there any functions to deliver a matrix between every two computers as I want in a cluster?
I get an idea that maybe socket connection could work, but how can I start it gracefully? Should I have to start R script on different computers manually since there seems no SSH in the WINDOWS? I have to work on it because of my professor.
Wanted to know if it's good practice to do that? Thanks in advance.

You have the option to use would be using SparkR .
You will be compelled to use Spark APIs to distribute your data and there's a chance certain packages don't behave as expected but it would do the job.
A spark standalone cluster is made of a master accessible via HTTP and multiple workers. It's not the Ideal solution for resource sharing but it's lighter than a Hadoop + spark on yarn solution.
Finally you can try Dataiku as it can provide such ability via notebooks, spark integration and Dataset management . The community edition is not collaborative but they provide free license to schools

Related

R machine learning model deployment as webservice [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Is there any way to deploy machine Learning model written in R language as a webservice, I know we have Flask in python and many more too, but didnt come across for any such library for R Machine learning code.
As others suggested, you can use R-Shiny to build an app which you can later deploy as a web service easily. Moreover, you can use html code inside shiny so you can customise your layout to your heart's content. If you are using RStudio (which I definitely encourage if you don't), you only need to select File > New File > Shiny Web App... Have a look at documentation and examples here.
However, if you only want to create a compact and fast web service without having to build a layout etc, I would suggest you use R plumber library. This is a good solution if you don't need anything too fancy and also is easily implementable by adding decorators to your current code.
Hope this helps!

R packages for Distributed processing [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am currently have a R query that can do the parallel processing within a loop using foreach. But it is done using a single server with 32 cores. Because of my data size, I am trying to find the r packages that can distribute the computing to different window servers and can work with foreach for paralleling.
Really appreciate for your help!
For several releases now, R has been shipping with a base library parallel. You could do much worse than starting to read its rather excellent (and still short) pdf vignette.
In a nutshell, you can just do something like
mclapply(1:nCores, someFunction())
and the function someFunction() will be run in parallel over nCores. A default value of half your physical cores may be a good start.
The Task View on High-Performance Computing has many more pointers.
SparkR is the answer. From "Announcing SparkR: R on Apache Spark":
SparkR, an R package initially developed at the AMPLab, provides an R frontend to Apache Spark and using Spark’s distributed computation engine allows us to run large scale data analysis from the R shell.
Also see SparkR (R on Spark).
To get started you need to set up a Spark cluster. This web page should help. The Spark documentation, without using Mesos or YARN as your cluster manager, is here. Once you have Spark set up, see Wendy Yu's tutorial on SparkR. She also shows how to integrate H20 with Spark which is referred to as 'Sparkling Water'.

How to set up / configure R for a dev, stage, and production environment on one server? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
A colleague needs to set up a dev, stage and production environment on our server and he is asking what this means for how to run our R codes. Frankly, I do not know. My intuitive solution would be to have three different servers, so the installed R packages do not conflict. But for now, the environments should be on the same server. I am not sure how to achieve this. How can we run several versions of a package side by side? For example, using different .libPaths to be able to host different packages side by side?
What is the proper way to set this up?
PS. I hope I expressed myself clear enough, as I do not have any experience with this stuff.
Every GNU program allows you to prefix its installation (and more, as eg a suffix or prefix appended to the executable).
We use that in the 'how to build R-devel script' I posted years and and still use, directly and eg in the Dockerfile scripts for Rocker.
This generalizes easily. Use different configurations (with/without (memory) profiling, UBSAN, ...) and/or versions to you content, place them next to each other in /opt/R or /usr/local/lib/R or ... and just use them as every R installation has its own separate file tree. One easy way to access them differently is via $PATH, another is to just have front-end scripts (or shell aliases) R-prod, R-qa, R-dev, etc pp
You will have to think through of you want a common .libPaths() (for, say, common dependencies) or whether you want to reinstall all libraries for each. The latter is the default.

R 64 bit for Windows comparison [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
So there is R for 64-bit Windows users now. I'd like to know if anyone has found incremental benefits in using R-64bit over the 32bit version on Windows.
I'm looking for more specific information
What was the system specification (6gb RAM for example) and the largest data-set that was crunched ?
Which algorithm performed faster ?
Any other experiential information that motivated you to adopt the 64bit version on Windows
If I had a $ for all the times non-R users cribbed about R's data limitation....!
I want to showcase R at my workplace and want some testimonials to prove that with a decently powerful machine, 64bit R on windows can crunch gigabyte class datasets.
I don't have anything quantitative, but it's been well worth the upgrade. 64-bit Windows (7) is far more reliable and you can simply run more large jobs at once. The main advantage in 64-bit R is the ability to create large objects that don't hit the 32-bit limit. I have 12Gb of RAM and have worked with objects ~ 8Gb in size, for simple tests. Usually I wouldn't have any R process using more than 1-2Gb but the overall performance is great.
I'd be happy to run examples if you want specifics.

Thoughts on moving to Maven in an enterprise environment [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm interested in hearing from those who either A) use Maven in an enterprise environment or B) tried to use Maven in an enterprise environment.
I work for a large company that is contemplating bringing in Maven into our environment. Currently we use OpenMake to build/merge and home-grown software to deploy code to 100+ servers running various platforms (eg. WAS and JBoss). OpenMake works fine for us however Maven does have some ideal features, most importantly being dependency management, but is it viable in a large environment? Also what headaches have/did you incur, if any, in maintaining a Maven environment.
Side note, I've read Why does Maven have such a bad rep?, What are your impressions of Maven?, and a few other posts. It's interesting seeing the split between developers.
stackoverflow isn't meant to be for subjective questions and discussion, so to keep it brief:
is it viable in a large environment?
Yes, it's already in use in large environments, both across open source foundations and corporations.
A lot of what Maven provides through it's centralisation, reuse and easy sharing is designed to facilitate multiple different teams or organisations collaborating.
Like anything - you'll get out what you put in. Spend the time implementing the infrastructure you need and designing the organisational structure and project patterns that you'll need and it will pay back dividends.
I'm sure for more details you can find plenty of examples (positive and negative) across the web and particularly on the users#maven.apache.org list.

Resources