Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Have Revolution Enterprise. Want to run 2 simple but computationally intensive operations on each of 121k files in a directory, outputting to new files. Was hoping to use some Revoscaler function that chunked/parallel processed the data similarly to lapply. So I'd have lapply(list of files, function), but using a faster Rxdf (revoscaler) function that might actually finish, since I suspect basic lapply would never complete.
So is there a Revoscaler version of lapply? Will running it from Revolution Enterprise automatically chunk things?
I see parlapply, mclapply (http://www.inside-r.org/r-doc/parallel/clusterApply)...can I run these using cores on the same desktop? Aws servers? Do I get anything out of running these packages in Revoscaler if its not a native Rxdf function? I guess then that this is a question more on what I can use as a "cluster" in this situation.
There is rxExec, which behaves like lapply in the single-core scenario, and like parLapply in the multi-core/multi-process scenario. You would use it like this:
# vector of file names to operate on
files <- list.files()
rxSetComputeContext("localpar")
rxExec(function(fname) {
...
}, fname=rxElemArg(files))
Here, func is the function that carries out the operations you want on the files; and you pass it to rxExec much like you would to lapply. The rxElemArg function tells rxExec to execute func on each of the different values of files. Setting the compute context to "localpar" starts up a local cluster of slave processes, so the operations will run in parallel. By default, the number of slaves is 4, but you can change this with rxOptions(numCoresToUse).
How much speedup can you expect to get? That depends on your data. If your files are small and most of the time is taken up by computations, then doing things in parallel can get you a big speedup. However, if your files are large, then you may run into I/O bottlenecks especially if all the files are on the same hard disk.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am exploring parallel programming in R and I have a good understanding of how the foreach function works, but I don't understand the differences between parallel,doparallel,doMC,doSNOW,SNOW,multicore, etc.
After doing a bunch of reading it seems that these packages work differently depending on the operating system, and I see some packages use the word multicore, and others use cluster (I am not sure if those are different), but beyond that it isn't clear what advantages or disadvantages each have.
I am working Windows, and I want to calculate standard errors using replicate weights in parallel so I don't have to calculate each replicate one at a time (if I have n cores I should be able to do n replicates at once). I was able to implement it using doSNOW, but it looks like plyr and the R community in general uses doMC so I am wondering if using doSNOW is a mistake.
Regards,
Carl
My understanding is that parallel is a conglomeration of snow and multicore, and is meant to incorporate the best parts of both.
For parallel computing on a single machine, I find parallel to have been very effective.
For parallel computing using a cluster of multiple machines, I've never succeeded in completing the cluster set up using parallel, but have succeeded using snow.
I've never used any of the do* packages, so I'm afraid I'm unable to comment.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Suppose I have VBA macro in Excel which does some calculation. And I would like to do a part of this calculation in R, but in program mode. Say, at some moment Excel macro has a vector and it needs to find its mean by mean function in R. How can I call R from VBA, transmit a vector to R, initiate the calculation in R and get back the result in VBA? Thanks.
There is a plugin RExcel, but I found quite horrible to use it (and you kinda have to pay for it).
The easiest and most general but hacky way to perform your interaction is the following:
1) Save your array/matrix/vector in csv in a folder
2) write your R code in a file that read the csv and write the result in csv
3) Call the R script from VBA with the VBA Shell function (Rscript scriptName.R)
4) Import the result back to excel/VBA.
This method has the advantage that you are separating the computational logic with the formatting from VBA.
You could also call the R code directly within VBA with -e option from R but this is strongly unadvised.
Hope it helps!
BTW: it works with all the other program (Python/LaTeX/Matlab).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
If one is building a substantial, organization-wide code base in R, is it acceptable practice to rely on the sqldf package as the default approach for data munging tasks? Or is best practice to rely on operations with R specific syntax where possible? By relying on sqldf, one is introducing a substantial amount of a different syntax, SQL, into their R code base.
I'm asking this question with specific regard to maintainability and style. I've searched existing R style guides and did not find anything on this subject.
EDIT: To clarify the workflow I'm concerned with, consider a data munging script making ample use of sqldf as follows:
library(sqldf)
gclust_group<-sqldf("SELECT clust,SUM(trips) AS trips2
FROM gclust
GROUP BY clust")
gclust_group2<-sqldf("SELECT g.*, h.Longitude,h.Latitude,h.withinss, s.trips2
FROM highestd g
LEFT JOIN centers h
ON g.clust=h.clust
LEFT JOIN gclust_group s
ON g.clust=s.clust")
And such a script could continue for many lines. (For those familiar with Hadoop and PIG, the style is actually similar to a PIG script). Most of the work is done using SQL syntax, albeit with the benefit of avoiding complex subqueries.
Write functions. Functions with clear names that describe their purpose. Document them. Write tests.
Whether the functions contain sqldf parts, or use dplyr, or use bare R code, or call Rcpp is at that level irrelevant.
But if you want to try changing something from sqldf to dplyr the important thing is that you have a stable platform on which to experiment, which means well-defined functions and a good set of tests. Maybe there's a bottleneck in one function that might run 100x faster if you do it with dplyr? Great, you can profile and test the code with both.
You can even branch your code and have a sqldf branch and a dplyr branch in your revision control system (you are using an RCS, right?) and work in parallel until you get a winner.
It honestly doesn't matter if you are introducing other bits of syntax into your R code from a maintainability perspective if your codebase is well-documented and tested.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I worked out an estimator, and I would like check its performance by doing simulation studies with R. I want to repeat the experiment for 500 times. Unfortunately, the computation involved in the estimator is sophisticated. Each replication will take 15 minutes on my desktop. I am looking for some distributed computation approaches with R. How should I start? I googled this topic. There are some many posts about this.
I'd suggest starting with the foreach package. If you're using mac or linux the following is the simplest way to do parallel computing:
# First we register a parallel backend. This will work on mac and linux.
# Windows is more complicated, try the `snow` package.
library(doMC)
registerDoMC(cores=4) # substitute for number of cores you want to run on.
# now we can run things in parallel using foreach
foreach (i = 1:4) %dopar% {
# What's in here will run on a separate core for each iteration.
}
You should read the vignette for foreach as it's quite different to for (especially nested loops) and it is also quite powerful for combining results at the end and returning them.
First step with any R problem as broad as this should be checking the CRAN Task Views. Oh look:
http://cran.r-project.org/web/views/HighPerformanceComputing.html
Note that StackOverflow isn't really the place for asking broad questions that are best answered with 'read that documentation over there' or 'why don't you try using tool X?'
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Are there any R-project packages that facilitate asynchronous network IO?
I'm think here along the lines of Ruby's Eventmachine or Python's Twisted.
If there are several such packages/libraries which is the best in terms of:
- performance
- features
First of all, R is single-threaded, so typically people try to use parallel computing approaches (see, for instance, the snow package). I don't think there's anything quite like Eventmachine or Twisted.
Check out the following:
The "State of the Art in Parallel Computing with R" paper describes most of the approaches to parallel computing in R (http://www.jstatsoft.org/v31/i01/paper). There are many useful packages in the HighPerformanceComputing view: http://cran.r-project.org/web/views/HighPerformanceComputing.html.
Check out svSocket: http://cran.r-project.org/web/packages/svSocket/
You can try using NetWorkSpaces with R: http://cran.r-project.org/web/packages/nws/.
There are several examples of R servers. RServe: http://www.rforge.net/Rserve/
The iBrokers packages is one of the only ones that I know which uses asynchonous requests. Have a look at the source code for that package (you can download it off R-Forge) and the related vignette: http://cran.r-project.org/web/packages/IBrokers/vignettes/RealTime.pdf
The biocep project also includes many relevant features: http://biocep-distrib.r-forge.r-project.org/