I've some experience in R and am learning Spark 1.6.1 by first exploring the implementation of R in Spark.
I noticed that the syntax for the R sample() command is different in Spark:
base::R: sample(x, size, replace)
Spark R: sample(DataFrame, withReplacement, fraction)
base::sample(x, size, replace) still works, but is masked by the Spark R version.
Does anyone know why this is, when most commands are identical between the two?
Are there use cases that I should use one versus the other?
Has anyone found an authoritative list of differences between Spark R and base:: R?
Thanks!
If you have a SparkR dataframe, you'll need to use the SparkR api for sampling. If you have a R dataframe, you'll need to use the base::R sampling function call. SparkR is not R and the function calls are not identical.
The issue you are having is one of masking.
To address the second part of the question, for the benefit of others who follow, I found that the Spark documentation does in fact list the R functions that are masked:
R Function Name Conflicts
Related
This is a short question.
Is there a way to perform abstractive text summarization in R?
There are ways to perform extractive summary (ie extract few relevant sentences), as shown here Text summarization in R language
However, I was unable to find a way to do abstractive summary "purely" in R. I understand that abstractive summary is way more complex and typically requires model training. The two methods that might work so far are using python through reticulate package or use on of google's APIs. was wondering if anyone is aware of R package that does that without external dependencies.
thank you in advance
I have defined some R functions in R studio which has some complicated scripts and a lot of readlines. I can run them successfully in R studio. Is there any way, like macros to transfer these user-defined functions to SAS 9.4 to use? I am not pretty familiar with SAS programming so it is better just copy the R functions into SAS and use it directly. I am trying to figure out how to do the transformation. Thank you!
You can't natively run R code in SAS, and you probably wouldn't want to. R and SAS are entirely different concepts, SAS being closer to a database language while R is a matrix language. Efficient R approaches are terrible in SAS, and vice versa. (Try a simple loop in R and you'll find SAS is orders of magnitude faster; but try matrix algebra in R instead).
You can call R in SAS, though. You need to be in PROC IML, SAS's matrix language (which may be a separate license from your SAS); once there, you use submit / R to submit the code to R. You need the RLANG system option to be set, and you may need some additional details set up on your SAS box to make sure it can see your R installation, and you need R 3.0+. You also need to be running SAS 9.22 or newer.
If you don't have R available through IML, you can use x or call system, if those are enabled and you have access to R through the command line. Alternately, you can run R by hand separately from SAS. Either way you would use a CSV or similar file format to transfer data back and forth.
Finally, I recommend seeing if there's a better approach in SAS for the same problem you solved in R. There usually is, and it's often quite fast.
If I wanted to use a standard R package like MXNet inside SparkR, is this possible ? Can standard CRAN packages be used inside the Spark distributed environment without considering a local vs a Spark Dataframe. Is the strategy in working with large data sets in R and Spark to use a Spark dataframe, whittle down the Dataframe and then convert it to a local data.frame to use the standard CRAN package ? Is there another strategy that I'm not aware of ?
Thanks
Can standard CRAN packages be used inside the Spark distributed environment without considering a local vs a Spark Dataframe.
No, they cannot.
Is the strategy in working with large data sets in R and Spark to use a Spark dataframe, whittle down the Dataframe and then convert it to a local data.frame.
Sadly, most of the time this is what you do.
Is there another strategy that I'm not aware of ?
dapply and gapply functions in Spark 2.0 can apply arbitrary R code to the partitions or groups.
I am trying to find a way to use distance weighted discrimination method (DWD) to remove biases from multiple microarray datasets.
My starting point is this. The problem is that Matlab version runs only under Windows, needs excel 5 format as input (where data appears to be truncated at line 65535 - matlab error is:
Error reading record for cells starting at column 65535. Try saving as Excel 98.
). Java version runs only with caBIG support, which, if I understood, has been shut down recently.
So I searched a lot and I find R/DWD package but from example I could not get how to provide the two datasets to merge to kdwd function.
Does anybody know how to use it?
Thanks
Try this, it has a DWD implementation
http://www.bioconductor.org/packages/release/bioc/html/inSilicoMerging.html
Is it possible to call Stata functions from R?
Not directly, i.e. there is no package I am aware of that implements a bridge.
You can always call external programs using system() but that is neither elegant nor efficient. That said, you could prepare data in R, write it out, call Stata and then read the results in; see help(system).
There's now an RStata package on CRAN that bridges R and Stata.
The real problem is that Stata doesn't have an interactive interpreter you can pass arguments to.
Dirk is right; you can just go ahead and write the data to a common format
(if size is large and speed is an issue, fixed width is safe), but you can also just use .dta throughout the process, using read.dta in R and natively reading in Stata.
Also, in R you can call to the system() you can pass a do file or a string containing a bunch of Stata commands.
So, generally, trying to use Stata for this or that task may or may not be worth it, especially if an R equivalent is close by.