I require to use R packages with spark_apply function in the sparklyr package. The rdocumentation is not quite clear. I tried to make spark_apply work by following this link. It worked for the first part with the following modifications.
The working part:
library(sparklyr)
spark_apply_bundle(packages = T, base_path = getwd())
bundle <- paste(getwd(), list.files()[grep("\\.tar$",list.files())][1], sep = "/")
hdfs_path <- "hdfs://<my-ip>/user/hadoop/R/packages/packages.tar"
system("hdfs dfs -moveFromLocal", bundle, "hdfs://<my-ip>/user/hadoop/R/packages")
config <- spark_config()
config$sparklyr.shell.files <- "hdfs://<my-ip>/user/hadoop/R/packages/packages.tar"
sc <- spark_connect(master = "yarn-client",
version = "2.4.0",
config = config)
mtcars_sparklyr <- copy_to(sc, mtcars)
However, when I try to use svm function within the spark_apply, it does not work using the packages argument.
result <- mtcars_sparklyr %>%
spark_apply(
function(d) {
fit <- svm(d$mpg, d$wt)
sum(fit$residuals ^ 2)
},
group_by = "cyl",
packages = bundle
)
The following, on the other hand, works. This is if I pass the svm function within the context. However, I require the packages argument to work because I have several packages and their functions to use within the spark_apply.
result <- mtcars_sparklyr %>%
spark_apply(
function(d) {
fit <- svm(d$mpg, d$wt)
sum(fit$residuals ^ 2)
},
group_by = "cyl",
context = {svm <- e1071::svm}
)
Related
Assume I have two packages
package1
data sets d1, d2, d3
package2 which should use data sets from package1
package1 contains data sets, which I want to use for testing.
I can access each of these via e.g. package1::d1. But how can I load all of them in an automated way?
Something like
ds <- data(package = "package1") # you can try e.g. "carData"
ds$results[1, 3] # gives the first entry
mydataset <- load(ds$results[1, 3]) # this does not work
Other will use both packages, so it should work for others and on different platforms (Windows, Mac)
Any ideas?
ds$results[1, 3]
# Item
# "d1"
looks promising, but
data(ds$results[1, 3])
# Warning message:
# In data(ds$results[1, 3]) : data set ‘ds$results[1, 3]’ not found
As indicated by #dcarlson, you could extract the names of all datasets in your package and give them back as list to the data() function as argument list. However this solution only returns a promise for each dataset and not the actual dataset.
my_package <- "datasets"
name_of_all_datasets <- data.frame(data(package = my_package)$results)$Item
data(list = name_of_all_datasets, package = my_package)
I just realized, there are two options:
You want to load external files from inst/extdata
pathExtData <- system.file("extdata", package = "myPackage")
allFilenames <- list.files(pathExtData, full.names = TRUE)
# e.g. in case of Excel files
datalist <- list()
for (i in 1:length(allFilenames)) {
datalist[[i]] <- readxl::read_xlsx(path = allFilenames[i], sheet = "mySheet")
}
You want to load RData from a package you can use
ds <- data(package = "myPackage")
datalist <- list()
for (i in 1:length(ds$results[, 3])) {
eval(parse(text = paste0("datalist[[", i, "]] <- myPackage::", ds$results[i, 3])))
}
I am pretty new to Spark, I have tried to look for something on the web but I haven't found anything satisfactory.
I have always run parallel computations using the command mclapply and I like its structure (i.e., first parameter used as scrolling index, second argument the function to be parallelized, and then other optional parameters passed to the function).
Now I am trying to do kind of the same thing via Spark, i.e., I would like to distribute my computations among all the node of the Spark cluster. This is shortly what I have learned and how I think the code should be structured (I'm using the package sparklyr):
I create a connection to Spark using the command spark_connect;
I copy my data.frame in the Spark environment with copy_to and access it through its tibble;
I would like to implement a "Spark-friendly" version of mclapply, but I have seen there is no similar function in the package (I have seen there exists the function spark.lapply in the SparkR package, but unfortunately it is not in the CRAN anymore).
Here below, a simple test script I have implemented that works using the function mclapply.
#### Standard code that works with mclapply #########
dfTest = data.frame(X = rep(1, 10000), Y = rep(2, 10000))
.testFunc = function(X = 1, df, str) {
rowSelected = df[X, ]
y = as.numeric(rowSelected[1] + rowSelected[2])
return(list(y = y, str = str))
}
lOutput = mclapply(X = 1 : nrow(dfTest), FUN = .testFunc, df = dfTest,
str = "useless string", mc.cores = 2)
######################################################
###### Similar code that should work with Spark ######
library(sparklyr)
sc = spark_connect(master = "local")
dfTest = data.frame(X = rep(1, 10000), Y = rep(2, 10000))
.testFunc = function(X = 1, df, str) {
rowSelected = df[X, ]
nSum = as.numeric(rowSelected[1] + rowSelected[2])
return(list(nSum = nSum, str = str))
}
dfTest_tbl = copy_to(sc, dfTest, "test_tbl", overwrite = TRUE)
# Apply similar function mclapply to dfTest_tbl, that works with
# Spark
# ???
######################################################
If someone has already found a solution for this, then it will be great. Also other references/guides/links are more than welcome. Thanks!
sparklyr
spark_apply is existing function you're looking for:
spark_apply(sdf, function(data) {
...
})
Please refer to Distributed R in sparklyr documentation for details.
SparkR
With SparkR use gapply / gapplyCollect
gapply(df, groupingCols, function(data) {...} schema)
dapply / dapplyCollect
dapply(df, function(data) {...}, schema)
UDFs. Refer to
dapply docs
gapply docs
for details.
Be warned that all solutions are inferior compared to native Spark code and should be avoided when high performance is required.
sparklyr::spark_apply now can support pass some external variables like models as context.
Here is my example to run xgboost model on sparklyr:
bst <- xgboost::xgb.load("project/models/xgboost.model")
res3 <- spark_apply(x = ft_union_price %>% sdf_repartition(partitions = 1500, partition_by = "uid"),
f = inference_fn,
packages = F,
memory = F,
names = c("uid",
"action_1",
"pred"),
context = {model <- bst})
I'm trying to pass a custom R function inside spark_apply but keep running into issues and cant figure out what some of the errors mean.
library(sparklyr)
sc <- spark_connect(master = "local")
perf_df <- data.frame(predicted = c(5, 7, 20),
actual = c(4, 6, 40))
perf_tbl <- sdf_copy_to(sc = sc,
x = perf_df,
name = "perf_table")
#custom function
ndcg <- function(predicted_rank, actual_rank) {
# x is a vector of relevance scores
DCG <- function(y) y[1] + sum(y[-1]/log(2:length(y), base = 2))
DCG(predicted_rank)/DCG(actual_rank)
}
#works in R using R data frame
ndcg(perf_df$predicted, perf_df$actual)
#does not work
perf_tbl %>%
spark_apply(function(e) ndcg(e$predicted, e$actual),
names = "ndcg")
Ok, i'm seeing two possible problems.
(1)-spark_apply prefers functions that have one parameter, a dataframe
(2)-you may need to make a package depending on how complex the function in.
let's say you modify ndcg to receive a dataframe as the parameter.
ndcg <- function(dataset) {
predicted_rank <- dataset$predicted
actual_rank <- dataset$actual
# x is a vector of relevance scores
DCG <- function(y) y[1] + sum(y[-1]/log(2:length(y), base = 2))
DCG(predicted_rank)/DCG(actual_rank)
}
And you put that in a package called ndcg_package
now your code will be similar to:
spark_apply(perf_tbl, ndcg, packages = TRUE, names = "ndcg")
Doing this from memory, so there may be a few typos, but it'll get you close.
I have a function where I dynamically build multiple formulas as strings and cast them to a formulas with as.formula. I then call that function in a parallel process using doSNOW and foreach and use those formulas through dplyr::mutate_.
When I use lapply(formula_list, as.formula) I get the error could not find function *custom_function* when run in parallel, though it works fine when run locally. However, when I use lapply(formula_list, function(x) as.formula(x) it works both in parallel and locally.
Why? What's the correct way to understand the environments here and the "right" way to code it?
I do get a warning that says: In e$fun(obj, substitute(ex), parent.frame(), e$data) : already exporting variable(s): *custom_func*
A minimal reproducible example is below.
# Packages
library(dplyr)
library(doParallel)
library(doSNOW)
library(foreach)
# A simple custom function
custom_sum <- function(x){
sum(x)
}
# Functions that call create formulas and use them with nse dplyr:
dplyr_mut_lapply_reg <- function(df){
my_dots <- setNames(
object = lapply(list("~custom_sum(Sepal.Length)"), as.formula),
nm = c("Sums")
)
return(
df %>%
group_by(Species) %>%
mutate_(.dots = my_dots)
)
}
dplyr_mut_lapply_lambda <- function(df){
my_dots <- setNames(
object = lapply(list("~custom_sum(Sepal.Length)"), function(x) as.formula(x)),
nm = c("Sums")
)
return(
df %>%
group_by(Species) %>%
mutate_(.dots = my_dots)
)
}
#1. CALLING BOTH LOCALLY
dplyr_mut_lapply_lambda(iris) #works
dplyr_mut_lapply_reg(iris) #works
#2. CALLING IN PARALLEL
#Faux Parallel Setup
cl <- makeCluster(1, outfile="")
registerDoSNOW(cl)
# Call Lambda Version WORKS
foreach(j = 1,
.packages = c("dplyr", "tidyr"),
.export = lsf.str()
) %dopar% {
dplyr_mut_lapply_lambda(iris)
}
# Call Regular Version FAILS
foreach(j = 1,
.packages = c("dplyr", "tidyr"),
.export = lsf.str()
) %dopar% {
dplyr_mut_lapply_reg(iris)
}
# Close Cluster
stopCluster(cl)
EDIT: In my original post title I wrote that I was using nse, but I really meant using standard evaluation. Whoops. I have changed this accordingly.
I don't have an exact answer to why here, but the future package (I'm the author) handles these type of "tricky" globals - they are tricky because they are not part of a package and they are nested, i.e. one global calls another global. For example, if you use:
library("doFuture")
cl <- parallel::makeCluster(1, outfile = "")
plan(cluster, workers = cl)
registerDoFuture()
that problematic "Call Regular Version FAILS" case should now work.
Now, the above uses parallel::makeCluster() which defaults to type = "PSOCK", whereas if you load doSNOW you get snow::makeCluster() which defaults to type = "MPI". Unfortunately, a full MPI backend is yet not implemented for the future package. Thus, if you're looking for an MPI solution, this won't help you (yet).
How do I pass a pre-existing object, such as a DF, to my custom functions, given the setup I have (see below)?
Alternatively, do I need to set up my custom functions differently?
My functions reside in a series of *.R scripts.
I source the functions in my .Rprofile:
.env$fxShortName <- function(){
source("C:\\path\\to\\scriptFile.R")
}
Not-Quite Solutions:
1) Defining the Function Manually || It works w/ the obvious drawback that I need to manually load my functions each time.
2) Rscript + commandArgs || This works if I define the DF within the function, like this:
#foo.R
a <- data.frame(a = c(1))
b <- data.frame(b = c(1))
args <- commandArgs(trailingOnly = TRUE)
print(args)
data.frame.name <- args[1]
print(colnames(get(data.frame.name)))
Rscript creates a new R instance, though, so it doesn't see my pre-existing DF. At least, it doesn't find it out-of-the-box.
3) Function w/ substitute, match.call, etc. || I've adopted %>>% to set up auto-updated views of certain DFs, so I tried modifying the setup that works in that case. For %>>% I have this code in my .Rprofile:
.env$`%>>%` <- function(expr, x) {
x <- substitute(x)
call <- match.call()[-1]
fun <- function() {NULL}
body(fun) <- call$expr
makeActiveBinding(sym = deparse(x), fun = fun, env = parent.frame())
invisible(NULL)
}
This type of setup works with DFs from my current session. However, I prefer the structure offered by keeping my custom scripts separate from my .Rprofile.
4) get() & mget() || This seemed promising, but I don't understand it enough to definitively say whether or not it will help. And, yes, I did RTFM.
Reproducible Example:
myfx(head, preExistingDF)
Sample Code:
myfx <- function(expr, x) {
x <- substitute(x)
call <- match.call()[-1]
fun <- function() {NULL}
body(fun) <- call$expr
print(body(fun))}
Put the sample code in a script. Add the following code to your .Rprofile:
.env$mySamplefx <- function(){
source("C:\\path\\to\\myfx.R")
}
Then try it after adding the code directly to your .Rprofile.