Get number of active spark executors with sparklyr and R - r

When launching a spark cluster via sparklyr, I notice that it can take between 10-60 seconds for all the executors to come online.
Right now I'm using Sys.sleep(60) to allow time for them to come online, but sometimes it takes longer than that, and sometimes it is shorter than that. I want a programmatic way to adjust for this time variance, similar to this question regarding Python. So I think I want to pass getExecutorMemoryStatus via sparklyr, but I'm not sure how to do this.
To see what I'm seeing, run the following code to launch a yarn-client spark connection, and check the Yarn UI. In the Event Timeline we can see at which time each executor comes online.
spark_config <- spark_config()
spark_config$spark.executor.memory <- "11G"
spark_config$`sparklyr.shell.driver-memory` <- "11G"
spark_config$spark.dynamicAllocation.enabled <- FALSE
spark_config$`spark.yarn.executor.memoryOverhead` <- "1G"
spark_config$spark.executor.instances <- 32
sc <- spark_connect(master = "yarn-client", config = spark_config)

So I think I want to pass getExecutorMemoryStatus via sparklyr, but I'm not sure how to do this.
You have to retrieve SparkContext object:
sc <- spark_connect(...)
spark_context(sc) %>%
...
and then invoke the method:
... %>% invoke("getExecutorMemoryStatus")
Together:
spark_context(sc) %>%
invoke("getExecutorMemoryStatus") %>%
names()
should give you a list of active executors.

Related

Registering a temporary table using sparklyr in Databricks

My colleague is using pyspark in Databricks and the usual step is to run an import using data = spark.read.format('delta').parquet('parquet_table').select('column1', column2') and then this caching step, which is really fast.
data.cache()
data.registerTempTable("data")
As an R user I am looking for this registerTempTable equivalent in sparklyr.
I would usually do
data = sparklyr::spark_read_parquet(sc = sc, path = "parquet_table", memory = FALSE) %>% dplyr::select(column1, column2)
In case I opt for memory = TRUE or tbl_cache(sc, "data") it keeps running and never stops. The contrast in time difference seems very obvious - my colleague's registerTempTable takes seconds whereas my option of sparklyr keeps running, i.e. unknown when it will stop. Is there a better function in sparklyr for R users which can do this registerTempTable faster?
You can try using cache and createOrReplaceTempView
library(SparkR)
df <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv", header="true", inferSchema = "true")
cache(df)
createOrReplaceTempView(df,"df_temp")
The above works for SparkR.
For sparklyR, you're code the spark_read_parquet() with memory=FALSE is a pretty similar procedure to what is happening with SparkR.
You shouldn't need to cache the data if you just want to create a temporary table so I would just use memory = FALSE.
See this question.

How to use shiny app as a target in drake

How to pass previous target (df) to ui and server functions that I use in the next command shinyApp. My plan looks like this:
plan <- drake_plan(
df = faithful,
app = shinyApp(ui, server)
)
ui and server are copied from the shiny tutorial. There's only one difference - I changed faithful to df (data in the previous target).
Now I'm getting an error:
Warning: Error in $: object of type 'closure' is not subsettable
[No stack trace available]
How to solve this? What's the best practice?
drake targets should return fixed data objects that can be stored with saveRDS() (or alternative kinds of files if you are using specialized formats). I recommend having a look at https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets. There issues with defining a running instance of a Shiny app as a target.
As long as the app is running, make() will never finish.
It does not really make sense to save the return value of shinyApp() as a data object. That's not really what a target is for. The purpose of a target is to reproducibly cache the results of a long computation so you do not need to rerun it unless some upstream code or data change.
Instead, I think the purpose of the app target should be to deploy to a website like https://shinyapps.io. To make the app update when df changes, be sure to mention df as a symbol in a command so that drake's static code analyzer can pick it up. Also, use file_in() to declare your Shiny app scripts as dependencies so drake automatically redeploys the app when the code changes.
library(drake)
plan <- drake_plan(
df = faithful,
deployment = custom_deployment_function(file_in("app.R"), df)
)
custom_deployment_function <- function(file, ...) {
rsconnect::deployApp(
appFiles = file,
appName = "your_name",
forceUpdate = TRUE
)
}
Also, be sure to check the dependency graph so you know drake will run the correct targets in the correct order.
vis_drake_graph(plan)
In your previous plan, the command for the app did not mention the symbol df, so drake did not know it needed to run one before the other.
plan <- drake_plan(
df = faithful,
app = shinyApp(ui, server)
)
vis_drake_graph(plan)

How to unpersist in Sparklyr?

I am using Sparklyr for a project and have understood that persisting is very useful. I am using sdf_persist for this, with the following syntax (correct me if I am wrong):
data_frame <- sdf_persist(data_frame)
Now I am reaching a point where I have too many RDDs stored in memory, so I need to unpersist some. However I cannot seem to find the function to do this in Sparklyr. Note that I have tried:
dplyr::db_drop_table(sc, "data_frame")
dplyr::db_drop_table(sc, data_frame)
unpersist(data_frame)
sdf_unpersist(data_frame)
But none of those work.
Also, I am trying to avoid using tbl_cache (in which case it seems that db_drop_table works) as it seems that sdf_persist offers more liberty on the storage level. It might be that I am missing the big picture of how to use persistence here, in which case, I'll be happy to learn more.
If you don't care about granularity then the simplest solution is to invoke Catalog.clearCache:
spark_session(sc) %>% invoke("catalog") %>% invoke("clearCache")
Uncaching specific object is much less straightforward due to sparklyr indirection. If you check the object returned by sdf_cache you'll see that the persisted table is not exposed directly:
df <- copy_to(sc, iris, memory=FALSE, overwrite=TRUE) %>% sdf_persist()
spark_dataframe(df) %>%
invoke("storageLevel") %>%
invoke("equals", invoke_static(sc, "org.apache.spark.storage.StorageLevel", "NONE"))
[1] TRUE
That's beacuase you don't get registered table directly, but rather a result of subquery like SELECT * FROM ....
It means you cannot simply call unpersist:
spark_dataframe(df) %>% invoke("unpersist")
as you would in one of the official API's.
Instead you can try to retrieve the name of the source table, for example like this
src_name <- as.character(df$ops$x)
and then invoke Catalog.uncacheTable:
spark_session(sc) %>% invoke("catalog") %>% invoke("uncacheTable", src_name)
That is likely not the most robust solution, so please use with caution.

Is there a way to transfer R objects to separate R sessions on linux?

I've got a program that repeatedly loads largish datasets that are stored in R's Rds format. Here's a silly example that has all of the salient features:
# make and save the data
big_data <- matrix(rnorm(1e6^2), 1e6)
saveRDS(big_data, file = "big_data.Rds")
# write a program that uses the data
big_data <- readRDS("big_data.Rds")
BIGGER_data <- big_data+rnorm(1)
print("hooray!")
# save this in a text file called `my_program.R`
# run this program a bunch
for (i = 1:1000){
system("Rscript my_program.R")
}
The bottleneck is loading the data. But what if I had a separate process somewhere that held the data in memory?
Maybe something like this:
# write a program to hold the data in memory
big_data <- readRDS("big_data.Rds")
# save this as `holder.R` open a terminal and do
Rscript holder.R
Now there is a process running somewhere with my data in memory. How can I get it from a different R session? (I'm assuming that this would be faster than loading it -- but is this correct?)
Maybe something like this:
# write another program:
big_data <- get_big_data_from_holder()
BIGGER_data <- big_data+1
print("yahoo!")
# save this as `my_improved_program.R`
# now do the following:
for (i = 1:1000){
system("Rscript my_improved_program.R")
}
So I guess my question is what would the function get_big_data_from_holder() look like? Is it possible to do this? Practical?
Backstory: I'm trying to work around what appears to be a memory leak in R's interface to keras/tensorflow, that I've described here. The workaround is to let the OS clean up all of the cruft left over from a TF session, so that I can run TF sessions one after another without my computer slowing to a crawl.
Edit: maybe I could do this with a clone() system call? Conceptually I can imagine that I'd clone the process running holder and then run all of the commands in the program that depend on the data that's loaded. But I don't know how this would be done.
You may also improve the performance of saving and loading the data by turning off compression:
saveRDS(..., compress = FALSE)
You may find my filematrix package useful for storing and quickly accessing the big matrix.
To create it, run:
big_data = matrix(rnorm(1e4^2), 1e4)
library(filematrix)
fm = fm.create.from.matrix('matrix_file', big_data)
close(fm)
To access it from another R session:
library(filematrix)
fm = fm.open('matrix_file')
show(fm[1:3,1:3])
close(fm)

spark: java.io.IOException: No space left on device [again!]

I am getting the java.io.IOException: No space left on device that occurs after running a simple query in sparklyr. I use both last versions of Spark (2.1.1) and Sparklyr
df_new <-spark_read_parquet(sc, "/mypath/parquet_*", name = "df_new", memory = FALSE)
myquery <- df_new %>% group_by(text) %>% summarize(mycount = n()) %>%
arrange(desc(mycount)) %>% head(10)
#this FAILS
get_result <- collect(myquery)
I do have set both
spark.local.dir <- "/mypath/"
spark.worker.dir <- "/mypath/"
using the usual
config <- spark_config()
config$`spark.executor.memory` <- "100GB"
config$`spark.executor.cores` <- "3"
config$`spark.local.dir` <- "/mypath/"
config$`spark.worker.dir` <- "mypath/"
config$`spark.cores.max`<- "2000"
config$`spark.default.parallelism`<- "4"
config$`spark.total-executor-cores`<- "80"
config$`sparklyr.shell.driver-memory` <- "100G"
config$`sparklyr.shell.executor-memory` <- "100G"
config$`spark.yarn.executor.memoryOverhead` <- "100G"
config$`sparklyr.shell.num-executors` <- "90"
config$`spark.memory.fraction` <- "0.2"
Sys.setenv(SPARK_HOME="mysparkpath")
sc <- spark_connect(master = "spark://mynode", config = config)
where mypath has more than 5TB of disk space (I can see these options in the Environment tab). I tried a similar command in Pyspark and it failed the same way (same error).
By looking at the Stages tab in Spark, I see that the error occurs when shuffle write is about 60 GB. (input is about 200GB). This is puzzling given that I have plenty of space available. I have have looked at the other SO solutions already...
The cluster job is started with magpie https://github.com/LLNL/magpie/blob/master/submission-scripts/script-sbatch-srun/magpie.sbatch-srun-spark
Every time I start a Spark job, I see a directory called spark-abcd-random_numbers in my /mypath folder. but the size of the files in there is very small (nowhere near the 60GB shuffle write)
there are about 40 parquet files. each is 700K (original csv files were 100GB) They contain strings essentially.
cluster is 10 nodes, each has 120GB RAM and 20 cores.
What is the problem here?
Thanks!!
I ve had this problem multiple times before. The reason behind is the temporary files. most of servers have a very small size partition for /tmp/ which is the default temporary directory for spark.
Usually, I used to change that by setting that in spark-submit command as the following:
$spark-submit --master local[*] --conf "spark.driver.extraJavaOptions=-Djava.io.tmpdir=/mypath/" ....
In your case, I think that you can provide that to the configuration in R as following (I have not tested that but that should work):
config$`spark.driver.extraJavaOptions` <- "-Djava.io.tmpdir=/mypath/"
config$`spark.executor.extraJavaOptions ` <- "-Djava.io.tmpdir=/mypath/"
Notice that you have to change that for the driver and executors since you're using Spark standalone master (as I can see in your question)
I hope that will help
change following settings in your magpie script
export MAGPIE_LOCAL_DIR="/tmp/${USER}/magpie"
export SPARK_LOCAL_DIR="/tmp/${USER}/spark"
to have mypath prefix and not /tmp
Once you set the parameter, you can see the new value of spark.local.dir in Spark environment UI. But it doesn't reflect.
Even I faced the similar problem. After setting this parameter, I restarted the machines and then started working.
Since you need to set this when the JVM is launched via spark-submit, you need to use the sparklyr java-options, e.g.
config$`sparklyr.shell.driver-java-options` <- "-Djava.io.tmpdir=/mypath"
I had this very problem this week on a Standalone mode cluster and after trying different things, like some of the recommendations in this thread, it ended up being a sub folder called "work" inside the Spark home folder grew unchecked for a while thus filling up the worker's hhd

Resources