We are trying to get the logical plan (not to be confused with the physical plan) that Spark generates for a given query. According to the Spark docs here you should be able to retrieve this using the scala command:
df.explain(true)
or in sparklyr with the example code:
spark_version <- "2.4.3"
sc <- spark_connect(master = "local", version = spark_version)
iris_sdf <- copy_to(sc, iris)
iris_sdf %>%
spark_dataframe %>%
invoke("explain", T)
This command runs, but simply returns NULL in RStudio. My guess is that sparklyr does not retrieve content that is printed to the console. Is there a way around this or another way to retrieve the logical plan using sparklyr? The physical plan is easy to get using dplyr::explain([your_sdf]), but does not return the logical plan that was used to create it.
Looks like you can get this via:
iris_sdf %>%
spark_dataframe %>%
invoke("queryExecution") %>%
invoke("toString") %>%
cat()
Related
I am working with R on (Azure) Databricks and wanted to enable Apache Arrow for I/O. However, using below sample code, I'm getting some weird errow that I cannot trace back.
The error is occurring on clusters using Databricks runtime ML7.0 (Spark 3.0.0) and ML7.1 (Spark 3.0.0).
library(arrow)
library(dplyr)
library(SparkR)
arrow::arrow_available()
#TRUE
# initialize Spark session using Arrow
SparkR::sparkR.session(sparkConfig = list(spark.sql.execution.arrow.sparkr.enabled = "true"))
# create Spark DataFrame
df <- mtcars
spark_df <- cache(createDataFrame(df))
# write spark_df as parquet
sink_path <- "/dbfs/FileStore/testData"
file_path <- "dbfs:/FileStore/testData/arrow_testFile"
dir.create(sink_path , recursive=T, showWarnings=F)
SparkR::write.parquet(spark_df, file_path, mode = "overwrite")
# read parquet file as Spark DataFrame and cache
file_path %>%
SparkR::read.parquet() %>%
SparkR::cache() -> sdf_new
# collect sdf_new
sdf_new %>%
SparkR::collect() -> rdf_new
The error message I am getting is the following:
Error : 'as_tibble' is not an exported object from 'namespace:arrow'
I know that some changes regarding "as_tibble" went on, but for me it is unclear how I can ommit this error and make the Arrow fly.
My question boilds down to: what is the Sparklyr equivalent to the str R command?
I am opening a large table (from a file), call it my_table, in Spark, from R using the Sparklyr package.
How can describe the table? Column names and types, a few examples, etc.
Apologies in advance for what must be a very basic question but I did search for it, and checked Rstudio's Sparklyr Cheatsheet and did not find the answer.
Let's use the mtcars dataset and move it to a local spark instance for example purposes:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
tbl_cars <- dplyr::copy_to(sc, mtcars, "mtcars")
Now you have many options, here are 2 of them, each slightly different - choose based on your needs:
1.Collect the first row into R (now it is a standard R data frame) and look at str:
str(tbl_cars %>% head(1) %>% collect())
2.Invoke the schema method and look at the result:
spark_dataframe(tbl_cars) %>% invoke("schema")
This will give something like:
StructType(StructField(mpg,DoubleType,true), StructField(cyl,DoubleType,true), StructField(disp,DoubleType,true), StructField(hp,DoubleType,true), StructField(drat,DoubleType,true), StructField(wt,DoubleType,true), StructField(qsec,DoubleType,true), StructField(vs,DoubleType,true), StructField(am,DoubleType,true), StructField(gear,DoubleType,true), StructField(carb,DoubleType,true))
Say I ran the following code and I forgot to assign the Spark dataframe iris to a variable in R and I can't use .Last.value to assign because I had run some other code right after copying the data to Spark.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
copy_to(sc, iris)
2+2 # ran some other code so can't use .Last.value
How do I assing the Spark dataframe "iris" to a variable in R called iris_tbl?
copy_to provides additional name argument By default it is set to:
deparse(substitute(df))
so in your case the name will be iris. If you want more predictable behavior you should set the name manually:
copy_to(sc, iris, name = "foo")
Then you can access it dplyr way with tbl:
dplyr::tbl(sc, "foo")
or via Spark session:
sc %>% spark_session() %>% invoke("table", "foo") %>% sdf_register()
All production ready reader methods (copy_to shouldn't be used as anything else than a testing and development tool) require name, so you can reference tables the same way
spark_read_csv(sc, "bar", path)
tbl(sc, "bar")
I have the code below that takes a dataset does a SQL transformation on it using a wrapper function calling the spark SQL API using Sparklyr. I then use "invoke("createOrReplaceTempView", "name")" to save the table in the Spark environment as a spark data frame so that I can call is in a future function call. Then I use dplyr code "mutate" to call a hive function "regexp_replace" to transform letters to numeric (0). I them need to call an SQL function again.
However to do so I seem to have to use the "copy_to" function from sparklyr. On a large data set the "copy_to" function causes the following error:
Error: org.apache.spark.SparkException: Job aborted due to stage
failure: Total size of serialized results of 6 tasks (1024.6 MB) is
bigger than spark.driver.maxResultSize (1024.0 MB)
Is there an alternative to "copy_to" that allows me to get a spark data frame that I can then call with SQL API?
Here is my codeL
sqlfunction <- function(sc, block) {
spark_session(sc) %>% invoke("sql", block)
}
sqlfunction(sc, "SELECT * FROM
test")%>%invoke("createOrReplaceTempView",
"name4")
names<-tbl(sc, "name4")%>%
mutate(col3 = regexp_replace(col2, "^[A-Z]", "0"))
copy_to(sc, names, overwrite = TRUE)
sqlfunction(sc, "SELECT * FROM
test")%>%invoke("createOrReplaceTempView",
"name5")
final<-tbl(sc, "name5")
It'd help if you had a reprex, but try
final <- names %>%
spark_dataframe() %>%
sqlfunction(sc, "SELECT * FROM test") %>%
sdf_register("name5")
In R I have a spark connection and a DataFrame as ddf.
library(sparklyr)
library(tidyverse)
sc <- spark_connect(master = "foo", version = "2.0.2")
ddf <- spark_read_parquet(sc, name='test', path="hdfs://localhost:9001/foo_parquet")
Since it's not a whole lot of rows I'd like to pull this into memory to apply some machine learning magic. However, it seems that certain rows cannot be collected.
df <- ddf %>% head %>% collect # works fine
df <- ddf %>% collect # doesn't work
The second line of code throws a Error in rawToChar(raw) : embedded nul in string: error. The column/row it fails on has some string data. Since head %>% collect works indicates that some rows seem to fail while others work as expected.
How can I work around this error, is there a way to clean up the error? What does the error actually mean?