I am new to sparklyr and haven't had any formal training - which will become obvious after this question. I'm also more on the statistician side of the spectrum which isn't helping. I'm getting an error after sub-setting a Spark DataFrame.
Consider the following example:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local[*]")
iris_tbl <- copy_to(sc, iris, name="iris", overwrite=TRUE)
#check column names
colnames(iris_tbl)
#subset so only a few variables remain
subdf <- iris_tbl %>%
select(Sepal_Length,Species)
subdf <- spark_dataframe(subdf)
#error happens when I try this operation
spark_session(sc) %>%
invoke("table", "subdf")
The error I'm getting is:
Error: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
There are several other lines of the error.
I don't understand why I'm getting this error. "subdf" is a Spark DataFrame.
To understand why this doesn't work you have to understand what happens when you copy_to. Internally sparklyr will register temporary table using Spark metastore and treat it more or less like just another database. This is why:
spark_session(sc) %>% invoke("table", "iris")
can find the "iris" table:
<jobj[32]>
class org.apache.spark.sql.Dataset
[Sepal_Length: double, Sepal_Width: double ... 3 more fields]
subdf from the other hand is just plain local object. It is not registered in the metastore hence it cannot be accessed using Spark catalog. To make it work you can register Spark DataFrame:
subdf <- iris_tbl %>%
select(Sepal_Length, Species)
spark_dataframe(subdf) %>%
invoke("createOrReplaceTempView", "subdf")
or copy_to if data is small enough to be handled by the driver:
subdf <- iris_tbl %>%
select(Sepal_Length, Species) %>%
copy_to(sc, ., name="subdf", overwrite=TRUE)
If you work with Spark 1.x createOrReplaceTempView should be replaced with registerTempTable.
Related
We are trying to get the logical plan (not to be confused with the physical plan) that Spark generates for a given query. According to the Spark docs here you should be able to retrieve this using the scala command:
df.explain(true)
or in sparklyr with the example code:
spark_version <- "2.4.3"
sc <- spark_connect(master = "local", version = spark_version)
iris_sdf <- copy_to(sc, iris)
iris_sdf %>%
spark_dataframe %>%
invoke("explain", T)
This command runs, but simply returns NULL in RStudio. My guess is that sparklyr does not retrieve content that is printed to the console. Is there a way around this or another way to retrieve the logical plan using sparklyr? The physical plan is easy to get using dplyr::explain([your_sdf]), but does not return the logical plan that was used to create it.
Looks like you can get this via:
iris_sdf %>%
spark_dataframe %>%
invoke("queryExecution") %>%
invoke("toString") %>%
cat()
My question boilds down to: what is the Sparklyr equivalent to the str R command?
I am opening a large table (from a file), call it my_table, in Spark, from R using the Sparklyr package.
How can describe the table? Column names and types, a few examples, etc.
Apologies in advance for what must be a very basic question but I did search for it, and checked Rstudio's Sparklyr Cheatsheet and did not find the answer.
Let's use the mtcars dataset and move it to a local spark instance for example purposes:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
tbl_cars <- dplyr::copy_to(sc, mtcars, "mtcars")
Now you have many options, here are 2 of them, each slightly different - choose based on your needs:
1.Collect the first row into R (now it is a standard R data frame) and look at str:
str(tbl_cars %>% head(1) %>% collect())
2.Invoke the schema method and look at the result:
spark_dataframe(tbl_cars) %>% invoke("schema")
This will give something like:
StructType(StructField(mpg,DoubleType,true), StructField(cyl,DoubleType,true), StructField(disp,DoubleType,true), StructField(hp,DoubleType,true), StructField(drat,DoubleType,true), StructField(wt,DoubleType,true), StructField(qsec,DoubleType,true), StructField(vs,DoubleType,true), StructField(am,DoubleType,true), StructField(gear,DoubleType,true), StructField(carb,DoubleType,true))
Say I ran the following code and I forgot to assign the Spark dataframe iris to a variable in R and I can't use .Last.value to assign because I had run some other code right after copying the data to Spark.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
copy_to(sc, iris)
2+2 # ran some other code so can't use .Last.value
How do I assing the Spark dataframe "iris" to a variable in R called iris_tbl?
copy_to provides additional name argument By default it is set to:
deparse(substitute(df))
so in your case the name will be iris. If you want more predictable behavior you should set the name manually:
copy_to(sc, iris, name = "foo")
Then you can access it dplyr way with tbl:
dplyr::tbl(sc, "foo")
or via Spark session:
sc %>% spark_session() %>% invoke("table", "foo") %>% sdf_register()
All production ready reader methods (copy_to shouldn't be used as anything else than a testing and development tool) require name, so you can reference tables the same way
spark_read_csv(sc, "bar", path)
tbl(sc, "bar")
I am using the sparklyr library.
I have a variable, wtd which I copied to spark:
copy_to(sc,wtd)
colnames(wtd) <- c("a","b","c","d","e","f","g")
Then I want to do a computation and store that in spark, not in my environment in R.
When I try:
sdf_register(wtd %>% group_by(c,b) %>% filter(row_number()==1) %>%count(d), "wtd2")
Error in UseMethod("sdf_register") :
no applicable method for 'sdf_register' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"
The command wtd2 = wtd %>% group_by(c,b) %>% filter(row_number()==1) %>%count(d) works correctly, but that will store it in my environment, not in spark.
The first argument in your sequence of operations should be a "tbl_spark", not a regular data.frame. Your command,
wtd2 = wtd %>% group_by(c,b) %>% filter(row_number()==1) %>%count(d)
works because you are not using Spark at all, just normal R data.frames.
If you want to use it with spark, first, store the spark_tbl variable that is returned when you copy your data.frame:
colnames(wtd) <- c("a","b","c","d","e","f","g")
wtd_tbl <- copy_to(sc, wtd)
Then, you can execute your data pipeline using sdf_register(wtd_tbl %>% ..., "wtd2").
If you execute the pipeline as defined, you will get an exception saying:
Error: org.apache.spark.sql.AnalysisException: Window function rownumber() requires window to be ordered
This is because in order to use row_number() in Spark, first you need to provide an "order function". You can have this with arrange(). I assume that you want your rows ordered by the columns "c" and "b", so your final pipeline would be something like this:
sdf_register(wtd_tbl %>%
dplyr::group_by(c, b) %>%
arrange(c, b) %>%
dplyr::filter(row_number() == 1) %>%
dplyr::count(d),
"wtd2")
I hope this helps.
In R I have a spark connection and a DataFrame as ddf.
library(sparklyr)
library(tidyverse)
sc <- spark_connect(master = "foo", version = "2.0.2")
ddf <- spark_read_parquet(sc, name='test', path="hdfs://localhost:9001/foo_parquet")
Since it's not a whole lot of rows I'd like to pull this into memory to apply some machine learning magic. However, it seems that certain rows cannot be collected.
df <- ddf %>% head %>% collect # works fine
df <- ddf %>% collect # doesn't work
The second line of code throws a Error in rawToChar(raw) : embedded nul in string: error. The column/row it fails on has some string data. Since head %>% collect works indicates that some rows seem to fail while others work as expected.
How can I work around this error, is there a way to clean up the error? What does the error actually mean?