Sparklyr "embedded nul in string" when collecting - r

In R I have a spark connection and a DataFrame as ddf.
library(sparklyr)
library(tidyverse)
sc <- spark_connect(master = "foo", version = "2.0.2")
ddf <- spark_read_parquet(sc, name='test', path="hdfs://localhost:9001/foo_parquet")
Since it's not a whole lot of rows I'd like to pull this into memory to apply some machine learning magic. However, it seems that certain rows cannot be collected.
df <- ddf %>% head %>% collect # works fine
df <- ddf %>% collect # doesn't work
The second line of code throws a Error in rawToChar(raw) : embedded nul in string: error. The column/row it fails on has some string data. Since head %>% collect works indicates that some rows seem to fail while others work as expected.
How can I work around this error, is there a way to clean up the error? What does the error actually mean?

Related

Warning "ORDER BY is ignored in subqueries without LIMIT" in R sparklyr

Issue
Sparklyr script output is filled with the same warning message (over and over again) :
"ORDER BY is ignored in subqueries without LIMIT
Do you need to move arrange() later in the pipeline or use window_order() instead?"
I have the same issue everywhere (local, databricks, jupyter, etc.)
I don't know exactly what is causing the warning but it appears generally when I use dplyr::group_by() with dplyr::arrange().
Reproducible example
library("dplyr"); library("sparklyr")
spark_version <- "3.1.3"
sc <- spark_connect(master = "local", version = spark_version)
# --- coping mtcars data into spark for the example
sdf_mtcars = sdf_copy_to(sc = sc, x = mtcars, name = "sdf_mtcars")
# --- example generating the warning message (twice)
# note : I had to use mutate() twice to make the warning appear
# the warning does not appear in this simple example if only one of the two columns cum_mpg or max_mpg is created
example_mtcars = sdf_mtcars %>%
group_by(cyl) %>%
arrange(mpg, .by_group = TRUE) %>%
mutate(cum_mpg = cumsum(mpg)) %>%
mutate(max_mpg = cummax(mpg)) %>%
ungroup() %>%
sdf_collect()
Warning messages in R :
1: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
2: ORDER BY is ignored in subqueries without LIMIT
ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
In this simple example the warnings appears twice but in my code it can appear up to 50 times.
The output is not really readable.
Thanks.

Return logical plan using sparklyr

We are trying to get the logical plan (not to be confused with the physical plan) that Spark generates for a given query. According to the Spark docs here you should be able to retrieve this using the scala command:
df.explain(true)
or in sparklyr with the example code:
spark_version <- "2.4.3"
sc <- spark_connect(master = "local", version = spark_version)
iris_sdf <- copy_to(sc, iris)
iris_sdf %>%
spark_dataframe %>%
invoke("explain", T)
This command runs, but simply returns NULL in RStudio. My guess is that sparklyr does not retrieve content that is printed to the console. Is there a way around this or another way to retrieve the logical plan using sparklyr? The physical plan is easy to get using dplyr::explain([your_sdf]), but does not return the logical plan that was used to create it.
Looks like you can get this via:
iris_sdf %>%
spark_dataframe %>%
invoke("queryExecution") %>%
invoke("toString") %>%
cat()

equivalent of "str()" (describes dataframe) for a spark table using sparklyr

My question boilds down to: what is the Sparklyr equivalent to the str R command?
I am opening a large table (from a file), call it my_table, in Spark, from R using the Sparklyr package.
How can describe the table? Column names and types, a few examples, etc.
Apologies in advance for what must be a very basic question but I did search for it, and checked Rstudio's Sparklyr Cheatsheet and did not find the answer.
Let's use the mtcars dataset and move it to a local spark instance for example purposes:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
tbl_cars <- dplyr::copy_to(sc, mtcars, "mtcars")
Now you have many options, here are 2 of them, each slightly different - choose based on your needs:
1.Collect the first row into R (now it is a standard R data frame) and look at str:
str(tbl_cars %>% head(1) %>% collect())
2.Invoke the schema method and look at the result:
spark_dataframe(tbl_cars) %>% invoke("schema")
This will give something like:
StructType(StructField(mpg,DoubleType,true), StructField(cyl,DoubleType,true), StructField(disp,DoubleType,true), StructField(hp,DoubleType,true), StructField(drat,DoubleType,true), StructField(wt,DoubleType,true), StructField(qsec,DoubleType,true), StructField(vs,DoubleType,true), StructField(am,DoubleType,true), StructField(gear,DoubleType,true), StructField(carb,DoubleType,true))

undefined columns when trying to use separate function

I have a code that I built to scrape player data from yahoo's fantasy football player page so I can get a list of players and the rank that yahoo gives them.
The code worked fine last year but now I am getting an error when I run the separate function:
> temp <- separate(temp,two,c('Note', 'Player','a','b','c','Opp'), sep="\n", remove=TRUE)
Error in `[.data.frame`(x, x_vars) : undefined columns selected
In addition: Warning message:
Expected 6 pieces. Missing pieces filled with `NA` in 1 rows [1].
I cannot figure out why it is giving this error, the column I am trying to separate looks correct. I have another script that uses this function to do something similar and when I went to try to use it there it worked fine.
The "missing pieces filled in with 'NA'" warning shouldn't be a problem, just that it wont run because of the undefined columns error.
The minimal code that I use to get to where I am is this:
library(rvest)## For read.html
library(tidyr)## For separate function
#scrapes the data
url <- 'https://football.fantasysports.yahoo.com/f1/107573/players?status=A&pos=O&cut_type=9&stat1=S_S_2017&myteam=0&sort=PR&sdir=1&count=0'
web <- read_html(url)
table = html_nodes(web, 'table')
temp <- html_table(table)[[2]]
#
colnames(temp) <- c('one','two',3:26)
temp <- separate(temp,two,c('Note', 'Player','a','b','c','Opp'), sep="\n", remove=TRUE)
The data is scraped in without names so I quickly give names to them including spelling out the column in question so it works with the separate function. I have tried using quotation marks around two in separate but it give the same error.
After remove the first row of temp, you code works.
library(dplyr)
colnames(temp) <- c('one','two',3:ncol(temp))
# Use ncol(temp) to make sure the column number is correct
temp2 <- temp %>%
filter(row_number() > 1) %>%
separate(two, c('Note', 'Player','a','b','c','Opp'), sep="\n", remove=TRUE)

Sparklyr "NoSuchTableException" error after subsetting data

I am new to sparklyr and haven't had any formal training - which will become obvious after this question. I'm also more on the statistician side of the spectrum which isn't helping. I'm getting an error after sub-setting a Spark DataFrame.
Consider the following example:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local[*]")
iris_tbl <- copy_to(sc, iris, name="iris", overwrite=TRUE)
#check column names
colnames(iris_tbl)
#subset so only a few variables remain
subdf <- iris_tbl %>%
select(Sepal_Length,Species)
subdf <- spark_dataframe(subdf)
#error happens when I try this operation
spark_session(sc) %>%
invoke("table", "subdf")
The error I'm getting is:
Error: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
There are several other lines of the error.
I don't understand why I'm getting this error. "subdf" is a Spark DataFrame.
To understand why this doesn't work you have to understand what happens when you copy_to. Internally sparklyr will register temporary table using Spark metastore and treat it more or less like just another database. This is why:
spark_session(sc) %>% invoke("table", "iris")
can find the "iris" table:
<jobj[32]>
class org.apache.spark.sql.Dataset
[Sepal_Length: double, Sepal_Width: double ... 3 more fields]
subdf from the other hand is just plain local object. It is not registered in the metastore hence it cannot be accessed using Spark catalog. To make it work you can register Spark DataFrame:
subdf <- iris_tbl %>%
select(Sepal_Length, Species)
spark_dataframe(subdf) %>%
invoke("createOrReplaceTempView", "subdf")
or copy_to if data is small enough to be handled by the driver:
subdf <- iris_tbl %>%
select(Sepal_Length, Species) %>%
copy_to(sc, ., name="subdf", overwrite=TRUE)
If you work with Spark 1.x createOrReplaceTempView should be replaced with registerTempTable.

Resources