equivalent of "str()" (describes dataframe) for a spark table using sparklyr - r

My question boilds down to: what is the Sparklyr equivalent to the str R command?
I am opening a large table (from a file), call it my_table, in Spark, from R using the Sparklyr package.
How can describe the table? Column names and types, a few examples, etc.
Apologies in advance for what must be a very basic question but I did search for it, and checked Rstudio's Sparklyr Cheatsheet and did not find the answer.

Let's use the mtcars dataset and move it to a local spark instance for example purposes:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
tbl_cars <- dplyr::copy_to(sc, mtcars, "mtcars")
Now you have many options, here are 2 of them, each slightly different - choose based on your needs:
1.Collect the first row into R (now it is a standard R data frame) and look at str:
str(tbl_cars %>% head(1) %>% collect())
2.Invoke the schema method and look at the result:
spark_dataframe(tbl_cars) %>% invoke("schema")
This will give something like:
StructType(StructField(mpg,DoubleType,true), StructField(cyl,DoubleType,true), StructField(disp,DoubleType,true), StructField(hp,DoubleType,true), StructField(drat,DoubleType,true), StructField(wt,DoubleType,true), StructField(qsec,DoubleType,true), StructField(vs,DoubleType,true), StructField(am,DoubleType,true), StructField(gear,DoubleType,true), StructField(carb,DoubleType,true))

Related

How to hot encode/generate dummy columns using sparklyr

I know there are number of questions similar to this here but 1) most of the solutions rely on deprecated functions like ml_create_dummy_variables and 2) other solutions are incomplete.
Is there a function or an approach to easily hot encode a categorical variable into multiple dummy variables in sparklyr?
This post asks for a solution in SparkR, incidentally a sparklyr solution is given that only works when the categories are unique in a given column, which renders its pointless.
This solution, results in a single dummy instead of a dummy for each category (grabs the first category). This is also the solution I stumbled onto (based on this post), which does not cut it:
iris_sdf <- copy_to(sc, iris, overwrite = TRUE)
iris_sdf %>%
ft_string_indexer(input_col = "Species", output_col = "species_num") %>%
mutate(cat_num = species_num + 1) %>%
ft_one_hot_encoder("species_num", "species_dum") %>%
ft_vector_assembler(c("species_dum"))
I'm looking for a solution that will take Species from the iris dataset and generate three columns -one for each category in Species (virginica, setosa, and versicolor). Using R, fastDummies package has what I need, but I'm left wondering how to achieve similar functionality in sparklyr.
Again, I'll note that ml_create_dummy_variables (suggested by this post) produced the following error:
Error in ml_create_dummy_variables(., "species_num", "species_dum") : Error in ml_create_dummy_variables(., "species_num", "species_dum") :
could not find function "ml_create_dummy_variables"
Note: I'm using sparklyr_1.3.1

Return logical plan using sparklyr

We are trying to get the logical plan (not to be confused with the physical plan) that Spark generates for a given query. According to the Spark docs here you should be able to retrieve this using the scala command:
df.explain(true)
or in sparklyr with the example code:
spark_version <- "2.4.3"
sc <- spark_connect(master = "local", version = spark_version)
iris_sdf <- copy_to(sc, iris)
iris_sdf %>%
spark_dataframe %>%
invoke("explain", T)
This command runs, but simply returns NULL in RStudio. My guess is that sparklyr does not retrieve content that is printed to the console. Is there a way around this or another way to retrieve the logical plan using sparklyr? The physical plan is easy to get using dplyr::explain([your_sdf]), but does not return the logical plan that was used to create it.
Looks like you can get this via:
iris_sdf %>%
spark_dataframe %>%
invoke("queryExecution") %>%
invoke("toString") %>%
cat()

How to load .dta (preserving labels) most comfortable in R?

I work with .dta files and try to make loading data as comfortable as possible. In my view, I need a combination of haven and readstata13.
haven looks perfect. It provides best "sub-labels". But it does not provide a column-selector-function. I cannot use read_dta for large files ( ~ 1 GB / on 64 GB RAM, Intel Xeon E5).
Question: Is there a way to select/load a subset of data?
read.dta13 is my best workaround. It has select.cols. But I have to get attr later, save and merge them (for about 10 files).
Question: How can I manually add these second labels which the haven package creates? (How are they called?)
Here is the MWE:
library(foreign)
write.dta(mtcars, "mtcars.dta")
library(haven)
mtcars <- read_dta("mtcars.dta")
library(readstata13)
mtcars2 <- read.dta13("mtcars.dta", convert.factors = FALSE, select.cols=(c("mpg", "cyl", "vs")))
var.labels <- attr(mtcars2,"var.labels")
data.key.mtcars2 <- data.frame(var.name=names(mtcars2),var.labels)
haven's development version supports selecting columns with the col_select argument:
library(haven) # devtools::install_github("tidyverse/haven")
mtcars <- read_dta("mtcars.dta", col_select = c(mpg, cyl, vs))
Alternatively; the column labels in RStudio's viewer are taken from the data frame's columns' "label" attribute. You can use a simple loop to assign them from the labels read by readstata13:
for (i in seq_along(mtcars2)) {
attr(mtcars2[[i]], "label") <- var.labels[i]
}
View(mtcars2)

How to refer to a Spark DataFrame by name in sparklyr and assign it to a variable?

Say I ran the following code and I forgot to assign the Spark dataframe iris to a variable in R and I can't use .Last.value to assign because I had run some other code right after copying the data to Spark.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
copy_to(sc, iris)
2+2 # ran some other code so can't use .Last.value
How do I assing the Spark dataframe "iris" to a variable in R called iris_tbl?
copy_to provides additional name argument By default it is set to:
deparse(substitute(df))
so in your case the name will be iris. If you want more predictable behavior you should set the name manually:
copy_to(sc, iris, name = "foo")
Then you can access it dplyr way with tbl:
dplyr::tbl(sc, "foo")
or via Spark session:
sc %>% spark_session() %>% invoke("table", "foo") %>% sdf_register()
All production ready reader methods (copy_to shouldn't be used as anything else than a testing and development tool) require name, so you can reference tables the same way
spark_read_csv(sc, "bar", path)
tbl(sc, "bar")

Sparklyr "embedded nul in string" when collecting

In R I have a spark connection and a DataFrame as ddf.
library(sparklyr)
library(tidyverse)
sc <- spark_connect(master = "foo", version = "2.0.2")
ddf <- spark_read_parquet(sc, name='test', path="hdfs://localhost:9001/foo_parquet")
Since it's not a whole lot of rows I'd like to pull this into memory to apply some machine learning magic. However, it seems that certain rows cannot be collected.
df <- ddf %>% head %>% collect # works fine
df <- ddf %>% collect # doesn't work
The second line of code throws a Error in rawToChar(raw) : embedded nul in string: error. The column/row it fails on has some string data. Since head %>% collect works indicates that some rows seem to fail while others work as expected.
How can I work around this error, is there a way to clean up the error? What does the error actually mean?

Resources