Spark ML_pipelines: managing table reading - r

I am using Spark ML_pipelines to easily deploy operations that I have developed in Sparklyr in a production environment using SCALA. It is working pretty well, except for one part: it seems that when I read a table from Hive and then create a pipeline that applies operations to this table the pipeline will also save the table reading operation and thereby the name of the table. However I want the pipeline to be independent of this.
Here is a reproducible example:
Sparklyr part:
sc = spark2_context(memory = "4G")
iris <- copy_to(sc, iris, overwrite=TRUE)
spark_write_table(iris, "base.iris")
spark_write_table(iris, "base.iris2")
df1 <- tbl(sc, "base.iris")
df2 <- df1 %>%
mutate(foo = 5)
pipeline <- ml_pipeline(sc) %>%
ft_dplyr_transformer(df2) %>%
ml_fit(df1)
ml_save(pipeline,
paste0(save_pipeline_path, "test_pipeline_reading_from_table"),
overwrite = TRUE)
df2 <- pipeline %>% ml_transform(df1)
dbSendQuery(sc, "drop table base.iris")
SCALApart:
import org.apache.spark.ml.PipelineModel
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val df1 = spark.sql("select * from base.iris2")
val pipeline = PipelineModel.load(pipeline_path + "/test_pipeline_reading_from_table")
val df2 = pipeline.transform(df1)
I get this error:
org.apache.spark.sql.AnalysisException: Table or view not found: `base`.`iris`; line 2 pos 5;
'Project ['Sepal_Length, 'Sepal_Width, 'Petal_Length, 'Petal_Width, 'Species, 5.0 AS foo#110]
+- 'UnresolvedRelation `base`.`iris`
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:82)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:637)
at org.apache.spark.ml.feature.SQLTransformer.transformSchema(SQLTransformer.scala:86)
at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:310)
at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:310)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:310)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:304)
... 71 elided
I can see 2 solutions:
It seems that persisting dataframe would be a solution, but then I would need to find a way not to overload my memory, hence my question on unpersisting
Passing the name of the table in Hive as a parameter of the pipeline, which I am trying to solve in this question
Now, all of this being said, I might be missing something as I am only a beginner...
EDIT: this is significantly different from this question as this concerns the specific problem of integrating a dataframe that was just read in a pipeline, as specified in the title.
EDIT: as for my project, persisting the tables after I read them is a viable solution. I don't know if there is any better solution.

Then the pipeline would call my table "base.table", making it impossible to apply it to another table.
That's actually not true. ft_dplyr_transformer is a syntactic sugar for Spark's own SQLTransformer. Internally dplyr expression is converted to SQL query, and the name of the table is replaced with __THIS__ (Spark placeholder referring to the current table).
Let's say you have transformation like this one:
copy_to(sc, iris, overwrite=TRUE)
df <- tbl(sc, "iris") %>%
mutate(foo = 5)
pipeline <- ml_pipeline(sc) %>%
ft_dplyr_transformer(df) %>%
ml_fit(tbl(sc, "iris"))
ml_stage(pipeline, "dplyr_transformer") %>% spark_jobj() %>% invoke("getStatement")
[1] "SELECT `Sepal_Length`, `Sepal_Width`, `Petal_Length`, `Petal_Width`, `Species`, 5.0 AS `foo`\nFROM `__THIS__`"
That's however rather confusing way expressing things, and it makes more sense to use native SQL transformer directly:
pipeline <- ml_pipeline(sc) %>%
ft_sql_transformer("SELECT *, 5 as `foo` FROM __THIS__") %>%
ml_fit(df)
Edit:
The problem you experience here looks like a bug. get_base_name function returns unquoted table name, so the value in your case will be
> get_base_name(x$ops)
<IDENT> default.iris
and the pattern will be
> pattern
[1] "\\bdefault.iris\\b"
However dbplyr::sql_render returns backquoted fully qualified name:
> dbplyr::sql_render(x)
<SQL> SELECT `Sepal_Length`, `Sepal_Width`, `Petal_Length`, `Petal_Width`, `Species`, 5.0 AS `foo`
FROM `default`.`iris`
So the pattern doesn't match the name.

Related

Can dplyr function work connected with SQL server?

I have a table in SQL server database, and I want to manipulate this table with dbplyr/dplyr in R packages.
library(odbc)
library(DBI)
library(tidyverse)
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server",
Server = "xx.xxx.xxx.xxx",
Database = "stock",
UID = "userid",
PWD = "userpassword")
startday = 20150101
day = tbl(con, in_schema("dbo", "LogDay"))
I tried this simple dplyr function after connecting to remote database, but only to fail with error messages.
day %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1),lead(priceOpen,1)/priceClose, NA)) %>%
select(logDate,stockCode, ovnprofit)
How can I solve this problem?
p.s. When I apply dplyr function after transforming 'day' into tibble first, it works. However, I want to apply dplyr function directly, not transforming into tibble because it's to time consuming and memory intensive.
The problem is most likely with the lead function. In R a data set has an ordering, but in SQL datasets are unordered and the order needs to be specified explicitly.
Note that the SQL code in the error message contains:
LEAD("stockCode", 1.0, NULL) OVER ()
That there is nothing in the brackets after the OVER suggests to me that SQL expects somethings here.
Two ways you can resolve this:
By using arrange before the mutate
By specifying the order_by argument of lead
# approach 1:
day %>%
arrange(logDate) %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1),
lead(priceOpen,1)/priceClose,
NA)
) %>%
select(logDate,stockCode, ovnprofit)
# approach 2:
day %>%
mutate(ovnprofit = ifelse(stockCode == lead(stockCode,1, order_by = 'logDate'),
lead(priceOpen,1, order_by = 'logDate')/priceClose,
NA)
) %>%
select(logDate,stockCode, ovnprofit)
However, it also looks like you are only wanting to lead within each stockCode. This can be done by group_by. I would recommend the following:
output = day %>%
group_by(stockCode) %>%
arrange(logDate) %>%
mutate(next_priceOpen = lead(priceOpen, 1)) %>%
mutate(ovnprofit = next_priceOpen / priceClose)
select(logDate,stockCode, ovnprofit)
If you view the generated SQL with show_query(output) you should see the SQL OVER clause similar to the following:
LEAD(priceOpen, 1.0, NULL) OVER (PARTITION BY stockCode ORDER BY logDate)

R: dbplyr using eval()

I have a question on how to use eval(parse(text=...)) in dbplyr SQL translation.
The following code works exactly what I want with dplyr using eval(parse(text=eval_text))
selected_col <- c("wt", "drat")
text <- paste(selected_col, ">3")
implode <- function(..., sep='|') {
paste(..., collapse=sep)
}
eval_text <- implode(text)
mtcars %>% dplyr::filter(eval(parse(text=eval_text)))
But when I put it into the database it returns an error message. I am looking for any solution that allows me to dynamically set the column names and filter with the or operator.
db <- tbl(con, "mtcars") %>%
dplyr::filter(eval(parse(eval_text)))
db <- collect(db)
Thanks!
Right approach, but dbplyr tends to work better with something that can receive the !! operator ('bang-bang' operator). At one point dplyr had *_ versions of functions (e.g. filter_) that accepted text inputs. This is now done using NSE (non-standard evaluation).
A couple of references: shiptech and r-bloggers (sorry couldn't find the official dplyr reference).
For your purposes you should find the following works:
library(rlang)
df %>% dplyr::filter(!!parse_expr(eval_text))
Full working:
library(dplyr)
library(dbplyr)
library(rlang)
data(mtcars)
df = tbl_lazy(mtcars, con = simulate_mssql()) # simulated database connection
implode <- function(..., sep='|') { paste(..., collapse=sep) }
selected_col <- c("wt", "drat")
text <- paste(selected_col, ">3")
eval_text <- implode(text)
df %>% dplyr::filter(eval(parse(eval_text))) # returns clearly wrong SQL
df %>% dplyr::filter(!!parse_expr(eval_text)) # returns valid & correct SQL
df %>% dplyr::filter(!!!parse_exprs(text)) # passes filters as a list --> AND (instead of OR)

Saving a dataset through a function, but it is not loaded afterwards

I have created a function that reads some text data from a certain file, does some manipulation (omitted here) and then saves each modified dataframe in this list as .RData. I have checked that the function does its job. However, when loading the output again into RStudio, the load command runs without errors, but there is no new object in my environment.
Any possible fixes?
f <- function(directory_input, directory_output, par1, par2){
library(tidyverse)
library(readxl)
if(dir.exists(directory_output) == F) {
dir.create(directory_output)
}
key <- data.frame(par= as.character(paste0(0, par1, par2)))
paths <- key %>% mutate(
filepath_in = file.path(directory_input, paste0(par, '.txt'), sep = ''),
filepath_out = file.path(directory_output, paste0(par, '.RData'), sep = '')
)
filepath_in <- paths$filepath_in
filepath_out <- paths$filepath_out
DF <- filepath_in %>% map( ~ .x %>% read.delim2(., encoding = 'Latin-1', nrows = 1000))
map2(DF, filepath_out, ~ .x %>% save(file = .y))
}
EDIT
After the comments, here is a bit more of context:
I was instructed to write a function that will be part of a future package.
The function does not create a new dataframe, only saves it in the computer. I designed in order to make it easier for the user in case he is manipulating multiple datasets.
On the other hand, it is natural to assume that the user would use these datasets in the future in the most intuitive way, using only load. So that is why it wouldn't be ideal to have a solution that requires using assign to load the results in the future.

Concat_ws() function in Sparklyr is missing

I am following a tutorial on web (Adobe) analytics, where I want to build a Markov Chain Model. (http://datafeedtoolbox.com/attribution-theory-the-two-best-models-for-algorithmic-marketing-attribution-implemented-in-apache-spark-and-r/).
In the example they are using the function:
concat_ws (from library(sparklyr)). But it looks like the function does not exists (after installing the package, and calling the library, I receive an error that the function does not exists...).
Comment author of the blog: concat_ws is a Spark SQL function:
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/functions.html
So, you’ll have to rely on sparklyr to have that function work.
My question: are there workarounds to get access to the concat_ws() function? I tried:
Searched on Github (https://github.com/rstudio/sparklyr) if I could find the function (or the source code).. unfortunately no result..
What is the goal of the function?
Concatenates multiple input string columns together into a single string column, using the given separator.
You can simply use paste from base R.
library(sparklyr)
library(dplyr)
config <- spark_config()
sc <- spark_connect(master = "local", config = config)
df <- as.data.frame(cbind(c("1", "2", "3"), c("a", "b", "c")))
sdf <- sdf_copy_to(sc, df, overwrite = T)
sdf %>%
mutate(concat = paste(V1, V2, sep = "-"))
You cannot find the function because it doesn't exist in sparklyr package. concat_ws is a Spark SQL function (org.apache.spark.sql.functions.concat_ws).
sparklyr depends on a SQL translation layer - function calls are translated into SQL expressions with dbplyr:
> dbplyr::translate_sql(concat_ws("-", foo, bar))
<SQL> CONCAT_WS('-', "foo", "bar")
This means that the function can be applied only in the sparklyr context:
sc <- spark_connect(master = "local[*]")
df <- copy_to(sc, tibble(x="foo", y="bar"))
df %>% mutate(xy = concat_ws("-", x, y))
# # Source: spark<?> [?? x 3]
# x y xy
# * <chr> <chr> <chr>
# 1 foo bar foo-bar
I had a similar problem with dbplyr (BigQuery database).
Problem
I kept getting the error:
my_dbplyr_object %>%
mutate(datetime_char = paste(date_char, time_char))
# failed x Function not found: CONCAT_WS at [1:147] [invalidQuery]
Solution
I wrote custom SQL and placed it inside sql().
Example
Once you know the SQL that will generate what you're after (in my case it was CONCAT(date_char, ' ', time_char)), then simply place it inside the sql() function, like so:
my_dbplyr_object %>%
mutate(datetime_char = sql("CONCAT(date_char, ' ', time_char)"))

How can I append data to a PostgreSQL table with `dplyr` without `collect()`?

The table reg_data is a PostgreSQL table. It turns out to be faster to run the regressions in PostgreSQL. But, as I am running it for 100,000s of data sets, I want to do it data set by data set and append the results of each to a table.
Is there a way to append PostgreSQL data to a PostgreSQL table using native dplyr verbs? I'm not sure that there's a huge cost to bringing the data to R then sending them back to PostgreSQL (it's just 6 numbers and a couple of identifying fields), but it does seem inelegant.
library(dplyr)
pg <- src_postgres()
reg_data <- tbl(pg, "reg_data")
reg_results <-
reg_data %>%
summarize(r_squared=regr_r2(y, x),
num_obs=regr_count(y, x),
constant=regr_intercept(y, x),
slope=regr_slope(y, x),
mean_analyst_fog=regr_avgx(y, x),
mean_manager_fog=regr_avgy(y, x)) %>%
collect() %>%
as.data.frame()
# Push to database.
dbWriteTable(pg$con, c("bgt", "within_call_data"), reg_results,
append=TRUE, row.names=FALSE)
dplyr does not include commands to insert or update records in a database, so there is not a complete native dplyr solution for this. But you could combine dplyr with regular SQL statements to avoid bringing the data to R.
Let's start by reproducing your steps before the collect() statement
library(dplyr)
pg <- src_postgres()
reg_data <- tbl(pg, "reg_data")
reg_results <-
reg_data %>%
summarize(r_squared=regr_r2(y, x),
num_obs=regr_count(y, x),
constant=regr_intercept(y, x),
slope=regr_slope(y, x),
mean_analyst_fog=regr_avgx(y, x),
mean_manager_fog=regr_avgy(y, x))
Now, you could use compute() instead of collect() to create a temporary table in the database.
temp.table.name <- paste0(sample(letters, 10, replace = TRUE), collapse = "")
reg_results <- reg_results %>% compute(name=temp.table.name)
Where temp.table.name is a random table name. Using the option name = temp.table.name in compute we assign this random name to the temporary table created.
Now, we will use the library RPostgreSQL to create an insert query that uses the results stored in the temporary table. As the temporary table only lives in the connection created by src_postgresql() we need to reuse it.
library(RPostgreSQL)
copyconn <- pg$con
class(copyconn) <- "PostgreSQLConnection" # I get an error if I don't fix the class
Finally the insert query
sql <- paste0("INSERT INTO destination_table SELECT * FROM ", temp.tbl.name,";")
dbSendQuery(copyconn, sql)
So, everything is happening in the database and the data is not brought into R.
EDIT
Previous versions of this post did break encapsulation when we obtained temp.tbl.name from reg_results. This is avoided using the option name=in compute.
another option would be to use a command called sql_render() to create each SQL statement, and then another command called db_save_query() to create the table using a SQL statement and then a manual statement to append to the table. To loop through each query, the purrr commands: map and walk are used. Preferably, a command like compute() command should do this, but in lieu of that, the following is a fully reproducible example:
library(dplyr)
library(dbplyr)
library(purrr)
# Setting up a SQLite db with 3 tables
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
copy_to(con, filter(mtcars, cyl == 4), "mtcars1")
copy_to(con, filter(mtcars, cyl == 6), "mtcars2")
copy_to(con, filter(mtcars, cyl == 8), "mtcars3")
# Pre-process the SQL statements
tables <- c("mtcars1","mtcars2","mtcars3")
all_results <- tables %>%
map(~{
tbl(con, .x) %>%
summarise(avg_mpg = mean(mpg),
records = n()) %>%
sql_render()
})
# Execute the SQL statements, 1st one creates the table
# subsquent queries are insterted to the table
first_query <- TRUE
all_results %>%
walk(~{
if(first_query == TRUE){
first_query <<- FALSE
db_save_query(con, ., "results")
} else {
dbExecute(con, build_sql("INSERT INTO results ", .))
}
})
tbl(con, "results")
dbDisconnect(con)

Resources