I am following a tutorial on web (Adobe) analytics, where I want to build a Markov Chain Model. (http://datafeedtoolbox.com/attribution-theory-the-two-best-models-for-algorithmic-marketing-attribution-implemented-in-apache-spark-and-r/).
In the example they are using the function:
concat_ws (from library(sparklyr)). But it looks like the function does not exists (after installing the package, and calling the library, I receive an error that the function does not exists...).
Comment author of the blog: concat_ws is a Spark SQL function:
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/functions.html
So, you’ll have to rely on sparklyr to have that function work.
My question: are there workarounds to get access to the concat_ws() function? I tried:
Searched on Github (https://github.com/rstudio/sparklyr) if I could find the function (or the source code).. unfortunately no result..
What is the goal of the function?
Concatenates multiple input string columns together into a single string column, using the given separator.
You can simply use paste from base R.
library(sparklyr)
library(dplyr)
config <- spark_config()
sc <- spark_connect(master = "local", config = config)
df <- as.data.frame(cbind(c("1", "2", "3"), c("a", "b", "c")))
sdf <- sdf_copy_to(sc, df, overwrite = T)
sdf %>%
mutate(concat = paste(V1, V2, sep = "-"))
You cannot find the function because it doesn't exist in sparklyr package. concat_ws is a Spark SQL function (org.apache.spark.sql.functions.concat_ws).
sparklyr depends on a SQL translation layer - function calls are translated into SQL expressions with dbplyr:
> dbplyr::translate_sql(concat_ws("-", foo, bar))
<SQL> CONCAT_WS('-', "foo", "bar")
This means that the function can be applied only in the sparklyr context:
sc <- spark_connect(master = "local[*]")
df <- copy_to(sc, tibble(x="foo", y="bar"))
df %>% mutate(xy = concat_ws("-", x, y))
# # Source: spark<?> [?? x 3]
# x y xy
# * <chr> <chr> <chr>
# 1 foo bar foo-bar
I had a similar problem with dbplyr (BigQuery database).
Problem
I kept getting the error:
my_dbplyr_object %>%
mutate(datetime_char = paste(date_char, time_char))
# failed x Function not found: CONCAT_WS at [1:147] [invalidQuery]
Solution
I wrote custom SQL and placed it inside sql().
Example
Once you know the SQL that will generate what you're after (in my case it was CONCAT(date_char, ' ', time_char)), then simply place it inside the sql() function, like so:
my_dbplyr_object %>%
mutate(datetime_char = sql("CONCAT(date_char, ' ', time_char)"))
Related
I am trying to insert values from a dataframe into a database table (impala) using SparkR in a databricks notebook:
require(SparkR)
test_df <- data.frame(row_no = c(2,3,4,5,6,7,8)
,row_dat = c('dat_2','dat_3','dat_4','dat_5','dat_6','dat_7','dat_8')
)
test_df <- as.data.frame(test_df)
sparkR.session()
insertInto(test_df,"db_name.table_name",overwrite = false)
I get the error: "unable to find an inherited method for function ‘insertInto’ for signature ‘"data.frame", "character"’"
I have checked the connection to this table and using SparkR::collect I can return the data from it no problem. So why isn't the insert working?
Instead of as.data.frame that returns R dataframe you need to use as.DataFrame that returns Spark dataframe that could be used with insertInto (see doc). Change code to:
require(SparkR)
test_df <- data.frame(row_no = c(2,3,4,5,6,7,8)
,row_dat = c('dat_2','dat_3','dat_4','dat_5','dat_6','dat_7','dat_8')
)
test_df <- as.DataFrame(test_df)
sparkR.session()
insertInto(test_df,"db_name.table_name",overwrite = FALSE)
I have a question on how to use eval(parse(text=...)) in dbplyr SQL translation.
The following code works exactly what I want with dplyr using eval(parse(text=eval_text))
selected_col <- c("wt", "drat")
text <- paste(selected_col, ">3")
implode <- function(..., sep='|') {
paste(..., collapse=sep)
}
eval_text <- implode(text)
mtcars %>% dplyr::filter(eval(parse(text=eval_text)))
But when I put it into the database it returns an error message. I am looking for any solution that allows me to dynamically set the column names and filter with the or operator.
db <- tbl(con, "mtcars") %>%
dplyr::filter(eval(parse(eval_text)))
db <- collect(db)
Thanks!
Right approach, but dbplyr tends to work better with something that can receive the !! operator ('bang-bang' operator). At one point dplyr had *_ versions of functions (e.g. filter_) that accepted text inputs. This is now done using NSE (non-standard evaluation).
A couple of references: shiptech and r-bloggers (sorry couldn't find the official dplyr reference).
For your purposes you should find the following works:
library(rlang)
df %>% dplyr::filter(!!parse_expr(eval_text))
Full working:
library(dplyr)
library(dbplyr)
library(rlang)
data(mtcars)
df = tbl_lazy(mtcars, con = simulate_mssql()) # simulated database connection
implode <- function(..., sep='|') { paste(..., collapse=sep) }
selected_col <- c("wt", "drat")
text <- paste(selected_col, ">3")
eval_text <- implode(text)
df %>% dplyr::filter(eval(parse(eval_text))) # returns clearly wrong SQL
df %>% dplyr::filter(!!parse_expr(eval_text)) # returns valid & correct SQL
df %>% dplyr::filter(!!!parse_exprs(text)) # passes filters as a list --> AND (instead of OR)
Data
I work with a large dataset (280 million rows) for which Spark and R seems to work nicely.
Problem
I'd had problems with SparkR's regexp_extract function. I thought it to work analogically to Stringr's str_detect but I haven't managed to get it to work. The documentation for regexp_extract is limited. Could you please give me a hand?
Reprex
Here is a reprex where I try to identify strings that do not have a space and paste " 00:01" as a suffix.
# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)
# Create data
df <- data.frame(sampletaken = c("06/03/2013", "29/11/2005 8:30", "06/03/2013", "15/01/2007 12:25", "06/03/2013", "15/01/2007 12:25"))
# Create Spark connection
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
dplyr::mutate(sampletaken = ifelse(regexp_extract(sampletaken, " "), sampletaken, paste(sampletaken, "00:01")))
# Collect data as dataframe
df1 <- df1 %>% as.data.frame()
head(df1$sampletaken)
Error
error: org.apache.spark.sql.AnalysisException: cannot resolve '(NOT regexp_extract(df.sampletaken, ' ', 1))' due to data type mismatch: argument 1 requires boolean type, however, 'regexp_extract(df.sampletaken, ' ', 1)' is of string type.; line 1 pos 80;
Solution
# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)
# Create data
df <- data.frame(sampletaken = c("06/03/2013", "29/11/2005 8:30", "06/03/2013", "15/01/2007 12:25", "06/03/2013", "15/01/2007 12:25"))
# Create Spark connection
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
dplyr::mutate(sampletaken1 = ifelse(rlike(sampletaken, " "), sampletaken, paste(sampletaken, "00:01")))
# Collect data as dataframe
df1 <- df1 %>% as.data.frame()
head(df1$sampletaken)
Probably rlike is what you're after if you're looking for the analog to str_detect, see the SQL API docs:
str rlike regexp - Returns true if str matches regexp, or false otherwise.
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*'
true
on a Column (i.e., in R, rather than in SparkQL through sql()), it would be like:
rlike(Column, 'regex.*pattern')
# i.e., in magrittr form
Column %>% rlike('regex.*pattern')
Note that like is usually more efficient if you can use it since the set of valid like patterns is much smaller.
I'm not familiar with SparkR, but it seems that the function regex_extract returns a string (presumably the matched pattern in the string) instead of a boolean, as required by the function ifelse.
You may try to match the returned value against the empty string.
I have created a function that reads some text data from a certain file, does some manipulation (omitted here) and then saves each modified dataframe in this list as .RData. I have checked that the function does its job. However, when loading the output again into RStudio, the load command runs without errors, but there is no new object in my environment.
Any possible fixes?
f <- function(directory_input, directory_output, par1, par2){
library(tidyverse)
library(readxl)
if(dir.exists(directory_output) == F) {
dir.create(directory_output)
}
key <- data.frame(par= as.character(paste0(0, par1, par2)))
paths <- key %>% mutate(
filepath_in = file.path(directory_input, paste0(par, '.txt'), sep = ''),
filepath_out = file.path(directory_output, paste0(par, '.RData'), sep = '')
)
filepath_in <- paths$filepath_in
filepath_out <- paths$filepath_out
DF <- filepath_in %>% map( ~ .x %>% read.delim2(., encoding = 'Latin-1', nrows = 1000))
map2(DF, filepath_out, ~ .x %>% save(file = .y))
}
EDIT
After the comments, here is a bit more of context:
I was instructed to write a function that will be part of a future package.
The function does not create a new dataframe, only saves it in the computer. I designed in order to make it easier for the user in case he is manipulating multiple datasets.
On the other hand, it is natural to assume that the user would use these datasets in the future in the most intuitive way, using only load. So that is why it wouldn't be ideal to have a solution that requires using assign to load the results in the future.
I am using Spark ML_pipelines to easily deploy operations that I have developed in Sparklyr in a production environment using SCALA. It is working pretty well, except for one part: it seems that when I read a table from Hive and then create a pipeline that applies operations to this table the pipeline will also save the table reading operation and thereby the name of the table. However I want the pipeline to be independent of this.
Here is a reproducible example:
Sparklyr part:
sc = spark2_context(memory = "4G")
iris <- copy_to(sc, iris, overwrite=TRUE)
spark_write_table(iris, "base.iris")
spark_write_table(iris, "base.iris2")
df1 <- tbl(sc, "base.iris")
df2 <- df1 %>%
mutate(foo = 5)
pipeline <- ml_pipeline(sc) %>%
ft_dplyr_transformer(df2) %>%
ml_fit(df1)
ml_save(pipeline,
paste0(save_pipeline_path, "test_pipeline_reading_from_table"),
overwrite = TRUE)
df2 <- pipeline %>% ml_transform(df1)
dbSendQuery(sc, "drop table base.iris")
SCALApart:
import org.apache.spark.ml.PipelineModel
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val df1 = spark.sql("select * from base.iris2")
val pipeline = PipelineModel.load(pipeline_path + "/test_pipeline_reading_from_table")
val df2 = pipeline.transform(df1)
I get this error:
org.apache.spark.sql.AnalysisException: Table or view not found: `base`.`iris`; line 2 pos 5;
'Project ['Sepal_Length, 'Sepal_Width, 'Petal_Length, 'Petal_Width, 'Species, 5.0 AS foo#110]
+- 'UnresolvedRelation `base`.`iris`
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:82)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:637)
at org.apache.spark.ml.feature.SQLTransformer.transformSchema(SQLTransformer.scala:86)
at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:310)
at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:310)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:310)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:304)
... 71 elided
I can see 2 solutions:
It seems that persisting dataframe would be a solution, but then I would need to find a way not to overload my memory, hence my question on unpersisting
Passing the name of the table in Hive as a parameter of the pipeline, which I am trying to solve in this question
Now, all of this being said, I might be missing something as I am only a beginner...
EDIT: this is significantly different from this question as this concerns the specific problem of integrating a dataframe that was just read in a pipeline, as specified in the title.
EDIT: as for my project, persisting the tables after I read them is a viable solution. I don't know if there is any better solution.
Then the pipeline would call my table "base.table", making it impossible to apply it to another table.
That's actually not true. ft_dplyr_transformer is a syntactic sugar for Spark's own SQLTransformer. Internally dplyr expression is converted to SQL query, and the name of the table is replaced with __THIS__ (Spark placeholder referring to the current table).
Let's say you have transformation like this one:
copy_to(sc, iris, overwrite=TRUE)
df <- tbl(sc, "iris") %>%
mutate(foo = 5)
pipeline <- ml_pipeline(sc) %>%
ft_dplyr_transformer(df) %>%
ml_fit(tbl(sc, "iris"))
ml_stage(pipeline, "dplyr_transformer") %>% spark_jobj() %>% invoke("getStatement")
[1] "SELECT `Sepal_Length`, `Sepal_Width`, `Petal_Length`, `Petal_Width`, `Species`, 5.0 AS `foo`\nFROM `__THIS__`"
That's however rather confusing way expressing things, and it makes more sense to use native SQL transformer directly:
pipeline <- ml_pipeline(sc) %>%
ft_sql_transformer("SELECT *, 5 as `foo` FROM __THIS__") %>%
ml_fit(df)
Edit:
The problem you experience here looks like a bug. get_base_name function returns unquoted table name, so the value in your case will be
> get_base_name(x$ops)
<IDENT> default.iris
and the pattern will be
> pattern
[1] "\\bdefault.iris\\b"
However dbplyr::sql_render returns backquoted fully qualified name:
> dbplyr::sql_render(x)
<SQL> SELECT `Sepal_Length`, `Sepal_Width`, `Petal_Length`, `Petal_Width`, `Species`, 5.0 AS `foo`
FROM `default`.`iris`
So the pattern doesn't match the name.