SparkR and dplyr: window function count() using gapply - r

I'm trying to implement simple query on Spark using "gapply", but face troubles.
This code works well.
library(SparkR)
library(dplyr)
df <- createDataFrame(iris)
createOrReplaceTempView(df, "iris")
display(SparkR::sql("SELECT *, COUNT(*) OVER(PARTITION BY Species) AS RowCount FROM iris"))
But I can't realize it via gapply
display(df %>% SparkR::group_by(df$Species)
%>% gapply(function(key, x) { y <- data.frame(x, SparkR::count()) },
"Sepal_Length double, Sepal_Width double, Petal_Length double, Petal_Width double, Species string, RowCount integer"))
returns error
SparkException: R unexpectedly exited. Caused by: EOFException:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 235.0 failed 4 times, most recent failure: Lost task
0.3 in stage 235.0 (TID 374) (10.150.202.5 executor 1): org.apache.spark.SparkException: R unexpectedly exited. R worker
produced errors: Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘count’ for signature
‘"missing"’ Calls: compute ... computeFunc -> data.frame ->
-> Execution halted
Is it possible to implement the window function "count" with gapply using pipes from dplyr?

Just a small mistake, you should have used base::nrow function instead of SparkR::count inside gapply.
display(df %>% SparkR::group_by(df$Species)
%>% gapply(function(key, x) { y <- data.frame(x, nrow(x)) },
"Sepal_Length double, Sepal_Width double, Petal_Length double, Petal_Width double, Species string, RowCount integer"))
This is how you could have done it through SparkR APIs using the SparkR::windowPartitionBy function, there is no need of creating any UDF here -
(
df %>%
SparkR::select(
c(
SparkR::columns(df),
SparkR::over(
SparkR::count(SparkR::lit(1)),
SparkR::windowPartitionBy(SparkR::column("Species"))
) %>% SparkR::alias("RowCount")
)
) %>%
display()
)

Related

PBI R Script Visual: Error in UseMethod("rename") : no applicable method for 'rename' applied to an object of class "function"

Super R noob here trying to get away with some copy/paste from Google. Any help to eliminate this error is appreciated.
here is my code:
library(bupaR)
library(dplyr)
data %>%
# rename timestamp variables appropriately
dplyr::rename(start = Starttime,
complete = Endtime) %>%
# convert timestamps to
convert_timestamps(columns = c("start", "complete"), format = ymd_hms) %>%
activitylog(case_id = "Shipasitem",
activity_id = "OperationDescription",
timestamps = c("start", "complete"))
and here is the error:
Error in UseMethod("rename") :
no applicable method for 'rename' applied to an object of class "function"
Calls: %>% ... convert_timestamps -> stopifnot -> %in% ->
Execution halted
i've tried calling all sorts of different libraries with no luck.

Applying user defined function to normalize all columns in sparklyr using spark_apply

I have a spark dataframe that i manipulate using sparklyr that has > 100 columns. I would like to normalize each column in the following way (vector - mean(vector)) / sd(vector). To achieve that in R, I could use dplyr in the following way:
library(dplyr)
normalize <- function(vector){
vector_norm = (vector - mean(vector)) / sd(vector)
return(vector_norm)
}
iris %>%
select(-Species) %>%
mutate_all(funs(normalize(.))) %>%
view
Unfortunately, sparklyr is incapable of running user defined functions in R natively. There is an approach using spark_apply that allows this to be run (though inefficiently). My best attempt at that approach is the following:
# Connect to Spark and push iris dataset to Spark
library(sparklyr)
sc <- spark_connect(method = "databricks")
iris_sdf <- sdf_copy_to(sc, iris %>% head(4), overwrite = T)
schema <- as.list(colnames(iris))
results_sdf <- spark_apply(iris_sdf,
function(vector){
vector_norm = (vector - mean(vector)) / sd(vector)
return(vector_norm)
},
columns = schema)
head(results_sdf, 10)
But i got the following error:
Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 25, 10.19.216.60, executor 0): java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details. Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 25, 10.19.216.60, executor 0): java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details.
at sparklyr.Rscript.init(rscript.scala:83)
at sparklyr.WorkerApply$$anon$2.run(workerapply.scala:133)
I also tried:
iris_sdf %>%
spark_apply(
function(e) data.frame((e$Sepal.Length - mean(e$Sepal.Length)) / sd(e$Sepal.Length)),
names = c("Sepal.Length")
)
No error but the resulting output had zero rows.
I would be open to any solution in sparklyr, pyspark, or scala.

Grouped correlation between two variables with dbplyr and corrr

I am connected with impala
con <- DBI::dbConnect(odbc::odbc(), "impala connector", schema = "some_schema")
library(dplyr)
library(dbplyr) #I have to load both of them, if not tbl won't work
table <- tbl(con, 'serverTable')
I would like to use Pearson's R to track the change of a measure in time as a quick and dirty prediction model.
In locale, it works quite well, but I have problems implementing it on the server.
Here's the code:
library(corrr)
table %>%
filter(!is.na(VAR) | VAR > -10 | VAR < -32) %>%
#VAR is the measure, and values over -10 or under -32 are already out of the threshold, I wanna intercept the subjects before that
mutate(num_date = as.numeric(as.POSIXct(date))) %>%
#to convert the date string into the number of seconds since 1970
group_by(id) %>%
#the measure is taken daily for various subjects, I am interested in isolating the subjects approaching the thresholds
mutate(corr = corrr::correlate(VAR, num_date)) %>%
ungroup() %>%
#here I calculare Pearson's R, I must specify corrr:: if not I get an error
filter(abs(corr) > 0.9) %>%
#in locale I found out that a value of 0.9 is good for isolating the subjects whose measure is approaching the thresholds
select(id) %>%
collect()
If I run this though, I get the error:
Error in corrr::correlate(VAR, num_date) : object 'VAR' not found.
So I tried to substitute that line with
mutate(corr = corrr::correlate(.$VAR, .$num_date)) %>%
and like this I get the error
Error in stats::cor(x = x, y = y, use = use, method = method) : supply both 'x' and 'y' or a matrix-like 'x'
if instead I try to use cor from stats, cor(VAR, num_date), I get the error
Error in new_result(connection#ptr, statement, immediate) : nanodbc/nanodbc.cpp:1412: HY000: [Cloudera][ImpalaODBC] (370) Query analysis error occurred during query execution: [HY000] : AnalysisException: some_schema.cor() unknown
like dbplyr can't translate cor into SQL (I see it if I run show_query() instead of collect() )
EDIT,
I solved the problem using SQL:
SELECT id, cor
FROM(
SELECT id,
((tot_sum - (VAR_sum * date_sum / _count)) / sqrt((VAR_sq - pow(VAR_sum, 2.0) / _count) * (date_sq - pow(date_sum, 2.0) / _count))) AS cor
FROM (
SELECT id,
sum(VAR) AS VAR_sum,
sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sum,
sum(VAR * VAR) AS VAR_sq,
sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE) * CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sq,
sum(VAR * CAST(CAST(date_push AS TIMESTAMP) AS DOUBLE)) AS tot_sum,
count(*) as _count
FROM (
SELECT id, VAR, date
FROM (
SELECT id, VAR, date
FROM schema
WHERE VAR IS NOT NULL) AS a
WHERE VAR < -10 OR VAR > -32) AS b
GROUP BY idur) AS c) AS d
WHERE ABS(cor) > 0.9 AND ABS(cor) <= 1
thanks to this article:
https://chartio.com/learn/postgresql/correlation-coefficient-pearson/
cor is not in the list of functions that dplyr can translate - see here: https://dbplyr.tidyverse.org/articles/sql-translation.html#known-functions
You can try the following in your code:
mutate(corr = translate_sql(corr(VAR, num_date)))
This should translate directly to CORR(VAR, num_date). These translations don't work in all database types. If you can't get this working in your case, you likely have no choice but to collect your data before you try to run non-translatable functions.
My solution was to use dplyr's functions to replicate the correlation formula:
temp_cor = d_price_w_db %>% # your table from SQL from tbl(con, "NAME OF TABLE")
group_by(GroupA, GroupB) %>% # Your groups
# And then use summarise to create the correlation.
# You can create as many as you like:
summarise(cor_temp_ab = ( avg(temp_a*temp_b) - (avg(temp_a)*avg(temp_b)) ) /
( sd(temp_a) * sd(temp_b) ),
.groups = "drop"
)
This creates the SQL query that will create your correlation coefficients. You can see it with show_query(temp_cor). Finally just do
local_object = temp_cor %>%
collect()
To save the result of your query in a local object.
The formula for correlation from this post: https://www.red-gate.com/simple-talk/blogs/statistics-sql-pearsons-correlation/

sparklyr feature transformation functions result in error

I have some problems using the ft_.. fuctions from the sparklyr R package. ft_bucketizer works, but ft_normalizer or ft_min_max_scaler does not. Here is an example:
library(sparklyr)
library(dplyr)
library(nycflights13)
sc <- spark_connect(master = "local", version = "2.1.0")
x = flights %>% select(dep_delay)
x_tbl <- sdf_copy_to(sc, x)
# works fine
ft_binarizer(x=x_tbl, input.col = "dep_delay", output.col = "delayed", threshold = 0)
# error
ft_normalizer(x= x_tbl, input.col = "dep_delay", output.col = "delayed_norm")
# error
ft_min_max_scaler(x= x_tbl,input.col = "dep_delay",output.col = "delayed_min_max")
The normalizer returns:
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 9, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$1: (double) => vector)"
The min_max_scaler returns:
"Error: java.lang.IllegalArgumentException: requirement failed: Column dep_delay must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually DoubleType."
I think it is a problem with the data type, but don't know how to solve it. Has anybody an idea what to do?
Many thanks in advance!
ft_normalizer operates on Vector columns so you have to use ft_vector_assembler first:
ft_vector_assembler(x_tbl, input_cols="dep_delay", output_col="dep_delay_v") %>%
ft_normalizer(input.col = "dep_delay_v", output.col = "delayed_v_norm")

Use of substr() on DataFrame column in SparkR

I am using SparkR and want to use the substr() command to isolate the last character of a string that is contained in a column. I can get substr() to work if I set the StartPosition and EndPosition to a constant:
substr(sdfIris$Species, 8, 8)
But when I try to set these parameters using a value sourced from the DataFrame:
sdfIris <- createDataFrame(sqlContext, iris)
sdfIris$Len <- length(sdfIris$Species)
sdfIris$Last <- substr(sdfIris$Species, sdfIris$Len, sdfIris$Len)
Error in as.integer(start - 1) : cannot coerce type 'S4' to vector of type 'integer'
It seems that the result being returned from sdfIris$Len is perhaps a one-cell DataFrame, and the parameter needs an integer.
I have tried collect(sdfIris$Len), but:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘collect’ for signature ‘"Column"’
This seems incongruous. substr() seems to see sdfIris$Len as a DataFrame, but collect() seems to see it as a Column.
I have already identified a work-around by using registerTempTable and using SparkSQL's substr to isolate the last character, but I was hoping to avoid the unnecessary steps of switching to SQL.
How can I use SparkR substr() on a DataFrame column with dynamic Start and Finish parameters?
It is not optimal but you can use expr:
df <- createDataFrame(
sqlContext,
data.frame(s=c("foo", "bar", "foobar"), from=c(1, 2, 0), to=c(2, 3, 5))
)
select(df, expr("substr(s, from, to)")) %>% head()
## substr(s,from,to)
## 1 fo
## 2 ar
## 3 fooba
or selectExpr:
selectExpr(df, "substr(s, from, to)") %>% head()
## substr(s,from,to)
## 1 fo
## 2 ar
## 3 fooba
as well as equivalent SQL query.

Resources