Efficiently calculate row totals of a wide Spark DF

Efficiently calculate row totals of a wide Spark DF - r

I have a wide spark data frame of a few thousand columns by about a million rows, for which I would like to calculate the row totals. My solution so far is below. I used:
dplyr - sum of multiple columns using regular expressions and
https://github.com/tidyverse/rlang/issues/116
library(sparklyr)
library(DBI)
library(dplyr)
library(rlang)
sc1 <- spark_connect(master = "local")
wide_df = as.data.frame(matrix(ceiling(runif(2000, 0, 20)), 10, 200))
wide_sdf = sdf_copy_to(sc1, wide_df, overwrite = TRUE, name = "wide_sdf")
col_eqn = paste0(colnames(wide_df), collapse = "+" )
# build up the SQL query and send to spark with DBI
query = paste0("SELECT (",
col_eqn,
") as total FROM wide_sdf")
dbGetQuery(sc1, query)
# Equivalent approach using dplyr instead
col_eqn2 = quo(!! parse_expr(col_eqn))
wide_sdf %>%
transmute("total" := !!col_eqn2) %>%
collect() %>%
as.data.frame()
The problems come when the number of columns is increased. On spark SQL it seems to be calculated one element at a time i.e. (((V1 + V1) + V3) + V4)...) This is leading to errors due to very high recursion.
Does anyone have an alternative more efficient approach? Any help would be much appreciated.

You're out of luck here. One way or another you're are going to hit some recursion limits (even if you go around SQL parser, sufficiently large sum of expressions will crash query planner). There are some slow solutions available:
Use spark_apply (at the cost of conversion to and from R):
wide_sdf %>% spark_apply(function(df) { data.frame(total = rowSums(df)) })
Convert to long format and aggregate (at the cost of explode and shuffle):
key_expr <- "monotonically_increasing_id() AS key"
value_expr <- paste(
"explode(array(", paste(colnames(wide_sdf), collapse=","), ")) AS value"
)
wide_sdf %>%
spark_dataframe() %>%
# Add id and explode. We need a separate invoke so id is applied
# before "lateral view"
sparklyr::invoke("selectExpr", list(key_expr, "*")) %>%
sparklyr::invoke("selectExpr", list("key", value_expr)) %>%
sdf_register() %>%
# Aggregate by id
group_by(key) %>%
summarize(total = sum(value)) %>%
arrange(key)
To get something more efficient you should consider writing Scala extension and applying sum directly on a Row object, without exploding:
package com.example.sparklyr.rowsum
import org.apache.spark.sql.{DataFrame, Encoders}
object RowSum {
def apply(df: DataFrame, cols: Seq[String]) = df.map {
row => cols.map(c => row.getAs[Double](c)).sum
}(Encoders.scalaDouble)
}
and
invoke_static(
sc, "com.example.sparklyr.rowsum.RowSum", "apply",
wide_sdf %>% spark_dataframe
) %>% sdf_register()

Related

How to pipe in dplyr

I am trying to use the pipe function in dplyr and left_join to clean some meta data up. Setting up variables....
library(openxlsx)
library(tidyverse)
mdat <- read.xlsx("https://journals.plos.org/plospathogens/article/file?type=supplementary&id=info:doi/10.1371/journal.ppat.1005511.s011",
startRow = 3, fillMergedCells = TRUE) %>%
mutate(sample=Accession.Number)
dge$samples$sample=
[1] "SRR1346026" "SRR1346027" "SRR1346028" "SRR1346029" "SRR1346030" "SRR1346031" "SRR1346032" "SRR1346033" "SRR1346034"
[10] "SRR1346035" "SRR1346036" "SRR1346037" "SRR1346038" "SRR1346039" "SRR1346040" "SRR1346041" "SRR1346042" "SRR1346043"
[19] "SRR1346044" "SRR1346045" "SRR1346046" "SRR1346047" "SRR1346049" "SRR1346048" "SRR1346050" "SRR1346051" "SRR1346052"
I am trying to pipe in the dge$samples$sample, which is a character class. It needs to become a data frame of one column named sample so I can merge mdat with it by left join in order to remove all the metadata I don't have a sample for. If you run dim(mdat) you will find it is 35 by 15, I want to reduce it to the 19 samples I actually have data for, these are given in the dge$samples$sample list. I am trying to use the following code to first convert dge$samples$sample into a data frame with one column titled sample for joining the two and essentially removing all metadata that is not of interest to me. The code below has been my progress so far but I think I am failing to understand how pipe works.
test = data.frame(dge$samples$sample) %>%
colnames(.) = c("sample") %>%
left_join(
.,
mdat,
by = sample,
copy = FALSE,
suffix = c(".x", ".y"),
keep = FALSE,
na_matches = c("na", "never")
)

Why not just check if theyre in there and filter them:
mdat %>% filter( sample %in% dge$samples$sample )
It's easier to understand and controll than a join and performance shouldn't be an issue.

I think your code can be reduced to
library(dplyr)
test <- data.frame(sample = dge$samples$sample) %>%
left_join(mdat, by = 'sample')
Or an inner join should work as well, using base R :
test <- merge(data.frame(sample = dge$samples$sample), mdat, by = 'sample')

Using collapse
library(collapse)
sbt(mdat, sample %in% dge$samples$sample)

How to properly parse (?) mdsets in expss within a loop?

I'm new to R and I don't know all basic concepts yet. The task is to produce a one merged table with multiple response sets. I am trying to do this using expss library and a loop.
This is the code in R without a loop (works fine):
#libraries
#blah, blah...
#path
df.path = "C:/dataset.sav"
#dataset load
df = read_sav(df.path)
#table
table_undropped1 = df %>%
tab_cells(mdset(q20s1i1 %to% q20s1i8)) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
There are 10 multiple response sets therefore I need to create 10 tables in a manner shown above. Then I transpose those tables and merge. To simplify the code (and learn something new) I decided to produce tables using a loop. However nothing works. I'd looked for a solution and I think the most close to correct one is:
#this generates a message: '1' not found
for(i in 1:10) {
assign(paste0("table_undropped",i),1) = df %>%
tab_cells(mdset(assign(paste0("q20s",i,"i1"),1) %to% assign(paste0("q20s",i,"i8"),1)))
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
Still it causes an error described above the code.
Alternatively, an SPSS macro for that would be (published only to better express the problem because I have to avoid SPSS):
define macro1 (x = !tokens (1)
/y = !tokens (1))
!do !i = !x !to !y.
mrsets
/mdgroup name = !concat($SET_,!i)
variables = !concat("q20s",!i,"i1") to !concat("q20s",!i,"i8")
value = 1.
ctables
/table !concat($SET_,!i) [colpct.responses.count pct40.0].
!doend
!enddefine.
*** MACRO CALL.
macro1 x = 1 y = 10.
In other words I am looking for a working substitute of !concat() in R.

%to% is not suited for parametric variable selection. There is a set of special functions for parametric variable selection and assignment. One of them is mdset_t:
for(i in 1:10) {
table_name = paste0("table_undropped",i)
..$table_name = df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
However, it is not good practice to store all tables as separate variables in the global environment. Better approach is to save all tables in the list:
all_tables = lapply(1:10, function(i)
df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
)
UPDATE.
Generally speaking, there is no need to merge. You can do all your work with tab_*:
my_big_table = df %>%
tab_total_row_position("none")
for(i in 1:10) {
my_big_table = my_big_table %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_stat_cpct()
}
my_big_table = my_big_table %>%
tab_pivot(stat_position = "inside_columns") # here we say that we need combine subtables horizontally

R - mutate columns with different formats

I'm trying to do analysis from multiple csv files, and in order to create a key that can be used for left_join I think that I need to try and merge two columns. At present I'm trying to use the tidyverse packages (inc. mutate), but I'm running into an issue as the two columns to merge have different formatting: 1 is a double and the other is in date format. I'm using the following code
qlik2 <- qlik %>%
separate('Admit DateTime', into = c('Admit Date', 'Admit Time'), sep = 10) %>%
mutate(key = MRN + `Admit Date`)
and getting tis output error:
Error in mutate_impl(.data, dots) :
Evaluation error: non-numeric argument to binary operator.
If there's another way around this (or if the error is actually related to something else), then I'd appreciate any thoughts on the matter. Equally, if people know of a way to left_join with multiple keys, then that would work as well.
Thanks,
Cal

Hard without a reproducible example. But if i understand your question you either want a numeric key, or trying to concatinate a string with the plus +.
Numeric key
library(hablar)
qlik2 <- qlik %>%
separate('Admit DateTime',
into = c('Admit Date', 'Admit Time'),
sep = 10) %>%
convert(num(MRN, `Admit Date`)) %>%
mutate(key = MRN + `Admit Date`)
String key
qlik2 <- qlik %>%
separate('Admit DateTime',
into = c('Admit Date', 'Admit Time'),
sep = 10) %>%
mutate(key = paste(MRN, `Admit Date`))

Can dplyr modify multiple columns of spark DF using a vector?

I'm new working with spark. I would like to multiply a large number of columns of a spark dataframe by values in a vector. So far with mtcars I used a for loop and mutate_at as follows:
library(dplyr)
library(rlang)
library(sparklyr)
sc1 <- spark_connect(master = "local")
mtcars_sp = sdf_copy_to(sc1, mtcars, overwrite = TRUE)
mtcars_cols = colnames(mtcars_sp)
mtc_factors = 0:10 / 10
# mutate 1 col at a time
for (i in 1:length(mtcars_cols)) {
# set equation and print - use sym() convert a string
mtcars_eq = quo( UQ(sym(mtcars_cols[i])) * mtc_factors[i])
# mutate formula - LHS resolves to a string, RHS a quosure
mtcars_sp = mtcars_sp %>%
mutate(!!mtcars_cols[i] := !!mtcars_eq )
}
dbplyr::sql_render(mtcars_sp)
mtcars_sp
This works ok with mtcars. However, it results in nested SQL queries being sent to spark, as shown by the sql_render, and breaks down with many columns. Can dplyr be used to instead send a single SQL query in this case?
BTW, I'd rather not transpose the data as it would be too expensive. Any help would be much appreciated!

In general you can use great answer by Artem Sokolov
library(glue)
mtcars_sp %>%
mutate(!!! setNames(glue("{mtcars_cols} * {mtc_factors}"), mtcars_cols) %>%
lapply(parse_quosure))
However if this is input for MLlib algorithms then ft_vector_assembler combined with ft_elementwise_product might be a better fit:
scaled <- mtcars_sp %>%
ft_vector_assembler(mtcars_cols, "features") %>%
ft_elementwise_product("features", "features_scaled", mtc_factors)
The result can be separated (I wouldn't recommend that if you're going with MLlib) into individual columns with sdf_separate_column:
scaled %>%
select(features_scaled) %>%
sdf_separate_column("features_scaled", mtcars_cols)

How do you get the difference between two "timestamps with/without timezone" in seconds using dplyr?

When a in-database (PostgreSQL) dplyr::mutate operation calculates the difference between two timestamps, a character vector is returned, each element of the form like:
> RPostgreSQL::dbGetQuery(db$con, 'select now() - current_date;')
?column?
1 09:23:48.880493
In this case it is HH:MM:SS.ssssss. How do I get dplyr to return this vector of time differences in seconds? That is, I would like to do the same thing as here, except have it as part of a mutate statement.
Example dplyr code would be:
tbl(db$con, 'tmp_table') %>%
mutate(time_diff = received_at - started_at) %>%
select(id, time_diff) %>%
collect(n = Inf)

This is by no means a satisfactory answer to me, but a roundabout way of doing it is:
tmp_table <-
tbl(db$con, 'tmp_table') %>%
mutate(time_diff = received_at - started_at) %>%
select(id, time_diff) %>%
compute() # creates a temporary table
You can then find the name of the temporary table using:
as.character(tmp_table$ops$x$x)
In my case this was [1] "rzlhbxogjx". Then, using the linked answer you could do:
RPostgreSQL::dbGetQuery(db$con,
paste0("select id, extract(epoch from time_diff)
as time_diff from ", as.character(tmp_table$ops$x$x), ";")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Efficiently calculate row totals of a wide Spark DF - r

Related

How to pipe in dplyr

How to properly parse (?) mdsets in expss within a loop?

R - mutate columns with different formats

Can dplyr modify multiple columns of spark DF using a vector?

How do you get the difference between two "timestamps with/without timezone" in seconds using dplyr?

Categories

Resources