Here is the sample data:
sample,fit_result,Site,Dx_Bin,dx,Hx_Prev,Hx_of_Polyps,Age,Gender,Smoke,Diabetic,Hx_Fam_CRC,Height,Weight,NSAID,Diabetes_Med,stage
2003650,0,U Michigan,High Risk Normal,normal,0,1,64,m,,0,1,182,120,0,0,0
2005650,0,U Michigan,High Risk Normal,normal,0,1,61,m,0,0,0,167,78,0,0,0
2007660,26,U Michigan,High Risk Normal,normal,0,1,47,f,0,0,1,170,63,0,0,0
2009650,10,Toronto,Adenoma,adenoma,0,1,81,f,1,0,0,168,65,1,0,0
2013660,0,U Michigan,Normal,normal,0,0,44,f,0,0,0,170,72,1,0,0
2015650,0,Dana Farber,High Risk Normal,normal,0,1,51,f,1,0,0,160,67,0,0,0
2017660,7,Dana Farber,Cancer,cancer,1,1,78,m,1,1,0,172,78,0,1,3
2019651,19,U Michigan,Normal,normal,0,0,59,m,0,0,0,177,65,0,0,0
2023680,0,Dana Farber,High Risk Normal,normal,1,1,63,f,1,0,0,154,54,0,0,0
2025653,1509,U Michigan,Cancer.,cancer,1,1,67,m,1,0,0,167,58,0,0,4
2027653,0,Toronto,Normal,normal,0,0,65,f,0,0,0,167,60,0,0,0
below is the R code
library(tidyverse)
h <- 'Height'
w <- 'Weight'
data %>% select(h) %>% filter(h > 180)
I can see only height column in output but filter is not applied. I dont get any error when i run the code. similarly, below code also does not work
s <- 'Site'
data %>% select(s) %>% mutate(s = str_replace(s," ","_"))
Output:
Site s
1 U Michigan Site
2 U Michigan Site
3 U Michigan Site
4 Toronto Site
I want to replce the space in Site column but obviously its not recognizing s and creating a new column s.
I tried running below code and still face the same issue.
exp <- substitute(s <- 'Site')
r <- eval(exp,data)
data %>% select(r) %>% mutate(r = str_replace(s," ","_"))
I searched everywhere and could not find a solution, Any help would be great. Thanks in advance (i know the normal way to do it i just want to be able to pass variables to the function)
We may either convert to sym and evaluate (!!). Also, if we want to assign on the lhs of the operator, use := instead of = and evaluate with !!
library(dplyr)
library(stringr)
data %>%
select(all_of(s)) %>%
mutate(!!s := str_replace(!! rlang::sym(s)," ","_"))
Similarly for the filter
data %>%
select(all_of(h)) %>%
filter(!! rlang::sym(h) > 180)
Yet another option would be to pass the variable objects in across (for filter can also use if_any/if_all) where we can pass one or more variables to loop across the columns
data %>%
select(all_of(s)) %>%
mutate(across(all_of(s), ~ str_replace(.x, " ", "_")))
Or use .data
data %>%
select(all_of(s)) %>%
mutate(!!s := str_replace(.data[[s]]," ","_"))
I have a question on how to use eval(parse(text=...)) in dbplyr SQL translation.
The following code works exactly what I want with dplyr using eval(parse(text=eval_text))
selected_col <- c("wt", "drat")
text <- paste(selected_col, ">3")
implode <- function(..., sep='|') {
paste(..., collapse=sep)
}
eval_text <- implode(text)
mtcars %>% dplyr::filter(eval(parse(text=eval_text)))
But when I put it into the database it returns an error message. I am looking for any solution that allows me to dynamically set the column names and filter with the or operator.
db <- tbl(con, "mtcars") %>%
dplyr::filter(eval(parse(eval_text)))
db <- collect(db)
Thanks!
Right approach, but dbplyr tends to work better with something that can receive the !! operator ('bang-bang' operator). At one point dplyr had *_ versions of functions (e.g. filter_) that accepted text inputs. This is now done using NSE (non-standard evaluation).
A couple of references: shiptech and r-bloggers (sorry couldn't find the official dplyr reference).
For your purposes you should find the following works:
library(rlang)
df %>% dplyr::filter(!!parse_expr(eval_text))
Full working:
library(dplyr)
library(dbplyr)
library(rlang)
data(mtcars)
df = tbl_lazy(mtcars, con = simulate_mssql()) # simulated database connection
implode <- function(..., sep='|') { paste(..., collapse=sep) }
selected_col <- c("wt", "drat")
text <- paste(selected_col, ">3")
eval_text <- implode(text)
df %>% dplyr::filter(eval(parse(eval_text))) # returns clearly wrong SQL
df %>% dplyr::filter(!!parse_expr(eval_text)) # returns valid & correct SQL
df %>% dplyr::filter(!!!parse_exprs(text)) # passes filters as a list --> AND (instead of OR)
I'm new to R and I don't know all basic concepts yet. The task is to produce a one merged table with multiple response sets. I am trying to do this using expss library and a loop.
This is the code in R without a loop (works fine):
#libraries
#blah, blah...
#path
df.path = "C:/dataset.sav"
#dataset load
df = read_sav(df.path)
#table
table_undropped1 = df %>%
tab_cells(mdset(q20s1i1 %to% q20s1i8)) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
There are 10 multiple response sets therefore I need to create 10 tables in a manner shown above. Then I transpose those tables and merge. To simplify the code (and learn something new) I decided to produce tables using a loop. However nothing works. I'd looked for a solution and I think the most close to correct one is:
#this generates a message: '1' not found
for(i in 1:10) {
assign(paste0("table_undropped",i),1) = df %>%
tab_cells(mdset(assign(paste0("q20s",i,"i1"),1) %to% assign(paste0("q20s",i,"i8"),1)))
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
Still it causes an error described above the code.
Alternatively, an SPSS macro for that would be (published only to better express the problem because I have to avoid SPSS):
define macro1 (x = !tokens (1)
/y = !tokens (1))
!do !i = !x !to !y.
mrsets
/mdgroup name = !concat($SET_,!i)
variables = !concat("q20s",!i,"i1") to !concat("q20s",!i,"i8")
value = 1.
ctables
/table !concat($SET_,!i) [colpct.responses.count pct40.0].
!doend
!enddefine.
*** MACRO CALL.
macro1 x = 1 y = 10.
In other words I am looking for a working substitute of !concat() in R.
%to% is not suited for parametric variable selection. There is a set of special functions for parametric variable selection and assignment. One of them is mdset_t:
for(i in 1:10) {
table_name = paste0("table_undropped",i)
..$table_name = df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
However, it is not good practice to store all tables as separate variables in the global environment. Better approach is to save all tables in the list:
all_tables = lapply(1:10, function(i)
df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
)
UPDATE.
Generally speaking, there is no need to merge. You can do all your work with tab_*:
my_big_table = df %>%
tab_total_row_position("none")
for(i in 1:10) {
my_big_table = my_big_table %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_stat_cpct()
}
my_big_table = my_big_table %>%
tab_pivot(stat_position = "inside_columns") # here we say that we need combine subtables horizontally
I have four functions, clean, clean2, cleanFun, and trim. Currently I apply the functions to one column, like so.
library(tidyverse)
library(data.table)
py17$CE.Finding.Description <- clean(py17$CE.Finding.Description)
py17$CE.Finding.Description <- clean2(py17$CE.Finding.Description)
py17$CE.Finding.Description <- cleanFun(py17$CE.Finding.Description)
py17$CE.Finding.Description <- trim(py17$CE.Finding.Description)
This process does the trick but I have to copy and paste this multiple times, and I'd eventually like to expand this to multiple columns.
For now, I'd like to save time and add an apply function but I'm not sure how to create that apply function. I've tried creating this.
maxclean <- function(cleaner) {
c(clean(cleaner), clean2(cleaner), cleanFun(cleaner), trim(cleaner))
}
py17$CE.Finding.Description <- sapply(py17$CE.Finding.Description, maxclean)
After trying this I just get
Error in `$<-.data.frame`(`*tmp*`, CE.Finding.Description, value = c(NA, :
replacement has 4 rows, data has 4318
I do not get any errors doing this the long way. Where am I going wrong on this?
Your maxclean function should take the same arguments as the separate functions. In your case - a vector. And then call each function in a row. Like this:
maxclean <- function(x) {
x <- clean(x)
x <- clean2(x)
x <- cleanFun(x)
x <- trim(x)
return(x)
}
Apparently, the OP has created a cleaning pipeline where the output of one step is fed into the next step and the final result of the pipeline overwrites the original input.
The magrittr package has the freduce() function which applies one function after the other in the described way. Thus,
py17$CE.Finding.Description <- clean(py17$CE.Finding.Description)
py17$CE.Finding.Description <- clean2(py17$CE.Finding.Description)
py17$CE.Finding.Description <- cleanFun(py17$CE.Finding.Description)
py17$CE.Finding.Description <- trim(py17$CE.Finding.Description)
can be written as:
library(magrittr)
fcts <- list(clean, clean2, cleanFun, trim)
py17$CE.Finding.Description %<>% freduce(fcts)
which is a shortcut for
py17$CE.Finding.Description <- py17$CE.Finding.Description %>%
clean() %>%
clean2() %>%
cleanFun() %>%
trim()
Here, %>% is the magrittr forward-pipe operator and %<>% is the magrittr compound assignment pipe-operator which updates the left-hand side object with the resulting value.
Reproducible example
Using the mtcars dataset:
data(mtcars)
mycars <- mtcars
mycars$mpg %<>%
{. - mean(.)} %>%
abs() %>%
sqrt()
mycars
or
mycars <- mtcars
mycars$mpg %<>% freduce(list(function(.) {. - mean(.)}, abs, sqrt))
mycars
Applying on multiple columns
The OP has mentioned that he eventually like to expand this to multiple columns
This can be achieved by, e.g.,
mycars <- mtcars
fcts <- list(function(.) {. - mean(.)}, abs, sqrt)
mycars$mpg %<>% freduce(fcts)
mycars$disp %<>% freduce(fcts)
mycars
I have a wide spark data frame of a few thousand columns by about a million rows, for which I would like to calculate the row totals. My solution so far is below. I used:
dplyr - sum of multiple columns using regular expressions and
https://github.com/tidyverse/rlang/issues/116
library(sparklyr)
library(DBI)
library(dplyr)
library(rlang)
sc1 <- spark_connect(master = "local")
wide_df = as.data.frame(matrix(ceiling(runif(2000, 0, 20)), 10, 200))
wide_sdf = sdf_copy_to(sc1, wide_df, overwrite = TRUE, name = "wide_sdf")
col_eqn = paste0(colnames(wide_df), collapse = "+" )
# build up the SQL query and send to spark with DBI
query = paste0("SELECT (",
col_eqn,
") as total FROM wide_sdf")
dbGetQuery(sc1, query)
# Equivalent approach using dplyr instead
col_eqn2 = quo(!! parse_expr(col_eqn))
wide_sdf %>%
transmute("total" := !!col_eqn2) %>%
collect() %>%
as.data.frame()
The problems come when the number of columns is increased. On spark SQL it seems to be calculated one element at a time i.e. (((V1 + V1) + V3) + V4)...) This is leading to errors due to very high recursion.
Does anyone have an alternative more efficient approach? Any help would be much appreciated.
You're out of luck here. One way or another you're are going to hit some recursion limits (even if you go around SQL parser, sufficiently large sum of expressions will crash query planner). There are some slow solutions available:
Use spark_apply (at the cost of conversion to and from R):
wide_sdf %>% spark_apply(function(df) { data.frame(total = rowSums(df)) })
Convert to long format and aggregate (at the cost of explode and shuffle):
key_expr <- "monotonically_increasing_id() AS key"
value_expr <- paste(
"explode(array(", paste(colnames(wide_sdf), collapse=","), ")) AS value"
)
wide_sdf %>%
spark_dataframe() %>%
# Add id and explode. We need a separate invoke so id is applied
# before "lateral view"
sparklyr::invoke("selectExpr", list(key_expr, "*")) %>%
sparklyr::invoke("selectExpr", list("key", value_expr)) %>%
sdf_register() %>%
# Aggregate by id
group_by(key) %>%
summarize(total = sum(value)) %>%
arrange(key)
To get something more efficient you should consider writing Scala extension and applying sum directly on a Row object, without exploding:
package com.example.sparklyr.rowsum
import org.apache.spark.sql.{DataFrame, Encoders}
object RowSum {
def apply(df: DataFrame, cols: Seq[String]) = df.map {
row => cols.map(c => row.getAs[Double](c)).sum
}(Encoders.scalaDouble)
}
and
invoke_static(
sc, "com.example.sparklyr.rowsum.RowSum", "apply",
wide_sdf %>% spark_dataframe
) %>% sdf_register()