How to properly parse (?) mdsets in expss within a loop? - r

I'm new to R and I don't know all basic concepts yet. The task is to produce a one merged table with multiple response sets. I am trying to do this using expss library and a loop.
This is the code in R without a loop (works fine):
#libraries
#blah, blah...
#path
df.path = "C:/dataset.sav"
#dataset load
df = read_sav(df.path)
#table
table_undropped1 = df %>%
tab_cells(mdset(q20s1i1 %to% q20s1i8)) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
There are 10 multiple response sets therefore I need to create 10 tables in a manner shown above. Then I transpose those tables and merge. To simplify the code (and learn something new) I decided to produce tables using a loop. However nothing works. I'd looked for a solution and I think the most close to correct one is:
#this generates a message: '1' not found
for(i in 1:10) {
assign(paste0("table_undropped",i),1) = df %>%
tab_cells(mdset(assign(paste0("q20s",i,"i1"),1) %to% assign(paste0("q20s",i,"i8"),1)))
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
Still it causes an error described above the code.
Alternatively, an SPSS macro for that would be (published only to better express the problem because I have to avoid SPSS):
define macro1 (x = !tokens (1)
/y = !tokens (1))
!do !i = !x !to !y.
mrsets
/mdgroup name = !concat($SET_,!i)
variables = !concat("q20s",!i,"i1") to !concat("q20s",!i,"i8")
value = 1.
ctables
/table !concat($SET_,!i) [colpct.responses.count pct40.0].
!doend
!enddefine.
*** MACRO CALL.
macro1 x = 1 y = 10.
In other words I am looking for a working substitute of !concat() in R.

%to% is not suited for parametric variable selection. There is a set of special functions for parametric variable selection and assignment. One of them is mdset_t:
for(i in 1:10) {
table_name = paste0("table_undropped",i)
..$table_name = df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
However, it is not good practice to store all tables as separate variables in the global environment. Better approach is to save all tables in the list:
all_tables = lapply(1:10, function(i)
df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
)
UPDATE.
Generally speaking, there is no need to merge. You can do all your work with tab_*:
my_big_table = df %>%
tab_total_row_position("none")
for(i in 1:10) {
my_big_table = my_big_table %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_stat_cpct()
}
my_big_table = my_big_table %>%
tab_pivot(stat_position = "inside_columns") # here we say that we need combine subtables horizontally

Related

Using a for loop in R to assign value labels

Context: I have a large dataset (CoreData) with an accompanying datafile (CoreValues) that contains the code and values for each variable within the dataset.
Problem: I want to use a loop to assign each variable within the dataset (CoreData) the correct value labels (from the CoreValues data).
What I've tried so far:
I have created a character vector that identifies which variables within my main data (CoreData) have values that need to be added:
Core_VarwithValueLabels<- unique(CoreValues$Abbreviation)
I have tried a for loop using the vector created , to create vectors for both the label and level arguments that feed into the factor() function.
for (i in Core_VarwithValueLabels){
assign(paste0(i, 'Labels'),
CoreValues %>%
filter(Abbreviation == i) %>%
select(Description) %>%
unique() %>%
unlist()
)
assign(paste0(i, 'Levels'),
CoreValues %>%
filter(Abbreviation == i) %>%
select(Code) %>%
unique() %>%
unlist()
)
CoreData[i] <- factor(CoreData[i], levels=paste0(i, 'Levels'), labels = paste0(i, 'Labels'))
}
This creates the correct label and level vectors, however, they are not being picked up properly within the factor function.
Question: Can you help me identify how to get my factor function to work within this loop or if there is a more appropriate method?
Sample data:
CoreValues:
example data from CoreValues
CoreData:
example data from CoreData
UPDATE: RESOLVED
I have now resolved this by using the get() function within my factor() function as it uses the strings I've created with paste0() and find the vector of that name.
for (i in Core_VarwithValueLabels){
assign(paste0(i, 'Labels'),
CoreValues %>%
filter(Abbreviation == i) %>%
select(Description) %>%
unique() %>%
unlist()
)
assign(paste0(i, 'Levels'),
CoreValues %>%
filter(Abbreviation == i) %>%
select(Code) %>%
unique() %>%
unlist()
)
CoreData[i] <- factor(CoreData[i], levels=get(paste0(i, 'Levels')), labels = get(paste0(i, 'Labels')))
}

Problem with mutate keyword and functions in R

I got a problem with the use of MUTATE, please check the next code block.
output1 <- mytibble %>%
mutate(newfield = FND(mytibble$ndoc))
output1
Where FND function is a FILTER applied to a large file (5GB):
FND <- function(n){
result <- LARGETIBBLE %>% filter(LARGETIBBLE$id == n)
return(paste(unique(result$somefield),collapse=" "))
}
I want to execute FND function for each row of output1 tibble, but it just executes one time.
Never use $ in dplyr pipes, very rarely they are used. You can change your FND function to :
library(dplyr)
FND <- function(n){
LARGETIBBLE %>% filter(id == n) %>% pull(somefield) %>%
unique %>% paste(collapse = " ")
}
Now apply this function to every ndoc value in mytibble.
mytibble %>% mutate(newfield = purrr::map_chr(ndoc, FND))
You can also use sapply :
mytibble$newfield <- sapply(mytibble$ndoc, FND)
FND(mytibble$ndoc) is more suitable for data frames. When you use functions such as mutate on a tibble, there is no need to specify the name of the tibble, only that of the column. The symbols %>% are already making sure that only data from the tibble is used. Thus your example would be:
output1 <- mytibble %>%
mutate(newfield = FND(ndoc))
FND <- function(n){
result <- LARGETIBBLE %>% filter(id == n)
return(paste(unique(result$somefield),collapse=" "))
}
This would be theoretically, however I do not know if your function FND will work, maybe try it and if not, give some practical example with data and what you are trying to achieve.

Efficiently calculate row totals of a wide Spark DF

I have a wide spark data frame of a few thousand columns by about a million rows, for which I would like to calculate the row totals. My solution so far is below. I used:
dplyr - sum of multiple columns using regular expressions and
https://github.com/tidyverse/rlang/issues/116
library(sparklyr)
library(DBI)
library(dplyr)
library(rlang)
sc1 <- spark_connect(master = "local")
wide_df = as.data.frame(matrix(ceiling(runif(2000, 0, 20)), 10, 200))
wide_sdf = sdf_copy_to(sc1, wide_df, overwrite = TRUE, name = "wide_sdf")
col_eqn = paste0(colnames(wide_df), collapse = "+" )
# build up the SQL query and send to spark with DBI
query = paste0("SELECT (",
col_eqn,
") as total FROM wide_sdf")
dbGetQuery(sc1, query)
# Equivalent approach using dplyr instead
col_eqn2 = quo(!! parse_expr(col_eqn))
wide_sdf %>%
transmute("total" := !!col_eqn2) %>%
collect() %>%
as.data.frame()
The problems come when the number of columns is increased. On spark SQL it seems to be calculated one element at a time i.e. (((V1 + V1) + V3) + V4)...) This is leading to errors due to very high recursion.
Does anyone have an alternative more efficient approach? Any help would be much appreciated.
You're out of luck here. One way or another you're are going to hit some recursion limits (even if you go around SQL parser, sufficiently large sum of expressions will crash query planner). There are some slow solutions available:
Use spark_apply (at the cost of conversion to and from R):
wide_sdf %>% spark_apply(function(df) { data.frame(total = rowSums(df)) })
Convert to long format and aggregate (at the cost of explode and shuffle):
key_expr <- "monotonically_increasing_id() AS key"
value_expr <- paste(
"explode(array(", paste(colnames(wide_sdf), collapse=","), ")) AS value"
)
wide_sdf %>%
spark_dataframe() %>%
# Add id and explode. We need a separate invoke so id is applied
# before "lateral view"
sparklyr::invoke("selectExpr", list(key_expr, "*")) %>%
sparklyr::invoke("selectExpr", list("key", value_expr)) %>%
sdf_register() %>%
# Aggregate by id
group_by(key) %>%
summarize(total = sum(value)) %>%
arrange(key)
To get something more efficient you should consider writing Scala extension and applying sum directly on a Row object, without exploding:
package com.example.sparklyr.rowsum
import org.apache.spark.sql.{DataFrame, Encoders}
object RowSum {
def apply(df: DataFrame, cols: Seq[String]) = df.map {
row => cols.map(c => row.getAs[Double](c)).sum
}(Encoders.scalaDouble)
}
and
invoke_static(
sc, "com.example.sparklyr.rowsum.RowSum", "apply",
wide_sdf %>% spark_dataframe
) %>% sdf_register()

Use multiple command chains with piping

EDIT: I reworked the question to make it clearer and integrate what I found by myself
Pipes are a great way to make the code more readable when using a single command chain
In some cases however, I feel one is forced to be inconsistent to its philosophy, either by creating unnecessary temp variables, mixing piping and embedded parenthesis, or defining custom functions.
See this SO question for example, where OP wants to know how to convert colnames to lower case with pipes: Dplyr or Magrittr - tolower?
I'll forget about the existence of names<- to make my point
There's basically 3 ways to do it:
Use a temp variable
temp <- df %>% names %>% tolower
df %>% setNames(temp)
Use embedded parenthesis
df %>% setNames(tolower(names(.)))
Define custom function
upcase <- function(df) {names(df) <- tolower(names(df)); df}
df %>% upcase
I think it would be more consistent to be able to do something like this:
df %T>% # create new branch with %T%>%
{names(.) %>% tolower %as% n} %>% # parallel branch assigned to alias n, then going back to main branch with %>%
setNames(n) # combine branches
For more complex cases, it is in my opinion more readable than the 3 examples above and I'm not polluting my workspace.
So far I've been able to come quite close, I can type:
df %T>%
{names(.) %>% tolower %as% n} %>%
setNames(A(n));fp()
OR (a little tribute to old school calculators)
df %1% # puts lhs in first memory slot (notice "%1%", I define these up to "%9%")
names %>%
tolower %>%
setNames(M(1),.);fp() # call the first stored value
(see code at bottom)
My issues are the following:
I create a new environment in my global environment, and I have to flush it manually with fp(), it's quite ugly
I'd like to get rid of this A function, but I don't understand well enough the environment structure of pipe chains to do so
Here's my code :
It creates an environment named PipeAliasEnv for aliases
%as% creates an alias in an isolated environment
%to% creates a variable in the calling environment
A calls an alias
fp removes all objects from PipeAliasEnv
This is the code that I used and a reproducible example solved in 4 different ways:
library(magrittr)
alias_init <- function(){
assign("PipeAliasEnv",new.env(),envir=.GlobalEnv)
assign("%as%" ,function(value,variable) {assign(as.character(substitute(variable)),value,envir=PipeAliasEnv)},envir=.GlobalEnv)
assign("%to%" ,function(value,variable) {assign(as.character(substitute(variable)),value,envir=parent.frame())},envir=.GlobalEnv)
assign("A" ,function(variable) { get(as.character(substitute(variable)), envir=PipeAliasEnv)},envir=.GlobalEnv)
assign("fp" ,function(remove_envir=FALSE){if(remove_envir) rm(PipeAliasEnv,envir=.GlobalEnv) else rm(list=ls(envir=PipeAliasEnv),envir=PipeAliasEnv)},envir=.GlobalEnv) # flush environment
# to handle `%i%` and M(i) notation, 9 should be enough :
sapply(1:9,function(i){assign(paste0("%",i,"%"),eval(parse(text=paste0('function(lhs,rhs){lhs <- eval(lhs)
rhs <- as.character(substitute(rhs))
str <- paste("lhs %>%",rhs[1],"(",paste(rhs[-1],collapse=","),")")
assign("x',i,'",lhs,envir=PipeAliasEnv)
eval(parse(text= str))}'))),envir=.GlobalEnv)})
assign("M" ,function(i) { get(paste0("x",as.character(substitute(i))), envir=PipeAliasEnv)},envir=.GlobalEnv)
}
alias_init()
# using %as%
df <- iris %T>%
{names(.) %>% toupper %as% n} %>%
setNames(A(n)) %T>%
{. %>% head %>% print}(.) ;fp()
# still using %as%, choosing another main chain
df <- iris %as% dataset %>%
names %>%
toupper %>%
setNames(A(dataset),.) %T>%
{. %>% head %>% print}(.);fp()
# using %to% (notice no assignment on 1st line)
iris %T>%
{names(.) %>% toupper %as% n} %>%
{setNames(.,A(n))} %to% df %>% # no need for '%T>%' and '{}' here
head %>% print;fp()
# or using the old school calculator fashion (probably the clearest for this precise task)
df <- iris %1%
names %>%
toupper %>%
setNames(M(1),.) %T>%
{. %>% head %>% print}(.);fp()
My question in short:
How do I get rid of A and fp ?
Bonus: %to% doesn't work when inside {}, how can I solve this ?

Getting the tidyr::nest() -> purrr:map() workflow to work for special case of no grouping var

I'm trying to write a function that does a split-apply-combine for which the split variable(s) are parameters, and - importantly - a null split is acceptable. For example, running statistics either on subsets of data or on the entire dataset.
somedata=expand.grid(a=1:3,b=1:3)
somefun=function(df_in,grpvars=NULL){
df_in %>% group_by_(.dots=grpvars) %>% nest() %>%
mutate(X2.Resid=map(data,~with(.x,chisq.test(b)$residuals))) %>%
unnest(data,X2.Resid) %>% return()
}
somefun(somedata,"a") # This works
somefun(somedata) # This fails
The null condition fails because nest() seems to need a variable to nest by, rather than nesting the entire df into a 1x1 data.frame. I can get around this as follows:
somefun2=function(df_in,grpvars="Dummy"){
df_in$Dummy=1
df_in %>% group_by_(.dots=grpvars) %>% nest() %>%
mutate(X2.Resid=map(data,~with(.x,chisq.test(b)$residuals))) %>%
unnest(data,X2.Resid) %>%
select(-Dummy) %>% return()
}
somefun2(somedata) # This works
However, I'm wondering if there is a more elegant way to fix this, without needing the dummy variabe?
Hmm, that behavior is a little surprising to me. A fix is easy though: you just have to make sure you nest everything():
somefun3 <- function(df_in, grpvars = NULL) {
df_in %>%
group_by_(.dots = grpvars) %>%
nest(everything()) %>%
mutate(X2.Resid = map(data, ~with(.x, chisq.test(b)$residuals))) %>%
unnest()
}
somefun3(somedata, "a")
somefun3(somedata)
Both work.

Resources