I'm modifying nested data frames inside of foo with map2 and mutate, and I'd like to name a variable in each nested data frame according to foo$name. I'm not sure what the proper syntax for nse/tidyeval unquotation would be here.
My attempt:
library(tidyverse)
foo <- mtcars %>%
group_by(gear) %>%
nest %>%
mutate(name = c("one", "two", "three")) %>%
mutate(data = map2(data, name, ~
mutate(.x, !!(.y) := "anything")))
#> Error in quos(...): object '.y' not found
I want the name of the newly created variable inside the nested data frames to be "one", "two", and "three", respectively.
I'm basing my attempt off the normal syntax I'd use if I was doing a normal mutate on a normal df, and where name is a string:
name <- "test"
mtcars %>% mutate(!!name := "anything") # works fine
If successful, the following line should return TRUE:
foo[1,2] %>% unnest %>% names %>% .[11] == "one"
This seems to be a feature/bug (not sure, see linked GitHub issue below) of how !! works within mutate and map. The solution is to define a custom function, in which case the unquoting works as expected.
library(tidyverse)
custom_mutate <- function(df, name, string = "anything")
mutate(df, !!name := string)
foo <- mtcars %>%
group_by(gear) %>%
nest %>%
mutate(name = c("one", "two", "three")) %>%
mutate(data = map2(data, name, ~
custom_mutate(.x, .y)))
foo[1,2] %>% unnest %>% names %>% .[11] == "one"
#[1] TRUE
You find more details on GitHub under issue #541: map2() call in dplyr::mutate() error while standalone map2() call works; note that the issue has been closed in September 2018, so I am assuming this is intended behaviour.
An alternative might be to use group_split instead of nest, in which case we
avoid the unquoting issue
nms <- c("one", "two", "three")
mtcars %>%
group_split(gear) %>%
map2(nms, ~.x %>% mutate(!!.y := "anything"))
This is because of the timing of unquoting. Nesting tidy eval functions can be a bit tricky because it is the very first tidy eval function that processes the unquoting operators.
Let's rewrite this:
mutate(data = map2(data, name, ~ mutate(.x, !!.y := "anything")))
to
mutate(data = map2(data, name, function(x, y) mutate(x, !!y := "anything")))
The x and y bindings are only created when the function is called by map2(). So when the first mutate() runs, these bindings don't exist yet and you get an object not found error. With the formula it's a bit harder to see but the formula expands to a function taking .x and .y arguments so we have the same problem.
In general, it's better to avoid complex nested logic in your code because it makes it harder to read. With tidy eval that's even more complexity, so best do things in steps. As an added bonus, doing things in steps requires creating intermediate variables which, if well named, help understand what the function is doing.
Related
I'm new to R and I don't know all basic concepts yet. The task is to produce a one merged table with multiple response sets. I am trying to do this using expss library and a loop.
This is the code in R without a loop (works fine):
#libraries
#blah, blah...
#path
df.path = "C:/dataset.sav"
#dataset load
df = read_sav(df.path)
#table
table_undropped1 = df %>%
tab_cells(mdset(q20s1i1 %to% q20s1i8)) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
There are 10 multiple response sets therefore I need to create 10 tables in a manner shown above. Then I transpose those tables and merge. To simplify the code (and learn something new) I decided to produce tables using a loop. However nothing works. I'd looked for a solution and I think the most close to correct one is:
#this generates a message: '1' not found
for(i in 1:10) {
assign(paste0("table_undropped",i),1) = df %>%
tab_cells(mdset(assign(paste0("q20s",i,"i1"),1) %to% assign(paste0("q20s",i,"i8"),1)))
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
Still it causes an error described above the code.
Alternatively, an SPSS macro for that would be (published only to better express the problem because I have to avoid SPSS):
define macro1 (x = !tokens (1)
/y = !tokens (1))
!do !i = !x !to !y.
mrsets
/mdgroup name = !concat($SET_,!i)
variables = !concat("q20s",!i,"i1") to !concat("q20s",!i,"i8")
value = 1.
ctables
/table !concat($SET_,!i) [colpct.responses.count pct40.0].
!doend
!enddefine.
*** MACRO CALL.
macro1 x = 1 y = 10.
In other words I am looking for a working substitute of !concat() in R.
%to% is not suited for parametric variable selection. There is a set of special functions for parametric variable selection and assignment. One of them is mdset_t:
for(i in 1:10) {
table_name = paste0("table_undropped",i)
..$table_name = df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
}
However, it is not good practice to store all tables as separate variables in the global environment. Better approach is to save all tables in the list:
all_tables = lapply(1:10, function(i)
df %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>%
tab_total_row_position("none") %>%
tab_stat_cpct() %>%
tab_pivot()
)
UPDATE.
Generally speaking, there is no need to merge. You can do all your work with tab_*:
my_big_table = df %>%
tab_total_row_position("none")
for(i in 1:10) {
my_big_table = my_big_table %>%
tab_cells(mdset_t("q20s{i}i{1:8}")) %>% # expressions in the curly brackets will be evaluated and substituted
tab_stat_cpct()
}
my_big_table = my_big_table %>%
tab_pivot(stat_position = "inside_columns") # here we say that we need combine subtables horizontally
Pretty basic but I don't think I really understand the change:
library(dplyr)
library(lubridate)
Lab_import_sql <- Lab_import %>%
select_if(~sum(!is.na(.)) > 0) %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, funs(ifelse(is.character(.), trimws(.),.))) %>%
mutate_at(.vars = Lab_import %>% select_if(grepl("'",.)) %>% colnames(),
.funs = gsub,
pattern = "'",
replacement = "''") %>%
mutate_if(is.character, funs(ifelse(is.character(.), paste0("'", ., "'"),.))) %>%
mutate_if(is.Date, funs(ifelse(is.Date(.), paste0("'", ., "'"),.)))
Edit:
Thanks everyone for the input, here's reproducible code and my solution:
library(dplyr)
library(lubridate)
import <- data.frame(Test_Name = "Fir'st Last",
Test_Date = "2019-01-01",
Test_Number = 10)
import_sql <-import %>%
select_if(~!all(is.na(.))) %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, trimws) %>%
mutate_if(is.character, list(~gsub("'", "''",.))) %>%
mutate_if(is.character, list(~paste0("'", ., "'"))) %>%
mutate_if(is.Date, list(~paste0("'", ., "'")))
As of dplyr 0.8.0, the documentation states that we should use list instead of funs, giving the example:
Before:
funs(name = f(.))
After:
list(name = ~f(.))
So here, the call funs(ifelse(is.character(.), trimws(.),.)) can become instead list(~ifelse(is.character(.), trimws(.),.)). This is using the formula notation for anonymous functions in the tidyverse, where a one-sided formula (expression beginning with ~) is interpreted as function(x), and wherever x would go in the function is represented by .. You can still use full functions inside list.
Note the difference between the .funs argument of mutate_if and the funs() function which wrapped other functions to pass to .funs; i.e. .funs = gsub still works. You only needed funs() if you needed to apply multiple functions to selected columns or to name them something by passing them as named arguments. You can do all the same things with list().
You also are duplicating work by adding ifelse inside mutate_if; that line could be simplified to mutate_if(is.character, trimws) since if the column is character already you don't need to check it again with ifelse. Since you apply only one function, no need for funs or list at all.
Pretty basic but I don't think I really understand the change:
library(dplyr)
library(lubridate)
Lab_import_sql <- Lab_import %>%
select_if(~sum(!is.na(.)) > 0) %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, funs(ifelse(is.character(.), trimws(.),.))) %>%
mutate_at(.vars = Lab_import %>% select_if(grepl("'",.)) %>% colnames(),
.funs = gsub,
pattern = "'",
replacement = "''") %>%
mutate_if(is.character, funs(ifelse(is.character(.), paste0("'", ., "'"),.))) %>%
mutate_if(is.Date, funs(ifelse(is.Date(.), paste0("'", ., "'"),.)))
Edit:
Thanks everyone for the input, here's reproducible code and my solution:
library(dplyr)
library(lubridate)
import <- data.frame(Test_Name = "Fir'st Last",
Test_Date = "2019-01-01",
Test_Number = 10)
import_sql <-import %>%
select_if(~!all(is.na(.))) %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, trimws) %>%
mutate_if(is.character, list(~gsub("'", "''",.))) %>%
mutate_if(is.character, list(~paste0("'", ., "'"))) %>%
mutate_if(is.Date, list(~paste0("'", ., "'")))
As of dplyr 0.8.0, the documentation states that we should use list instead of funs, giving the example:
Before:
funs(name = f(.))
After:
list(name = ~f(.))
So here, the call funs(ifelse(is.character(.), trimws(.),.)) can become instead list(~ifelse(is.character(.), trimws(.),.)). This is using the formula notation for anonymous functions in the tidyverse, where a one-sided formula (expression beginning with ~) is interpreted as function(x), and wherever x would go in the function is represented by .. You can still use full functions inside list.
Note the difference between the .funs argument of mutate_if and the funs() function which wrapped other functions to pass to .funs; i.e. .funs = gsub still works. You only needed funs() if you needed to apply multiple functions to selected columns or to name them something by passing them as named arguments. You can do all the same things with list().
You also are duplicating work by adding ifelse inside mutate_if; that line could be simplified to mutate_if(is.character, trimws) since if the column is character already you don't need to check it again with ifelse. Since you apply only one function, no need for funs or list at all.
EDIT: I reworked the question to make it clearer and integrate what I found by myself
Pipes are a great way to make the code more readable when using a single command chain
In some cases however, I feel one is forced to be inconsistent to its philosophy, either by creating unnecessary temp variables, mixing piping and embedded parenthesis, or defining custom functions.
See this SO question for example, where OP wants to know how to convert colnames to lower case with pipes: Dplyr or Magrittr - tolower?
I'll forget about the existence of names<- to make my point
There's basically 3 ways to do it:
Use a temp variable
temp <- df %>% names %>% tolower
df %>% setNames(temp)
Use embedded parenthesis
df %>% setNames(tolower(names(.)))
Define custom function
upcase <- function(df) {names(df) <- tolower(names(df)); df}
df %>% upcase
I think it would be more consistent to be able to do something like this:
df %T>% # create new branch with %T%>%
{names(.) %>% tolower %as% n} %>% # parallel branch assigned to alias n, then going back to main branch with %>%
setNames(n) # combine branches
For more complex cases, it is in my opinion more readable than the 3 examples above and I'm not polluting my workspace.
So far I've been able to come quite close, I can type:
df %T>%
{names(.) %>% tolower %as% n} %>%
setNames(A(n));fp()
OR (a little tribute to old school calculators)
df %1% # puts lhs in first memory slot (notice "%1%", I define these up to "%9%")
names %>%
tolower %>%
setNames(M(1),.);fp() # call the first stored value
(see code at bottom)
My issues are the following:
I create a new environment in my global environment, and I have to flush it manually with fp(), it's quite ugly
I'd like to get rid of this A function, but I don't understand well enough the environment structure of pipe chains to do so
Here's my code :
It creates an environment named PipeAliasEnv for aliases
%as% creates an alias in an isolated environment
%to% creates a variable in the calling environment
A calls an alias
fp removes all objects from PipeAliasEnv
This is the code that I used and a reproducible example solved in 4 different ways:
library(magrittr)
alias_init <- function(){
assign("PipeAliasEnv",new.env(),envir=.GlobalEnv)
assign("%as%" ,function(value,variable) {assign(as.character(substitute(variable)),value,envir=PipeAliasEnv)},envir=.GlobalEnv)
assign("%to%" ,function(value,variable) {assign(as.character(substitute(variable)),value,envir=parent.frame())},envir=.GlobalEnv)
assign("A" ,function(variable) { get(as.character(substitute(variable)), envir=PipeAliasEnv)},envir=.GlobalEnv)
assign("fp" ,function(remove_envir=FALSE){if(remove_envir) rm(PipeAliasEnv,envir=.GlobalEnv) else rm(list=ls(envir=PipeAliasEnv),envir=PipeAliasEnv)},envir=.GlobalEnv) # flush environment
# to handle `%i%` and M(i) notation, 9 should be enough :
sapply(1:9,function(i){assign(paste0("%",i,"%"),eval(parse(text=paste0('function(lhs,rhs){lhs <- eval(lhs)
rhs <- as.character(substitute(rhs))
str <- paste("lhs %>%",rhs[1],"(",paste(rhs[-1],collapse=","),")")
assign("x',i,'",lhs,envir=PipeAliasEnv)
eval(parse(text= str))}'))),envir=.GlobalEnv)})
assign("M" ,function(i) { get(paste0("x",as.character(substitute(i))), envir=PipeAliasEnv)},envir=.GlobalEnv)
}
alias_init()
# using %as%
df <- iris %T>%
{names(.) %>% toupper %as% n} %>%
setNames(A(n)) %T>%
{. %>% head %>% print}(.) ;fp()
# still using %as%, choosing another main chain
df <- iris %as% dataset %>%
names %>%
toupper %>%
setNames(A(dataset),.) %T>%
{. %>% head %>% print}(.);fp()
# using %to% (notice no assignment on 1st line)
iris %T>%
{names(.) %>% toupper %as% n} %>%
{setNames(.,A(n))} %to% df %>% # no need for '%T>%' and '{}' here
head %>% print;fp()
# or using the old school calculator fashion (probably the clearest for this precise task)
df <- iris %1%
names %>%
toupper %>%
setNames(M(1),.) %T>%
{. %>% head %>% print}(.);fp()
My question in short:
How do I get rid of A and fp ?
Bonus: %to% doesn't work when inside {}, how can I solve this ?
I'm trying to write a function that does a split-apply-combine for which the split variable(s) are parameters, and - importantly - a null split is acceptable. For example, running statistics either on subsets of data or on the entire dataset.
somedata=expand.grid(a=1:3,b=1:3)
somefun=function(df_in,grpvars=NULL){
df_in %>% group_by_(.dots=grpvars) %>% nest() %>%
mutate(X2.Resid=map(data,~with(.x,chisq.test(b)$residuals))) %>%
unnest(data,X2.Resid) %>% return()
}
somefun(somedata,"a") # This works
somefun(somedata) # This fails
The null condition fails because nest() seems to need a variable to nest by, rather than nesting the entire df into a 1x1 data.frame. I can get around this as follows:
somefun2=function(df_in,grpvars="Dummy"){
df_in$Dummy=1
df_in %>% group_by_(.dots=grpvars) %>% nest() %>%
mutate(X2.Resid=map(data,~with(.x,chisq.test(b)$residuals))) %>%
unnest(data,X2.Resid) %>%
select(-Dummy) %>% return()
}
somefun2(somedata) # This works
However, I'm wondering if there is a more elegant way to fix this, without needing the dummy variabe?
Hmm, that behavior is a little surprising to me. A fix is easy though: you just have to make sure you nest everything():
somefun3 <- function(df_in, grpvars = NULL) {
df_in %>%
group_by_(.dots = grpvars) %>%
nest(everything()) %>%
mutate(X2.Resid = map(data, ~with(.x, chisq.test(b)$residuals))) %>%
unnest()
}
somefun3(somedata, "a")
somefun3(somedata)
Both work.