I've been playing with magrittr, and I really like the resulting code. It's clean and can really save on typing.
How can I rename list elements in magrittr:
In typical base R:
data_lists <- paste0("q",2011:2015)
data_lists <- lapply(data_lists,get)
names(data_lists) <- paste0("q",2011:2015)
In magrittr, I thought:
data_lists <-
paste0("q",2011:2015) %>%
lapply(.,get) %>%
names(.) %<>% paste0("q",2011:2015) # this is wrong
But... no dice.
Magrittr uses a number of aliases for problems of this nature. Here is an example sequence using the alias set_names()
data_lists <-
paste0("q",2011:2015) %>%
lapply(.,get) %>%
set_names(paste0("q",2011:2015))
See ?extract for more aliases
Because everything in R is a function (mostly), you could also do
data_lists <-
paste0("q",2011:2015) %>%
lapply(.,get) %>%
`names<-`(paste0("q",2011:2015))
Related
I am trying to use the assertive package for run-time testing, and I would like to pass column names using the pipe.
Here's a simple example:
library(tidyverse)
library(assertive)
df <- tibble(Name = c("DONALD", "JAIME", "LINDA"))
This works but doesn't use the pipe:
assertive::assert_all_are_true(df$Name == str_to_upper(df$Name))
This uses the pipe, but doesn't work:
df %>% assertive::assert_all_are_true(Name == str_to_upper(Name))
#> Error in match.arg(severity): object 'Name' not found
How can I pipe column names to assertive?
We can use with
library(dplyr)
df %>%
with(., assertive::assert_all_are_true(Name == str_to_upper(Name)))
Or extract the column with .$
df %>%
{assertive::assert_all_are_true(.$Name == str_to_upper(.$Name))}
Or with |> from R 4.1.0
df |>
{\(x) assertive::assert_all_are_true(x$Name == str_to_upper(x$Name))}()
I got a problem with the use of MUTATE, please check the next code block.
output1 <- mytibble %>%
mutate(newfield = FND(mytibble$ndoc))
output1
Where FND function is a FILTER applied to a large file (5GB):
FND <- function(n){
result <- LARGETIBBLE %>% filter(LARGETIBBLE$id == n)
return(paste(unique(result$somefield),collapse=" "))
}
I want to execute FND function for each row of output1 tibble, but it just executes one time.
Never use $ in dplyr pipes, very rarely they are used. You can change your FND function to :
library(dplyr)
FND <- function(n){
LARGETIBBLE %>% filter(id == n) %>% pull(somefield) %>%
unique %>% paste(collapse = " ")
}
Now apply this function to every ndoc value in mytibble.
mytibble %>% mutate(newfield = purrr::map_chr(ndoc, FND))
You can also use sapply :
mytibble$newfield <- sapply(mytibble$ndoc, FND)
FND(mytibble$ndoc) is more suitable for data frames. When you use functions such as mutate on a tibble, there is no need to specify the name of the tibble, only that of the column. The symbols %>% are already making sure that only data from the tibble is used. Thus your example would be:
output1 <- mytibble %>%
mutate(newfield = FND(ndoc))
FND <- function(n){
result <- LARGETIBBLE %>% filter(id == n)
return(paste(unique(result$somefield),collapse=" "))
}
This would be theoretically, however I do not know if your function FND will work, maybe try it and if not, give some practical example with data and what you are trying to achieve.
I have four functions, clean, clean2, cleanFun, and trim. Currently I apply the functions to one column, like so.
library(tidyverse)
library(data.table)
py17$CE.Finding.Description <- clean(py17$CE.Finding.Description)
py17$CE.Finding.Description <- clean2(py17$CE.Finding.Description)
py17$CE.Finding.Description <- cleanFun(py17$CE.Finding.Description)
py17$CE.Finding.Description <- trim(py17$CE.Finding.Description)
This process does the trick but I have to copy and paste this multiple times, and I'd eventually like to expand this to multiple columns.
For now, I'd like to save time and add an apply function but I'm not sure how to create that apply function. I've tried creating this.
maxclean <- function(cleaner) {
c(clean(cleaner), clean2(cleaner), cleanFun(cleaner), trim(cleaner))
}
py17$CE.Finding.Description <- sapply(py17$CE.Finding.Description, maxclean)
After trying this I just get
Error in `$<-.data.frame`(`*tmp*`, CE.Finding.Description, value = c(NA, :
replacement has 4 rows, data has 4318
I do not get any errors doing this the long way. Where am I going wrong on this?
Your maxclean function should take the same arguments as the separate functions. In your case - a vector. And then call each function in a row. Like this:
maxclean <- function(x) {
x <- clean(x)
x <- clean2(x)
x <- cleanFun(x)
x <- trim(x)
return(x)
}
Apparently, the OP has created a cleaning pipeline where the output of one step is fed into the next step and the final result of the pipeline overwrites the original input.
The magrittr package has the freduce() function which applies one function after the other in the described way. Thus,
py17$CE.Finding.Description <- clean(py17$CE.Finding.Description)
py17$CE.Finding.Description <- clean2(py17$CE.Finding.Description)
py17$CE.Finding.Description <- cleanFun(py17$CE.Finding.Description)
py17$CE.Finding.Description <- trim(py17$CE.Finding.Description)
can be written as:
library(magrittr)
fcts <- list(clean, clean2, cleanFun, trim)
py17$CE.Finding.Description %<>% freduce(fcts)
which is a shortcut for
py17$CE.Finding.Description <- py17$CE.Finding.Description %>%
clean() %>%
clean2() %>%
cleanFun() %>%
trim()
Here, %>% is the magrittr forward-pipe operator and %<>% is the magrittr compound assignment pipe-operator which updates the left-hand side object with the resulting value.
Reproducible example
Using the mtcars dataset:
data(mtcars)
mycars <- mtcars
mycars$mpg %<>%
{. - mean(.)} %>%
abs() %>%
sqrt()
mycars
or
mycars <- mtcars
mycars$mpg %<>% freduce(list(function(.) {. - mean(.)}, abs, sqrt))
mycars
Applying on multiple columns
The OP has mentioned that he eventually like to expand this to multiple columns
This can be achieved by, e.g.,
mycars <- mtcars
fcts <- list(function(.) {. - mean(.)}, abs, sqrt)
mycars$mpg %<>% freduce(fcts)
mycars$disp %<>% freduce(fcts)
mycars
EDIT: I reworked the question to make it clearer and integrate what I found by myself
Pipes are a great way to make the code more readable when using a single command chain
In some cases however, I feel one is forced to be inconsistent to its philosophy, either by creating unnecessary temp variables, mixing piping and embedded parenthesis, or defining custom functions.
See this SO question for example, where OP wants to know how to convert colnames to lower case with pipes: Dplyr or Magrittr - tolower?
I'll forget about the existence of names<- to make my point
There's basically 3 ways to do it:
Use a temp variable
temp <- df %>% names %>% tolower
df %>% setNames(temp)
Use embedded parenthesis
df %>% setNames(tolower(names(.)))
Define custom function
upcase <- function(df) {names(df) <- tolower(names(df)); df}
df %>% upcase
I think it would be more consistent to be able to do something like this:
df %T>% # create new branch with %T%>%
{names(.) %>% tolower %as% n} %>% # parallel branch assigned to alias n, then going back to main branch with %>%
setNames(n) # combine branches
For more complex cases, it is in my opinion more readable than the 3 examples above and I'm not polluting my workspace.
So far I've been able to come quite close, I can type:
df %T>%
{names(.) %>% tolower %as% n} %>%
setNames(A(n));fp()
OR (a little tribute to old school calculators)
df %1% # puts lhs in first memory slot (notice "%1%", I define these up to "%9%")
names %>%
tolower %>%
setNames(M(1),.);fp() # call the first stored value
(see code at bottom)
My issues are the following:
I create a new environment in my global environment, and I have to flush it manually with fp(), it's quite ugly
I'd like to get rid of this A function, but I don't understand well enough the environment structure of pipe chains to do so
Here's my code :
It creates an environment named PipeAliasEnv for aliases
%as% creates an alias in an isolated environment
%to% creates a variable in the calling environment
A calls an alias
fp removes all objects from PipeAliasEnv
This is the code that I used and a reproducible example solved in 4 different ways:
library(magrittr)
alias_init <- function(){
assign("PipeAliasEnv",new.env(),envir=.GlobalEnv)
assign("%as%" ,function(value,variable) {assign(as.character(substitute(variable)),value,envir=PipeAliasEnv)},envir=.GlobalEnv)
assign("%to%" ,function(value,variable) {assign(as.character(substitute(variable)),value,envir=parent.frame())},envir=.GlobalEnv)
assign("A" ,function(variable) { get(as.character(substitute(variable)), envir=PipeAliasEnv)},envir=.GlobalEnv)
assign("fp" ,function(remove_envir=FALSE){if(remove_envir) rm(PipeAliasEnv,envir=.GlobalEnv) else rm(list=ls(envir=PipeAliasEnv),envir=PipeAliasEnv)},envir=.GlobalEnv) # flush environment
# to handle `%i%` and M(i) notation, 9 should be enough :
sapply(1:9,function(i){assign(paste0("%",i,"%"),eval(parse(text=paste0('function(lhs,rhs){lhs <- eval(lhs)
rhs <- as.character(substitute(rhs))
str <- paste("lhs %>%",rhs[1],"(",paste(rhs[-1],collapse=","),")")
assign("x',i,'",lhs,envir=PipeAliasEnv)
eval(parse(text= str))}'))),envir=.GlobalEnv)})
assign("M" ,function(i) { get(paste0("x",as.character(substitute(i))), envir=PipeAliasEnv)},envir=.GlobalEnv)
}
alias_init()
# using %as%
df <- iris %T>%
{names(.) %>% toupper %as% n} %>%
setNames(A(n)) %T>%
{. %>% head %>% print}(.) ;fp()
# still using %as%, choosing another main chain
df <- iris %as% dataset %>%
names %>%
toupper %>%
setNames(A(dataset),.) %T>%
{. %>% head %>% print}(.);fp()
# using %to% (notice no assignment on 1st line)
iris %T>%
{names(.) %>% toupper %as% n} %>%
{setNames(.,A(n))} %to% df %>% # no need for '%T>%' and '{}' here
head %>% print;fp()
# or using the old school calculator fashion (probably the clearest for this precise task)
df <- iris %1%
names %>%
toupper %>%
setNames(M(1),.) %T>%
{. %>% head %>% print}(.);fp()
My question in short:
How do I get rid of A and fp ?
Bonus: %to% doesn't work when inside {}, how can I solve this ?
Here's a piece of code:
data <- data.frame(a=runif(20),b=runif(20),subject=rep(1:2,10)) %>%
group_by(subject) %>%
do(distance = dist(.))
#no dplyr
intermediate <- lapply(data$distance,as.matrix)
mean.dists <- apply(simplify2array(intermediate),MARGIN = c(1,2),FUN=mean)
#dplyr
mean.dists <- lapply(data$distance,as.matrix) %>%
apply(simplify2array(.),MARGIN=c(1,2),FUN=mean)
Why does the "no dplyr" version work, and the "dplyr" version throws the error, "dim(X) must have a positive length"? They seem identical to me.
The issue is that you haven't quite fully implemented the pipe line. You are using magrittr here, and the issue has little to do with dplyr
data$distance %>%
lapply(as.matrix ) %>%
simplify2array %>%
apply(MARGIN=1:2, FUN=mean)