I am interested in the ability to pass a string not as an argument within a function but as an entire function. This may not be the smartest approach but I am simply curious so that I can understand the functionality between dplyr and how R interprets strings. Perhaps I am missing something very obvious but here are my attempts:
#what i want----
library(dplyr)
mtcars %>% count()
#replicate by passing string as count---
#feed string as a function
my_string = "count()"
#attempt 1
mtcars %>% my_string
#attempt 2
mtcars %>% eval(noquote(my_string))
#neither of the attempts work
If this is not possible I understand, but it would be interesting if possible as I can see some applications for this in my mind.
EDIT
A little more to explain why I want to do this. I have worked with fst files for some time for some very large data and load data into my environment like so, often performing operations on one file at a time and in parallel which is very efficient for my purposes:
#pseudo code---
seq.Date(1,2,by = "days") %>%
pblapply(function(x){
read.fst(list.files(as.character(x), as.data.table = T) %>%
#this portion turn into a string----
group_by(foo) %>%
count()
#------------------------------
}) %>% rbindlist()
#application-------
my_string = "group_by(foo) %>%
count()"
seq.Date(1,2,by = "days") %>%
pblapply(function(x){
read.fst(list.files(as.character(x), as.data.table = T) %>% my_string
}) %>% rbindlist()
I use data table more often but I think dplyr might be better for this specific task I am interested with. What I want to be able to do is separately write out the entire pipeline as a string and then pass it. This will allow me to write out a library package to shorten my workflow. Something to that effect.
Related
I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]
How to avoid assigning a value to a variable, then calling it separately right after to manipulate it? (Like so:)
df.3 <- filter(df.1, !(Patient_ID %in% df.2))
df.3 %>% count(MIPSGroup)
The only way I know how to do it is:
(df.3 <- filter(df.1, !(Patient_ID %in% df.2))) %>%
df.3 %>% count(MIPSGroup)
But there's gotta be a better way...
Thanks!
If you were willing to modify the original input, you could use magrittr's %<>% (compound assignment) operator:
(mtcars %<>% filter(cyl==6)) %>% count(mpg)
This modifies the value of mtcars according to the filter and prints the results of the count operation on the result.
There may be some way to use magrittr's other operators (e.g. %T>%) to get this done, but I haven't figured it out yet. I tried
((mtcars -> tmpcars) %<>% filter(cyl==6)) %>% count(mpg)
but R's parsing magic can't quite handle it.
I recently discovered the pipe operator %>%, which can make code more readable. Here is my MWE.
library(dplyr) # for the pipe operator
library(lsr) # for the cohensD function
set.seed(4) # make it reproducible
dat <- data.frame( # create data frame
subj = c(1:6),
pre = sample(1:6, replace = TRUE),
post = sample(1:6, replace = TRUE)
)
dat %>% select(pre, post) %>% sapply(., mean) # works as expected
However, I struggle using the pipe operator in this particular case
dat %>% select(pre, post) %>% cohensD(.$pre, .$post) # piping returns an error
cohensD(dat$pre, dat$post) # classical way works fine
Why is it not possible to subset columns using the placeholder .in combination with $? Is it worthwhile to write this line using a pipe operator %>%, or does it complicate syntax? The classical way of writing this seems more concise.
This would work:
dat %>% select(pre, post) %>% {cohensD(.$pre, .$post)}
Wrapping the last call into curly braces makes it be treated like an expression and not a function call. When you pipe something into an expression, the . gets replaced as expected. I often use this trick to call a function which does not interface well with piping.
What is inside the braces happens to be a function call but could really be any expression of . .
Since you're going from a bunch of data into one (row of) value(s), you're summarizing. in a dplyr pipeline you can then use the summarize function, within the summarize function you don't need to subset and can just call pre and post
Like so:
dat %>% select(pre, post) %>% summarize(CD = cohensD(pre, post))
(The select statement isn't actually necessary in this case, but I left it in to show how this works in a pipeline)
It doesn't work because the . operator has to be used directly as an argument, and not inside a nested function (like $...) in your call.
If you really want to use piping, you can do it with the formula interface, but with a little reshaping before (melt is from reshape2 package):
dat %>% select(pre, post) %>% melt %>% cohensD(value~variable, .)
#### [1] 0.8115027
I've looked around StackOverflow for an answer here, but I think I may be missing a term. Here's the scenario:
I have a large data set with multiple groups that I want to report on. Let's say that this data set has answers to certain questions as columns, and I want to take specific columns and responses, group the answers, and perform counts. Essentially, I have a dplyr filter expression that would look like this:
z <- results %>% filter(AgeGroup %in% c("16-20", "21-25", "26-30")) %>%
group_by(AgeGroup) %>% summarize(ageCount=n())
Then I generate a table with the results using xtable() and dump them in my Rmarkdown document. What I'd like to do is create a function that can do this, such that I can do the following
resultPrint <- function(qualifier, groupColumn) {
return(results %>% filter(qualifier) %>%
group_by(groupColumn) %>% summarize(count=n())
}
resultPrint("AgeGroup %in% c(\"16-20\", \"21-25\", \"26-30\")", "AgeGroup")
Or some equivalent.
Is there a way to do this in R? It would simplify a lot of code I am writing if I could. Thanks!
Thank you to r2evans! Here's my solution:
resultPrint <- function(qualifier, groupColumn) {
return(results %>% filter_(qualifier) %>%
group_by_(.dots = groupColumn) %>% summarize(count=n()))
}
filterClause = quote(AgeGroup %in% c("16-20", "21-25", "26-30"))
stuff <- resultPrint(filterClause, quote(AgeGroup))
Thank you!!
Is it possible to set all column names to upper or lower within a dplyr or magrittr chain?
In the example below I load the data and then, using a magrittr pipe, chain it through to my dplyr mutations. In the 4th line I use the tolower function , but this is for a different purpose: to create a new variable with lowercase observations.
mydata <- read.csv('myfile.csv') %>%
mutate(Year = mdy_hms(DATE),
Reference = (REFNUM),
Event = tolower(EVENT)
I'm obviously looking for something like colnames = tolower but know this doesn't work/exist.
I note the dplyr rename function but this isn't really helpful.
In magrittr the colname options are:
set_colnames instead of base R's colnames<-
set_names instead of base R's names<-
I've tried numerous permutations with these but no dice.
Obviously this is very simple in base r.
names(mydata) <- tolower(names(mydata))
However it seems incongruous with the dplyr/magrittr philosophies that you'd have to do that as a clunky one liner, before moving on to an elegant chain of dplyr/magrittr code.
with {dplyr} we can do :
mydata %>% rename_all(tolower)
or
mydata %>% rename(across(everything(), tolower))
iris %>% setNames(tolower(names(.))) %>% head
Or equivalently use replacement function in non-replacement form:
iris %>% `names<-`(tolower(names(.))) %>% head
iris %>% `colnames<-`(tolower(names(.))) %>% head # if you really want to use `colnames<-`
Using magrittr's "compound assignment pipe-operator" %<>% might be, if I understand your question correctly, an even more succinct option.
library("magrittr")
names(iris) %<>% tolower
?`%<>%` # for more
mtcars %>%
set_colnames(value = casefold(colnames(.), upper = FALSE)) %>%
head
casefold is available in base R and can convert in both direction, i.e. can convert to either all upper case or all lower case by using the flag upper, as need might be.
Also colnames() will use only column headers for case conversion.
You could also define a function:
upcase <- function(df) {
names(df) <- toupper(names(df))
df
}
library(dplyr)
mtcars %>% upcase %>% select(MPG)