r dplyr group_by - by variable content - r

I use dplyr group_by function to group my data frame,
and need to be able to group the data, by a column, i don't know the name of the column yet, i need to decide it along the code, so the name can't be hard coded.
for example,
i can't use
data %>% group_by(col_name)
i need to do somthing like
data %>% c <- col_name
data %>% group_by(c)
when i try doing so, it popes error:
Error: unknown variable to group by : c
All the examples I find are for the trevial case when you can hard code the name of the column
group by example
Same in the r help
Thanks.

You would like to look up NSE as others have said in their comments. Using that also requires you to use lazyeval package, and group_by_ function, which allows you to you standard evaluation. So it will look like:
data %>% group_by_(lazyeval::interp(~var, var = as.name(c)))

Related

dplyr passing column names as a variable with is.na filter

I am aware that similar questions have been asked and I have tried multiple options but I am still having an error message.
df_construction <- function(selected_month, selected_variable){
selected_variable_en <- rlang::enquo(selected_variable) #This was an attempt following the link
#filter_criteria <- interp(!is.na(~y), .values = list(y = as.name(selected_variable))) This doesn't work
df1 <- airquality %>%
dplyr::filter(Month == selected_month,
!is.na(selected_variable_en))%>%
select(Month, Day, !!selected_variable)
return(df1)}
df1 <- df_construction(2, "Solar.R")
My ultimate goal is to build this in Shiny and thus have inputs the user will have selected as arguments in the function.
I know that the filter and the select functions shouldn't be dealt with in the same way.
I have followed the steps according to: https://www.brodrigues.co/blog/2016-07-18-data-frame-columns-as-arguments-to-dplyr-functions/ but had no success due to the !is.na filter.
I just want to have a dataframe where the only columns are the Month column for the selected months, the Day column and whichever column from the choice Ozone, Solar.R, Wind, Temp the user has selected, without any NA.
Thank you very much for your help!!
!! is often not enough to unquote variable names. You often need them in conjunction with rlang::sym. And if you have more than one variable to unquote, you need to use !!! and rlang::syms
df_construction <- function(selected_month, selected_variable){
df1 <- airquality %>%
dplyr::filter(Month == selected_month,
!is.na(!!rlang::sym(selected_variable_en)))%>%
select(Month, Day, selected_variable)
return(df1)
}
For select, you can directly put variable names. There has been a new functionality in dplyr to unquote {{}} but it does not work in all cases.
If you start writing variables names in functions, you might have difficulties with dplyr. In that aspect, data.table is easier to use (see a blog post I wrote on the subject)

How can I write this R expression in the pipe operator format?

I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]

Pass a string variable to spread function in dplyr

I am trying to make a function which I pass a string variable to dplyr pipeline but having some problem. Like the following
col_spread = "speed".
In select(), I can use get(col_spread) to select the column named speed.
df %>% select(get(col_spread))
However, when I am using spread function in dplyr
df %>% spread(key = Key_col, value = get(col_spread))
Error: Invalid column specification
It doesn't work.
Is NSE the only way to go? If so, what should I do?
Thank you!
Actually get really isn't a great idea. It would be better to use the standard evaulation version of
df %>% select_(col_spread)
and then for spread it would look like
df %>% spread_("Key_col", col_spread)
note which values are quoted and which are not. spread_ expects two character values.

Using dplyr in assignment

I have a question about dplyr.
Lets say I want to update certain values in a dataframe, can I do this?:
mtcars %>% filter(mpg>20) %>% select(hp)=1000
(The example is nonsensical where all cars with MPGs greater than 20 have HP set to 1000)
I get an error so I am guessing the answer is no I can't use %>% and the dplyr verbs to the left of an assignment, but the dplyr syntax is a lot cleaner than:
mtcars[mtcars$mpg>20,"hp"]=1000
Especially when you are dealing with more complex cases, so I wanted to ask if there is any way to use the dplyr syntax in this case?
edit: It looks like mutate is the verb I want, so now my question is, can I dynamically change the name of the var in the mutate statement like so:
for (i in c("hp","wt")) {mtcars<-mtcars %>% filter(mpg>20) %>% mutate(i=1000) }
This example just creates a column named "i" with value 1000, which isn't what I want.

using variable column names in dplyr (do)

I have the following example data
d.1 = data.frame(id=c(1,1,2,3,3), date=c(2001,2002,2001,2001,2003), measure=c(1:5))
d.2 = data.frame(id=c(1,2,2,3,3), date=c(2001,2002,2003,2002,2008), measure=c(1:5))
d = merge(d.1,d.2, all=T, by="id")
d.1 and d.2 are two kinds of measurements and I need one of each measurements per id. The measurements should be as close to each other as possible. I can do that with dplyr by
require(dplyr)
d = d %>%
group_by(id) %>%
do(.[which.min(abs(.$date.x-.$date.y)),])
The question is how i can use dplyr if the names of the date columns are saved in a variable like name.x="date.x" and name.y="date.y" because i can't use
...
do(.[which.min(abs(.[, name.x]-.[, name.y])),])
....
I tried to find anaother solution using eval, as.symbol ans stuff like that but i couldn't figure out a solution...
d$date.x returns a vector while d[, name.x] returns a data.frame, which does not work when passed inside your function. So simply change the way you access this column to d[[name.x]] and it will work:
d %>% group_by(id) %>% do(.[which.min(abs(.[[name.x]] -.[[name.y]])),])
Since 0.4 (which was released just after this question was answered), dplyr has included standard evaluation version do_, which in theory should be easier to program with than the NSE version.
You could use it similarly:
interp <- lazyeval::interp
d %>%
group_by(id) %>%
do_(interp(~ .[which.min(abs(.$x - .$y)), ],
x = as.name(name.x), y = as.name(name.y)))
I'm not sure it's any easier to read or write than the NSE version. For the other verbs,
code can remain concise while also programmatically accessing names.
For do_, however, one must use the dot pronoun to access column names e.g. as discussed in this question. As a consequence, I think, you always need to use interp with do_. This makes the code more verbose than the NSE version in the earlier answer.

Resources