dplyr 0.5: arrange() using groupings - r

I've got a lot of code written in dplyr 0.4.3, that relied on the grouped arrange() function. As of the 0.5 release, arrange no longer applies grouping.
This decision baffles me, as this makes arrange() inconsistent with other dplyr verbs, and surely a user could just ungroup() before arrange() if ungrouped is required. I would have hoped for perhaps a parameter in arrange() to retain grouped_by behavior, but alas!
I therefore have to rewrite my grouped arrange. At this point, my only option seems to be to break up the pipe at the arrange call, loop through the groups and arrange group by group, and then bind() the result again. I'm hoping there might be a more elegant solution?
Below is an MRE, I'd like to run a cumsum on wt per group_by(cyl). Many thanks for ideas/suggestions.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
arrange(desc(mpg)) %>%
mutate(WtCum = cumsum(wt))

To order within groups in dplyr 0.5, add the grouping variable before the other ordering variables within arrange.
mtcars %>%
group_by(cyl) %>%
arrange(cyl, desc(mpg))

If you want to keep around an “old arrange”, you may use this snippet:
arrange_old <- function(.data, ...) {
dplyr::arrange_(.data, .dots = c(groups(.data), lazyeval::lazy_dots(...)))
}
This will respect grouping by basically prepending the group variables to the new arrange call.
Then you can do:
mtcars %>%
group_by(cyl) %>%
arrange_old(desc(mpg))
For what it's worth, I've also found this change confusing and unintuitive, and I keep making the mistake of forgetting to explicitly specify the grouping.

Related

dplyr mutate using dynamic variable name while respecting group_by

I'm trying as per
dplyr mutate using variable columns
&
dplyr - mutate: use dynamic variable names
to use dynamic names in mutate. What I am trying to do is to normalize column data by groups subject to a minimum standard deviation. Each column has a different minimum standard deviation
e.g. (I omitted loops & map statements for convenience)
require(dplyr)
require(magrittr)
data(iris)
iris <- tbl_df(iris)
minsd <- c('Sepal.Length' = 0.8)
varname <- 'Sepal.Length'
iris %>% group_by(Species) %>% mutate(!!varname := mean(pluck(iris,varname),na.rm=T)/max(sd(pluck(iris,varname)),minsd[varname]))
I got the dynamic assignment & variable selection to work as suggested by the reference answers. But group_by() is not respected which, for me at least, is the main benefit of using dplyr here
desired answer is given by
iris %>% group_by(Species) %>% mutate(!!varname := mean(Sepal.Length,na.rm=T)/max(sd(Sepal.Length),minsd[varname]))
Is there a way around this?
I actually did not know much about pluck, so I don't know what went wrong, but I would go for this and this works:
iris %>%
group_by(Species) %>%
mutate(
!! varname :=
mean(!!as.name(varname), na.rm = T) /
max(sd(!!as.name(varname)),
minsd[varname])
)
Let me know if this isn't what you were looking for.
The other answer is obviously the best and it also solved a similar problem that I have encountered. For example, with !!as.name(), there is no need to use group_by_() (or group_by_at or arrange_() (or arrange_at()).
However, another way is to replace pluck(iris,varname) in your code with .data[[varname]]. The reason why pluck(iris,varname) does not work is that, I suppose, iris in pluck(iris,varname) is not grouped. However, .data refer to the tibble that executes mutate(), and so is grouped.
An alternative to as.name() is rlang::sym() from the rlang package.

How to Create Multiple Frequency Tables with Percentages Across Factor Variables using Purrr::map

library(tidyverse)
library(ggmosaic) for "happy" dataset.
I feel like this should be a somewhat simple thing to achieve, but I'm having difficulty with percentages when using purrr::map together with table(). Using the "happy" dataset, I want to create a list of frequency tables for each factor variable. I would also like to have rounded percentages instead of counts, or both if possible.
I can create frequency precentages for each factor variable separately with the code below.
with(happy,round(prop.table(table(marital)),2))
However I can't seem to get the percentages to work correctly when using table() with purrr::map. The code below doesn't work...
happy%>%select_if(is.factor)%>%map(round(prop.table(table)),2)
The second method I tried was using tidyr::gather, and calculating the percentage with dplyr::mutate and then splitting the data and spreading with tidyr::spread.
TABLE<-happy%>%select_if(is.factor)%>%gather()%>%group_by(key,value)%>%summarise(count=n())%>%mutate(perc=count/sum(count))
However, since there are different factor variables, I would have to split the data by "key" before spreading using purrr::map and tidyr::spread, which came close to producing some useful output except for the repeating "key" values in the rows and the NA's.
TABLE%>%split(TABLE$key)%>%map(~spread(.x,value,perc))
So any help on how to make both of the above methods work would be greatly appreciated...
You can use an anonymous function or a formula to get your first option to work. Here's the formula option.
happy %>%
select_if(is.factor) %>%
map(~round(prop.table(table(.x)), 2))
In your second option, removing the NA values and then removing the count variable prior to spreading helps. The order in the result has changed, however.
TABLE = happy %>%
select_if(is.factor) %>%
gather() %>%
filter(!is.na(value)) %>%
group_by(key, value) %>%
summarise(count = n()) %>%
mutate(perc = round(count/sum(count), 2), count = NULL)
TABLE %>%
split(.$key) %>%
map(~spread(.x, value, perc))

Take a sample without group in dplyr, R

I know how to take a random sample each group from a dataframe using sample_n or sample_frac in dplyr, which can go like this,
dataset %>%
group_by(user_id) %>%
sample_n(10)
However, I have a slightly different question. I want to take a random sample from the whole dataset. It should be as simple as this one,
sample_n(dataset,10)
But, because I have used group_by command on the dataset in a previous case, it seems the group_by still takes effect here. The second command is equivalent to the first here.
I wonder how can I remove the effect of group_by and get a random sample from the whole dataset?
We can use ungroup() to remove any group variable and then apply the sample_n
dataset %>%
group_by(user_id) %>%
ungroup() %>%
sample_n(10)

summarise vs. summarise_each function in dplyr package

I am trying to summarise the value for one variable after splitting the data with group_by using dplyr package, the following code works fine and the output is listed below, but I can not substitute summarise_each with summriase even only one column need to be calculated, I wonder why?
iris %>% group_by(Species) %>% select(one_of('Sepal.Length')) %>%
summarise_each(funs(mean(.)))
or I will get the output like "S3:lazy".
summarize and summarize_each work quite differently. summarize is in fact simpler — just specify the expression directly:
iris %>%
group_by(Species) %>%
select(Sepal.Length) %>%
summarize(Sepal.Length = mean(Sepal.Length))
You can choose any name for the output column, it doesn’t need to be the same as the input.

Using dplyr in assignment

I have a question about dplyr.
Lets say I want to update certain values in a dataframe, can I do this?:
mtcars %>% filter(mpg>20) %>% select(hp)=1000
(The example is nonsensical where all cars with MPGs greater than 20 have HP set to 1000)
I get an error so I am guessing the answer is no I can't use %>% and the dplyr verbs to the left of an assignment, but the dplyr syntax is a lot cleaner than:
mtcars[mtcars$mpg>20,"hp"]=1000
Especially when you are dealing with more complex cases, so I wanted to ask if there is any way to use the dplyr syntax in this case?
edit: It looks like mutate is the verb I want, so now my question is, can I dynamically change the name of the var in the mutate statement like so:
for (i in c("hp","wt")) {mtcars<-mtcars %>% filter(mpg>20) %>% mutate(i=1000) }
This example just creates a column named "i" with value 1000, which isn't what I want.

Resources