R command `group_by` - r

I am not able to understand exactly how this code works. I have found it on a tutorial guide:
Data manipulation in R - Steph Locke
on page 133 an example that I am able to understand only partially.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(month, carrier) %>%
summarise(n=n()) %>% ##sum of items;
group_by(month) %>%
mutate(prop=scales::percent(n/sum(n)), n=NULL) %>%
spread(month, prop)
flights %>%
group_by(month, carrier) %>% ## This is grouping by months and within the months by carrier;
summarise(n=n()) %>% ## It is summing the items, giving for each month and each carrier the sum of items;
At this point there in another group_by(), it looks like a nested to group_by(month, carrier)
Then:
mutate(prop=scales::percent(n/sum(n)), n=NULL) %>% ## Calculates the percentage of items over the total and store them in "prop"
Last line it creates the matrix, putting in the columns month and inside the value obtained from prop
I would like to understand better what is doing exactly the second group_by(month) %>%
Thank you in advance for every reply.

The second group_by is not needed here as by default summarise step argument .groups = "drop_last". Therefore, after the first summarise, there is only a single grouping column i.e. 'month' remains. We can change the code to
flights %>%
group_by(month, carrier) %>%
summarise(n=n()) %>%
mutate(prop=scales::percent(n/sum(n)), n=NULL)
Suppose, we change the default value in .groups to "drop", then, it will drop all the grouping variables, and thus a new group_by statement is needed. Also, after the last grouping statement, if we are using mutate, it wouldn't drop the group attributes and thus ungroup would be useful
flights %>%
group_by(month, carrier) %>%
summarise(n=n(), .groups = "drop") %>%
group_by(month) %>%
mutate(prop=scales::percent(n/sum(n)), n=NULL) %>%
ungroup

Related

I would like to select the bottom 3 lowest numbers within a group using R

I have this script
CHECK <-TOP3BYNumber %>%
arrange(Number) %>%
group_by(Number) %>%
top_n(3)
This gives me the highest 3 values grouped by the column Number using dplyr.
I would like to instead of getting the top three highest values to get the top 3 lowest values.
i tried
top_n(-3) and this does not work.
We can use slice
library(dplyr)
TOP3BYNumber %>%
arrange(desc(Number)) %>%
group_by(Number) %>%
slice(seq_len(3))
Or with row_number()
TOP3BYNumber %>%
arrange(desc(Number)) %>%
group_by(Number) %>%
slice(head(row_number(), 3))

Better output with dplyr -- breaking functions and results

This is a long-lasting question, but now I really to solve this puzzle. I'm using dplyr all the time and I think it is great to summarise variables. However, I'm trying to display a pivot table with partial success only. Dplyr always reports one single row with all results, what's annoying. I have to copy-paste the results to excel to organize everything...
I got the code here
and it almost working.
This result
Should be like the following one:
Because I always report my results using this style
Use this code to get the same results:
library(tidyverse)
set.seed(123)
ds <- data.frame(group=c("american", "canadian"),
iq=rnorm(n=50,mean=100,sd=15),
income=rnorm(n=50, mean=1500, sd=300),
math=rnorm(n=50, mean=5, sd=2))
ds %>%
group_by(group) %>%
summarise_at(vars(iq, income, math),funs(mean, sd)) %>%
t %>%
as.data.frame %>%
rownames_to_column %>%
separate(rowname, into = c("feature", "fun"), sep = "_")
To clarify, I've tried this code, but spread works with only one summary (mean or sd, etc). Some people use gather(), but it's complicated to work with group_by and gather().
Thanks for any help.
Instead of transposing (t) and changing the class types, after the summarise step, do a gather to change it to 'long' format and then spread it back after doing some modifications with separate and unite
library(tidyverse)
ds %>%
group_by(group) %>%
summarise_at(vars(iq, income, math),funs(mean, sd)) %>%
gather(key, val, iq_mean:math_sd) %>%
separate(key, into = c('key1', 'key2')) %>%
unite(group, group, key2) %>%
spread(group, val)

Explain ungroup() in dplyr

If I'm working with a dataset and I want to group the data (i.e. by country), compute a summary statistic (mean()) and then ungroup() the data.frame to have a dataset with the original dimensions (country-year) and a new column that lists the mean for each country (repeated over n years), how would I do that with dplyr? The ungroup() function doesn't return a data.frame with the original dimensions:
gapminder %>%
group_by(country) %>%
summarize(mn = mean(pop)) %>%
ungroup() # returns data.frame with nrows == length(unique(gapminder$country))
ungroup() is useful if you want to do something like
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
where you want to do some sort of transformation that uses an entire group's statistics. In the above example, mn is the ratio of a population to the group's average population. When it is ungrouped, any further mutations called on it would not use the grouping for aggregate statistics.
summarize automatically reduces the dimensions, and there's no way to get that back. Perhaps you wanted to do
gapminder %>%
group_by(country) %>%
mutate(mn = mean(pop)) %>%
ungroup()
Which creates mn as the mean for each group, replicated for each row within that group.
The summarize() reduced the number of rows. If you didn't want to change the number of rows, then use mutate() rather than summarize().
actually ungroup() is not needed in your case.
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop))
generates the same results as the following:
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
The only difference is that the latter actually runs a bit slower.

Split a Dataset into a Nested List of Dataframes and then Spread Using Tidyr and Purrr

library(ggmosaic)
library(tidyverse)
Below is the sample code
happy2<-happy%>%
select(sex,marital,degree,health)%>%
group_by(sex,marital,degree,health)%>%
summarise(Count=n())
The following code splits the dataset into a nested list with tables of male and female (sex variable) for each category of the degree variable.
happy2 %>%
split(.$degree) %>%
lapply(function(x) split(x, x$sex))
This is where I'm now struggling. I would like to reshape, or using Tidyr, spread the "marital" variable, or perhaps this should be split again, so that each category of "marital" is a column header with each column containing the "health" variable and corresponding "Count". The redundant "sex" and "degree" columns can be dropped.
Since I'm working with a list, I've been attempting to use Tidyverse methods, for example, I've been trying to use purrr to drop variables:
happy2%>%map(~select(.x,-sex)
I'm thinking that I can also spread using purrr, but I'm having trouble making this work.
To help illustrate what I'm looking for, I attached a pic of the possible structure. I didn't include all categories and the counts are not correct since I'm only showing the structure. I suppose the "marital" category could also be a third split variable as well if that's easier? So what I'm hoping for is male and female tables for each category of degree, with marital by health and showing the corresponding count.
Help would be appreciated...
Would the following work? I changed the syntax for split by sex so that I can chain the subsequent commands together:
happy2 %>%
split(.$degree) %>%
lapply(function(x) x %>% split(.$sex) %>%
lapply(function(x) x %>% select(-sex, -degree) %>%
spread(health, Count)))
Edit:
This would give you a separate table for each marital status:
happy2 %>%
ungroup() %>%
split(.$degree) %>%
lapply(function(x) x %>% split(.$sex) %>%
lapply(function(x) x %>% select(-sex, -degree) %>% split(.$marital)))
And if you don't want the first column indicating marital status, the following version drops that:
happy2 %>%
ungroup() %>%
split(.$degree) %>%
lapply(function(x) x %>% split(.$sex) %>%
lapply(function(x) x %>% select(-sex, -degree) %>% split(.$marital) %>%
lapply(function(x) x %>% select(-marital))))
What about this:
# cleaned up your code a bit
# removed the select (as it does nothing)
# consistent column names (count is lower case like the rest of the variables)
# added spacing
happy2 <- happy %>%
group_by(sex, marital, degree, health) %>%
summarise(count=n())
happy2 %>%
dplyr::ungroup() %>%
split(list(.$degree, .$sex, .$marital)) %>%
lapply(. %>% select(health, count))
Or do you really want the "martial" status as table heading for the "health" column has in your picture?

How to pass multiple column names as input to group_by in dplyr [duplicate]

This question already has answers here:
dplyr - groupby on multiple columns using variable names
(4 answers)
Closed 3 years ago.
I am new to R and dplyr package. I am trying to pass a variable to dplyr group_by, which we can vary/change.
for instance when working with flights dataset, I can get counts of rows by any column (or multiple columns) using the code below:
library(nycflights13)
flights %>% group_by(origin) %>% tally()
flights %>% group_by(carrier) %>% tally()
flights %>% group_by(origin,carrier) %>% tally()
but if I want to pass the name of the columns used, to group_by as a variable, then it does not work when using multiple column names.
group="carrier"
flights %>% group_by_(group) %>% tally()
group="origin"
flights %>% group_by_(group) %>% tally()
group=c("origin","carrier") #This does not work
flights %>% group_by_(group) %>% tally()
I will appreciate any help. Thanks.
You've almost got it, you just need to use the .dots argument to pass in your grouping variables.
group <- c("origin","carrier")
flights %>%
group_by_(.dots = group) %>%
tally()

Resources