This is a long-lasting question, but now I really to solve this puzzle. I'm using dplyr all the time and I think it is great to summarise variables. However, I'm trying to display a pivot table with partial success only. Dplyr always reports one single row with all results, what's annoying. I have to copy-paste the results to excel to organize everything...
I got the code here
and it almost working.
This result
Should be like the following one:
Because I always report my results using this style
Use this code to get the same results:
library(tidyverse)
set.seed(123)
ds <- data.frame(group=c("american", "canadian"),
iq=rnorm(n=50,mean=100,sd=15),
income=rnorm(n=50, mean=1500, sd=300),
math=rnorm(n=50, mean=5, sd=2))
ds %>%
group_by(group) %>%
summarise_at(vars(iq, income, math),funs(mean, sd)) %>%
t %>%
as.data.frame %>%
rownames_to_column %>%
separate(rowname, into = c("feature", "fun"), sep = "_")
To clarify, I've tried this code, but spread works with only one summary (mean or sd, etc). Some people use gather(), but it's complicated to work with group_by and gather().
Thanks for any help.
Instead of transposing (t) and changing the class types, after the summarise step, do a gather to change it to 'long' format and then spread it back after doing some modifications with separate and unite
library(tidyverse)
ds %>%
group_by(group) %>%
summarise_at(vars(iq, income, math),funs(mean, sd)) %>%
gather(key, val, iq_mean:math_sd) %>%
separate(key, into = c('key1', 'key2')) %>%
unite(group, group, key2) %>%
spread(group, val)
Related
I am currently trying to find a short & tidy way to unnest a nested tibble with 2 grouping variables and a tibble/df as data for each observation into a tibble having only one of the grouping variables and the respective data in a df (or tibble). I will illustrate my sample by using the starwars dataset provided by tidyverse and show the 3 solutions I came up with so far.
library(tidyverse)
#Set example data: 2 grouping variables, name & sex, and one data column with a tibble/df for each observation
tbl_1 <- starwars %>% group_by(name, sex) %>% nest() %>% ungroup()
#1st Solution: Neither short nor tidy but gets me the result I would like to have in the end
tbl_2 <- tbl_1 %>%
group_by(sex) %>%
nest() %>%
ungroup()%>%
mutate(data = map(.$data, ~.x %>% group_by(name) %>% unnest(c(data))))
#2nd Solution: A lot shorter and more neat but still not what I have in mind
tbl_2 <- tbl_1 %>%
nest(-sex) %>%
mutate(data = map(.$data, ~.x %>% unnest(cols = c(data))))
#3rd Solution: The best so far, short and readable
tbl_2 <- tbl_1 %>%
unnest(data) %>%
group_by(name) %>%
nest(-sex)
##Solution as I have it in mind / I think should be somehow possible.
tbl_2 <- tbl_1 %>% group_by(sex) %>% unnest() #This however gives one large tibble grouped by sex, not two separate tibbles in a nested tibble
Is such a solution I am looking for even possible in the first place or is the 3rd solution as close as it gets in terms of being both short, readable and tidy?
In terms of my actual workflow tbl_1 is the "work horse" of my analysis and not subject to change, I use to apply analysis or ggplot via map for figures etc., which are sometimes on the level of "names" or "sex".
I appreciate any input!
Update:
User #caldwellst has given a sufficient enough answer for me to mark this question as answered, unfortunately only as a comment. After waiting a bit, I would now accept any other answer with the same suggestion as the solution to mark this question as solved.
As #caldwellst has pointed out in a comment, the group_by is unnecessary, the provided solution is sufficiently short and tidy enough for me in that case.
tbl_1 %>% unnest(data) %>% nest(data = -sex).
I will remove my answer and accept a different one, if #caldwellst posts the comment as answer or somebody else provides a different, but equally suitable one.
If I'm working with a dataset and I want to group the data (i.e. by country), compute a summary statistic (mean()) and then ungroup() the data.frame to have a dataset with the original dimensions (country-year) and a new column that lists the mean for each country (repeated over n years), how would I do that with dplyr? The ungroup() function doesn't return a data.frame with the original dimensions:
gapminder %>%
group_by(country) %>%
summarize(mn = mean(pop)) %>%
ungroup() # returns data.frame with nrows == length(unique(gapminder$country))
ungroup() is useful if you want to do something like
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
where you want to do some sort of transformation that uses an entire group's statistics. In the above example, mn is the ratio of a population to the group's average population. When it is ungrouped, any further mutations called on it would not use the grouping for aggregate statistics.
summarize automatically reduces the dimensions, and there's no way to get that back. Perhaps you wanted to do
gapminder %>%
group_by(country) %>%
mutate(mn = mean(pop)) %>%
ungroup()
Which creates mn as the mean for each group, replicated for each row within that group.
The summarize() reduced the number of rows. If you didn't want to change the number of rows, then use mutate() rather than summarize().
actually ungroup() is not needed in your case.
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop))
generates the same results as the following:
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
The only difference is that the latter actually runs a bit slower.
Using the sample data (bottom), I want to use the code below to group and summarise the data. After this, I want to transpose, but I'm stuck on how to use tidyr to achieve this?
For context, I'm attempting to recreate an existing table that was created in Excel using knitr::kable, so the final product of my code below is expected to break tidy principles.
For example:
library(tidyverse)
Df <- Df %>% group_by(Code1, Code2, Level) %>%
summarise_all(funs(count = sum(!is.na(.))))
I can add t(.) using the pipe...
Df <- Df %>% group_by(Code1, Code2, Level) %>%
summarise_all(funs(count = sum(!is.na(.)))) %>%
t(.)
or I can add...
Df <- as.data.frame(t(Df)
Both of these options allow me to transpose, but I'm wondering if there's a tidyverse method of achieving this using tidyr's gather and spread functions? I want to have more control over the process and also want to remove the "V1","V2", etc, that appear as column names when using transpose (t).
How can I achieve this using tidyverse?
Sample Code:
Code1 <- c("H200","H350","H250","T400","T240","T600")
Code2 <- c("4A","4A","4A","2B","2B","2B")
Level <- c(1,2,3,1,2,3)
Q1 <- c(30,40,40,50,60,80)
Q2 <- c(50,30,50,40,80,30)
Q3 <- c(30,45,70,42,81,34)
Df <- data.frame(Code1, Code2, Level, Q1, Q2, Q3)
The general idiom in the tidyverse is to gather() your data to the maximal extent, forming a "long" data frame with one measurement per row. Then, spread() can revert this long data frame into whichever "wide" format that you like best. This procedure can effectively transpose the data: just gather() all the identifier columns except the row names, and then spread() the row names.
For example, here is how to effectively transpose mtcars:
require(tidyverse)
mtcars %>%
rownames_to_column %>%
gather(variable, value, -rowname) %>%
spread(rowname, value)
Your data does not have "row names" as understood in R, but Code1 effectively serves as a row name because it uniquely identifies each (original) row of your data.
Df1 <- Df %>%
group_by(Code1, Code2, Level) %>%
summarise_all(funs(count = sum(!is.na(.)))) %>%
gather(column, value, -Code1) %>%
spread(Code1, value)
UPDATE for tidyr 1.0 or higher (late 2019 onwards)
The new pivot_wider() and pivot_longer() functions are now preferred over the older (but still supported) gather() and spread(). Thus the preferred way to transpose mtcars is probably
require(tidyverse)
mtcars %>%
rownames_to_column() %>%
pivot_longer(-rowname, 'variable', 'value') %>%
pivot_wider(variable, rowname)
library(tidyr)
library(dplyr)
Df <- Df %>% group_by(Code1, Code2, Level) %>%
summarise_all(funs(count = sum(!is.na(.)))) %>%
gather(var, val, 2:ncol(Df)) %>%
spread(Code1, val)
I have a data.frame that contains client names, years, and several revenue numbers from each year.
df <- data.frame(client = rep(c("Client A","Client B", "Client C"),3),
year = rep(c(2014,2013,2012), each=3),
rev = rep(c(10,20,30),3)
)
I want to end up with a data.frame that aggregates the revenue by client and year. I then want to sort the data.frame by year then by descending revenue.
library(dplyr)
df1 <- df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
arrange(year, desc(tot))
However, when using the code above the arrange() function doesn't change the order of the grouped data.frame at all. When I run the below code and coerce to a normal data.frame it works.
library(dplyr)
df1 <- df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
data.frame() %>%
arrange(year, desc(tot))
Am I missing something or will I need to do this every time when trying to arrange a grouped_df by a grouped variable?
R Version: 3.1.1
dplyr package version: 0.3.0.2
EDIT 11/13/2017:
As noted by lucacerone, beginning with dplyr 0.5, arrange once again ignores groups when sorting. So my original code now works in the way I initially expected it would.
arrange() once again ignores grouping, reverting back to the behaviour of dplyr 0.3 and earlier. This makes arrange() inconsistent with other dplyr verbs, but I think this behaviour is generally more useful. Regardless, it’s not going to change again, as more changes will just cause more confusion.
Try switching the order of your group_by statement:
df %>%
group_by(year, client) %>%
summarise(tot = sum(rev)) %>%
arrange(year, desc(tot))
I think arrange is ordering within groups; after summarize, the last group is dropped, so this means in your first example it's arranging rows within the client group. Switching the order to group_by(year, client) seems to fix it because the client group gets dropped after summarize.
Alternatively, there is the ungroup() function
df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
ungroup() %>%
arrange(year, desc(tot))
Edit, #lucacerone: since dplyr 0.5 this does not work anymore:
Breaking changes arrange() once again ignores grouping, reverting back
to the behaviour of dplyr 0.3 and earlier. This makes arrange()
inconsistent with other dplyr verbs, but I think this behaviour is
generally more useful. Regardless, it’s not going to change again, as
more changes will just cause more confusion.
Latest versions of dplyr (at least from dplyr_0.7.4) allow to arrange within groups. You just have so set into the arrange() call .by_group = TRUE. More information is available here
In your example, try:
library(dplyr)
df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
arrange(desc(tot), .by_group = TRUE)
I can summarise a data frame with dplyr like this:
mtcars %>%
group_by(cyl) %>%
summarise(mean(mpg))
To convert the output back to class data.frame, my current approach is this:
as.data.frame(mtcars %>%
group_by(cyl) %>%
summarise(mean(mpg)))
Is there any way to get dplyr to output a class data.frame without having to use as.data.frame?
As was pointed out in the comments you might not need to convert it since it might be good enough that it inherits from data frame. If that is not good enough then this still uses as.data.frame but is slightly more elegant:
mtcars %>%
group_by(cyl) %>%
summarise(mean(mpg)) %>%
ungroup %>%
as.data.frame()
ADDED I just read in the comments that the reason you want this is to avoid the truncation of printed output. In that case just define this option, possibly in your .Rprofile file:
options(dplyr.print_max = Inf)
(Note that you can still hit the maximum defined by the "max.print" option associated with print so you would need to set that one too if it's also too low for you.)
Update: Changed %.% to %>% to reflect changes in dplyr.
In addition to what G. Grothendieck mentioned above, you can convert it into a new dataframe:
new_summary <- mtcars %>%
group_by(cyl) %>%
summarise(mean(mpg)) %>%
as.data.frame()