Does dplyr's summarize output ever have a deterministic output order? - r

When performing a grouped summary in dplyr, one would normally summarize all target variables in a single command:
# Method 1: summarize all target variables in one command
mtcars %>%
group_by(am) %>%
summarize(mpg = mean(mpg),
disp = mean(disp))
However, one might prefer to perform the summarizations separately for greater flexibility & programmability (yes I am aware of across, but my impression is that its flexibility is limited). In this case, I assume that one must join the tables together at the end:
# Method 2: summarize separately and join
a <- mtcars %>%
group_by(am) %>%
summarize(mpg = mean(mpg))
b <- mtcars %>%
group_by(am) %>%
summarize(disp = mean(disp))
inner_join(a, b, by = 'am')
The join could be avoided by just appending the summary from b to a:
a$c <- b$disp
However, this would assume that the rows of a and b are in the same order. This is certainly not assured in general, as typical SQL databases do not guarantee output order. When dplyr uses such a database as a backend, it will presumably reflect whatever random order the database returned the data in.
My question is, does vanilla dplyr (i.e. no external backend) guarantee a certain ordering of rows, such that the non-join solution can be considered safe & robust? I suspect dplyr is not interested in guaranteeing row order, but have not been able to find a definitive statement one way or the other.

Related

Using the R syntax sequence operator ":" within the the sum command with more then 50 columns

i would like to index by column name within the sum command using the sequence operator.
library(dbplyr)
library(tidyverse)
df=data.frame(
X=c("A","B","C"),
X.1=c(1,2,3),X.2=c(1,2,3),X.3=c(1,2,3),X.4=c(1,2,3),X.5=c(1,2,3),X.6=c(1,2,3),X.7=c(1,2,3),X.8=c(1,2,3),X.9=c(1,2,3),X.10=c(1,2,3),
X.11=c(1,2,3),X.12=c(1,2,3),X.13=c(1,2,3),X.14=c(1,2,3),X.15=c(1,2,3),X.16=c(1,2,3),X.17=c(1,2,3),X.18=c(1,2,3),X.19=c(1,2,3),X.20=c(1,2,3),
X.21=c(1,2,3),X.22=c(1,2,3),X.23=c(1,2,3),X.24=c(1,2,3),X.25=c(1,2,3),X.26=c(1,2,3),X.27=c(1,2,3),X.28=c(1,2,3),X.29=c(1,2,3),X.30=c(1,2,3),
X.31=c(1,2,3),X.32=c(1,2,3),X.33=c(1,2,3),X.34=c(1,2,3),X.35=c(1,2,3),X.36=c(1,2,3),X.37=c(1,2,3),X.38=c(1,2,3),X.39=c(1,2,3),X.40=c(1,2,3),
X.41=c(1,2,3),X.42=c(1,2,3),X.43=c(1,2,3),X.44=c(1,2,3),X.45=c(1,2,3),X.46=c(1,2,3),X.47=c(1,2,3),X.48=c(1,2,3),X.49=c(1,2,3),X.50=c(1,2,3),
X.51=c(1,2,3),X.52=c(1,2,3),X.53=c(1,2,3),X.54=c(1,2,3),X.55=c(1,2,3),X.56=c(1,2,3))
Is there a quicker way todo this. The following provides the correct result. However, for large datasets (larger than this one ) it becomes vary laborious to deal with especially when pivot_wider is used and the columns are not created before hand (like above)
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1,X.2,X.3,X.4,X.5)),
X=="B"~ sum(c(X.4,X.5)),
X=="C" ~ sum(c( X.3, X.4, X.5, X.6, X.7, X.8, X.9, X.10, X.11, X.12, X.13, X.14, X.15, X.16,
X.17, X.18, X.19, X.20, X.21, X.22, X.23, X.24, X.25, X.26, X.27, X.28, X.29, X.30,
X.31, X.32, X.33, X.34, X.35, X.36, X.37, X.38, X.39, X.40, X.41, X.42,X.43, X.44,
X.45, X.46, X.47, X.48, X.49, X.50, X.51, X.52, X.53, X.54, X.55, X.56)))) %>% dplyr::select(Result_column)
The following is the how it would be used when using "select" syntax, which is that i would like to use. However, does not provide correct numerical solution. One can shorter the code by ~50 entries, by using a sequence operator ":".
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
below is a related question, however, not the same because what is needed is not a column that starts with "X" but rather a sequence.
Using mutate rowwise over a subset of columns
EDIT:
the provided code (below) from cnbrowlie is correct.
df %>% mutate(
Result_column=case_when(
X=="A"~ sum(c(X.1:X.5)),
X=="B"~ sum(c(X.4:X.5)),
X=="C" ~ sum(c(X.3:X.56)))) %>% dplyr::select(Result_column)
This can be done with dplyr>=1.0.0 using rowSums() (which computes the sum for a row across multiple columns) and across() (which superceded vars() as a method for specifying columns in a dataframe, allowing the use of : to select sequences of columns):
df %>% rowwise() %>% mutate(
Result_column=case_when(
X=="A"~ rowSums(across(X.1:X.5)),
X=="B"~ rowSums(across(X.4:X.5)),
X=="C" ~ rowSums(across(X.3:X.56))
)
) %>% dplyr::select(Result_column)

Collapsing scattered information across multiple variables into 1 in R

I have some table data that has been scattered across around 1000 variables in a dataset. Most are split across 2 variables, and I can piece together the data using coalesce, however this is pretty inefficient for some variables which are instead spread across >10. Is there are a better/more efficient way?
The syntax I have written so far is:
scattered_data <- df %>%
select(id, contains("MASS9A_E2")) %>%
#this brings in all the variables for this one question that start with this string
mutate(speciality = coalesce(MASS9A_E2_C4_1,MASS9A_E2_C4_2,MASS9A_E2_C4_3, MASS9A_E2_C4_4, MASS9A_E2_C4_5, MASS9A_E2_C4_6, MASS9A_E2_C4_7, MASS9A_E2_C4_8, MASS9A_E2_C4_9, MASS9A_E2_C5_1,MASS9A_E2_C5_2,MASS9A_E2_C5_3, MASS9A_E2_C5_4, MASS9A_E2_C5_5, MASS9A_E2_C5_6, MASS9A_E2_C5_7, MASS9A_E2_C5_8, MASS9A_E2_C5_9))
As I have this for 28 MASS questions and would really love to be able to collapse these down a bit quicker.
You can use do.call() to take all columns except id as input of coalesce().
library(dplyr)
df %>%
select(id, contains("MASS9A_E2")) %>%
mutate(speciality = do.call(coalesce, select(df, -id)))
In addition, you can call coalesce() iteratively by Reduce().
df %>%
select(id, contains("MASS9A_E2")) %>%
mutate(speciality = Reduce(coalesce, select(df, -id)))

Why selected variables in dplyr package are not in output df in R?

I have a df with 30 columns and 2000 rows.
from the df, I selected several variables by their name and calculated mean of Value by 3 by3 rows of group and type variables.
But there are only 3 variables (group, type, res) in output data.
How should I tell to save selected variables into output df? Is there anything wrong with this code?
output <- data %>%
select(group, type, A, B, C, Value) %>%
group_by(group = gl(n()/3, 3), type) %>%
summarise(res = mean(Value))
Thanks in advance!
As others have pointed out, summarize only returns grouping variables and those variables specified in summarize. This is by design – summarize returns a single row for each group, so there must be a single value for each variable.
The function used in summarize must return a single value (so that's covered), while using group_by with variables ensures that these variables are the same within the group. But for the other variables, there could be several different values within the group: which would summarize choose? Instead of making a guess, it drops those variables.
There are several options to get around this, which one is best depends on your data and what you want to do with it:
Add these variables as grouping variables. This is the preferred method, but obviously it only works if the structure of the data allows it. For example, in a hypothetical dataset, if you want to group by city but want to preserve the state variable, using group_by(city, state) will divide into groups the same way as group_by(city) since city and state are linked (for example, "Boston" will always be with "MA").
Define them in summarize and choose only the first value to be the value for that group, as in #thc's answer. Note that you will lose any other values of those variables and it's not always clear which value will be kept and which will be lost.
Use mutate instead - this will keep the original number of rows rather than collapsing to 1 per group, but will ensure that you don't lose any data.
Join them as a comma (or other) separated string by adding: A = paste(A, sep = ', ') to the summarize for each variable you want to keep. This will preserve the information, at the expense of making it dificult to work with in any future steps.
You can include them in summarise instead, e.g.:
output <- data %>%
select(group, type, A, B, C, Value) %>%
group_by(group = gl(n()/3, 3), type) %>%
summarise(res = mean(Value), A=A[1], B=B[1], C=C[1] )
I believe this is the fastest approach under dplyr if you have a very large data.frame.

Does the order in which the dplyr functions,used in pipeline matters?

I noticed that the order in which the dplyr functions when used in pipeline impacts the result. for example:
iris %>%
group_by(Species) %>%
mutate(Sum = sum(Sepal.Length))
produces different results than this:
iris %>%
mutate(Sum = sum(Sepal.Length)) %>%
group_by(Species)
Can anyone explain the reason for this, and if there are any specific order in which they have to be defined, please mention the same.
Thank you
FYI: iris is an inbuilt dataset in R,use data(iris) to load it. I was trying to add a new column, sum of sepal lengths for each species.
Yes, the order matters.
The pipe is equivalent to:
iris<-group_by(iris, Species)
iris<-mutate(iris, Sum = sum(Sepal.Length))
If you change the order, you change the result. If you group by species first, you'll have the result of the sum by species (I guess that's what you want).
However if you group by species after the sum, this sum will correspond to summing the Sepal length for all species.
Yes, the order matters because each part of the pipe is evaluated on its own, starting from the first through to the last pipe-part and the result of the previous pipe (or original dataset) is piped forward to the next following pipe-part. That means, if you use group_by after the mutate as in your example, the mutate will be done without grouping.
One side effect is that you can create complex and long pipes where you control the order of operations (by positioning them at the right part of the pipe) and you don't need to start a new pipe after an operation is finished.

How to extract one specific group in dplyr

Given a grouped tbl, can I extract one/few groups?
Such function can be useful when prototyping code, e.g.:
mtcars %>%
group_by(cyl) %>%
select_first_n_groups(2) %>%
do({'complicated expression'})
Surely, one can do an explicit filter before grouping, but that can be cumbersome.
With a bit of dplyr along with some nesting/unnesting (supported by tidyr package), you could establish a small helper to get the first (or any) group
first = function(x) x %>% nest %>% ungroup %>% slice(1) %>% unnest(data)
mtcars %>% group_by(cyl) %>% first()
By adjusting the slicing you could also extract the nth or any range of groups by index, but typically the first or the last is what most users want.
The name is inspired by functional APIs which all call it first (see stdlibs of i.e. kotlin, python, scala, java, spark).
Edit: Faster Version
A more scalable version (>50x faster on large datasets) that avoids nesting would be
first_group = function(x) x %>%
select(group_cols()) %>%
distinct %>%
ungroup %>%
slice(1) %>%
{ semi_join(x, .)}
A another positive side-effect of this improved version is that it fails if not grouping is present in x.
Try this where groups is a vector of group numbers. Here 1:2 means the first two groups:
select_groups <- function(data, groups, ...)
data[sort(unlist(attr(data, "indices")[ groups ])) + 1, ]
mtcars %>% group_by(cyl) %>% select_groups(1:2)
The selected rows appear in the original order. If you prefer that the rows appear in the order that the groups are specified (e.g. in the above eaxmple the rows of the first group followed by the rows of the second group) then remove the sort.

Resources