select minus operator in dplyr group_by - r

Does anyone know of a fast way to select 'all-but-one' (or 'all-but-a-few') columns when using dplyr::group_by?
Ultimately, I just want to aggregate over all distinct rows after removing a few select columns, but I don't want to have to explicitly list all the grouping columns each time (since those get added and removed somewhat frequently in my analysis).
Example:
> df <- data_frame(a = c(1,1,2,2), b = c("foo", "foo", "bar", "bar"), c = runif(4))
> df
Source: local data frame [4 x 3]
a b c
(dbl) (chr) (dbl)
1 1 foo 0.95460749
2 1 foo 0.05094088
3 2 bar 0.93032589
4 2 bar 0.40081121
Now I want to aggregate by a and b, so I can do this:
> df %>% group_by(a, b) %>% summarize(mean(c))
Source: local data frame [2 x 3]
Groups: a [?]
a b mean(c)
(dbl) (chr) (dbl)
1 1 foo 0.5027742
2 2 bar 0.6655686
Great.
But, I'd really like to be able to do something like just specify not c, similar to dplyr::select(-c):
> df %>% select(-c)
Source: local data frame [4 x 2]
a b
(dbl) (chr)
1 1 foo
2 1 foo
3 2 bar
4 2 bar
But group_by can apply expressions, so the equivalent doesn't work:
> df %>% group_by(-c) %>% summarize(mean(c))
Source: local data frame [4 x 2]
-c mean(c)
(dbl) (dbl)
1 -0.95460749 0.95460749
2 -0.93032589 0.93032589
3 -0.40081121 0.40081121
4 -0.05094088 0.05094088
Anyone know if I'm just missing a basic function or shortcut to help me do this quickly?
Example use case: if df suddenly gains a new column d, I'd like the downstream code to now aggregate over unique combinations of a, b, and d, without me having to explicitly add d to the group_by call.)

In current versions of dplyr, the function group_by_at, together with vars, accomplishes this goal:
df %>% group_by_at(vars(-c)) %>% summarize(mean(c))
# A tibble: 2 x 3
# Groups: a [?]
a b `sum(c)`
<dbl> <chr> <dbl>
1 1 foo 0.9851376
2 2 bar 1.0954412
Appears to have been introduced in dplyr 0.7.0, in June 2017

Related

Using tidyr::pivot_wider when some keys have multiple values [duplicate]

This question already has answers here:
Concatenate strings by group with dplyr [duplicate]
(4 answers)
Closed 1 year ago.
I have a long data frame that i want to widen, but one key has two different values:
df <- data.frame(ColA=c("A", "B", "B", "C"), ColB=letters[23:26])
ColA ColB
1 A w
2 B x
3 B y
4 C z
I want my output to be a paste of the two values for this key together:
ColA ColB
1 A w
2 B xy
3 C z
A regular pivot_wider() will throw a warning and convert the values to lists:
df.wide <- df %>%
pivot_wider(names_from=ColA, values_from=ColB)
Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates
# A tibble: 1 x 3
A B C
<list> <list> <list>
1 <chr [1]> <chr [2]> <chr [1]>
Based on the warning it looks like pivot_wider() with a value_fn() is similar to what I want as an intermediate step:
# intermediate step
df.wide <- df %>%
pivot_wider(names_from=ColA, values_from=ColB, values_fn=SOMETHING)
A B C
1 w xy z
But it looks like values_fn() only takes summary functions, and not something that would work on character data (like paste())
The closest I can get is:
df %>%
pivot_wider(names_from=ColA, values_from=ColB, values_fn=list) %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(cols=everything(), names_to="ColA", values_to="ColB")
# A tibble: 3 x 2
ColA ColB
<chr> <chr>
1 A "w"
2 B "c(\"x\", \"y\")"
3 C "z"
With an additional mutating gsub()-type function. Surely there's an easier way! Preferably within the tidyverse, but also open to other packages.
Thanks
I don't think you need to pivot here, unless your real data is more complicated than the example shown.
library(dplyr)
df %>%
group_by(ColA) %>%
summarise(ColB = paste0(ColB, collapse = ""))
Result:
# A tibble: 3 × 2
ColA ColB
<chr> <chr>
1 A w
2 B xy
3 C z

How to summarize_each with mixed column class

Consider the situation, where I want to summarize_each a data.frame with mixed column type.
> (temp=data.frame(ID=c(1,1,2,2),gender=c("M","M","F","F"),val1=rnorm(4),val2=rnorm(4)))
ID gender val1 val2
1 1 M -1.7944804 0.5232313
2 1 M 0.3938437 -0.8424086
3 2 F -0.3190777 0.3220580
4 2 F 1.3667340 -0.6031376
> temp%>%group_by(ID)%>%summarize_each(funs(mean))
Source: local data frame [2 x 4]
ID gender val1 val2
(dbl) (lgl) (dbl) (dbl)
1 1 NA -0.7003184 -0.1595886
2 2 NA 0.5238282 -0.1405398
This doesn't work because mean(gender) doesn't make sense.
Question:
If all my non-numeric columns are characteristic of ID, thus are identical within each ID, can I somehow get summarize_each to return that 'unique' value?
> temp%>%group_by(ID,gender)%>%summarize_each(funs(mean))
Source: local data frame [2 x 4]
Groups: ID [?]
ID gender val1 val2
(dbl) (fctr) (dbl) (dbl)
1 1 M -0.7003184 -0.1595886
2 2 F 0.5238282 -0.1405398
is the output that I want, but I somehow feel like this is doing unnecessary nested group_by because there really is nothing to group within ID.
One option would be gather/spread from tidyr. Reshape to 'long' format with gather, grouped by 'ID', 'var', get the first element of 'gender' and mean of 'val', spread it back to 'wide' format.
library(tidyr)
library(dplyr)
gather(temp, var, val, val1:val2) %>%
group_by(ID, var) %>%
summarise(gender = first(gender), val = mean(val)) %>%
spread(var, val)
Or another is using mutate_if and unique. After grouping by 'ID', we get the mean of the numeric columns with mutate_if. As the other columns (i.e. 'gender' also remains in the output) we can just do unique to get the unique rows from the output.
temp %>%
group_by(ID) %>%
mutate_if(is.numeric, mean) %>%
unique()
# ID gender val1 val2
# <int> <chr> <dbl> <dbl>
#1 1 M -0.7003184 -0.1595886
#2 2 F 0.5238281 -0.1405398

R: aggregate by all factor levels (present and not present)

I can aggregate a data.frame trivially with dplyr with the following:
z <- data.frame(a = rnorm(20), b = rep(letters[1:4], each = 5))
library(dplyr)
z %>%
group_by(b) %>%
summarise(out = n())
Source: local data frame [4 x 2]
b out
(fctr) (int)
1 a 5
2 b 5
3 c 5
4 d 5
However, sometimes a dataset may be missing a factor. In which case I would like the output to be 0.
For example, let's say the typical dataset should have 5 groups.
z$b <- factor(z$b, levels = letters[1:5])
But clearly there aren't any in this particular but could be in another. How can I aggregate this data so the length for missing factors is 0.
Desired output:
Source: local data frame [4 x 2]
b out
(fctr) (int)
1 a 5
2 b 5
3 c 5
4 d 5
5 e 0
One way to approach this is to use complete from "tidyr". You have to use mutate first to factor column "b":
library(dplyr)
library(tidyr)
z %>%
mutate(b = factor(b, letters[1:5])) %>%
group_by(b) %>%
summarise(out = n()) %>%
complete(b, fill = list(out = 0))
# Source: local data frame [5 x 2]
#
# b out
# (fctr) (dbl)
# 1 a 5
# 2 b 5
# 3 c 5
# 4 d 5
# 5 e 0
A workaround is to join with a table containing all levels:
z <- full_join(z, data.frame(b=levels(z$b))
This will set all the missing rows for your analysis variables to NA, which in the general case would make more sense than setting them to zero. You can change them to zero if necessary with z[is.na(z)] <- 0.
You could use xtabs:
xtabs(a ~ b, z)
This aggregates z$b rather than just counting levels in z$a as in your example, but that's easily achieved with table:
table(z$a)

group_by() into fill() not working as expected

I'm trying to do a Last Observation Carried Forward operation on some poorly formatted data using dplyr and tidyr. It isn't working as I'd expect.
library(dplyr)
library(tidyr)
df <- data.frame(id=c(1,1,2,2,3,3),
email=c('bob#email.com', NA, 'joe#email.com', NA, NA, NA))
df2 <- df %>% group_by(id) %>% fill(email)
This results in:
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob#email.com
2 1 bob#email.com
3 2 joe#email.com
4 2 joe#email.com
5 3 joe#email.com
6 3 joe#email.com
I expect it to be:
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob#email.com
2 1 bob#email.com
3 2 joe#email.com
4 2 joe#email.com
5 3 NA
6 3 NA
The reason I expect it to be the latter is because of group_by's documentation saying, "The group_by function takes an existing tbl and converts it into a grouped tbl where operations are performed "by group"." The group in this case is determined by the id variable, and the following operation is fill(email). However, it's pretty clearly NOT doing that.
And before anybody asks, it makes no difference if the fields are both character instead of numeric or factor.
UPDATE
#aosmith pointed out this open issue on Github. I'm going to say that there won't be a proper solution to this problem until that issue is resolved. Everything else would just be a workaround. So, if somebody makes a successful PR addressing that issue and posts it here, I'd be happy to mark it as the solution.
Looks like this has been fixed in the development version of tidyr. You now get the expected result per id using fill from tidyr_0.3.1.9000.
df %>% group_by(id) %>% fill(email)
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob#email.com
2 1 bob#email.com
3 2 joe#email.com
4 2 joe#email.com
5 3 NA
6 3 NA
Luckily you can still use zoo::na.locf for this:
df %>%
group_by(id) %>%
mutate(email = zoo::na.locf(email, na.rm = FALSE))
# Source: local data frame [6 x 2]
# Groups: id [3]
#
# id email
# (dbl) (fctr)
# 1 1 bob#email.com
# 2 1 bob#email.com
# 3 2 joe#email.com
# 4 2 joe#email.com
# 5 3 NA
# 6 3 NA
Another option is to use do from dplyr:
df3 <- df %>% group_by(id) %>% do(fill(.,email))
Two questions, does it has be duplicated and do you have to use dplyr and tidyr?
Maybe this could be a solution?
(
bar <- data.frame(id=c(1,1,2,2,3,3),
email=c('bob#email.com', NA, 'joe#email.com', NA, NA, NA))
)
#> id email
#> 1 bob#email.com
#> 1 <NA>
#> 2 joe#email.com
#> 2 <NA>
#> 3 <NA>
#> 3 <NA>
(
foo <- bar[!duplicated(bar$id),]
)
#> id email
#> 1 bob#email.com
#> 2 joe#email.com
#> 3 <NA>
This is kind of ugly, but it is another option that uses dplyr and works with your sample data
df %>%
group_by(id) %>%
mutate(email = email[ !is.na(email) ][1])
I have come across this issue quite a few times, I do worry about using this..
df2 <- df %>% group_by(id) %>% fill(email)
on large data sets as I have had mixed results and found the following work around. The split function used with map_df ensures you apply whatever you are doing to the a specific df for each id and map_df then re binds all the individual df like magic. It has also proved handy in lots of other circumstances. Somewhat obsolete now this issue has been fixed but still a useful alternative that avoids group_by().
df %>% split(.$id) %>% map_df(function(x){ x %>% fill(email)})

rearrange specific rows into columns using dplyr

I am trying to rearrange rows into columns in a specific way (preferably using dplyr) but I dont really know where to start with this. I am trying to create one row for each person (Bill or Bob) and have all of that persons values on one row. So far I have
df<-data.frame(
Participant=c("bob1","bill1","bob2","bill2"),
No_Photos=c(1,4,5,6)
)
res<-df %>% group_by(Participant) %>% dplyr::summarise(phot_mean=mean(No_Photos))
which gives me:
Participant mean(No_Photos)
(fctr) (dbl)
1 bill1 4
2 bill2 6
3 bob1 1
4 bob2 5
GOAL:
mean_NO_Photos_1 mean_No_Photos_2
bob 1 5
bill 4 6
Using tidyr and dplyr:
library(tidyr)
library(dplyr)
df %>% mutate(rep = extract_numeric(Participant),
Participant = gsub("[0-9]", "", Participant)) %>%
group_by(Participant, rep) %>%
summarise(mean = mean(No_Photos)) %>%
spread(rep, mean)
Source: local data frame [2 x 3]
Participant 1 2
(chr) (dbl) (dbl)
1 bill 4 6
2 bob 1 5

Resources