dplyr not respecting group_by when applying cumsum - r

As described in numerous questions on here, I should be able to take a data.frame, group it, sort by date, and then apply cumsum, to get the cumulative sum over time per grouping.
Instead, with dplyr 0.8.0, I'm getting cumulative sums that ignore the grouping.
Example code:
data.frame(
cat = sample(c("a", "b", "c"), size = 1000, replace = T),
date = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 1000, replace=T)
) %>%
mutate(
x = 1
) %>%
arrange(date) %>%
group_by(cat) %>%
mutate(x = cumsum(x)) %>%
tail()
Now, I'd expect the last few rows to have x equal to around 300-something, for each group.
Instead I get:
# A tibble: 6 x 3
# Groups: cat [2]
cat date x
<chr> <date> <dbl>
1 a 1999-12-31 995
2 a 1999-12-31 996
3 c 2000-01-01 997
4 a 2000-01-01 998
5 c 2000-01-01 999
6 a 2000-01-01 1000
What am I doing wrong?

I'm guessing this is a classic problem when you load plyr after dplyr, nothing to do with your version of dplyr. For example:
tmp1<- data.frame(cat = sample(c("a", "b", "c"), size = 1000, replace = T),
date = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 1000, replace=T)) %>% mutate(x = 1)
see difference between
tmp1 %>%
arrange(date) %>%
group_by(cat) %>%
plyr::mutate(x = cumsum(x)) %>%
tail()
and
tmp1 %>%
arrange(date) %>%
group_by(cat) %>%
dplyr::mutate(x = cumsum(x)) %>%
tail()
plyr's mutate doesn't understand grouping.
You can verify if this is the problem using search()

Related

Using dplyr::summarise with dplyr::across and purrr::map to sum across columns with the same prefix

I have a data frame where I want to sum column values with the same prefix to produce a new column. My current problem is that it's not taking into account my group_by variable and returning identical values. Is part of the problem the .cols variable I'm selecting in the across function?
Sample data
library(dplyr)
library(purrr)
set.seed(10)
dat <- data.frame(id = rep(1:2, 5),
var1.pre = rnorm(10),
var1.post = rnorm(10),
var2.pre = rnorm(10),
var2.post = rnorm(10)
) %>%
mutate(index = id)
var_names = c("var1", "var2")
What I've tried
sumfunction <- map(
var_names,
~function(.){
sum(dat[glue("{.x}.pre")], dat[glue("{.x}.post")], na.rm = TRUE)
}
) %>%
setNames(var_names)
dat %>%
group_by(id) %>%
summarise(
across(
.cols = index,
.fns = sumfunction,
.names = "{.fn}"
)
) %>%
ungroup
Desired output
For this and similar problems I made the 'dplyover' package (it is not on CRAN). Here we can use dplyover::across2() to loop over two series of columns, first, all columns ending with "pre" and second all columns ending with "post". To get the names correct we can use .names = "{pre}" to get the common prefix of both series of columns.
library(dplyr)
library(dplyover) # https://timteafan.github.io/dplyover/
dat %>%
group_by(id) %>%
summarise(across2(ends_with("pre"),
ends_with("post"),
~ sum(c(.x, .y)),
.names = "{pre}"
)
)
#> # A tibble: 2 × 3
#> id var1 var2
#> <int> <dbl> <dbl>
#> 1 1 -2.32 -5.55
#> 2 2 1.11 -9.54
Created on 2022-12-14 with reprex v2.0.2
Whenever operations across multiple columns get complicated, we could pivot:
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(-c(id, index),
names_to = c(".value", "name"),
names_sep = "\\.") %>%
group_by(id) %>%
summarise(var1 = sum(var1), var2=sum(var2))
id var1 var2
<int> <dbl> <dbl>
1 1 -2.32 -5.55
2 2 1.11 -9.54

How to estimate the mean of the 10% upper and lower values over multiple categories with dplyr?

Suppose you have this data.frame in R
set.seed(15)
df <- data.frame(cat = rep(c("a", "b"), each = 50),
x = c(runif(50, 0, 1), runif(50, 1, 2)))
I want to estimate the mean of the 10% upper and lower values in each category.
I can do it using base functions like this
dfa <- df[df$cat=="a",]
dfb <- df[df$cat=="b",]
mean(dfa[dfa$x >= quantile(dfa$x, 0.9),"x"])
# [1] 0.9537632
mean(dfa[dfa$x <= quantile(dfa$x, 0.1),"x"])
# [1] 0.07959845
mean(dfb[dfb$x >= quantile(dfb$x, 0.9),"x"])
# [1] 1.963775
mean(dfb[dfb$x <= quantile(dfb$x, 0.1),"x"])
# [1] 1.092218
However, I can't figure it out how to implement this using dplyr or purrr.
Thanks for the help.
We could do this in a group by approach using cut and quantile as breaks
library(dplyr)
df %>%
group_by(cat) %>%
mutate(grp = cut(x, breaks = c(-Inf, quantile(x,
probs = c(0.1, 0.9)), Inf))) %>%
group_by(grp, .add = TRUE) %>%
summarise(x = mean(x, na.rm = TRUE), .groups = 'drop_last') %>%
slice(-2)
-ouptut
# A tibble: 4 x 3
# Groups: cat [2]
cat grp x
<chr> <fct> <dbl>
1 a (-Inf,0.0813] 0.0183
2 a (0.853, Inf] 0.955
3 b (-Inf,1.21] 1.07
4 b (1.93, Inf] 1.95
Here's a way you can use cut() to help partitaion your data into groups and then take the mean
df %>%
group_by(cat) %>%
mutate(part=cut(x, c(-Inf, quantile(x, c(.1, .9)), Inf), labels=c("low","center","high"))) %>%
filter(part!="center") %>%
group_by(cat, part) %>%
summarize(mean(x))
which returns everything in a nice tibble
cat part `mean(x)`
<chr> <fct> <dbl>
1 a low 0.0796
2 a high 0.954
3 b low 1.09
4 b high 1.96
To make it a bit cleaner, you can factor out the splitting to a helper function
split_quantile <- function(x , p=c(.1, .9)) {
cut(x, c(-Inf, quantile(x, c(.1, .9)), Inf), labels=c("low","center","high"))
}
df %>%
group_by(cat) %>%
mutate(part = split_quantile(x)) %>%
filter(part != "center") %>%
group_by(cat, part) %>%
summarize(mean(x))
A variant of #MrFlick's answer - you can use cut_number and slice:
df %>%
group_by(cat) %>%
mutate(part = cut_number(x, n = 10)) %>%
group_by(cat, part) %>%
summarise(mean(x)) %>%
slice(1, n())

dplyr count unique and repeat id's by months

I have a df that looks like the following:
ID DATE
12 10-20-20
12 10-22-20
10 10-15-20
9 10-10-20
11 11-01-20
7 11-02-20
I would like to group by month and then create a column for unique id count and repeat id count like below:
MONTH Unique_Count Repeat_Count
10-1-20 2 2
11-1-20 2 0
I am able to get the date down to the first of the month and group by ID but I am not sure how to count unique instances within the months.
df %>%
mutate(month = floor_date(as.Date(DATE), "month")) %>%
group_by(ID) %>%
mutate(count = n())
Are you perhaps looking for:
df %>%
mutate(month = strftime(floor_date(as.Date(DATE, "%m-%d-%y"), "month"),
"%m-%d-%y")) %>%
group_by(month) %>%
summarize(unique_count = length(which(table(ID) == 1)),
repeat_count = sum(table(ID)[(which(table(ID) > 1))]))
#> # A tibble: 2 x 3
#> month unique_count repeat_count
#> <chr> <int> <int>
#> 1 10-01-20 2 2
#> 2 11-01-20 2 0
Here's a shot at it:
library(lubridate)
library(dplyr)
dates <- as.Date(c("2020-10-15", "2020-10-15", "2020-11-16", "2020-11-16", "2020-11-16"))
ids <- c(12, 12, 13, 13, 14)
df <- data.frame(dates, ids)
duplicates <- df %>%
group_by(dates_floored = floor_date(dates, unit = "month"), ids) %>%
mutate(duplicate_count = n()) %>%
filter(duplicate_count > 1) %>%
distinct(ids, .keep_all = TRUE)
uniques <- df %>%
group_by(dates_floored = floor_date(dates, unit = "month"), ids) %>%
mutate(unique_count = n()) %>%
filter(unique_count < 2) %>%
distinct(ids, .keep_all = TRUE)
df_cleaned <- full_join(uniques, duplicates, by = c("ids", "dates", "dates_floored")) %>%
group_by(dates_floored) %>%
summarize(count_duplicates = sum(duplicate_count, na.rm = TRUE),
count_unique = sum(unique_count, na.rm = TRUE))
df_cleaned

More efficient way to perform calculations on multiple (combined) columns by group

What is a more efficient way to perform calculations on multiple combined columns by group?
I have a dataset with Manager Effectiveness & Team Effectiveness components. How can I quickly calculate the number of 5s for each component by gender?
The desired outcome is like so:
Number of 5s for 'Manager effectiveness' = 2
Number of 5s for 'Team effectiveness' = 0
So far, I've tried the dplyr method:
Data %>%
group_by(gender) %>%
summarise(sum(c(Manager EQ, Manager IQ)) == 5)
Data %>%
group_by(gender) %>%
summarise(sum(c(Team collaboration, Team friendliness)) == 5)
Though it works, typing each column name quickly becomes tedious and error-prone as more columns are involved.
We can use summarise_at
library(dplyr)
Data %>%
group_by(gender) %>%
summarise_at(vars(starts_with('Manager')), ~ sum(. == 5))
Or if we are checking the sum of all numeric columns, use summarise_if
Data %>%
group_by(gender) %>%
summarise_if(is.numeric, ~ sum(. == 5))
Can we wrapped in a function
f1 <- function(dat, colPrefix, grp, val) {
dat %>%
group_by_at(grp) %>%
summarise_at(vars(starts_with(colPrefix)), ~ sum(. == val))
}
f1(Data, "Manager", "gender", 5)
Mostly expanding on #akrun's answer:
## made up data 100 observations
set.seed(133)
dat <- 1:5
gen <- c("M", "F")
z <- tibble(me = sample(dat, 100, TRUE),
mi = sample(dat, 100, TRUE),
tc = sample(dat, 100, TRUE),
tf = sample(dat, 100, TRUE),
gender = sample(gen, 100, TRUE))
# Grouping by gender, counting 5's, and reshaping data
z %>%
group_by(gender) %>%
summarise_at(vars(everything()), ~ sum(. == 5)) %>%
pivot_longer(me:tf) %>%
mutate(name = paste0("# 5's for ", name)) %>%
pivot_wider(gender)
Output:
# A tibble: 2 x 5
gender `# 5's for me` `# 5's for mi` `# 5's for tc` `# 5's for tf`
<chr> <int> <int> <int> <int>
1 F 6 6 8 5
2 M 10 14 20 5
This is starting to get a little hack-ey, but in response to Amanda's comment & my misunderstanding of the question:
z %>%
group_by(gender) %>%
summarise_at(vars(everything()), ~ sum(. == 5)) %>%
pivot_longer(me:tf) %>%
mutate(name = paste0("# 5's for ", name)) %>%
mutate(grp = ifelse(str_detect(name, 'm'), 'manager', 'team')) %>%
group_by(gender, grp) %>%
summarise(total_5s = sum(value))
Gives results:
# A tibble: 4 x 3
# Groups: gender [2]
gender grp total_5s
<chr> <chr> <int>
1 F manager 12
2 F team 13
3 M manager 24
4 M team 25
Unfortunately this relies heavily on making a distinction and group based on the column names of the original data.

How to use multiple arguments in mutate_all for any function?

My data is below
grp <- paste('group', sample(1:3, 100, replace = T))
x <- rnorm(100, 100)
y <- rnorm(100, 10)
df <- data.frame(grp = grp, x =x , y =y , stringsAsFactors = F)
lag_size <- c(10, 4, 9)
Now when I try to use
df %>% group_by(grp) %>% mutate_all(lag, n = lag_size) %>% arrange(grp)
it gives an error
Error in mutate_impl(.data, dots) :
Expecting a single value:
whereas this works fine
df %>% group_by(grp) %>% mutate_all(lag, n = 10) %>% arrange(grp)
If we need to do the lag based on the 'grp' i.e. to lag the corresponding 'grp' with the value specified in 'lag_size'
library(tidyverse)
res <- map2(split(df[2:3], df$grp) , lag_size, ~.x %>%
mutate_all(lag, n = .y)) %>%
bind_rows(., .id = 'grp')
We can check the lag in 'grp' by the position of the first non-NA element
res %>%
group_by(grp) %>%
summarise(n = which(!is.na(x))[1]-1)
# A tibble: 3 x 2
# grp n
# <chr> <dbl>
#1 group 1 10
#2 group 2 4
#3 group 3 9

Resources