Count unique values by group - r

DATA = data.frame("TRIMESTER" = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),
"STUDENT" = c(1,2,3,4,5,6,7,1,2,3,5,9,10,11,3,7,10,6,12,15,17,16,21))
WANT = data.frame("TRIMESTER" = c(1,2,3),
"NEW_ENROLL" = c(7,3,5),
"TOTAL_ENROLL" = c(7,10,15))
I Have 'DATA' and want to make 'WANT' which has three columns and for every 'TRIMESTER' you count the number of NEW 'STUDENT' and then for 'TOTAL_ENROLL' you just count the total number of unique 'STUDENT' every trimester.
My attempt only counts the number for each TRIMESTER.
library(dplyr)
DATA %>%
group_by(TRIMESTER) %>%
count()

Here is a way.
suppressPackageStartupMessages(library(dplyr))
DATA <- data.frame("TRIMESTER" = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),
"STUDENT" = c(1,2,3,4,5,6,7,1,2,3,5,9,10,11,3,7,10,6,12,15,17,16,21))
DATA %>%
mutate(NEW_ENROLL = !duplicated(STUDENT)) %>%
group_by(TRIMESTER) %>%
summarise(NEW_ENROLL = sum(NEW_ENROLL)) %>%
ungroup() %>%
mutate(TOTAL_ENROLL = cumsum(NEW_ENROLL))
#> # A tibble: 3 × 3
#> TRIMESTER NEW_ENROLL TOTAL_ENROLL
#> <dbl> <int> <int>
#> 1 1 7 7
#> 2 2 3 10
#> 3 3 5 15
Created on 2022-08-14 by the reprex package (v2.0.1)

For variety we can use Base R aggregate with transform
transform(aggregate(. ~ TRIMESTER , DATA[!duplicated(DATA$STUDENT),] , length),
TOTAL_ENROLL = cumsum(STUDENT))
Output
TRIMESTER STUDENT TOTAL_ENROLL
1 1 7 7
2 2 3 10
3 3 5 15

We replace the duplicated elements in 'STUDENT' to NA, grouped by TRIMESTER, get the sum of nonNA elements and finally do the cumulative sum (cumsum)
library(dplyr)
DATA %>%
mutate(STUDENT = replace(STUDENT, duplicated(STUDENT), NA)) %>%
group_by(TRIMESTER) %>%
summarise(NEW_ENROLL = sum(!is.na(STUDENT)), .groups= 'drop') %>%
mutate(TOTAL_ENROLL = cumsum(NEW_ENROLL))
-output
# A tibble: 3 × 3
TRIMESTER NEW_ENROLL TOTAL_ENROLL
<dbl> <int> <int>
1 1 7 7
2 2 3 10
3 3 5 15
Or with distinct
distinct(DATA, STUDENT, .keep_all = TRUE) %>%
group_by(TRIMESTER) %>%
summarise(NEW_ENROLL = n(), .groups = 'drop') %>%
mutate(TOTAL_ENROLL = cumsum(NEW_ENROLL))
# A tibble: 3 × 3
TRIMESTER NEW_ENROLL TOTAL_ENROLL
<dbl> <int> <int>
1 1 7 7
2 2 3 10
3 3 5 15

Related

How to add a row to each group and assign values

I have this tibble:
library(tibble)
library(dplyr)
df <- tibble(id = c("one", "two", "three"),
A = c(1,2,3),
B = c(4,5,6))
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 two 2 5
3 three 3 6
I want to add a row to each group AND assign values to the new column BUT with a function (here the new row in each group should get A=4 B = the first group value of column B USING first(B)-> desired output:
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 6
5 two 2 5
6 two 4 5
I have tried so far:
If I add a row in a ungrouped tibble with add_row -> this works perfect!
df %>%
add_row(A=4, B=4)
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 two 2 5
3 three 3 6
4 NA 4 4
If I try to use add_row in a grouped tibble -> this works not:
df %>%
group_by(id) %>%
add_row(A=4, B=4)
Error: Can't add rows to grouped data frames.
Run `rlang::last_error()` to see where the error occurred.
According to this post Add row in each group using dplyr and add_row() we could use group_modify -> this works great:
df %>%
group_by(id) %>%
group_modify(~ add_row(A=4, B=4, .x))
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 4
5 two 2 5
6 two 4 4
I want to assign to column B the first value of column B (or it can be any function min(B), max(B) etccc.) -> this does not work:
df %>%
group_by(id) %>%
group_modify(~ add_row(A=4, B=first(B), .x))
Error in h(simpleError(msg, call)) :
Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl für Funktion 'first': object 'B' not found
library(tidyverse)
df <- tibble(id = c("one", "two", "three"),
A = c(1,2,3),
B = c(4,5,6))
df %>%
group_by(id) %>%
summarise(add_row(cur_data(), A = 4, B = first(cur_data()$B)))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups: id [3]
#> id A B
#> <chr> <dbl> <dbl>
#> 1 one 1 4
#> 2 one 4 4
#> 3 three 3 6
#> 4 three 4 6
#> 5 two 2 5
#> 6 two 4 5
Or
df %>%
group_by(id) %>%
group_split() %>%
map_dfr(~ add_row(.,id = first(.$id), A = 4, B = first(.$B)))
#> # A tibble: 6 × 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 one 1 4
#> 2 one 4 4
#> 3 three 3 6
#> 4 three 4 6
#> 5 two 2 5
#> 6 two 4 5
Created on 2022-01-02 by the reprex package (v2.0.1)
Maybe this is an option
library(dplyr)
df %>%
group_by(id) %>%
summarise( A=c(A,4), B=c(B,first(B)) ) %>%
ungroup
`summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
# A tibble: 6 x 3
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 6
5 two 2 5
6 two 4 5
According to the documentation of the function group_modify, if you use a formula, you must use ". or .x to refer to the subset of rows of .tbl for the given group;" that's why you used .x inside the add_row function. To be entirely consistent, you have to do it also within the first function.
df %>%
group_by(id) %>%
group_modify(~ add_row(A=4, B=first(.x$B), .x))
# A tibble: 6 x 3
# Groups: id [3]
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 6
5 two 2 5
6 two 4 5
Using first(.$B) or first(df$B) will provide the same results.
A possible solution:
library(tidyverse)
df <- tibble(id = c("one", "two", "three"),
A = c(1,2,3),
B = c(4,5,6))
df %>%
group_by(id) %>%
slice(rep(1,2)) %>% mutate(A = if_else(row_number() > 1, first(df$B), A)) %>%
ungroup
#> # A tibble: 6 × 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 one 1 4
#> 2 one 4 4
#> 3 three 3 6
#> 4 three 4 6
#> 5 two 2 5
#> 6 two 4 5

Using `map` to find rowMeans by column names

I have a dataset with consistently named columns and I would like to take the average of the columns by their group e.g.,
library(dplyr)
library(purrr)
library(glue)
df <- tibble(`1_x_blind` = 1:3,
`1_y_blind` = 7:9,
`2_x_blind` = 4:6,
`2_y_blind` = 5:7)
df %>%
mutate(`1_overall_test` = rowMeans(select(., matches(glue("^1_.*_blind$")))))
#> # A tibble: 3 x 5
#> `1_x_blind` `1_y_blind` `2_x_blind` `2_y_blind` `1_overall_test`
#> <int> <int> <int> <int> <dbl>
#> 1 1 7 4 5 4
#> 2 2 8 5 6 5
#> 3 3 9 6 7 6
This method works fine. The next step for me would be to scale it so that I can do the entire series of columns e.g., something like
df %>%
mutate(overall_blind = map(1:2, ~rowMeans(select(., matches(glue("^{.x}_.*_blind$"))))))
#> Error: Problem with `mutate()` input `overall_blind`.
#> x no applicable method for 'select' applied to an object of class "c('integer', 'numeric')"
#> ℹ Input `overall_blind` is `map(1:2, ~rowMeans(select(., matches(glue("^{.x}_.*_blind$")))))`.
I think the problem here is that select is confusing the . operator. Is it possible to map over a series of column names in this way? Ideally I'd like the column names to follow the {.x}_overall pattern as in the example above.
Update Here's a cleaner way that doesn't require rename or bind_cols:
map_dfc(1:2,
function(x) df %>%
select(matches(glue("^{x}_.*_blind$"))) %>%
mutate("{x}_overall_blind" := rowMeans(.))
)
# A tibble: 3 x 6
`1_x_blind` `1_y_blind` `1_overall_blind` `2_x_blind` `2_y_blind` `2_overall_blind`
<int> <int> <dbl> <int> <int> <dbl>
1 1 7 4 4 5 4.5
2 2 8 5 5 6 5.5
3 3 9 6 6 7 6.5
Previous
Here's a map approach.
The challenge is mutating two new columns based on separate groups of existing columns. Easiest just to do that in its own map_dfc() and then bind that to the existing df.
df %>%
bind_cols(
map_dfc(1:2, ~rowMeans(df %>% select(matches(glue("^{.x}_.*_blind$"))))) %>%
rename_with(~paste0(str_replace(., "\\...", ""), "_overall_blind"))
)
# A tibble: 3 x 6
`1_x_blind` `1_y_blind` `2_x_blind` `2_y_blind` `1_overall_blind` `2_overall_blind`
<int> <int> <int> <int> <dbl> <dbl>
1 1 7 4 5 4 4.5
2 2 8 5 6 5 5.5
3 3 9 6 7 6 6.5
And here's a way to get your rowwise column-group averages using pivots, which avoids regex and mutate/map operations:
df %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
separate(name, c("grp"), sep = "_", extra = "drop") %>%
group_by(row, grp) %>%
summarise(overall_blind = mean(value)) %>%
ungroup() %>%
pivot_wider(id_cols = row, names_from = grp, values_from = overall_blind,
names_glue = "{grp}_{.value}") %>%
bind_cols(df)
# A tibble: 3 x 6
`1_overall_blind` `2_overall_blind` `1_x_blind` `1_y_blind` `2_x_blind` `2_y_blind`
<dbl> <dbl> <int> <int> <int> <int>
1 4 4.5 1 7 4 5
2 5 5.5 2 8 5 6
3 6 6.5 3 9 6 7
Here is one solution:
map_dfc(1:2, function(x) {
select(df, matches(glue("^{x}_.*_blind$"))) %>%
mutate(overall_blind = rowMeans(select(., matches(glue("^{x}_.*_blind$"))))) %>%
# General but not perfect names
# set_names(paste0(x, "_", names(.)))
# Hand-tailored names
set_names(c(names(.)[1], names(.)[2], paste0(x, "_", names(.)[3])))
})
#> # A tibble: 3 x 6
#> `1_x_blind` `1_y_blind` `1_overall_blind` `2_x_blind` `2_y_blind` `2_overall_blind`
#> <int> <int> <dbl> <int> <int> <dbl>
#> 1 1 7 4 4 5 4.5
#> 2 2 8 5 5 6 5.5
#> 3 3 9 6 6 7 6.5
I added two possibilities of naming the overall_blind column for each group, one more general but not perfect names (it duplicates the 1_ or 2_ for the data columns), and another that gives the names that you want but require knowing in advance the number of columns per group.
We can use split.default to split the data into list of datasets based on the column name pattern, then get the rowMeans and bind with the original data
library(dplyr)
library(purrr)
library(stringr)
df %>%
split.default(readr::parse_number(names(.))) %>%
map_dfc(rowMeans) %>%
set_names(str_c(names(.), "_overall_blind")) %>%
bind_cols(df, .)
# A tibble: 3 x 6
# `1_x_blind` `1_y_blind` `2_x_blind` `2_y_blind` `1_overall_blind` `2_overall_blind`
# <int> <int> <int> <int> <dbl> <dbl>
#1 1 7 4 5 4 4.5
#2 2 8 5 6 5 5.5
#3 3 9 6 7 6 6.5

How do you use dplyr::pull to convert grouped a colum into vectors?

I have a tibble, df, I would like to take the tibble and group it and then use dplyr::pull to create vectors from the grouped dataframe. I have provided a reprex below.
df is the base tibble. My desired output is reflected by df2. I just don't know how to get there programmatically. I have tried to use pull to achieve this output but pull did not seem to recognize the group_by function and instead created a vector out of the whole column. Is what I'm trying to achieve possible with dplyr or base r. Note - new_col is supposed to be a vector created from the name column.
library(tidyverse)
library(reprex)
df <- tibble(group = c(1,1,1,1,2,2,2,3,3,3,3,3),
name = c('Jim','Deb','Bill','Ann','Joe','Jon','Jane','Jake','Sam','Gus','Trixy','Don'),
type = c(1,2,3,4,3,2,1,2,3,1,4,5))
df
#> # A tibble: 12 x 3
#> group name type
#> <dbl> <chr> <dbl>
#> 1 1 Jim 1
#> 2 1 Deb 2
#> 3 1 Bill 3
#> 4 1 Ann 4
#> 5 2 Joe 3
#> 6 2 Jon 2
#> 7 2 Jane 1
#> 8 3 Jake 2
#> 9 3 Sam 3
#> 10 3 Gus 1
#> 11 3 Trixy 4
#> 12 3 Don 5
# Desired Output - New Col is a column of vectors
df2 <- tibble(group=c(1,2,3),name=c("Jim","Jane","Gus"), type=c(1,1,1), new_col = c("'Jim','Deb','Bill','Ann'","'Joe','Jon','Jane'","'Jake','Sam','Gus','Trixy','Don'"))
df2
#> # A tibble: 3 x 4
#> group name type new_col
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Jim 1 'Jim','Deb','Bill','Ann'
#> 2 2 Jane 1 'Joe','Jon','Jane'
#> 3 3 Gus 1 'Jake','Sam','Gus','Trixy','Don'
Created on 2020-11-14 by the reprex package (v0.3.0)
Maybe this is what you are looking for:
library(dplyr)
df <- tibble(group = c(1,1,1,1,2,2,2,3,3,3,3,3),
name = c('Jim','Deb','Bill','Ann','Joe','Jon','Jane','Jake','Sam','Gus','Trixy','Don'),
type = c(1,2,3,4,3,2,1,2,3,1,4,5))
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
group_by(name, type, .add = TRUE) %>%
summarise(new_col = paste(new_col, collapse = ","))
#> `summarise()` regrouping output by 'group', 'name' (override with `.groups` argument)
#> # A tibble: 3 x 4
#> # Groups: group, name [3]
#> group name type new_col
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Jim 1 Jim,Deb,Bill,Ann
#> 2 2 Jane 1 Joe,Jon,Jane
#> 3 3 Gus 1 Jake,Sam,Gus,Trixy,Don
EDIT If new_col should be a list of vectors then you could do `summarise(new_col = list(c(new_col)))
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
group_by(name, type, .add = TRUE) %>%
summarise(new_col = list(c(new_col)))
Another option would be to use tidyr::nest:
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
nest(new_col = new_col)

dplyr: getting grouped min and max of columns in a for loop [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 3 years ago.
I am trying to get the grouped min and max of several columns using a for loop:
My data:
df <- data.frame(a=c(1:5, NA), b=c(6:10, NA), c=c(11:15, NA), group=c(1,1,1,2,2,2))
> df
a b c group
1 1 6 11 1
2 2 7 12 1
3 3 8 13 1
4 4 9 14 2
5 5 10 15 2
6 NA NA NA 2
My attempt:
cols <- df %>% select(a,b) %>% names()
for(i in seq_along(cols)) {
output <- df %>% dplyr::group_by(group) %>%
dplyr::summarise_(min=min(.dots=i, na.rm=T), max=max(.dots=i, na.rm=T))
print(output)
}
Desired output for column a:
group min max
<dbl> <int> <int>
1 1 1 3
2 2 4 5
Using dplyr package, you can get:
df %>%
na.omit() %>%
pivot_longer(-group) %>%
group_by(group, name) %>%
summarise(min = min(value),
max = max(value)) %>%
arrange(name, group)
# group name min max
# <dbl> <chr> <int> <int>
# 1 1 a 1 3
# 2 2 a 4 5
# 3 1 b 6 8
# 4 2 b 9 10
# 5 1 c 11 13
# 6 2 c 14 15
We can use summarise_all after grouping by 'group' and if it needs to be in a particular order, then use select to select based on the column names
library(dplyr)
library(stringr)
df %>%
group_by(group) %>%
summarise_all(list(min = ~ min(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE))) %>%
select(group, order(str_remove(names(.), "_.*")))
# A tibble: 2 x 7
# group a_min a_max b_min b_max c_min c_max
# <dbl> <int> <int> <int> <int> <int> <int>
#1 1 1 3 6 8 11 13
#2 2 4 5 9 10 14 15
Without to use for loop but using dplyr and tidyr from tidyverse, you can get the min and max of each columns by 1) pivoting the dataframe in a longer format, 2) getting the min and max value per group and then 3) pivoting wider the dataframe to get the expected output:
library(tidyverse)
df %>% pivot_longer(., cols = c(a,b,c), names_to = "Names",values_to = "Value") %>%
group_by(group,Names) %>% summarise(Min = min(Value, na.rm =TRUE), Max = max(Value,na.rm = TRUE)) %>%
pivot_wider(., names_from = Names, values_from = c(Min,Max)) %>%
select(group,contains("_a"),contains("_b"),contains("_c"))
# A tibble: 2 x 7
# Groups: group [2]
group Min_a Max_a Min_b Max_b Min_c Max_c
<dbl> <int> <int> <int> <int> <int> <int>
1 1 1 3 6 8 11 13
2 2 4 5 9 10 14 15
Is it what you are looking for ?
In base R, we can use aggregate and get min and max for multiple columns by group.
aggregate(.~group, df, function(x)
c(min = min(x, na.rm = TRUE),max= max(x, na.rm = TRUE)))
# group a.min a.max b.min b.max c.min c.max
#1 1 1 3 6 8 11 13
#2 2 4 5 9 10 14 15

R dplyr group_by summarise keep last non missing

Consider the following dataset where id uniquely identifies a person, and name varies within id only to the extent of minor spelling issues. I want to aggregate to id level using dplyr:
df= data.frame(id=c(1,1,1,2,2,2),name=c('michael c.','mike', 'michael','','John',NA),var=1:6)
Using group_by(id) yields the correct computation, but I lose the name column:
df %>% group_by(id) %>% summarise(newvar=sum(var)) %>%ungroup()
A tibble: 2 x 2
id newvar
<dbl> <int>
1 1 6
2 2 15
Using group_by(id,name) yields both name and id but obviously the "wrong" sums.
I would like to keep the last non-missing observatoin of the name within each group. I basically lack a dplyr version of Statas lastnm() function:
df %>% group_by(id) %>% summarise(sum = sum(var), Name = lastnm(name))
id sum Name
1 1 6 michael
2 2 15 John
Is there a "keep last non missing"-option?
1) Use mutate like this:
df %>%
group_by(id) %>%
mutate(sum = sum(var)) %>%
ungroup
giving:
# A tibble: 6 x 4
id name var sum
<dbl> <fct> <int> <int>
1 1 michael c. 1 6
2 1 mike 2 6
3 1 michael 3 6
4 2 john 4 15
5 2 john 5 15
6 2 john 6 15
2) Another possibility is:
df %>%
group_by(id) %>%
summarize(name = name %>% unique %>% toString, sum = sum(var)) %>%
ungroup
giving:
# A tibble: 2 x 3
id name sum
<dbl> <chr> <int>
1 1 michael c., mike, michael 6
2 2 john 15
3) Another variation is to only report the first name in each group:
df %>%
group_by(id) %>%
summarize(name = first(name), sum = sum(var)) %>%
ungroup
giving:
# A tibble: 2 x 3
id name sum
<dbl> <fct> <int>
1 1 michael c. 6
2 2 john 15
I posted a feature request on dplyrs github thread, and the reponse there is actually the best answer. For sake of completion I repost it here:
df %>%
group_by(id) %>%
summarise(sum=sum(var), Name=last(name[!is.na(name)]))
#> # A tibble: 2 x 3
#> id sum Name
#> <dbl> <int> <chr>
#> 1 1 6 michael
#> 2 2 15 John

Resources