Create a group ID based on a string column in R

Create a group ID based on a string column in R - r

This is probably a dumb question, but how do I create a new group ID based on a string column in R? The values of the ID are arbitrary.
ID: the column I want to create
Name ID
A09john 1
J43mary 2
B7you 3
A09john 1
J43mary 2
B7you 3
I was hoping to use simple codes like below, but I don't know how to do it. Thank you!
df1 %>%
group_by(Name) %>%
mutate(ID = row_number(as.numeric(????)))

Here is a tidyverse approach that use dplyr::cur_group_id() (current group identifier)
library(tidyverse)
d <- data.frame(
Name = c("A09john", "J43mary", "B7you", "A09john", "J43mary", "B7you")
)
new_data <- d |>
dplyr::group_by(Name) |>
dplyr::mutate(ID = dplyr::cur_group_id()) |>
ungroup()
new_data
#> # A tibble: 6 x 2
#> Name ID
#> <chr> <int>
#> 1 A09john 1
#> 2 J43mary 3
#> 3 B7you 2
#> 4 A09john 1
#> 5 J43mary 3
#> 6 B7you 2
# If you want to have the ID based on the order of appearance.
# You have to convert Name to factor first
new_data2 <- d |>
dplyr::mutate(Name = factor(Name, levels = unique(Name))) |>
dplyr::group_by(Name) |>
mutate(ID = dplyr::cur_group_id()) |>
ungroup()
new_data2
#> # A tibble: 6 x 2
#> # Groups: Name [3]
#> Name ID
#> <fct> <int>
#> 1 A09john 1
#> 2 J43mary 2
#> 3 B7you 3
#> 4 A09john 1
#> 5 J43mary 2
#> 6 B7you 3
Created on 2022-06-16 by the reprex package (v2.0.1)
row_number() is not the solution as it will compute the row number in each group.

I would use sequence and dplyr in this way:
df1 %>%
group_by(Name) %>%
mutate(ID = sequence(n()))

Related

Continuing a sequence into NAs using dplyr

I am trying to figure out a dplyr specific way of continuing a sequence of numbers when there are NAs in that column.
For example I have this dataframe:
library(tibble)
dat <- tribble(
~x, ~group,
1, "A",
2, "A",
NA_real_, "A",
NA_real_, "A",
1, "B",
NA_real_, "B",
3, "B"
)
dat
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 NA A
#> 4 NA A
#> 5 1 B
#> 6 NA B
#> 7 3 B
I would like this one:
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 3 A
#> 4 4 A
#> 5 1 B
#> 6 2 B
#> 7 3 B
When I try this I get a warning which makes me think I am probably approaching this incorrectly:
library(dplyr)
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(n))
#> Warning in seq_len(n): first element used of 'length.out' argument
#> Warning in seq_len(n): first element used of 'length.out' argument
#> # A tibble: 7 × 4
#> # Groups: group [2]
#> x group n new_seq
#> <dbl> <chr> <int> <int>
#> 1 1 A 4 1
#> 2 2 A 4 2
#> 3 NA A 4 3
#> 4 NA A 4 4
#> 5 1 B 3 1
#> 6 NA B 3 2
#> 7 3 B 3 3

It's easier if you do it in one go. Your approach is not 'wrong', it is just that seq_len needs one integer, and you are giving a vector (n), so seq_len corrects it by using the first value.
dat %>%
group_by(group) %>%
mutate(x = seq_len(n()))
Note that row_number might be even easier here:
dat %>%
group_by(group) %>%
mutate(x = row_number())

We could use rowid directly if the intention is to create a sequence and group size is just intermediate column
library(data.table)
library(dplyr)
dat %>%
mutate(new_seq = rowid(group))
The issue with using a column after it is created is that it is no longer a single row as showed in #Maëls post. If we need to do that, use first as seq_len is not vectorized and here it is not needed as well
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(first(n)))

A base R option using ave (work in a similar way as group_by in dplyr)
> transform(dat, x = ave(x, group, FUN = seq_along))
x group
1 1 A
2 2 A
3 3 A
4 4 A
5 1 B
6 2 B
7 3 B

Count unique values by group

DATA = data.frame("TRIMESTER" = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),
"STUDENT" = c(1,2,3,4,5,6,7,1,2,3,5,9,10,11,3,7,10,6,12,15,17,16,21))
WANT = data.frame("TRIMESTER" = c(1,2,3),
"NEW_ENROLL" = c(7,3,5),
"TOTAL_ENROLL" = c(7,10,15))
I Have 'DATA' and want to make 'WANT' which has three columns and for every 'TRIMESTER' you count the number of NEW 'STUDENT' and then for 'TOTAL_ENROLL' you just count the total number of unique 'STUDENT' every trimester.
My attempt only counts the number for each TRIMESTER.
library(dplyr)
DATA %>%
group_by(TRIMESTER) %>%
count()

Here is a way.
suppressPackageStartupMessages(library(dplyr))
DATA <- data.frame("TRIMESTER" = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),
"STUDENT" = c(1,2,3,4,5,6,7,1,2,3,5,9,10,11,3,7,10,6,12,15,17,16,21))
DATA %>%
mutate(NEW_ENROLL = !duplicated(STUDENT)) %>%
group_by(TRIMESTER) %>%
summarise(NEW_ENROLL = sum(NEW_ENROLL)) %>%
ungroup() %>%
mutate(TOTAL_ENROLL = cumsum(NEW_ENROLL))
#> # A tibble: 3 × 3
#> TRIMESTER NEW_ENROLL TOTAL_ENROLL
#> <dbl> <int> <int>
#> 1 1 7 7
#> 2 2 3 10
#> 3 3 5 15
Created on 2022-08-14 by the reprex package (v2.0.1)

For variety we can use Base R aggregate with transform
transform(aggregate(. ~ TRIMESTER , DATA[!duplicated(DATA$STUDENT),] , length),
TOTAL_ENROLL = cumsum(STUDENT))
Output
TRIMESTER STUDENT TOTAL_ENROLL
1 1 7 7
2 2 3 10
3 3 5 15

We replace the duplicated elements in 'STUDENT' to NA, grouped by TRIMESTER, get the sum of nonNA elements and finally do the cumulative sum (cumsum)
library(dplyr)
DATA %>%
mutate(STUDENT = replace(STUDENT, duplicated(STUDENT), NA)) %>%
group_by(TRIMESTER) %>%
summarise(NEW_ENROLL = sum(!is.na(STUDENT)), .groups= 'drop') %>%
mutate(TOTAL_ENROLL = cumsum(NEW_ENROLL))
-output
# A tibble: 3 × 3
TRIMESTER NEW_ENROLL TOTAL_ENROLL
<dbl> <int> <int>
1 1 7 7
2 2 3 10
3 3 5 15
Or with distinct
distinct(DATA, STUDENT, .keep_all = TRUE) %>%
group_by(TRIMESTER) %>%
summarise(NEW_ENROLL = n(), .groups = 'drop') %>%
mutate(TOTAL_ENROLL = cumsum(NEW_ENROLL))
# A tibble: 3 × 3
TRIMESTER NEW_ENROLL TOTAL_ENROLL
<dbl> <int> <int>
1 1 7 7
2 2 3 10
3 3 5 15

How to add a row to each group and assign values

I have this tibble:
library(tibble)
library(dplyr)
df <- tibble(id = c("one", "two", "three"),
A = c(1,2,3),
B = c(4,5,6))
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 two 2 5
3 three 3 6
I want to add a row to each group AND assign values to the new column BUT with a function (here the new row in each group should get A=4 B = the first group value of column B USING first(B)-> desired output:
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 6
5 two 2 5
6 two 4 5
I have tried so far:
If I add a row in a ungrouped tibble with add_row -> this works perfect!
df %>%
add_row(A=4, B=4)
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 two 2 5
3 three 3 6
4 NA 4 4
If I try to use add_row in a grouped tibble -> this works not:
df %>%
group_by(id) %>%
add_row(A=4, B=4)
Error: Can't add rows to grouped data frames.
Run `rlang::last_error()` to see where the error occurred.
According to this post Add row in each group using dplyr and add_row() we could use group_modify -> this works great:
df %>%
group_by(id) %>%
group_modify(~ add_row(A=4, B=4, .x))
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 4
5 two 2 5
6 two 4 4
I want to assign to column B the first value of column B (or it can be any function min(B), max(B) etccc.) -> this does not work:
df %>%
group_by(id) %>%
group_modify(~ add_row(A=4, B=first(B), .x))
Error in h(simpleError(msg, call)) :
Fehler bei der Auswertung des Argumentes 'x' bei der Methodenauswahl für Funktion 'first': object 'B' not found

library(tidyverse)
df <- tibble(id = c("one", "two", "three"),
A = c(1,2,3),
B = c(4,5,6))
df %>%
group_by(id) %>%
summarise(add_row(cur_data(), A = 4, B = first(cur_data()$B)))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups: id [3]
#> id A B
#> <chr> <dbl> <dbl>
#> 1 one 1 4
#> 2 one 4 4
#> 3 three 3 6
#> 4 three 4 6
#> 5 two 2 5
#> 6 two 4 5
Or
df %>%
group_by(id) %>%
group_split() %>%
map_dfr(~ add_row(.,id = first(.$id), A = 4, B = first(.$B)))
#> # A tibble: 6 × 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 one 1 4
#> 2 one 4 4
#> 3 three 3 6
#> 4 three 4 6
#> 5 two 2 5
#> 6 two 4 5
Created on 2022-01-02 by the reprex package (v2.0.1)

Maybe this is an option
library(dplyr)
df %>%
group_by(id) %>%
summarise( A=c(A,4), B=c(B,first(B)) ) %>%
ungroup
`summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
# A tibble: 6 x 3
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 6
5 two 2 5
6 two 4 5

According to the documentation of the function group_modify, if you use a formula, you must use ". or .x to refer to the subset of rows of .tbl for the given group;" that's why you used .x inside the add_row function. To be entirely consistent, you have to do it also within the first function.
df %>%
group_by(id) %>%
group_modify(~ add_row(A=4, B=first(.x$B), .x))
# A tibble: 6 x 3
# Groups: id [3]
id A B
<chr> <dbl> <dbl>
1 one 1 4
2 one 4 4
3 three 3 6
4 three 4 6
5 two 2 5
6 two 4 5
Using first(.$B) or first(df$B) will provide the same results.

A possible solution:
library(tidyverse)
df <- tibble(id = c("one", "two", "three"),
A = c(1,2,3),
B = c(4,5,6))
df %>%
group_by(id) %>%
slice(rep(1,2)) %>% mutate(A = if_else(row_number() > 1, first(df$B), A)) %>%
ungroup
#> # A tibble: 6 × 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 one 1 4
#> 2 one 4 4
#> 3 three 3 6
#> 4 three 4 6
#> 5 two 2 5
#> 6 two 4 5

How do you use dplyr::pull to convert grouped a colum into vectors?

I have a tibble, df, I would like to take the tibble and group it and then use dplyr::pull to create vectors from the grouped dataframe. I have provided a reprex below.
df is the base tibble. My desired output is reflected by df2. I just don't know how to get there programmatically. I have tried to use pull to achieve this output but pull did not seem to recognize the group_by function and instead created a vector out of the whole column. Is what I'm trying to achieve possible with dplyr or base r. Note - new_col is supposed to be a vector created from the name column.
library(tidyverse)
library(reprex)
df <- tibble(group = c(1,1,1,1,2,2,2,3,3,3,3,3),
name = c('Jim','Deb','Bill','Ann','Joe','Jon','Jane','Jake','Sam','Gus','Trixy','Don'),
type = c(1,2,3,4,3,2,1,2,3,1,4,5))
df
#> # A tibble: 12 x 3
#> group name type
#> <dbl> <chr> <dbl>
#> 1 1 Jim 1
#> 2 1 Deb 2
#> 3 1 Bill 3
#> 4 1 Ann 4
#> 5 2 Joe 3
#> 6 2 Jon 2
#> 7 2 Jane 1
#> 8 3 Jake 2
#> 9 3 Sam 3
#> 10 3 Gus 1
#> 11 3 Trixy 4
#> 12 3 Don 5
# Desired Output - New Col is a column of vectors
df2 <- tibble(group=c(1,2,3),name=c("Jim","Jane","Gus"), type=c(1,1,1), new_col = c("'Jim','Deb','Bill','Ann'","'Joe','Jon','Jane'","'Jake','Sam','Gus','Trixy','Don'"))
df2
#> # A tibble: 3 x 4
#> group name type new_col
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Jim 1 'Jim','Deb','Bill','Ann'
#> 2 2 Jane 1 'Joe','Jon','Jane'
#> 3 3 Gus 1 'Jake','Sam','Gus','Trixy','Don'
Created on 2020-11-14 by the reprex package (v0.3.0)

Maybe this is what you are looking for:
library(dplyr)
df <- tibble(group = c(1,1,1,1,2,2,2,3,3,3,3,3),
name = c('Jim','Deb','Bill','Ann','Joe','Jon','Jane','Jake','Sam','Gus','Trixy','Don'),
type = c(1,2,3,4,3,2,1,2,3,1,4,5))
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
group_by(name, type, .add = TRUE) %>%
summarise(new_col = paste(new_col, collapse = ","))
#> `summarise()` regrouping output by 'group', 'name' (override with `.groups` argument)
#> # A tibble: 3 x 4
#> # Groups: group, name [3]
#> group name type new_col
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Jim 1 Jim,Deb,Bill,Ann
#> 2 2 Jane 1 Joe,Jon,Jane
#> 3 3 Gus 1 Jake,Sam,Gus,Trixy,Don
EDIT If new_col should be a list of vectors then you could do `summarise(new_col = list(c(new_col)))
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
group_by(name, type, .add = TRUE) %>%
summarise(new_col = list(c(new_col)))
Another option would be to use tidyr::nest:
df %>%
group_by(group) %>%
mutate(new_col = name, name = first(name, order_by = type), type = first(type, order_by = type)) %>%
nest(new_col = new_col)

R dplyr group_by summarise keep last non missing

Consider the following dataset where id uniquely identifies a person, and name varies within id only to the extent of minor spelling issues. I want to aggregate to id level using dplyr:
df= data.frame(id=c(1,1,1,2,2,2),name=c('michael c.','mike', 'michael','','John',NA),var=1:6)
Using group_by(id) yields the correct computation, but I lose the name column:
df %>% group_by(id) %>% summarise(newvar=sum(var)) %>%ungroup()
A tibble: 2 x 2
id newvar
<dbl> <int>
1 1 6
2 2 15
Using group_by(id,name) yields both name and id but obviously the "wrong" sums.
I would like to keep the last non-missing observatoin of the name within each group. I basically lack a dplyr version of Statas lastnm() function:
df %>% group_by(id) %>% summarise(sum = sum(var), Name = lastnm(name))
id sum Name
1 1 6 michael
2 2 15 John
Is there a "keep last non missing"-option?

1) Use mutate like this:
df %>%
group_by(id) %>%
mutate(sum = sum(var)) %>%
ungroup
giving:
# A tibble: 6 x 4
id name var sum
<dbl> <fct> <int> <int>
1 1 michael c. 1 6
2 1 mike 2 6
3 1 michael 3 6
4 2 john 4 15
5 2 john 5 15
6 2 john 6 15
2) Another possibility is:
df %>%
group_by(id) %>%
summarize(name = name %>% unique %>% toString, sum = sum(var)) %>%
ungroup
giving:
# A tibble: 2 x 3
id name sum
<dbl> <chr> <int>
1 1 michael c., mike, michael 6
2 2 john 15
3) Another variation is to only report the first name in each group:
df %>%
group_by(id) %>%
summarize(name = first(name), sum = sum(var)) %>%
ungroup
giving:
# A tibble: 2 x 3
id name sum
<dbl> <fct> <int>
1 1 michael c. 6
2 2 john 15

I posted a feature request on dplyrs github thread, and the reponse there is actually the best answer. For sake of completion I repost it here:
df %>%
group_by(id) %>%
summarise(sum=sum(var), Name=last(name[!is.na(name)]))
#> # A tibble: 2 x 3
#> id sum Name
#> <dbl> <int> <chr>
#> 1 1 6 michael
#> 2 2 15 John

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create a group ID based on a string column in R - r

I would use sequence and dplyr in this way: df1 %>% group_by(Name) %>% mutate(ID = sequence(n()))

Related

Continuing a sequence into NAs using dplyr

Count unique values by group

How to add a row to each group and assign values

How do you use dplyr::pull to convert grouped a colum into vectors?

R dplyr group_by summarise keep last non missing

Categories

Resources