dplyr: replace grouping values with 1 through N groups [duplicate] - r

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 1 year ago.
I have data where each row represents one observation from one person. For example:
library(dplyr)
dat <- tibble(ID = rep(sample(1111:9999, 3), each = 3),
X = 1:9)
# A tibble: 9 x 2
ID X
<int> <int>
1 9573 1
2 9573 2
3 9573 3
4 7224 4
5 7224 5
6 7224 6
7 7917 7
8 7917 8
9 7917 9
I want to replace these IDs with a different value. It can be anything, but the easiest (and preferred) solutions is just to replace with 1:n groups. So the desired solution would be:
# A tibble: 9 x 2
ID X
<int> <int>
1 1 1
2 1 2
3 1 3
4 2 4
5 2 5
6 2 6
7 3 7
8 3 8
9 3 9
Probably something that starts with:
dat %>%
group_by(IID) %>%
???

A fast option would be match
library(dplyr)
dat %>%
mutate(ID = match(ID, unique(ID)))
-output
# A tibble: 9 x 2
# ID X
# <int> <int>
#1 1 1
#2 1 2
#3 1 3
#4 2 4
#5 2 5
#6 2 6
#7 3 7
#8 3 8
#9 3 9
Or use as.integer on a factor
dat %>%
mutate(ID = as.integer(factor(ID, levels = unique(ID))))
In tidyverse, we can also cur_group_id
dat %>%
group_by(ID = factor(ID, levels = unique(ID))) %>%
mutate(ID = cur_group_id()) %>%
ungroup

Related

How to balance a dataset in `dplyr` using `sample_n` automatically to the size of the smallest class?

I have a dataset like:
df <- tibble(
id = 1:18,
class = rep(c(rep(1,3),rep(2,2),3),3),
var_a = rep(c("a","b"),9)
)
# A tibble: 18 x 3
id cluster var_a
<int> <dbl> <chr>
1 1 1 a
2 2 1 b
3 3 1 a
4 4 2 b
5 5 2 a
6 6 3 b
7 7 1 a
8 8 1 b
9 9 1 a
10 10 2 b
11 11 2 a
12 12 3 b
13 13 1 a
14 14 1 b
15 15 1 a
16 16 2 b
17 17 2 a
18 18 3 b
That dataset contains a number of observations in several classes. The classes are not balanced. In the sample above we can see, that only 3 observations are of class 3, while there are 6 observations of class 2 and 9 observations of class 1.
Now I want to automatically balance that dataset so that all classes are of the same size. So I want a dataset of 9 rows, 3 rows in each class. I can use the sample_n function from dplyr to do such a sampling.
I achieved to do so by first calculating the smallest class size..
min_length <- as.numeric(df %>%
group_by(class) %>%
summarise(n = n()) %>%
ungroup() %>%
summarise(min = min(n)))
..and then apply the sample_n function:
set.seed(1)
df %>% group_by(cluster) %>% sample_n(min_length)
# A tibble: 9 x 3
# Groups: cluster [3]
id cluster var_a
<int> <dbl> <chr>
1 15 1 a
2 7 1 a
3 13 1 a
4 4 2 b
5 5 2 a
6 17 2 a
7 18 3 b
8 6 3 b
9 12 3 b
I wondered If it's possible to do that (calculating the smallest class size and then sampling) in one go?
You can do it in one step, but it is cheating a little:
set.seed(42)
df %>%
group_by(class) %>%
sample_n(min(table(df$class))) %>%
ungroup()
# # A tibble: 9 x 3
# id class var_a
# <int> <dbl> <chr>
# 1 1 1 a
# 2 8 1 b
# 3 15 1 a
# 4 4 2 b
# 5 5 2 a
# 6 11 2 a
# 7 12 3 b
# 8 18 3 b
# 9 6 3 b
I say "cheating" because normally you would not want to reference df$ from within the pipe. However, because they property we're looking for is of the whole frame but the table function only sees one group at a time, we need to side-step that a little.
One could do
df %>%
mutate(mn = min(table(class))) %>%
group_by(class) %>%
sample_n(mn[1]) %>%
ungroup()
# # A tibble: 9 x 4
# id class var_a mn
# <int> <dbl> <chr> <int>
# 1 14 1 b 3
# 2 13 1 a 3
# 3 7 1 a 3
# 4 4 2 b 3
# 5 16 2 b 3
# 6 5 2 a 3
# 7 12 3 b 3
# 8 18 3 b 3
# 9 6 3 b 3
Though I don't think that that is any more elegant/readable.

Add a new row for each id in dataframe for ALL variables

I want to add a new row after each id. I found a solution on a stackflow page(Inserting a new row to data frame for each group id)
but there is one thing I want to change and I dont know how. I want to make a new row for all variables, I don't want to write down all the variables ( the stackflow example). It doesnt matter the numbers in the row, I will change that later. If it is possible to add "base" in the new row for trt, that would be good. I want the code to work for many ids and varibles, having a lot of those in the data I'm working with. Many thanks if someone can help me with this!
The example code:
set.seed(1)
> id <- rep(1:3,each=4)
> trt <- rep(c("A","OA", "B", "OB"),3)
> pointA <- sample(1:10,12, replace=TRUE)
> pointB<- sample(1:10,12, replace=TRUE)
> pointC<- sample(1:10,12, replace=TRUE)
> df <- data.frame(id,trt,pointA, pointB,pointC)
> df
id trt pointA pointB pointC
1 1 A 3 7 3
2 1 OA 4 4 4
3 1 B 6 8 1
4 1 OB 10 5 4
5 2 A 3 8 9
6 2 OA 9 10 4
7 2 B 10 4 5
8 2 OB 7 8 6
9 3 A 7 10 5
10 3 OA 1 3 2
11 3 B 3 7 9
12 3 OB 2 2 7
I want it to look like:
df <- rbind(df[1:4,], df1, df[5:8,], df2, df[9:12,],df3)
> df
id trt pointA pointB pointC
1 1 A 3 7 3
2 1 OA 4 4 4
3 1 B 6 8 1
4 1 OB 10 5 4
5 1 base
51 2 A 3 8 9
6 2 OA 9 10 4
7 2 B 10 4 5
8 2 OB 7 8 6
13 2 base
9 3 A 7 10 5
10 3 OA 1 3 2
11 3 B 3 7 9
12 3 OB 2 2 7
14 3 base
>
I'm trying this code:
df %>%
+ group_by(id) %>%
+ summarise(week = "base") %>%
+ mutate_all() %>% #want tomutate allvariables
+ bind_rows(df, .) %>%
+ arrange(id)
You could bind_rows directly, it will add NAs to all other columns by default.
library(dplyr)
df %>% group_by(id) %>% summarise(trt = 'base') %>% bind_rows(df) %>% arrange(id)
# id trt pointA pointB pointC
# <int> <chr> <int> <int> <int>
# 1 1 base NA NA NA
# 2 1 A 3 7 3
# 3 1 OA 4 4 4
# 4 1 B 6 8 1
# 5 1 OB 10 5 4
# 6 2 base NA NA NA
# 7 2 A 3 8 9
# 8 2 OA 9 10 4
# 9 2 B 10 4 5
#10 2 OB 7 8 6
#11 3 base NA NA NA
#12 3 A 7 10 5
#13 3 OA 1 3 2
#14 3 B 3 7 9
#15 3 OB 2 2 7
If you want empty strings instead of NA, we can give a range of columns in mutate_at and replace NA values with empty string.
df %>%
group_by(id) %>%
summarise(trt = 'base') %>%
bind_rows(df) %>%
mutate_at(vars(pointA:pointC), ~replace(., is.na(.) , '')) %>%
arrange(id)
library(dplyr)
library(purrr)
df %>% mutate_if(is.factor, as.character) %>%
group_split(id) %>%
map_dfr(~bind_rows(.x, data.frame(id=.x$id[1], trt="base", stringsAsFactors = FALSE)))
#Note that group_modify is Experimental
df %>% mutate_if(is.factor, as.character) %>%
group_by(id) %>%
group_modify(~bind_rows(.x, data.frame(trt="base", stringsAsFactors = FALSE)))

Matching of positive to negative numbers in the same group - R

I would like to do the following thing:
id calendar_week value
1 1 10
2 2 2
3 2 -2
4 2 3
5 3 10
6 3 -10
The output which I want is the list of id (or the rows) which have a positiv to negative match for a given calendar_week -> which means I want for example the id 2 and 3 because there is a match of -2 to 2 in Calendar week 2. I don't want id 4 because there is no -3 value in calendar week 2 and so on.
output:
id calendar_week value
2 2 2
3 2 -2
5 3 10
6 3 -10
Could also do:
library(dplyr)
df %>%
group_by(calendar_week, ab = abs(value)) %>%
filter(n() > 1) %>% ungroup() %>%
select(-ab)
Output:
# A tibble: 4 x 3
id calendar_week value
<int> <int> <int>
1 2 2 2
2 3 2 -2
3 5 3 10
4 6 3 -10
Given your additional clarifications, you could do:
df %>%
group_by(calendar_week, value) %>%
mutate(idx = row_number()) %>%
group_by(calendar_week, idx, ab = abs(value)) %>%
filter(n() > 1) %>% ungroup() %>%
select(-idx, -ab)
On a modified data frame:
id calendar_week value
1 1 1 10
2 2 2 2
3 3 2 -2
4 3 2 2
5 4 2 3
6 5 3 10
7 6 3 -10
8 7 4 10
9 8 4 10
This gives:
# A tibble: 4 x 3
id calendar_week value
<int> <int> <int>
1 2 2 2
2 3 2 -2
3 5 3 10
4 6 3 -10
Using tidyverse :
library(tidyverse)
df %>%
group_by(calendar_week) %>%
nest() %>%
mutate(values = map_chr(data, ~ str_c(.x$value, collapse = ', '))) %>%
unnest() %>%
filter(str_detect(values, as.character(-value))) %>%
select(-values)
Output :
calendar_week id value
<dbl> <int> <dbl>
1 2 2 2
2 2 3 -2
3 3 5 10
4 3 6 -10
If as stated in the comments only a single match is required you could try:
library(dplyr)
df %>%
group_by(calendar_week, nvalue = abs(value)) %>%
filter(!duplicated(value)) %>%
filter(sum(value) == 0) %>%
ungroup() %>%
select(-nvalue)
id calendar_week value
<int> <int> <int>
1 2 2 2
2 3 2 -2
3 5 3 -10
4 6 3 10

Summarise all using which on other column in dplyr [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 5 years ago.
For some reason, I could not find a solution using the summarise_all function for the following problem:
df <- data.frame(A = c(1,2,2,3,3,3,4,4), B = 1:8, C = 8:1, D = c(1,2,3,1,2,5,10,9))
desired results:
df %>%
group_by(A) %>%
summarise(B = B[which.min(D)],
C = C[which.min(D)],
D = D[which.min(D)])
# A tibble: 4 x 4
A B C D
<dbl> <int> <int> <dbl>
1 1 1 8 1
2 2 2 7 2
3 3 4 5 1
4 4 8 1 9
What I tried:
df %>%
group_by(A) %>%
summarise_all(.[which.min(D)])
In words, I want to group by a variable and find for each column the value that belongs to the minimum value of another column. I could not find a solution for this using summarise_all. I am searching for a dplyr approach.
You can just filter down to the row that has a minimum value of D for each level of A. The code below assumes there is only one minimum row in each group.
df %>%
group_by(A) %>%
arrange(D) %>%
slice(1)
A B C D
1 1 1 8 1
2 2 2 7 2
3 3 4 5 1
4 4 8 1 9
If there can be multiple rows with minimum D, then:
df <- data.frame(A = c(1,2,2,3,3,3,4,4), B = 1:8, C = 8:1, D = c(1,2,3,1,2,5,9,9))
df %>%
group_by(A) %>%
filter(D == min(D))
A B C D
1 1 1 8 1
2 2 2 7 2
3 3 4 5 1
4 4 7 2 9
5 4 8 1 9
You need filter - any time you're trying to drop some rows and keep others, that's the verb you want.
df %>% group_by(A) %>% filter(D == min(D))
#> # A tibble: 4 x 4
#> # Groups: A [4]
#> A B C D
#> <dbl> <int> <int> <dbl>
#> 1 1 1 8 1
#> 2 2 2 7 2
#> 3 3 4 5 1
#> 4 4 8 1 9

How to group_by(everything())

I want to count unique combinations in a dataframe using dplyr
I tried the following:
require(dplyr)
set.seed(314)
dat <- data.frame(a = sample(1:3, 100, replace = T),
b = sample(1:2, 100, replace = T),
c = sample(1:2, 100, replace = T))
dat %>% group_by(a,b,c) %>% summarise(n = n())
But to make this generic (unrelated to the names of the columns) I tried:
dat %>% group_by(everything()) %>% summarise(n = n())
Which results in:
a b c n
<int> <int> <int> <int>
1 1 1 1 6
2 1 1 2 8
3 1 2 1 13
4 1 2 2 8
5 2 1 1 7
6 2 1 2 12
7 2 2 1 14
8 2 2 2 10
9 3 1 1 3
10 3 1 2 4
11 3 2 1 7
12 3 2 2 8
Which gives the error
Error in mutate_impl(.data, dots) : `c(...)` must be a character vector
I fiddled around with different things but cannot get it to work. I know I could use names(dat) but the columns in the dataframe that need to be in the group_by() are depended on previous steps in the dplyr chain.
There is a function called group_by_all() (and in the same sense group_by_at and group_by_if )which does exactly that.
library(dplyr)
dat %>%
group_by_all() %>%
summarise(n = n())
which gives the same result,
# A tibble: 12 x 4
# Groups: a, b [?]
a b c n
<int> <int> <int> <int>
1 1 1 1 6
2 1 1 2 8
3 1 2 1 13
4 1 2 2 8
5 2 1 1 7
6 2 1 2 12
7 2 2 1 14
8 2 2 2 10
9 3 1 1 3
10 3 1 2 4
11 3 2 1 7
12 3 2 2 8
PS
packageVersion('dplyr')
#[1] ‘0.7.2’
We can use .dots
dat %>%
group_by(.dots = names(.)) %>%
summarise(n = n())
# A tibble: 12 x 4
# Groups: a, b [?]
# a b c n
# <int> <int> <int> <int>
#1 1 1 1 6
#2 1 1 2 8
#3 1 2 1 13
#4 1 2 2 8
#5 2 1 1 7
#6 2 1 2 12
#7 2 2 1 14
#8 2 2 2 10
#9 3 1 1 3
#10 3 1 2 4
#11 3 2 1 7
#12 3 2 2 8
Another option would be to use the unquote, sym approach
dat %>%
group_by(!!! rlang::syms(names(.))) %>%
summarise(n = n())
In dplyr version 1.0.0 and later, you would now use across().
library(dplyr)
dat %>%
group_by(across(everything())) %>%
summarise(n = n())
Package version:
> packageVersion("dplyr")
[1] ‘1.0.5’

Resources