How to count distinct non-missing rows in R using dplyr - r

library(dplyr)
mydat <- data.frame(id = c(123, 111, 234, "none", 123, 384, "none"),
id2 = c(1, 1, 1, 2, 2, 3, 4))
> mydat
id id2
1 123 1
2 111 1
3 234 1
4 none 2
5 123 2
6 384 3
7 none 4
I would like to count the number of unique ids for each id2 in medal. However, for the id that is none, I do not want to count it.
> mydat %>% group_by(id2) %>% summarise(count = n_distinct(id))
# A tibble: 4 × 2
id2 count
<dbl> <int>
1 1 3
2 2 2
3 3 1
4 4 1
Using this mistakenly counts none. The desired output should be
> mydat %>% group_by(id2) %>% summarise(count = n_distinct(id))
# A tibble: 4 × 2
id2 count
<dbl> <int>
1 1 3
2 2 1
3 3 1
4 4 0

mydat %>% group_by(id2) %>%
summarise(
count = n_distinct(id),
wanted = n_distinct(id[id != "none"])
)
# # A tibble: 4 × 3
# id2 count wanted
# <dbl> <int> <int>
# 1 1 3 3
# 2 2 2 1
# 3 3 1 1
# 4 4 1 0

Related

How to fill a binary outcome to all rows within a group in dplyr?

I'm trying to use group_by() to create a new variable that assigns either a 1 or 0 based on a condition (two criteria). I'd like to assign 1 to ALL rows within the group if the condition is met ONCE, and a 0 if not at all. The code below assigns the 1 at the single line in which the condition is met. How can I adjust this to fill all rows within the grouping variables?
library(tidyverse)
set.seed(10)
dat <- data.frame(
group = rep(c("Group1", "Group2"), each = 18),
d_num = rep(c(1:6), times = 2, each = 3),
var1 = sample(1:4, 36, replace = TRUE),
var2 = sample(1:4, 36, replace = TRUE)
)
x <- dat %>%
group_by(group, d_num) %>%
mutate(var3 = ifelse(var1 == 1 | var2 == 1, 1, 0))
Wrap your condition in any():
library(tidyverse)
x <- dat %>%
group_by(group, d_num) %>%
mutate(var3 = ifelse(any(var1 == 1 | var2 == 1), 1, 0))
print(x, n = 20)
Output:
# A tibble: 36 × 5
# Groups: group, d_num [12]
group d_num var1 var2 var3
<chr> <int> <int> <int> <dbl>
1 Group1 1 3 2 1
2 Group1 1 1 1 1
3 Group1 1 2 2 1
4 Group1 2 4 1 1
5 Group1 2 4 3 1
6 Group1 2 3 2 1
7 Group1 3 4 2 1
8 Group1 3 2 2 1
9 Group1 3 3 1 1
10 Group1 4 3 4 1
11 Group1 4 3 1 1
12 Group1 4 4 2 1
13 Group1 5 3 4 0
14 Group1 5 3 3 0
15 Group1 5 2 2 0
16 Group1 6 3 3 0
17 Group1 6 2 2 0
18 Group1 6 2 4 0
19 Group2 1 4 3 1
20 Group2 1 1 3 1
# … with 16 more rows

Filter groups that are not consecutively numbered

I have a dataframe with a grouping variable Sequ and a counting variable grp:
df <- data.frame(
Sequ = c(1,1,2,2,2,2,3,3,3,4,4,4,4),
grp = c(1,2,
1,3,4,5,
1,2,3,
1,2,4,5)
)
I need to filter those Sequences where the grpcount is not by increments of 1 but greater than 1. The following method identifies the rows where the 'break' occurs but it does not filter the Sequences in their entirety:
df %>%
group_by(Sequ) %>%
filter(lead(grp) - grp > 1)
# A tibble: 2 × 2
# Groups: Sequ [2]
Sequ grp
<dbl> <dbl>
1 2 1
2 4 2
How can I get this desired output:
df
Sequ grp
1 2 1
2 2 3
3 2 4
4 2 5
5 4 1
6 4 2
7 4 4
8 4 5
You may try
library(dplyr)
df %>%
group_by(Sequ) %>%
filter(!all(abs(diff(grp)) == 1))
Sequ grp
<dbl> <dbl>
1 2 1
2 2 3
3 2 4
4 2 5
5 4 1
6 4 2
7 4 4
8 4 5

How to flag the last row of a data frame group?

Suppose we start with the below dataframe df:
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
ID Period Value
1 1 1 10
2 1 2 12
3 1 3 11
4 5 1 4
5 5 2 6
Now using dplyr I add a "Calculate" column that multiplies Period and Value of each row, giving me the following:
> df %>% mutate(Calculate = Period * Value)
ID Period Value Calculate
1 1 1 10 10
2 1 2 12 24
3 1 3 11 33
4 5 1 4 4
5 5 2 6 12
I'd like to modify the above "Calculate" to give me a value of 0, when reaching the last row for a given ID, so that the data frame output looks like:
ID Period Value Calculate
1 1 1 10 10
2 1 2 12 24
3 1 3 11 0
4 5 1 4 4
5 5 2 6 0
I was going to use the lead() function to peer at the next row to see if the ID changes but wasn't sure that happens when reaching the end of the data frame.
How could this be accomplished using dplyr?
You can group_by ID and replace the last row for each ID with 0.
library(dplyr)
df %>%
mutate(Calculate = Period * Value) %>%
group_by(ID) %>%
mutate(Calculate = replace(Calculate, n(), 0)) %>%
ungroup
# ID Period Value Calculate
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 10
#2 1 2 12 24
#3 1 3 11 0
#4 5 1 4 4
#5 5 2 6 0
Yet another possibility:
library(tidyverse)
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
df %>%
mutate(Calculate = Period * Value) %>%
group_by(ID) %>%
mutate(Calculate = if_else(row_number() == n(), 0, Calculate)) %>%
ungroup
#> # A tibble: 5 × 4
#> ID Period Value Calculate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 10 10
#> 2 1 2 12 24
#> 3 1 3 11 0
#> 4 5 1 4 4
#> 5 5 2 6 0
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
library(tidyverse)
df %>%
mutate(Calculate = Period * Value * duplicated(ID, fromLast = TRUE))
#> ID Period Value Calculate
#> 1 1 1 10 10
#> 2 1 2 12 24
#> 3 1 3 11 0
#> 4 5 1 4 4
#> 5 5 2 6 0
Created on 2022-01-09 by the reprex package (v2.0.1)
This should work. You can also replace rownum with Period (most likely)
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
df = df %>% mutate(Calculate = Period * Value)
df$rownum = rownames(df)
df = df %>%
group_by(ID) %>%
mutate(Calculate = ifelse(rownum == max(rownum), 0, Calculate)) %>%
ungroup()
A tibble: 5 × 5
ID Period Value Calculate rownum
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 1 10 10 1
2 1 2 12 24 2
3 1 3 11 0 3
4 5 1 4 4 4
5 5 2 6 0 5

bootstrap by group in tibble

Suppose I have a tibble tbl_
tbl_ <- tibble(id = c(1,1,2,2,3,3), dta = 1:6)
tbl_
# A tibble: 6 x 2
id dta
<dbl> <int>
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 3 6
There are 3 id groups. I want to resample entire id groups 3 times with replacement. For example the resulting tibble can be:
id dta
<dbl> <int>
1 1 1
2 1 2
3 1 1
4 1 2
5 3 5
6 3 6
but not
id dta
<dbl> <int>
1 1 1
2 1 2
3 1 1
4 2 4
5 3 5
6 3 6
or
id dta
<dbl> <int>
1 1 1
2 1 1
3 2 3
4 2 4
5 3 5
6 3 6
Here is one option with sample_n and distinct
library(tidyverse)
distinct(tbl_, id) %>%
sample_n(nrow(.), replace = TRUE) %>%
pull(id) %>%
map_df( ~ tbl_ %>%
filter(id == .x)) %>%
arrange(id)
# A tibble: 6 x 2
# id dta
# <dbl> <int>
#1 1.00 1
#2 1.00 2
#3 1.00 1
#4 1.00 2
#5 3.00 5
#6 3.00 6
An option can be to get the minimum row number for each id. That row number will be used to generate random samples from wiht replace = TRUE.
library(dplyr)
tbl_ %>% mutate(rn = row_number()) %>%
group_by(id) %>%
summarise(minrow = min(rn)) ->min_row
indx <- rep(sample(min_row$minrow, nrow(min_row), replace = TRUE), each = 2) +
rep(c(0,1), 3)
tbl_[indx,]
# # A tibble: 6 x 2
# id dta
# <dbl> <int>
# 1 1.00 1
# 2 1.00 2
# 3 3.00 5
# 4 3.00 6
# 5 2.00 3
# 6 2.00 4
Note: In the above answer the number of rows for each id has been assumed as 2 but this answer can tackle any number of IDs. The hard-coded each=2 and c(0,1) needs to be modified in order to scale it up to handle more than 2 rows for each id

adding grouping indicator for repeating sequences

I thought this is simple thing but failed and can't find answer from anywhere.
Example data looks like this. I have nro running from 1:x and restarts at random points. I would like to create ind variable which would be 1 for first run and 2 for second...
tbl <- tibble(nro = c(rep(1:3, 1), rep(1:5, 1), rep(1:4, 1)))
End result should look like this:
tibble(nro = c(rep(1:3, 1), rep(1:5, 1), rep(1:4, 1)),
ind = c(rep(1, 3), rep(2, 5), rep(3, 4)))
# A tibble: 12 x 2
nro ind
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 4 2
8 5 2
9 1 3
10 2 3
11 3 3
12 4 3
I thought I could do something with ifelse but failed miserably.
tbl %>%
mutate(ind = ifelse(nro < lag(nro), 1 + lag(ind), 1))
I assume this needs some kind of loop.
for sequences of the same length
You could use group_by on your nro variable and then just take the row_number():
tbl %>%
group_by(nro) %>%
mutate(ind = row_number())
# A tibble: 12 x 2
# Groups: nro [4]
# nro ind
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 1 2
# 6 2 2
# 7 3 2
# 8 4 2
# 9 1 3
# 10 2 3
# 11 3 3
# 12 4 3
for varying length of the sequences
inspired by docendo discimus's comment
tbl <- tibble(nro = c(rep(1:3, 1), rep(1:5, 1), rep(1:4, 1)))
tbl %>%
mutate(ind = cumsum(nro == 1))
However, this is limited to sequences which begin with 1, since only the TRUE values of nro == 1 are cumulated.
thus, you should consider to use this:
tbl %>% mutate(dif = nro - lag(nro)) %>%
mutate(dif = ifelse(is.na(dif), nro, dif)) %>%
mutate(ind = cumsum(dif < 0) + 1) %>%
select(-dif)
# A tibble: 12 x 2
# nro ind
# <int> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 4 2
# 8 5 2
# 9 1 3
# 10 2 3
# 11 3 3
# 12 4 3

Resources