This question already has answers here:
Count number of non-NA values by group
(3 answers)
Count non-NA values by group [duplicate]
(3 answers)
Closed 1 year ago.
I try to count values in group_by with NA in one column of data frame. I have data like this:
> df <- data.frame(id = c(1, 2, 3, NA, 4, NA),
group = c("A", "A", "B", "C", "D", "E"))
> df
id group
1 1 A
2 2 A
3 3 B
4 NA C
5 4 D
6 NA E
I want to count groups having NA in first column as 0, but with an approach like this
> df %>% group_by(group) %>% summarise(n = n())
# A tibble: 5 x 2
group n
* <chr> <int>
1 A 2
2 B 1
3 C 1
4 D 1
5 E 1
i have 1 in rows C and E but not 0 which i want.
The expected result looks like this:
# A tibble: 5 x 2
group n
* <chr> <int>
1 A 2
2 B 1
3 C 0
4 D 1
5 E 0
How can i do this?
We can get the sum of a logical vector created with is.na to get the count as TRUE => 1 and FALSE => 0 so the sum returns the count of non-NA elements
library(dplyr)
df %>%
group_by(group) %>%
summarise(n = sum(!is.na(id)))
# A tibble: 5 x 2
# group n
# * <chr> <int>
#1 A 2
#2 B 1
#3 C 0
#4 D 1
#5 E 0
Or use length after subsetting
df %>%
group_by(group) %>%
summarise(n = length(id[!is.na(id)]))
n() returns the total number of rows including the missing values
Related
I have the following dataframe:
df <-read.table(header=TRUE, text="id code
1 A
1 B
1 C
2 A
2 A
2 A
3 A
3 B
3 A")
Per id, I would love to find those individuals that have at least 2 conditions, namely:
conditionA = "A"
conditionB = "B"
conditionC = "C"
and create a new colum with "index", 1 if there are two or more conditions met and 0 otherwise:
df_output <-read.table(header=TRUE, text="id code index
1 A 1
1 B 1
1 C 1
2 A 0
2 A 0
2 A 0
3 A 1
3 B 1
3 A 1")
So far I have tried the following:
df_output = df %>%
group_by(id) %>%
mutate(index = ifelse(grepl(conditionA|conditionB|conditionC, code), 1, 0))
and as you can see I am struggling to get the threshold count into the code.
You can create a vector of conditions, and then use %in% and sum to count the number of occurrences in each group. Use + (or ifelse) to convert logical into 1 and 0:
conditions = c("A", "B", "C")
df %>%
group_by(id) %>%
mutate(index = +(sum(unique(code) %in% conditions) >= 2))
id code index
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 0
5 2 A 0
6 2 A 0
7 3 A 1
8 3 B 1
9 3 A 1
You could use n_distinct(), which is a faster and more concise equivalent of length(unique(x)).
df %>%
group_by(id) %>%
mutate(index = +(n_distinct(code) >= 2)) %>%
ungroup()
# # A tibble: 9 × 3
# id code index
# <int> <chr> <int>
# 1 1 A 1
# 2 1 B 1
# 3 1 C 1
# 4 2 A 0
# 5 2 A 0
# 6 2 A 0
# 7 3 A 1
# 8 3 B 1
# 9 3 A 1
You can check conditions using intersect() function and check whether resulting list is of minimal (eg- 2) length.
conditions = c('A', 'B', 'C')
df_output2 =
df %>%
group_by(id) %>%
mutate(index = as.integer(length(intersect(code, conditions)) >= 2))
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
I have the following dataframe:
df <- data.frame(
id = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
name = c("J", "Z", "X", "A", "J", "B", "R", "J", "X")
)
I would like to group_by(id), then create a counter column which increases the value for each subsequent instance/level of (name). The desired output would look like this...
id name count
1 J 1
1 Z 1
1 X 1
2 A 1
2 J 2
2 B 1
3 R 1
3 J 3
3 X 2
I assume it would be something that starts like this...
library(tidyverse)
df %>%
group_by(id) %>%
But I'm not sure how I would implement that kind of counter...
Any help much appreciated.
actually you have to group by the "name" since you are looking to count that independ of the Id:
library(dplyr)
df %>%
dplyr::group_by(name) %>%
dplyr::mutate(count = dplyr::row_number()) %>%
dplyr::ungroup()
# A tibble: 9 x 3
id name count
<dbl> <chr> <int>
1 1 J 1
2 1 Z 1
3 1 X 1
4 2 A 1
5 2 J 2
6 2 B 1
7 3 R 1
8 3 J 3
9 3 X 2
I have a data frame with a group, a condition that differs by group, and an index within each group:
df <- data.frame(group = c(rep(c("A", "B", "C"), each = 3)),
condition = rep(c(0,1,1), each = 3),
index = c(1:3,1:3,2:4))
> df
group condition index
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 B 1 2
6 B 1 3
7 C 1 2
8 C 1 3
9 C 1 4
I would like to slice the data within each group, filtering out all but the row with the lowest index. However, this filter should only be applied when the condition applies, i.e., condition == 1. My solution was to compute a ranking on the index within each group and filter on the combination of condition and rank:
df %>%
group_by(group) %>%
mutate(rank = order(index)) %>%
filter(case_when(condition == 0 ~ TRUE,
condition == 1 & rank == 1 ~ TRUE))
# A tibble: 5 x 4
# Groups: group [3]
group condition index rank
<chr> <dbl> <int> <int>
1 A 0 1 1
2 A 0 2 2
3 A 0 3 3
4 B 1 1 1
5 C 1 2 1
This left me wondering whether there is a faster solution that does not require a separate ranking variable, and potentially uses slice_min() instead.
You can use filter() to keep all cases where the condition is zero or the index equals the minimum index.
library(dplyr)
df %>%
group_by(group) %>%
filter(condition == 0 | index == min(index))
# A tibble: 5 x 3
# Groups: group [3]
group condition index
<chr> <dbl> <int>
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 C 1 2
An option with slice
library(dplyr)
df %>%
group_by(group) %>%
slice(unique(c(which(condition == 0), which.min(index))))
I was wondering if there's a more elegant way of taking a dataframe, grouping by x to see how many x's occur in the dataset, then mutating to find the first occurrence of every x (y)
test <- data.frame(x = c("a", "b", "c", "d",
"c", "b", "e", "f", "g"),
y = c(1,1,1,1,2,2,2,2,2))
x y
1 a 1
2 b 1
3 c 1
4 d 1
5 c 2
6 b 2
7 e 2
8 f 2
9 g 2
Current Output
output <- test %>%
group_by(x) %>%
summarise(count = n())
x count
<fct> <int>
1 a 1
2 b 2
3 c 2
4 d 1
5 e 1
6 f 1
7 g 1
Desired Output
x count first_seen
<fct> <int> <dbl>
1 a 1 1
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 2
6 f 1 2
7 g 1 2
I can filter the test dataframe for the first occurrences then use a left_join but was hoping there's a more elegant solution using mutate?
# filter for first occurrences of y
right <- test %>%
group_by(x) %>%
filter(y == min(y)) %>%
slice(1) %>%
ungroup()
# bind to the output dataframe
left_join(output, right, by = "x")
We can use first after grouping by 'x' to create a new column, use that also in group_by and get the count with n()
library(dplyr)
test %>%
group_by(x) %>%
group_by(first_seen = first(y), add = TRUE) %>%
summarise(count = n())
# A tibble: 7 x 3
# Groups: x [7]
# x first_seen count
# <fct> <dbl> <int>
#1 a 1 1
#2 b 1 2
#3 c 1 2
#4 d 1 1
#5 e 2 1
#6 f 2 1
#7 g 2 1
I have a question. Why not keep it simple? for example
test %>%
group_by(x) %>%
summarise(
count = n(),
first_seen = first(y)
)
#> # A tibble: 7 x 3
#> x count first_seen
#> <chr> <int> <dbl>
#> 1 a 1 1
#> 2 b 2 1
#> 3 c 2 1
#> 4 d 1 1
#> 5 e 1 2
#> 6 f 1 2
#> 7 g 1 2
Given the example data, I'd like to spread a subset of the key-value pairs. In this case it is just one pair. However there are other cases where the subset to be spread is more than one pair.
library(tidyr)
# dummy data
> df1 <- data.frame(e = c(1, 1, 1, 1),
n = c("a", "b", "c", "d") ,
s = c(1, 2, 5, 7))
> df1
e n s
1 1 a 1
2 1 b 2
3 1 c 5
4 1 d 7
Classical spread of all key-value pairs:
> df1 %>% spread(n,s)
e a b c d
1 1 1 2 5 7
Desired output, spread only n=c
e c n s
1 1 5 a 1
2 1 5 b 2
3 1 5 d 7
We can do a gather after the spread
df1 %>%
spread(n, s) %>%
gather(n, s, -c, -e)
# e c n s
#1 1 5 a 1
#2 1 5 b 2
#3 1 5 d 7
Or instead of spread/gather, we filter without the 'c' row and then mutate to create the 'c' column while subsetting the 's' that corresponds to 'c'
df1 %>%
filter(n != "c") %>%
mutate(c = df1$s[df1$n=="c"])