Get number of occurrences of each unique value [duplicate] - r

This question already has answers here:
Count number of occurences for each unique value
(14 answers)
How to count how many values per level in a given factor?
(9 answers)
Closed 2 years ago.
This is something I spent some time searching for. There were several good answers on Stack Overflow detailing how you can get the number of unique values, but I couldn't find any that showed how to count the number of occurrences for each value using dplyr.

df %>% select(val) %>% group_by(val) %>% mutate(count = n()) %>% unique()
This first filters out the value of interest, groups by it, then creates a new column all the unique values, and the number of occurrences of each of those values.
Here is a reproducible example showcasing how it works:
id <- c(1,2,3,4,5,6,7,8,9,0)
val <- c(0,1,2,3,1,1,1,0,0,2)
df <- data.frame(id=id,val=val)
df
#> id val
#> 1 1 0
#> 2 2 1
#> 3 3 2
#> 4 4 3
#> 5 5 1
#> 6 6 1
#> 7 7 1
#> 8 8 0
#> 9 9 0
#> 10 0 2
df %>% select(val) %>% group_by(val) %>% mutate(count = n()) %>% unique()
#> # A tibble: 4 x 2
#> # Groups: val [4]
#> val count
#> <dbl> <int>
#> 1 0 3
#> 2 1 4
#> 3 2 2
#> 4 3 1
Created on 2020-06-17 by the reprex package (v0.3.0)

Related

Difference by subgroup using R

I have the following dataset:
I want to calculate the difference between values according to the subgroups. Nevertheless, subgroup 1 must come first. Thus 10-0=10; 0-20=-20; 30-31=-1. I want to perform it using R.
I know that it would be something like this, but I do not know how to put the sub_group into the code:
library(tidyverse)
df %>%
group_by(group) %>%
summarise(difference= diff(value))
Edited answer after OP's comment:
The OP clarified that the data are not sorted by sub_group within every group. Therefore, I added the arrange after group_by. The OP further clarified that the value of sub_group == 1 always should be the first term of the difference.
Below I demonstrate how to achieve this in an example with 3 sub_groups within every group. The code rests on the assumption that the lowest value of sub_group == 1. I drop each group's first sub_group after the difference.
library(tidyverse)
df <- tibble(group = rep(LETTERS[1:3], each = 3),
sub_group = rep(1:3, 3),
value = c(10,0,5,0,20,15,30,31,10))
df
#> # A tibble: 9 × 3
#> group sub_group value
#> <chr> <int> <dbl>
#> 1 A 1 10
#> 2 A 2 0
#> 3 A 3 5
#> 4 B 1 0
#> 5 B 2 20
#> 6 B 3 15
#> 7 C 1 30
#> 8 C 2 31
#> 9 C 3 10
df |>
group_by(group) |>
arrange(group, sub_group) |>
mutate(value = first(value) - value) |>
slice(2:n())
#> # A tibble: 6 × 3
#> # Groups: group [3]
#> group sub_group value
#> <chr> <int> <dbl>
#> 1 A 2 10
#> 2 A 3 5
#> 3 B 2 -20
#> 4 B 3 -15
#> 5 C 2 -1
#> 6 C 3 20
Created on 2022-10-18 with reprex v2.0.2
P.S. (from the original answer)
In the example data, you show the wrong difference for group C. It should read -1. I am convinced that most people here would appreciate if you could post your example data using code or at least as text which can be copied instead of a picture.

Continuing a sequence into NAs using dplyr

I am trying to figure out a dplyr specific way of continuing a sequence of numbers when there are NAs in that column.
For example I have this dataframe:
library(tibble)
dat <- tribble(
~x, ~group,
1, "A",
2, "A",
NA_real_, "A",
NA_real_, "A",
1, "B",
NA_real_, "B",
3, "B"
)
dat
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 NA A
#> 4 NA A
#> 5 1 B
#> 6 NA B
#> 7 3 B
I would like this one:
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 3 A
#> 4 4 A
#> 5 1 B
#> 6 2 B
#> 7 3 B
When I try this I get a warning which makes me think I am probably approaching this incorrectly:
library(dplyr)
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(n))
#> Warning in seq_len(n): first element used of 'length.out' argument
#> Warning in seq_len(n): first element used of 'length.out' argument
#> # A tibble: 7 × 4
#> # Groups: group [2]
#> x group n new_seq
#> <dbl> <chr> <int> <int>
#> 1 1 A 4 1
#> 2 2 A 4 2
#> 3 NA A 4 3
#> 4 NA A 4 4
#> 5 1 B 3 1
#> 6 NA B 3 2
#> 7 3 B 3 3
It's easier if you do it in one go. Your approach is not 'wrong', it is just that seq_len needs one integer, and you are giving a vector (n), so seq_len corrects it by using the first value.
dat %>%
group_by(group) %>%
mutate(x = seq_len(n()))
Note that row_number might be even easier here:
dat %>%
group_by(group) %>%
mutate(x = row_number())
We could use rowid directly if the intention is to create a sequence and group size is just intermediate column
library(data.table)
library(dplyr)
dat %>%
mutate(new_seq = rowid(group))
The issue with using a column after it is created is that it is no longer a single row as showed in #Maëls post. If we need to do that, use first as seq_len is not vectorized and here it is not needed as well
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(first(n)))
A base R option using ave (work in a similar way as group_by in dplyr)
> transform(dat, x = ave(x, group, FUN = seq_along))
x group
1 1 A
2 2 A
3 3 A
4 4 A
5 1 B
6 2 B
7 3 B

Group by a factor and then summarise a different variable [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 3 years ago.
I have data in this format, where samples are in groups (in this example A or B), have a numerical quantity and a quality score (which is a factor).
I would like to summarise the qual_score by each group_name.
Example Data:
group_name <- rep(c("A","B"),5)
qual_score <- c(rep("POOR",4),rep("FAIR",1),rep("GOOD",5))
quantity <- 5:14
df <- data.frame(group_name, qual_score, quantity)
> df
group_name qual_score quantity
1 A POOR 5
2 B POOR 6
3 A POOR 7
4 B POOR 8
5 A FAIR 9
6 B FAIR 10
7 A GOOD 11
8 B GOOD 12
9 A GOOD 13
10 B GOOD 14
Desired Output:
desired_output <- data.frame(c("2","2"),c("1","0"),c("2","3"))
colnames(desired_output) <- c("POOR", "FAIR", "GOOD")
rownames(desired_output) <- c("A", "B")
desired_output
POOR FAIR GOOD
A 2 1 2
B 2 0 3
I can do summary() of qual_score for the entire dataframe:
> summary(df$qual_score)
FAIR GOOD POOR
2 4 4
And can group_by() to summarise mean(quantity) according to each group:
> df %>%
+ group_by(group_name) %>%
+ summarise(mean(quantity))
# A tibble: 2 x 2
group_name `mean(quantity)`
<fct> <dbl>
1 A 9
2 B 10
But when I try to use group_by() with summary() I get a warning and the following output:
> df %>%
+ group_by(group_name) %>%
+ summary(qual_score)
group_name qual_score quantity
A:5 FAIR:2 Min. : 5.00
B:5 GOOD:4 1st Qu.: 7.25
POOR:4 Median : 9.50
Mean : 9.50
3rd Qu.:11.75
Max. :14.00
Warning messages:
1: In if (length(ll) > maxsum) { :
the condition has length > 1 and only the first element will be used
2: In if (length(ll) > maxsum) { :
the condition has length > 1 and only the first element will be used
library(dplyr)
df %>%
group_by(group_name) %>%
select(-quantity) %>%
table()
#> qual_score
#> group_name FAIR GOOD POOR
#> A 1 2 2
#> B 0 3 2
If you want a solution completely in tidyverse:
library(dplyr)
library(tidyr)
df %>%
group_by(group_name, qual_score) %>%
tally() %>%
spread(qual_score, n, fill=0)
#> # A tibble: 2 x 4
#> # Groups: group_name [2]
#> group_name FAIR GOOD POOR
#> <fct> <dbl> <dbl> <dbl>
#> 1 A 1 2 2
#> 2 B 0 3 2

standard eval with `dplyr::count()` [duplicate]

This question already has answers here:
dplyr: How to use group_by inside a function?
(4 answers)
Closed 3 years ago.
How can I pass a character vector to dplyr::count().
library(magrittr)
variables <- c("cyl", "vs")
mtcars %>%
dplyr::count_(variables)
This works well, but dplyr v0.8 throws the warning:
count_() is deprecated.
Please use count() instead
The 'programming' vignette or the tidyeval book can help you
to program with count() : https://tidyeval.tidyverse.org
I'm not seeing standard evaluation examples of quoted names or of dplyr::count() in https://tidyeval.tidyverse.org/dplyr.html or other chapters of the current versions of the tidyeval book and Programming with dplyr.
My two best guesses after reading this documenation and another SO question is
mtcars %>%
dplyr::count(!!variables)
mtcars %>%
dplyr::count(!!rlang::sym(variables))
which throw these two errors:
Error: Column <chr> must be length 32 (the number of rows) or one,
not 2
Error: Only strings can be converted to symbols
To create a list of symbols from strings, you want rlang::syms (not rlang::sym). For unquoting a list or a vector, you want to use !!! (not !!). The following will work:
library(magrittr)
variables <- c("cyl", "vs")
vars_sym <- rlang::syms(variables)
vars_sym
#> [[1]]
#> cyl
#>
#> [[2]]
#> vs
mtcars %>%
dplyr::count(!!! vars_sym)
#> # A tibble: 5 x 3
#> cyl vs n
#> <dbl> <dbl> <int>
#> 1 4 0 1
#> 2 4 1 10
#> 3 6 0 3
#> 4 6 1 4
#> 5 8 0 14
Maybe you can try
mtcars %>%
group_by(cyl, vs) %>%
tally()
This gives
# A tibble: 5 x 3
# Groups: cyl [3]
cyl vs n
<dbl> <dbl> <int>
1 4 0 1
2 4 1 10
3 6 0 3
4 6 1 4
5 8 0 14

select only rows with duplicate id and specific value from another column in R

I have the following data with ID and value:
id <- c("1103-5","1103-5","1104-2","1104-2","1104-4","1104-4","1106-2","1106-2","1106-3","1106-3","2294-1","2294-1","2294-2","2294-2","2294-2","2294-3","2294-3","2294-3","2294-4","2294-4","2294-5","2294-5","2294-5","2300-1","2300-1","2300-2","2300-2","2300-4","2300-4","2321-1","2321-1","2321-2","2321-2","2321-3","2321-3","2321-4","2321-4","2347-1","2347-1","2347-2","2347-2")
value <- c(6,3,6,3,6,3,6,3,6,3,3,6,9,3,6,9,3,6,3,6,9,3,6,9,6,9,6,9,6,9,3,9,3,9,3,9,3,9,6,9,6)
If you notice, there are multiple values for the same id. What I'd like to do is get the value that are only 3 and 6 only if the IDs are the same. for eg. ID "1103-5" has both 3 and 6, so it should be in the list, but not "2347-2"
I'm using R
One method I tried is the following, but it gives me everything with value 3 and 6.
d <- data.frame(id, value)
group36 <- d[d$value == 3 | d$value == 6,]
and
d %>% group_by(id) %>% filter(3 == value | 6 == value)
The output should be like this:
id value
1103-5 6
1103-5 3
1104-2 6
1104-2 3
1104-4 6
1104-4 3
1106-2 6
1106-2 3
1106-3 6
1106-3 3
2294-1 3
2294-1 6
2294-2 3
2294-2 6
2294-3 3
2294-3 6
2294-4 3
2294-4 6
2294-5 3
2294-5 6
d<-group_by(d,id)
filter(d,any(value==3),any(value==6))
This gives you all the IDs where there is both a value of 3 (somewhere) AND a value of 6 (somewhere). Mind you, your data contains some IDs with THREE values. In these cases, if both 3 and 6 are present, it will be included in the result.
If you want to exclude those lines that remain which done equal 3 or 6, add this:
filter(d,value==3 | value==6)
If you want to exclude IDs that also have 3 and 6 as values but also have OTHER values, use this:
filter(d,any(value==3),any(value==6),value==3 | value==6)
Not sure if this is what you want. We can filter rows that equal to either 3 or 6 then convert from long to wide format and keep only columns which have both 3 and 6 values. After that, convert back to long format.
library(dplyr)
library(tidyr)
id <- c("1103-5","1103-5","1104-2","1104-2","1104-4","1104-4","1106-2","1106-2",
"1106-3","1106-3","2294-1","2294-1","2294-2","2294-2","2294-2",
"2294-3","2294-3","2294-3","2294-4","2294-4","2294-5","2294-5","2294-5",
"2300-1","2300-1","2300-2","2300-2","2300-4","2300-4","2321-1","2321-1",
"2321-2","2321-2","2321-3","2321-3","2321-4","2321-4","2347-1","2347-1","2347-2","2347-2")
value <- c(6,3,6,3,6,3,6,3,6,3,3,6,9,3,6,9,3,6,3,6,9,3,6,9,6,9,6,9,6,9,3,9,3,9,3,9,3,9,6,9,6)
d <- data.frame(id, value)
d %>%
group_by(id) %>%
filter(value %in% c(3, 6)) %>%
mutate(rows = 1:n()) %>%
spread(key = id, value) %>%
select_if(~ all(!is.na(.)))
#> # A tibble: 2 x 11
#> rows `1103-5` `1104-2` `1104-4` `1106-2` `1106-3` `2294-1` `2294-2`
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 6 6 6 6 3 3
#> 2 2 3 3 3 3 3 6 6
#> # ... with 3 more variables: `2294-3` <dbl>, `2294-4` <dbl>,
#> # `2294-5` <dbl>
d %>%
group_by(id) %>%
filter(value %in% c(3, 6)) %>%
mutate(rows = 1:n()) %>%
spread(key = id, value) %>%
select_if(~ all(!is.na(.))) %>%
select(-rows) %>%
gather(id, value)
#> # A tibble: 20 x 2
#> id value
#> <chr> <dbl>
#> 1 1103-5 6
#> 2 1103-5 3
#> 3 1104-2 6
#> 4 1104-2 3
#> 5 1104-4 6
#> 6 1104-4 3
#> 7 1106-2 6
#> 8 1106-2 3
#> 9 1106-3 6
#> 10 1106-3 3
#> 11 2294-1 3
#> 12 2294-1 6
#> 13 2294-2 3
#> 14 2294-2 6
#> 15 2294-3 3
#> 16 2294-3 6
#> 17 2294-4 3
#> 18 2294-4 6
#> 19 2294-5 3
#> 20 2294-5 6
Created on 2018-07-01 by the reprex package (v0.2.0.9000).

Resources