I have following data (example):
id <- c(1, 1, 2, 2, 2)
x <- c(2, 2, 3, 3, 4)
dat <- data.frame(id, x)
Now I can count the occurrence of x by group (id) and save in dat2:
dat2 <- dat %>% group_by(id, x) %>% dplyr::mutate(count = n())
Now count cases for the id's:
dat2 <- dat2 %>% group_by(id) %>% dplyr::mutate(j = n())
This works all fine. Result:
dat2
# A tibble: 5 x 4
# Groups: id [2]
id x count j
<dbl> <dbl> <int> <int>
1 1 2 2 2
2 1 2 2 2
3 2 3 2 3
4 2 3 2 3
5 2 4 1 3
Now to my problem. I want to use paste within "group_by". To be more exact, i want to use two character-"placeholder" i (for id) and z (for x) to control the grouping. I don't want to use the "real" objects id and x:
i <- "id"
z <- "x"
dat2 <- dat %>% group_by(dat[[paste(i, sep = "")]], dat[[paste(z, sep = "")]]) %>% dplyr::mutate(count = n())
This first step also works, same as above. However, going into the next final step, an error occurs:
dat2 <- dat2 %>% group_by(dat[[paste(i, sep = "")]]) %>% dplyr::mutate(j = n ())
Error: Problem with `mutate()` input `..1`.
x Input `..1` can't be recycled to size 2.
i Input `..1` is `dat[[paste(i, sep = "")]]`.
i Input `..1` must be size 2 or 1, not 5.
i The error occured in group 1: dat[[paste(i, sep = "")]] = 1, dat[[paste(z, sep = "")]] = 2.
Run `rlang::last_error()` to see where the error occurred.
My question: How to avoid this error and get to the same result like before without using paste? Working with the paste command may look strange, but i need to work with a character-placeholder.
I am glad about any help!
We could use across instead of paste
library(dplyr)
dat %>%
group_by(across(all_of(c(i, z)))) %>%
mutate(count = n()) %>%
group_by(across(all_of(i))) %>%
mutate(j = n())
# A tibble: 5 x 4
# Groups: id [2]
id x count j
<dbl> <dbl> <int> <int>
1 1 2 2 2
2 1 2 2 2
3 2 3 2 3
4 2 3 2 3
5 2 4 1 3
Or instead of grouping, use add_count
dat %>%
add_count(across(all_of(c(i, z))), name = 'count') %>%
add_count(across(all_of(i)), name = 'j')
id x count j
1 1 2 2 2
2 1 2 2 2
3 2 3 2 3
4 2 3 2 3
5 2 4 1 3
Related
I have a dataset that looks like this
ID|Filter|
1 Y
1 N
1 Y
1 Y
2 N
2 N
2 N
2 Y
2 Y
3 N
3 Y
3 Y
I would like the final result to look like this. A summary count of total count and also when filter is "Y"
ID|All Count|Filter Yes
1 4 3
2 5 2
3 3 2
If i do like this i only get the full count but I also want the folder as the next column
df<- df %>%
group_by(ID)%>%
summarise(`All Count`=n())
df %>%
group_by(ID) %>%
summarise(`All Count` = n(),
`Count Yes` = sum(Filter == "Y"))
# A tibble: 3 × 3
ID `All Count` `Count Yes`
<chr> <int> <int>
1 1 4 3
2 2 5 2
3 3 3 2
We can use
library(dplyr)
df %>%
group_by(ID)%>%
summarise(`All Count`=n(), `Filter Yes` = sum(Filter == 'Y', na.rm = TRUE))
I have a dataframe with a column of ids, but for some rows there are multiple ids concatenated together. I want to merge this onto another dataframe using the id, and when the ids are concatenated it handles that and reflects it by having the values in the new columns added also concatenated.
For example I have dataframes
data <- data.frame(
id = c(1, 4, 3, "2,3", "1,4"),
value = c(1:5)
)
> data
id value
1 1 1
2 4 2
3 3 3
4 2,3 4
5 1,4 5
mapping <- data.frame(
id = 1:4,
name = c("one", "two", "three", "four")
)
> mapping
id name
1 1 one
2 2 two
3 3 three
4 4 four
I would like to end up with
id value name
1 1 1 one
2 4 2 four
3 3 3 three
4 2,3 4 two,three
5 1,4 5 one,four
I don't think there's a good way to do this other than to separate, join, and re-concatenate:
library(dplyr)
library(tidyr)
data %>%
mutate(true_id = row_number()) %>%
separate_rows(id, convert = TRUE) %>%
left_join(mapping, by = "id") %>%
group_by(true_id, value) %>%
summarize(id = toString(id), name = toString(name), .groups = "drop")
# # A tibble: 5 × 4
# true_id value id name
# <int> <int> <chr> <chr>
# 1 1 1 1 one
# 2 2 2 4 four
# 3 3 3 3 three
# 4 4 4 2, 3 two, three
# 5 5 5 1, 4 one, four
I wasn't sure if your value column would actually be unique, so I added a true_id just in case.
What about something like this. I could think of a few ways. One is longer, but much easier to follow and the other is short, but kind of a mess.
library(tidyverse)
#long and readable
data |>
mutate(tmp = row_number()) |>
mutate(id = str_split(id, ",")) |>
unnest_longer(id) |>
left_join(mapping |>
mutate(id = as.character(id)), by = "id") |>
group_by(tmp) |>
summarise(id = paste(id, collapse = ","),
value = value[1],
name = paste(name, collapse = ","))
#> # A tibble: 5 x 4
#> tmp id value name
#> <int> <chr> <int> <chr>
#> 1 1 1 1 one
#> 2 2 4 2 four
#> 3 3 3 3 three
#> 4 4 2,3 4 two,three
#> 5 5 1,4 5 one,four
#short and ugly
data |>
mutate(name = map_chr(id, \(x)paste(
mapping$name[which(as.character(mapping$id) %in% str_split(x, ",")[[1]])],
collapse = ",") ))
#> id value name
#> 1 1 1 one
#> 2 4 2 four
#> 3 3 3 three
#> 4 2,3 4 two,three
#> 5 1,4 5 one,four
greping the data$ids out of the mapping$ids.
mapply(\(x, y) toString(mapping$name[grep(sprintf('[%s]', gsub('\\D', '', x)), y)]),
data$id, list(mapping$id))
# 1 4 3 2,3 1,4
# "one" "four" "three" "two, three" "one, four"
In order not to have a space after the comma, use paste(., collapse=',') instead of toString.
I have a data.frame with a group variable and an integer variable, with missing data.
df<-data.frame(group=c(1,1,2,2,3,3),a=as.integer(c(1,2,NA,NA,1,NA)))
I want to compute the maximum available value of variable a within each group : in my example, I should get 2 for group 1, NA for group 2 and 1 for group 3.
df %>% group_by(group) %>% mutate(max.a=case_when(sum(!is.na(a))==0 ~ NA_integer_,
T ~ max(a,na.rm=T)))
The above code generates an error, seemingly because in group 2 all values of a are missing so max(a,na.rm=T) is set to -Inf, which is not an integer.
Why is this case computed for group 2 whereas the condition is false, as the following verification confirms ?
df %>% group_by(group) %>% mutate(test=sum(!is.na(a))==0)
I found a workaround converting a to double, but I still get a warning and dissatisfaction not to have found a better solution.
case_when evaluates all the RHS of the condition irrespective if the condition is satisfied or not hence you get an error. You may use hablar::max_ which returns NA if all the values are NA.
library(dplyr)
df %>%
group_by(group) %>%
mutate(max.a= hablar::max_(a)) %>%
ungroup
# group a max.a
# <dbl> <int> <int>
#1 1 1 2
#2 1 2 2
#3 2 NA NA
#4 2 NA NA
#5 3 1 1
#6 3 NA 1
Instead of making use of case_when I would suggest to use an if () statement like so:
library(dplyr)
df <- data.frame(group = c(1, 1, 2, 2, 3, 3), a = as.integer(c(1, 2, NA, NA, 1, NA)))
df %>%
group_by(group) %>%
mutate(max.a = if (all(is.na(a))) NA_real_ else max(a, na.rm = T))
#> # A tibble: 6 x 3
#> # Groups: group [3]
#> group a max.a
#> <dbl> <int> <dbl>
#> 1 1 1 2
#> 2 1 2 2
#> 3 2 NA NA
#> 4 2 NA NA
#> 5 3 1 1
#> 6 3 NA 1
This code gives a warning but it works.
library(dplyr)
df %>%
group_by(group) %>%
dplyr::summarise(max.a = max(a, na.rm=TRUE))
Output:
group max.a
<dbl> <dbl>
1 1 2
2 2 -Inf
3 3 1
library(tidyverse)
df <- tibble(a = as.factor(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
How do I make dplyr look at this data frame df and collapse all these occurences of 2 into a single summed group, and collapse all the occurrences of 1 into a single summed group? And also keep the rest of the data frame.
Turn this:
# A tibble: 20 x 2
a b
<fct> <dbl>
1 1 50
2 2 20
3 3 13
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 2
11 11 2
12 12 2
13 13 2
14 14 1
15 15 1
16 16 1
17 17 1
18 18 1
19 19 1
20 20 1
into this:
# A tibble: 5 x 2
a b
<fct> <dbl>
1 1 50
2 2 20
3 3 13
4 grp2 20
5 grp1 7
[Edit] - I fixed the example data. Sorry about that.
We group by a manufactured sortkey to maintain sort order. We used the fact that b is in descending order in the input but if that is not the case in your actual data then replace sortkey = -b with the more general sortkey = data.table::rleid(b) or the longer sortkey = cumsum(coalesce(b != lag(b), FALSE)) .
We also convert b to the group names giving a new a. It wasn't clear which groups are to be converted to grp... form. Hard-coded 1 and 2? Any group with more than one row? Groups at the end with more than one row? At any rate it would be easy enough to change the condition in the if_else once that were clarified.
Finally perform the summation and then remove the sortkey.
df %>%
group_by(sortkey = -b, a = paste0(if_else(b %in% 1:2, "grp", ""), b)) %>%
summarize(b = sum(b)) %>%
ungroup %>%
select(-sortkey)
giving:
# A tibble: 5 x 2
a b
<chr> <int>
1 50 50
2 20 20
3 13 13
4 grp2 20
5 grp1 7
Here's a way. I have converted a from factor to character to make things easier. You can convert it back to factor if you want. Also your test data was a bit wrong.
df <- tibble(a = as.character(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
df %>%
mutate(
a = case_when(
b == 1 ~ "grp1",
b == 2 ~ "grp2",
TRUE ~ a
)
) %>%
group_by(a) %>%
summarise(b = sum(b))
# A tibble: 5 x 2
a b
<chr> <dbl>
1 1 50
2 2 20
3 3 13
4 grp1 7
5 grp2 20
This is an approach which gives you the desired names for groups & where you don't need to think in advance how many cases like that you would need (e.g. it would create grp3, grp4, ... depending on the number in b).
library(dplyr)
df %>%
mutate(
grp = as.numeric(lag(df$b) != df$b),
grp = cumsum(ifelse(is.na(grp), 0, grp))
) %>% group_by(grp) %>%
mutate(
a = ifelse(n() > 1, paste0("grp", b), a),
b = sum(b)
) %>% ungroup() %>% distinct(a, b)
Output:
a b
<chr> <dbl>
1 1 50
2 2 20
3 3 13
4 grp2 20
5 grp1 7
Note that the code could be also condensed but that leads to a certain lack of readability in my opinion:
df %>%
group_by(grp = cumsum(ifelse(is.na(as.numeric(lag(df$b) != df$b)), 0, as.numeric(lag(df$b) != df$b)))) %>%
mutate(
a = ifelse(n() > 1, paste0("grp", b), a),
b = sum(b)
) %>% ungroup() %>% distinct(a, b)
I have a situation where I am trying to find the number of intersections with a vector per group in another tibble.
Data example
a <- tibble(EXPERIMENT = rep(c("a","b","c"),each =4),
ECOTYPE = rep(1:12))
b <- tibble(ECOTYPE = c(1,1,5,4,8,7,6,1,4,4,2,5,6,7,1))
I want to find the number of intersections between ECOTYPE in b and ECOTYPEper EXPERIMENT in a.
I wonder if I can use dplyr to solve this, as the group_by function seems to fit this problem, but when I run:
a %>%
group_by(EXPERIMENT) %>%
summarise(INTERSECTIONS = length(intersect(b$ECOTYPE, .$ECOTYPE))
I only get the total number of intersections between a and b.
Am I missing something?
Edit:
Sorry for not posting my desired output. I would like something like this:
# A tibble: 3 x 2
EXPERIMENT INTERSECTIONS
<chr> <dbl>
1 a 8
2 b 7
3 c 0
Depending how you want to count, this will give the number of rows in b matching a:
b %>% mutate(b_flag = 1) %>%
right_join(a) %>%
group_by(EXPERIMENT) %>%
summarize(INTERSECTIONS = sum(b_flag, na.rm = T))
# # A tibble: 3 x 2
# EXPERIMENT INTERSECTIONS
# <fctr> <dbl>
# 1 a 8
# 2 b 7
# 3 c 0
I think the only problem with your code is the unnecessary .$, but it gives the counts of distinct ecotypes in b, ignoring the fact that b has three ECOTYPE = 1 rows, for example.
a %>%
group_by(EXPERIMENT) %>%
summarise(INTERSECTIONS = length(intersect(b$ECOTYPE, ECOTYPE)))
# # A tibble: 3 x 2
# EXPERIMENT INTERSECTIONS
# <fctr> <int>
# 1 a 3
# 2 b 4
# 3 c 0
This is a result of how intersect works:
intersect(c(1, 2, 3), c(1, 1, 1))
# [1] 1
Join the two and count how many are left:
inner_join(a,b, by='ECOTYPE') %>% group_by(EXPERIMENT) %>% count()
# A tibble: 2 x 2
# Groups: EXPERIMENT [2]
EXPERIMENT n
<chr> <int>
1 a 8
2 b 7
Now, if you add an indicator column to b, you can start to count absences as well:
b %>% mutate(present=TRUE) %>% right_join(a, by='ECOTYPE') %>% group_by(EXPERIMENT) %>% summarise(n(), missing=sum(is.na(present)))
# A tibble: 3 x 3
EXPERIMENT `n()` missing
<chr> <int> <int>
1 a 9 1
2 b 7 0
3 c 4 4