I have a tibble which resembles the following:
data<-tibble(ref=c("ABC", "ABC", "XYZ", "XYZ", "FGH", "FGH", "FGH"),
type=c("A", "B", "A", "A", "A", "A", "B"))
ref type
1 ABC A
2 ABC B
3 XYZ A
4 XYZ A
5 FGH A
6 FGH A
7 FGH B
I need to group by ref and if--within a group--type B is present, return that row, else default to return any row (but only 1 row) of type A.
Expected output:
ref type
1 ABC B
2 XYZ A
3 FGH B
with large amounts of data, it is better to do sorting before grouping
tidyverse
library(tidyverse)
df<-tibble(ref=c("ABC", "ABC", "XYZ", "XYZ", "FGH", "FGH", "FGH"),
type=c("A", "B", "A", "A", "A", "A", "B"))
distinct(df) %>%
arrange(ref, desc(type)) %>%
group_by(ref) %>%
slice_head(n = 1) %>%
ungroup()
#> # A tibble: 3 × 2
#> ref type
#> <chr> <chr>
#> 1 ABC B
#> 2 FGH B
#> 3 XYZ A
data.table
Created on 2022-04-27 by the reprex package (v2.0.1)
df<-data.frame(ref=c("ABC", "ABC", "XYZ", "XYZ", "FGH", "FGH", "FGH"),
type=c("A", "B", "A", "A", "A", "A", "B"))
library(data.table)
setDT(df)[order(ref, -type), .SD[1], by = ref]
#> ref type
#> 1: ABC B
#> 2: FGH B
#> 3: XYZ A
Created on 2022-04-27 by the reprex package (v2.0.1)
If you only have A and B, then you can arrange and simply get the first row, i.e.
library(dplyr)
data %>%
group_by(ref) %>%
filter(type %in% c('A', 'B')) %>% #If other types exist
arrange(desc(type)) %>%
slice(1L)
# A tibble: 3 x 2
# Groups: ref [3]
ref type
<chr> <chr>
1 ABC B
2 FGH B
3 XYZ A
We can use which.max over boolean to extract the desired rows
data %>%
group_by(ref) %>%
slice(which.max(type == "B")) %>%
ungroup()
which gives
# A tibble: 3 x 2
ref type
<chr> <chr>
1 ABC B
2 FGH B
3 XYZ A
Related
I would like to reassign a given records to a single group if the records are duplicated. In the below dataset I would like to to have 12-4 all being assigned to group A or B but not both. Is there a way to go abou it?
library(tidyverse)
dat <- tibble(
group = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
assigned = c("12-1", "12-2", "12-3", "12-4", "12-4", "12-5", "12-6",
"12-7", "12-8")
)
# Attempts to tease out records for each group
dat %>% pivot_wider(names_from = group, values_from = assigned)
You can group by record and reassign all to the same group, chosen at random from the available groups:
dat %>%
group_by(assigned) %>%
mutate(group = nth(group, sample(n())[1])) %>%
ungroup()
#> # A tibble: 9 x 2
#> group assigned
#> <chr> <chr>
#> 1 A 12-1
#> 2 A 12-2
#> 3 A 12-3
#> 4 A 12-4
#> 5 A 12-4
#> 6 B 12-5
#> 7 B 12-6
#> 8 B 12-7
#> 9 B 12-8
library(tidyverse)
dat <- tibble(
group = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
assigned = c(
"12-1", "12-2", "12-3", "12-4", "12-4", "12-5", "12-6",
"12-7", "12-8"
)
)
dat %>%
select(-group) %>%
left_join(
dat %>%
left_join(dat %>% count(group)) %>%
# reassign to the smallest group
arrange(n) %>%
select(-n) %>%
distinct(assigned, .keep_all = TRUE)
)
#> Joining, by = "group"
#> Joining, by = "assigned"
#> # A tibble: 9 × 2
#> assigned group
#> <chr> <chr>
#> 1 12-1 A
#> 2 12-2 A
#> 3 12-3 A
#> 4 12-4 A
#> 5 12-4 A
#> 6 12-5 B
#> 7 12-6 B
#> 8 12-7 B
#> 9 12-8 B
Created on 2022-04-04 by the reprex package (v2.0.0)
This question already has answers here:
Getting the top values by group
(6 answers)
Closed 1 year ago.
I have this data:
df <- data.frame(
node = c("A", "B", "A", "A", "A", "B", "A", "A", "A", "B", "B", "B", "B"),
left = c("ab", "ab", "ab", "ab", "cc", "xx", "cc", "ab", "zz", "xx", "xx", "zz", "zz")
)
I want to count grouped frequencies and proportions and slice/filter out a sequence of grouped rows. Say, given the small dataset, I want to have the rows with the two highest Freq_left values per group. How can that be done? I can only extract the rows with the maximum Freq_left values but not the desired sequence of rows:
df %>%
group_by(node, left) %>%
# summarise
summarise(
Freq_left = n(),
Prop_left = round(Freq_left/nrow(.)*100, 4)
) %>%
slice_max(Freq_left)
# A tibble: 2 × 4
# Groups: node [2]
node left Freq_left Prop_left
<chr> <chr> <int> <dbl>
1 A ab 4 30.8
2 B xx 3 23.1
Expected output:
node left Freq_left Prop_left
<chr> <chr> <int> <dbl>
A ab 4 30.8
A cc 2 15.4
B xx 3 23.1
B zz 2 15.4
You could use dplyr::top_n or dplyr::slice_max:
Thanks to #PaulSmith for pointing out that dplyr::top_n is superseded in favor of dplyr::slice_max:
library(dplyr)
df %>%
group_by(node, left) %>%
# summarise
summarise(
Freq_left = n(),
Prop_left = round(Freq_left/nrow(.)*100, 4)
) %>%
slice_max(order_by = Prop_left, n = 2)
#> `summarise()` has grouped output by 'node'. You can override using the `.groups` argument.
#> # A tibble: 4 × 4
#> # Groups: node [2]
#> node left Freq_left Prop_left
#> <chr> <chr> <int> <dbl>
#> 1 A ab 4 30.8
#> 2 A cc 2 15.4
#> 3 B xx 3 23.1
#> 4 B zz 2 15.4
I am working with the grouped data in R.
In the following data example, I would like to fill the missing values in "sex" variable, and keep as is if there was no corresponding data (i.e. for id=6).
In the "diagnosis" variable, some had only one diagnosis and some had multiple diagnosis. So, I also would like to group the variable "diagnosis" into "wanted" to identify mutual exclusiveness.
The example data is;
d.f <- tribble (
~id, ~sex, ~diagnosis,
1, "M", "A",
1, NA, "B",
1, NA, "C",
2, NA, "A",
2, "F", NA,
2, NA, "A",
3, NA, NA,
3, "M", "A",
3, "M", "B",
4, "F", "C",
5, "F", "B",
6, NA, "A",
7, "M", NA
)
The desired data is ;
wanted <- tribble (
~id, ~sex, ~diagnosis,~wanted,
1, "M", "A", "ABC group",
1, "M", "B", "ABC group",
1, "M", "C", "ABC group",
2, "F", "A", "Only A",
2, "F", NA, "Only A",
2, "F", "A", "Only A",
3, "M", NA, "AB group",
3, "M", "A", "AB group",
3, "M", "B", "AB group",
4, "F", "C", "Only C",
5, "F", "B", "Only B",
6, NA, "A", "Only A",
7, "M", NA, "Missing"
)
mutate sex column by using first(na.omit(sex)) first is just an aggregating function which is safe to use here
another column say wanted can be mutated in two steps.
paste all strings together in the group using paste(unique(na.omit(diagnosis)), collapse = '')
thereafter use case_when to mutate strings as per your choice
library(tidyverse)
d.f %>%
group_by(id) %>%
mutate(sex = first(na.omit(sex)),
wanted = { x <- paste(unique(na.omit(diagnosis)), collapse = '');
case_when(nchar(x) == 1 ~ paste0('Only ', x),
nchar(x) == 0 ~ 'Missing',
TRUE ~ paste(x, ' Group'))})
#> # A tibble: 13 x 4
#> # Groups: id [7]
#> id sex diagnosis wanted
#> <dbl> <chr> <chr> <chr>
#> 1 1 M A ABC Group
#> 2 1 M B ABC Group
#> 3 1 M C ABC Group
#> 4 2 F A Only A
#> 5 2 F <NA> Only A
#> 6 2 F A Only A
#> 7 3 M <NA> AB Group
#> 8 3 M A AB Group
#> 9 3 M B AB Group
#> 10 4 F C Only C
#> 11 5 F B Only B
#> 12 6 <NA> A Only A
#> 13 7 M <NA> Missing
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
group_by(id) %>%
drop_na(diagnosis) %>%
summarise(wanted = str_c(c(unique(diagnosis)), collapse = "")) %>%
full_join(df1, . , by = "id") %>%
group_by(id) %>%
fill(sex, .direction = "updown")
#> # A tibble: 13 x 4
#> # Groups: id [7]
#> id sex diagnosis wanted
#> <dbl> <chr> <chr> <chr>
#> 1 1 M A ABC
#> 2 1 M B ABC
#> 3 1 M C ABC
#> 4 2 F A A
#> 5 2 F <NA> A
#> 6 2 F A A
#> 7 3 M <NA> AB
#> 8 3 M A AB
#> 9 3 M B AB
#> 10 4 F C C
#> 11 5 F B B
#> 12 6 <NA> A A
#> 13 7 M <NA> <NA>
This can also be used:
library(dplyr)
d.f %>%
group_by(id) %>%
mutate(sex = coalesce(sex, sex[!is.na(sex)][1]),
wanted = across(diagnosis, ~ {x <- unique(diagnosis[!is.na(diagnosis)])
if_else(length(x) > 1, paste(paste(x, collapse = ""), "Group"),
if_else(length(x) == 1, paste("Only", x[1]), "Missing")
)}))
# A tibble: 13 x 4
# Groups: id [7]
id sex diagnosis wanted$diagnosis
<dbl> <chr> <chr> <chr>
1 1 M A ABC Group
2 1 M B ABC Group
3 1 M C ABC Group
4 2 F A Only A
5 2 F NA Only A
6 2 F A Only A
7 3 M NA AB Group
8 3 M A AB Group
9 3 M B AB Group
10 4 F C Only C
11 5 F B Only B
12 6 NA A Only A
13 7 M NA Missing
I have a database with several columns ( >20) and 2 of these columns have the subject names. I would like to add another column with inside a number that identifies the combination of the two subjects.
Here is an example with only the 2 columns of names (I don't include the others for convenience):
ID1 ID2
A B
A C
A B
B C
A B
B A
C B
And here is what i would like to create:
ID1 ID2 CODE
A B 1
A C 2
A B 1
B C 3
A B 1
B A 1
C B 3
I am kind of new in R and I think it can be done with stringr but I am not sure how
Thanks for the help!
Simo
df$CODE <- as.integer(
factor(
apply(df, 1, function(x) paste0(sort(x), collapse = ""))
)
)
# ID1 ID2 CODE
# 1 A B 1
# 2 A C 2
# 3 A B 1
# 4 B C 3
# 5 A B 1
# 6 B A 1
# 7 C B 3
Data
df <- data.frame(
ID1 = c("A", "A", "A", "B", "A", "B", "C"),
ID2 = c("B", "C", "B", "C", "B", "A", "B")
)
Try this:
library(dplyr)
#Code
new <- df %>% rowwise() %>%
mutate(Var = paste0(sort(c(ID1, ID2)), collapse = '')) %>%
group_by(Var) %>%
mutate(CODE=cur_group_id()) %>%
ungroup() %>%
select(-Var)
Output:
# A tibble: 7 x 3
ID1 ID2 CODE
<chr> <chr> <int>
1 A B 1
2 A C 2
3 A B 1
4 B C 3
5 A B 1
6 B A 1
7 C B 3
Some data used:
#Data
df <- structure(list(ID1 = c("A", "A", "A", "B", "A", "B", "C"), ID2 = c("B",
"C", "B", "C", "B", "A", "B")), class = "data.frame", row.names = c(NA,
-7L))
What would be a good tidyverse approach to this type of problem? I want to filter out the duplicated rows of group that have an NA in them (keeping the row that has values for both var1 and var2) but keep the rows when there is no duplicated value in group. dat illustrates the raw example with expected_output showing what I'd hope to have.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
dat <- tibble::tribble(
~group, ~var1, ~var2,
"A", "foo", NA,
"A", "foo", "bar",
"B", "foo", NA,
"C", NA, "bar",
"C", "foo", "bar",
"D", NA, "bar",
"E", "foo", "bar",
"E", NA, "bar"
)
expected_output <- tibble::tribble(
~group, ~var1, ~var2,
"A", "foo", "bar",
"B", "foo", NA,
"C", "foo", "bar",
"D", NA, "bar",
"E", "foo", "bar"
)
expected_output
#> # A tibble: 5 x 3
#> group var1 var2
#> <chr> <chr> <chr>
#> 1 A foo bar
#> 2 B foo <NA>
#> 3 C foo bar
#> 4 D <NA> bar
#> 5 E foo bar
Any suggestions or ideas?
Solution 1 - if the duplicate rows are located in different positions for each group (e.g. first, last or somewhere in between)
dat %>%
arrange(group,var1,var2) %>%
group_by(group) %>%
slice_head() %>%
ungroup()
Output:
# A tibble: 5 x 3
group var1 var2
<chr> <chr> <chr>
1 A foo bar
2 B foo NA
3 C foo bar
4 D NA bar
5 E foo bar
Solution 2 - if the duplicate row is always the last row of that group
You can use duplicated with the fromLast option set to keep the last matched line, find the index of matches, negate it, and use that to remove duplicates as follows:
dat[!duplicated(dat$group, fromLast = TRUE), ]
which gives your requested output:
# A tibble: 4 x 3
group var1 var2
<chr> <chr> <chr>
1 A foo bar
2 B foo NA
3 C foo bar
4 D NA bar
One option could be:
dat %>%
group_by(group) %>%
slice_max(rowSums(!is.na(across(c(var1, var2)))), 1)
group var1 var2
<chr> <chr> <chr>
1 A foo bar
2 B foo <NA>
3 C foo bar
4 D <NA> bar