What would be a good tidyverse approach to this type of problem? I want to filter out the duplicated rows of group that have an NA in them (keeping the row that has values for both var1 and var2) but keep the rows when there is no duplicated value in group. dat illustrates the raw example with expected_output showing what I'd hope to have.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
dat <- tibble::tribble(
~group, ~var1, ~var2,
"A", "foo", NA,
"A", "foo", "bar",
"B", "foo", NA,
"C", NA, "bar",
"C", "foo", "bar",
"D", NA, "bar",
"E", "foo", "bar",
"E", NA, "bar"
)
expected_output <- tibble::tribble(
~group, ~var1, ~var2,
"A", "foo", "bar",
"B", "foo", NA,
"C", "foo", "bar",
"D", NA, "bar",
"E", "foo", "bar"
)
expected_output
#> # A tibble: 5 x 3
#> group var1 var2
#> <chr> <chr> <chr>
#> 1 A foo bar
#> 2 B foo <NA>
#> 3 C foo bar
#> 4 D <NA> bar
#> 5 E foo bar
Any suggestions or ideas?
Solution 1 - if the duplicate rows are located in different positions for each group (e.g. first, last or somewhere in between)
dat %>%
arrange(group,var1,var2) %>%
group_by(group) %>%
slice_head() %>%
ungroup()
Output:
# A tibble: 5 x 3
group var1 var2
<chr> <chr> <chr>
1 A foo bar
2 B foo NA
3 C foo bar
4 D NA bar
5 E foo bar
Solution 2 - if the duplicate row is always the last row of that group
You can use duplicated with the fromLast option set to keep the last matched line, find the index of matches, negate it, and use that to remove duplicates as follows:
dat[!duplicated(dat$group, fromLast = TRUE), ]
which gives your requested output:
# A tibble: 4 x 3
group var1 var2
<chr> <chr> <chr>
1 A foo bar
2 B foo NA
3 C foo bar
4 D NA bar
One option could be:
dat %>%
group_by(group) %>%
slice_max(rowSums(!is.na(across(c(var1, var2)))), 1)
group var1 var2
<chr> <chr> <chr>
1 A foo bar
2 B foo <NA>
3 C foo bar
4 D <NA> bar
Related
What I would like to do is a bit difficult to explain, but the code would look something like this:
df_merged <- merge(df1, df2,
by.x = c("City", "District"),
by.y = c("City", "District" | "Area"),
all.x = TRUE)
Here "|", in the code above, would mean "OR".
The basic point is that I would like to merge the two frames by two columns. "City" matches for both data frames. However, I also need there to be a match based on "District".
The problem is that, due to human error while the dataset was originally made, in df2 some values for "District" were put in the "Area" column. Hence, ideally, if we have "District" being "A" in df1, then the merge occurs if "A" is found in either the "District" or "Area" column from df2.
Here is an example:
df1 <- data.frame(City = c("A", "B"), District = c("cc", "dd"))
df2 <- data.frame(City = c("A", "A", "B", "B"), Code = c("1a","2a","3a","4a"), District = c("cc", "Apple", "Pear", "Orange"), Area = c("e", "a", "dd", "f"))
df3 <- data.frame(City = c("A", "B"), District = c("cc","dd"), Code = c("1a", "3a"))
Here df3 is what I am aiming for! As you can see in df2, there is something messed up and the values for district got into the wrong column. In my original dataset, it is difficult to clean up this error.
> df1
City District
1 A cc
2 B dd
> df2
City Code District Area
1 A 1a cc e
2 A 2a Apple a
3 B 3a Pear dd
4 B 4a Orange f
> df3
City District Code
1 A cc 1a
2 B dd 3a
Here are a couple of options. One idea is that district might be NA if it is actually in Area. In this case you could coalesce the NA and join on District. Alternatively, you could map out rows in df2 that match rows in df1 and then expand df1 to accommodate those rows.
library(tidyverse)
df1 <- tibble(City = c(rep("Chicago", 4), rep("Tucson", 3)),
District = c("A", "A", "A", "B", "A", "B", "C"))
df2 <- tibble(City = c("Chicago", "Chicago", "Tucson", "Tucson"),
District = c("A", "B", NA, "C"),
Area = c("10", "30", "A", "20"),
value = c(1:4))
#option 1
left_join(df1,
df2 |>
mutate(District = coalesce(District, Area)),
by = c("City", "District" ))
#> # A tibble: 7 x 4
#> City District Area value
#> <chr> <chr> <chr> <int>
#> 1 Chicago A 10 1
#> 2 Chicago A 10 1
#> 3 Chicago A 10 1
#> 4 Chicago B 30 2
#> 5 Tucson A A 3
#> 6 Tucson B <NA> NA
#> 7 Tucson C 20 4
#option 2
df1 |>
mutate(matches = map2(City, District,
~filter(df2, City == .x & (District == .y | Area == .y))|>
select(-City, -District))) |>
unnest_wider(matches)
#> # A tibble: 7 x 4
#> City District Area value
#> <chr> <chr> <chr> <int>
#> 1 Chicago A 10 1
#> 2 Chicago A 10 1
#> 3 Chicago A 10 1
#> 4 Chicago B 30 2
#> 5 Tucson A A 3
#> 6 Tucson B <NA> NA
#> 7 Tucson C 20 4
I have a tibble which resembles the following:
data<-tibble(ref=c("ABC", "ABC", "XYZ", "XYZ", "FGH", "FGH", "FGH"),
type=c("A", "B", "A", "A", "A", "A", "B"))
ref type
1 ABC A
2 ABC B
3 XYZ A
4 XYZ A
5 FGH A
6 FGH A
7 FGH B
I need to group by ref and if--within a group--type B is present, return that row, else default to return any row (but only 1 row) of type A.
Expected output:
ref type
1 ABC B
2 XYZ A
3 FGH B
with large amounts of data, it is better to do sorting before grouping
tidyverse
library(tidyverse)
df<-tibble(ref=c("ABC", "ABC", "XYZ", "XYZ", "FGH", "FGH", "FGH"),
type=c("A", "B", "A", "A", "A", "A", "B"))
distinct(df) %>%
arrange(ref, desc(type)) %>%
group_by(ref) %>%
slice_head(n = 1) %>%
ungroup()
#> # A tibble: 3 × 2
#> ref type
#> <chr> <chr>
#> 1 ABC B
#> 2 FGH B
#> 3 XYZ A
data.table
Created on 2022-04-27 by the reprex package (v2.0.1)
df<-data.frame(ref=c("ABC", "ABC", "XYZ", "XYZ", "FGH", "FGH", "FGH"),
type=c("A", "B", "A", "A", "A", "A", "B"))
library(data.table)
setDT(df)[order(ref, -type), .SD[1], by = ref]
#> ref type
#> 1: ABC B
#> 2: FGH B
#> 3: XYZ A
Created on 2022-04-27 by the reprex package (v2.0.1)
If you only have A and B, then you can arrange and simply get the first row, i.e.
library(dplyr)
data %>%
group_by(ref) %>%
filter(type %in% c('A', 'B')) %>% #If other types exist
arrange(desc(type)) %>%
slice(1L)
# A tibble: 3 x 2
# Groups: ref [3]
ref type
<chr> <chr>
1 ABC B
2 FGH B
3 XYZ A
We can use which.max over boolean to extract the desired rows
data %>%
group_by(ref) %>%
slice(which.max(type == "B")) %>%
ungroup()
which gives
# A tibble: 3 x 2
ref type
<chr> <chr>
1 ABC B
2 FGH B
3 XYZ A
I have a data frame which includes: one column having individual ID's (unique), and a second column showing a common unique variable. That is, everyone in column 1 took the same action, which is shown in column B.
I'd like to write code in R which creates new rows, which pair everyone in column A based on column B.
that is, given this example:
person <- c("a", "b", "c", "d", "e", "f")
action <- c("x", "x", "x", "y", "y", "y")
data.frame(person, action)
I'd want to create this:
person1 <- c("a", "a", "b", "d", "d", "e")
person2 <- c("b", "c", "c", "e", "f", "f")
data.frame(person1, person2)
A method using group_modify() and combn():
library(dplyr)
df %>%
group_by(action) %>%
group_modify(~ as_tibble(t(combn(pull(.x, person), 2))))
# A tibble: 6 × 3
# Groups: action [2]
action V1 V2
<chr> <chr> <chr>
1 x a b
2 x a c
3 x b c
4 y d e
5 y d f
6 y e f
How about this:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
#> Warning: package 'tidyr' was built under R version 4.1.2
person<-c("a", "b", "c", "d", "e", "f")
action<-c("x", "x", "x", "y", "y", "y")
dat <- data.frame(person, action)
dat %>%
group_by(action) %>%
summarise(person = as.data.frame(t(combn(person, 2)))) %>%
unnest(person) %>%
rename(person1=V1, person2=V2)
#> `summarise()` has grouped output by 'action'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 3
#> # Groups: action [2]
#> action person1 person2
#> <chr> <chr> <chr>
#> 1 x a b
#> 2 x a c
#> 3 x b c
#> 4 y d e
#> 5 y d f
#> 6 y e f
Created on 2022-04-21 by the reprex package (v2.0.1)
Here is a one liner in base R.
person <- c("a", "b", "c", "d", "e", "f")
action <- c("x", "x", "x", "y", "y", "y")
df <- data.frame(person, action)
setNames(
do.call(
rbind,
lapply(split(df, df$action),
function(x) as.data.frame(t(combn(x$person, 2))))),
c("person1", "person2"))
# person1 person2
# x.1 a b
# x.2 a c
# x.3 b c
# y.1 d e
# y.2 d f
# y.3 e f
Using base R
subset(merge(dat, dat, by = 'action'), person.x != person.y &
duplicated(paste(pmin(person.x, person.y), pmax(person.x, person.y))))
action person.x person.y
4 x b a
7 x c a
8 x c b
13 y e d
16 y f d
17 y f e
I would like to reassign a given records to a single group if the records are duplicated. In the below dataset I would like to to have 12-4 all being assigned to group A or B but not both. Is there a way to go abou it?
library(tidyverse)
dat <- tibble(
group = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
assigned = c("12-1", "12-2", "12-3", "12-4", "12-4", "12-5", "12-6",
"12-7", "12-8")
)
# Attempts to tease out records for each group
dat %>% pivot_wider(names_from = group, values_from = assigned)
You can group by record and reassign all to the same group, chosen at random from the available groups:
dat %>%
group_by(assigned) %>%
mutate(group = nth(group, sample(n())[1])) %>%
ungroup()
#> # A tibble: 9 x 2
#> group assigned
#> <chr> <chr>
#> 1 A 12-1
#> 2 A 12-2
#> 3 A 12-3
#> 4 A 12-4
#> 5 A 12-4
#> 6 B 12-5
#> 7 B 12-6
#> 8 B 12-7
#> 9 B 12-8
library(tidyverse)
dat <- tibble(
group = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
assigned = c(
"12-1", "12-2", "12-3", "12-4", "12-4", "12-5", "12-6",
"12-7", "12-8"
)
)
dat %>%
select(-group) %>%
left_join(
dat %>%
left_join(dat %>% count(group)) %>%
# reassign to the smallest group
arrange(n) %>%
select(-n) %>%
distinct(assigned, .keep_all = TRUE)
)
#> Joining, by = "group"
#> Joining, by = "assigned"
#> # A tibble: 9 × 2
#> assigned group
#> <chr> <chr>
#> 1 12-1 A
#> 2 12-2 A
#> 3 12-3 A
#> 4 12-4 A
#> 5 12-4 A
#> 6 12-5 B
#> 7 12-6 B
#> 8 12-7 B
#> 9 12-8 B
Created on 2022-04-04 by the reprex package (v2.0.0)
This question already has answers here:
Getting the top values by group
(6 answers)
Closed 1 year ago.
I have this data:
df <- data.frame(
node = c("A", "B", "A", "A", "A", "B", "A", "A", "A", "B", "B", "B", "B"),
left = c("ab", "ab", "ab", "ab", "cc", "xx", "cc", "ab", "zz", "xx", "xx", "zz", "zz")
)
I want to count grouped frequencies and proportions and slice/filter out a sequence of grouped rows. Say, given the small dataset, I want to have the rows with the two highest Freq_left values per group. How can that be done? I can only extract the rows with the maximum Freq_left values but not the desired sequence of rows:
df %>%
group_by(node, left) %>%
# summarise
summarise(
Freq_left = n(),
Prop_left = round(Freq_left/nrow(.)*100, 4)
) %>%
slice_max(Freq_left)
# A tibble: 2 × 4
# Groups: node [2]
node left Freq_left Prop_left
<chr> <chr> <int> <dbl>
1 A ab 4 30.8
2 B xx 3 23.1
Expected output:
node left Freq_left Prop_left
<chr> <chr> <int> <dbl>
A ab 4 30.8
A cc 2 15.4
B xx 3 23.1
B zz 2 15.4
You could use dplyr::top_n or dplyr::slice_max:
Thanks to #PaulSmith for pointing out that dplyr::top_n is superseded in favor of dplyr::slice_max:
library(dplyr)
df %>%
group_by(node, left) %>%
# summarise
summarise(
Freq_left = n(),
Prop_left = round(Freq_left/nrow(.)*100, 4)
) %>%
slice_max(order_by = Prop_left, n = 2)
#> `summarise()` has grouped output by 'node'. You can override using the `.groups` argument.
#> # A tibble: 4 × 4
#> # Groups: node [2]
#> node left Freq_left Prop_left
#> <chr> <chr> <int> <dbl>
#> 1 A ab 4 30.8
#> 2 A cc 2 15.4
#> 3 B xx 3 23.1
#> 4 B zz 2 15.4