Preferential removal of partial duplicates in a dataframe, dependant upon multiple columns

Preferential removal of partial duplicates in a dataframe, dependant upon multiple columns - r

While removing rows that are duplicates in one particular column, is it possible to preferentially retain one of the duplicate rows based upon the second and third columns?
Consider the following example:
# Example dataframe.
df <- data.frame(col.1 = c(1, 1, 1, 2, 2, 2, 3, 3),
col.2 = c('a', 'b', 'b', 'a', 'b', 'c', 'a', 'a'),
col.3 = c('b', 'c', 'a', 'b', 'a', 'b', 'c', 'b'))
# Output
col.1 col.2 col.3
1 a b
1 b c
1 b a
2 a b
2 b a
2 c b
3 a c
3 a b
I would like to remove rows that are duplicates in col.1, while preferentially retaining rows that have col.2 == 'b', and col.3 == 'c'. A match in both col.2 and col.3 is preferred the most, while a single match in col.2 is preferred over a single match in col.3, and a match in just one column is preferred over no match at all. For duplicate rows with no matches, any one of the duplicate rows may be retained.
In the case of the example given, the resultant data frame would look like this:
# Output.
col.1 col.2 col.3
1 b c
2 b a
3 a c
Thank you!

We group by 'col.1', filter rows where 'col.2' is 'b' or 'col.3' is 'c', then filter out the duplicated rows based on the 'col.2' and 'col.3' values
library(tidyverse)
df %>%
group_by(col.1) %>%
filter(col.2 == 'b'| col.3 == 'c') %>%
ungroup %>%
filter(!duplicated(.[-1], fromLast = TRUE))
# A tibble: 3 x 3
# col.1 col.2 col.3
# <dbl> <fct> <fct>
#1 1 b c
#2 2 b a
#3 3 a c

If you group_by the col.1 and col.3 while preferentially retaining the duplicates that have col.2 == 'b'. Then you take the output of this and group_by just col.1 while preferentially retaining the duplicates that have col.3 == 'c', you end up with the desired result. This also follows the desired logic, if the preferred values are changed.
df %>%
group_by(col.1, col.3) %>%
slice(match('b', col.2, nomatch = 1)) %>%
group_by(col.1) %>%
slice(match('c', col.3, nomatch = 1))
# Output:
# A tibble: 3 x 3
# Groups: col.1 [3]
col.1 col.2 col.3
<dbl> <fct> <fct>
1 1 b c
2 2 b a
3 3 a c

Related

Find novel categories between groups

I am trying to identify which trees are different between two groups a & b across different forest types (type).
My dummy example:
dd1 <- data.frame(
type = rep(1, 5),
grp = c('a', 'a', 'a', 'b', 'b'),
sp = c('oak', 'beech', 'spruce',
'oak', 'yew')
)
dd2 <- data.frame(
type = rep(2, 3),
grp = c('a', 'b', 'b'),
sp = c('oak', 'beech', 'spruce')
)
dd <- rbind(dd1, dd2)
I can find unique species by each group (in reality, two groups: type & grp) by distinct:
dd %>%
group_by(type, grp) %>%
distinct(sp)
But instead I want to know which trees in group b are different from group a?
Expected output:
type grp sp
<dbl> <chr> <chr>
1 1 b yew # here, only `yew` is a new one; `oak` was previously listed in group `a`
2 2 b beech # both beech and spruce are new compared to group `a`
3 2 b spruce
How can I do this? Thank you!

The condition to filter is
library(dplyr)
dd %>%
group_by(type) %>%
filter(grp == 'b' & !sp %in% sp[grp == 'a']) %>%
ungroup()
# # A tibble: 3 × 3
# type grp sp
# <dbl> <chr> <chr>
# 1 1 b yew
# 2 2 b beech
# 3 2 b spruce

You could try an anti_join:
library(dplyr)
library(tidyr)
dd |>
anti_join(dd |> filter(grp == "a"), by = c("sp", "type"))
Output:
type grp sp
1 1 b yew
2 2 b beech
3 2 b spruce

R: dplyr How to group by then filter rows based on the condition of each group's first row

I have a simple data frame such as
df <- data.frame(x=c(1,1,1,1,2,2,2,3,3,3),
y=c('a','b','a','c','e','d','e','a','f','c'))
I want to group by x, then if the first row of each x-groups has y == 'a', then get only rows that have y == 'a' | y == 'c'
So I expect the outcome would have row 1, 3, 4, 8, 10
Thank you very much.

After grouping by 'x', create an & condition - 1) check whether the first value of 'y' is 'a', 2) condition that checks for values 'a', 'c' in the column
library(dplyr)
df %>%
group_by(x) %>%
filter('a' == first(y), y %in% c('a', 'c')) %>%
ungroup
-output
# A tibble: 5 × 2
x y
<dbl> <chr>
1 1 a
2 1 a
3 1 c
4 3 a
5 3 c
If we have additional rules, create a named list where the names will be expected first values of 'y' and the vector of values to be filtered, then extract the list element based on the first value of the 'y' and use that vector in the logical expression with %in%
df %>%
group_by(x) %>%
filter(y %in% list(a = c('a', 'c'), e = 'e')[[first(y)]]) %>%
ungroup
-output
# A tibble: 7 × 2
x y
<dbl> <chr>
1 1 a
2 1 a
3 1 c
4 2 e
5 2 e
6 3 a
7 3 c

Here is another dplyr option
> df %>%
+ filter(y %in% c("a", "c") & ave(y == "a", x, FUN = first))
x y
1 1 a
2 1 a
3 1 c
4 3 a
5 3 c

Inserting rows to reflect missing data

I am working on a function that outputs a data frame that currently omits trials where there is missing data. However, I would like the full trial count to be added back into the file and the other data columns be blank for these instances (reflecting the missing data).
Example Data Frames:
Df1withTrialCount <- data.frame(Participant = c('A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A' ),
Trial = c(1,1,2,2,3,3,4,5,6,7,8,9,10,10,10),
NotRelevantVariable = c(1,2,3,4,5,6,4,3,2,1,1,2,3,4,5))
Df2NeedsTrialsAddedIn <- data.frame(Participant = c('A', 'A', 'A', 'A', 'A'),
Trial = c(1,3,5,6,10),
EyeGaze = c(.4, .2., .2, .1, .1))
So I would end up with something that had one row each for Trials 1-10 but blanks in Eye Gaze when there was not data (e.g., Trial 2 would have a blank for EyeGaze but Trial 3 would have .2).
Any help or insights would be greatly appreciated.
Take care and thank you for your time,
Caroline

With base::merge:
merge(unique(Df1withTrialCount[, c("Participant", "Trial")]),
Df2NeedsTrialsAddedIn,
all.x = TRUE)

We can use complete
library(tidyr)
complete(Df2NeedsTrialsAddedIn, Participant,
Trial = seq_len(max(Df1withTrialCount$Trial)))
-output
# A tibble: 10 x 3
# Participant Trial EyeGaze
# <chr> <dbl> <dbl>
# 1 A 1 0.4
# 2 A 2 NA
# 3 A 3 0.2
# 4 A 4 NA
# 5 A 5 0.2
# 6 A 6 0.1
# 7 A 7 NA
# 8 A 8 NA
# 9 A 9 NA
#10 A 10 0.1
If we need both min and `max from first dataset
complete(Df2NeedsTrialsAddedIn, Participant,
Trial = seq(min(Df1withTrialCount$Trial), max(Df1withTrialCount$Trial), by = 1))

library(tidyverse)
Df1withTrialCount %>%
left_join(Df2NeedsTrialsAddedIn, by=c('Participant', 'Trial')) %>%
distinct(Trial, .keep_all = TRUE)

Find variables common across groups for tidy data in R

I am looking to find common cases across groups in R, based on a tidy data set.
I could split the data sets and then join them, or use Reduce, but that seems laborious and I sure there must be a way to do this easily for tidy data, likely using dplyr and group_by().
Here is an example:
data <- data.frame(case = c('A', 'B', 'C', 'D', 'B', 'C', 'D', 'E'),
var = c(rep(1,4), rep(2, 4)))
case var
1 A 1
2 B 1
3 C 1
4 D 1
5 B 2
6 C 2
7 D 2
8 E 2
What I want is the cases common across variables: 'B', 'C', 'D'. I am thinking this should be easy but can't find an answer.

Group by case, then grab the first row for those cases that have the correct number of occurrences.
library(dplyr)
data %>%
group_by(case) %>%
slice(which(n_distinct(var) == n_distinct(.$var))[1])

After grouping by 'case', filter the groups having the number of distinct elements in 'var' equal to all the distinct elements in 'var', ungroup and get the distinct 'case'
library(dplyr)
data %>%
group_by(case) %>%
filter(n_distinct(var) == n_distinct(.$var)) %>%
ungroup %>%
distinct(case)
# A tibble: 3 x 1
# case
# <fct>
#1 B
#2 C
#3 D
Or using data.table
library(data.table)
setDT(data)[, .GRP[uniqueN(var) == uniqueN(data$var)], case]$case
#[1] B C D
Or using base R
with(data, names(Filter(function(x) all(unique(var) %in% x), split(var, case))))
#[1] "B" "C" "D"

create column based on last non missing value of other column

very similar to this questions I try to populate a new variable by finding the last non missing value by group for an existing variable in a dataframe, ideally using dplyr/zoo. I want to only keep the last value though, and not merely overwrite missings, consider the following minimal example:
df1 <- data.frame(ID = c(1, 1, 1, 2, 2,2),
date = c(1,2,3,1,2,3),
var1 = c('a', '', 'b', '','c', ''))
df2 = ## R-commands to get:
df2 <- data.frame(ID = c(1, 1, 1, 2, 2,2),
date = c(1,2,3,1,2,3),
var1 = c('b', 'b', 'b', 'c','c', 'c'))

Using dplyr,
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(var1 = last(var1[var1 != '']))
which gives,
# A tibble: 6 x 3
# Groups: ID [2]
ID date var1
<dbl> <dbl> <fct>
1 1 1 b
2 1 2 b
3 1 3 b
4 2 1 c
5 2 2 c
6 2 3 c

Here is one option with base R using ave
df1$var1 <- with(df1, ave(as.character(var1), ID, FUN =
function(x) tail(x[nzchar(x)], 1)))
df1$var1
#[1] "b" "b" "b" "c" "c" "c"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Preferential removal of partial duplicates in a dataframe, dependant upon multiple columns - r

Related

Find novel categories between groups

R: dplyr How to group by then filter rows based on the condition of each group's first row

Inserting rows to reflect missing data

Find variables common across groups for tidy data in R

create column based on last non missing value of other column

Categories

Resources