grouping and then verify two conditions with characters variables R - r

I would like to understand how to verify two conditions in groups with R. Like if I have:
x <- data.frame("id" = c('A12', 'A12', 'A13', 'A13', 'A14', 'A14'),
"var1" = c('a', 'b', 'b', 'c', 'b', 'a'),
"var2" = c('x', 'y', 'z', 'z', 'y', 'x'),
"var3" = c('h', 'l', 'l', 'h', 'q', 'q),
stringsAsFactors = FALSE)
for the group with the ID A12 are 'a', 'x' and 'h' present in the same row?

After grouping by 'id', we may need to wrap with any if the whole group have at least one row with the condition satisfied
library(dplyr)
x %>%
group_by(id) %>%
mutate(flag = any(var1 == 'a' & var2 == 'x' & var3 == 'h'))
# A tibble: 6 x 5
# Groups: id [3]
# id var1 var2 var3 flag
# <chr> <chr> <chr> <chr> <lgl>
#1 A12 a x h TRUE
#2 A12 b y l TRUE
#3 A13 b z l FALSE
#4 A13 c z h FALSE
#5 A14 b y q FALSE
#6 A14 a x q FALSE
Or another option is to paste the columns, and then do a string match
library(stringr)
x %>%
group_by(id) %>%
mutate(flag = any(str_c(var1, var2, var3) == 'axh'))
If it is just to create a column of TRUE/FALSE, then remove the any and the group_by step

Related

Find novel categories between groups

I am trying to identify which trees are different between two groups a & b across different forest types (type).
My dummy example:
dd1 <- data.frame(
type = rep(1, 5),
grp = c('a', 'a', 'a', 'b', 'b'),
sp = c('oak', 'beech', 'spruce',
'oak', 'yew')
)
dd2 <- data.frame(
type = rep(2, 3),
grp = c('a', 'b', 'b'),
sp = c('oak', 'beech', 'spruce')
)
dd <- rbind(dd1, dd2)
I can find unique species by each group (in reality, two groups: type & grp) by distinct:
dd %>%
group_by(type, grp) %>%
distinct(sp)
But instead I want to know which trees in group b are different from group a?
Expected output:
type grp sp
<dbl> <chr> <chr>
1 1 b yew # here, only `yew` is a new one; `oak` was previously listed in group `a`
2 2 b beech # both beech and spruce are new compared to group `a`
3 2 b spruce
How can I do this? Thank you!
The condition to filter is
library(dplyr)
dd %>%
group_by(type) %>%
filter(grp == 'b' & !sp %in% sp[grp == 'a']) %>%
ungroup()
# # A tibble: 3 × 3
# type grp sp
# <dbl> <chr> <chr>
# 1 1 b yew
# 2 2 b beech
# 3 2 b spruce
You could try an anti_join:
library(dplyr)
library(tidyr)
dd |>
anti_join(dd |> filter(grp == "a"), by = c("sp", "type"))
Output:
type grp sp
1 1 b yew
2 2 b beech
3 2 b spruce

Using the value in one column to specify from which row to retrieve a value for a new column

I'm looking for an automated way of converting this:
dat = tribble(
~a, ~b, ~c
, 'x', 1, 'y'
, 'y', 2, NA
, 'q', 4, NA
, 'z', 3, 'q'
)
to:
tribble(
~a, ~b, ~d
, 'x', 1, 2
, 'z', 3, 4
)
So, the column c in dat encodes which row in dat to look at to grab a value for a new column d, and if c is NA, toss that row from the output. Any tips?
We can join dat with itself using c and a columns.
library(dplyr)
dat %>%
inner_join(dat %>% select(-c) %>% rename(d = 'b'),
by = c('c' = 'a'))
# A tibble: 2 x 4
# a b c d
# <chr> <dbl> <chr> <dbl>
#1 x 1 y 2
#2 z 3 q 4
In base R, we can do this with merge :
merge(dat, dat[-3], by.x = 'c', by.y = 'a')
We create the 'd' with lead of 'b' and filter out the NA rows of 'c' and remove the c column with select
library(dplyr)
dat %>%
mutate(d = lead(b)) %>%
filter(!is.na(c)) %>%
select(-c)
# A tibble: 2 x 3
# a b d
# <chr> <dbl> <dbl>
#1 x 1 2
#2 z 3 4
Or more compactly
dat %>%
mutate(d = replace(lead(b), is.na(c), NA), c = NULL) %>%
na.omit
Or with fill
library(tidyr)
dat %>%
mutate(c1 = c) %>%
fill(c1) %>%
group_by(c1) %>%
mutate(d = lead(b)) %>%
ungroup %>%
filter(!is.na(c)) %>%
select(-c, -c1)
Or in data.table
library(data.table)
setDT(dat)[, d := shift(b, type = 'lead')][!is.na(c)][, c := NULL][]
# a b d
#1: x 1 2
#2: z 3 4
NOTE: Both the solutions are simple and doesn't require any joins. Besides, it gives the expected output in the OP's post
Or using match from base R
cbind(na.omit(dat), d = with(dat, b[match(c, a, nomatch = 0)]))[, -3]
# a b d
#1 x 1 2
#2 z 3 4

Preferential removal of partial duplicates in a dataframe, dependant upon multiple columns

While removing rows that are duplicates in one particular column, is it possible to preferentially retain one of the duplicate rows based upon the second and third columns?
Consider the following example:
# Example dataframe.
df <- data.frame(col.1 = c(1, 1, 1, 2, 2, 2, 3, 3),
col.2 = c('a', 'b', 'b', 'a', 'b', 'c', 'a', 'a'),
col.3 = c('b', 'c', 'a', 'b', 'a', 'b', 'c', 'b'))
# Output
col.1 col.2 col.3
1 a b
1 b c
1 b a
2 a b
2 b a
2 c b
3 a c
3 a b
I would like to remove rows that are duplicates in col.1, while preferentially retaining rows that have col.2 == 'b', and col.3 == 'c'. A match in both col.2 and col.3 is preferred the most, while a single match in col.2 is preferred over a single match in col.3, and a match in just one column is preferred over no match at all. For duplicate rows with no matches, any one of the duplicate rows may be retained.
In the case of the example given, the resultant data frame would look like this:
# Output.
col.1 col.2 col.3
1 b c
2 b a
3 a c
Thank you!
We group by 'col.1', filter rows where 'col.2' is 'b' or 'col.3' is 'c', then filter out the duplicated rows based on the 'col.2' and 'col.3' values
library(tidyverse)
df %>%
group_by(col.1) %>%
filter(col.2 == 'b'| col.3 == 'c') %>%
ungroup %>%
filter(!duplicated(.[-1], fromLast = TRUE))
# A tibble: 3 x 3
# col.1 col.2 col.3
# <dbl> <fct> <fct>
#1 1 b c
#2 2 b a
#3 3 a c
If you group_by the col.1 and col.3 while preferentially retaining the duplicates that have col.2 == 'b'. Then you take the output of this and group_by just col.1 while preferentially retaining the duplicates that have col.3 == 'c', you end up with the desired result. This also follows the desired logic, if the preferred values are changed.
df %>%
group_by(col.1, col.3) %>%
slice(match('b', col.2, nomatch = 1)) %>%
group_by(col.1) %>%
slice(match('c', col.3, nomatch = 1))
# Output:
# A tibble: 3 x 3
# Groups: col.1 [3]
col.1 col.2 col.3
<dbl> <fct> <fct>
1 1 b c
2 2 b a
3 3 a c

R group_by return number of largest unique type

Suppose I have this data set:
df <- data.frame(c('a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'b'),
c('c', 'c', 'd', 'e', 'f', 'c', 'e', 'f', 'f', 'f', 'g', 'h', 'f')
) %>% setNames(c('type', 'value'))
type value
1 a c
2 a c
3 a d
4 a e
5 a f
6 a c
7 b e
8 b f
9 b f
10 b f
11 b g
12 b h
13 b f
I'd like to perform some kind of command as follows:
df %>% group_by(type) %>%
summarise_all(funs(largest_group_size))
This would ideally produce a table with the largest number of any value for a and b.
type largest_group_size
1 a 3
2 b 4
This table would have:
3 for a, because there are 3 values of c for a, and c is the largest group for a
4 for b, because there are 4 values of f for b, and f is the largest group for b
Ideally, I'd like to go a step further and calculate the percentage that the largest group is of the whole by type. So (largest_group_size / n()).
In two group_by steps:
df %>%
group_by(type, value) %>%
summarise(groups = n()) %>%
group_by(type) %>%
summarise(largest_group = max(groups),
as_percentage = largest_group / sum(groups))
This gives:
type largest_group as_percentage
<fct> <dbl> <dbl>
1 a 3 0.5
2 b 4 0.571
There is probably a more efficient way, but this is how I would do this in a hurry.

How to overwrite some rows of a tibble with another tibble

Suppose I have data like the following:
# A tibble: 10 x 4
# Groups: a.month, a.group [10]
a.month a.group other.group amount
<date> <chr> <chr> <dbl>
1 2016-02-01 A X 15320
2 2016-05-01 A Z 50079
3 2016-06-01 A Y 60564
4 2016-08-01 A X 10540
5 2017-01-01 B X 30020
6 2017-03-01 B X 76310
7 2017-04-01 B Y 44215
8 2017-05-01 A Y 67241
9 2017-06-01 A Z 17180
10 2017-07-01 B Z 31720
And I want to produce rows for every possible combination of a.group, other.group and for every month in between (with amount being zero if not present on the data above)
I managed to produce a tibble with the default amounts through:
another.tibble <- as_tibble(expand.grid(
a.month = months.list,
a.group = unique.a.groups,
other.group = unique.o.groups,
amount = 0
));
How should I proceed to populate another.tibble with the values from the first one?
It is important to invoke expand.grid with stringsAsFactors=FALSE. Then, we simply make a LEFT_JOIN() to complete the combinations where we have data
library(tidyverse)
df <- tribble(
~a.month, ~a.group, ~other.group, ~amount,
'2016-02-01', 'A', 'X', 15320,
'2016-05-01', 'A', 'Z', 50079,
'2016-06-01', 'A', 'Y', 60564,
'2016-08-01', 'A', 'X', 10540,
'2017-01-01', 'B', 'X', 30020,
'2017-03-01', 'B', 'X', 76310,
'2017-04-01', 'B', 'Y', 44215,
'2017-05-01', 'A', 'Y', 67241,
'2017-06-01', 'A', 'Z', 17180,
'2017-07-01', 'B', 'Z', 31720
)
another.tibble <- as_tibble(expand.grid(
a.month = unique(df$a.month),
a.group = unique(df$a.group),
other.group = unique(df$other.group),
amount = 0, stringsAsFactors=F)
)
another.tibble %>%
left_join(df, by= c("a.month" = "a.month", "a.group" = "a.group", "other.group" = "other.group")) %>%
mutate(amount.x = ifelse(is.na(amount.y), 0, amount.y)) %>%
rename(amount = amount.x) %>%
select(1:4)

Resources