Why does this dplyr group function give strange results?

Why does this dplyr group function give strange results? - r

When I run the below reproducible code I get the desired grouping results in the GroupRank column shown immediately beneath:
library(dplyr)
myData <-
data.frame(
Element = c("A","A","B","A","C","C"),
Group = c(0,0,0,0,1,1)
)
myDataGroups <- myData %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(ElementCnt = row_number()) %>%
ungroup() %>%
mutate(Group = factor(Group, unique(Group))) %>%
arrange(Group) %>%
mutate(groupCt = cumsum(Group != lag(Group, 1, Group[[1]])) - 1L) %>%
group_by(Group) %>%
mutate(GroupRank = ElementCnt - max(0L,groupCt),
GroupRank = if_else(as.character(Group) == "0", ElementCnt, min(GroupRank))
)%>%
ungroup() %>%
arrange(origOrder)
myDataGroups
> myDataGroups
# A tibble: 6 x 6
Element Group origOrder ElementCnt groupCt GroupRank
<chr> <fct> <int> <int> <int> <int>
1 A 0 1 1 -1 1
2 A 0 2 2 -1 2
3 B 0 3 1 -1 1
4 A 0 4 3 -1 3
5 C 1 5 1 0 1
6 C 1 6 2 0 1
However when I take the line from the above code GroupRank = if_else(as.character(Group) == "0", ElementCnt, min(GroupRank)) and simply add a max function like this GroupRank = max(1L,if_else( as.character(Group) == "0", ElementCnt, min(GroupRank))) (run as 1 and 1L both ways and get the same results) I get the strange output shown below. GroupRank shouldn´t have changed from the above output:
Element Group origOrder ElementCnt groupCt GroupRank
<chr> <fct> <int> <int> <int> <int>
1 A 0 1 1 -1 3
2 A 0 2 2 -1 3
3 B 0 3 1 -1 3
4 A 0 4 3 -1 3
5 C 1 5 1 0 1
6 C 1 6 2 0 1
What am I doing wrong here? Am I using max() incorrectly?

Note the difference between max() and pmax().
max(1:5, 5:1)
#> [1] 5
pmax(1:5, 5:1)
#> [1] 5 4 3 4 5
max() returns a scalar, which is why you get a constant value per group. pmax() does what you apparently expect, which is return a rowwise maximum vector.

Related

Average across rows and sum across columns if condition is met in R dataframe

I have an R dataframe that looks like this:
chr bp instances_1 instances_2 instances_sum
1 143926410 0 1 1
1 144075771 1 0 1
1 187762696 0 2 2
1 187783844 2 0 2
2 121596288 0 1 1
2 122042325 3 0 3
2 259939985 1 0 1
2 259991389 0 1 1
What I would like to do is group by 'chr', determine if two rows are within 1e7 base-pairs ('bp') from one another, and if they are, retain the average (and round the average) and sum across all other columns that met the condition. So, the final product would look like:
chr bp instances_1 instances_2 instances_sum
1 144001091 1 1 2
1 187773270 2 2 4
2 121819307 3 1 4
2 259965687 1 1 2
I tried the to manipulate the following code (using tidyverse) that I used for a similar kind of task that did it over multiple columns:
df_Pruned <- df |>
group_by(chr_snp1, chr_snp2) |>
mutate(grp = (abs(loc_snp1 - lag(loc_snp1, default = first(loc_snp1))) < 1e7) &
(abs(loc_snp2 - lag(loc_snp2, default = first(loc_snp2))) < 1e7)) |>
group_by(grp, .add=TRUE) |>
filter(pval == min(pval)) |>
ungroup()|>
select(-grp)
into this by trying to do the same over one grouping variable ('chr') and by trying to average and sum at the same time:
df_Pruned <- df |>
group_by(chr) |>
mutate(grp = (abs(bp - lag(bp, default = first(bp))) < 1e7)) |>
group_by(grp, .add=TRUE) |>
filter(bp == mean(bp) & instances_sum == sum(instances_sum)) |>
ungroup()|>
select(-grp)
But I can't get it to work. I think I'm close but could use some help.

Using cumsum with the lag condition produces your expected output:
df |>
mutate(grp = cumsum(abs(bp - lag(bp, default = first(bp))) > 1e7)) |>
group_by(chr, grp) |>
summarise(bp = mean(bp),
across(starts_with("instance"), sum),
.groups = "drop")
# A tibble: 4 × 6
chr grp bp instances_1 instances_2 instances_sum
<int> <int> <dbl> <int> <int> <int>
1 1 0 144001090. 1 1 2
2 1 1 187773270 2 2 4
3 2 2 121819306. 3 1 4
4 2 3 259965687 1 1 2

Create dyadic (relational) data from monadic data

I have conflict data that looks like this
conflict_ID country_code SideA
1 1 1
1 2 1
1 3 0
2 4 1
2 5 0
Now I want to make it into dyadic conflict data that looks like this (SideA=1 should be country_code_1):
conflict_ID country_code_1 country_code_2
1 1 3
1 2 3
2 4 5
Can anyone point me in the right direction?

Here's a direct approach:
df %>%
filter(SideA == 1) %>%
select(conflict_ID, country_code_1 = country_code) %>%
left_join(
df %>%
filter(SideA == 0) %>%
select(conflict_ID, country_code_2 = country_code),
by = "conflict_ID"
)
# conflict_ID country_code_1 country_code_2
# 1 1 1 3
# 2 1 2 3
# 3 2 4 5
Using this data:
df = read.table(text = 'conflict_ID country_code SideA
1 1 1
1 2 1
1 3 0
2 4 1
2 5 0 ', header = T)

This extends the previous issue you posted. You could produce all combinations for each conflict_ID, and filter out those combinations where country_code_2 matches country_code with SideA == 1.
library(dplyr)
library(tidyr)
mydf %>%
group_by(conflict_ID) %>%
summarise(country_code = combn(country_code, 2, sort, simplify = FALSE),
.groups = 'drop') %>%
unnest_wider(country_code, names_sep = '_') %>%
anti_join(filter(mydf, SideA == 1),
by = c("conflict_ID", "country_code_2" = "country_code"))
# # A tibble: 3 × 3
# conflict_ID country_code_1 country_code_2
# <int> <int> <int>
# 1 1 1 3
# 2 1 2 3
# 3 2 4 5

Find 2 out of 3 conditions per ID

I have the following dataframe:
df <-read.table(header=TRUE, text="id code
1 A
1 B
1 C
2 A
2 A
2 A
3 A
3 B
3 A")
Per id, I would love to find those individuals that have at least 2 conditions, namely:
conditionA = "A"
conditionB = "B"
conditionC = "C"
and create a new colum with "index", 1 if there are two or more conditions met and 0 otherwise:
df_output <-read.table(header=TRUE, text="id code index
1 A 1
1 B 1
1 C 1
2 A 0
2 A 0
2 A 0
3 A 1
3 B 1
3 A 1")
So far I have tried the following:
df_output = df %>%
group_by(id) %>%
mutate(index = ifelse(grepl(conditionA|conditionB|conditionC, code), 1, 0))
and as you can see I am struggling to get the threshold count into the code.

You can create a vector of conditions, and then use %in% and sum to count the number of occurrences in each group. Use + (or ifelse) to convert logical into 1 and 0:
conditions = c("A", "B", "C")
df %>%
group_by(id) %>%
mutate(index = +(sum(unique(code) %in% conditions) >= 2))
id code index
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 0
5 2 A 0
6 2 A 0
7 3 A 1
8 3 B 1
9 3 A 1

You could use n_distinct(), which is a faster and more concise equivalent of length(unique(x)).
df %>%
group_by(id) %>%
mutate(index = +(n_distinct(code) >= 2)) %>%
ungroup()
# # A tibble: 9 × 3
# id code index
# <int> <chr> <int>
# 1 1 A 1
# 2 1 B 1
# 3 1 C 1
# 4 2 A 0
# 5 2 A 0
# 6 2 A 0
# 7 3 A 1
# 8 3 B 1
# 9 3 A 1

You can check conditions using intersect() function and check whether resulting list is of minimal (eg- 2) length.
conditions = c('A', 'B', 'C')
df_output2 =
df %>%
group_by(id) %>%
mutate(index = as.integer(length(intersect(code, conditions)) >= 2))

Multiplying column value by another value matching column name R

I have a data frame which looks like this:
Value1 = c("1","2","1","3")
Letter = c("A","B","B","A")
A = c("2","2","0","1")
B = c("1","1","1","0")
data <- data.frame(Value1,Letter,A,B)
data
Value1 Letter A B
1 1 A 2 1
2 2 B 2 1
3 1 B 0 1
4 3 A 1 0
I'm trying to add a new column which is the multiplication of column Value1, by column A or B depending on what is in the Letter column. The expected result would be:
Value1 Letter A B Results
1 1 A 2 1 2
2 2 B 2 1 2
3 1 B 0 1 1
4 3 A 1 0 3
I'm trying to use the match() function, but without success.
Thanks!

With base R:
data <- type.convert(data, as.is = TRUE)
data$Results <- ifelse(data$Letter == 'A', data$A * data$Value1, data$B * data$Value1)
Output
Value1 Letter A B Results
1 1 A 2 1 2
2 2 B 2 1 2
3 1 B 0 1 1
4 3 A 1 0 3
Another option would be to pivot to long form, do the calculation, then pivot back to wide format.
library(tidyverse)
data %>%
type.convert(as.is = TRUE) %>%
pivot_longer(c(A, B)) %>%
mutate(Results = ifelse(Letter == name, value * Value1, NA_integer_)) %>%
pivot_wider(names_from = "name", values_from = "value") %>%
group_by(Value1, Letter) %>%
summarise_all(discard, is.na)
Output
Value1 Letter Results A B
<int> <chr> <int> <int> <int>
1 1 A 2 2 1
2 1 B 1 0 1
3 2 B 2 2 1
4 3 A 3 1 0

Use case_when or ifelse
library(dplyr)
data <- data %>%
type.convert(as.is = TRUE) %>%
mutate(Results = case_when(Letter == 'A' ~ A * Value1,
TRUE ~ B * Value1))
-output
data
Value1 Letter A B Results
1 1 A 2 1 2
2 2 B 2 1 2
3 1 B 0 1 1
4 3 A 1 0 3
Or use get with rowwise
data <- data %>%
type.convert(as.is = TRUE) %>%
rowwise %>%
mutate(Result = get(Letter) * Value1) %>%
# or may also use
# mutate(Result = cur_data()[[Letter]] * Value1) %>%
ungroup
-output
data
# A tibble: 4 × 5
Value1 Letter A B Result
<int> <chr> <int> <int> <int>
1 1 A 2 1 2
2 2 B 2 1 2
3 1 B 0 1 1
4 3 A 1 0 3
In base R, we may use row/column indexing as vectorized option
data <- type.convert(data, as.is = TRUE)
nm1 <- unique(data$Letter)
data$Results <-data[nm1][cbind(seq_len(nrow(data)),
match(data$Letter, nm1))] * data$Value1

R: Slicing a grouped data frame conditional on a column

I have a data frame with a group, a condition that differs by group, and an index within each group:
df <- data.frame(group = c(rep(c("A", "B", "C"), each = 3)),
condition = rep(c(0,1,1), each = 3),
index = c(1:3,1:3,2:4))
> df
group condition index
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 B 1 2
6 B 1 3
7 C 1 2
8 C 1 3
9 C 1 4
I would like to slice the data within each group, filtering out all but the row with the lowest index. However, this filter should only be applied when the condition applies, i.e., condition == 1. My solution was to compute a ranking on the index within each group and filter on the combination of condition and rank:
df %>%
group_by(group) %>%
mutate(rank = order(index)) %>%
filter(case_when(condition == 0 ~ TRUE,
condition == 1 & rank == 1 ~ TRUE))
# A tibble: 5 x 4
# Groups: group [3]
group condition index rank
<chr> <dbl> <int> <int>
1 A 0 1 1
2 A 0 2 2
3 A 0 3 3
4 B 1 1 1
5 C 1 2 1
This left me wondering whether there is a faster solution that does not require a separate ranking variable, and potentially uses slice_min() instead.

You can use filter() to keep all cases where the condition is zero or the index equals the minimum index.
library(dplyr)
df %>%
group_by(group) %>%
filter(condition == 0 | index == min(index))
# A tibble: 5 x 3
# Groups: group [3]
group condition index
<chr> <dbl> <int>
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 C 1 2

An option with slice
library(dplyr)
df %>%
group_by(group) %>%
slice(unique(c(which(condition == 0), which.min(index))))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why does this dplyr group function give strange results? - r

Note the difference between max() and pmax(). max(1:5, 5:1) #> [1] 5 pmax(1:5, 5:1) #> [1] 5 4 3 4 5 max() returns a scalar, which is why you get a constant value per group. pmax() does what you apparently expect, which is return a rowwise maximum vector.

Related

Average across rows and sum across columns if condition is met in R dataframe

Create dyadic (relational) data from monadic data

Find 2 out of 3 conditions per ID

Multiplying column value by another value matching column name R

R: Slicing a grouped data frame conditional on a column

Categories

Resources