R: Slicing a grouped data frame conditional on a column - r

I have a data frame with a group, a condition that differs by group, and an index within each group:
df <- data.frame(group = c(rep(c("A", "B", "C"), each = 3)),
condition = rep(c(0,1,1), each = 3),
index = c(1:3,1:3,2:4))
> df
group condition index
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 B 1 2
6 B 1 3
7 C 1 2
8 C 1 3
9 C 1 4
I would like to slice the data within each group, filtering out all but the row with the lowest index. However, this filter should only be applied when the condition applies, i.e., condition == 1. My solution was to compute a ranking on the index within each group and filter on the combination of condition and rank:
df %>%
group_by(group) %>%
mutate(rank = order(index)) %>%
filter(case_when(condition == 0 ~ TRUE,
condition == 1 & rank == 1 ~ TRUE))
# A tibble: 5 x 4
# Groups: group [3]
group condition index rank
<chr> <dbl> <int> <int>
1 A 0 1 1
2 A 0 2 2
3 A 0 3 3
4 B 1 1 1
5 C 1 2 1
This left me wondering whether there is a faster solution that does not require a separate ranking variable, and potentially uses slice_min() instead.

You can use filter() to keep all cases where the condition is zero or the index equals the minimum index.
library(dplyr)
df %>%
group_by(group) %>%
filter(condition == 0 | index == min(index))
# A tibble: 5 x 3
# Groups: group [3]
group condition index
<chr> <dbl> <int>
1 A 0 1
2 A 0 2
3 A 0 3
4 B 1 1
5 C 1 2

An option with slice
library(dplyr)
df %>%
group_by(group) %>%
slice(unique(c(which(condition == 0), which.min(index))))

Related

Find 2 out of 3 conditions per ID

I have the following dataframe:
df <-read.table(header=TRUE, text="id code
1 A
1 B
1 C
2 A
2 A
2 A
3 A
3 B
3 A")
Per id, I would love to find those individuals that have at least 2 conditions, namely:
conditionA = "A"
conditionB = "B"
conditionC = "C"
and create a new colum with "index", 1 if there are two or more conditions met and 0 otherwise:
df_output <-read.table(header=TRUE, text="id code index
1 A 1
1 B 1
1 C 1
2 A 0
2 A 0
2 A 0
3 A 1
3 B 1
3 A 1")
So far I have tried the following:
df_output = df %>%
group_by(id) %>%
mutate(index = ifelse(grepl(conditionA|conditionB|conditionC, code), 1, 0))
and as you can see I am struggling to get the threshold count into the code.
You can create a vector of conditions, and then use %in% and sum to count the number of occurrences in each group. Use + (or ifelse) to convert logical into 1 and 0:
conditions = c("A", "B", "C")
df %>%
group_by(id) %>%
mutate(index = +(sum(unique(code) %in% conditions) >= 2))
id code index
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 0
5 2 A 0
6 2 A 0
7 3 A 1
8 3 B 1
9 3 A 1
You could use n_distinct(), which is a faster and more concise equivalent of length(unique(x)).
df %>%
group_by(id) %>%
mutate(index = +(n_distinct(code) >= 2)) %>%
ungroup()
# # A tibble: 9 × 3
# id code index
# <int> <chr> <int>
# 1 1 A 1
# 2 1 B 1
# 3 1 C 1
# 4 2 A 0
# 5 2 A 0
# 6 2 A 0
# 7 3 A 1
# 8 3 B 1
# 9 3 A 1
You can check conditions using intersect() function and check whether resulting list is of minimal (eg- 2) length.
conditions = c('A', 'B', 'C')
df_output2 =
df %>%
group_by(id) %>%
mutate(index = as.integer(length(intersect(code, conditions)) >= 2))

Multiplying column value by another value matching column name R

I have a data frame which looks like this:
Value1 = c("1","2","1","3")
Letter = c("A","B","B","A")
A = c("2","2","0","1")
B = c("1","1","1","0")
data <- data.frame(Value1,Letter,A,B)
data
Value1 Letter A B
1 1 A 2 1
2 2 B 2 1
3 1 B 0 1
4 3 A 1 0
I'm trying to add a new column which is the multiplication of column Value1, by column A or B depending on what is in the Letter column. The expected result would be:
Value1 Letter A B Results
1 1 A 2 1 2
2 2 B 2 1 2
3 1 B 0 1 1
4 3 A 1 0 3
I'm trying to use the match() function, but without success.
Thanks!
With base R:
data <- type.convert(data, as.is = TRUE)
data$Results <- ifelse(data$Letter == 'A', data$A * data$Value1, data$B * data$Value1)
Output
Value1 Letter A B Results
1 1 A 2 1 2
2 2 B 2 1 2
3 1 B 0 1 1
4 3 A 1 0 3
Another option would be to pivot to long form, do the calculation, then pivot back to wide format.
library(tidyverse)
data %>%
type.convert(as.is = TRUE) %>%
pivot_longer(c(A, B)) %>%
mutate(Results = ifelse(Letter == name, value * Value1, NA_integer_)) %>%
pivot_wider(names_from = "name", values_from = "value") %>%
group_by(Value1, Letter) %>%
summarise_all(discard, is.na)
Output
Value1 Letter Results A B
<int> <chr> <int> <int> <int>
1 1 A 2 2 1
2 1 B 1 0 1
3 2 B 2 2 1
4 3 A 3 1 0
Use case_when or ifelse
library(dplyr)
data <- data %>%
type.convert(as.is = TRUE) %>%
mutate(Results = case_when(Letter == 'A' ~ A * Value1,
TRUE ~ B * Value1))
-output
data
Value1 Letter A B Results
1 1 A 2 1 2
2 2 B 2 1 2
3 1 B 0 1 1
4 3 A 1 0 3
Or use get with rowwise
data <- data %>%
type.convert(as.is = TRUE) %>%
rowwise %>%
mutate(Result = get(Letter) * Value1) %>%
# or may also use
# mutate(Result = cur_data()[[Letter]] * Value1) %>%
ungroup
-output
data
# A tibble: 4 × 5
Value1 Letter A B Result
<int> <chr> <int> <int> <int>
1 1 A 2 1 2
2 2 B 2 1 2
3 1 B 0 1 1
4 3 A 1 0 3
In base R, we may use row/column indexing as vectorized option
data <- type.convert(data, as.is = TRUE)
nm1 <- unique(data$Letter)
data$Results <-data[nm1][cbind(seq_len(nrow(data)),
match(data$Letter, nm1))] * data$Value1

How can I filter by subjects who have all levels of a factor?

I am trying to filter a data set to only include subjects who have data in all conditions (levels of a factor).
I have tried to filter by calculating the number of levels for each subject, but that does not work.
library(tidyverse)
Data <- data.frame(
Subject = factor(c(rep(1, 3),
rep(2, 3),
rep(3, 1))),
Condition = factor(c("A", "B", "C",
"A", "B", "C",
"A")),
Val = c(1, 0, 1,
0, 0, 1,
1)
)
Data %>%
semi_join(
.,
Data %>%
group_by(Subject) %>%
summarize(Num_Cond = length(levels(Condition))) %>%
filter(Num_Cond == 3),
by = "Subject"
)
This attempt yields:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
7 3 A 1
Desired output:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
I want to filter subject 3 out because they only have data for one condition.
Is there a dplyr/tidyverse approach for this problem?
We can create a condition with all and levels
library(dplyr)
Data %>%
group_by(Subject) %>%
filter(all(levels(Condition) %in% Condition))
# A tibble: 6 x 3
# Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Or with n_distinct and nlevels
Data %>%
group_by(Subject) %>%
filter(nlevels(Condition) == n_distinct(Condition))
# A tibble: 6 x 3
# Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Here is a solution testing wether the number of rows of each groupis equal to the number of levels of Condition.
Data %>%
group_by(Subject) %>%
filter(n() == nlevels(Condition))
## A tibble: 6 x 3
## Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Edit
Following the comment by user #akrun I tested with a data set having duplicate values for each row and the code above does fail.
bind_rows(Data, Data) %>%
group_by(Subject) %>%
#distinct() %>%
filter(n() == nlevels(Condition))
## A tibble: 0 x 3
## Groups: Subject [0]
## ... with 3 variables: Subject <fct>, Condition <fct>, Val <dbl>
To run the commented out code line would solve the problem.
I found a relatively simple solution by sub-setting on Subject:
Data %>%
semi_join(
.,
Data %>%
group_by(Subject) %>%
droplevels() %>%
summarize(Num_Cond = length(levels(Condition)[Subject])) %>%
filter(Num_Cond == 3),
by = "Subject"
)
This gives the desired output:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1

Flagging row that meets two conditions

For a given ID, I am trying to identify the latest observation (last wave or highest wave number) that meets a criteria (=1 or =2)
My data:
data <- data.frame(id=c(1,1,1, 2,2,2, 3,3,3), wave=c(1,2,3, 1,2,3, 1,2,3), var=c(NA,1,2, 1,2,NA, 3,1,3))
Outcome:
outcome <- data.frame(id=c(1,1,1, 2,2,2, 3,3,3), wave=c(1,2,3, 1,2,3, 1,2,3), var=c(NA,1,2, 1,2,NA, 3,1,3), flag=c(0,0,1, 0,1,0, 0,1,0))
I can't seem to figure out how to specify to only flag the latest/last row for a given id
data %>% group_by(id) %>% mutate(flag=if_else(var %in% c(1,2) & ...,1,0))
Subset the 'wave', get the max, compare (==) with the 'wave' column and convert to integer
library(dplyr)
data %>%
group_by(id) %>%
mutate(flag = as.integer(wave == max(wave[var %in% 1:2])))
# A tibble: 9 x 4
# Groups: id [3]
# id wave var flag
# <dbl> <dbl> <dbl> <int>
#1 1 1 NA 0
#2 1 2 1 0
#3 1 3 2 1
#4 2 1 1 0
#5 2 2 2 1
#6 2 3 NA 0
#7 3 1 3 0
#8 3 2 1 1
#9 3 3 3 0
Here, we assume that there are unique 'wave' values for each 'id'

Filter (subset) by conditions in 2 columns in R (dplyr or otherwise)

Given a dataset such as:
set.seed(134)
df<- data.frame(ID= rep(LETTERS[1:5], each=2),
condition=rep(0:1, 5),
value=rpois(10, 3)
)
df
ID condition value
1 A 0 2
2 A 1 3
3 B 0 5
4 B 1 2
5 C 0 3
6 C 1 1
7 D 0 2
8 D 1 4
9 E 0 1
10 E 1 5
For each ID, when the value for condition==0 is less than the value for condition==1, I want to keep both observations. When the value for condition==0 is greater than condition==1, I want to keep only the row for condition==0.
The subset returned should be this:
ID condition value
1 A 0 2
2 A 1 3
3 B 0 5
5 C 0 3
7 D 0 2
8 D 1 4
9 E 0 1
10 E 1 5
Using dplyr the first step is:
df %>% group_by(ID) %>%
But not sure where to go from there.
Translating fairly literally,
library(dplyr)
set.seed(134)
df <- data.frame(ID = rep(LETTERS[1:5], each = 2),
condition = rep(0:1, 5),
value = rpois(10, 3))
df %>% group_by(ID) %>%
filter(condition == 0 |
(condition == 1 & value > value[condition == 0]))
#> # A tibble: 8 x 3
#> # Groups: ID [5]
#> ID condition value
#> <fct> <int> <int>
#> 1 A 0 2
#> 2 A 1 3
#> 3 B 0 5
#> 4 C 0 3
#> 5 D 0 2
#> 6 D 1 4
#> 7 E 0 1
#> 8 E 1 5
This depends on each group having a single observation with condition == 0, but should otherwise be fairly robust.
This is may not be the easiest way, but should work as you want.
library(reshape2)
df %>%
dcast(ID ~ condition, value.var = 'value') %>% # cast to wide format
mutate(`1` = ifelse(`1` > `0`, `1`, NA)) %>% # turn 0>1 values as NA
melt('ID') %>% # melt as long format
arrange(ID) %>% # sort by ID
filter(complete.cases(.)) # remove NA rows
Output:
ID variable value
1 A 0 2
2 A 1 3
3 B 0 5
4 C 0 3
5 D 0 2
6 D 1 4
7 E 0 1
8 E 1 5
You always want the value from the first row in each group. You only want the value from the second row in each group if it's larger than the first.
This works:
df %>%
group_by(ID) %>%
filter(row_number() == 1 | value > lag(value))
Edit: as #alistaire points out, this method depends on a particular order in, which is might be a good idea to guarantee as follows:
df %>%
arrange(ID, condition) %>%
group_by(ID) %>%
filter(row_number() == 1 | value > lag(value))

Resources