Equality of columns using dplyr - problem with missing values - r

I am using the dplyr package in R to test equality of two columns using the code below. The results work well except for missing values where neither TRUE nor FALSE is returned
mutate(check = if_else(only == count, TRUE, FALSE))
Any ideas on how I can tweak this syntax?
Thanks in advance!

Is this what you are looking for?
library(dplyr)
dat <- data.frame(only = c(1, NA, 2, 3, NA),
count = c(1, NA, 3, 2, 1))
dat %>%
mutate(check = if_else(only == count | is.na(only) & is.na(count),
TRUE, FALSE, missing = FALSE))
only count check
1 1 1 TRUE
2 NA NA TRUE
3 2 3 FALSE
4 3 2 FALSE
5 NA 1 FALSE

mutate(check = if_else(only == count |
is.na(only) == is.na(count), TRUE, FALSE))

You can try case_when. The following code means if not col1 == col2, which includes col1 != col2 or columns are NA, it will return FALSE.
# Example data frame
dat <- data.frame(
col1 = c(1, 2, 3, 4, 5),
col2 = c(1, 2, NA, 4, 6)
)
library(dplyr)
dat2 <- dat %>%
mutate(check = case_when(
col1 == col2 ~TRUE,
TRUE ~FALSE
))
print(dat2)
# col1 col2 check
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 3 NA FALSE
# 4 4 4 TRUE
# 5 5 6 FALSE

Related

Identify rows with a value greater than threshold, but only direct one above per group

Suppose we have a dataset with a grouping variable, a value, and a threshold that is unique per group. Say I want to identify a value that is greater than a threshold, but only one.
test <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2)
)
want <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2),
want = c(NA, NA, "yes", NA, "yes", NA)
)
In the table above, Group A has a threshold of 4, and only value of 5 is higher. But in Group B, threshold is 2, and both value of 3 and 5 is higher. However, only row with value of 3 is marked.
I was able to do this by identifying which rows had value greater than threshold, then removing the repeated value:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = if_else(value > threshold, "yes", NA_character_)) %>%
mutate(across(want, ~replace(.x, duplicated(.x), NA)))
I was wondering if there was a direct way to do this using a single logical statement rather than doing it two-step method, something along the line of:
test %>%
group_by(grp) %>%
mutate(want = if_else(???, "yes", NA_character_))
The answer doesn't have to be on R either. Just a logical step explanation would suffice as well. Perhaps using a rank?
Thank you!
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = (value > threshold), want = want & !lag(cumany(want))) %>%
ungroup()
# # A tibble: 6 × 4
# grp value threshold want
# <chr> <dbl> <dbl> <lgl>
# 1 A 1 4 FALSE
# 2 A 3 4 FALSE
# 3 A 5 4 TRUE
# 4 B 1 2 FALSE
# 5 B 3 2 TRUE
# 6 B 5 2 FALSE
If you really want strings, you can if_else after this.
Here is more direct way:
The essential part:
With min(which((value > threshold) == TRUE) we get the first TRUE in our column,
Next we use an ifelse and check the number we get to the row number and set the conditions:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = ifelse(row_number()==min(which((value > threshold) == TRUE)),
"yes", NA_character_))
grp value threshold want
<chr> <dbl> <dbl> <chr>
1 A 1 4 NA
2 A 3 4 NA
3 A 5 4 yes
4 B 1 2 NA
5 B 3 2 yes
6 B 5 2 NA
>
This is a perfect chance for a data.table answer using its non-equi matching and multiple match handling capabilities:
library(data.table)
setDT(test)
test[test, on=.(grp, value>threshold), mult="first", flag := TRUE]
test
# grp value threshold flag
# <char> <num> <num> <lgcl>
#1: A 1 4 NA
#2: A 3 4 NA
#3: A 5 4 TRUE
#4: B 1 2 NA
#5: B 3 2 TRUE
#6: B 5 2 NA
Find the "first" matching value in each group that is greater than > the threshold and set := it to TRUE

Remove non-last rows with certain condition per group

I have the following dataframe called df (dput below):
group indicator value
1 A FALSE 2
2 A FALSE 1
3 A FALSE 2
4 A TRUE 4
5 B FALSE 5
6 B FALSE 1
7 B TRUE 3
I would like to remove the non-last rows with indicator == FALSE per group. This means that in df the rows: 1,2 and 5 should be removed because they are not the last rows with FALSE per group. Here is the desired output:
group indicator value
1 A FALSE 2
2 A TRUE 4
3 B FALSE 1
4 B TRUE 3
So I was wondering if anyone knows how to remove non-last rows with certain condition per group in R?
dput of df:
df <- structure(list(group = c("A", "A", "A", "A", "B", "B", "B"),
indicator = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE
), value = c(2, 1, 2, 4, 5, 1, 3)), class = "data.frame", row.names = c(NA,
-7L))
Filter using last(which()) to find the row number of the last FALSE row per group:
library(dplyr)
df %>%
group_by(group) %>%
filter(indicator | row_number() == last(which(!indicator))) %>%
ungroup()
# A tibble: 4 × 3
group indicator value
<chr> <lgl> <dbl>
1 A FALSE 2
2 A TRUE 4
3 B FALSE 1
4 B TRUE 3
You can do this with lead and check if the following indicator is TRUE.
library(tidyverse)
df <- structure(list(group = c("A", "A", "A", "A", "B", "B", "B"),
indicator = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE
), value = c(2, 1, 2, 4, 5, 1, 3)), class = "data.frame", row.names = c(NA,
-7L))
df |>
group_by(group) |>
mutate(slicer = if_else(lead(indicator) ==F, 1, 0)) |>
mutate(slicer = if_else(is.na(slicer), 0 , slicer)) |>
filter(slicer == 0) |>
select(-slicer)
#> # A tibble: 4 × 3
#> # Groups: group [2]
#> group indicator value
#> <chr> <lgl> <dbl>
#> 1 A FALSE 2
#> 2 A TRUE 4
#> 3 B FALSE 1
#> 4 B TRUE 3
Another approach:
library(dplyr)
df %>%
group_by(group) %>%
slice_max(cumsum(!indicator))
Note: While this approach covers the example shown and OP's clarification that T always comes last, it will not work in sequences such as T, F, F, T in which you'd like to keep both Ts and not just the one following F.
Output:
# A tibble: 4 x 3
# Groups: group [2]
group indicator value
<chr> <lgl> <dbl>
1 A FALSE 2
2 A TRUE 4
3 B FALSE 1
4 B TRUE 3
Some alternatives one could come up with:
"Dumb" solution
should_be_kept <- logical(nrow(df))
for(row in 1:nrow(df)) {
if(df[row,"Indicator"]) {
should_be_kept[row] <- TRUE
} else if(row == max(which(!df[, "Indicator"] & df$Group == df[row, "Group"]))) {
should_be_kept[row] <- TRUE
} else {
should_be_kept[row] = FALSE
}
}
df[should_be_kept, ]
Solution using a custom function to find the last FALSE indicators from each group
rows_to_keep <- logical(nrow(df)) #We create a TRUE/FALSE vector with one entry for each row of df
rows_to_keep[df$Indicator] <- TRUE #If Indicator is TRUE, we mark that row as "selectable"
get_last_false_in_group <- function(df, group) {
return(max(which(df$Group == group & !df$Indicator))) #Gets the last time the condition inside of which() is met
}
#The following chunk does a group-by-group search of the last false indicator. There's probably some apply magic that simplifies this but I'm too dumb to come up with it.
groups <- levels(factor(df$Group))
for(current_group in groups) {
rows_to_keep[get_last_false_in_group(df, current_group)] <- TRUE
}
#Now that our rows_to_keep vector is ready, we can filter the corresponding rows and get the intended result:
df[rows_to_keep,]
With the data.table package, it's possible to replace the calls to max(which(...)) with calls to just the last function

How to convert multiple binary columns into a single character column?

I would like to convert data frame df1 into data frame df2.
id <- c(1,2,3)
outcome_1 <- c(1,0,1)
outcome_2 <- c(1,1,0)
df1 <- data.frame(id,outcome_1,outcome_2)
id <- c(1,2,3)
outcome <- c("1,2","2","1")
df2 <- data.frame(id,outcome)
The answers to the following question almost do what I want, but in my case a row can have more than one positive outcome (e.g. first row needs to be "1,2"). Also, I would like the resulting column to be a character column.
R: Converting multiple binary columns into one factor variable whose factors are binary column names
Please kindly help. Thank you.
Subset the substrings of the outcomes with their binary values coerced as.logical.
apply(df1[-1], 1, \(x) toString(substring(names(df1)[-1], 9)[as.logical(x)]))
# [1] "1, 2" "2" "1"
or
apply(df1[-1], 1, \(x) paste(substring(names(df1)[-1], 9)[as.logical(x)], collapse=','))
# [1] "1,2" "2" "1"
Using the first method:
cbind(df1[1], outcome=apply(df1[-1], 1, \(x) toString(substring(names(df1)[-1], 9)[as.logical(x)])))
# id outcome
# 1 1 1, 2
# 2 2 2
# 3 3 1
If you want a nested list you may use list2DF.
l <- list2DF(c(df1[1],
outcome=list(apply(df1[-1], 1, \(x)
as.numeric(substring(names(df1)[-1], 9))[as.logical(x)]))))
l
# id outcome
# 1 1 1, 2
# 2 2 2
# 3 3 1
where
str(l)
# 'data.frame': 3 obs. of 2 variables:
# $ id : num 1 2 3
# $ outcome:List of 3
# ..$ : num 1 2
# ..$ : num 2
# ..$ : num 1
Data:
df1 <- structure(list(id = c(1, 2, 3), outcome_1 = c(1, 0, 1), outcome_2 = c(1,
1, 0)), class = "data.frame", row.names = c(NA, -3L))
Here is one more tidyverse approach:
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(-id, ~case_when(. == 1 ~ cur_column()), .names = 'new_{col}'), .keep="unused") %>%
unite(outcome, starts_with('new'), na.rm = TRUE, sep = ', ') %>%
mutate(outcome = gsub('outcome_', '', outcome))
id outcome
1 1 1, 2
2 2 2
3 3 1
How many outcome_ columns are there? If just 2, this will work fine.
library(dplyr)
df1 %>%
rowwise() %>%
summarise(id = id,
outcome = paste(which(c(outcome_1,outcome_2)==1), collapse =","))
# A tibble: 3 x 2
id outcome
<dbl> <chr>
1 1 1,2
2 2 2
3 3 1
If there are more than 2, try this:
df1 %>%
rowwise() %>%
summarise(id=id,
outcome = paste(which(c_across(-id)== 1), collapse =","))
Another possible solution, based on dplyr and purrr::pmap:
library(tidyverse)
df1 %>%
transmute(id, outcome = pmap(., ~ c(1*..2, 2*..3) %>% .[. != 0] %>% toString))
#> id outcome
#> 1 1 1, 2
#> 2 2 2
#> 3 3 1
Or simply:
library(tidyverse)
pmap_dfr(df1, ~ data.frame(id = ..1, outcome = c(1*..2, 2*..3) %>% .[. != 0]
%>% toString))
#> id outcome
#> 1 1 1, 2
#> 2 2 2
#> 3 3 1
outcome_col_idx <- grepl("outcome", colnames(df1))
cbind(
df1[,!outcome_col_idx, drop = FALSE],
outcome = apply(
replace(df1, df1 == 0, NA)[,outcome_col_idx],
1,
function(x){
as.factor(
toString(
gsub(
"outcome_",
"",
names(x)[complete.cases(x)]
)
)
)
}
)
)

How to simplify data checking that iterates by rows and columns (two nested loops) with if_else

Q1 <- c(1, 1, 2, 2)
Q2_1 <- c(3, 3, 3, 3)
Q2_2 <- c(3, 4, 2, 1)
data <- data.frame(cbind(Q1, Q2_1, Q2_2))
I need to do some data checking if values in Q1 variables do not appear in Q2 variables (in both Q2_1 and Q2_2) and I need the result in a single variable.
For now I was using to nested for loops (for rows and columns) with if_else function from dplyr but it's quite a lot of code and I have to do similar checks multiple times. Is there any way to simplify the code?
For now that what I'm doing:
Q2_index <- grep("Q2_", names(data))
data$Q2_error <- 0
for(i in 1:dim(data)[1]){
for(j in 1:length(Q2_index)){
data$Q2_error[i] <- if_else(data$Q2_error != 1 & data$Q1 == data[, Q2_index[j]], 1, 0, 0)[i]
}
}
Second example:
ID <- 1:3
Q1_1 <- 1:3
Q1_2 <- c(3, NA, 1)
Q1_3 <- c(4, 2, 1)
Q2_1 <- c(5, 2, 1)
Q2_2 <- c(1, NA, NA)
Q2_3 <- c(NA, NA, NA)
data <- data.frame(ID, Q1_1, Q1_2, Q1_3, Q2_1, Q2_2, Q2_3)
Q1_index <- grep("Q1_", names(data))
Q2_index <- grep("Q2_", names(data))
data$Q1Q2error <- 0
for(i in 1:dim(data)[1]){
for(j in 1:length(Q1_index)){
data$Q1Q2error[i] <- if_else(data[, Q1_index[j]] >= 1 & data[, Q2_index[j]] != data[, Q1_index[j]] & is.na(data[, Q2_index[j]]), 0, 1, 1)[i]
}
}
Evaluated conditions vary from check to check. As a result I need a single variable that indicates if I deal with an error so I can easily match the error to ID (so either 1 and 0 or TRUE, FALSE). Please notice that this is simplyfied example and I have to deal with around 10-20 Q1 or Q2 variables at the same time.
Why not to create a generic function for the desired operation so you can reuse it on new data frames:
aggreg_it <- function(data){
cols_Q1 <- names(data)[grep("Q1", names(data))]
cols_Q2 <- names(data)[grep("Q2", names(data))]
mapply(function(i,j) {ifelse(length(intersect(i, j))>0,1,0)},
strsplit(apply(data[, cols_Q1], 1, paste, collapse=","),","),
strsplit(apply(data[, cols_Q2] , 1, paste ,collapse=","),","))
}
data$result <- aggreg_it(data)
# ID Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 result
#1 1 10 3 4 5 1 NA 0
#2 2 11 NA 2 2 NA NA 1
#3 3 12 1 1 1 NA NA 1
#4 4 13 5 6 7 8 9 0
as you did not make assumptions about NAs, note that NAs are considered as valid values in this example. Hope this brings you forward. similar you could make a funcion from your code.
Data used:
ID <- 1:4
Q1_1 <- c(10,11,12,13)
Q1_2 <- c(3, NA, 1, 5)
Q1_3 <- c(4, 2, 1, 6)
Q2_1 <- c(5, 2, 1, 7)
Q2_2 <- c(1, NA, NA, 8)
Q2_3 <- c(NA, NA, NA, 9)
data <- data.frame(ID, Q1_1, Q1_2, Q1_3, Q2_1, Q2_2, Q2_3)
No need for a loop, these operations are vectorized in R.
(I changed your input data a bit to show differentiated results)
Base R:
data$Q1_in_Q2 <- data$Q1 %in% data$Q2_1 | data$Q1 %in% data$Q2_2
data
#> Q1 Q2_1 Q2_2 Q1_in_Q2
#> 1 1 1 5 TRUE
#> 2 3 3 4 TRUE
#> 3 2 3 2 TRUE
#> 4 6 3 1 FALSE
With dplyr:
library(dplyr)
data <- data %>%
mutate(Q1_in_Q2_1 = Q1 %in% Q2_1,
Q1_in_Q2_2 = Q1 %in% Q2_2,
Q1_in_Q2 = Q1_in_Q2_1 | Q1_in_Q2_2) %>%
select(Q1, Q2_1, Q2_2, Q1_in_Q2_1, Q1_in_Q2_2, Q1_in_Q2)
data
#> Q1 Q2_1 Q2_2 Q1_in_Q2_1 Q1_in_Q2_2 Q1_in_Q2
#> 1 1 1 5 TRUE TRUE TRUE
#> 2 3 3 4 TRUE FALSE TRUE
#> 3 2 3 2 FALSE TRUE TRUE
#> 4 6 3 1 FALSE FALSE FALSE
Data:
Q1 <- c(1, 3, 2, 6)
Q2_1 <- c(1, 3, 3, 3)
Q2_2 <- c(5, 4, 2, 1)
data <- data.frame(cbind(Q1, Q2_1, Q2_2))

R ifelse loop on unique values always resolves FALSE

I am newish to R and having trouble with a for loop over unique values.
with the df:
id = c(1,2,2,3,3,4)
rank = c(1,2,1,3,3,4)
df = data.frame(id, rank)
I run:
df$dg <- logical(6)
for(i in unique(df$id)){
ifelse(!unique(df$rank), df$dg ==T, df$dg == F)
}
I am trying to mark the $dg variable as T providing that rank is different for each unique id and F if rank is the same within each id.
I am not getting any errors, but I am only getting F for all values of $dg even though I should be getting a mix.
I have also used the following loop with the same results:
for(i in unique(df$id)){
ifelse(length(unique(df$rank)), df$dg ==T, df$dg == F)
}
I have read other similar posts but the advice has not worked for my case.
From Comments:
I want to mark dg TRUE for all instances of an id if rank changed at all for a given id. Im looking to say for a given ID which has anywhere between 1-13 instances, mark dg TRUE if rank differs across instances.
Update: How to identify groups (ids) that only have one rank?
After clarification that OP provided this would be a solution for this particular case:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
For another data-set that has also an id, which has duplicates but also non-duplicate rank (presented below) this would be the output:
df2 %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
#:OUTPUT:
# Source: local data frame [9 x 3]
# Groups: id [5]
#
# # A tibble: 9 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
# 7 5 1 TRUE
# 8 5 1 TRUE
# 9 5 3 TRUE
Data-no-2:
df2 <- structure(list(id = c(1, 2, 2, 3, 3, 4, 5, 5, 5), rank = c(1, 2, 1, 3, 3, 4, 1, 1, 3
)), .Names = c("id", "rank"), row.names = c(NA, -9L), class = "data.frame")
How to identify duplicated rows within each group (id)?
You can use dplyr package:
library(dplyr)
df %>%
group_by(id, rank) %>%
mutate(dg = ifelse(n() > 1, F,T))
This will give you:
# Source: local data frame [6 x 3]
# Groups: id, rank [5]
#
# # A tibble: 6 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
Note: You can simply convert it back to a data.frame().
A data.table solution would be:
dt <- data.table(df)
dt$dg <- ifelse(dt[ , dg := .N, by = list(id, rank)]$dg>1,F,T)
Data:
df <- structure(list(id = c(1, 2, 2, 3, 3, 4), rank = c(1, 2, 1, 3,
3, 4)), .Names = c("id", "rank"), row.names = c(NA, -6L), class = "data.frame")
# > df
# id rank
# 1 1 1
# 2 2 2
# 3 2 1
# 4 3 3
# 5 3 3
# 6 4 4
N. B. Unless you want a different identifier rather than TRUE/FALSE, using ifelse() is redundant and costs computationally. #DavidArenburg

Resources