Remove non-last rows with certain condition per group - r

I have the following dataframe called df (dput below):
group indicator value
1 A FALSE 2
2 A FALSE 1
3 A FALSE 2
4 A TRUE 4
5 B FALSE 5
6 B FALSE 1
7 B TRUE 3
I would like to remove the non-last rows with indicator == FALSE per group. This means that in df the rows: 1,2 and 5 should be removed because they are not the last rows with FALSE per group. Here is the desired output:
group indicator value
1 A FALSE 2
2 A TRUE 4
3 B FALSE 1
4 B TRUE 3
So I was wondering if anyone knows how to remove non-last rows with certain condition per group in R?
dput of df:
df <- structure(list(group = c("A", "A", "A", "A", "B", "B", "B"),
indicator = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE
), value = c(2, 1, 2, 4, 5, 1, 3)), class = "data.frame", row.names = c(NA,
-7L))

Filter using last(which()) to find the row number of the last FALSE row per group:
library(dplyr)
df %>%
group_by(group) %>%
filter(indicator | row_number() == last(which(!indicator))) %>%
ungroup()
# A tibble: 4 × 3
group indicator value
<chr> <lgl> <dbl>
1 A FALSE 2
2 A TRUE 4
3 B FALSE 1
4 B TRUE 3

You can do this with lead and check if the following indicator is TRUE.
library(tidyverse)
df <- structure(list(group = c("A", "A", "A", "A", "B", "B", "B"),
indicator = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE
), value = c(2, 1, 2, 4, 5, 1, 3)), class = "data.frame", row.names = c(NA,
-7L))
df |>
group_by(group) |>
mutate(slicer = if_else(lead(indicator) ==F, 1, 0)) |>
mutate(slicer = if_else(is.na(slicer), 0 , slicer)) |>
filter(slicer == 0) |>
select(-slicer)
#> # A tibble: 4 × 3
#> # Groups: group [2]
#> group indicator value
#> <chr> <lgl> <dbl>
#> 1 A FALSE 2
#> 2 A TRUE 4
#> 3 B FALSE 1
#> 4 B TRUE 3

Another approach:
library(dplyr)
df %>%
group_by(group) %>%
slice_max(cumsum(!indicator))
Note: While this approach covers the example shown and OP's clarification that T always comes last, it will not work in sequences such as T, F, F, T in which you'd like to keep both Ts and not just the one following F.
Output:
# A tibble: 4 x 3
# Groups: group [2]
group indicator value
<chr> <lgl> <dbl>
1 A FALSE 2
2 A TRUE 4
3 B FALSE 1
4 B TRUE 3

Some alternatives one could come up with:
"Dumb" solution
should_be_kept <- logical(nrow(df))
for(row in 1:nrow(df)) {
if(df[row,"Indicator"]) {
should_be_kept[row] <- TRUE
} else if(row == max(which(!df[, "Indicator"] & df$Group == df[row, "Group"]))) {
should_be_kept[row] <- TRUE
} else {
should_be_kept[row] = FALSE
}
}
df[should_be_kept, ]
Solution using a custom function to find the last FALSE indicators from each group
rows_to_keep <- logical(nrow(df)) #We create a TRUE/FALSE vector with one entry for each row of df
rows_to_keep[df$Indicator] <- TRUE #If Indicator is TRUE, we mark that row as "selectable"
get_last_false_in_group <- function(df, group) {
return(max(which(df$Group == group & !df$Indicator))) #Gets the last time the condition inside of which() is met
}
#The following chunk does a group-by-group search of the last false indicator. There's probably some apply magic that simplifies this but I'm too dumb to come up with it.
groups <- levels(factor(df$Group))
for(current_group in groups) {
rows_to_keep[get_last_false_in_group(df, current_group)] <- TRUE
}
#Now that our rows_to_keep vector is ready, we can filter the corresponding rows and get the intended result:
df[rows_to_keep,]
With the data.table package, it's possible to replace the calls to max(which(...)) with calls to just the last function

Related

Identify rows with a value greater than threshold, but only direct one above per group

Suppose we have a dataset with a grouping variable, a value, and a threshold that is unique per group. Say I want to identify a value that is greater than a threshold, but only one.
test <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2)
)
want <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2),
want = c(NA, NA, "yes", NA, "yes", NA)
)
In the table above, Group A has a threshold of 4, and only value of 5 is higher. But in Group B, threshold is 2, and both value of 3 and 5 is higher. However, only row with value of 3 is marked.
I was able to do this by identifying which rows had value greater than threshold, then removing the repeated value:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = if_else(value > threshold, "yes", NA_character_)) %>%
mutate(across(want, ~replace(.x, duplicated(.x), NA)))
I was wondering if there was a direct way to do this using a single logical statement rather than doing it two-step method, something along the line of:
test %>%
group_by(grp) %>%
mutate(want = if_else(???, "yes", NA_character_))
The answer doesn't have to be on R either. Just a logical step explanation would suffice as well. Perhaps using a rank?
Thank you!
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = (value > threshold), want = want & !lag(cumany(want))) %>%
ungroup()
# # A tibble: 6 × 4
# grp value threshold want
# <chr> <dbl> <dbl> <lgl>
# 1 A 1 4 FALSE
# 2 A 3 4 FALSE
# 3 A 5 4 TRUE
# 4 B 1 2 FALSE
# 5 B 3 2 TRUE
# 6 B 5 2 FALSE
If you really want strings, you can if_else after this.
Here is more direct way:
The essential part:
With min(which((value > threshold) == TRUE) we get the first TRUE in our column,
Next we use an ifelse and check the number we get to the row number and set the conditions:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = ifelse(row_number()==min(which((value > threshold) == TRUE)),
"yes", NA_character_))
grp value threshold want
<chr> <dbl> <dbl> <chr>
1 A 1 4 NA
2 A 3 4 NA
3 A 5 4 yes
4 B 1 2 NA
5 B 3 2 yes
6 B 5 2 NA
>
This is a perfect chance for a data.table answer using its non-equi matching and multiple match handling capabilities:
library(data.table)
setDT(test)
test[test, on=.(grp, value>threshold), mult="first", flag := TRUE]
test
# grp value threshold flag
# <char> <num> <num> <lgcl>
#1: A 1 4 NA
#2: A 3 4 NA
#3: A 5 4 TRUE
#4: B 1 2 NA
#5: B 3 2 TRUE
#6: B 5 2 NA
Find the "first" matching value in each group that is greater than > the threshold and set := it to TRUE

Equality of columns using dplyr - problem with missing values

I am using the dplyr package in R to test equality of two columns using the code below. The results work well except for missing values where neither TRUE nor FALSE is returned
mutate(check = if_else(only == count, TRUE, FALSE))
Any ideas on how I can tweak this syntax?
Thanks in advance!
Is this what you are looking for?
library(dplyr)
dat <- data.frame(only = c(1, NA, 2, 3, NA),
count = c(1, NA, 3, 2, 1))
dat %>%
mutate(check = if_else(only == count | is.na(only) & is.na(count),
TRUE, FALSE, missing = FALSE))
only count check
1 1 1 TRUE
2 NA NA TRUE
3 2 3 FALSE
4 3 2 FALSE
5 NA 1 FALSE
mutate(check = if_else(only == count |
is.na(only) == is.na(count), TRUE, FALSE))
You can try case_when. The following code means if not col1 == col2, which includes col1 != col2 or columns are NA, it will return FALSE.
# Example data frame
dat <- data.frame(
col1 = c(1, 2, 3, 4, 5),
col2 = c(1, 2, NA, 4, 6)
)
library(dplyr)
dat2 <- dat %>%
mutate(check = case_when(
col1 == col2 ~TRUE,
TRUE ~FALSE
))
print(dat2)
# col1 col2 check
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 3 NA FALSE
# 4 4 4 TRUE
# 5 5 6 FALSE

Find point in dataframe where (col_1[ i ], col_2[ i ]) = (col_1[ j ], -col_2[ j ])

There might be an obvious solution to this that I have missed but here goes:
Consider the data frame below. I wish to create a column with TRUE/FALSE values, where the value is TRUE whenever the condition (col_1[i], col_2[i]) = (col_1[j], -col_2[j]) is fulfilled. Note that sum() does not work here, since there might be a third value.
To elaborate; what I have is:
col_1 <- c("x", "x", "y", "y", "y", "z", "z")
col_2 <- c(-1, 1, 3, -3, 4, 7, 3)
df <- data.frame(col_1, col_2)
What I want is:
I think the answer must be something with df %>% group_by(x), but I can't think of the complete solution.
Here is my attempt. As you were saying, grouping data is necessary. I defined groups with col_1 and foo. foo contains absolute values of col_2. If the number of observation is larger than one and unique number of observation in col_2 is equal to 2, you have the pairs you are searching.
group_by(df, col_1, foo = abs(col_2)) %>%
mutate(check = n() > 1 & n_distinct(col_2) == 2) %>%
ungroup %>%
select(-foo)
col_1 col_2 check
<fct> <dbl> <lgl>
1 x -1 TRUE
2 x 1 TRUE
3 y 3 TRUE
4 y -3 TRUE
5 y 4 FALSE
6 z 7 FALSE
7 z 3 FALSE
As Ronak previously mentioned, there may be cases like this.
col_1 <- c("x", "x", "y", "y", "y", "z", "z")
col_2 <- c(1, 1, 3, -3, 4, 7, 3)
df2 <- data.frame(col_1, col_2)
col_1 col_2
1 x 1
2 x 1
3 y 3
4 y -3
5 y 4
6 z 7
7 z 3
group_by(df2, col_1, foo = abs(col_2)) %>%
mutate(check = n() > 1 & n_distinct(col_2) == 2) %>%
ungroup %>%
select(-foo)
col_1 col_2 check
<fct> <dbl> <lgl>
1 x 1 FALSE
2 x 1 FALSE
3 y 3 TRUE
4 y -3 TRUE
5 y 4 FALSE
6 z 7 FALSE
7 z 3 FALSE
You can try the following base R code, where a custom function f is defined to check the sum:
f <- function(v) {
unique(c(combn(seq(v),2)[,combn(v,2,sum)==0]))
}
dfout <- Reduce(rbind,
lapply(split(df,df$col_1),
function(v) {
v$col_3 <- F
v$col_3[f(v$col_2)] <- T
v
})
)
dfout <- dfout[order(as.numeric(rownames(dfout))),]
such that
> dfout
col_1 col_2 col_3
1 x -1 TRUE
2 x 1 TRUE
3 y 3 TRUE
4 y -3 TRUE
5 y 4 FALSE
6 z 7 FALSE
7 z 3 FALSE

In R, discover rows which partially match rows in another data frame

I have the following two data frames:
> df1
# A tibble: 4 x 4
x y z w
<dbl> <dbl> <dbl> <dbl>
4 5 8 9
4 6 7 4
3 6 7 10
8 2 8 9
> df2
# A tibble: 4 x 4
x y z w
<dbl> <dbl> <dbl> <dbl>
6 2 7 9
2 6 7 10
4 5 8 12
4 5 8 3
I would like to discover which rows in df2 have a match in df1, where a match means being identical in at least n/2 columns.
So in this example, row 1 in df2 is a match to row 4 in df1 (columns 1 and 3), row 2 in df2 matches row 2 in df1 on columns 2 and 3 and row 3 on columns 2,3,4 and so on.
I also have to save the location of the repeating rows and the columns on which they match.
For small data sets, I could replicate both data sets and subtract them and count the zeros. However what I need is a solution which would work on very large data sets (~20K rows).
Any ideas? A dplyr solution (rather than a data.table) would be highly appreciated.
This final output might not be the ideal format, but it should at least have the information you're looking for and work with many more fields/columns.
df1 <- read.table(text =
"x y z w
4 5 8 9
4 6 7 4
3 6 7 10
8 2 8 9",
header = T)
df2 <- read.table(text =
"x y z w
6 2 7 9
2 6 7 10
4 5 8 12
4 5 8 3",
header = T)
library(dplyr)
library(tidyr)
Add a row ID number to each data frame and reshape the data from wide to long with gather. (I'm assuming each row can be treated as a unique id):
df1 <- df1 %>%
mutate(df1_id = row_number()) %>%
gather(field, value, x:w) %>%
arrange(df1_id)
df2 <- df2 %>%
mutate(df2_id = row_number()) %>%
gather(field, value, x:w) %>%
arrange(df2_id)
Join the two data frames with an inner_join on field/column and value. Then use group and filter to get only field and value combinations that have two or more matches
df2 %>%
inner_join(df1, by = c('value', 'field')) %>%
group_by(df2_id, df1_id) %>%
filter(n()>=2) %>% # where 2 is the minimum number of matches
arrange(df2_id, df1_id, value) %>%
select(df2_id, df1_id, field, value)
# A tibble: 13 x 4
# Groups: df2_id, df1_id [5]
df2_id df1_id field value
<int> <int> <chr> <int>
1 1 4 y 2
2 1 4 w 9
3 2 2 y 6
4 2 2 z 7
5 2 3 y 6
6 2 3 z 7
7 2 3 w 10
8 3 1 x 4
9 3 1 y 5
10 3 1 z 8
11 4 1 x 4
12 4 1 y 5
13 4 1 z 8
You can see that df2 row id 1 matches df1 row 4 on the fields y and w,
df2 row 2 matches df1 row 2 on fields fields y and z,
df2 row 2 also matches df1 row 3 on fields y, x, and w.
df2 rows 3 and 4 match df1 row 1 on x, y, and z.
arrange and select are really only necessary for easier viewing of the data.
How bout this? Using dplyr and purrr, we add id.1/id.2 fields and append .1 or .2 to the existing fields to both data frames as appropriate. Then we create a list of vectors for the by parameter. We will iterate through each vector when inner_join-ing df2 to df1, concatenate all the results from the inner_join-ing, and selecting the ids from both data frames.
require(dplyr)
require(purrr)
df1 <- tibble(
x = c(4, 4, 3, 8),
y = c(5, 6, 6, 2),
z = c(8, 7, 7, 8),
w = c(9, 4, 10, 9)
)
df2 <- tibble(
x = c(6, 2, 4, 4),
y = c(2, 6, 5, 5),
z = c(7, 7, 8, 8),
w = c(9, 10, 12, 13)
)
df1 <- df1 %>%
mutate(id.1 = 1:length(.)) %>%
rename(
x.1 = x,
y.1 = y,
z.1 = z,
w.1 = w
)
df2 <- df2 %>%
mutate(id.2 = 1:length(.)) %>%
rename(
x.2 = x,
y.2 = y,
z.2 = z,
w.2 = w
)
inner_join_by <-
list(
c("x.1" = "x.2", "y.1" = "y.2"),
c("x.1" = "x.2", "z.1" = "z.2"),
c("x.1" = "x.2", "w.1" = "w.2"),
c("y.1" = "y.2", "z.1" = "z.2"),
c("y.1" = "y.2", "w.1" = "w.2"),
c("z.1" = "z.2", "w.1" = "w.2")
)
filtered <- inner_join_by %>%
map_df(.f = ~inner_join(x = df1, y = df2, by = .x)) %>%
select(id.1, id.2) %>%
distinct()
One option could be using apply row-wise:
apply(df1, 1, function(x)apply(df2,1,function(y)x==y))
# [,1] [,2] [,3] [,4]
# [1,] FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE TRUE
# [3,] FALSE TRUE TRUE FALSE
# [4,] TRUE FALSE FALSE TRUE
# [5,] FALSE FALSE FALSE FALSE
# [6,] FALSE TRUE TRUE FALSE
# [7,] FALSE TRUE TRUE FALSE
# [8,] FALSE FALSE TRUE FALSE
# [9,] TRUE TRUE FALSE FALSE
# [10,] TRUE FALSE FALSE FALSE
# [11,] TRUE FALSE FALSE TRUE
# [12,] FALSE FALSE FALSE FALSE
# [13,] TRUE TRUE FALSE FALSE
# [14,] TRUE FALSE FALSE FALSE
# [15,] TRUE FALSE FALSE TRUE
# [16,] FALSE FALSE FALSE FALSE
What about the following solution (still involving a loop):
Here the function which for a given row checks and returns matches:
fct <- function(x, dat){
M1logical <- t(unlist(x) == t(dat))
n <- which(rowSums(M1logical) > 1)
if(length(n) > 0){
return(n)
}
if(length(n) == 0){
return(0)
}
}
Now applying iterating:
mylist <- rep(list(NA), nrow(df2))
for(k in 1:nrow(df2)){
mylist[[k]] <- fct(df2[k,], df1)
}
It takes my computer 23.14 seconds (microbenchmark) to compute it with two data frames of size 20000x4 each, see here for the dummy data (roughly 45 seconds on an older device):
df1 <- data.frame(x=sample(1:20,20000, replace = T), y=sample(1:20,20000, replace = T),
z=sample(1:20,20000, replace = T), w=sample(1:20,20000, replace = T),
stringsAsFactors = F)
df2 <- data.frame(x=sample(1:20,20000, replace = T), y=sample(1:20,20000, replace = T),
z=sample(1:20,20000, replace = T), w=sample(1:20,20000, replace = T),
stringsAsFactors = F)

R ifelse loop on unique values always resolves FALSE

I am newish to R and having trouble with a for loop over unique values.
with the df:
id = c(1,2,2,3,3,4)
rank = c(1,2,1,3,3,4)
df = data.frame(id, rank)
I run:
df$dg <- logical(6)
for(i in unique(df$id)){
ifelse(!unique(df$rank), df$dg ==T, df$dg == F)
}
I am trying to mark the $dg variable as T providing that rank is different for each unique id and F if rank is the same within each id.
I am not getting any errors, but I am only getting F for all values of $dg even though I should be getting a mix.
I have also used the following loop with the same results:
for(i in unique(df$id)){
ifelse(length(unique(df$rank)), df$dg ==T, df$dg == F)
}
I have read other similar posts but the advice has not worked for my case.
From Comments:
I want to mark dg TRUE for all instances of an id if rank changed at all for a given id. Im looking to say for a given ID which has anywhere between 1-13 instances, mark dg TRUE if rank differs across instances.
Update: How to identify groups (ids) that only have one rank?
After clarification that OP provided this would be a solution for this particular case:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
For another data-set that has also an id, which has duplicates but also non-duplicate rank (presented below) this would be the output:
df2 %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
#:OUTPUT:
# Source: local data frame [9 x 3]
# Groups: id [5]
#
# # A tibble: 9 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
# 7 5 1 TRUE
# 8 5 1 TRUE
# 9 5 3 TRUE
Data-no-2:
df2 <- structure(list(id = c(1, 2, 2, 3, 3, 4, 5, 5, 5), rank = c(1, 2, 1, 3, 3, 4, 1, 1, 3
)), .Names = c("id", "rank"), row.names = c(NA, -9L), class = "data.frame")
How to identify duplicated rows within each group (id)?
You can use dplyr package:
library(dplyr)
df %>%
group_by(id, rank) %>%
mutate(dg = ifelse(n() > 1, F,T))
This will give you:
# Source: local data frame [6 x 3]
# Groups: id, rank [5]
#
# # A tibble: 6 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
Note: You can simply convert it back to a data.frame().
A data.table solution would be:
dt <- data.table(df)
dt$dg <- ifelse(dt[ , dg := .N, by = list(id, rank)]$dg>1,F,T)
Data:
df <- structure(list(id = c(1, 2, 2, 3, 3, 4), rank = c(1, 2, 1, 3,
3, 4)), .Names = c("id", "rank"), row.names = c(NA, -6L), class = "data.frame")
# > df
# id rank
# 1 1 1
# 2 2 2
# 3 2 1
# 4 3 3
# 5 3 3
# 6 4 4
N. B. Unless you want a different identifier rather than TRUE/FALSE, using ifelse() is redundant and costs computationally. #DavidArenburg

Resources