Conditionally removing duplicates - r

I have a dataset in which I need to conditionally remove duplicated rows based on values in another column.
Specifically, I need to delete any row where size = 0 only if SampleID is duplicated.
SampleID<-c("a", "a", "b", "b", "b", "c", "d", "d", "e")
size<-c(0, 1, 1, 2, 3, 0, 0, 1, 0)
data<-data.frame(SampleID, size)
I want to delete rows with:
Sample ID size
a 0
d 0
And keep:
SampleID size
a 1
b 1
b 2
b 3
c 0
d 1
e 0
Note. actual dataset it very large, so I am not looking for a way to just remove a known row by row number.

In dplyr we can do this using group_by and filter:
library(dplyr)
data %>%
group_by(SampleID) %>%
filter(!(size==0 & n() > 1)) # filter(size!=0 | n() == 1))
#> # A tibble: 7 x 2
#> # Groups: SampleID [5]
#> SampleID size
#> <fct> <dbl>
#> 1 a 1
#> 2 b 1
#> 3 b 2
#> 4 b 3
#> 5 c 0
#> 6 d 1
#> 7 e 0

Using data.table framework: Transform your set to data.table
require(data.table)
setDT(data)
Build a list of id where we can delete lines:
dropable_ids = unique(data[size != 0, SampleID])
Finaly keep lines that are not in the dropable list or with non 0 value
data = data[!(SampleID %in% dropable_ids & size == 0), ]
Please note that not( a and b ) is equivalent to a or b but data.table framework doesn't handle well or.
Hope it helps

A solution that works in base R without data.table and is easy to follow through for R starters:
#Find all duplicates
data$dup1 <- duplicated(data$SampleID)
data$dup2 <- duplicated(data$SampleID, fromLast = TRUE)
data$dup <- ifelse(data$dup1 == TRUE | data$dup2 == TRUE, 1, 0)
#Subset to relevant
data$drop <- ifelse(data$dup == 1 & data$size == 0, 1, 0)
data2 <- subset(data, drop == 0)

Related

Identify rows with a value greater than threshold, but only direct one above per group

Suppose we have a dataset with a grouping variable, a value, and a threshold that is unique per group. Say I want to identify a value that is greater than a threshold, but only one.
test <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2)
)
want <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2),
want = c(NA, NA, "yes", NA, "yes", NA)
)
In the table above, Group A has a threshold of 4, and only value of 5 is higher. But in Group B, threshold is 2, and both value of 3 and 5 is higher. However, only row with value of 3 is marked.
I was able to do this by identifying which rows had value greater than threshold, then removing the repeated value:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = if_else(value > threshold, "yes", NA_character_)) %>%
mutate(across(want, ~replace(.x, duplicated(.x), NA)))
I was wondering if there was a direct way to do this using a single logical statement rather than doing it two-step method, something along the line of:
test %>%
group_by(grp) %>%
mutate(want = if_else(???, "yes", NA_character_))
The answer doesn't have to be on R either. Just a logical step explanation would suffice as well. Perhaps using a rank?
Thank you!
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = (value > threshold), want = want & !lag(cumany(want))) %>%
ungroup()
# # A tibble: 6 × 4
# grp value threshold want
# <chr> <dbl> <dbl> <lgl>
# 1 A 1 4 FALSE
# 2 A 3 4 FALSE
# 3 A 5 4 TRUE
# 4 B 1 2 FALSE
# 5 B 3 2 TRUE
# 6 B 5 2 FALSE
If you really want strings, you can if_else after this.
Here is more direct way:
The essential part:
With min(which((value > threshold) == TRUE) we get the first TRUE in our column,
Next we use an ifelse and check the number we get to the row number and set the conditions:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = ifelse(row_number()==min(which((value > threshold) == TRUE)),
"yes", NA_character_))
grp value threshold want
<chr> <dbl> <dbl> <chr>
1 A 1 4 NA
2 A 3 4 NA
3 A 5 4 yes
4 B 1 2 NA
5 B 3 2 yes
6 B 5 2 NA
>
This is a perfect chance for a data.table answer using its non-equi matching and multiple match handling capabilities:
library(data.table)
setDT(test)
test[test, on=.(grp, value>threshold), mult="first", flag := TRUE]
test
# grp value threshold flag
# <char> <num> <num> <lgcl>
#1: A 1 4 NA
#2: A 3 4 NA
#3: A 5 4 TRUE
#4: B 1 2 NA
#5: B 3 2 TRUE
#6: B 5 2 NA
Find the "first" matching value in each group that is greater than > the threshold and set := it to TRUE

Filter out rows when condition is met with dplyr

How to filter out rows in one column when condition is met in another column for different groups?
For example:
library(dplyr)
df1 <-tribble(
~group, ~var1, ~var2,
"a", 0, 0,
"a", 1, 0,
"a",1, 0,
"a",0, 1,
"a", 1, 0,
"b", 1, 0,
"b", 0, 1,
"b", 1, 0,
"b", 0, 1)
I want to allow ones in var1 only after having the first 1 in var2. Therefore, in this example, I would like to get:
group var1 var2
<chr> <dbl> <dbl>
a 0 0
a 0 1
a 1 0
b 0 1
b 1 0
b 0 1
I can identify from where I want to start filtering the data, but don't know exactly how to proceed:
df1 %>%
group_by(var2,group) %>%
mutate(test = case_when(row_number() == 1 & var2 == 1 ~ "exclude_previous_rows",
T ~ "n"))
I'm sure there is a simple way to do this with dplyr, but couldn't find it so far.
We can use a cumulative sum. I think this is what you want:
df1 %>%
group_by(group) %>%
filter(cumsum(var2 == 1) > 0)
# # A tibble: 5 x 3
# # Groups: group [2]
# group var1 var2
# <chr> <dbl> <dbl>
# 1 a 0 1
# 2 a 1 0
# 3 b 0 1
# 4 b 1 0
# 5 b 0 1
This will keep all rows including and after the first 1 in var2, by group. I'm not really sure what you mean by "I want to allow ones in var1" - your code seems to ignore var1, and mine follows suit.
An option using data.table
library(data.table)
setDT(df1)[df1[, .I[cumsum(var2 == 1) > 0], group]$V1]

How to create a rank for a variable in a longitudinal dataset based on a condition?

I have a longitudinal dataset where each subject is represented more than once. One represents one admission for a patient. Each admission, regardless of the subject also has a unique "key". I need to figure out which admission is the "INDEX" admission, that is, the first admission, so that I know that which rows are the subsequent RE-admission. The variable to use is "Daystoevent"; the lowest number represents the INDEX admission. I want to create a new variable based on the condition that for each subject, the lowest number in the variable "Daystoevent" is the "index" admission and each subsequent gets a number "1" , "2" etc. I want to do this WITHOUT changing into the horizontal format.
The dataset looks like this:
Subject Daystoevent Key
A 5 rtwe
A 8 erer
B 3 tter
B 8 qgfb
A 2 sada
C 4 ccfw
D 7 mjhr
B 4 sdfw
C 1 srtg
C 2 xcvs
D 3 muyg
Would appreciate some help.
This may not be an elegant solution but will do the job:
library(dplyr)
df <- df %>%
group_by(Subject) %>%
arrange(Subject, Daystoevent) %>%
mutate(
Admission = if_else(Daystoevent == min(Daystoevent), 0, 1),
) %>%
ungroup()
for(i in 1:(nrow(df) - 1)) {
if(df$Admission[i] == 1) {
df$Admission[i + 1] <- 2
} else if(df$Admission[i + 1] != 0){
df$Admission[i + 1] <- df$Admission[i] + 1
}
}
df[df == 0] <- "index"
df
# # A tibble: 11 x 4
# Subject Daystoevent Key Admission
# <chr> <dbl> <chr> <chr>
# 1 A 2 sada index
# 2 A 5 rtwe 1
# 3 A 8 erer 2
# 4 B 3 tter index
# 5 B 4 sdfw 1
# 6 B 8 qgfb 2
# 7 C 1 srtg index
# 8 C 2 xcvs 1
# 9 C 4 ccfw 2
# 10 D 3 muyg index
# 11 D 7 mjhr 1
Data:
df <- data_frame(
Subject = c("A", "A", "B", "B", "A", "C", "D", "B", "C", "C", "D"),
Daystoevent = c(5, 8, 3, 8, 2, 4, 7, 4, 1, 2, 3),
Key = c("rtwe", "erer", "tter", "qgfb", "sada", "ccfw", "mjhr", "sdfw", "srtg", "xcvs", "muyg")
)

How to use window function in R

I have the following data frame structure :
id status
a 1
a 2
a 1
b 1
b 1
b 0
b 1
c 0
c 0
c 2
c 1
d 0
d 2
d 0
Here a,b,c are unique id's and status is a flag ranging from 0,1 and 2.
I need to select each individual id whose status has changed from 0 to 1 in any point during the whole time frame, so the expected output of this would be two id's 'b' and 'c'.
I thought of using lag to accomplish that but in that case, I wont't be able to handle id 'c', in which there is a 0 in the beginning but it reaches 1 at some stage. Any thoughts on how we can achieve this using window functions (or any other technique)
You want to find id's having a status of 1 after having had a status of 0.
Here is a dplyr solution:
library(dplyr)
# Generate data
mydf = data_frame(
id = c(rep("a", 3), rep("b", 4), rep("c", 4), rep("d", 3)),
status = c(1, 2, 1, 1, 1, 0, 1, 0, 0, 2, 1, 0, 2, 0)
)
mydf %>% group_by(id) %>%
# Keep only 0's and 1's
filter(status %in% c(0,1)) %>%
# Compute diff between two status
mutate(dif = status - lag(status, 1)) %>%
# If it is 1, it is a 0 => 1
filter(dif == 1) %>%
# Catch corresponding id's
select(id) %>%
unique
One possible way using dplyr (Edited to include id only when a 1 appears after a 0):
library(dplyr)
df %>%
group_by(id) %>%
filter(status %in% c(0, 1)) %>%
filter(status == 0 & lead(status, default = 0) == 1) %>%
select(id) %>% unique()
#> # A tibble: 2 x 1
#> # Groups: id [2]
#> id
#> <chr>
#> 1 b
#> 2 c
Data
df <- read.table(text = "id status
a 1
a 2
a 1
b 1
b 1
b 0
b 1
c 0
c 0
c 2
c 1
d 0
d 2
d 0", header = TRUE, stringsAsFactors = FALSE)
I dunno if this is the most efficient way, but: split by id, check statuses for 0, and if there is any, check for 1 behind the 0 index:
lst <- split(df$status, df$id)
f <- function(x) {
if (!any(x==0)) return(FALSE)
any(x[which.max(x==0):length(x)]==1)
}
names(lst)[(sapply(lst, f))]
# [1] "b" "c"

Fast way to find min in groups after excluding observations using R

I need to do something similar to below on a very large data set (with many groups), and read somewhere that using .SD is slow. Is there any faster way to perform the following operation?
To be more precise, I need to create a new column that contains the min value for each group after having excluded a subset of observations in that group (something similar to minif in Excel).
library(data.table)
dt <- data.table(valid = c(0,1,1,0,1),
a = c(1,1,2,3,4),
groups = c("A", "A", "A", "B", "B"))
dt[, valid_min := .SD[valid == 1, min(a, na.rm = TRUE)], by = groups]
With the output:
> test
valid a k valid_min
1: 0 1 A 1
2: 1 1 A 1
3: 1 2 A 1
4: 0 3 B 4
5: 1 4 B 4
To make it even more complicated, groups could have no valid entries or they could have multiple valid but missing entries. My current code is similar to this:
dt <- data.table(valid = c(0,1,1,0,1,0,1,1),
a = c(1,1,2,3,4,3,NA,NA),
k = c("A", "A", "A", "B", "B", "C", "D", "D"))
dt[, valid_min := .SD[valid == 1,
ifelse(all(is.na(a)), NA_real_, min(a, na.rm = TRUE))], by = k]
Output:
> dt
valid a k valid_min
1: 0 1 A 1
2: 1 1 A 1
3: 1 2 A 1
4: 0 3 B 4
5: 1 4 B 4
6: 0 3 C NA
7: 1 NA D NA
8: 1 NA D NA
There's...
dt[dt[valid == 1 & !is.na(a), min(a), by=k], on=.(k), the_min := i.V1]
This should be fast since the inner call to min is optimized for groups. (See ?GForce.)
We can do the same using dplyr
dt %>%
group_by(groups) %>%
mutate(valid_min = min(ifelse(valid == 1,
a, NA),
na.rm = TRUE))
Which gives:
valid a groups valid_min
<dbl> <dbl> <chr> <dbl>
1 0 1 A 1
2 1 1 A 1
3 1 2 A 1
4 0 3 B 4
5 1 4 B 4
Alternatively, if you are not interested in keeping the 'non-valid' rows, we can do the following:
dt %>%
filter(valid == 1) %>%
group_by(groups) %>%
mutate(valid_min = min(a))
Looks like I provided the slowest approach. Comparing each approach (using a larger, replicated data frame called df) with a microbenchmark test:
library(microbenchmark)
library(ggplot2)
mbm <- microbenchmark(
dplyr.test = suppressWarnings(df %>%
group_by(k) %>%
mutate(valid_min = min(ifelse(valid == 1,
a, NA),
na.rm = TRUE),
valid_min = ifelse(valid_min == Inf,
NA,
valid_min))),
data.table.test = df[, valid_min := .SD[valid == 1,
ifelse(all(is.na(a)), NA_real_, min(a, na.rm = TRUE))], by = k],
GForce.test = df[df[valid == 1 & !is.na(a), min(a), by=k], on=.(k), the_min := i.V1]
)
autoplot(mbm)
...well, i tried...

Resources