Subset data based on conditional statement - r

I would like to know if there is a way of combining ifelse statement and the filter function (in dplyr package) to subset a data frame. Consider the data
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
A=c(3,6,2,5,4,3,8,9,8),
D1=c(0,0,0,1,1,0,0,0,0),
D2=c(1,0,0,0,0,1,1,0,1))
I want to delete rows following D2=1 or D1=D2=0 for each id. The expected output would look like
df<-data.frame(id=c(1,2,2,2,3),
A=c(3,5,4,3,9),
D1=c(0,1,1,0,0),
D2=c(1,0,0,1,0))
I have approached this by several attempts using group_by and the filter function but it appears conditional statements are needed but I'm finding it difficulty to combine these with the filter function. I have come across several Q&A on subsetting data (e.g. How to subset data by filtering and grouping efficiently in R) but these do not respond to my question. I greatly appreciate any help on this.

In dplyr , you can find out the first index where the condition is met and select rows which occur before the condition is satisfied for each group.
library(dplyr)
df %>%
group_by(id) %>%
filter(row_number() <= which(D1 == 0 & D2 == 0 | D2 == 1)[1])
# id A D1 D2
# <dbl> <dbl> <dbl> <dbl>
#1 1 3 0 1
#2 2 5 1 0
#3 2 4 1 0
#4 2 3 0 1
#5 3 9 0 0
The above works assuming that at least one row in each group satisfies the condition. A general case, where there might be instances that none of the row satisfies the condition and we want to select all the rows in the group we can use :
df %>%
group_by(id) %>%
slice({
inds <- which(D1 == 0 & D2 == 0 | D2 == 1)[1]
if(!is.na(inds)) -((inds + 1):n()) else seq_len(n())})

It doesn't seem like you need to use dplyr here (unless I'm missing something). Try this:
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
A=c(3,6,2,5,4,3,8,9,8),
D1=c(0,0,0,1,1,0,0,0,0),
D2=c(1,0,0,0,0,1,1,0,1))
del = c()
for (i in 1:nrow(df)){
if (df$D2[i] == 1 | (df$D1[i] ==0 & df$D2[i] == 0)){
del = c(del, i)
}
}
df = df[del,]

Pure dplyr:
df %>%
group_by(id) %>%
filter(row_number() == n() | rev(cumany(rev(!(D2 == 1 | (D1 == D2 & D2 == 0))))))
# # A tibble: 5 x 4
# # Groups: id [3]
# id A D1 D2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 0 0
# 2 2 5 1 0
# 3 2 4 1 0
# 4 2 8 0 1
# 5 3 8 0 1

Related

Selecting all columns that have some specific values

I have a data.frame with more than 50 columns and 10,000 rows I want select those columns that are haveing 0 or 1 in them excluding other values in those columna
sample data.frame is as below:
dummy_df <- data.frame(
id=1:4,
gender=c(4,1,0,1),
height=seq(150, 180,by = 10),
smoking=c(3,0,1,0)
)
I want to select all those columns with 0 or 1 value and exclude other values like 4 in gender and 3 in smoking and as below
gender smoking
1 0
0 1
1 0
but I have 50 columns in actual data frame and I don't know which of them are having 0 or 1
What I'm trying is:
dummy_df %>% select_if(~ all( . %in% 0:1))
Is this useful for you?
dummy_df %>%
select(- c(id, height)) %>%
rowwise() %>%
filter(any(c_across() == 0)|any(c_across() == 1))
# A tibble: 3 x 2
# Rowwise:
gender smoking
<dbl> <dbl>
1 1 0
2 0 1
3 1 0
EDIT:
If you don't know in advance which cols contain 0 and/or 1, you can determine that in base R:
temp <- dummy_df[sapply(dummy_df, function(x) any(x == 0|x == 1))]
Now you can filter for rows with 0and/or 1:
temp %>%
rowwise() %>%
filter(any(c_across() == 0)|any(c_across() == 1))
I think it's more like a case of filter than select:
library(dplyr)
dummy_df %>%
filter(if_all(c(gender, smoking), ~ .x %in% c(0, 1)))
id gender height smoking
1 2 1 160 0
2 3 0 170 1
3 4 1 180 0

Creating Event Onset Variable

I have clinical data that records a patient at three time points with a disease outcome indicated by a binary variable. It looks something like this
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,1,1,0,0,1,1,1,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome)
Data
I want to create an onset variable, so for each patient it would code a 1 for the time which the patient first got the disease, but would then be a 0 for any time period before or a time period after (even if that patient still had the disease). For the example data it should now look like this.
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,1,1,0,0,1,1,1,0)
outcome_onset <- c(0,1,0,0,0,1,1,0,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome,
outcome_onset=outcome_onset)
Data
Therefore I would like some code/ some help automating the creation of the outcome_onset variable.
Here is an option with cumsum to create a logical vector after grouping by the 'patientid'
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = +(cumsum(outcome) == 1))
Or use match and %in%
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = +(row_number() %in% match(1, outcome_onset)))
We can use which.max to get the index of 1st one in outcome variable and make that row as 1 and rest of them as 0.
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = as.integer(row_number() %in% which.max(outcome)),
outcome_onset = replace(outcome_onset, is.na(outcome), NA))
# patientid time outcome outcome_onset
# <dbl> <dbl> <dbl> <int>
#1 100 1 0 0
#2 100 2 1 1
#3 100 3 1 0
#4 101 1 0 0
#5 101 2 0 0
#6 101 3 1 1
#7 102 1 1 1
#8 102 2 1 0
#9 102 3 0 0

Filtering rows based on two conditions at the ID level

I have long data where a given subject has 4 observations. I want to only include a given id that meets the following conditions:
has at least one 3
has at least one of 1,2 OR NA
My data structure:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3), a=c(NA,1,2,3, NA,3,2,0, NA,NA,1,1))
My unsuccessful attempt (I get an empty data frame):
df %>% dplyr::group_by(id) %>% filter(a==3 & a %in% c(1,2,NA))
An option is to group by 'id', create a logic to return single TRUE/FALSE as output. Based on the OP's post, we need both values '3' and either one of the values 1, 2, NA in the column 'a'. So, 3 %in% a returns a logical vector of length 1, then wrap any on the second set where we do a comparison with multiple values or check the NA elements (is.na), merge both logical output with &
library(dplyr)
df %>%
group_by(id) %>%
filter((3 %in% a) & any(c(1, 2) %in% a|is.na(a)) )
# A tibble: 8 x 2
# Groups: id [2]
# id a
# <dbl> <dbl>
#1 1 NA
#2 1 1
#3 1 2
#4 1 3
#5 2 NA
#6 2 3
#7 2 2
#8 2 0
I have done this a bit of a long way to show how an idea could work. You can consolidate this a bit.
df %>%
group_by(id) %>%
mutate(has_3 = sum(a == 3, na.rm = T) > 0,
keep_me = has_3 & (sum(is.na(a)) > 0 | sum(a %in% c(1, 2)) > 0)) %>%
filter(keep_me == TRUE) %>%
select(id, a)
id a
<dbl> <dbl>
1 1 NA
2 1 1
3 1 2
4 1 3
5 2 NA
6 2 3
7 2 2
8 2 0
As I read it, the filter should keep ids 1 and 2. So I would use combo of all/any:
df %>%
group_by(id) %>%
filter(all(3 %in% a) & any(c(1,2,NA) %in% a))

keeping certain rows in data frame with a condition

I have a data frame in R for which I want to remove certain rows provided that match certain conditions. How can I do it ?
I have tried using dplyr and ifelse but my code does not give right answer
check8 <- distinct(df5,prod,.keep_all = TRUE)
Does not work! gives the entire data set
Input is:
check1 <- data.frame(ID = c(1,1,2,2,2,3,4),
prod = c("R","T","R","T",NA,"T","R"),
bad = c(0,0,0,1,0,1,0))
# ID prod bad
# 1 1 R 0
# 2 1 T 0
# 3 2 R 0
# 4 2 T 1
# 5 2 <NA> 0
# 6 3 T 1
# 7 4 R 0
Output expected:
data.frame(ID = c(1,2,3,4),
prod = c("R","R","T","R"),
bad = c(0,0,1,0))
# ID prod bad
# 1 1 R 0
# 2 2 R 0
# 3 3 T 1
# 4 4 R 0
I want to have the output such that for IDs where both prod or NA are there, keep only rows with prod R, but if only one prod is there then keep that row despite the prod .
Using dplyr we can use filter to select rows where prod == "R" or if there is only one row in the group, select that row.
library(dplyr)
check1 %>%
group_by(ID) %>%
filter(prod == "R" | n() == 1)
# ID prod bad
# <dbl> <fct> <dbl>
#1 1 R 0
#2 2 R 0
#3 3 T 1
#4 4 R 0
Here solution using an anti_join
library(dplyr)
check1 <- data.frame(ID = c(1,1,2,2,2,3,4), prod = c("R","T","R","T",NA,"T","R"), bad = c(0,0,0,1,0,1,0))
# First part: select all the IDs which contain 'R' as prod
p1 <- check1 %>%
group_by(ID) %>%
filter(prod == 'R')
# Second part: using anti_join get all the rows from check1 where there are not
# matching values in p1
p2 <- anti_join(check1, p1, by = 'ID')
solution <- bind_rows(
p1,
p2
) %>%
arrange(ID)

Setting column value of a subset of rows in a dataframe in R [duplicate]

This question already has answers here:
How can I rank observations in-group faster?
(4 answers)
Closed 5 years ago.
I have a dataframe df with a column called ID.
Multiple rows may have the same ID and I want to set a column value "occurrence" to indicate how many times the ID has been seen before.
for (i in unique(df$ID)) {
rows = df[df$ID==i, ]
for (idx in 1:nrow(rows)) {
rows[idx,'occurrence'] = idx
}
}
Unfortunately, this adds the occurrence column to rows, but it does not update the original data frame. How do I get the occurrence column added to df?
Update: The row_number() function pointed out by neilfws works great. Actually, I have a followup question: The dataframe also has a year column, an what I need to do is to add a new column (say Prev.Year.For.This.ID) for the year of the previous occurrence of the ID. e.g if the input is
Year = c(1991,1991,1993,1994,1995)
ID = c(1,2,1,2,1)
df <- data.frame (Year, ID)
I'd like the output to look like this:
ID Year occurrence Prev.Year.For.This.Id
1 1991 1 <NA>
2 1992 1 <NA>
1 1993 2 1991
2 1994 2 1992
1 1995 3 1993
You can use dplyr to group_by ID, then row_number gives the running total of occurrences.
library(dplyr)
df1 <- data.frame(ID = c(1,2,3,1,4,5,6,2,7,8,2))
df1 %>%
group_by(ID) %>%
mutate(cnt = row_number()) %>%
ungroup()
ID cnt
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 1 2
5 4 1
6 5 1
7 6 1
8 2 2
9 7 1
10 8 1
11 2 3
Are you after something like the following (I made up sample data for you):
library(dplyr)
df = data.frame(ID = c(1,1,1,2,2,3))
answer = df %>% group_by(ID) %>% mutate(occurrence = cumsum(ID / ID) - 1) %>% as.data.frame
This will give something which looks like this:
ID occurrence
1 0
1 1
1 2
2 0
2 1
3 0
The dplyr package is a great tool for grouping and summarising data. I also find the code very readable when I use the pipe %>% (though, admittedly, it does take some getting used to).
> library(data.table)
> df = data.frame(ID = c(1,1,1,2,2,3))
> df <- data.table(df)
> df[, occurrence := sequence(.N), by = c("ID")]
> df
ID occurrence
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 3 1

Resources