Filter Rows Between with Multiple Events per Subject - r

I have a large data set and I'm trying to filter the days following a specific event for each subject. This issue is that the "event" of interest may happen multiple times for some subjects and for a few subjects the event doesn't happen at all (in which case they could just be removed from the summarized data).
Here is an example of the data and what I've tried:
library(tidyverse)
set.seed(355)
subject <- c(rep(LETTERS[1:4], each = 40), rep("E", times = 40))
event <- c(sample(0:1, size = length(subject)-40, replace = T, prob = c(0.95, 0.05)), rep(0, times = 40))
df <- data.frame(subject, event)
df %>%
filter(event == 1) %>%
count(subject, event, sort = T)
# A tibble: 4 x 3
subject event n
<fct> <dbl> <int>
1 D 1 3
2 A 1 2
3 B 1 2
4 C 1 2
So we see that subject D has had the event 3 times while subjects A, B, and C have had the event 2 times. Subject E has not had the event at all.
My next step was to create an "event" tag that identifies where each event happened and then produced an NA for all over rows. I also created an event sequence, which sequences along between events, because I thought it might be useful, but I didn't end up trying to use it.
df_cleaned <- df %>%
group_by(subject, event) %>%
mutate(event_seq = seq_along(event == 1),
event_detail = ifelse(event == 1, "event", NA)) %>%
as.data.frame()
I tried two different approaches using a filter() and between() to get each event and the 2 rows following each event. Both of these approaches create an error because of the multiple events within subject. I can't figure out a good workaround for it.
Approach 1:
df_cleaned %>%
group_by(subject) %>%
filter(., between(row_number(),
left = which(!is.na(event_detail)),
right = which(!is.na(event_detail)) + 1))
Approach 2:
df_cleaned %>%
group_by(subject) %>%
mutate(event_group = cumsum(!is.na(event_detail))) %>%
filter(., between(row_number(), left = which(event_detail == "event"), right = which(event_detail == "event") + 2))

If you want to get rows with 1 in event and the following two rows, you can do the following. Thanks to Ananda Mahto who is the author of splitstackshape package, we can handle this type of operation with getMyRows(), which returns a list. You can specify a range of rows in the function. Here I said 0:2. So I am asking R to take each row with 1 in event and the following two rows. I used bind_rows() to return a data frame. But if you need to work with a list, you do not have to do that.
install_github("mrdwab/SOfun")
library(SOfun)
library(dplyr)
ind <- which(x = df$event == 1)
bind_rows(getMyRows(data = df, pattern = ind, range = 0:2))
subject event
1 A 1
2 A 0
3 A 0
4 A 1
5 A 0
6 A 0
7 B 1
8 B 0
9 B 0
10 B 1
11 B 0
12 B 0
13 C 1
14 C 0
15 C 0
16 C 1
17 C 0
18 C 0
19 D 1
20 D 0
21 D 0
22 D 1
23 D 0
24 D 0
25 D 1
26 D 0
27 D 0

Here is a tidyverse approach which uses cumsum() to create groups of rows after (and including) an event and which picks the top 3 rows of each group:
df %>%
group_by(subject) %>%
mutate(event_group = cumsum(event == 1L)) %>%
group_by(event_group, add = TRUE) %>%
filter(event_group > 0 & row_number() <= 3L)
# A tibble: 27 x 3
# Groups: subject, event_group [9]
subject event event_group
<fct> <dbl> <int>
1 A 1 1
2 A 0 1
3 A 0 1
4 A 1 2
5 A 0 2
6 A 0 2
7 B 1 1
8 B 0 1
9 B 0 1
10 B 1 2
# … with 17 more rows
For testing an edge case, here is a modified data set where subject A starts with three subsequent events. Furthermore, I have added row numbers rn in order to check that the correct rows are picked:
df2 <- df %>%
mutate(event = ifelse(row_number() <= 2L, 1L, event),
rn = row_number())
Now we get
df2 %>%
group_by(subject) %>%
mutate(event_group = cumsum(event == 1L)) %>%
group_by(event_group, add = TRUE) %>%
filter(event_group > 0 & row_number() <= 3L)
# A tibble: 29 x 4
# Groups: subject, event_group [11]
subject event rn event_group
<fct> <dbl> <int> <int>
1 A 1 1 1
2 A 1 2 2
3 A 1 3 3
4 A 0 4 3
5 A 0 5 3
6 A 1 22 4
7 A 0 23 4
8 A 0 24 4
9 B 1 59 1
10 B 0 60 1
# … with 19 more rows
which is in line with my expectations for this edge case.

Here is a base R option which looks similar to #jazzurro's attempt. We get the row indices where event == 1, then select next two rows from each index, use unique so in case there are overlapping indices we select only the unique ones and subset it from the original df.
inds <- which(df$event == 1)
df[unique(c(sapply(inds, `+`, 0:2))), ]
# subject event
#3 A 1
#4 A 0
#5 A 0
#22 A 1
#23 A 0
#24 A 0
#59 B 1
#60 B 0
#61 B 0
#62 B 1
#63 B 0
#64 B 0
#....
Another option using dplyr, could be using lag
library(dplyr)
df %>%
group_by(subject) %>%
filter(event == 1 | lag(event) == 1 | lag(event, 2) == 1)

Related

Find 2 out of 3 conditions per ID

I have the following dataframe:
df <-read.table(header=TRUE, text="id code
1 A
1 B
1 C
2 A
2 A
2 A
3 A
3 B
3 A")
Per id, I would love to find those individuals that have at least 2 conditions, namely:
conditionA = "A"
conditionB = "B"
conditionC = "C"
and create a new colum with "index", 1 if there are two or more conditions met and 0 otherwise:
df_output <-read.table(header=TRUE, text="id code index
1 A 1
1 B 1
1 C 1
2 A 0
2 A 0
2 A 0
3 A 1
3 B 1
3 A 1")
So far I have tried the following:
df_output = df %>%
group_by(id) %>%
mutate(index = ifelse(grepl(conditionA|conditionB|conditionC, code), 1, 0))
and as you can see I am struggling to get the threshold count into the code.
You can create a vector of conditions, and then use %in% and sum to count the number of occurrences in each group. Use + (or ifelse) to convert logical into 1 and 0:
conditions = c("A", "B", "C")
df %>%
group_by(id) %>%
mutate(index = +(sum(unique(code) %in% conditions) >= 2))
id code index
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 0
5 2 A 0
6 2 A 0
7 3 A 1
8 3 B 1
9 3 A 1
You could use n_distinct(), which is a faster and more concise equivalent of length(unique(x)).
df %>%
group_by(id) %>%
mutate(index = +(n_distinct(code) >= 2)) %>%
ungroup()
# # A tibble: 9 × 3
# id code index
# <int> <chr> <int>
# 1 1 A 1
# 2 1 B 1
# 3 1 C 1
# 4 2 A 0
# 5 2 A 0
# 6 2 A 0
# 7 3 A 1
# 8 3 B 1
# 9 3 A 1
You can check conditions using intersect() function and check whether resulting list is of minimal (eg- 2) length.
conditions = c('A', 'B', 'C')
df_output2 =
df %>%
group_by(id) %>%
mutate(index = as.integer(length(intersect(code, conditions)) >= 2))

Efficient way to add sample information as new column to data set

I know how I can subset a data frame by sampling certain rows. However, I'm struggling with finding an easy (preferably tidyverse) way to just ADD the sampling information as a new column to my data set, i.e. I simply want to populate a new column with "1" if it is sampled and "0" if not.
I currently have this one, but it feels overly complicated. Note, in the example I want to sample 3 rows per group.
df <- data.frame(group = c(1,2,1,2,1,1,1,1,2,2,2,2,2,1,1),
var = 1:15)
library(tidyverse)
df <- df %>%
group_by(group) %>%
mutate(sampling_info = sample.int(n(), size = n(), replace = FALSE),
sampling_info = if_else(sampling_info <= 3, 1, 0))
You can try -
library(dplyr)
set.seed(123)
df %>%
arrange(group) %>%
group_by(group) %>%
mutate(sampling_info = as.integer(row_number() %in% sample(n(), size = 3))) %>%
ungroup
# group var sampling_info
# <dbl> <int> <int>
# 1 1 1 0
# 2 1 3 0
# 3 1 5 1
# 4 1 6 0
# 5 1 7 0
# 6 1 8 0
# 7 1 14 1
# 8 1 15 1
# 9 2 2 0
#10 2 4 1
#11 2 9 1
#12 2 10 0
#13 2 11 0
#14 2 12 1
#15 2 13 0
sample(n(), size = 3) will generate 3 random row numbers for each group and we assign 1 for those row numbers.

How to remove subsequent values after first instance only when other values are absent

I am trying to remove zeros after the first instance of a zero, when all future values are 0. Eventually I would love to do this group_by species but baby steps. Here's an example;
# Sample
library(tidyverse)
id<-c("a","b","c","d","e","f","g","h","i","j")
time<-c(1,2,3,4,5,6,7,8,9,10)
value<-c(90, 50, 40, 0, 30, 30, 0, 10, 0, 0)
df<-data.frame(id, time, value)
df
id time value
1 a 1 90
2 b 2 50
3 c 3 40
4 d 4 0
5 e 5 30
6 f 6 30
7 g 7 0
8 h 8 10
9 i 9 0
10 j 10 0
I would like to see observation id "j" and only observation id "j" removed. I am not even sure where to start. Any suggestions are much appreciated!
In base R only.It uses rle to get the number of trailing zeros, if any. Then subsets the dataframe with head.
r <- rle(df$value == 0)
if(r$values[length(r$values)]) head(df, -(r$lengths[length(r$values)] - 1))
# id time value
#1 a 1 90
#2 b 2 50
#3 c 3 40
#4 d 4 0
#5 e 5 30
#6 f 6 30
#7 g 7 0
#8 h 8 10
#9 i 9 0
You can write a function with the code above, and maybe *apply it to groups.
trailingZeros <- function(DF, col = "value"){
r <- rle(DF[[col]] == 0)
if(r$values[length(r$values)] && r$lengths[length(r$values)] > 1)
head(DF, -(r$lengths[length(r$values)] - 1))
else
DF
}
trailingZeros(df)
Note that this also works with a larger number of trailing zeros.
id2 <- c("a","b","c","d","e","f","g","h","i","j","k")
time2 <- c(1,2,3,4,5,6,7,8,9,10,11)
value2 <- c(90, 50, 40, 0, 30, 30, 0, 10, 0, 0, 0) # One more zero at end
df2 <- data.frame(id = id2, time = time2, value = value2)
trailingZeros(df2)
here is a solution within the tidyverse which also works on a larger number of trailing zeros:
df <- tibble(id = letters[1:11], time = 1:11,
value = c(90,50,40,0,30,30,0,10,0,0,0))
df %>%
slice(n():1) %>%
slice(c(which(cumsum(value > 0) > 0)[1] - 1, which(cumsum(value > 0) > 0))) %>%
slice(n():1)
Tidyverse solution that also works with groups
based on sample data (without grouping)
code can be shortened, but this looks very readable ;-)
df %>%
#arrange by id
arrange( id ) %>%
#no grouping valiable in sample data.. so don't use group_by here
#group_by( group) %>%
#create dummy's: position in group, last value of group, position of last non-zero in group, previous value (within group)
mutate( pos_in_group = 1:n() ) %>%
mutate( last_value = last( value ) ) %>%
mutate( pos_last_not_zero = max( which( value != 0) ) ) %>%
mutate( prev_value = lag( value ) ) %>%
#filter all rows where:
# the last value of the group != 0 AND
# the previous row (within the group) != 0 AND
# the position of the row is 'below' the last non-zero measurement (in the group)
filter( !(last_value == 0 & prev_value == 0 & pos_in_group >= pos_last_not_zero + 1 ) ) %>%
#throw away the dummy's
select( -c( pos_in_group, last_value, pos_last_not_zero, prev_value ) )
# id time value
# 1 a 1 90
# 2 b 2 50
# 3 c 3 40
# 4 d 4 0
# 5 e 5 30
# 6 f 6 30
# 7 g 7 0
# 8 h 8 10
# 9 i 9 0
Example with some grouping involved
# Sample
library(tidyverse)
id<-c("a","b","c","d","e","f","g","h","i","j","k")
group<-c(1,1,1,1,1,1,2,2,2,2,2)
time<-c(1,2,3,4,5,6,7,8,9,10,11)
value = c(90,0,0,40,0,0,30,30,0,0,0)
df<-data.frame(id, group, time, value)
df
# id group time value
# 1 a 1 1 90
# 2 b 1 2 0
# 3 c 1 3 0
# 4 d 1 4 40
# 5 e 1 5 0
# 6 f 1 6 0
# 7 g 2 7 30
# 8 h 2 8 30
# 9 i 2 9 0
# 10 j 2 10 0
# 11 k 2 11 0
df %>%
#arrange by id
arrange( id ) %>%
#group
group_by( group) %>%
#create dummy's: position in group, last value of group, position of last non-zero in group, previous value (within group)
mutate( pos_in_group = 1:n() ) %>%
mutate( last_value = last( value ) ) %>%
mutate( pos_last_not_zero = max( which( value != 0) ) ) %>%
mutate( prev_value = lag( value ) ) %>%
#filter all rows where:
# the last value of the group != 0 AND
# the previous row (within the group) != 0 AND
# the position of the row is 'below' the last non-zero measurement (in the group)
filter( !(last_value == 0 & prev_value == 0 & pos_in_group >= pos_last_not_zero + 1 ) ) %>%
#throuw away the dummy's
select( -c( pos_in_group, last_value, pos_last_not_zero, prev_value ) )
# # A tibble: 8 x 4
# # Groups: group [2]
# id group time value
# <fct> <dbl> <dbl> <dbl>
# 1 a 1 1 90
# 2 b 1 2 0
# 3 c 1 3 0
# 4 d 1 4 40
# 5 e 1 5 0
# 6 g 2 7 30
# 7 h 2 8 30
# 8 i 2 9 0

Filter (subset) by conditions in 2 columns in R (dplyr or otherwise)

Given a dataset such as:
set.seed(134)
df<- data.frame(ID= rep(LETTERS[1:5], each=2),
condition=rep(0:1, 5),
value=rpois(10, 3)
)
df
ID condition value
1 A 0 2
2 A 1 3
3 B 0 5
4 B 1 2
5 C 0 3
6 C 1 1
7 D 0 2
8 D 1 4
9 E 0 1
10 E 1 5
For each ID, when the value for condition==0 is less than the value for condition==1, I want to keep both observations. When the value for condition==0 is greater than condition==1, I want to keep only the row for condition==0.
The subset returned should be this:
ID condition value
1 A 0 2
2 A 1 3
3 B 0 5
5 C 0 3
7 D 0 2
8 D 1 4
9 E 0 1
10 E 1 5
Using dplyr the first step is:
df %>% group_by(ID) %>%
But not sure where to go from there.
Translating fairly literally,
library(dplyr)
set.seed(134)
df <- data.frame(ID = rep(LETTERS[1:5], each = 2),
condition = rep(0:1, 5),
value = rpois(10, 3))
df %>% group_by(ID) %>%
filter(condition == 0 |
(condition == 1 & value > value[condition == 0]))
#> # A tibble: 8 x 3
#> # Groups: ID [5]
#> ID condition value
#> <fct> <int> <int>
#> 1 A 0 2
#> 2 A 1 3
#> 3 B 0 5
#> 4 C 0 3
#> 5 D 0 2
#> 6 D 1 4
#> 7 E 0 1
#> 8 E 1 5
This depends on each group having a single observation with condition == 0, but should otherwise be fairly robust.
This is may not be the easiest way, but should work as you want.
library(reshape2)
df %>%
dcast(ID ~ condition, value.var = 'value') %>% # cast to wide format
mutate(`1` = ifelse(`1` > `0`, `1`, NA)) %>% # turn 0>1 values as NA
melt('ID') %>% # melt as long format
arrange(ID) %>% # sort by ID
filter(complete.cases(.)) # remove NA rows
Output:
ID variable value
1 A 0 2
2 A 1 3
3 B 0 5
4 C 0 3
5 D 0 2
6 D 1 4
7 E 0 1
8 E 1 5
You always want the value from the first row in each group. You only want the value from the second row in each group if it's larger than the first.
This works:
df %>%
group_by(ID) %>%
filter(row_number() == 1 | value > lag(value))
Edit: as #alistaire points out, this method depends on a particular order in, which is might be a good idea to guarantee as follows:
df %>%
arrange(ID, condition) %>%
group_by(ID) %>%
filter(row_number() == 1 | value > lag(value))

Extracting group dependent results from a dataframe

I have a dataframe made from different groups, and for each group real and predicted values. I want to extract values of tests on these values :
library(dplyr)
d = data.frame(group = c(rep(5,x="a"),rep(5,x="b")), real = c(rep(2, x=1:5)), pred = c(2,1,3,4,5,1,2,4,3,5))
group real pred
1 a 1 2
2 a 2 1
3 a 3 3
4 a 4 4
5 a 5 5
6 b 1 1
7 b 2 2
8 b 3 4
9 b 4 3
10 b 5 5
d <- d %>% group_by(group) %>% mutate( sg = ifelse(real == 1 & real == pred, 1, 0))
d <- d %>% group_by(group) %>% mutate( sp = ifelse(real <= 3 & pred <= 3, 1, 0))
d %>% distinct(sg, sp)
sg sp group
1 0 1 a
2 0 0 a
3 1 1 b
4 0 1 b
5 0 0 b
But I want something like this (only 1 result per group)
sg sp group
1 0 1 a
3 1 1 b
I am pretty sure dplyr, data.table or tidyr can do something but I cannot find how.
If it is always the first row of each group that you want to extract, you could use the do function:
d %>% do(.[1,])
Another option is to use the filter function like this:
d %>% filter(seq_along(sp) == 1)

Resources