Countdown dates in R - r

Suppose I have the following dataset:
id1 <- c(1,1,1,1,2,2,2,2,1,1,1,1)
dates <- c("a","a","a","a","b","b","b","b","c","c","c","c")
x <- c(NA,0,NA,NA,NA,NA,0,NA,NA,NA,NA,0)
df <- data.frame(id1,dates,x)
My objective is to have a new column that explicitly tells counts the sequence of observations around 0 for every combination of id1 and dates. This would yield the following outcome:
desired_result <- c(-1,0,1,2,-2,-1,0,1,-3,-2,-1,0)
Any help is appreciated.

library(dplyr)
df %>%
group_by(id1, dates) %>%
mutate(x = row_number() - which(x == 0))
id1 dates x
1 1 a -1
2 1 a 0
3 1 a 1
4 1 a 2
5 2 b -2
6 2 b -1
7 2 b 0
8 2 b 1
9 1 c -3
10 1 c -2
11 1 c -1
12 1 c 0
With dplyr 1.1.0:
df %>%
mutate(x = row_number() - which(x == 0), .by = dates)

Related

Find 2 out of 3 conditions per ID

I have the following dataframe:
df <-read.table(header=TRUE, text="id code
1 A
1 B
1 C
2 A
2 A
2 A
3 A
3 B
3 A")
Per id, I would love to find those individuals that have at least 2 conditions, namely:
conditionA = "A"
conditionB = "B"
conditionC = "C"
and create a new colum with "index", 1 if there are two or more conditions met and 0 otherwise:
df_output <-read.table(header=TRUE, text="id code index
1 A 1
1 B 1
1 C 1
2 A 0
2 A 0
2 A 0
3 A 1
3 B 1
3 A 1")
So far I have tried the following:
df_output = df %>%
group_by(id) %>%
mutate(index = ifelse(grepl(conditionA|conditionB|conditionC, code), 1, 0))
and as you can see I am struggling to get the threshold count into the code.
You can create a vector of conditions, and then use %in% and sum to count the number of occurrences in each group. Use + (or ifelse) to convert logical into 1 and 0:
conditions = c("A", "B", "C")
df %>%
group_by(id) %>%
mutate(index = +(sum(unique(code) %in% conditions) >= 2))
id code index
1 1 A 1
2 1 B 1
3 1 C 1
4 2 A 0
5 2 A 0
6 2 A 0
7 3 A 1
8 3 B 1
9 3 A 1
You could use n_distinct(), which is a faster and more concise equivalent of length(unique(x)).
df %>%
group_by(id) %>%
mutate(index = +(n_distinct(code) >= 2)) %>%
ungroup()
# # A tibble: 9 × 3
# id code index
# <int> <chr> <int>
# 1 1 A 1
# 2 1 B 1
# 3 1 C 1
# 4 2 A 0
# 5 2 A 0
# 6 2 A 0
# 7 3 A 1
# 8 3 B 1
# 9 3 A 1
You can check conditions using intersect() function and check whether resulting list is of minimal (eg- 2) length.
conditions = c('A', 'B', 'C')
df_output2 =
df %>%
group_by(id) %>%
mutate(index = as.integer(length(intersect(code, conditions)) >= 2))

Filter Rows Between with Multiple Events per Subject

I have a large data set and I'm trying to filter the days following a specific event for each subject. This issue is that the "event" of interest may happen multiple times for some subjects and for a few subjects the event doesn't happen at all (in which case they could just be removed from the summarized data).
Here is an example of the data and what I've tried:
library(tidyverse)
set.seed(355)
subject <- c(rep(LETTERS[1:4], each = 40), rep("E", times = 40))
event <- c(sample(0:1, size = length(subject)-40, replace = T, prob = c(0.95, 0.05)), rep(0, times = 40))
df <- data.frame(subject, event)
df %>%
filter(event == 1) %>%
count(subject, event, sort = T)
# A tibble: 4 x 3
subject event n
<fct> <dbl> <int>
1 D 1 3
2 A 1 2
3 B 1 2
4 C 1 2
So we see that subject D has had the event 3 times while subjects A, B, and C have had the event 2 times. Subject E has not had the event at all.
My next step was to create an "event" tag that identifies where each event happened and then produced an NA for all over rows. I also created an event sequence, which sequences along between events, because I thought it might be useful, but I didn't end up trying to use it.
df_cleaned <- df %>%
group_by(subject, event) %>%
mutate(event_seq = seq_along(event == 1),
event_detail = ifelse(event == 1, "event", NA)) %>%
as.data.frame()
I tried two different approaches using a filter() and between() to get each event and the 2 rows following each event. Both of these approaches create an error because of the multiple events within subject. I can't figure out a good workaround for it.
Approach 1:
df_cleaned %>%
group_by(subject) %>%
filter(., between(row_number(),
left = which(!is.na(event_detail)),
right = which(!is.na(event_detail)) + 1))
Approach 2:
df_cleaned %>%
group_by(subject) %>%
mutate(event_group = cumsum(!is.na(event_detail))) %>%
filter(., between(row_number(), left = which(event_detail == "event"), right = which(event_detail == "event") + 2))
If you want to get rows with 1 in event and the following two rows, you can do the following. Thanks to Ananda Mahto who is the author of splitstackshape package, we can handle this type of operation with getMyRows(), which returns a list. You can specify a range of rows in the function. Here I said 0:2. So I am asking R to take each row with 1 in event and the following two rows. I used bind_rows() to return a data frame. But if you need to work with a list, you do not have to do that.
install_github("mrdwab/SOfun")
library(SOfun)
library(dplyr)
ind <- which(x = df$event == 1)
bind_rows(getMyRows(data = df, pattern = ind, range = 0:2))
subject event
1 A 1
2 A 0
3 A 0
4 A 1
5 A 0
6 A 0
7 B 1
8 B 0
9 B 0
10 B 1
11 B 0
12 B 0
13 C 1
14 C 0
15 C 0
16 C 1
17 C 0
18 C 0
19 D 1
20 D 0
21 D 0
22 D 1
23 D 0
24 D 0
25 D 1
26 D 0
27 D 0
Here is a tidyverse approach which uses cumsum() to create groups of rows after (and including) an event and which picks the top 3 rows of each group:
df %>%
group_by(subject) %>%
mutate(event_group = cumsum(event == 1L)) %>%
group_by(event_group, add = TRUE) %>%
filter(event_group > 0 & row_number() <= 3L)
# A tibble: 27 x 3
# Groups: subject, event_group [9]
subject event event_group
<fct> <dbl> <int>
1 A 1 1
2 A 0 1
3 A 0 1
4 A 1 2
5 A 0 2
6 A 0 2
7 B 1 1
8 B 0 1
9 B 0 1
10 B 1 2
# … with 17 more rows
For testing an edge case, here is a modified data set where subject A starts with three subsequent events. Furthermore, I have added row numbers rn in order to check that the correct rows are picked:
df2 <- df %>%
mutate(event = ifelse(row_number() <= 2L, 1L, event),
rn = row_number())
Now we get
df2 %>%
group_by(subject) %>%
mutate(event_group = cumsum(event == 1L)) %>%
group_by(event_group, add = TRUE) %>%
filter(event_group > 0 & row_number() <= 3L)
# A tibble: 29 x 4
# Groups: subject, event_group [11]
subject event rn event_group
<fct> <dbl> <int> <int>
1 A 1 1 1
2 A 1 2 2
3 A 1 3 3
4 A 0 4 3
5 A 0 5 3
6 A 1 22 4
7 A 0 23 4
8 A 0 24 4
9 B 1 59 1
10 B 0 60 1
# … with 19 more rows
which is in line with my expectations for this edge case.
Here is a base R option which looks similar to #jazzurro's attempt. We get the row indices where event == 1, then select next two rows from each index, use unique so in case there are overlapping indices we select only the unique ones and subset it from the original df.
inds <- which(df$event == 1)
df[unique(c(sapply(inds, `+`, 0:2))), ]
# subject event
#3 A 1
#4 A 0
#5 A 0
#22 A 1
#23 A 0
#24 A 0
#59 B 1
#60 B 0
#61 B 0
#62 B 1
#63 B 0
#64 B 0
#....
Another option using dplyr, could be using lag
library(dplyr)
df %>%
group_by(subject) %>%
filter(event == 1 | lag(event) == 1 | lag(event, 2) == 1)

Filter (subset) by conditions in 2 columns in R (dplyr or otherwise)

Given a dataset such as:
set.seed(134)
df<- data.frame(ID= rep(LETTERS[1:5], each=2),
condition=rep(0:1, 5),
value=rpois(10, 3)
)
df
ID condition value
1 A 0 2
2 A 1 3
3 B 0 5
4 B 1 2
5 C 0 3
6 C 1 1
7 D 0 2
8 D 1 4
9 E 0 1
10 E 1 5
For each ID, when the value for condition==0 is less than the value for condition==1, I want to keep both observations. When the value for condition==0 is greater than condition==1, I want to keep only the row for condition==0.
The subset returned should be this:
ID condition value
1 A 0 2
2 A 1 3
3 B 0 5
5 C 0 3
7 D 0 2
8 D 1 4
9 E 0 1
10 E 1 5
Using dplyr the first step is:
df %>% group_by(ID) %>%
But not sure where to go from there.
Translating fairly literally,
library(dplyr)
set.seed(134)
df <- data.frame(ID = rep(LETTERS[1:5], each = 2),
condition = rep(0:1, 5),
value = rpois(10, 3))
df %>% group_by(ID) %>%
filter(condition == 0 |
(condition == 1 & value > value[condition == 0]))
#> # A tibble: 8 x 3
#> # Groups: ID [5]
#> ID condition value
#> <fct> <int> <int>
#> 1 A 0 2
#> 2 A 1 3
#> 3 B 0 5
#> 4 C 0 3
#> 5 D 0 2
#> 6 D 1 4
#> 7 E 0 1
#> 8 E 1 5
This depends on each group having a single observation with condition == 0, but should otherwise be fairly robust.
This is may not be the easiest way, but should work as you want.
library(reshape2)
df %>%
dcast(ID ~ condition, value.var = 'value') %>% # cast to wide format
mutate(`1` = ifelse(`1` > `0`, `1`, NA)) %>% # turn 0>1 values as NA
melt('ID') %>% # melt as long format
arrange(ID) %>% # sort by ID
filter(complete.cases(.)) # remove NA rows
Output:
ID variable value
1 A 0 2
2 A 1 3
3 B 0 5
4 C 0 3
5 D 0 2
6 D 1 4
7 E 0 1
8 E 1 5
You always want the value from the first row in each group. You only want the value from the second row in each group if it's larger than the first.
This works:
df %>%
group_by(ID) %>%
filter(row_number() == 1 | value > lag(value))
Edit: as #alistaire points out, this method depends on a particular order in, which is might be a good idea to guarantee as follows:
df %>%
arrange(ID, condition) %>%
group_by(ID) %>%
filter(row_number() == 1 | value > lag(value))

Extracting group dependent results from a dataframe

I have a dataframe made from different groups, and for each group real and predicted values. I want to extract values of tests on these values :
library(dplyr)
d = data.frame(group = c(rep(5,x="a"),rep(5,x="b")), real = c(rep(2, x=1:5)), pred = c(2,1,3,4,5,1,2,4,3,5))
group real pred
1 a 1 2
2 a 2 1
3 a 3 3
4 a 4 4
5 a 5 5
6 b 1 1
7 b 2 2
8 b 3 4
9 b 4 3
10 b 5 5
d <- d %>% group_by(group) %>% mutate( sg = ifelse(real == 1 & real == pred, 1, 0))
d <- d %>% group_by(group) %>% mutate( sp = ifelse(real <= 3 & pred <= 3, 1, 0))
d %>% distinct(sg, sp)
sg sp group
1 0 1 a
2 0 0 a
3 1 1 b
4 0 1 b
5 0 0 b
But I want something like this (only 1 result per group)
sg sp group
1 0 1 a
3 1 1 b
I am pretty sure dplyr, data.table or tidyr can do something but I cannot find how.
If it is always the first row of each group that you want to extract, you could use the do function:
d %>% do(.[1,])
Another option is to use the filter function like this:
d %>% filter(seq_along(sp) == 1)

Using 'window' functions in dplyr

I need to process rows of a data-frame in order, but need to look-back for certain rows. Here is an approximate example:
library(dplyr)
d <- data_frame(trial = rep(c("A","a","b","B","x","y"),2))
d <- d %>%
mutate(cond = rep('', n()), num = as.integer(rep(0,n())))
for (i in 1:nrow(d)){
if(d$trial[i] == "A"){
d$num[i] <- 0
d$cond[i] <- "A"
}
else if(d$trial[i] == "B"){
d$num[i] <- 0
d$cond[i] <- "B"
}
else{
d$num[i] <- d$num[i-1] +1
d$cond[i] <- d$cond[i-1]
}
}
The resulting data-frame looks like
> d
Source: local data frame [12 x 3]
trial cond num
1 A A 0
2 a A 1
3 b A 2
4 B B 0
5 x B 1
6 y B 2
7 A A 0
8 a A 1
9 b A 2
10 B B 0
11 x B 1
12 y B 2
What is the proper way of doing this using dplyr?
dlpyr-only solution:
d %>%
group_by(i=cumsum(trial %in% c('A','B'))) %>%
mutate(cond=trial[1],num=seq(n())-1) %>%
ungroup() %>%
select(-i)
# trial cond num
# 1 A A 0
# 2 a A 1
# 3 b A 2
# 4 B B 0
# 5 x B 1
# 6 y B 2
# 7 A A 0
# 8 a A 1
# 9 b A 2
# 10 B B 0
# 11 x B 1
# 12 y B 2
Try
d %>%
mutate(cond = zoo::na.locf(ifelse(trial=="A"|trial=="B", trial, NA))) %>%
group_by(id=rep(1:length(rle(cond)$values), rle(cond)$lengths)) %>%
mutate(num = 0:(n()-1)) %>% ungroup %>%
select(-id)
Here is one way. The first thing was to add A or B in cond using ifelse. Then, I employed na.locf() from the zoo package in order to fill NA with A or B. I wanted to assign a temporary group ID before I took care of num. I borrowed rleid() in the data.table package. Grouping the data with the temporary group ID (i.e., foo), I used row_number() which is one of the window functions in the dplyr package. Note that I tried to remove foo doing select(-foo). But, the column wanted to stay. I think this is probably something to do with compatibility of the function.
library(zoo)
library(dplyr)
library(data.table)
d <- data_frame(trial = rep(c("A","a","b","B","x","y"),2))
mutate(d, cond = ifelse(trial == "A" | trial == "B", trial, NA),
cond = na.locf(cond),
foo = rleid(cond)) %>%
group_by(foo) %>%
mutate(num = row_number() - 1)
# trial cond foo num
#1 A A 1 0
#2 a A 1 1
#3 b A 1 2
#4 B B 2 0
#5 x B 2 1
#6 y B 2 2
#7 A A 3 0
#8 a A 3 1
#9 b A 3 2
#10 B B 4 0
#11 x B 4 1
#12 y B 4 2

Resources