I need an more efficient way to add a marker that shows that an observation is registered 3 days before a specific date. The problem is that these dates are not necessarily consecutive i.e. they can be missing, yet I need the marker to ignore the missing days. The example below illustrates the problem and what I need more clearly:
library(tidyverse)
library(lubridate)
df <- data.frame("Date" = c(as_date(0:9)), ID = rep(paste0("ID", 1:3), each = 10))
df <- df[-c(5, 13, 24),]
date_before <- "1970-01-07"
df[, "three_days_before"] <- 0
for(i in df$ID){
cond <- df[, "ID"] == i &
df[, "Date"] == date_before
before_n <- (which(cond)-3):(which(cond)-1)
df[before_n, "three_days_before"] <- 1
}
df
The loop gives me what I need (three days are marked each time regardless their inclusion in the data.frame), yet it takes quite a long time to calculate on a larger data set. Can someone recommend a better way?
1) Apply a rolling window separately for each ID. The rolling window function checks whether any of the next 3 elements of Date equal date_before.
(Specifying a width of list(1:3) says to use offsets 1, 2 and 3 which means the next 3 ahead.) Note that there are no next 3 elements for the last value so we use fill to fill it in. We add 0 to convert from logical to numeric. This solution involves only two lines of code and no explicit looping.
library(zoo)
roll <- function(x) rollapply(x, list(1:3), FUN = any, partial = TRUE, fill = FALSE)
transform(df, before = ave(Date == date_before, ID, FUN = roll) + 0)
giving:
Date ID before
1 1970-01-01 ID1 0
2 1970-01-02 ID1 0
3 1970-01-03 ID1 1
4 1970-01-04 ID1 1
6 1970-01-06 ID1 1
7 1970-01-07 ID1 0
8 1970-01-08 ID1 0
9 1970-01-09 ID1 0
10 1970-01-10 ID1 0
11 1970-01-01 ID2 0
12 1970-01-02 ID2 0
14 1970-01-04 ID2 1
15 1970-01-05 ID2 1
16 1970-01-06 ID2 1
17 1970-01-07 ID2 0
18 1970-01-08 ID2 0
19 1970-01-09 ID2 0
20 1970-01-10 ID2 0
21 1970-01-01 ID3 0
22 1970-01-02 ID3 0
23 1970-01-03 ID3 1
25 1970-01-05 ID3 1
26 1970-01-06 ID3 1
27 1970-01-07 ID3 0
28 1970-01-08 ID3 0
29 1970-01-09 ID3 0
30 1970-01-10 ID3 0
2) This could also be expressed as a pipeline where roll is from above:
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(before = roll(Date == date_before)) %>%
ungroup
Here is a tidyverse solution using difftime and cumsum:
library(tidyverse);
df %>%
group_by(ID) %>%
mutate(
is_before = difftime(as_date(date_before), Date) >= 0,
three_days_before = as.numeric((max(cumsum(is_before)) - cumsum(is_before)) %in% 1:3)) %>%
select(-is_before) %>%
as.data.frame()
# Date ID three_days_before
#1 1970-01-01 ID1 0
#2 1970-01-02 ID1 0
#3 1970-01-03 ID1 1
#4 1970-01-04 ID1 1
#5 1970-01-06 ID1 1
#6 1970-01-07 ID1 0
#7 1970-01-08 ID1 0
#8 1970-01-09 ID1 0
#9 1970-01-10 ID1 0
#10 1970-01-01 ID2 0
#11 1970-01-02 ID2 0
#12 1970-01-04 ID2 1
#13 1970-01-05 ID2 1
#14 1970-01-06 ID2 1
#15 1970-01-07 ID2 0
#16 1970-01-08 ID2 0
#17 1970-01-09 ID2 0
#18 1970-01-10 ID2 0
#19 1970-01-01 ID3 0
#20 1970-01-02 ID3 0
#21 1970-01-03 ID3 1
#22 1970-01-05 ID3 1
#23 1970-01-06 ID3 1
#24 1970-01-07 ID3 0
#25 1970-01-08 ID3 0
#26 1970-01-09 ID3 0
#27 1970-01-10 ID3 0
Explanation: We group entries by ID; is_before flags entries at or before date_before; we then flag the first three rows before date_before with (max(cumsum(is_before)) - cumsum(is_before)) %in% 1:3).
Sample data
library(lubridate);
df <- data.frame("Date" = c(as_date(0:9)), ID = rep(paste0("ID", 1:3), each = 10))
df <- df[-c(5, 13, 24),]
date_before <- "1970-01-07"
Related
Dear geniuses of Stack
I find myself in need of help with writing rather advanced code. I have a data frame with three columns; a unique ID column (IDcol), a result column (Result), and a date for the results (Date).
There are 3 types of results, N1, N2 and N4. These can be measured over time for each ID. My data could look something like this:
IDcol Result Date
1 ID1 N1 2018-06-11
2 ID1 N2 2018-06-11
3 ID1 N4 2018-01-01
4 ID1 N1 2012-06-11
5 ID1 N2 2012-06-11
6 ID2 N1 2016-04-23
7 ID2 N2 2016-04-23
8 ID2 N4 2015-01-05
9 ID3 N1 2015-11-24
10 ID3 N2 2015-11-24
11 ID3 N4 2013-05-05
12 ID4 N1 2015-01-01
13 ID4 N2 2015-01-01
14 ID4 N1 2014-05-06
15 ID4 N2 2014-05-06
16 ID5 N1 2016-04-04
17 ID5 N2 2014-04-04
18 ID5 N4 2012-04-04
As you can see for ID1, N1 and N2 is measured twice on two dates - N4 is also measured in between. The code needs to "scan" for the earliest dates where N1 and N2 are measured. The scoring system could give these occurrences a score of 2. However, if N4 occurs within minus 1 year of N1 and N2 measurements, the code should "scan" for this and prioritize this as a "higher score". So for the example of ID1, the desired output would look like this:
IDcol Result Date score
1 ID1 N1 2018-06-11 3
2 ID1 N2 2018-06-11 3
3 ID1 N4 2018-01-01 3
Meaning that even though earlier N1 and N2 results were measured, they were dropped due to N4 also being measured within one year of the later measures of N1 and N2. ID2 is kept as an example, where the code should keep the rows for ID2. For ID3, N4 is measured but falls out of the minus one-year interval, and the row should thus be dropped. For ID4, the earliest measurements should be prioritized over the latest, and lastly, for ID5 - there are too many discrepancies between dates where Results are measured. Effectively, the desired output should look like this:
IDcol Result Date score
1 ID1 N1 2018-06-11 3
2 ID1 N2 2018-06-11 3
3 ID1 N4 2018-01-01 3
4 ID2 N1 2016-04-23 3
5 ID2 N2 2016-04-23 3
6 ID2 N4 2015-01-05 3
7 ID3 N1 2015-11-24 2
8 ID3 N2 2015-11-24 2
9 ID4 N1 2014-05-06 2
10 ID4 N2 2014-05-06 2
I have tried with dplyr with the following:
IDcol <- c("ID1", "ID1", "ID1",
"ID1", "ID1",
"ID2", "ID2", "ID2",
"ID3", "ID3", "ID3",
"ID4", "ID4",
"ID4", "ID4",
"ID5", "ID5", "ID5")
Result <- c("N1", "N2", "N4",
"N1", "N2",
"N1", "N2", "N4",
"N1", "N2", "N4",
"N1", "N2",
"N1", "N2",
"N1", "N2", "N4")
Date <- c("2018-06-11","2018-06-11", "2018-01-01",
"2012-06-11", "2012-06-11",
"2016-04-23", "2016-04-23", "2015-01-05",
"2015-11-24", "2015-11-24", "2013-05-05",
"2015-01-01", "2015-01-01",
"2014-05-06", "2014-05-06",
"2016-04-04", "2014-04-04", "2012-04-04")
df <- as.data.frame(cbind(IDcol, Result, Date))
df$Date <- ymd(df$Date)
df1 <- df %>% group_by(IDcol, Date) %>%
mutate(score=case_when(length(Result) > 3 & Date[Result=="N4"] %within% interval (Date[Result=="N1"-duration(1,units=years)) ~ 3,
"N1" %in% Result & "N2" %in% Result ~ 2,
TRUE ~ 0,
)) %>% filter(score > 0) %>% group_by(IDcol) %>% arrange(desc(score), Date,.by_group=TRUE) %>% filter(Date == first(Date)) %>% ungroup()
Any help is greatly appreciated
I believe this gives you your desired output. With problems like this I like to simplify the logic/code a bit by separating some information into a different df first and using filter.
cutoff.data <- df %>%
filter(Result != 'N4') %>%
group_by(IDcol) %>%
summarise(earliest = min(Date)) %>%
mutate(n4.cutoff = earliest - dyears(1)) %>%
select(-earliest)
df %>%
left_join(cutoff.data) %>%
group_by(IDcol) %>%
mutate(n4.row = ifelse(Result == 'N4',1,0),
n4.check = sum(n4.row)) %>%
ungroup() %>% # grouping messes up the following ifelse step
mutate(score = ifelse(n4.check == 1 & Date[Result == 'N4'] >= n4.cutoff, 3, 2)) %>%
group_by(IDcol) %>% # but then grouping is required to get lowest score per ID
mutate(score = min(score))
# A tibble: 18 x 7
# Groups: IDcol [5]
IDcol Result Date n4.cutoff n4.row n4.check score
<fct> <fct> <date> <dttm> <dbl> <dbl> <dbl>
1 ID1 N1 2018-06-11 2011-06-11 18:00:00 0 1 3
2 ID1 N2 2018-06-11 2011-06-11 18:00:00 0 1 3
3 ID1 N4 2018-01-01 2011-06-11 18:00:00 1 1 3
4 ID1 N1 2012-06-11 2011-06-11 18:00:00 0 1 3
5 ID1 N2 2012-06-11 2011-06-11 18:00:00 0 1 3
6 ID2 N1 2016-04-23 2015-04-23 18:00:00 0 1 2
7 ID2 N2 2016-04-23 2015-04-23 18:00:00 0 1 2
8 ID2 N4 2015-01-05 2015-04-23 18:00:00 1 1 2
9 ID3 N1 2015-11-24 2014-11-23 18:00:00 0 1 2
10 ID3 N2 2015-11-24 2014-11-23 18:00:00 0 1 2
11 ID3 N4 2013-05-05 2014-11-23 18:00:00 1 1 2
12 ID4 N1 2015-01-01 2013-05-05 18:00:00 0 0 2
13 ID4 N2 2015-01-01 2013-05-05 18:00:00 0 0 2
14 ID4 N1 2014-05-06 2013-05-05 18:00:00 0 0 2
15 ID4 N2 2014-05-06 2013-05-05 18:00:00 0 0 2
16 ID5 N1 2016-04-04 2013-04-03 18:00:00 0 1 2
17 ID5 N2 2014-04-04 2013-04-03 18:00:00 0 1 2
18 ID5 N4 2012-04-04 2013-04-03 18:00:00 1 1 2
I have big data frame (dim: 12867779x5) which looks like that:
id
group
date1
date 2
icf
id1
2
2020-03-17
2019-06-05
id1
3
2020-04-03
2019-05-09
id2
2
2020-04-10
2019-07-04
id2
3
2021-04-1
2020-06-01
id3
1
2020-04-13
2019-07-07
id3
2
2021-04-10
2020-06-01
id3
3
2020-04-10
2019-07-04
id3
4
2021-04-13
2020-06-01
Desired output:
id
group
date1
date 2
icf
id1
3
2020-04-03
2019-05-09
0
id2
2
2020-04-10
2019-07-04
52
id2
3
2021-04-01
2020-06-01
0
id3
1
2020-04-13
2019-07-07
49
id3
2
2021-04-10
2020-06-01
-646
id3
3
2020-04-10
2019-07-04
52
id3
4
2021-04-13
2020-06-01
0
To calculate icf I need to check if the id's from row i and i+1 are the same. If yes icf = date2(i+1) - date1(i).
I wrote this function to calculate icf, but it's too slow. I'm looking for ways to speed it up, I was thinking about using the apply function but I don't have idea how to re-write this icfCalculation fucntion.
icfCalculation <- function(dataFrame){
nr <- nrow(dataFrame) - 1
for (i in 1:nr) {
if(dataFrame[i, 1] == dataFrame[i+1, 1]){
dataFrame[i,5] = dataFrame[i+1, 4] - dataFrame[i, 3]
}
else{
dataFrame[i,5] = 0
}
}
return(dataFrame)
}
Thanks for putting expected output. This is not the same as what you have put - but it does give the same results as your function and should be significantly quicker to thanks to the data.table internal optimisations:
library(data.table)
# Read in data
dat <- read.table(text = "id group date1 date2
id1 2 2020-03-17 2019-06-05
id1 3 2020-04-03 2019-05-09
id2 2 2020-04-10 2019-07-04
id2 3 2021-04-1 2020-06-01
id3 1 2020-04-13 2019-07-07
id3 2 2021-04-10 2020-06-01
id3 3 2020-04-10 2019-07-04
id3 4 2021-04-13 2020-06-01",
h = T,
colClasses = c("character", "character", "Date", "Date")
)
# Make it a data.table
setDT(dat)
dat[, icf := fifelse(
id == shift(id, type = "lead"),
as.integer(
shift(date2, type = "lead") - date1
),
0)
]
dat
# id group date1 date2 icf
# 1: id1 2 2020-03-17 2019-06-05 -313
# 2: id1 3 2020-04-03 2019-05-09 0
# 3: id2 2 2020-04-10 2019-07-04 52
# 4: id2 3 2021-04-01 2020-06-01 0
# 5: id3 1 2020-04-13 2019-07-07 49
# 6: id3 2 2021-04-10 2020-06-01 -646
# 7: id3 3 2020-04-10 2019-07-04 52
# 8: id3 4 2021-04-13 2020-06-01 NA
If you want the last NA to be 0, just add dat$icf[nrow(dat)] <- 0.
library(dplyr)
library(tidyr)
df %>%
mutate(icf = replace_na(ifelse(id == lead(id), lead(date2) - date1, 0), 0))
Rather than use tidyr::replace_na you could also specify the default argument of lead.
Base R
A base R approach would be something like:
df$icf <- with(df, ifelse(id == c(id[2:nrow(df)], NA), c(date2[2:nrow(df)], NA) - date1, 0))
Output
id group date1 date2 icf
1 id1 2 2020-03-17 2019-06-05 -313
2 id1 3 2020-04-03 2019-05-09 0
3 id2 2 2020-04-10 2019-07-04 52
4 id2 3 2021-04-01 2020-06-01 0
5 id3 1 2020-04-13 2019-07-07 49
6 id3 2 2021-04-10 2020-06-01 -646
7 id3 3 2020-04-10 2019-07-04 52
8 id3 4 2021-04-13 2020-06-01 0
I have two data.tables, which I want to combine based on if a date in one table is in the given time range in the other table. In dt1 I have exit dates and I want to check in dt2 which values were valid at the exit date for each ID.
dt1 <- data.table (ID = 1:10,
exit = c("31/12/2010", "01/01/2021", "30/09/2010", "31/12/2015", "30/09/2010","31/10/2018", "01/02/2016", "01/05/2015", "01/09/2013", "01/01/2016"))
dt2 <- data.table (ID = c(1,2,2,2,3,5,6,6,7,8,8,9,10),
valid_from = c("01/01/2010", "01/01/2012", "01/01/2013", "01/12/2017", "01/05/2010", "01/04/2010", "01/05/2014", "01/11/2016", "01/01/2016", "15/04/2013", "01/01/2015", "15/02/2010", "01/04/2012"),
valid_until = c("01/01/2021", "31/12/2012", "30/11/2017", "01/01/2021", "01/01/2021", "01/01/2021", "31/10/2016", "01/01/2021", "01/01/2021", "31/12/2014", "01/05/2015", "01/01/2013", "01/01/2021"),
text1 = c("a", "a", "b", "c", "b", "b", "c", "a", "a", "b", "a", "c", "a"),
text2 = c("I", "I", "II", "I", "III", "I", "II", "III", "I", "II", "II", "I", "III" ))
ID exit
1: 1 31/12/2010
2: 2 01/01/2021
3: 3 30/09/2010
4: 4 31/12/2015
5: 5 30/09/2010
6: 6 31/10/2018
7: 7 01/02/2016
8: 8 01/05/2015
9: 9 01/09/2013
10: 10 01/01/2016
ID valid_from valid_until text1 text2
1: 1 01/01/2010 01/01/2021 a I
2: 2 01/01/2012 31/12/2012 a I
3: 2 01/01/2013 30/11/2017 b II
4: 2 01/12/2017 01/01/2021 c I
5: 3 01/05/2010 01/01/2021 b III
6: 5 01/04/2010 01/01/2021 b I
7: 6 01/05/2014 31/10/2016 c II
8: 6 01/11/2016 01/01/2021 a III
9: 7 01/01/2016 01/01/2021 a I
10: 8 15/04/2013 31/12/2014 b II
11: 8 01/01/2015 01/05/2015 a II
12: 9 15/02/2010 01/01/2013 c I
13: 10 01/04/2012 01/01/2021 a III
As a result I would like to return in dt1 the valid values to the exit dates.
If an ID is not found in dt2 (would be the case for ID 4 in the sample data), it should return NA.
ID exit text1 text2
1: 1 31/12/2010 a I
2: 2 01/01/2021 c I
3: 3 30/09/2010 b III
4: 4 31/12/2015 <NA> <NA>
5: 5 30/09/2010 b I
6: 6 31/10/2018 a III
7: 7 01/02/2016 a I
8: 8 01/05/2015 a II
9: 9 01/09/2013 c I
10: 10 01/01/2016 a III
Could anyone help me solve this?
As the input is a data.table, consider using data.table methods which are fast
library(data.table)
# // convert the date columns to `Date` class
dt1[, exit := as.IDate(exit, '%d/%m/%Y')]
dt2[, c('valid_from', 'valid_until') := .(as.IDate(valid_from, '%d/%m/%Y'),
as.IDate(valid_until, '%d/%m/%Y'))]
# // do a non-equi join
dt1[dt2, c('text1', 'text2') := .(i.text1, i.text2),
on = .(ID, exit >= valid_from, exit <= valid_until)]
-output
> dt1
ID exit text1 text2
1: 1 2010-12-31 a I
2: 2 2021-01-01 c I
3: 3 2010-09-30 b III
4: 4 2015-12-31 <NA> <NA>
5: 5 2010-09-30 b I
6: 6 2018-10-31 a III
7: 7 2016-02-01 a I
8: 8 2015-05-01 a II
9: 9 2013-09-01 <NA> <NA>
10: 10 2016-01-01 a III
Here is a dplyr solution, that was created with the help of #akrun: see here dates: Not yet implemented NAbounds=TRUE for this non-numeric and non-character type
library(dplyr)
libray(lubridate)
df1 <- left_join(dt1, dt2, by="ID") %>%
mutate(across(c(exit, valid_from, valid_until), dmy)) %>%
rowwise() %>%
mutate(match= +(dplyr::between(exit, valid_from, valid_until))) %>%
group_by(ID) %>%
filter(match==max(match) | is.na(match)) %>%
select(ID, exit, text1, text2) %>%
ungroup()
output:
ID exit text1 text2
<dbl> <date> <chr> <chr>
1 1 2010-12-31 a I
2 2 2021-01-01 c I
3 3 2010-09-30 b III
4 4 2015-12-31 NA NA
5 5 2010-09-30 b I
6 6 2018-10-31 a III
7 7 2016-02-01 a I
8 8 2015-05-01 a II
9 9 2013-09-01 c I
10 10 2016-01-01 a III
You may use fuzzyjoin after changing the dates to Date class.
library(fuzzyjoin)
library(dplyr)
dt1 %>%
mutate(exit = as.Date(exit, '%d/%m/%Y')) %>%
fuzzy_left_join(dt2 %>%
mutate(across(starts_with('valid'), as.Date, '%d/%m/%Y')),
by = c('ID', 'exit' = 'valid_from', 'exit' = 'valid_until'),
match_fun = c(`==`, `>=`, `<=`)) %>%
select(ID = ID.x, exit, text1, text2)
# ID exit text1 text2
#1 1 2010-12-31 a I
#2 2 2021-01-01 c I
#3 3 2010-09-30 b III
#4 4 2015-12-31 <NA> <NA>
#5 5 2010-09-30 b I
#6 6 2018-10-31 a III
#7 7 2016-02-01 a I
#8 8 2015-05-01 a II
#9 9 2013-09-01 <NA> <NA>
#10 10 2016-01-01 a III
I create a simple dataframe:
library(dplyr)
df <- tibble(
UserId = c("A", "A", "A", "A", "A", "B", "B", "B", "B"),
Answer_Date = as.Date(c("2010-12-31", "2011-12-29", "2012-12-25", "2013-12-10", "2014-12-31", "2010-10-31", "2011-10-28", "2013-10-31", "2015-10-31")),
Q1 = c(3, 1, 1, 0, 1, 4, 2, 5, 4),
Q2 = c(2, 0, 1, 2, 1, 8, 2, 6, 5),
) %>%
group_by(UserId) %>%
mutate(First_Date = min(Answer_Date)) %>%
mutate(Last_Date = max(Answer_Date)) %>%
ungroup()
which gives me
> df
# A tibble: 9 x 6
UserId Answer_Date Q1 Q2 First_Date Last_Date
<chr> <date> <dbl> <dbl> <date> <date>
1 A 2010-12-31 3 2 2010-12-31 2014-12-31
2 A 2011-12-29 1 0 2010-12-31 2014-12-31
3 A 2012-12-25 1 1 2010-12-31 2014-12-31
4 A 2013-12-10 0 2 2010-12-31 2014-12-31
5 A 2014-12-31 1 1 2010-12-31 2014-12-31
6 B 2010-10-31 4 8 2010-10-31 2015-10-31
7 B 2011-10-28 2 2 2010-10-31 2015-10-31
8 B 2013-10-31 5 6 2010-10-31 2015-10-31
9 B 2015-10-31 4 5 2010-10-31 2015-10-31
I now wish to compute the change in each subject's answers between the first and last date on which they answer the questionnaire. I start by writing
df_tmp <- df %>%
filter(Answer_Date == First_Date) %>%
select(c("UserId", "Q1", "Q2"))
colnames(df_tmp) <- c("UserId", paste0("First_Response_", c("Q1", "Q2")))
df <- merge(df, df_tmp, by = "UserId")
df_tmp <- df %>%
filter(Answer_Date == Last_Date) %>%
select(c("UserId", "Q1", "Q2"))
colnames(df_tmp) <- c("UserId", paste0("Last_Response_", c("Q1", "Q2")))
df <- merge(df, df_tmp, by = "UserId")
giving me
> df
UserId Answer_Date Q1 Q2 First_Date Last_Date First_Q1 First_Q2 Last_Q1 Last_Q2
1 A 2010-12-31 3 2 2010-12-31 2014-12-31 3 2 1 1
2 A 2011-12-29 1 0 2010-12-31 2014-12-31 3 2 1 1
3 A 2012-12-25 1 1 2010-12-31 2014-12-31 3 2 1 1
4 A 2013-12-10 0 2 2010-12-31 2014-12-31 3 2 1 1
5 A 2014-12-31 1 1 2010-12-31 2014-12-31 3 2 1 1
6 B 2010-10-31 4 8 2010-10-31 2015-10-31 4 8 4 5
7 B 2011-10-28 2 2 2010-10-31 2015-10-31 4 8 4 5
8 B 2013-10-31 5 6 2010-10-31 2015-10-31 4 8 4 5
9 B 2015-10-31 4 5 2010-10-31 2015-10-31 4 8 4 5
I now wish to create two now columns, Delta_Q1 = Last_Q1 - First_Q1 and Delta_Q2 = Last_Q2 - First_Q2, but (possibly) using mutate, paste0("First_", c("Q1", "Q2")), paste0("Last_", c("Q1", "Q2")) and paste0("Delta_", c("Q1", "Q2")).
What is the correct syntax for computing the differences (or in general, some function of two variables) between pairs of columns sequentially? The reason I don't want to write the differences down manually is simple - the real dataframe has lots of pairs of columns.
Many thanks in advance for your help.
Sincerely
Thomas Philips
You can create two vector of columns and directly subtract them to create new columns.
first_r_col <- grep('First_Response', colnames(df))
last_r_col <- grep('Last_Response', colnames(df))
df[paste0('delta', seq_along(first_r_col))] <- df[last_r_col] - df[first_r_col]
Using dplyr select statement might be easy way to select the columns.
library(dplyr)
df[paste0('delta', seq_along(first_r_col))] <-
df %>% select(starts_with('Last_Response')) -
df %>% select(starts_with('First_Response'))
Here's one approach that does not require you creating the First_Date and Last_Date columns:
library(dplyr)
df %>%
group_by(UserId) %>%
arrange(UserId, Answer_Date) %>%
filter(row_number() == 1 | row_number() == n()) %>%
summarize(Delta_Q1 = diff(Q1),
Delta_Q2 = diff(Q2))
I don't think much of that coding is needed, below is a dplyr solution:
df %>%
group_by(UserId) %>%
arrange(Answer_Date) %>%
summarize(First_Q1 = first(Q1),
First_Q2 = first(Q2),
Last_Q1 = last(Q1),
Last_Q2 = last(Q2)) %>%
mutate(Delta_Q1 = Last_Q1 - First_Q1,
Delta_Q2 = Last_Q2 - First_Q2)
Gives the output of:
# A tibble: 2 x 7
UserId First_Q1 First_Q2 Last_Q1 Last_Q2 Delta_Q1 Delta_Q2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 3 2 1 1 -2 -1
2 B 4 8 4 5 0 -3
With the benefit of hindsight, I missed the blindingly obvious answer, and made it harder than it should have been:
QUESTIONS <- c("Q1", "Q2")
FIRST_RESPONSE_PREFIX <- "First_"
LAST_RESPONSE_PREFIX <- "Last_"
DELTA_RESPONSE_PREFIX <- "Delta_"
first_response_cols <- paste0(FIRST_RESPONSE_PREFIX, QUESTIONS)
last_response_cols <- paste0(LAST_RESPONSE_PREFIX, QUESTIONS)
delta_response_cols <- paste0(DELTA_RESPONSE_PREFIX, QUESTIONS)
df_tmp1 <- df %>%
filter(Answer_Date == First_Answer_Date) %>%
select(c("UserId", QUESTIONS))
colnames(df_tmp1) <- c("UserId", first_response_cols)
df <- merge(df, df_tmp1, by = "UserId")
df_tmp2 <- df %>%
filter(Answer_Date == Last_Answer_Date) %>%
select(c("UserId", QUESTIONS))
colnames(df_tmp2) <- c("UserId", last_response_cols)
df <- merge(df, df_tmp2, by = "UserId")
df[delta_response_cols] <- df[last_response_cols] - df[first_response_cols]
When I run the code, I get exactly what i want:
> df
UserId Answer_Date Q1 Q2 First_Answer_Date Last_Answer_Date First_Q1 First_Q2 Last_Q1 Last_Q2 Delta_Q1 Delta_Q2
1 A 2010-12-31 3 2 2010-12-31 2014-12-31 3 2 1 1 -2 -1
2 A 2011-12-29 1 0 2010-12-31 2014-12-31 3 2 1 1 -2 -1
3 A 2012-12-25 1 1 2010-12-31 2014-12-31 3 2 1 1 -2 -1
4 A 2013-12-10 0 2 2010-12-31 2014-12-31 3 2 1 1 -2 -1
5 A 2014-12-31 1 1 2010-12-31 2014-12-31 3 2 1 1 -2 -1
6 B 2010-10-31 4 8 2010-10-31 2015-10-31 4 8 4 5 0 -3
7 B 2011-10-28 2 2 2010-10-31 2015-10-31 4 8 4 5 0 -3
8 B 2013-10-31 5 6 2010-10-31 2015-10-31 4 8 4 5 0 -3
9 B 2015-10-31 4 5 2010-10-31 2015-10-31 4 8 4 5 0 -3
That said, thanks for the help - I learned something by looking at the suggested answers.
I've seen lots of questions like this but can't figure this simple problem out. I don't want to collapse the dataset. Say I have this dataset:
library(tidyverse)
library(lubridate)
df <- data.frame(group = c("a", "a", "a", "a", "a", "b", "b", "b"),
starts = c("2011-09-18", NA, "2014-08-08", "2016-09-18", NA, "2013-08-08", "2015-08-08", NA),
ends = c(NA, "2013-03-06", "2015-08-08", NA, "2017-03-06", "2014-08-08", NA, "2016-08-08"))
df$starts <- parse_date_time(df$starts, "ymd")
df$ends <- parse_date_time(df$ends, "ymd")
df
group starts ends
1 a 2011-09-18 <NA>
2 a <NA> 2013-03-06
3 a 2014-08-08 2015-08-08
4 a 2016-09-18 <NA>
5 a <NA> 2017-03-06
6 b 2013-08-08 2014-08-08
7 b 2015-08-08 <NA>
8 b <NA> 2016-08-08
Desired output is:
group starts ends epi
1 a 2011-09-18 <NA> 1
2 a <NA> 2013-03-06 1
3 a 2014-08-08 2015-08-08 2
4 a 2016-09-18 <NA> 3
5 a <NA> 2017-03-06 3
6 b 2013-08-08 2014-08-08 1
7 b 2015-08-08 <NA> 2
8 b <NA> 2016-08-08 2
I was thinking something like this but obviously doesn't account for episodes where there is no NA
df <- df %>%
group_by(group) %>%
mutate(epi = cumsum(is.na(ends)))
df
I'm not sure how to incorporate cumsum(is.na) with condition if_else. Maybe I'm going at it the wrong way?
Any suggestions would be great.
A solution using dplyr. Assuming your data frame is well structured that each start always has an associated end record.
df2 <- df %>%
group_by(group) %>%
mutate(epi = cumsum(!is.na(starts))) %>%
ungroup()
df2
# # A tibble: 8 x 4
# group starts ends epi
# <fct> <dttm> <dttm> <int>
# 1 a 2011-09-18 00:00:00 NA 1
# 2 a NA 2013-03-06 00:00:00 1
# 3 a 2014-08-08 00:00:00 2015-08-08 00:00:00 2
# 4 a 2016-09-18 00:00:00 NA 3
# 5 a NA 2017-03-06 00:00:00 3
# 6 b 2013-08-08 00:00:00 2014-08-08 00:00:00 1
# 7 b 2015-08-08 00:00:00 NA 2
# 8 b NA 2016-08-08 00:00:00 2
An option is to get the rowSums of NA elements for columns 'starts', 'ends', grouped by 'group', get the rleid from the 'epi'
library(dplyr)
library(data.table)
df %>%
mutate(epi = rowSums(is.na(.[c("starts", "ends")]))) %>%
group_by(group) %>%
mutate(epi = rleid(epi))
# A tibble: 8 x 4
# Groups: group [2]
# group starts ends epi
# <fct> <dttm> <dttm> <int>
#1 a 2011-09-18 00:00:00 NA 1
#2 a NA 2013-03-06 00:00:00 1
#3 a 2014-08-08 00:00:00 2015-08-08 00:00:00 2
#4 a 2016-09-18 00:00:00 NA 3
#5 a NA 2017-03-06 00:00:00 3
#6 b 2013-08-08 00:00:00 2014-08-08 00:00:00 1
#7 b 2015-08-08 00:00:00 NA 2
#8 b NA 2016-08-08 00:00:00 2
If there are only two columns to consider
df %>%
group_by(group) %>%
mutate(epi = rleid(is.na(starts) + is.na(ends)))