Remove rows after a certain date based on a condition in R - r

There are similar questions I've seen, but none of them apply it to specific rows of a data.table or data.frame, rather they apply it to the whole matrix.
Subset a dataframe between 2 dates
How to select some rows with specific date from a data frame in R
I have a dataset with patients who were diagnosed with OA and those who were not:
dt <- data.table(ID = seq(1,10,1), OA = c(1,0,0,1,0,0,0,1,1,0),
oa.date = as.Date(c("01/01/2006", "01/01/2001", "01/01/2001", "02/03/2005","01/01/2001","01/01/2001","01/01/2001","05/06/2010", "01/01/2011", "01/01/2001"), "%d/%m/%Y"),
stop.date = as.Date(c("01/01/2006", "31/12/2007", "31/12/2008", "02/03/2005", "31/12/2011", "31/12/2011", "31/12/2011", "05/06/2010", "01/01/2011", "31/12/2011"), "%d/%m/%Y"))
dt$oa.date[dt$OA==0] <- NA
> dt
ID OA oa.date stop.date
1: 1 1 2006-01-01 2006-01-01
2: 2 0 <NA> 2007-12-31
3: 3 0 <NA> 2008-12-31
4: 4 1 2005-03-02 2005-03-02
5: 5 0 <NA> 2011-12-31
6: 6 0 <NA> 2011-12-31
7: 7 0 <NA> 2011-12-31
8: 8 1 2010-06-05 2010-06-05
9: 9 1 2011-01-01 2011-01-01
10: 10 0 <NA> 2011-12-31
What I want to do is delete those who were diagnosed with OA (OA==1) before start:
start <- as.Date("01/01/2009", "%d/%m/%Y")
So I want my final data to be:
> dt
ID OA oa.date stop.date
1: 2 0 <NA> 2009-12-31
2: 3 0 <NA> 2008-12-31
3: 5 0 <NA> 2011-12-31
4: 6 0 <NA> 2011-12-31
5: 7 0 <NA> 2011-12-31
6: 8 1 2010-06-05 2010-06-05
7: 9 1 2011-01-01 2011-01-01
8: 10 0 <NA> 2011-12-31
My tries are:
dt[dt$OA==1] <- dt[!(oa.date < start)]
I've also tried a loop but to no effect.
Any help is much appreciated.

This should be straightforward:
> dt[!(OA & oa.date < start)]
# ID OA oa.date stop.date
#1: 2 0 <NA> 2007-12-31
#2: 3 0 <NA> 2008-12-31
#3: 5 0 <NA> 2011-12-31
#4: 6 0 <NA> 2011-12-31
#5: 7 0 <NA> 2011-12-31
#6: 8 1 2010-06-05 2010-06-05
#7: 9 1 2011-01-01 2011-01-01
#8: 10 0 <NA> 2011-12-31
The OA column is binary (1/0) which is coerced to logical (TRUE/FALSE) in the i-expression.

You can try
dt=dt[dt$OA==0|(dt$OA==1&!(dt$oa.date < start)),]

Related

Create an summarizing variable for multiple columns in data.table r

I have the following data.table
dt <- data.table(id=c(1,2,2,2,3,3,4),
date=c("2019-09-13", "2018-12-06", "2017-12-14", "2018-02-08", "2015-12-06", "2012-12-14", "2011-02-08"),
variable_1=c("a","b",NA,NA,"b","c",NA),
variable_2=c(NA,NA,"a",NA,"a","c",NA),
variable_3=c(NA,NA,NA,"b","c","c",NA))
dt
id date variable_1 variable_2 variable_3
1: 1 2019-09-13 a <NA> <NA>
2: 2 2018-12-06 b <NA> <NA>
3: 2 2017-12-14 <NA> a <NA>
4: 2 2018-02-08 <NA> <NA> b
5: 3 2015-12-06 b a c
6: 3 2012-12-14 c c c
7: 4 2011-02-08 <NA> <NA> <NA>
I want to create a variable y that is summarizing all the columns. Everything that has one !is.na() among the variable should be 0 . Every row that has only is.na among all the variables should be 1. Like this:
id date variable_1 variable_2 variable_3 y
1: 1 2019-09-13 a <NA> <NA> 0
2: 2 2018-12-06 b <NA> <NA> 0
3: 2 2017-12-14 <NA> a <NA> 0
4: 2 2018-02-08 <NA> <NA> b 0
5: 3 2015-12-06 b a c 0
6: 3 2012-12-14 c c c 0
7: 4 2011-02-08 <NA> <NA> <NA> 1
In the original data.table I have 22 variables that I am looking at among 830 total variables. So I would prefer not to look for every Variable with _1 to _22 separately.
Is there a way in data.table?
dt[, y := +(rowSums(!is.na(.SD)) == 0L), .SDcols = patterns("^variable_")]
# id date variable_1 variable_2 variable_3 y
# 1: 1 2019-09-13 a <NA> <NA> 0
# 2: 2 2018-12-06 b <NA> <NA> 0
# 3: 2 2017-12-14 <NA> a <NA> 0
# 4: 2 2018-02-08 <NA> <NA> b 0
# 5: 3 2015-12-06 b a c 0
# 6: 3 2012-12-14 c c c 0
# 7: 4 2011-02-08 <NA> <NA> <NA> 1
Walk-through:
.SDcols=patterns(...) defines the columns to be processed as .SD in the j component. This doesn't involve removing/selecting columns for the output, just the ones that will be referenced internally.
!is.na(.SD) returns a logical matrix, same dims as .SD, indicating if its value is NA.
rowSums(...) returns the count of non-NAs in the row.
using the inverted logic of "count the number of non-NA values in a row", we're able to not care about the number of columns being processed; this is what allows me to use == 0L.
+(...) is a shorthand trick for converting logical to 0:1

aggregate per variable over which sums are calculated in R data.table

I have some time data
library(data.table); library(lubridate); set.seed(42)
dat <- rbind(data.table(time=as.POSIXct("2019-01-01 08:00:00") + round(runif(10,60,1e4)), val=runif(10),group=1)[order(time), id:=seq_len(.N)],
data.table(time=as.POSIXct("2019-02-01 18:00:00") + round(runif(10,60,1e4)), val=runif(10),group=2)[order(time), id:=seq_len(.N)])
> dat[order(group,id)]
time val group id
1: 2019-01-01 08:23:19 0.117487362 1 1
2: 2019-01-01 08:48:24 0.934672247 1 2
3: 2019-01-01 09:27:00 0.940014523 1 3
4: 2019-01-01 09:47:19 0.462292823 1 4
5: 2019-01-01 09:49:51 0.474997082 1 5
6: 2019-01-01 09:57:48 0.560332746 1 6
7: 2019-01-01 10:03:02 0.978226428 1 7
8: 2019-01-01 10:18:35 0.255428824 1 8
9: 2019-01-01 10:32:33 0.457741776 1 9
10: 2019-01-01 10:36:15 0.719112252 1 10
11: 2019-02-01 18:14:39 0.003948339 2 1
12: 2019-02-01 18:23:59 0.811055141 2 2
13: 2019-02-01 19:05:39 0.007334147 2 3
14: 2019-02-01 19:15:03 0.906601408 2 4
15: 2019-02-01 19:26:11 0.832916080 2 5
16: 2019-02-01 20:19:30 0.611778643 2 6
17: 2019-02-01 20:30:46 0.737595618 2 7
18: 2019-02-01 20:31:03 0.207658973 2 8
19: 2019-02-01 20:37:50 0.685169729 2 9
20: 2019-02-01 20:44:50 0.388108283 2 10
and I would like to calculate the sum of val during the following hour for each value of time. For example, for ID 1, this would be the sum of val for IDs 1 and 2 (because time for ID 3 is more than one hour after ID 1), for ID 2, the sum of val for IDs 2 to 4, and so forth. This yields the desired output (for group 1 only)
> res
time val id new1 new2
1: 2019-01-01 08:23:19 0.1174874 1 1.052160 1.052160
2: 2019-01-01 08:48:24 0.9346722 2 2.336979 2.336979
3: 2019-01-01 09:27:00 0.9400145 3 3.671292 3.671292
4: 2019-01-01 09:47:19 0.4622928 4 3.908132 3.908132
5: 2019-01-01 09:49:51 0.4749971 5 3.445839 NA
6: 2019-01-01 09:57:48 0.5603327 6 2.970842 NA
7: 2019-01-01 10:03:02 0.9782264 7 2.410509 NA
8: 2019-01-01 10:18:35 0.2554288 8 1.432283 NA
9: 2019-01-01 10:32:33 0.4577418 9 1.176854 NA
10: 2019-01-01 10:36:15 0.7191123 10 0.719112 NA
where two behaviors at the end are possible:
where the sequence is treated as is;
where sums are only calculated until there is not id for which there is an id with time at least an hour later, and all others are set NA (preferred).
I suspect that solving this requires me to subset within j but this is a problem I frequently run into and can't solve. I have not yet understood the general approach to this.
It could be a loop with join
dat1 <- dat[order(id)]
out <- rbindlist(lapply(dat1$id, function(i) {
d1 <- dat1[seq_len(.N) >= match(i, id)]
d1[d1[, .(time = time %m+% hours(1))], .(time1 = time, val, new1 = sum(val)),
on = .(time <= time), by = .EACHI][1]
}))[, time := NULL][]
setnames(out, 1, "time")
out[time < time[2] %m+% hours(1), new2 := new1]
out
# time val new1 new2
# 1: 2019-01-01 08:23:19 0.1174874 1.0521596 1.052160
# 2: 2019-01-01 08:48:24 0.9346722 2.3369796 2.336980
# 3: 2019-01-01 09:27:00 0.9400145 3.6712924 3.671292
# 4: 2019-01-01 09:47:19 0.4622928 3.9081319 3.908132
# 5: 2019-01-01 09:49:51 0.4749971 3.4458391 NA
# 6: 2019-01-01 09:57:48 0.5603327 2.9708420 NA
# 7: 2019-01-01 10:03:02 0.9782264 2.4105093 NA
# 8: 2019-01-01 10:18:35 0.2554288 1.4322829 NA
# 9: 2019-01-01 10:32:33 0.4577418 1.1768540 NA
#10: 2019-01-01 10:36:15 0.7191123 0.7191123 NA
Update
For the new data, we can split by group and apply the same method
f1 <- function(data) {
lst1 <- split(data, data[["group"]])
rbindlist(lapply(lst1, function(.dat) {
out <- rbindlist(lapply(.dat$id, function(i) {
d1 <- .dat[seq_len(.N) >= match(i, id)]
d1[d1[, .(time = time %m+% hours(1))], .(time1 = time, val, new1 = sum(val)),
on = .(time <= time), by = .EACHI][1]
}))[, time := NULL][]
setnames(out, 1, "time")
out[time[.N]-time > hours(1), new2 := new1][]
})
)}
f1(dat1)
# time val new1 new2
#1: 2019-01-01 08:23:19 0.117487362 1.0521596 1.0521596
#2: 2019-01-01 08:48:24 0.934672247 2.3369796 2.3369796
#3: 2019-01-01 09:27:00 0.940014523 3.6712924 3.6712924
#4: 2019-01-01 09:47:19 0.462292823 3.9081319 3.9081319
#5: 2019-01-01 09:49:51 0.474997082 3.4458391 NA
#6: 2019-01-01 09:57:48 0.560332746 2.9708420 NA
#7: 2019-01-01 10:03:02 0.978226428 2.4105093 NA
#8: 2019-01-01 10:18:35 0.255428824 1.4322829 NA
#9: 2019-01-01 10:32:33 0.457741776 1.1768540 NA
#10: 2019-01-01 10:36:15 0.719112252 0.7191123 NA
#11: 2019-02-01 18:14:39 0.003948339 0.8223376 0.8223376
#12: 2019-02-01 18:23:59 0.811055141 1.7249907 1.7249907
#13: 2019-02-01 19:05:39 0.007334147 1.7468516 1.7468516
#14: 2019-02-01 19:15:03 0.906601408 1.7395175 1.7395175
#15: 2019-02-01 19:26:11 0.832916080 1.4446947 NA
#16: 2019-02-01 20:19:30 0.611778643 2.6303112 NA
#17: 2019-02-01 20:30:46 0.737595618 2.0185326 NA
#18: 2019-02-01 20:31:03 0.207658973 1.2809370 NA
#19: 2019-02-01 20:37:50 0.685169729 1.0732780 NA
#20: 2019-02-01 20:44:50 0.388108283 0.3881083 NA

Copy a value from one person in a group to everyone in a group

I have a data set in long format (multiple observations per ID), due to omitted information on prescriptions. Each ID is part of a larger "set", and there are 50 or more sets all with one diseased ID. One person per set has the disease, and the others don't.
dt <- data.table(ID = rep(1:10, each = 4),
disease = c(rep(0, 16), rep(1, 4), rep(0, 12), rep(1,4), rep(0,4)),
dob = c(rep(as.Date("13/05/1924", "%d/%m/%Y"), 4), rep(as.Date("15/09/1936", "%d/%m/%Y"),4),
rep(as.Date("30/06/1957", "%d/%m/%Y"),4), rep(as.Date("19/02/1946", "%d/%m/%Y"),4),
rep(as.Date("26/04/1939", "%d/%m/%Y"),4), rep(as.Date("13/05/1922", "%d/%m/%Y"), 4), rep(as.Date("18/10/1945", "%d/%m/%Y"),4),
rep(as.Date("30/06/1957", "%d/%m/%Y"),4), rep(as.Date("19/02/1946", "%d/%m/%Y"),4),
rep(as.Date("26/12/1939", "%d/%m/%Y"),4)),
disease.date = c(rep(as.Date("01/01/2000", "%d/%m/%Y"), 16), rep(as.Date("19/02/2006", "%d/%m/%Y"),4),
rep(as.Date("01/01/2000", "%d/%m/%Y"), 12), rep(as.Date("13/11/2010", "%d/%m/%Y"),4),
rep(as.Date("01/01/2000", "%d/%m/%Y"), 4)),
set = c(rep(1,20), rep(2,20)))
dt <- dt[(disease==0), disease.date:=NA]
dt
ID disease dob disease.date set
1: 1 0 1924-05-13 <NA> 1
2: 1 0 1924-05-13 <NA> 1
3: 1 0 1924-05-13 <NA> 1
4: 1 0 1924-05-13 <NA> 1
5: 2 0 1936-09-15 <NA> 1
6: 2 0 1936-09-15 <NA> 1
7: 2 0 1936-09-15 <NA> 1
8: 2 0 1936-09-15 <NA> 1
9: 3 0 1957-06-30 <NA> 1
10: 3 0 1957-06-30 <NA> 1
11: 3 0 1957-06-30 <NA> 1
12: 3 0 1957-06-30 <NA> 1
13: 4 0 1946-02-19 <NA> 1
14: 4 0 1946-02-19 <NA> 1
15: 4 0 1946-02-19 <NA> 1
16: 4 0 1946-02-19 <NA> 1
17: 5 1 1939-04-26 2006-02-19 1
18: 5 1 1939-04-26 2006-02-19 1
19: 5 1 1939-04-26 2006-02-19 1
20: 5 1 1939-04-26 2006-02-19 1
21: 6 0 1922-05-13 <NA> 2
22: 6 0 1922-05-13 <NA> 2
23: 6 0 1922-05-13 <NA> 2
24: 6 0 1922-05-13 <NA> 2
25: 7 0 1945-10-18 <NA> 2
26: 7 0 1945-10-18 <NA> 2
27: 7 0 1945-10-18 <NA> 2
28: 7 0 1945-10-18 <NA> 2
29: 8 0 1957-06-30 <NA> 2
30: 8 0 1957-06-30 <NA> 2
31: 8 0 1957-06-30 <NA> 2
32: 8 0 1957-06-30 <NA> 2
33: 9 1 1946-02-19 2010-11-13 2
34: 9 1 1946-02-19 2010-11-13 2
35: 9 1 1946-02-19 2010-11-13 2
36: 9 1 1946-02-19 2010-11-13 2
37: 10 0 1939-12-26 <NA> 2
38: 10 0 1939-12-26 <NA> 2
39: 10 0 1939-12-26 <NA> 2
40: 10 0 1939-12-26 <NA> 2
I'm interested in finding the age of everyone in that set on the date of disease for the case.
for example, how old is everyone in set 1 on 19/02/2006 (the cases disease date)? and in set 2 on 13/11/2010?
I've tried the data.table way:
cc[, age := dob - oa.cons.date, by = set]
which only worked for those with a disease.date
Any other thoughts I had involved copying the disease.date of each case to the controls in the sameset, but I didn't know how to do that either.
You can copy the first non-empty disease date within each set group to the whole column disease.date:
dt[, disease.date := disease.date[!is.na(disease.date)][1], by = set]
Then calculate age:
dt[, age := disease.date - dob]
Notice that time difference intervals are in days. You may divide them by 365 or treat them in any other suitable way. Maybe package lubridate can be useful here. With its help:
dt[, age := as.period(interval(dob, disease.date), unit = "years")]
or
dt[, age := decimal_date(disease.date) - decimal_date(dob)]
You can try this:
(dt$dob - dt$disease.date[20])/365
Taking dt$disease.date[20] since there are some NAs in the disease.date column.
Since both columns are date objects, R automatically calculates the difference in two dates. The difference will be in terms of days, so dividing by 365 gives you the approximate age.

Adding row for missing value in data.table

My question is somehow related to Fastest way to add rows for missing values in a data.frame? but a bit tougher I think. And I can't figure out how to adapt this solution to my problem.
Here is what my data.table looks like :
ida idb value date
1: A 2 26600 2004-12-31
2: A 3 19600 2005-03-31
3: B 3 18200 2005-06-30
4: B 4 1230 2005-09-30
5: C 2 8700 2005-12-31
The difference is that every 'ida' has his own dates and there is at least one row where 'ida' appears with each date but not necessarily for all 'idb'. I want to insert every missing ('ida','idb') couple missing with the corresponding date and 0 as a value.
Moreover, there is no periodicity for the dates.
How would you do this ?
Desired output :
ida idb value date
1: A 2 26600 2004-12-31
1: A 2 0 2005-03-31
2: A 3 19600 2005-03-31
2: A 3 0 2004-12-31
3: B 3 18200 2005-06-30
4: B 3 0 2005-09-30
5: B 4 1230 2005-09-30
4: B 4 0 2005-06-30
6: C 2 8700 2005-12-31
The order doesn't matter. Every date missing is filled with a 0 value.
You just do the same thing as in your linked question by each ida:
setkey(dt, idb, date)
dt[, .SD[CJ(unique(idb), unique(date))], by = ida][is.na(value), value := 0][]
# ida idb value date
#1: A 2 26600 2004-12-31
#2: A 2 0 2005-03-31
#3: A 3 0 2004-12-31
#4: A 3 19600 2005-03-31
#5: C 2 8700 2005-12-31
#6: B 3 18200 2005-06-30
#7: B 3 0 2005-09-30
#8: B 4 0 2005-06-30
#9: B 4 1230 2005-09-30

Compare different columns in separate rows in R

I would like to check that an individual does not have any gaps in their eligibility status. I define a gap as a date_of_claim that occurs 30 days after the last elig_end_date. therefore, what I would like to do is check that each date_of_claim is no longer than the elig_end_date +30days in the row immediately preceeding. Ideally I would like an indicator that says 0 for no gap and 1 if there is a gap per person and where the gap occurs. Here is a sample df with the solution built in as 'gaps'.
names date_of_claim elig_end_date obs gaps
1 tom 2010-01-01 2010-07-01 1 NA
2 tom 2010-05-04 2010-07-01 1 0
3 tom 2010-06-01 2014-01-01 2 0
4 tom 2010-10-10 2014-01-01 2 0
5 mary 2010-03-01 2014-06-14 1 NA
6 mary 2010-05-01 2014-06-14 1 0
7 mary 2010-08-01 2014-06-14 1 0
8 mary 2010-11-01 2014-06-14 1 0
9 mary 2011-01-01 2014-06-14 1 0
10 john 2010-03-27 2011-03-01 1 NA
11 john 2010-07-01 2011-03-01 1 0
12 john 2010-11-01 2011-03-01 1 0
13 john 2011-02-01 2011-03-01 1 0
14 sue 2010-02-01 2010-04-30 1 NA
15 sue 2010-02-27 2010-04-30 1 0
16 sue 2010-03-13 2010-05-31 2 0
17 sue 2010-04-27 2010-06-30 3 0
18 sue 2010-04-27 2010-06-30 3 0
19 sue 2010-05-06 2010-08-31 4 0
20 sue 2010-06-08 2010-09-30 5 0
21 mike 2010-05-01 2010-07-30 1 NA
22 mike 2010-06-01 2010-07-30 1 0
23 mike 2010-11-12 2011-07-30 2 1
I have found this post quite useful How can I compare a value in a column to the previous one using R?, but feel that I cant use a loop as my df has 4 million rows, and I have had a lot of difficulty trying to run a loop on it already.
to this end, i think the code i need is something like this:
df$gaps<-ifelse(df$date_of_claim>=df$elig_end_date+30,1,0) ##this doesn't use the preceeding row.
I've made a clumsy attempt using this:
df$gaps<-df$date_of_claim>=df$elig_end_date[-1,]
but I get an error to say i have an incorrect number of dimensions.
all help greatly appreciated! thank you.
With four million observations I would use data.table:
DF <- read.table(text="names date_of_claim elig_end_date obs gaps
1 tom 2010-01-01 2010-07-01 1 NA
2 tom 2010-05-04 2010-07-01 1 0
3 tom 2010-06-01 2014-01-01 2 0
4 tom 2010-10-10 2014-01-01 2 0
5 mary 2010-03-01 2014-06-14 1 NA
6 mary 2010-05-01 2014-06-14 1 0
7 mary 2010-08-01 2014-06-14 1 0
8 mary 2010-11-01 2014-06-14 1 0
9 mary 2011-01-01 2014-06-14 1 0
10 john 2010-03-27 2011-03-01 1 NA
11 john 2010-07-01 2011-03-01 1 0
12 john 2010-11-01 2011-03-01 1 0
13 john 2011-02-01 2011-03-01 1 0
14 sue 2010-02-01 2010-04-30 1 NA
15 sue 2010-02-27 2010-04-30 1 0
16 sue 2010-03-13 2010-05-31 2 0
17 sue 2010-04-27 2010-06-30 3 0
18 sue 2010-04-27 2010-06-30 3 0
19 sue 2010-05-06 2010-08-31 4 0
20 sue 2010-06-08 2010-09-30 5 0
21 mike 2010-05-01 2010-07-30 1 NA
22 mike 2010-06-01 2010-07-30 1 0
23 mike 2010-11-12 2011-07-30 2 1", header=TRUE)
library(data.table)
DT <- data.table(DF)
DT[, c("date_of_claim", "elig_end_date") := list(as.Date(date_of_claim), as.Date(elig_end_date))]
DT[, gaps2:= c(NA, date_of_claim[-1] > head(elig_end_date, -1)+30), by=names]
# names date_of_claim elig_end_date obs gaps gaps2
# 1: tom 2010-01-01 2010-07-01 1 NA NA
# 2: tom 2010-05-04 2010-07-01 1 0 FALSE
# 3: tom 2010-06-01 2014-01-01 2 0 FALSE
# 4: tom 2010-10-10 2014-01-01 2 0 FALSE
# 5: mary 2010-03-01 2014-06-14 1 NA NA
# 6: mary 2010-05-01 2014-06-14 1 0 FALSE
# 7: mary 2010-08-01 2014-06-14 1 0 FALSE
# 8: mary 2010-11-01 2014-06-14 1 0 FALSE
# 9: mary 2011-01-01 2014-06-14 1 0 FALSE
# 10: john 2010-03-27 2011-03-01 1 NA NA
# 11: john 2010-07-01 2011-03-01 1 0 FALSE
# 12: john 2010-11-01 2011-03-01 1 0 FALSE
# 13: john 2011-02-01 2011-03-01 1 0 FALSE
# 14: sue 2010-02-01 2010-04-30 1 NA NA
# 15: sue 2010-02-27 2010-04-30 1 0 FALSE
# 16: sue 2010-03-13 2010-05-31 2 0 FALSE
# 17: sue 2010-04-27 2010-06-30 3 0 FALSE
# 18: sue 2010-04-27 2010-06-30 3 0 FALSE
# 19: sue 2010-05-06 2010-08-31 4 0 FALSE
# 20: sue 2010-06-08 2010-09-30 5 0 FALSE
# 21: mike 2010-05-01 2010-07-30 1 NA NA
# 22: mike 2010-06-01 2010-07-30 1 0 FALSE
# 23: mike 2010-11-12 2011-07-30 2 1 TRUE
# names date_of_claim elig_end_date obs gaps gaps2

Resources