We will start with the following DataTable:
id date
1: 1 2015-12-31
2: 1 2014-12-31
3: 1 2013-12-31
4: 1 2012-12-31
5: 1 2011-12-31
6: 2 2015-12-31
7: 2 2014-12-31
8: 2 2014-01-25
9: 2 2013-01-25
10: 2 2012-01-25
library(data.table)
DT <- data.table(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
as.IDate(c("2015-12-31", "2014-12-31", "2013-12-31", "2012-12-31",
"2011-12-31", "2015-12-31", "2014-12-31", "2014-01-25",
"2013-01-25", "2012-01-25")))
setnames(DT, c("id", "date"))
For every unique id, I want to create a ranking. The most current date for a specific id should have a rank of 0. After I should remove one year from that date to get the ranking -1. If the month is not the same as the date of rank 0, we should stop the ranking. For example, at the line 8, for the id=2, since the month is not december we should stop the ranking.
We would get the following result:
id date rank_year
1: 1 2015-12-31 0
2: 1 2014-12-31 -1
3: 1 2013-12-31 -2
4: 1 2012-12-31 -3
5: 1 2011-12-31 -4
6: 2 2015-12-31 0
7: 2 2014-12-31 -1
8: 2 2014-01-25 NA
9: 2 2013-01-25 NA
10: 2 2012-01-25 NA
I have the following code so far (given by #Frank and #akrun) :
DT <- DT[order(id, -date)]
DT <- DT[,rank_year := { z = month(date) + year(date)*12
as.integer( (z - z[1L])/12) # 12 months
}, by = id]
id date rank_year
1: 1 2015-12-31 0
2: 1 2014-12-31 -1
3: 1 2013-12-31 -2
4: 1 2012-12-31 -3
5: 1 2011-12-31 -4
6: 2 2015-12-31 0
7: 2 2014-12-31 -1
8: 2 2014-01-25 -1
9: 2 2013-01-25 -2
10: 2 2012-01-25 -3
Ok, I guess I would do it like
DT[, rank_year := replace(
year(date) - year(date)[1L],
month(date) != month(date[1L]),
NA_integer_
), by=id]
id date rank_year
1: 1 2015-12-31 0
2: 1 2014-12-31 -1
3: 1 2013-12-31 -2
4: 1 2012-12-31 -3
5: 1 2011-12-31 -4
6: 2 2015-12-31 0
7: 2 2014-12-31 -1
8: 2 2014-01-25 NA
9: 2 2013-01-25 NA
10: 2 2012-01-25 NA
See ?replace for details on how this works.
One way of extending the old answer is
DT[, r := {
z = month(date) + year(date)*12
res = (z - z[1L])/12
as.integer( replace(res, res %% 1 != 0, NA) )
}, by=id]
Related
This is the example data I haveļ¼
data<- data.frame(ID=c(rep(1,4),rep(2,5),rep(3,5),rep(4,4),rep(5,3),rep(6,3),rep(7,4),rep(8,5)),
test_results=c("POS","NEG","NA","NA",
"NA","NEG","POS","NA","NA",
"NEG","NEG","NEG","POS","NA",
"NA","NA","NA","NA",
"NEG","NEG","NEG",
"POS","POS","POS",
"NEG","NEG","NEG","NA",
"POS","POS","POS","NA","POS"),
Test_date=c("2000-1-1","2002-1-2","2003-1-1","2004-1-1",
"2000-2-1","2000-10-1","2002-10-2","2002-11-1","2002-12-1",
"2000-1-1","2002-1-1","2004-1-1","2006-1-1","2008-1-1",
"2000-1-1","2001-1-1","2002-1-1","2003-1-1",
"2000-1-1","2002-1-1","2004-1-1",
"2002-1-1","2004-1-1","2006-1-1",
"2000-1-1","2002-2-1","2003-12-1","2003-12-30",
"2002-3-1","2004-5-2","2005-12-30","2005-12-31","2007-9-10"))
I want to remove the 'NA' after the 'POS' in each ID, if the time interval between the "NA" and the "POS" is less than 3 months.
This is the intended result:
data.frame(ID=c(rep(1,4),rep(2,3),rep(3,5),rep(4,4),rep(5,3),rep(6,3),rep(7,4),rep(8,4)),
test_results=c("POS","NEG","NA","NA",
"NA","NEG","POS",
"NEG","NEG","NEG","POS","NA",
"NA","NA","NA","NA",
"NEG","NEG","NEG",
"POS","POS","POS",
"NEG","NEG","NEG","NA",
"POS","POS","POS","POS"),
Test_date=c("2000-1-1","2002-1-2","2003-1-1","2004-1-1",
"2000-2-1","2000-10-1","2002-10-2",
"2000-1-1","2002-1-1","2004-1-1","2006-1-1","2008-1-1",
"2000-1-1","2001-1-1","2002-1-1","2003-1-1",
"2000-1-1","2002-1-1","2004-1-1",
"2002-1-1","2004-1-1","2006-1-1",
"2000-1-1","2002-2-1","2003-12-1","2003-12-30",
"2002-3-1","2004-5-2","2005-12-30","2007-9-10"))
I had many attempts to find a good way to achieve this but did not get the solution. Any insights will be appreciated. Thank you!
Here is a data.table option
library(data.table)
library(lubridate)
setDT(data)[
,
Test_date := ymd(Test_date)
][
,
Q := c(NA, Test_date[test_results == "POS"] %m+% months(3))[cumsum(test_results == "POS") + 1],
ID
][!replace(rep(FALSE, .N), test_results == "NA" & Test_date <= Q, TRUE)][
,
Q := NULL
][]
which gives
ID test_results Test_date
1: 1 POS 2000-01-01
2: 1 NEG 2002-01-02
3: 1 NA 2003-01-01
4: 1 NA 2004-01-01
5: 2 NA 2000-02-01
6: 2 NEG 2000-10-01
7: 2 POS 2002-10-02
8: 3 NEG 2000-01-01
9: 3 NEG 2002-01-01
10: 3 NEG 2004-01-01
11: 3 POS 2006-01-01
12: 3 NA 2008-01-01
13: 4 NA 2000-01-01
14: 4 NA 2001-01-01
15: 4 NA 2002-01-01
16: 4 NA 2003-01-01
17: 5 NEG 2000-01-01
18: 5 NEG 2002-01-01
19: 5 NEG 2004-01-01
20: 6 POS 2002-01-01
21: 6 POS 2004-01-01
22: 6 POS 2006-01-01
23: 7 NEG 2000-01-01
24: 7 NEG 2002-02-01
25: 7 NEG 2003-12-01
26: 7 NA 2003-12-30
27: 8 POS 2002-03-01
28: 8 POS 2004-05-02
29: 8 POS 2005-12-30
30: 8 POS 2007-09-10
ID test_results Test_date
A dplyr option following a similar idea
library(tidyverse)
library(lubridate)
data %>%
mutate(Test_date = ymd(Test_date)) %>%
group_by(ID) %>%
mutate(Q = c(NA, Test_date[test_results == "POS"] %m+% months(3))[cumsum(test_results == "POS") + 1]) %>%
filter(!replace(rep(FALSE, n()), test_results == "NA" & Test_date <= Q, TRUE)) %>%
select(-Q) %>%
ungroup()
I have some time data
library(data.table); library(lubridate); set.seed(42)
dat <- rbind(data.table(time=as.POSIXct("2019-01-01 08:00:00") + round(runif(10,60,1e4)), val=runif(10),group=1)[order(time), id:=seq_len(.N)],
data.table(time=as.POSIXct("2019-02-01 18:00:00") + round(runif(10,60,1e4)), val=runif(10),group=2)[order(time), id:=seq_len(.N)])
> dat[order(group,id)]
time val group id
1: 2019-01-01 08:23:19 0.117487362 1 1
2: 2019-01-01 08:48:24 0.934672247 1 2
3: 2019-01-01 09:27:00 0.940014523 1 3
4: 2019-01-01 09:47:19 0.462292823 1 4
5: 2019-01-01 09:49:51 0.474997082 1 5
6: 2019-01-01 09:57:48 0.560332746 1 6
7: 2019-01-01 10:03:02 0.978226428 1 7
8: 2019-01-01 10:18:35 0.255428824 1 8
9: 2019-01-01 10:32:33 0.457741776 1 9
10: 2019-01-01 10:36:15 0.719112252 1 10
11: 2019-02-01 18:14:39 0.003948339 2 1
12: 2019-02-01 18:23:59 0.811055141 2 2
13: 2019-02-01 19:05:39 0.007334147 2 3
14: 2019-02-01 19:15:03 0.906601408 2 4
15: 2019-02-01 19:26:11 0.832916080 2 5
16: 2019-02-01 20:19:30 0.611778643 2 6
17: 2019-02-01 20:30:46 0.737595618 2 7
18: 2019-02-01 20:31:03 0.207658973 2 8
19: 2019-02-01 20:37:50 0.685169729 2 9
20: 2019-02-01 20:44:50 0.388108283 2 10
and I would like to calculate the sum of val during the following hour for each value of time. For example, for ID 1, this would be the sum of val for IDs 1 and 2 (because time for ID 3 is more than one hour after ID 1), for ID 2, the sum of val for IDs 2 to 4, and so forth. This yields the desired output (for group 1 only)
> res
time val id new1 new2
1: 2019-01-01 08:23:19 0.1174874 1 1.052160 1.052160
2: 2019-01-01 08:48:24 0.9346722 2 2.336979 2.336979
3: 2019-01-01 09:27:00 0.9400145 3 3.671292 3.671292
4: 2019-01-01 09:47:19 0.4622928 4 3.908132 3.908132
5: 2019-01-01 09:49:51 0.4749971 5 3.445839 NA
6: 2019-01-01 09:57:48 0.5603327 6 2.970842 NA
7: 2019-01-01 10:03:02 0.9782264 7 2.410509 NA
8: 2019-01-01 10:18:35 0.2554288 8 1.432283 NA
9: 2019-01-01 10:32:33 0.4577418 9 1.176854 NA
10: 2019-01-01 10:36:15 0.7191123 10 0.719112 NA
where two behaviors at the end are possible:
where the sequence is treated as is;
where sums are only calculated until there is not id for which there is an id with time at least an hour later, and all others are set NA (preferred).
I suspect that solving this requires me to subset within j but this is a problem I frequently run into and can't solve. I have not yet understood the general approach to this.
It could be a loop with join
dat1 <- dat[order(id)]
out <- rbindlist(lapply(dat1$id, function(i) {
d1 <- dat1[seq_len(.N) >= match(i, id)]
d1[d1[, .(time = time %m+% hours(1))], .(time1 = time, val, new1 = sum(val)),
on = .(time <= time), by = .EACHI][1]
}))[, time := NULL][]
setnames(out, 1, "time")
out[time < time[2] %m+% hours(1), new2 := new1]
out
# time val new1 new2
# 1: 2019-01-01 08:23:19 0.1174874 1.0521596 1.052160
# 2: 2019-01-01 08:48:24 0.9346722 2.3369796 2.336980
# 3: 2019-01-01 09:27:00 0.9400145 3.6712924 3.671292
# 4: 2019-01-01 09:47:19 0.4622928 3.9081319 3.908132
# 5: 2019-01-01 09:49:51 0.4749971 3.4458391 NA
# 6: 2019-01-01 09:57:48 0.5603327 2.9708420 NA
# 7: 2019-01-01 10:03:02 0.9782264 2.4105093 NA
# 8: 2019-01-01 10:18:35 0.2554288 1.4322829 NA
# 9: 2019-01-01 10:32:33 0.4577418 1.1768540 NA
#10: 2019-01-01 10:36:15 0.7191123 0.7191123 NA
Update
For the new data, we can split by group and apply the same method
f1 <- function(data) {
lst1 <- split(data, data[["group"]])
rbindlist(lapply(lst1, function(.dat) {
out <- rbindlist(lapply(.dat$id, function(i) {
d1 <- .dat[seq_len(.N) >= match(i, id)]
d1[d1[, .(time = time %m+% hours(1))], .(time1 = time, val, new1 = sum(val)),
on = .(time <= time), by = .EACHI][1]
}))[, time := NULL][]
setnames(out, 1, "time")
out[time[.N]-time > hours(1), new2 := new1][]
})
)}
f1(dat1)
# time val new1 new2
#1: 2019-01-01 08:23:19 0.117487362 1.0521596 1.0521596
#2: 2019-01-01 08:48:24 0.934672247 2.3369796 2.3369796
#3: 2019-01-01 09:27:00 0.940014523 3.6712924 3.6712924
#4: 2019-01-01 09:47:19 0.462292823 3.9081319 3.9081319
#5: 2019-01-01 09:49:51 0.474997082 3.4458391 NA
#6: 2019-01-01 09:57:48 0.560332746 2.9708420 NA
#7: 2019-01-01 10:03:02 0.978226428 2.4105093 NA
#8: 2019-01-01 10:18:35 0.255428824 1.4322829 NA
#9: 2019-01-01 10:32:33 0.457741776 1.1768540 NA
#10: 2019-01-01 10:36:15 0.719112252 0.7191123 NA
#11: 2019-02-01 18:14:39 0.003948339 0.8223376 0.8223376
#12: 2019-02-01 18:23:59 0.811055141 1.7249907 1.7249907
#13: 2019-02-01 19:05:39 0.007334147 1.7468516 1.7468516
#14: 2019-02-01 19:15:03 0.906601408 1.7395175 1.7395175
#15: 2019-02-01 19:26:11 0.832916080 1.4446947 NA
#16: 2019-02-01 20:19:30 0.611778643 2.6303112 NA
#17: 2019-02-01 20:30:46 0.737595618 2.0185326 NA
#18: 2019-02-01 20:31:03 0.207658973 1.2809370 NA
#19: 2019-02-01 20:37:50 0.685169729 1.0732780 NA
#20: 2019-02-01 20:44:50 0.388108283 0.3881083 NA
I have a data set in long format (multiple observations per ID), due to omitted information on prescriptions. Each ID is part of a larger "set", and there are 50 or more sets all with one diseased ID. One person per set has the disease, and the others don't.
dt <- data.table(ID = rep(1:10, each = 4),
disease = c(rep(0, 16), rep(1, 4), rep(0, 12), rep(1,4), rep(0,4)),
dob = c(rep(as.Date("13/05/1924", "%d/%m/%Y"), 4), rep(as.Date("15/09/1936", "%d/%m/%Y"),4),
rep(as.Date("30/06/1957", "%d/%m/%Y"),4), rep(as.Date("19/02/1946", "%d/%m/%Y"),4),
rep(as.Date("26/04/1939", "%d/%m/%Y"),4), rep(as.Date("13/05/1922", "%d/%m/%Y"), 4), rep(as.Date("18/10/1945", "%d/%m/%Y"),4),
rep(as.Date("30/06/1957", "%d/%m/%Y"),4), rep(as.Date("19/02/1946", "%d/%m/%Y"),4),
rep(as.Date("26/12/1939", "%d/%m/%Y"),4)),
disease.date = c(rep(as.Date("01/01/2000", "%d/%m/%Y"), 16), rep(as.Date("19/02/2006", "%d/%m/%Y"),4),
rep(as.Date("01/01/2000", "%d/%m/%Y"), 12), rep(as.Date("13/11/2010", "%d/%m/%Y"),4),
rep(as.Date("01/01/2000", "%d/%m/%Y"), 4)),
set = c(rep(1,20), rep(2,20)))
dt <- dt[(disease==0), disease.date:=NA]
dt
ID disease dob disease.date set
1: 1 0 1924-05-13 <NA> 1
2: 1 0 1924-05-13 <NA> 1
3: 1 0 1924-05-13 <NA> 1
4: 1 0 1924-05-13 <NA> 1
5: 2 0 1936-09-15 <NA> 1
6: 2 0 1936-09-15 <NA> 1
7: 2 0 1936-09-15 <NA> 1
8: 2 0 1936-09-15 <NA> 1
9: 3 0 1957-06-30 <NA> 1
10: 3 0 1957-06-30 <NA> 1
11: 3 0 1957-06-30 <NA> 1
12: 3 0 1957-06-30 <NA> 1
13: 4 0 1946-02-19 <NA> 1
14: 4 0 1946-02-19 <NA> 1
15: 4 0 1946-02-19 <NA> 1
16: 4 0 1946-02-19 <NA> 1
17: 5 1 1939-04-26 2006-02-19 1
18: 5 1 1939-04-26 2006-02-19 1
19: 5 1 1939-04-26 2006-02-19 1
20: 5 1 1939-04-26 2006-02-19 1
21: 6 0 1922-05-13 <NA> 2
22: 6 0 1922-05-13 <NA> 2
23: 6 0 1922-05-13 <NA> 2
24: 6 0 1922-05-13 <NA> 2
25: 7 0 1945-10-18 <NA> 2
26: 7 0 1945-10-18 <NA> 2
27: 7 0 1945-10-18 <NA> 2
28: 7 0 1945-10-18 <NA> 2
29: 8 0 1957-06-30 <NA> 2
30: 8 0 1957-06-30 <NA> 2
31: 8 0 1957-06-30 <NA> 2
32: 8 0 1957-06-30 <NA> 2
33: 9 1 1946-02-19 2010-11-13 2
34: 9 1 1946-02-19 2010-11-13 2
35: 9 1 1946-02-19 2010-11-13 2
36: 9 1 1946-02-19 2010-11-13 2
37: 10 0 1939-12-26 <NA> 2
38: 10 0 1939-12-26 <NA> 2
39: 10 0 1939-12-26 <NA> 2
40: 10 0 1939-12-26 <NA> 2
I'm interested in finding the age of everyone in that set on the date of disease for the case.
for example, how old is everyone in set 1 on 19/02/2006 (the cases disease date)? and in set 2 on 13/11/2010?
I've tried the data.table way:
cc[, age := dob - oa.cons.date, by = set]
which only worked for those with a disease.date
Any other thoughts I had involved copying the disease.date of each case to the controls in the sameset, but I didn't know how to do that either.
You can copy the first non-empty disease date within each set group to the whole column disease.date:
dt[, disease.date := disease.date[!is.na(disease.date)][1], by = set]
Then calculate age:
dt[, age := disease.date - dob]
Notice that time difference intervals are in days. You may divide them by 365 or treat them in any other suitable way. Maybe package lubridate can be useful here. With its help:
dt[, age := as.period(interval(dob, disease.date), unit = "years")]
or
dt[, age := decimal_date(disease.date) - decimal_date(dob)]
You can try this:
(dt$dob - dt$disease.date[20])/365
Taking dt$disease.date[20] since there are some NAs in the disease.date column.
Since both columns are date objects, R automatically calculates the difference in two dates. The difference will be in terms of days, so dividing by 365 gives you the approximate age.
I have prescription record data and would like to find out how many prescriptions each person had in each year from their issue date until the end of their record. Example data (first 5 rows of each ID):
ID Issue_Date index.date other.drugs
1: 1 2000-02-08 2011-02-03 1
2: 1 2000-04-04 2011-02-03 0
3: 1 2000-05-30 2011-02-03 1
4: 1 2000-07-25 2011-02-03 1
5: 1 2000-08-22 2011-02-03 1
---
1: 2 2007-03-23 2009-04-03 1
2: 2 2007-04-04 2009-04-03 1
3: 2 2007-04-23 2009-04-03 1
4: 2 2007-04-23 2009-04-03 0
5: 2 2007-05-21 2009-04-03 1
the other.drugs column is an indicator variable that shows whether the prescription given on that date is not a prescription of interest in the study. the index.date is the date they entered the study. There are more than 1000 ID's, and only 2 are given here.
I want to find the sum of the other.drugs per year for every year after their issue.date. I calculated this separately for the first year using the below code:
dt <- dt[, yearend.1 := Issue_Date[1]+365, by = ID]
dt <- dt[(Issue_Date<=yearend.1), comorbid.1 := sum(other.drugs), by = ID]
dt <- dt[, comorbid.1:= comorbid.1[!is.na(comorbid.1)][1], by = ID]
# the last line copies the value to each cell the ID occupies in the data.table for that column instead of having NA's
And this gave the following result:
ID Issue_Date index.date other.drugs yearend.1 comorbid.1
1: 1 2000-02-08 2011-02-03 1 2001-02-07 8
2: 1 2000-04-04 2011-02-03 1 2001-02-07 8
3: 1 2000-05-30 2011-02-03 1 2001-02-07 8
4: 1 2000-07-25 2011-02-03 1 2001-02-07 8
5: 1 2000-08-22 2011-02-03 1 2001-02-07 8
---
1: 2 2007-03-23 2009-04-03 1 2008-03-22 30
2: 2 2007-04-04 2009-04-03 1 2008-03-22 30
3: 2 2007-04-23 2009-04-03 1 2008-03-22 30
4: 2 2007-04-23 2009-04-03 1 2008-03-22 30
5: 2 2007-05-21 2009-04-03 1 2008-03-22 30
Interpretation: ID 1 was prescribed 8 other drugs in the year after their first issue_date, and ID 2 was prescribed 30.
For years 2-10 (there is a maximum of 11 years of records) I wrote the following loop:
years <- seq(730, 3650, 365)
# number of days in 2-10 years.
years2 <- seq(2,10,1)
# numbering the years for column names
colnames <- paste0("yearend.", years2)
colnames2 <- paste0("comorbid.", years2)
# names of columns to be used
for (i in 1:length(years)) {
dt <- dt[, colnames[i] := Issue_Date[1]+years[i], by = ID]
dt <- dt[(Issue_Date>=(as.Date(colnames[i], "%d-%m-%Y")) & Issue_Date<(as.Date(colnames[i+1], "%d-%m-%Y"))),
colnames2[i] := sum(other.drugs), by = ID]
dt <- dt[, colnames2[i]:= colnames2[i][!is.na(colnames2[i])][1], by = ID]
}
However the new columns that should have been created are:
ID Issue_Date index.date other.drugs yearend.1 comorbid.1 yearend.2 comorbid.2 yearend.3 comorbid.3
1: 1 2000-02-08 2011-02-03 1 2001-02-07 8 2002-02-07 comorbid.2 2003-02-07 comorbid.3
2: 1 2000-04-04 2011-02-03 1 2001-02-07 8 2002-02-07 comorbid.2 2003-02-07 comorbid.3
3: 1 2000-05-30 2011-02-03 1 2001-02-07 8 2002-02-07 comorbid.2 2003-02-07 comorbid.3
4: 1 2000-07-25 2011-02-03 1 2001-02-07 8 2002-02-07 comorbid.2 2003-02-07 comorbid.3
5: 1 2000-08-22 2011-02-03 1 2001-02-07 8 2002-02-07 comorbid.2 2003-02-07 comorbid.3
---
I would like to know what is going wrong with my loop. Help is much appreciated.
Whenever you need to use a column name in a data.table which actually comes from a variable in R, you need to use get. Thus you should rewrite your loop like this,
for (i in 1:length(years)) {
dt <- dt[, colnames[i] := Issue_Date[1]+years[i], by = ID]
dt <- dt[(Issue_Date>=(as.Date(get(colnames[i]), "%d-%m-%Y")) & Issue_Date<(as.Date(get(colnames[i+1]), "%d-%m-%Y"))),
colnames2[i] := sum(other.drugs), by = ID]
dt <- dt[, colnames2[i]:= get(colnames2[i])[!is.na(get(colnames2[i]))][1], by = ID]
}
I couldn't actually test your code as it is, since I had 2 problems:
I didn't have enough data so that I would get anything from your temporal condition Issue_Date>...
Maybe I'm missing something, but in your loop you are trying to use colnames[i+1], i.e yearend.X before it is actually created (maybe you've run it several times and that's why you don't get an error?)
I did something like this to test it, of course the values of comorbid.2 do not make sense:
dt
ID Issue_Date index.date other.drugs yearend.1 comorbid.1
1: 1 00-02-08 2011-02-03 1 01-02-07 4
2: 1 00-04-04 2011-02-03 0 01-02-07 4
3: 1 00-05-30 2011-02-03 1 01-02-07 4
4: 1 00-07-25 2011-02-03 1 01-02-07 4
5: 1 00-08-22 2011-02-03 1 01-02-07 4
6: 2 07-03-23 2009-04-03 1 08-03-22 4
7: 2 07-04-04 2009-04-03 1 08-03-22 4
8: 2 07-04-23 2009-04-03 1 08-03-22 4
9: 2 07-04-23 2009-04-03 0 08-03-22 4
10: 2 07-05-21 2009-04-03 1 08-03-22 4
i <- 1
dt <- dt[, colnames[i] := Issue_Date[1]+years[i], by = ID]
dt <- dt[Issue_Date<get(colnames[i]),
colnames2[i] := sum(other.drugs), by = ID]
dt <- dt[, colnames2[i]:= get(colnames2[i])[!is.na(get(colnames2[i]))][1], by = ID]
dt
ID Issue_Date index.date other.drugs yearend.1 comorbid.1 yearend.2 comorbid.2
1: 1 00-02-08 2011-02-03 1 01-02-07 4 02-02-07 4
2: 1 00-04-04 2011-02-03 0 01-02-07 4 02-02-07 4
3: 1 00-05-30 2011-02-03 1 01-02-07 4 02-02-07 4
4: 1 00-07-25 2011-02-03 1 01-02-07 4 02-02-07 4
5: 1 00-08-22 2011-02-03 1 01-02-07 4 02-02-07 4
6: 2 07-03-23 2009-04-03 1 08-03-22 4 09-03-22 4
7: 2 07-04-04 2009-04-03 1 08-03-22 4 09-03-22 4
8: 2 07-04-23 2009-04-03 1 08-03-22 4 09-03-22 4
9: 2 07-04-23 2009-04-03 0 08-03-22 4 09-03-22 4
10: 2 07-05-21 2009-04-03 1 08-03-22 4 09-03-22 4
Hope it helps.
There are similar questions I've seen, but none of them apply it to specific rows of a data.table or data.frame, rather they apply it to the whole matrix.
Subset a dataframe between 2 dates
How to select some rows with specific date from a data frame in R
I have a dataset with patients who were diagnosed with OA and those who were not:
dt <- data.table(ID = seq(1,10,1), OA = c(1,0,0,1,0,0,0,1,1,0),
oa.date = as.Date(c("01/01/2006", "01/01/2001", "01/01/2001", "02/03/2005","01/01/2001","01/01/2001","01/01/2001","05/06/2010", "01/01/2011", "01/01/2001"), "%d/%m/%Y"),
stop.date = as.Date(c("01/01/2006", "31/12/2007", "31/12/2008", "02/03/2005", "31/12/2011", "31/12/2011", "31/12/2011", "05/06/2010", "01/01/2011", "31/12/2011"), "%d/%m/%Y"))
dt$oa.date[dt$OA==0] <- NA
> dt
ID OA oa.date stop.date
1: 1 1 2006-01-01 2006-01-01
2: 2 0 <NA> 2007-12-31
3: 3 0 <NA> 2008-12-31
4: 4 1 2005-03-02 2005-03-02
5: 5 0 <NA> 2011-12-31
6: 6 0 <NA> 2011-12-31
7: 7 0 <NA> 2011-12-31
8: 8 1 2010-06-05 2010-06-05
9: 9 1 2011-01-01 2011-01-01
10: 10 0 <NA> 2011-12-31
What I want to do is delete those who were diagnosed with OA (OA==1) before start:
start <- as.Date("01/01/2009", "%d/%m/%Y")
So I want my final data to be:
> dt
ID OA oa.date stop.date
1: 2 0 <NA> 2009-12-31
2: 3 0 <NA> 2008-12-31
3: 5 0 <NA> 2011-12-31
4: 6 0 <NA> 2011-12-31
5: 7 0 <NA> 2011-12-31
6: 8 1 2010-06-05 2010-06-05
7: 9 1 2011-01-01 2011-01-01
8: 10 0 <NA> 2011-12-31
My tries are:
dt[dt$OA==1] <- dt[!(oa.date < start)]
I've also tried a loop but to no effect.
Any help is much appreciated.
This should be straightforward:
> dt[!(OA & oa.date < start)]
# ID OA oa.date stop.date
#1: 2 0 <NA> 2007-12-31
#2: 3 0 <NA> 2008-12-31
#3: 5 0 <NA> 2011-12-31
#4: 6 0 <NA> 2011-12-31
#5: 7 0 <NA> 2011-12-31
#6: 8 1 2010-06-05 2010-06-05
#7: 9 1 2011-01-01 2011-01-01
#8: 10 0 <NA> 2011-12-31
The OA column is binary (1/0) which is coerced to logical (TRUE/FALSE) in the i-expression.
You can try
dt=dt[dt$OA==0|(dt$OA==1&!(dt$oa.date < start)),]