How to compute a new variable based on the number of days since a particular type of record - r

I'm trying to create a variable that shows the number of days since a particular event occurred. This is a follow up to this previous question, using the same data.
The data looks like this (note dates are in DD-MM-YYYY format):
ID date drug score
A 28/08/2016 2 3
A 29/08/2016 1 4
A 30/08/2016 2 4
A 2/09/2016 2 4
A 3/09/2016 1 4
A 4/09/2016 2 4
B 8/08/2016 1 3
B 9/08/2016 2 4
B 10/08/2016 2 3
B 11/08/2016 1 3
C 30/11/2016 2 4
C 2/12/2016 1 5
C 3/12/2016 2 1
C 5/12/2016 1 4
C 6/12/2016 2 4
C 8/12/2016 1 2
C 9/12/2016 1 2
For 'drug': 1=drug taken, 2=no drug taken.
Each time the value of drug is 1, if that ID has a previous record that is also drug==1, then I need to generate a new value 'lagtime' that shows the number of days (not the number of rows!) since the previous time the drug was taken.
So the output I am looking for is:
ID date drug score lagtime
A 28/08/2016 2 3
A 29/08/2016 1 4
A 30/08/2016 2 4
A 2/09/2016 2 4
A 3/09/2016 1 4 5
A 4/09/2016 2 4
B 8/08/2016 1 3
B 9/08/2016 2 4
B 10/08/2016 2 3
B 11/08/2016 1 3 3
C 30/11/2016 2 4
C 2/12/2016 1 5
C 3/12/2016 2 1
C 5/12/2016 1 4 3
C 6/12/2016 2 4
C 8/12/2016 1 2 3
C 9/12/2016 1 2 1
So I need a way to generate (mutate?) this lagtime score that is calculated as the date for each drug==1 record, minus the date of the previous drug==1 record, grouped by ID.
This has me completely bamboozled.
Here's code for the example data:
data<-data.frame(ID=c("A","A","A","A","A","A","B","B","B","B","C","C","C","C","C","C","C"),
date=as.Date(c("28/08/2016","29/08/2016","30/08/2016","2/09/2016","3/09/2016","4/09/2016","8/08/2016","9/08/2016","10/08/2016","11/08/2016","30/11/2016","2/12/2016","3/12/2016","5/12/2016","6/12/2016","8/12/2016","9/12/2016"),format= "%d/%m/%Y"),
drug=c(2,1,2,2,1,2,1,2,2,1,2,1,2,1,2,1,1),
score=c(3,4,4,4,4,4,3,4,3,3,4,5,1,4,4,2,2))

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'ID', specify the i (drug ==1), get the difference of 'date' (diff(date)), concatenate with NA as the diff output length is 1 less than the original vector, convert to integer and assign (:=) to create the 'lagtime'. By default, all other values will be NA
library(data.table)
setDT(data)[drug==1, lagtime := as.integer(c(NA, diff(date))), ID]
data
# ID date drug score lagtime
# 1: A 2016-08-28 2 3 NA
# 2: A 2016-08-29 1 4 NA
# 3: A 2016-08-30 2 4 NA
# 4: A 2016-09-02 2 4 NA
# 5: A 2016-09-03 1 4 5
# 6: A 2016-09-04 2 4 NA
# 7: B 2016-08-08 1 3 NA
# 8: B 2016-08-09 2 4 NA
# 9: B 2016-08-10 2 3 NA
#10: B 2016-08-11 1 3 3
#11: C 2016-11-30 2 4 NA
#12: C 2016-12-02 1 5 NA
#13: C 2016-12-03 2 1 NA
#14: C 2016-12-05 1 4 3
#15: C 2016-12-06 2 4 NA
#16: C 2016-12-08 1 2 3
#17: C 2016-12-09 1 2 1

Related

Is there some way to keep variable names from.SD+.SDcols together with non .SD variable names in data.table?

Given a data.table
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
7: c 1 1 7 3
8: c 2 3 8 2
9: c 2 6 9 1
if one does
DT[, .(a, .SD), .SDcols=x:y]
a .SD.x .SD.v .SD.y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the variables from .SDcols become prefixed by .SD. On the other hand, if one tries, as in https://stackoverflow.com/a/62282856/997979,
DT[, c(.(a), .SD), .SDcols=x:y]
V1 x v y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the other variable name (a) become lost. (It is due to this reason that I re-ask the question which I initially marked as a duplicate to that linked above).
Is there some way to keep the names from both .SD variables and non .SD variables?
The goal is simultaneously being able to use .() to select variables without quotes and being able to select variables through .SDcols = patterns("...")
Thanks in advance!
not really sure why.. but it works ;-)
DT[, .(a, (.SD)), .SDcols=x:y]
# a x v y
# 1: 1 b 1 1
# 2: 2 b 1 3
# 3: 3 b 1 6
# 4: 4 a 2 1
# 5: 5 a 2 3
# 6: 6 a 1 6
# 7: 7 c 1 1
# 8: 8 c 2 3
# 9: 9 c 2 6

Subsetting panel observations

I have a data.table with firm information.
library(data.table)
DT <- fread("
iso Firm GDP year
A 1 1 1
A 2 1 1
A 3 1 1
A 4 1 1
A 5 3 2
A 6 3 2
A 7 3 2
A 8 3 2
B 9 2 1
B 10 2 1
B 11 2 1
B 12 2 1
B 13 4 1
B 14 4 1
B 15 4 1
B 16 4 1",
header = TRUE)
I want to calculate GDPgrowth (per country) from one year to the other and add it to the dataset ((N-O)/O). However, if I do:
DT <- DT[,GDPgrowth :=((GDP- shift(GDP))/shift(GDP)), by=iso]
the outcome will be zero because it subtracts the firm observations from each other.
How can I make sure it calculates for the whole group of firms belonging to the country together?
Desired output:
library(data.table)
DT <- fread("
iso Firm GDP GDPgrowth year
A 1 1 NA 1
A 2 1 NA 1
A 3 1 NA 1
A 4 1 NA 1
A 5 3 2 2
A 6 3 2 2
A 7 3 2 2
A 8 3 2 2
B 9 2 NA 1
B 10 2 NA 1
B 11 2 NA 1
B 12 2 NA 1
B 13 4 1 1
B 14 4 1 1
B 15 4 1 1
B 16 4 1 1",
header = TRUE)
Here is one way continuing from your current approach :
library(data.table)
DT[,GDPgrowth :=((GDP- shift(GDP))/shift(GDP)), by=iso]
DT[GDPgrowth == 0, GDPgrowth := NA]
DT[, GDPgrowth:= zoo::na.locf(GDPgrowth, na.rm = FALSE), .(iso, year)]
DT
# iso Firm GDP year GDPgrowth
# 1: A 1 1 1 NA
# 2: A 2 1 1 NA
# 3: A 3 1 1 NA
# 4: A 4 1 1 NA
# 5: A 5 3 2 2
# 6: A 6 3 2 2
# 7: A 7 3 2 2
# 8: A 8 3 2 2
# 9: B 9 2 1 NA
#10: B 10 2 1 NA
#11: B 11 2 1 NA
#12: B 12 2 1 NA
#13: B 13 4 1 1
#14: B 14 4 1 1
#15: B 15 4 1 1
#16: B 16 4 1 1
Using dplyr and tidyr::fill it can be done as
library(dplyr)
DT %>%
group_by(iso) %>%
mutate(GDPgrowth = (GDP - lag(GDP))/lag(GDP),
GDPgrowth = replace(GDPgrowth, GDPgrowth == 0, NA)) %>%
group_by(iso, year) %>%
tidyr::fill(GDPgrowth)

how to group consecutive days based on another category in R

I would like to use the following data frame
time <- c("01/01/1951", "02/01/1951", "03/01/1951", "04/01/1951", "03/03/1953", "04/03/1953", "05/03/1953", "06/03/1953", "02/01/1951", "03/01/1951", "04/01/1951", "05/01/1951", "13/03/1953", "14/03/1953", "15/03/1953", "16/03/1953", "01/05/1951", "02/05/1951", "03/05/1951", "04/05/1951", "04/03/1953", "05/03/1953", "06/03/1953", "07/03/1953")
member <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
trainall <- data.frame(time, member)
trainall$time = as.Date(trainall$time,format="%d/%m/%Y")
to order it by group of consecutive days based on the members. therefore if the same days are in member 2 and 1 I dont want them grouped together as consecutive!
ultimately I want a new column making this group
this is what I tried but it didnt work
y = sort(trainall$time)
trainall$g = cumsum(c(1, abs(y[-length(y)] - y[-1]) > 1))
this is the outcome I want.
trainall
time member g
1 01/01/1951 1 1
2 02/01/1951 1 1
3 03/01/1951 1 1
4 04/01/1951 1 1
5 03/03/1953 1 2
6 04/03/1953 1 2
7 05/03/1953 1 2
8 06/03/1953 1 2
9 02/01/1951 2 3
10 03/01/1951 2 3
11 04/01/1951 2 3
12 05/01/1951 2 3
13 13/03/1953 2 4
14 14/03/1953 2 4
15 15/03/1953 2 4
16 16/03/1953 2 4
17 01/05/1951 3 5
18 02/05/1951 3 5
19 03/05/1951 3 5
20 04/05/1951 3 5
21 04/03/1953 3 6
22 05/03/1953 3 6
23 06/03/1953 3 6
24 07/03/1953 3 6
ultimately this is the outcome I want. however, here I did it manually and my actual data frame is much much larger (16 members)
anyone know how to easily do this?
The use of logical values as integers 0 and 1 and your friend diff can do the trick. Something like this should do it, provided that your data is sorted by member and time.
# Your data
time <- c("01/01/1951", "02/01/1951", "03/01/1951", "04/01/1951", "03/03/1953", "04/03/1953", "05/03/1953", "06/03/1953", "02/01/1951", "03/01/1951", "04/01/1951", "05/01/1951", "13/03/1953", "14/03/1953", "15/03/1953", "16/03/1953", "01/05/1951", "02/05/1951", "03/05/1951", "04/05/1951", "04/03/1953", "05/03/1953", "06/03/1953", "07/03/1953")
member <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
trainall <- data.frame(time, member)
trainall$time = as.Date(trainall$time,format="%d/%m/%Y")
# Creating column g
trainall$g <- cumsum(c(1, (abs(diff(trainall$time)) + diff(trainall$member))!=1))
print(trainall)
# time member g
#1 1951-01-01 1 1
#2 1951-01-02 1 1
#3 1951-01-03 1 1
#4 1951-01-04 1 1
#5 1953-03-03 1 2
#6 1953-03-04 1 2
#7 1953-03-05 1 2
#8 1953-03-06 1 2
#9 1951-01-02 2 3
#10 1951-01-03 2 3
#11 1951-01-04 2 3
#12 1951-01-05 2 3
#13 1953-03-13 2 4
#14 1953-03-14 2 4
#15 1953-03-15 2 4
#16 1953-03-16 2 4
#17 1951-05-01 3 5
#18 1951-05-02 3 5
#19 1951-05-03 3 5
#20 1951-05-04 3 5
#21 1953-03-04 3 6
#22 1953-03-05 3 6
#23 1953-03-06 3 6
#24 1953-03-07 3 6
Edit: Added abs() around the time difference. I guess the abs cannot strictly be omitted as you could have a time difference of -2 days when the member changes, which cause the sum to be 1.
Edit 2: Re. your extra comment, try
trainall$G <- sequence(table(trainall$g))
Here is one option with .GRP from data.table
library(data.table)
setDT(trainall)[, g := .GRP, .(member, grp = cumsum(c(FALSE, diff(time) != 1)))]
trainall
# time member g
# 1: 1951-01-01 1 1
# 2: 1951-01-02 1 1
# 3: 1951-01-03 1 1
# 4: 1951-01-04 1 1
# 5: 1953-03-03 1 2
# 6: 1953-03-04 1 2
# 7: 1953-03-05 1 2
# 8: 1953-03-06 1 2
# 9: 1951-01-02 2 3
#10: 1951-01-03 2 3
#11: 1951-01-04 2 3
#12: 1951-01-05 2 3
#13: 1953-03-13 2 4
#14: 1953-03-14 2 4
#15: 1953-03-15 2 4
#16: 1953-03-16 2 4
#17: 1951-05-01 3 5
#18: 1951-05-02 3 5
#19: 1951-05-03 3 5
#20: 1951-05-04 3 5
#21: 1953-03-04 3 6
#22: 1953-03-05 3 6
#23: 1953-03-06 3 6
#24: 1953-03-07 3 6

imputing forward / backward

I am trying to impute some longitudinal data in this way (see below). For each individual (id), if first values are NA, I would like to impute using the first observed value for that individual regardless when that occurs. Then, I would like to impute forward based on the last value observed for each individual (see imputed below).
var values might not necessarily increase monotonically. Those values might be a character vector.
I have tried several ways to do this, but still I cannot get a satisfactory solution.
Any ideas?
id <- c(1,1,1,1,1,1,1,2,2,2,2)
time <- c(1,2,3,4,5,6,7,3,5,7,9)
var <- c(NA,NA,1,NA,2,3,NA,NA,2,3,NA)
imputed <- c(1,1,1,1,2,3,3,2,2,3,3)
dat <- data.table(id, time, var, imputed)
id time var imputed
1: 1 1 NA 1
2: 1 2 NA 1
3: 1 3 1 1
4: 1 4 NA 1
5: 1 5 2 2
6: 1 6 3 3
7: 1 7 NA 3
8: 2 3 NA 2
9: 2 5 2 2
10: 2 7 3 3
11: 2 9 NA 3
library(zoo)
dat[, newimp := na.locf(na.locf(var, FALSE), fromLast=TRUE), by = id]
dat
# id time var imputed newimp
# 1: 1 1 NA 1 1
# 2: 1 2 NA 1 1
# 3: 1 3 1 1 1
# 4: 1 4 NA 1 1
# 5: 1 5 2 2 2
# 6: 1 6 3 3 3
# 7: 1 7 NA 3 3
# 8: 2 3 NA 2 2
# 9: 2 5 2 2 2
#10: 2 7 3 3 3
#11: 2 9 NA 3 3

Repeating sets of rows according to the number of rows by column in R with data.table

Currently in R, I am trying to do the following for data.table table:
Suppose my data looks like:
Class Person ID Index
A 1 3
A 2 3
A 5 3
B 7 2
B 12 2
C 18 1
D 25 2
D 44 2
Here, the class refers to the class a person belongs to. The Person ID variable represents a unique identifier of a person. Finally, the Index tells us how many people are in each class.
From this, I would like to create a new data table as so:
Class Person ID Index
A 1 3
A 2 3
A 5 3
A 1 3
A 2 3
A 5 3
A 1 3
A 2 3
A 5 3
B 7 2
B 12 2
B 7 2
B 12 2
C 18 1
D 25 2
D 44 2
D 25 2
D 44 2
where we repeated each set of persons by class based on the index variable. Hence, we would repeat the class A by 3 times because the index says 3.
So far, my code looks like:
setDT(data)[, list(Class = rep(Person ID, seq_len(.N)), Person ID = sequence(seq_len(.N)), by = Index]
However, I am not getting the correct result and I feel like there is a simpler way to do this. Would anyone have any ideas? Thank you!
If that particular order is important to you, then perhaps something like this should work:
setDT(data)[, list(PersonID, sequence(rep(.N, Index))), by = list(Class, Index)]
# Class Index PersonID V2
# 1: A 3 1 1
# 2: A 3 2 2
# 3: A 3 5 3
# 4: A 3 1 1
# 5: A 3 2 2
# 6: A 3 5 3
# 7: A 3 1 1
# 8: A 3 2 2
# 9: A 3 5 3
# 10: B 2 7 1
# 11: B 2 12 2
# 12: B 2 7 1
# 13: B 2 12 2
# 14: C 1 18 1
# 15: D 2 25 1
# 16: D 2 44 2
# 17: D 2 25 1
# 18: D 2 44 2
If the order is not important, perhaps:
setDT(data)[rep(1:nrow(data), Index)]
Here is a way using dplyr in case you wanted to try
library(dplyr)
data %>%
group_by(Class) %>%
do(data.frame(.[sequence(.$Index[row(.)[,1]]),]))
which gives the output
# Class Person.ID Index
#1 A 1 3
#2 A 2 3
#3 A 5 3
#4 A 1 3
#5 A 2 3
#6 A 5 3
#7 A 1 3
#8 A 2 3
#9 A 5 3
#10 B 7 2
#11 B 12 2
#12 B 7 2
#13 B 12 2
#14 C 18 1
#15 D 25 2
#16 D 44 2
#17 D 25 2
#18 D 44 2

Resources