A special case of dcast in R - r

My question seems really simple, and indeed I feel very annoyed for the fact that I cannot make it work. Let's say I have a simple dataframe with one column for group and one variable x. Because my variable group contains a "control" condition, I would like to run a t.test of all my other conditions against my control variable.
library(data.table) # I am use to the data.table sintax, tho I will happily accept a solution in any other dialect
# Generate dummy data
set.seed(1)
df <- data.table(x = rnorm(100), g = sample(LETTERS[1:3], size = 100, replace =T ))
setkey(df, g, x) # Order
df # Inspect data
For that purpose, I would like to dcast the control group and add it as a new column. Since what I want is to run a t-test and for it, I will use the whole group, I do not mind in which order the column gets included. However, the function that I would use to change from a long format to a wide format (dcast), doesn't seem to work here.
# dcast appoach
m <- dcast(df, x ~ g) # This is just... B*#!!it
So here is an approximation of what I look for:
# Kind of what I want
# Isolate control condition
Control <- df[g == "C"]
df[, C := rep(Control, 3)] # In this case it says there a "remainder", tho I would prefer to add NAs to the variable x until completion
I also would not mind having all the groups A, B and C, as columns.
Thanks in advance for your help

Perhaps, this might be what the OP has asked for:
library(data.table)
dcast(df, rowid(g) ~ g, value.var = "x")
g A B C
1: 1 -1.804958629 -1.98935170 -2.21469989
2: 2 -1.470752384 -1.52356680 -0.74327321
3: 3 -1.276592208 -1.37705956 -0.62124058
4: 4 -1.253633400 -1.12936310 -0.61202639
5: 5 -1.224612615 -1.04413463 -0.58952095
6: 6 -0.934097632 -0.83562861 -0.47340064
7: 7 -0.709946431 -0.82046838 -0.41499456
8: 8 -0.707495157 -0.68875569 -0.39428995
9: 9 -0.626453811 -0.47815006 -0.30538839
10: 10 -0.573265414 -0.25336168 -0.13505460
11: 11 -0.568668733 -0.13517862 0.02800216
12: 12 -0.542520031 -0.11234621 0.39810588
13: 13 -0.443291873 -0.05931340 0.41794156
14: 14 -0.367221476 -0.05612874 0.55848643
15: 15 -0.304183924 -0.05380504 0.61982575
16: 16 -0.164523596 -0.01619026 0.69696338
17: 17 -0.155795507 0.07434132 0.82122120
18: 18 -0.102787727 0.15325334 0.88110773
19: 19 -0.044933609 0.34111969 0.94383621
20: 20 -0.039240003 0.36458196 1.12493092
21: 21 0.001105352 0.38767161 1.16040262
22: 22 0.074564983 0.48742905 1.17808700
23: 23 0.183643324 0.56971963 1.46555486
24: 24 0.188792300 0.59390132 1.51178117
25: 25 0.267098791 0.61072635 NA
26: 26 0.291446236 0.76317575 NA
27: 27 0.329507772 1.10002537 NA
28: 28 0.332950371 1.35867955 NA
29: 29 0.370018810 1.43302370 NA
30: 30 0.389843236 1.58683345 NA
31: 31 0.475509529 2.40161776 NA
32: 32 0.556663199 NA NA
33: 33 0.575781352 NA NA
34: 34 0.593946188 NA NA
35: 35 0.689739362 NA NA
36: 36 0.700213650 NA NA
37: 37 0.738324705 NA NA
38: 38 0.768532925 NA NA
39: 39 0.782136301 NA NA
40: 40 0.918977372 NA NA
41: 41 1.063099837 NA NA
42: 42 1.207867806 NA NA
43: 43 1.595280802 NA NA
44: 44 1.980399899 NA NA
45: 45 2.172611670 NA NA
g A B C
This works by artificially introducing an individual row count rowid(g) for each group.
However, in line with 42-'s comment, I do not understand how this will help to solve OP's underlying problem.

Related

R: Populate only next two NA values (grouped by individual) [duplicate]

Is there a way we can fill NAs in a zoo or xts object with limited number of NAs forward. In other words like fill NAs up to 3 consecutive NAs, and then keep the NAs from the 4th value on until a valid number.
Something like this.
library(zoo)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
x
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 NA NA NA 5 NA NA
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
NA NA NA 11 12 NA NA
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
NA NA NA NA 19 20
Desired output, will be something with variable n = 3 is
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 1 1 1 5 5 5
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
5 NA NA 11 12 12 12
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 NA NA NA 19 20
I have tried lot of combination with na.locf(x, maxgap = 3) etc without much success. I can create a loop to get the desired output, I was wondering whether there is vectorized way of achieving this.
fillInTheBlanks <- function(v, n=3) {
result <- v
counter0 <- 1
for(i in 2:length(v)) {
value <- v[i]
if (is.na(value)) {
if (counter0 > n) {
result[i] <- v[i]
} else {
result[i] <- result[i-1]
counter0 <- counter0 + 1
} }
else {
result[i] <- v[i]
counter0 <- 1
}
}
return(result)
}
Thanks
Here's another way:
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
# [1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
edit: my previous answer required that x have no duplicates. The current answer does not.
benchmarks
x <- rep(x, length.out=1e4)
plourde <- function(x) {
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
}
agstudy <- function(x) {
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
}
microbenchmark(plourde(x), agstudy(x))
# Unit: milliseconds
# expr min lq median uq max neval
# plourde(x) 5.30 5.591 6.409 6.774 57.13 100
# agstudy(x) 16.04 16.249 16.454 17.516 20.64 100
And another idea that, unless I've missed something, seems valid:
na_locf_until = function(x, n = 3)
{
wnn = which(!is.na(x))
inds = sort(c(wnn, (wnn + n+1)[which((wnn + n+1) < c(wnn[-1], length(x)))]))
c(rep(NA, wnn[1] - 1),
as.vector(x)[rep(inds, c(diff(inds), length(x) - inds[length(inds)] + 1))])
}
na_locf_until(x)
#[1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
Without using na.locf, but the idea is to split your xts by group of non missing values, then for each group replacing only the 3 first values (after the non misssing one) with the first value. It is a loop , but since it is only applied on group , it should be faster than a simple loop over all the values.
zz <-
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
## create the zoo object since , the latter algorithm is applied only to the values
zoo(zz,index(x))
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26 2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02
1 1 1 1 5 5 5 5 NA NA 11 12 12
2014-10-03 2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 12 NA NA NA 19 20
The cleanest way to implement this in data.table is probably using the join syntax:
na.omit(dt)[dt, on = .(date), roll = +3, .(date, x_filled = x, x = i.x)]
date x_filled x
1: 2019-02-14 1 1
2: 2019-02-15 1 NA
3: 2019-02-16 1 NA
4: 2019-02-17 1 NA
5: 2019-02-18 5 5
6: 2019-02-19 5 NA
7: 2019-02-20 5 NA
8: 2019-02-21 5 NA
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 12 NA
14: 2019-02-27 12 NA
15: 2019-02-28 12 NA
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
*This solution depends on the date columns and it being contiguous
From playing around in data.table comes this hacky solution:
np1 <- 3 + 1
dt[,
x_filled := x[c(rep(1, min(np1, .N)), rep(NA, max(0, .N - np1)))],
by = cumsum(!is.na(x))]
# Or slightly simplified:
dt[,
x_filled := ifelse(rowid(x) < 4, x[1], x[NA]),
by = cumsum(!is.na(x))]
> dt
date x x_filled
1: 2019-02-14 1 1
2: 2019-02-15 NA 1
3: 2019-02-16 NA 1
4: 2019-02-17 NA 1
5: 2019-02-18 5 5
6: 2019-02-19 NA 5
7: 2019-02-20 NA 5
8: 2019-02-21 NA 5
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 NA 12
14: 2019-02-27 NA 12
15: 2019-02-28 NA 12
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
We build on the fact that subsetting vectors with NA returns NA.
Data/Packages
library(zoo)
library(data.table)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
dt <- data.table(date = index(x), x = as.integer(x))

How to find max of a column within a particular daterange using data.table

I have a dataset with contractID, data and DaysPastDue information. How do I look forward say 12 months for each row and identify the Max DPD, corresponding to that contractID.
The data set looks like this
Contract_number Date DPD
1: a 2014-03-01 14
2: a 2014-03-01 5
3: a 2014-10-01 6
4: a 2014-10-01 16
5: a 2015-12-01 4
6: a 2015-12-01 17
7: a 2016-09-01 16
8: a 2016-09-01 15
9: a 2016-10-01 3
10: a 2016-10-01 8
11: b 2014-05-01 18
12: b 2014-05-01 9
13: b 2014-08-01 2
14: b 2014-08-01 14
Code for generating this dataset
library(data.table)
set.seed(123)
dummy_data= data.table(Contract_number = letters[1:4],
Date = sample(seq(as.Date('2014/01/01'), as.Date('2016/12/01'), by="month"), 20),
DPD=sample.int(20:50, 40, replace = TRUE)
)
dummy_data[order(Contract_number,Date)]
I have a dplyr solution to this, wondering if there is a more concise datatable way to do this?
max_dpd_data<-dummy_data %>% left_join(dummy_data,dummy_data,by="Contract_number") %>%
filter(Date.y>Date.x & Date.y<=(Date.x + months(12))) %>%
group_by(Contract_number, Date.x) %>% summarise(Max_DPD_12_M = max(DPD.y), N_Mnths_Future=n()) %>%
rename(Date='Date.x')
dummy_data1<- left_join(dummy_data,max_dpd_data,by = c("Contract_number","Date"))
I also do not want to go the route of using expand.grid to fill in missing months, and then using Shift.
I figure you are looking for something like this
> library(data.table)
> set.seed(123)
> dummy_data= data.table(Contract_number = letters[1:4],
+ Date = sample(seq(as.Date('2014/01/01'), as.Date('2016/12/01'), by="month"), 20),
+ DPD=sample.int(20:50, 40, replace = TRUE)
+ )
>
> # You can useset key to sort
> setkey(dummy_data, Contract_number, Date)
> dummy_data
Contract_number Date DPD
1: a 2014-05-01 16
2: a 2014-05-01 3
3: a 2014-11-01 18
4: a 2014-11-01 3
5: a 2015-05-01 14
6: a 2015-05-01 16
7: a 2016-07-01 14
8: a 2016-07-01 4
9: a 2016-09-01 6
10: a 2016-09-01 6
11: b 2014-01-01 5
12: b 2014-01-01 16
13: b 2014-02-01 15
14: b 2014-02-01 3
15: b 2015-01-01 3
16: b 2015-01-01 18
17: b 2016-04-01 14
18: b 2016-04-01 9
19: b 2016-10-01 16
20: b 2016-10-01 3
21: c 2014-03-01 1
22: c 2014-03-01 12
23: c 2014-06-01 7
24: c 2014-06-01 18
25: c 2015-02-01 13
26: c 2015-02-01 9
27: c 2015-04-01 11
28: c 2015-04-01 5
29: c 2016-01-01 20
30: c 2016-01-01 1
31: d 2014-12-01 19
32: d 2014-12-01 9
33: d 2015-07-01 10
34: d 2015-07-01 5
35: d 2015-12-01 5
36: d 2015-12-01 8
37: d 2016-02-01 12
38: d 2016-02-01 10
39: d 2016-06-01 20
40: d 2016-06-01 8
Contract_number Date DPD
>
> # Add yearmonth decimal column
> dummy_data[, ym := as.integer(format(Date, "%Y%m"))][
+ , ym := (ym %/% 100) + (ym %% 100) / 12][, ym_less_one := ym - 1][
+ , ym2 := ym]
>
> dummy_data <- dummy_data[
+ dummy_data, on = c("Contract_number", "ym>ym", "ym_less_one<=ym2"),
+ .(Date = first(i.Date), DPD = first(i.DPD), max_DPD = max(DPD)),
+ by =.EACHI][, c("ym", "ym_less_one") := NULL]
>
> print(dummy_data)
Contract_number Date DPD max_DPD
1: a 2014-05-01 16 18
2: a 2014-05-01 3 18
3: a 2014-11-01 18 16
4: a 2014-11-01 3 16
5: a 2015-05-01 14 NA
6: a 2015-05-01 16 NA
7: a 2016-07-01 14 6
8: a 2016-07-01 4 6
9: a 2016-09-01 6 NA
10: a 2016-09-01 6 NA
11: b 2014-01-01 5 18
12: b 2014-01-01 16 18
13: b 2014-02-01 15 18
14: b 2014-02-01 3 18
15: b 2015-01-01 3 NA
16: b 2015-01-01 18 NA
17: b 2016-04-01 14 16
18: b 2016-04-01 9 16
19: b 2016-10-01 16 NA
20: b 2016-10-01 3 NA
21: c 2014-03-01 1 18
22: c 2014-03-01 12 18
23: c 2014-06-01 7 13
24: c 2014-06-01 18 13
25: c 2015-02-01 13 20
26: c 2015-02-01 9 20
27: c 2015-04-01 11 20
28: c 2015-04-01 5 20
29: c 2016-01-01 20 NA
30: c 2016-01-01 1 NA
31: d 2014-12-01 19 10
32: d 2014-12-01 9 10
33: d 2015-07-01 10 20
34: d 2015-07-01 5 20
35: d 2015-12-01 5 20
36: d 2015-12-01 8 20
37: d 2016-02-01 12 20
38: d 2016-02-01 10 20
39: d 2016-06-01 20 NA
40: d 2016-06-01 8 NA
Contract_number Date DPD max_DPD
I am not sure whether or not you want the month of the observation within the 12 month period. Further, there might be some issues with the >= operations + floating point issues. A solution is maybe to subtract some_factor * .Machine$double.eps from the ym_less_one column.

Fill NA in a time series only to a limited number

Is there a way we can fill NAs in a zoo or xts object with limited number of NAs forward. In other words like fill NAs up to 3 consecutive NAs, and then keep the NAs from the 4th value on until a valid number.
Something like this.
library(zoo)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
x
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 NA NA NA 5 NA NA
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
NA NA NA 11 12 NA NA
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
NA NA NA NA 19 20
Desired output, will be something with variable n = 3 is
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 1 1 1 5 5 5
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
5 NA NA 11 12 12 12
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 NA NA NA 19 20
I have tried lot of combination with na.locf(x, maxgap = 3) etc without much success. I can create a loop to get the desired output, I was wondering whether there is vectorized way of achieving this.
fillInTheBlanks <- function(v, n=3) {
result <- v
counter0 <- 1
for(i in 2:length(v)) {
value <- v[i]
if (is.na(value)) {
if (counter0 > n) {
result[i] <- v[i]
} else {
result[i] <- result[i-1]
counter0 <- counter0 + 1
} }
else {
result[i] <- v[i]
counter0 <- 1
}
}
return(result)
}
Thanks
Here's another way:
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
# [1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
edit: my previous answer required that x have no duplicates. The current answer does not.
benchmarks
x <- rep(x, length.out=1e4)
plourde <- function(x) {
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
}
agstudy <- function(x) {
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
}
microbenchmark(plourde(x), agstudy(x))
# Unit: milliseconds
# expr min lq median uq max neval
# plourde(x) 5.30 5.591 6.409 6.774 57.13 100
# agstudy(x) 16.04 16.249 16.454 17.516 20.64 100
And another idea that, unless I've missed something, seems valid:
na_locf_until = function(x, n = 3)
{
wnn = which(!is.na(x))
inds = sort(c(wnn, (wnn + n+1)[which((wnn + n+1) < c(wnn[-1], length(x)))]))
c(rep(NA, wnn[1] - 1),
as.vector(x)[rep(inds, c(diff(inds), length(x) - inds[length(inds)] + 1))])
}
na_locf_until(x)
#[1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
Without using na.locf, but the idea is to split your xts by group of non missing values, then for each group replacing only the 3 first values (after the non misssing one) with the first value. It is a loop , but since it is only applied on group , it should be faster than a simple loop over all the values.
zz <-
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
## create the zoo object since , the latter algorithm is applied only to the values
zoo(zz,index(x))
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26 2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02
1 1 1 1 5 5 5 5 NA NA 11 12 12
2014-10-03 2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 12 NA NA NA 19 20
The cleanest way to implement this in data.table is probably using the join syntax:
na.omit(dt)[dt, on = .(date), roll = +3, .(date, x_filled = x, x = i.x)]
date x_filled x
1: 2019-02-14 1 1
2: 2019-02-15 1 NA
3: 2019-02-16 1 NA
4: 2019-02-17 1 NA
5: 2019-02-18 5 5
6: 2019-02-19 5 NA
7: 2019-02-20 5 NA
8: 2019-02-21 5 NA
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 12 NA
14: 2019-02-27 12 NA
15: 2019-02-28 12 NA
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
*This solution depends on the date columns and it being contiguous
From playing around in data.table comes this hacky solution:
np1 <- 3 + 1
dt[,
x_filled := x[c(rep(1, min(np1, .N)), rep(NA, max(0, .N - np1)))],
by = cumsum(!is.na(x))]
# Or slightly simplified:
dt[,
x_filled := ifelse(rowid(x) < 4, x[1], x[NA]),
by = cumsum(!is.na(x))]
> dt
date x x_filled
1: 2019-02-14 1 1
2: 2019-02-15 NA 1
3: 2019-02-16 NA 1
4: 2019-02-17 NA 1
5: 2019-02-18 5 5
6: 2019-02-19 NA 5
7: 2019-02-20 NA 5
8: 2019-02-21 NA 5
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 NA 12
14: 2019-02-27 NA 12
15: 2019-02-28 NA 12
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
We build on the fact that subsetting vectors with NA returns NA.
Data/Packages
library(zoo)
library(data.table)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
dt <- data.table(date = index(x), x = as.integer(x))

Retain and lag function in R as SAS

I am looking for a function in R similar to lag1, lag2 and retain functions in SAS which I can use with data.tables.
I know there are functions like embed and lag in R but they don't return a single value or the previous value . They return a complete set of vectors.
Is there anything in R which I can use with data.table?
More info on the SAS functions :
Retain
Lag
You have to be aware that R works very different from the data step in SAS. The lag function in SAS is used in the data step, and is used within the implicit loop structure of that data step. The same goes for the retain function, which simply keeps the value constant when going through the data looping.
R on the other hand works completely vectorized. This means that you have to rethink what you want to do, and adapt accordingly.
retain is simply useless in R, as R recycles arguments by default. If you want to do this explicitly, you might look at eg rep() to construct a vector with constant values and a certain length.
lag is a matter of using indices, and just shifting position of all values in a vector. In order to keep a vector of the same length, you need to add some NA and remove some extra values.
A simple example: This SAS code lags a variable x and adds a variable year that has a constant value:
data one;
retain year 2013;
input x ##;
y=lag1(x);
z=lag2(x);
datalines;
1 2 3 4 5 6
;
In R, you could write your own lag function like this:
mylag <- function(x,k) c(rep(NA,k),head(x,-k))
This single line adds k times NA at the beginning of the vector, and drops the last k values from the vector. The result is a lagged vector as given by lag1 etc. in SAS.
this allows something like :
nrs <- 1:6 # equivalent to datalines
one <- data.frame(
x = nrs,
y = mylag(nrs,1),
z = mylag(nrs,2),
year = 2013 # R automatically loops, so no extra command needed
)
The result is :
> one
x y z year
1 1 NA NA 2013
2 2 1 NA 2013
3 3 2 1 2013
4 4 3 2 2013
5 5 4 3 2013
6 6 5 4 2013
Exactly the same would work with a data.table object. The important note here is to rethink your strategy: Instead of thinking loopwise as you do with the DATA step in SAS, you have to start thinking in terms of vectors and indices when using R.
I would say the closet equivalent to retain, lag1, and lag2 would be the Lag function in the quantmod package.
It's very easy to use with data.tables. E.g.:
library(data.table)
library(quantmod)
d <- data.table(v1=c(rep('a', 10), rep('b', 10)), v2=1:20)
setkeyv(d, 'v1')
d[,new_var := Lag(v2, 1), by='v1']
d[,new_var2 := v2-Lag(v2, 3), by='v1']
d[,new_var3 := Next(v2, 2), by='v1']
This yields the following:
print(d)
v1 v2 new_var new_var2 new_var3
1: a 1 NA NA 3
2: a 2 1 NA 4
3: a 3 2 NA 5
4: a 4 3 3 6
5: a 5 4 3 7
6: a 6 5 3 8
7: a 7 6 3 9
8: a 8 7 3 10
9: a 9 8 3 NA
10: a 10 9 3 NA
11: b 11 NA NA 13
12: b 12 11 NA 14
13: b 13 12 NA 15
14: b 14 13 3 16
15: b 15 14 3 17
16: b 16 15 3 18
17: b 17 16 3 19
18: b 18 17 3 20
19: b 19 18 3 NA
20: b 20 19 3 NA
As you can see, Lag lets you look back and Next lets you look forward. Both functions are nice because they pad the result with NAs such that it has the same length as the input.
If you want to get even fancier, and higher-performance, you can look into rolling joins with data.table objects. This is a little bit different thab what you are asking for, but is conceptually related, and so powerful and awesome I have to share.
Start with a data.table:
library(data.table)
library(quantmod)
set.seed(42)
d1 <- data.table(
id=c(rep('a', 10), rep('b', 10)),
time=rep(1:10,2),
value=runif(20))
setkeyv(d1, c('id', 'time'))
print(d1)
id time value
1: a 1 0.9148060
2: a 2 0.9370754
3: a 3 0.2861395
4: a 4 0.8304476
5: a 5 0.6417455
6: a 6 0.5190959
7: a 7 0.7365883
8: a 8 0.1346666
9: a 9 0.6569923
10: a 10 0.7050648
11: b 1 0.4577418
12: b 2 0.7191123
13: b 3 0.9346722
14: b 4 0.2554288
15: b 5 0.4622928
16: b 6 0.9400145
17: b 7 0.9782264
18: b 8 0.1174874
19: b 9 0.4749971
20: b 10 0.5603327
You have another data.table you want to join, but not all time indexes are present in the second table:
d2 <- data.table(
id=sample(c('a', 'b'), 5, replace=TRUE),
time=sample(1:10, 5),
value2=runif(5))
setkeyv(d2, c('id', 'time'))
print(d2)
id time value2
1: a 4 0.811055141
2: a 10 0.003948339
3: b 6 0.737595618
4: b 8 0.388108283
5: b 9 0.685169729
A regular merge yields lots of missing values:
d2[d1,,roll=FALSE]
id time value2 value
1: a 1 NA 0.9148060
2: a 2 NA 0.9370754
3: a 3 NA 0.2861395
4: a 4 0.811055141 0.8304476
5: a 5 NA 0.6417455
6: a 6 NA 0.5190959
7: a 7 NA 0.7365883
8: a 8 NA 0.1346666
9: a 9 NA 0.6569923
10: a 10 0.003948339 0.7050648
11: b 1 NA 0.4577418
12: b 2 NA 0.7191123
13: b 3 NA 0.9346722
14: b 4 NA 0.2554288
15: b 5 NA 0.4622928
16: b 6 0.737595618 0.9400145
17: b 7 NA 0.9782264
18: b 8 0.388108283 0.1174874
19: b 9 0.685169729 0.4749971
20: b 10 NA 0.5603327
However, data.table allows you to roll the secondary index forward, WITHIN THE PRIMARY INDEX!
d2[d1,,roll=TRUE]
id time value2 value
1: a 1 NA 0.9148060
2: a 2 NA 0.9370754
3: a 3 NA 0.2861395
4: a 4 0.811055141 0.8304476
5: a 5 0.811055141 0.6417455
6: a 6 0.811055141 0.5190959
7: a 7 0.811055141 0.7365883
8: a 8 0.811055141 0.1346666
9: a 9 0.811055141 0.6569923
10: a 10 0.003948339 0.7050648
11: b 1 NA 0.4577418
12: b 2 NA 0.7191123
13: b 3 NA 0.9346722
14: b 4 NA 0.2554288
15: b 5 NA 0.4622928
16: b 6 0.737595618 0.9400145
17: b 7 0.737595618 0.9782264
18: b 8 0.388108283 0.1174874
19: b 9 0.685169729 0.4749971
20: b 10 0.685169729 0.5603327
This is pretty damn cool: Old observations are rolled forward in time, until they are replaced by new ones. If you want to replace the NA values at the beggining of the series, you can do so by rolling the first observation backwards:
d2[d1,,roll=TRUE, rollends=c(TRUE, TRUE)]
id time value2 value
1: a 1 0.811055141 0.9148060
2: a 2 0.811055141 0.9370754
3: a 3 0.811055141 0.2861395
4: a 4 0.811055141 0.8304476
5: a 5 0.811055141 0.6417455
6: a 6 0.811055141 0.5190959
7: a 7 0.811055141 0.7365883
8: a 8 0.811055141 0.1346666
9: a 9 0.811055141 0.6569923
10: a 10 0.003948339 0.7050648
11: b 1 0.737595618 0.4577418
12: b 2 0.737595618 0.7191123
13: b 3 0.737595618 0.9346722
14: b 4 0.737595618 0.2554288
15: b 5 0.737595618 0.4622928
16: b 6 0.737595618 0.9400145
17: b 7 0.737595618 0.9782264
18: b 8 0.388108283 0.1174874
19: b 9 0.685169729 0.4749971
20: b 10 0.685169729 0.5603327
These rolling joins are absolutely incredible, and I've never seen them implemented in any other open source package (see ?data.table for more info). It will take a little while to turn off your "SAS brain" and turn on your "R brain", but once you get over that initial hump you'll find that the language is much more expressive.
For retain, try this :
retain<-function(x,event,outside=NA)
{
indices <- c(1,which(event==TRUE), nrow(df)+1)
values <- c(outside,x[event==TRUE])
y<- rep(values, diff(indices))
}
With data : I want to retain down the value when w==b
df <- data.frame(w = c("a","b","c","a","b","c"), x = 1:6, y = c(1,1,2,2,2,3), stringsAsFactors = FALSE)
df$z<-retain(df$x-df$y,df$w=="b")
df
And here's the contrary obtain, that does not exist in SAS:
obtain<-function(x,event,outside=NA)
{
indices <- c(0,which(event==TRUE), nrow(df))
values <- c(x[event==TRUE],outside)
y<- rep(values, diff(indices))
}
Here's an example. I want to obtain the value in advance where w==b
df$z2<-obtain(df$x-df$y,df$w=="b")
df
Thanks to Julien for helping.
here's an example: cumulate value with sqldf:
> w_cum <-
sqldf("select t1.id, t1.SomeNumt, SUM(t2.SomeNumt) as cum_sum
from w_cum t1
inner join w_cum t2 on t1.id >= t2.id
group by t1.id, t1.SomeNumt
order by t1.id
")
id SomeNumt cum_sum
1 11 11
2 12 23
3 13 36
4 14 50
5 15 65
6 16 81
7 17 98
8 18 116
9 19 135
10 20 155

Rolling joins: roll forwards and backwards

data.table is awesome, because I can do rolling joins, and even do rolling joins within groups!
library(data.table)
set.seed(42)
metrics <- data.frame(
ID=c(rep(1, 10), rep(2,5), rep(3,5)),
Time=c(1:10, 4:8, 8:12),
val1=runif(20),
val2=runif(20),
val3=runif(20),
val4=runif(20)
)
metrics <- data.table(metrics[sample(1:nrow(metrics), 15),], key=c('ID', 'Time'))
calendar <- data.table(expand.grid(ID=1:3, Time=1:12), key=c('ID', 'Time'))
metrics[calendar,roll=TRUE]
However, this isn't awesome enough for me. This data.table still has NAs:
> metrics[calendar,roll=TRUE]
ID Time val1 val2 val3 val4
1: 1 1 0.9148060 0.9040314 0.3795592 0.675607275
2: 1 2 0.9370754 0.1387102 0.4357716 0.982817198
3: 1 3 0.9370754 0.1387102 0.4357716 0.982817198
4: 1 4 0.8304476 0.9466682 0.9735399 0.566488424
5: 1 5 0.8304476 0.9466682 0.9735399 0.566488424
6: 1 6 0.5190959 0.5142118 0.9575766 0.189473935
7: 1 7 0.7365883 0.3902035 0.8877549 0.271286615
8: 1 8 0.7365883 0.3902035 0.8877549 0.271286615
9: 1 9 0.6569923 0.4469696 0.9709666 0.693204820
10: 1 10 0.7050648 0.8360043 0.6188382 0.240544740
11: 1 11 0.7050648 0.8360043 0.6188382 0.240544740
12: 1 12 0.7050648 0.8360043 0.6188382 0.240544740
13: 2 1 NA NA NA NA
14: 2 2 NA NA NA NA
15: 2 3 NA NA NA NA
16: 2 4 0.4577418 0.7375956 0.3334272 0.042988796
17: 2 5 0.7191123 0.8110551 0.3467482 0.140479094
18: 2 6 0.9346722 0.3881083 0.3984854 0.216385415
19: 2 7 0.2554288 0.6851697 0.7846928 0.479398564
20: 2 8 0.2554288 0.6851697 0.7846928 0.479398564
21: 2 9 0.2554288 0.6851697 0.7846928 0.479398564
22: 2 10 0.2554288 0.6851697 0.7846928 0.479398564
23: 2 11 0.2554288 0.6851697 0.7846928 0.479398564
24: 2 12 0.2554288 0.6851697 0.7846928 0.479398564
25: 3 1 NA NA NA NA
26: 3 2 NA NA NA NA
27: 3 3 NA NA NA NA
28: 3 4 NA NA NA NA
29: 3 5 NA NA NA NA
30: 3 6 NA NA NA NA
31: 3 7 NA NA NA NA
32: 3 8 0.9400145 0.8329161 0.7487954 0.719355838
33: 3 9 0.9400145 0.8329161 0.7487954 0.719355838
34: 3 10 0.1174874 0.2076590 0.1712643 0.375489965
35: 3 11 0.4749971 0.9066014 0.2610880 0.514407708
36: 3 12 0.5603327 0.6117786 0.5144129 0.001570554
ID Time val1 val2 val3 val4
I could fill these NA's using zoo:::na.locf, fromLast=TRUE, but that's not very fun. Can anyone think of an elegant way I can roll NA's backward, (after rolling them forward), during the data.table join?
This is possible in data.table version 1.8.8 released March 2013:
metrics[calendar, roll=TRUE, rollends=c(TRUE, TRUE)]
From the data.table NEWS file:
In addition to TRUE/FALSE, 'roll' may now be a positive number (roll forwards/LOCF) or
negative number (roll backwards/NOCB). A finite number limits the distance a value is
rolled (limited staleness). roll=TRUE and roll=+Inf are equivalent.
'rollends' is a new parameter holding two logicals. The first observation is rolled
backwards if the first value of rollends is TRUE. The last observation is rolled forwards if the second value of rollends
is TRUE. If roll is a finite number, the same limit applies to the ends.
New value roll='nearest' joins to the nearest value (either backwards or forwards) when
the value falls in a gap, and to the end value according to 'rollends'.
'rolltolast' has been deprecated. For backwards compatibility it is converted to
{roll=TRUE;rollends=c(FALSE,FALSE)}.
As always, to download the most up-to-date version of data.table, see Installation.
metrics[calendar, roll = TRUE, rollends = c(TRUE, TRUE)]

Resources