How to add rows and extrapolate the data by multiple variables? - r

I'm trying to add missing lines for "day" and extrapolate the data for "value". In my data each subject ("id") has 2 periods (period 1 and period 2) and values for consecutive days.
An example of my data looks like this:
df <- data.frame(
id = c(1,1,1,1, 1,1,1,1, 2,2,2,2, 2,2,2,2, 3,3,3,3, 3,3,3,3),
period = c(1,1,1,1, 2,2,2,2, 1,1,1,1, 2,2,2,2, 1,1,1,1, 2,2,2,2),
day= c(1,2,4,5, 1,3,4,5, 2,3,4,5, 1,2,3,5, 2,3,4,5, 1,2,3,4),
value =c(10,12,15,16, 11,14,15,17, 13,14,15,16, 15,16,18,20, 16,17,19,29, 14,16,18,20))
For each id and period I am missing data for days 3,2,1,4,1,5, respectively. I want to expand the data to let's say 10 days and extrapolate the data on value column (e.g. with linear regression).
My final df should be something like that:
df2 <- data.frame(
id = c(1,1,1,1,1,1,1, 1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 2,2,2,2,2,2,2, 3,3,3,3,3,3,3, 3,3,3,3,3,3,3),
period = c(1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 1,1,1,1,1,1,1, 2,2,2,2,2,2,2, 1,1,1,1,1,1,1, 2,2,2,2,2,2,2),
day= c(1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7, 1,2,3,4,5,6,7),
value =c(10,12,13,15,16,17,18, 11,12,14,15,17,18,19, 12,13,14,15,16,18,22, 15,16,18,19,20,22,23, 15,16,17,19,29,39,49, 14,16,18,20,22,24,26))
The most similar example I found doesn't extrapolate by two variables (ID and period in my case), it extrapolates only by year. I tried to adapt the code but no success :(
Another example extrapolates the data by multiple id but doesn't add rows for missing data.
I couldn't combine both codes with my limited experience in R. Any suggestions?
Thanks in advance...

We can use complete
library(dplyr)
library(tidyr)
library(forecast)
df %>%
group_by(id, period) %>%
complete(day =1:7)%>%
mutate(value = as.numeric(na.interp(value)))

#akrun's answer is good, as long as you don't mind using linear interpolation. However, if you do want to use a linear model, you could try this data.table approach.
library(data.table)
model <- lm(value ~ day + period + id,data=df)
dt <- as.data.table(df)[,.SD[,.(day = 1:7,value = value[match(1:7,day)])],by=.(id,period)]
dt[is.na(value), value := predict(model,.SD),]
dt
id period day value
1: 1 1 1 10.00000
2: 1 1 2 12.00000
3: 1 1 3 12.86714
4: 1 1 4 15.00000
5: 1 1 5 16.00000
6: 1 1 6 18.13725
7: 1 1 7 19.89396
8: 1 2 1 11.00000
9: 1 2 2 12.15545
10: 1 2 3 14.00000
11: 1 2 4 15.00000
12: 1 2 5 17.00000
13: 1 2 6 19.18227
14: 1 2 7 20.93898
15: 2 1 1 11.90102
16: 2 1 2 13.00000
17: 2 1 3 14.00000
18: 2 1 4 15.00000
19: 2 1 5 16.00000
20: 2 1 6 20.68455
21: 2 1 7 22.44125
22: 2 2 1 15.00000
23: 2 2 2 16.00000
24: 2 2 3 18.00000
25: 2 2 4 18.21616
26: 2 2 5 20.00000
27: 2 2 6 21.72957
28: 2 2 7 23.48627
29: 3 1 1 14.44831
30: 3 1 2 16.00000
31: 3 1 3 17.00000
32: 3 1 4 19.00000
33: 3 1 5 29.00000
34: 3 1 6 23.23184
35: 3 1 7 24.98855
36: 3 2 1 14.00000
37: 3 2 2 16.00000
38: 3 2 3 18.00000
39: 3 2 4 20.00000
40: 3 2 5 22.52016
41: 3 2 6 24.27686
42: 3 2 7 26.03357
id period day value

Related

How to calculate moving average from previous rows in data.table?

I've a data like this;
library(data.table)
set.seed(1)
df <- data.table(store = sample(LETTERS[1:2],size = 10,replace = T),
week = sample(1:10),
demand = round(sample(rnorm(10,mean = 20,sd=2)),2))
random_na_index <- sample(1:nrow(df),3)
df[random_na_index,demand := NA]
setorder(df,store,week)
store week demand
1: A 3 19.18
2: A 5 NA
3: A 6 NA
4: A 8 19.55
5: A 9 20.50
6: A 10 NA
7: B 1 20.75
8: B 2 17.70
9: B 4 19.40
10: B 7 17.52
I need to calculate moving average using the 2 weeks before the current week. I couldn't do it because zoo's and data.table's frollmean uses current row also while calculating moving average. I don't also know how to handle NA's while applying a rolling function.
The desired output should look like;
store week demand desired_column
1: A 3 19.18 NA
2: A 5 NA 19.180
3: A 6 NA 19.180
4: A 8 19.55 NA
5: A 9 20.50 19.550
6: A 10 NA 20.025
7: B 1 20.75 NA
8: B 2 17.70 20.750
9: B 4 19.40 19.225
10: B 7 17.52 18.550
You could shift the values before applying frollmean with na.rm = TRUE argument:
df[order(store,week),desired:=frollmean(shift(demand),n=2,na.rm=T),by=.(store)][]
store week demand desired
<char> <int> <num> <num>
1: A 3 19.18 NA
2: A 5 NA 19.180
3: A 6 NA 19.180
4: A 8 19.55 NaN
5: A 9 20.50 19.550
6: A 10 NA 20.025
7: B 1 20.75 NA
8: B 2 17.70 20.750
9: B 4 19.40 19.225
10: B 7 17.52 18.550

creating a unique variable based on row differences of another variable considering groups

By using the data below, I want to create a new unique customer id by considering their contact date.
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
I couldn't go any further than calculating date differences.
The original dataset I work is bigger; therefore, I prefer a data.table solution if possible.
library(data.table)
treshold <- 2
dt <- structure(list(customer_id = c('10','20','20','20','20','20','30','30','30','30','30','40','50','50'),
contact_date = as.Date(c("2019-01-05","2019-01-01","2019-01-01","2019-01-02",
"2019-01-08","2019-01-09","2019-02-02","2019-02-05",
"2019-02-05","2019-02-09","2019-02-12","2019-02-01",
"2019-02-01","2019-02-05")),
desired_output = c(1,2,2,2,3,3,4,5,5,6,7,8,9,10)),
class = "data.frame",
row.names = 1:14)
setDT(dt)
setorder(dt, customer_id, contact_date)
dt[, date_diff_in_days:=contact_date - shift(contact_date, type = c("lag")), by=customer_id]
dt[, date_diff_in_days:=as.numeric(date_diff_in_days)]
dt
customer_id contact_date desired_output date_diff_in_days
1: 10 2019-01-05 1 NA
2: 20 2019-01-01 2 NA
3: 20 2019-01-01 2 0
4: 20 2019-01-02 2 1
5: 20 2019-01-08 3 6
6: 20 2019-01-09 3 1
7: 30 2019-02-02 4 NA
8: 30 2019-02-05 5 3
9: 30 2019-02-05 5 0
10: 30 2019-02-09 6 4
11: 30 2019-02-12 7 3
12: 40 2019-02-01 8 NA
13: 50 2019-02-01 9 NA
14: 50 2019-02-05 10 4
Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.
When creating a new ID, if you set up the by= vectors correctly to capture the rule, the auto-counter .GRP can be used:
thresh <- 2
dt[, g := .GRP, by=.(
customer_id,
cumsum(contact_date - shift(contact_date, fill=first(contact_date)) > thresh)
)]
dt[, any(g != desired_output)]
# [1] FALSE
I think the code above is correct since it works on the example, but you might want to check on your actual data (comparing against results from, eg, Gregor's approach) to be sure.
We use cumsum to increment whenever date_diff_in_days is NA or when the threshold is exceeded.
dt[, result := cumsum(is.na(date_diff_in_days) | date_diff_in_days > treshold)]
# customer_id contact_date desired_output date_diff_in_days result
# 1: 10 2019-01-05 1 NA 1
# 2: 20 2019-01-01 2 NA 2
# 3: 20 2019-01-01 2 0 2
# 4: 20 2019-01-02 2 1 2
# 5: 20 2019-01-08 3 6 3
# 6: 20 2019-01-09 3 1 3
# 7: 30 2019-02-02 4 NA 4
# 8: 30 2019-02-05 5 3 5
# 9: 30 2019-02-05 5 0 5
# 10: 30 2019-02-09 6 4 6
# 11: 30 2019-02-12 7 3 7
# 12: 40 2019-02-01 8 NA 8
# 13: 50 2019-02-01 9 NA 9
# 14: 50 2019-02-05 10 4 10

Copy a value from one person in a group to everyone in a group

I have a data set in long format (multiple observations per ID), due to omitted information on prescriptions. Each ID is part of a larger "set", and there are 50 or more sets all with one diseased ID. One person per set has the disease, and the others don't.
dt <- data.table(ID = rep(1:10, each = 4),
disease = c(rep(0, 16), rep(1, 4), rep(0, 12), rep(1,4), rep(0,4)),
dob = c(rep(as.Date("13/05/1924", "%d/%m/%Y"), 4), rep(as.Date("15/09/1936", "%d/%m/%Y"),4),
rep(as.Date("30/06/1957", "%d/%m/%Y"),4), rep(as.Date("19/02/1946", "%d/%m/%Y"),4),
rep(as.Date("26/04/1939", "%d/%m/%Y"),4), rep(as.Date("13/05/1922", "%d/%m/%Y"), 4), rep(as.Date("18/10/1945", "%d/%m/%Y"),4),
rep(as.Date("30/06/1957", "%d/%m/%Y"),4), rep(as.Date("19/02/1946", "%d/%m/%Y"),4),
rep(as.Date("26/12/1939", "%d/%m/%Y"),4)),
disease.date = c(rep(as.Date("01/01/2000", "%d/%m/%Y"), 16), rep(as.Date("19/02/2006", "%d/%m/%Y"),4),
rep(as.Date("01/01/2000", "%d/%m/%Y"), 12), rep(as.Date("13/11/2010", "%d/%m/%Y"),4),
rep(as.Date("01/01/2000", "%d/%m/%Y"), 4)),
set = c(rep(1,20), rep(2,20)))
dt <- dt[(disease==0), disease.date:=NA]
dt
ID disease dob disease.date set
1: 1 0 1924-05-13 <NA> 1
2: 1 0 1924-05-13 <NA> 1
3: 1 0 1924-05-13 <NA> 1
4: 1 0 1924-05-13 <NA> 1
5: 2 0 1936-09-15 <NA> 1
6: 2 0 1936-09-15 <NA> 1
7: 2 0 1936-09-15 <NA> 1
8: 2 0 1936-09-15 <NA> 1
9: 3 0 1957-06-30 <NA> 1
10: 3 0 1957-06-30 <NA> 1
11: 3 0 1957-06-30 <NA> 1
12: 3 0 1957-06-30 <NA> 1
13: 4 0 1946-02-19 <NA> 1
14: 4 0 1946-02-19 <NA> 1
15: 4 0 1946-02-19 <NA> 1
16: 4 0 1946-02-19 <NA> 1
17: 5 1 1939-04-26 2006-02-19 1
18: 5 1 1939-04-26 2006-02-19 1
19: 5 1 1939-04-26 2006-02-19 1
20: 5 1 1939-04-26 2006-02-19 1
21: 6 0 1922-05-13 <NA> 2
22: 6 0 1922-05-13 <NA> 2
23: 6 0 1922-05-13 <NA> 2
24: 6 0 1922-05-13 <NA> 2
25: 7 0 1945-10-18 <NA> 2
26: 7 0 1945-10-18 <NA> 2
27: 7 0 1945-10-18 <NA> 2
28: 7 0 1945-10-18 <NA> 2
29: 8 0 1957-06-30 <NA> 2
30: 8 0 1957-06-30 <NA> 2
31: 8 0 1957-06-30 <NA> 2
32: 8 0 1957-06-30 <NA> 2
33: 9 1 1946-02-19 2010-11-13 2
34: 9 1 1946-02-19 2010-11-13 2
35: 9 1 1946-02-19 2010-11-13 2
36: 9 1 1946-02-19 2010-11-13 2
37: 10 0 1939-12-26 <NA> 2
38: 10 0 1939-12-26 <NA> 2
39: 10 0 1939-12-26 <NA> 2
40: 10 0 1939-12-26 <NA> 2
I'm interested in finding the age of everyone in that set on the date of disease for the case.
for example, how old is everyone in set 1 on 19/02/2006 (the cases disease date)? and in set 2 on 13/11/2010?
I've tried the data.table way:
cc[, age := dob - oa.cons.date, by = set]
which only worked for those with a disease.date
Any other thoughts I had involved copying the disease.date of each case to the controls in the sameset, but I didn't know how to do that either.
You can copy the first non-empty disease date within each set group to the whole column disease.date:
dt[, disease.date := disease.date[!is.na(disease.date)][1], by = set]
Then calculate age:
dt[, age := disease.date - dob]
Notice that time difference intervals are in days. You may divide them by 365 or treat them in any other suitable way. Maybe package lubridate can be useful here. With its help:
dt[, age := as.period(interval(dob, disease.date), unit = "years")]
or
dt[, age := decimal_date(disease.date) - decimal_date(dob)]
You can try this:
(dt$dob - dt$disease.date[20])/365
Taking dt$disease.date[20] since there are some NAs in the disease.date column.
Since both columns are date objects, R automatically calculates the difference in two dates. The difference will be in terms of days, so dividing by 365 gives you the approximate age.

R - Calculate Time Elapsed Since Last Event with Multiple Event Types

I have a dataframe that contains the dates of multiple types of events.
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000"
,"03/01/2001","17/03/2001","23/04/2001",
"26/05/2001","01/06/2001",
"30/06/2001","02/07/2001","15/07/2001"
,"21/12/2001"), "%d/%m/%Y"),
event_type=c(0,4,1,2,4,1,0,2,3,3,4,3))
date event_type
---------------- ----------
1 2000-07-06 0
2 2000-09-15 4
3 2000-10-15 1
4 2001-01-03 2
5 2001-03-17 4
6 2001-04-23 1
7 2001-05-26 0
8 2001-06-01 2
9 2001-06-30 3
10 2001-07-02 3
11 2001-07-15 4
12 2001-12-21 3
I am trying to calculate the days between each event type so the output looks like the below:
date event_type days_since_last_event
---------------- ---------- ---------------------
1 2000-07-06 0 NA
2 2000-09-15 4 NA
3 2000-10-15 1 NA
4 2001-01-03 2 NA
5 2001-03-17 4 183
6 2001-04-23 1 190
7 2001-05-26 0 324
8 2001-06-01 2 149
9 2001-06-30 3 NA
10 2001-07-02 3 2
11 2001-07-15 4 120
12 2001-12-21 3 172
I have benefited from the answers from these two previous posts but have not been able to address my specific problem in R; multiple event types.
Calculate elapsed time since last event
Calculate days since last event in R
Below is as far as I have gotten. I have not been able to leverage the last event index to calculate the last event date.
df <- cbind(df, as.vector(data.frame(count=ave(df$event_type==df$event_type,
df$event_type, FUN=cumsum))))
df <- rename(df, c("count" = "last_event_index"))
date event_type last_event_index
--------------- ------------- ----------------
1 2000-07-06 0 1
2 2000-09-15 4 1
3 2000-10-15 1 1
4 2001-01-03 2 1
5 2001-03-17 4 2
6 2001-04-23 1 2
7 2001-05-26 0 2
8 2001-06-01 2 2
9 2001-06-30 3 1
10 2001-07-02 3 2
11 2001-07-15 4 3
12 2001-12-21 3 3
We can use diff to get the difference between adjacent 'date' after grouping by 'event_type'. Here, I am using data.table approach by converting the 'data.frame' to 'data.table' (setDT(df)), grouped by 'event_type', we get the diff of 'date'.
library(data.table)
setDT(df)[,days_since_last_event :=c(NA,diff(date)) , by = event_type]
df
# date event_type days_since_last_event
# 1: 2000-07-06 0 NA
# 2: 2000-09-15 4 NA
# 3: 2000-10-15 1 NA
# 4: 2001-01-03 2 NA
# 5: 2001-03-17 4 183
# 6: 2001-04-23 1 190
# 7: 2001-05-26 0 324
# 8: 2001-06-01 2 149
# 9: 2001-06-30 3 NA
#10: 2001-07-02 3 2
#11: 2001-07-15 4 120
#12: 2001-12-21 3 172
Or as #Frank mentioned in the comments, we can also use shift (from version v1.9.5+ onwards) to get the lag (by default, the type='lag') of 'date' and subtract from the 'date'.
setDT(df)[, days_since_last_event := as.numeric(date-shift(date,type="lag")),
by = event_type]
The base R version of this is to use split/lapply/rbind to generate the new column.
> do.call(rbind,
lapply(
split(df, df$event_type),
function(d) {
d$dsle <- c(NA, diff(d$date)); d
}
)
)
date event_type dsle
0.1 2000-07-06 0 NA
0.7 2001-05-26 0 324
1.3 2000-10-15 1 NA
1.6 2001-04-23 1 190
2.4 2001-01-03 2 NA
2.8 2001-06-01 2 149
3.9 2001-06-30 3 NA
3.10 2001-07-02 3 2
3.12 2001-12-21 3 172
4.2 2000-09-15 4 NA
4.5 2001-03-17 4 183
4.11 2001-07-15 4 120
Note that this returns the data in a different order than provided; you can re-sort by date or save the original indices if you want to preserve that order.
Above, #akrun has posted the data.tables approach, the parallel dplyr approach would be straightforward as well:
library(dplyr)
df %>% group_by(event_type) %>% mutate(days_since_last_event=date - lag(date, 1))
Source: local data frame [12 x 3]
Groups: event_type [5]
date event_type days_since_last_event
(date) (dbl) (dfft)
1 2000-07-06 0 NA days
2 2000-09-15 4 NA days
3 2000-10-15 1 NA days
4 2001-01-03 2 NA days
5 2001-03-17 4 183 days
6 2001-04-23 1 190 days
7 2001-05-26 0 324 days
8 2001-06-01 2 149 days
9 2001-06-30 3 NA days
10 2001-07-02 3 2 days
11 2001-07-15 4 120 days
12 2001-12-21 3 172 days

Retain and lag function in R as SAS

I am looking for a function in R similar to lag1, lag2 and retain functions in SAS which I can use with data.tables.
I know there are functions like embed and lag in R but they don't return a single value or the previous value . They return a complete set of vectors.
Is there anything in R which I can use with data.table?
More info on the SAS functions :
Retain
Lag
You have to be aware that R works very different from the data step in SAS. The lag function in SAS is used in the data step, and is used within the implicit loop structure of that data step. The same goes for the retain function, which simply keeps the value constant when going through the data looping.
R on the other hand works completely vectorized. This means that you have to rethink what you want to do, and adapt accordingly.
retain is simply useless in R, as R recycles arguments by default. If you want to do this explicitly, you might look at eg rep() to construct a vector with constant values and a certain length.
lag is a matter of using indices, and just shifting position of all values in a vector. In order to keep a vector of the same length, you need to add some NA and remove some extra values.
A simple example: This SAS code lags a variable x and adds a variable year that has a constant value:
data one;
retain year 2013;
input x ##;
y=lag1(x);
z=lag2(x);
datalines;
1 2 3 4 5 6
;
In R, you could write your own lag function like this:
mylag <- function(x,k) c(rep(NA,k),head(x,-k))
This single line adds k times NA at the beginning of the vector, and drops the last k values from the vector. The result is a lagged vector as given by lag1 etc. in SAS.
this allows something like :
nrs <- 1:6 # equivalent to datalines
one <- data.frame(
x = nrs,
y = mylag(nrs,1),
z = mylag(nrs,2),
year = 2013 # R automatically loops, so no extra command needed
)
The result is :
> one
x y z year
1 1 NA NA 2013
2 2 1 NA 2013
3 3 2 1 2013
4 4 3 2 2013
5 5 4 3 2013
6 6 5 4 2013
Exactly the same would work with a data.table object. The important note here is to rethink your strategy: Instead of thinking loopwise as you do with the DATA step in SAS, you have to start thinking in terms of vectors and indices when using R.
I would say the closet equivalent to retain, lag1, and lag2 would be the Lag function in the quantmod package.
It's very easy to use with data.tables. E.g.:
library(data.table)
library(quantmod)
d <- data.table(v1=c(rep('a', 10), rep('b', 10)), v2=1:20)
setkeyv(d, 'v1')
d[,new_var := Lag(v2, 1), by='v1']
d[,new_var2 := v2-Lag(v2, 3), by='v1']
d[,new_var3 := Next(v2, 2), by='v1']
This yields the following:
print(d)
v1 v2 new_var new_var2 new_var3
1: a 1 NA NA 3
2: a 2 1 NA 4
3: a 3 2 NA 5
4: a 4 3 3 6
5: a 5 4 3 7
6: a 6 5 3 8
7: a 7 6 3 9
8: a 8 7 3 10
9: a 9 8 3 NA
10: a 10 9 3 NA
11: b 11 NA NA 13
12: b 12 11 NA 14
13: b 13 12 NA 15
14: b 14 13 3 16
15: b 15 14 3 17
16: b 16 15 3 18
17: b 17 16 3 19
18: b 18 17 3 20
19: b 19 18 3 NA
20: b 20 19 3 NA
As you can see, Lag lets you look back and Next lets you look forward. Both functions are nice because they pad the result with NAs such that it has the same length as the input.
If you want to get even fancier, and higher-performance, you can look into rolling joins with data.table objects. This is a little bit different thab what you are asking for, but is conceptually related, and so powerful and awesome I have to share.
Start with a data.table:
library(data.table)
library(quantmod)
set.seed(42)
d1 <- data.table(
id=c(rep('a', 10), rep('b', 10)),
time=rep(1:10,2),
value=runif(20))
setkeyv(d1, c('id', 'time'))
print(d1)
id time value
1: a 1 0.9148060
2: a 2 0.9370754
3: a 3 0.2861395
4: a 4 0.8304476
5: a 5 0.6417455
6: a 6 0.5190959
7: a 7 0.7365883
8: a 8 0.1346666
9: a 9 0.6569923
10: a 10 0.7050648
11: b 1 0.4577418
12: b 2 0.7191123
13: b 3 0.9346722
14: b 4 0.2554288
15: b 5 0.4622928
16: b 6 0.9400145
17: b 7 0.9782264
18: b 8 0.1174874
19: b 9 0.4749971
20: b 10 0.5603327
You have another data.table you want to join, but not all time indexes are present in the second table:
d2 <- data.table(
id=sample(c('a', 'b'), 5, replace=TRUE),
time=sample(1:10, 5),
value2=runif(5))
setkeyv(d2, c('id', 'time'))
print(d2)
id time value2
1: a 4 0.811055141
2: a 10 0.003948339
3: b 6 0.737595618
4: b 8 0.388108283
5: b 9 0.685169729
A regular merge yields lots of missing values:
d2[d1,,roll=FALSE]
id time value2 value
1: a 1 NA 0.9148060
2: a 2 NA 0.9370754
3: a 3 NA 0.2861395
4: a 4 0.811055141 0.8304476
5: a 5 NA 0.6417455
6: a 6 NA 0.5190959
7: a 7 NA 0.7365883
8: a 8 NA 0.1346666
9: a 9 NA 0.6569923
10: a 10 0.003948339 0.7050648
11: b 1 NA 0.4577418
12: b 2 NA 0.7191123
13: b 3 NA 0.9346722
14: b 4 NA 0.2554288
15: b 5 NA 0.4622928
16: b 6 0.737595618 0.9400145
17: b 7 NA 0.9782264
18: b 8 0.388108283 0.1174874
19: b 9 0.685169729 0.4749971
20: b 10 NA 0.5603327
However, data.table allows you to roll the secondary index forward, WITHIN THE PRIMARY INDEX!
d2[d1,,roll=TRUE]
id time value2 value
1: a 1 NA 0.9148060
2: a 2 NA 0.9370754
3: a 3 NA 0.2861395
4: a 4 0.811055141 0.8304476
5: a 5 0.811055141 0.6417455
6: a 6 0.811055141 0.5190959
7: a 7 0.811055141 0.7365883
8: a 8 0.811055141 0.1346666
9: a 9 0.811055141 0.6569923
10: a 10 0.003948339 0.7050648
11: b 1 NA 0.4577418
12: b 2 NA 0.7191123
13: b 3 NA 0.9346722
14: b 4 NA 0.2554288
15: b 5 NA 0.4622928
16: b 6 0.737595618 0.9400145
17: b 7 0.737595618 0.9782264
18: b 8 0.388108283 0.1174874
19: b 9 0.685169729 0.4749971
20: b 10 0.685169729 0.5603327
This is pretty damn cool: Old observations are rolled forward in time, until they are replaced by new ones. If you want to replace the NA values at the beggining of the series, you can do so by rolling the first observation backwards:
d2[d1,,roll=TRUE, rollends=c(TRUE, TRUE)]
id time value2 value
1: a 1 0.811055141 0.9148060
2: a 2 0.811055141 0.9370754
3: a 3 0.811055141 0.2861395
4: a 4 0.811055141 0.8304476
5: a 5 0.811055141 0.6417455
6: a 6 0.811055141 0.5190959
7: a 7 0.811055141 0.7365883
8: a 8 0.811055141 0.1346666
9: a 9 0.811055141 0.6569923
10: a 10 0.003948339 0.7050648
11: b 1 0.737595618 0.4577418
12: b 2 0.737595618 0.7191123
13: b 3 0.737595618 0.9346722
14: b 4 0.737595618 0.2554288
15: b 5 0.737595618 0.4622928
16: b 6 0.737595618 0.9400145
17: b 7 0.737595618 0.9782264
18: b 8 0.388108283 0.1174874
19: b 9 0.685169729 0.4749971
20: b 10 0.685169729 0.5603327
These rolling joins are absolutely incredible, and I've never seen them implemented in any other open source package (see ?data.table for more info). It will take a little while to turn off your "SAS brain" and turn on your "R brain", but once you get over that initial hump you'll find that the language is much more expressive.
For retain, try this :
retain<-function(x,event,outside=NA)
{
indices <- c(1,which(event==TRUE), nrow(df)+1)
values <- c(outside,x[event==TRUE])
y<- rep(values, diff(indices))
}
With data : I want to retain down the value when w==b
df <- data.frame(w = c("a","b","c","a","b","c"), x = 1:6, y = c(1,1,2,2,2,3), stringsAsFactors = FALSE)
df$z<-retain(df$x-df$y,df$w=="b")
df
And here's the contrary obtain, that does not exist in SAS:
obtain<-function(x,event,outside=NA)
{
indices <- c(0,which(event==TRUE), nrow(df))
values <- c(x[event==TRUE],outside)
y<- rep(values, diff(indices))
}
Here's an example. I want to obtain the value in advance where w==b
df$z2<-obtain(df$x-df$y,df$w=="b")
df
Thanks to Julien for helping.
here's an example: cumulate value with sqldf:
> w_cum <-
sqldf("select t1.id, t1.SomeNumt, SUM(t2.SomeNumt) as cum_sum
from w_cum t1
inner join w_cum t2 on t1.id >= t2.id
group by t1.id, t1.SomeNumt
order by t1.id
")
id SomeNumt cum_sum
1 11 11
2 12 23
3 13 36
4 14 50
5 15 65
6 16 81
7 17 98
8 18 116
9 19 135
10 20 155

Resources