Define variable iteratively in data table in r - r

I am trying to find a faster solution to defining a variable iteratively, i.e., the next row of the variable depends on the previous row. For example, suppose I have the following data.table:
tmp <- data.table(type = c("A", "A", "A", "B", "B", "B"),
year = c(2011, 2012, 2013, 2011, 2012, 2013),
alpha = c(1,1,1,2,2,2),
beta = c(3,3,3,4,4,4),
pred = c(1,NA,NA,2,NA, NA))
For each type (A and then B), I want to solve for pred going forward, where pred for type A for the year 2012 is:
pred_2012_A = alpha + beta * pred_2011_A
and the pred for 2013 for type A continues:
pred_2013_A = alpha + beta * pred_2012_A
I have a solution using a for loop to go through type and create a variable to store the previous value and use the "by" command in data table to loop through the year as such:
for(i in c("A", "B")){
tmp.val <- tmp[type == i & year == 2011]$pred # initial value for type i
tmp[year > 2011 & type == i, pred := {
tmp.val <- alpha + beta * tmp.val
}, by = year]
}
Ultimately, the original data table looks like:
type year alpha beta pred
1: A 2011 1 3 1
2: A 2012 1 3 NA
3: A 2013 1 3 NA
4: B 2011 2 4 2
5: B 2012 2 4 NA
6: B 2013 2 4 NA
And the updated table looks like:
type year alpha beta pred
1: A 2011 1 3 1
2: A 2012 1 3 4
3: A 2013 1 3 13
4: B 2011 2 4 2
5: B 2012 2 4 10
6: B 2013 2 4 42
My question here is if there is a faster way to implement this without the for loop. Is there a way to implement this routine in one data table statement that is faster than using the for loop? My real usage has many more types and many more years to compute, so a faster implementation would be greatly appreciated.
Thank you.

You can just do the math:
tmp[, pred := pred[1]*beta^(1:.N-1) + alpha*cumsum(c(0, beta[1]^(0:(.N-2)))), by=type]
# type year alpha beta pred
# 1: A 2011 1 3 1
# 2: A 2012 1 3 4
# 3: A 2013 1 3 13
# 4: B 2011 2 4 2
# 5: B 2012 2 4 10
# 6: B 2013 2 4 42
Comment. In my opinion, the data structure in the OP is flawed. Alpha and beta are clearly attributes of the type, not something that is varying from row to row. It should start with:
typeDT = data.table(
type=c("A","B"),
year.start = 2011L,
year.end=2013,
a = 1:2,
b = 3:4,
pred0 = 1:2
)
# type year.start year.end a b pred0
# 1: A 2011 2013 1 3 1
# 2: B 2011 2013 2 4 2
With this structure, you could expand to your data set naturally:
typeDT[, {
year = year.start:year.end
n = length(year)
p = pred0*b^(0:(n-1)) + a*cumsum(c(0, b^(0:(n-2))))
.(year = year, pred = p)
}, by=type]
# type year pred
# 1: A 2011 1
# 2: A 2012 4
# 3: A 2013 13
# 4: B 2011 2
# 5: B 2012 10
# 6: B 2013 42

A bit hacky but bear with me, it only takes two iterations.
df <- read.table(text = "type year alpha beta pred
1: A 2011 1 3 1
2: A 2012 1 3 NA
3: A 2013 1 3 NA
4: B 2011 2 4 2
5: B 2012 2 4 NA
6: B 2013 2 4 NA", header = T)
df2 <- df
while(any(is.na(df2$pred))){
df2$pred <- df2$alpha + df2$beta*lag(df2$pred)
df2$pred[which(!is.na(df$pred))] <- df$pred[which(!is.na(df$pred))]
}
The solution is correct
df2
type year alpha beta pred
1: A 2011 1 3 1
2: A 2012 1 3 4
3: A 2013 1 3 13
4: B 2011 2 4 2
5: B 2012 2 4 10
6: B 2013 2 4 42

Related

replace row values based on another row value in a data.table

I have a trivial question, though I am struggling to find a simple answer. I have a data table that looks something like this:
dt <- data.table(id= c(A,A,A,A,B,B,B,C,C,C), time=c(1,2,3,1,2,3,1,2,3), score = c(10,15,13,25,NA,NA,18,29,19))
dt
# id time score
# 1: A 1 NA
# 2: A 2 10
# 3: A 3 15
# 4: A 4 13
# 5: B 1 NA
# 6: B 2 25
# 7: B 3 NA
# 8: B 4 NA
# 9: C 1 18
# 10: C 2 29
# 11: C 3 NA
# 12: C 4 19
I would like to replace the missing values of my group "B" with the values of "A".
The final dataset should look something like this
final
# id time score
# 1: A 1 NA
# 2: A 2 10
# 3: A 3 15
# 4: A 4 13
# 5: B 1 NA
# 6: B 2 25
# 7: B 3 15
# 8: B 4 13
# 9: C 1 18
# 10: C 2 29
# 11: C 3 NA
# 12: C 4 19
In other words, conditional on the fact that B is NA, I would like to replace the score of "A". Do note that "C" remains NA.
I am struggling to find a clean way to do this using data.table. However, if it is simpler with other methods it would still be ok.
Thanks a lot for your help
Here is one option where we get the index of the rows which are NA for 'score' and the 'id' is "B", use that to replace the NA with the corresponding 'score' value from 'A'
library(data.table)
i1 <- setDT(dt)[id == 'B', which(is.na(score))]
dt[, score:= replace(score, id == 'B' & is.na(score), score[which(id == 'A')[i1]])]
Or a similar option in dplyr
library(dplyr)
dt %>%
mutate(score = replace(score, id == "B" & is.na(score),
score[which(id == "A")[i1]))

merge two tables by a given rule

Consider the example where I have two datatables, df1 is a copy of my order and SOH is my iventory. I want to merge the df1$price into SOH, whereby:
if SOH$arrival_year > df1$year, then write the price associated with the oldest year, if no older year appears write NA
if the SOH item doesnt appear in df1, write NA in price
supplier <- c(1,1,1,1,1,2,2)
item <- c(20,20,20,21,22,23,26)
year <- c(2000,2002,2008,2001,2007,2005,2009)
price <- c(.3,.4,.5,1.6,1.5,3.2,.25)
df1 <- data.frame(supplier, item, year, price)
#
supplier_on_hand <- c(1,1,1,1,1,1,2,2,3)
item_on_hand <- c(20,20,20,22,20,20,23,23,10)
arrival_year <- c(2000,2001,2002,2009,2007,2012,2006,2004,2009)
SOH <- data.frame(supplier_on_hand, item_on_hand, arrival_year)
The following output is desired:
Another possibility is using the rolling join ability of the data.table-package:
library(data.table)
setDT(df1)[setDT(SOH), on = .(supplier = supplier_on_hand, item = item_on_hand, year = arrival_year), roll = Inf]
# in a bit more readable format:
setDT(SOH)
setDT(df1)
df1[SOH, on = .(supplier = supplier_on_hand, item = item_on_hand, year = arrival_year), roll = Inf]
# or with setting keys first:
setDT(SOH, key = c('supplier_on_hand','item_on_hand','arrival_year'))
setDT(df1, key = c('supplier','item','year'))
df1[SOH, roll = Inf]
which gives:
supplier item year price
1: 1 20 2000 0.3
2: 1 20 2001 0.3
3: 1 20 2002 0.4
4: 1 20 2007 0.4
5: 1 20 2012 0.5
6: 1 22 2009 1.5
7: 2 23 2004 NA
8: 2 23 2006 3.2
9: 3 10 2009 NA
The following looks like it works for me:
cbind(SOH, price =
apply(SOH, 1, function(x) {
#setting the item and year constraints
temp <- df1[df1$item == x[2] & df1$year <= x[3], ]
#order per year descending as per rules
temp <- temp[order(temp$year, decreasing = TRUE), ]
#set to NA if item or year does not confirm rules
if (is.na(temp[1, 'price'])) return(NA) else return(temp[1, 'price'])
})
)
Ouput:
supplier_on_hand item_on_hand arrival_year price
1 1 20 2000 0.3
2 1 20 2001 0.3
3 1 20 2002 0.4
4 1 22 2009 1.5
5 1 20 2007 0.4
6 1 20 2012 0.5
7 2 23 2006 3.2
8 2 23 2004 NA
9 3 10 2009 NA

Make R user-defined-function for data.table commands - How to refer a column properly

I have the df1 data
df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
new.employee=c(4,6,2,6,23,2,5,34))
id year month new.employee
1 A 2014 1 4
2 A 2014 2 6
3 A 2015 1 2
4 A 2015 2 6
5 B 2014 1 23
6 B 2014 2 2
7 B 2015 1 5
8 B 2015 2 34
and desired outcome with following functions:
library(data.table) # V1.9.6+
temp <- setDT(df1)[month == 2L, .(id, frank(-new.employee)), by = year]
df1[temp, new.employee.rank := i.V2, on = c("year", "id")]
df1
# id year month new.employee new.employee.rank
# 1: A 2014 1 4 1
# 2: A 2014 2 6 1
# 3: A 2015 1 2 2
# 4: A 2015 2 6 2
# 5: B 2014 1 23 2
# 6: B 2014 2 2 2
# 7: B 2015 1 5 1
# 8: B 2015 2 34 1
Now, I want to datamining by creating a user-defined function to varying the input, which is new.employee in above example. I tried some ways but they did not work:
the first try:
myRank <- function(data,var) {
temp <- setDT(data)[month == 2L, .(id, frank(-var)), by = year]
data[temp, new.employee.rank := i.V2, on = c("year", "id")]
return(data)
}
myRank(df1,new.employee)
Error in is.data.frame(x) : object 'new.employee' not found
the second try:
myRank(df1,df1$new.employee)
nothing appeared
The third try: I change the function a bit
myRank <- function(data,var) {
temp <- setDT(data)[month == 2L, .(id, rank(data$var)), by = year]
data[temp, new.employee.rank := i.V2, on = c("year", "id")]
return(data)
}
myRank(df1,df1$new.employee)
Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
2: In [.data.table(setDT(data), month == 2L, .(id, rank(data$var)), :
Item 2 of j's result for group 1 is zero length. This will be filled with 2 NAs to match the longest column in this result. Later groups may have a similar problem but only the first is reported to save filling the warning buffer.
3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
I looked at similar problems but my R experience is not good enough to understand those.
data.table uses a non standard evaluation by default (unless you start to mess around with with = FALSE), and thus, you will need to refer to your column by name or alternatively use get. Another problem with your code (as mentioned in comments) is that you are calling new.employee while it's not defined outside of the scope of df1. If you want prevent from R from evaluating it before you pass it to your data set, you could use the deparse(substitute(var)) combination which will prevent evaluation and then convert var to a character string which can in turn be passed to get or the eval(as.name()) combination (which do entirely different things but within the data.table scope will lead to the same result).
Finally, there is the printing issue after using := within the function. Even if everything works, return(data) won't do anything, you will need to force printing either by using an additional [] or by explicitly calling print
Here's a possible solution
myRank <- function(data, var) {
var <- deparse(substitute(var)) ## <~~~ Note this
temp <- setDT(data)[month == 2L, .(id, frank(-get(var))), by = year] ## <~~ Note the get
data[temp, new.employee.rank := i.V2, on = c("year", "id")][] ## <~~ Note the []
}
myRank(df1, new.employee)
# id year month new.employee new.employee.rank
# 1: A 2014 1 4 1
# 2: A 2014 2 6 1
# 3: A 2015 1 2 2
# 4: A 2015 2 6 2
# 5: B 2014 1 23 2
# 6: B 2014 2 2 2
# 7: B 2015 1 5 1
# 8: B 2015 2 34 1
Or
myRank <- function(data, var) {
var <- as.name(deparse(substitute(var))) ## <~~~ Note additional as.name
temp <- setDT(data)[month == 2L, .(id, frank(-eval(var))), by = year] ## <~ Note the eval
data[temp, new.employee.rank := i.V2, on = c("year", "id")][]
}
myRank(df1, new.employee)
# id year month new.employee new.employee.rank
# 1: A 2014 1 4 1
# 2: A 2014 2 6 1
# 3: A 2015 1 2 2
# 4: A 2015 2 6 2
# 5: B 2014 1 23 2
# 6: B 2014 2 2 2
# 7: B 2015 1 5 1
# 8: B 2015 2 34 1
I would guess the second option will be faster as it avoids extracting the whole column out of data
As a side note, you could also make the creation of the new variables names interactive by replacing
new.employee.rank := i.V2
with something like
paste0("New.", var, ".rank") := i.V2

filling missing values time series data in R

I am trying to expand yearly values in my panel data to year-quarter values. That is repeat the yearly values to every quarter.
For e.g., I am looking to get the repeated values of income for year-quarter 2000Q1, 2000Q2, 2000Q3, 2000Q4, 2001Q1, ... , 2001Q4. So the data frame would be id,year-quarter, income.
I use a two step approach but have some issues to handle. If the quarterly starting value is missing, then I would then need to the quarterly to be missing (NA) too.
Case 1:
annual_data <- data.frame(
person=c(1, 1, 1, 2, 2,2),
year=c(2010, 2011, 2012, 2010, 2011, 2012),
income=c(4, 10, 13, 1, NA, 30)
)
Case 2:
annual_data <- data.frame(
person=c(1, 1, 1, 2, 2,2),
year=c(2010, 2011, 2012, 2010, 2011, 2012),
income=c(4, 10, 13, NA, NA, 30)
)
In the first step, I expand the data to quarterly as was mentioned:
interpolating in R yearly time series data with quarterly values
So use a function such as:
expand <- function(x) {
years <- min(x$year):max(x$year)
quarters <- 1:4
grid <- expand.grid(quarter=quarters, year=years)
x$quarter <- 1
merged <- grid %>% left_join(x, by=c('year', 'quarter'))
merged$person <- x$person[1]
return(merged)
}
Then I used in
zoo::na.locf
dplyr::mutate.
quarterlydata <- annual_data %>% group_by(person) %>% do(expand(.))
testdata <- quarterlydata %>% group_by(person) %>% mutate(ynew=zoo::na.locf(y))
but havent had much luck as it copies forward to all missing values from the previous non-missing values. That is,
Case 1: it copies all values, So income of 1 for person 2 gets copied over to 2010 and 2011. When it must be copied over to just 2010, and 2011 should be NAs.
For case 2: I get
Error: incompatible size (%d), expecting %d (the group size) or 1.
Any thoughts on where I am missing?
For case 1 you are missing the year in your group_by. Since using the code that you have, the groupings for na.locf thinks that year is part of the grouping which na.locf must run over.
testdata <- quarterlydata %>%
group_by(person, year) %>%
mutate(ynew=zoo::na.locf(income, na.rm=FALSE))
With the output:
> tail(testdata, 13)
Source: local data frame [13 x 5]
Groups: person, year
quarter year person income ynew
1 4 2012 1 NA 13
2 1 2010 2 1 1
3 2 2010 2 NA 1
4 3 2010 2 NA 1
5 4 2010 2 NA 1
6 1 2011 2 NA NA
7 2 2011 2 NA NA
8 3 2011 2 NA NA
9 4 2011 2 NA NA
10 1 2012 2 30 30
11 2 2012 2 NA 30
12 3 2012 2 NA 30
13 4 2012 2 NA 30
For case 2, as you might already infer from the code above, you must have na.rm set to FALSE otherwise the vector will drop off all NA which it could not extrapolate.
So using exactly the same code for case 2 we will have the output:
> tail(testdata, 13)
Source: local data frame [13 x 5]
Groups: person, year
quarter year person income ynew
1 4 2012 1 NA 13
2 1 2010 2 NA NA
3 2 2010 2 NA NA
4 3 2010 2 NA NA
5 4 2010 2 NA NA
6 1 2011 2 NA NA
7 2 2011 2 NA NA
8 3 2011 2 NA NA
9 4 2011 2 NA NA
10 1 2012 2 30 30
11 2 2012 2 NA 30
12 3 2012 2 NA 30
13 4 2012 2 NA 30

R data.table with rollapply

Is there an existing idiom for computing rolling statistics using data.table grouping?
For example, given the following code:
DT = data.table(x=rep(c("a","b","c"),each=2), y=c(1,3), v=1:6)
setkey(DT, y)
stat.ror <- DT[,rollapply(v, width=1, by=1, mean, na.rm=TRUE), by=y];
If there isn't one yet, what would be the best way to do it?
In fact I am trying to solve this very problem right now. Here is a partial solution which will work for grouping by a single column:
Edit: got it with RcppRoll, I think:
windowed.average <- function(input.table,
window.width = 2,
id.cols = names(input.table)[3],
index.col = names(input.table)[1],
val.col = names(input.table)[2]) {
require(RcppRoll)
avg.with.group <-
input.table[,roll_mean(get(val.col), n = window.width),by=c(id.cols)]
avg.index <-
input.table[,roll_mean(get(index.col), n = window.width),by=c(id.cols)]$V1
output.table <- data.table(
Group = avg.with.group,
Index = avg.index)
# rename columns to (sensibly) match inputs
setnames(output.table, old=colnames(output.table),
new = c(id.cols,val.col,index.col))
return(output.table)
}
A (badly written) unit test that will pass the above:
require(testthat)
require(zoo)
test.datatable <- data.table(Time = rep(seq_len(10), times=2),
Voltage = runif(20),
Channel= rep(seq_len(2),each=10))
test.width <- 8
# first test: single id column
test.avgtable <- data.table(
test.datatable[,rollapply(Voltage, width = test.width, mean, na.rm=TRUE),
by=c("Channel")],
Time = test.datatable[,rollapply(Time, width = test.width, mean, na.rm=TRUE),
by=c("Channel")]$V1)
setnames(test.avgtable,old=names(test.avgtable),
new=c("Channel","Voltage","Time"))
expect_that(test.avgtable,
is_identical_to(windowed.average(test.datatable,test.width)))
How it looks:
> test.datatable
Time Voltage Channel Class
1: 1 0.310935570 1 1
2: 2 0.565257533 1 2
3: 3 0.577278573 1 1
4: 4 0.152315111 1 2
5: 5 0.836052122 1 1
6: 6 0.655417230 1 2
7: 7 0.034859642 1 1
8: 8 0.572040136 1 2
9: 9 0.268105436 1 1
10: 10 0.126484340 1 2
11: 1 0.139711248 2 1
12: 2 0.336316520 2 2
13: 3 0.413086486 2 1
14: 4 0.304146029 2 2
15: 5 0.399344631 2 1
16: 6 0.581641210 2 2
17: 7 0.183586025 2 1
18: 8 0.009775488 2 2
19: 9 0.449576242 2 1
20: 10 0.938517952 2 2
> test.avgtable
Channel Voltage Time
1: 1 0.4630195 4.5
2: 1 0.4576657 5.5
3: 1 0.4028191 6.5
4: 2 0.2959510 4.5
5: 2 0.3346841 5.5
6: 2 0.4099593 6.5
Unfortunately, I haven't managed to make it work with multiple groupings (as this second section shows):
Looks okay for multiple column groups:
# second test: multiple id columns
# Depends on the first test passing to be meaningful.
test.width <- 4
test.datatable[,Class:= rep(seq_len(2),times=ceiling(nrow(test.datatable)/2))]
# windowed.average(test.datatable,test.width,id.cols=c("Channel","Class"))
test.avgtable <- rbind(windowed.average(test.datatable[Class==1,],test.width),
windowed.average(test.datatable[Class==2,],test.width))
# somewhat artificially attaching expected class labels
test.avgtable[,Class:= rep(seq_len(2),times=nrow(test.avgtable)/4,each=2)]
setkey(test.avgtable,Channel)
setcolorder(test.avgtable,c("Channel","Class","Voltage","Time"))
expect_that(test.avgtable,
is_equivalent_to(windowed.average(test.datatable,test.width,
id.cols=c("Channel","Class"))))

Resources