merge two tables by a given rule - r

Consider the example where I have two datatables, df1 is a copy of my order and SOH is my iventory. I want to merge the df1$price into SOH, whereby:
if SOH$arrival_year > df1$year, then write the price associated with the oldest year, if no older year appears write NA
if the SOH item doesnt appear in df1, write NA in price
supplier <- c(1,1,1,1,1,2,2)
item <- c(20,20,20,21,22,23,26)
year <- c(2000,2002,2008,2001,2007,2005,2009)
price <- c(.3,.4,.5,1.6,1.5,3.2,.25)
df1 <- data.frame(supplier, item, year, price)
#
supplier_on_hand <- c(1,1,1,1,1,1,2,2,3)
item_on_hand <- c(20,20,20,22,20,20,23,23,10)
arrival_year <- c(2000,2001,2002,2009,2007,2012,2006,2004,2009)
SOH <- data.frame(supplier_on_hand, item_on_hand, arrival_year)
The following output is desired:

Another possibility is using the rolling join ability of the data.table-package:
library(data.table)
setDT(df1)[setDT(SOH), on = .(supplier = supplier_on_hand, item = item_on_hand, year = arrival_year), roll = Inf]
# in a bit more readable format:
setDT(SOH)
setDT(df1)
df1[SOH, on = .(supplier = supplier_on_hand, item = item_on_hand, year = arrival_year), roll = Inf]
# or with setting keys first:
setDT(SOH, key = c('supplier_on_hand','item_on_hand','arrival_year'))
setDT(df1, key = c('supplier','item','year'))
df1[SOH, roll = Inf]
which gives:
supplier item year price
1: 1 20 2000 0.3
2: 1 20 2001 0.3
3: 1 20 2002 0.4
4: 1 20 2007 0.4
5: 1 20 2012 0.5
6: 1 22 2009 1.5
7: 2 23 2004 NA
8: 2 23 2006 3.2
9: 3 10 2009 NA

The following looks like it works for me:
cbind(SOH, price =
apply(SOH, 1, function(x) {
#setting the item and year constraints
temp <- df1[df1$item == x[2] & df1$year <= x[3], ]
#order per year descending as per rules
temp <- temp[order(temp$year, decreasing = TRUE), ]
#set to NA if item or year does not confirm rules
if (is.na(temp[1, 'price'])) return(NA) else return(temp[1, 'price'])
})
)
Ouput:
supplier_on_hand item_on_hand arrival_year price
1 1 20 2000 0.3
2 1 20 2001 0.3
3 1 20 2002 0.4
4 1 22 2009 1.5
5 1 20 2007 0.4
6 1 20 2012 0.5
7 2 23 2006 3.2
8 2 23 2004 NA
9 3 10 2009 NA

Related

Define variable iteratively in data table in r

I am trying to find a faster solution to defining a variable iteratively, i.e., the next row of the variable depends on the previous row. For example, suppose I have the following data.table:
tmp <- data.table(type = c("A", "A", "A", "B", "B", "B"),
year = c(2011, 2012, 2013, 2011, 2012, 2013),
alpha = c(1,1,1,2,2,2),
beta = c(3,3,3,4,4,4),
pred = c(1,NA,NA,2,NA, NA))
For each type (A and then B), I want to solve for pred going forward, where pred for type A for the year 2012 is:
pred_2012_A = alpha + beta * pred_2011_A
and the pred for 2013 for type A continues:
pred_2013_A = alpha + beta * pred_2012_A
I have a solution using a for loop to go through type and create a variable to store the previous value and use the "by" command in data table to loop through the year as such:
for(i in c("A", "B")){
tmp.val <- tmp[type == i & year == 2011]$pred # initial value for type i
tmp[year > 2011 & type == i, pred := {
tmp.val <- alpha + beta * tmp.val
}, by = year]
}
Ultimately, the original data table looks like:
type year alpha beta pred
1: A 2011 1 3 1
2: A 2012 1 3 NA
3: A 2013 1 3 NA
4: B 2011 2 4 2
5: B 2012 2 4 NA
6: B 2013 2 4 NA
And the updated table looks like:
type year alpha beta pred
1: A 2011 1 3 1
2: A 2012 1 3 4
3: A 2013 1 3 13
4: B 2011 2 4 2
5: B 2012 2 4 10
6: B 2013 2 4 42
My question here is if there is a faster way to implement this without the for loop. Is there a way to implement this routine in one data table statement that is faster than using the for loop? My real usage has many more types and many more years to compute, so a faster implementation would be greatly appreciated.
Thank you.
You can just do the math:
tmp[, pred := pred[1]*beta^(1:.N-1) + alpha*cumsum(c(0, beta[1]^(0:(.N-2)))), by=type]
# type year alpha beta pred
# 1: A 2011 1 3 1
# 2: A 2012 1 3 4
# 3: A 2013 1 3 13
# 4: B 2011 2 4 2
# 5: B 2012 2 4 10
# 6: B 2013 2 4 42
Comment. In my opinion, the data structure in the OP is flawed. Alpha and beta are clearly attributes of the type, not something that is varying from row to row. It should start with:
typeDT = data.table(
type=c("A","B"),
year.start = 2011L,
year.end=2013,
a = 1:2,
b = 3:4,
pred0 = 1:2
)
# type year.start year.end a b pred0
# 1: A 2011 2013 1 3 1
# 2: B 2011 2013 2 4 2
With this structure, you could expand to your data set naturally:
typeDT[, {
year = year.start:year.end
n = length(year)
p = pred0*b^(0:(n-1)) + a*cumsum(c(0, b^(0:(n-2))))
.(year = year, pred = p)
}, by=type]
# type year pred
# 1: A 2011 1
# 2: A 2012 4
# 3: A 2013 13
# 4: B 2011 2
# 5: B 2012 10
# 6: B 2013 42
A bit hacky but bear with me, it only takes two iterations.
df <- read.table(text = "type year alpha beta pred
1: A 2011 1 3 1
2: A 2012 1 3 NA
3: A 2013 1 3 NA
4: B 2011 2 4 2
5: B 2012 2 4 NA
6: B 2013 2 4 NA", header = T)
df2 <- df
while(any(is.na(df2$pred))){
df2$pred <- df2$alpha + df2$beta*lag(df2$pred)
df2$pred[which(!is.na(df$pred))] <- df$pred[which(!is.na(df$pred))]
}
The solution is correct
df2
type year alpha beta pred
1: A 2011 1 3 1
2: A 2012 1 3 4
3: A 2013 1 3 13
4: B 2011 2 4 2
5: B 2012 2 4 10
6: B 2013 2 4 42

Split and Diff function in R

I have a data frame called data. I am splitting the data using split function by an attribute called KEY.
data <- split(data, data$KEY);
After splitting the dataframe by KEY, what we get is data for individual firms. dataframe data had the data for all the firms in the universe. After the split, each individual split has two columns, year and sales. For each split, I have to calculate incremental sales corresponding to each year. For instance, if we have data 2002 - 10, 2003 - 12, 2004 - 15, 2005 - 20. What I am interested in getting would be 2003-2, 2004 -3, 2005 - 5, for each split.
I have written a function, called mod_sale, to perform the job mentioned:
data[with(data, order(year)),];
sale_data <- diff(data$SALE);
data <- data[-1,];
data$SALE <- sale_data;
return(data)
Currently, I am using for loop:
for(key in names(data)){
a <- try(mod_sale(data[[key]]))
if(class(a) == "try-error") next;
mod_data <- rbind(mod_data,a)};
I think there is some way, I can use sapply (and may be plyr too). Can someone help me with improving this R code? Not sure how sapply code would go.
sapply(data, mod_sale)
Any help would be appreciated. Thanks.
Edit:
Here is a data example:
a <- data.frame();
key <- c(1,1,1,1,2,2,2,2,2,3,3,3);
sales <- c(12,12,15,8,3,6,3,9,9,12,3,7);
year <- c(2002,2003,2004,2005,2001,2002,2003,2004,2005,2003,2004,2005);
ovar <- runif(12,5.0,7.5);
a <- data.frame(key,sales,year,ovar)
In the resultant data.frame, I am expecting incremental sales rather than real sales. Obviously, we will lose 3 data points for 3 key; one for each starting year, as we are taking difference. So there will be three less rows in the resultant data.frame, which would have columns key,diff(sales),year, and ovar.
This is what I would have done:
a$diffsales <- ave( a$sales, a$key, FUN=function(x) c(NA, diff(x) ) )
a
key sales year ovar diffsales
1 1 12 2002 6.845177 NA
2 1 12 2003 6.328153 0
3 1 15 2004 6.872669 3
4 1 8 2005 6.098920 -7
5 2 3 2001 7.154824 NA
6 2 6 2002 6.110810 3
7 2 3 2003 5.906624 -3
8 2 9 2004 5.214369 6
9 2 9 2005 5.818218 0
10 3 12 2003 5.354354 NA
11 3 3 2004 6.728992 -9
12 3 7 2005 7.412213 4
I appreciate the attempt to display what you'd tried. Thank you.
In the future, try to provide a small example, like this:
df <- data.frame(year = 2001:2010,
sale = sample(20,10))
df <- rbind(df,df,df)
df$key <- rep(letters[1:3],each = 10)
That makes it much clearer what your data look like, and it makes it very easy for people trying to answer. The easier you make it for us, the faster+better answers you'll get.
I'd recommend sorting before splitting:
#Sort first (already sorted, but you get the idea)
df <- df[order(df$key,df$year),]
df_split <- split(df,df$key)
You don't actually want to use sapply. (Try it and see.) You just want lapply:
out <- lapply(df_split,function(x) {x$sale_diff <- c(NA,diff(x$sale)); x[-1,]})
You'd put it all together again using:
do.call(rbind,out)
You're right, plyr or data.table could also do this. I'll leave those examples to others.
Using data.table:
library(data.table)
dt = data.table(a)
dt[, sale_diff := c(NA, diff(sales)), by = key]
dt
# key sales year ovar sale_diff
# 1: 1 12 2002 7.416857 NA
# 2: 1 12 2003 5.625818 0
# 3: 1 15 2004 5.018934 3
# 4: 1 8 2005 6.671986 -7
# 5: 2 3 2001 6.242739 NA
# 6: 2 6 2002 6.297763 3
# 7: 2 3 2003 6.482124 -3
# 8: 2 9 2004 6.724256 6
# 9: 2 9 2005 5.071265 0
#10: 3 12 2003 6.136681 NA
#11: 3 3 2004 6.974392 -9
#12: 3 7 2005 6.517553 4

Reshape with factors

I am trying to reshape a data frame that contains a factor and a numeric variable with the melt and cast procedure. The following data shows my problem:
library(reshape)
df <- as.data.frame(cbind(c(1,1,2,2,3,3),c(2000,2001,2001,2002,2000,2001),c(2,1,4,3,1,5)))
names(df) <- c("Id","Year","Var")
df$Fac <- interaction(c(1,1,1,0,0,0),c(0,0,0,1,1,1),drop=TRUE)
MData <- melt.data.frame(df,id=c("Year","Id"))
RSData <- cast(MData, Id ~ Year | ...)
The operation works, but the missing observations in RSData are not NAs as they should be, but rather strings (< NA> and not NA):
$Var
Id 2000 2001 2002
1 1 2 1 <NA>
2 2 <NA> 4 3
3 3 1 5 <NA>
$Fac
Id 2000 2001 2002
1 1 1.0 1.0 <NA>
2 2 <NA> 1.0 0.1
3 3 0.1 0.1 <NA>
If I, however, disregard the factor the NAs are normal NAs:
df <- as.data.frame(cbind(c(1,1,2,2,3,3),c(2000,2001,2001,2002,2000,2001),c(2,1,4,3,1,5)))
names(df) <- c("Id","Year","Var")
MData <- melt.data.frame(df,id=c("Year","Id"))
RSData <- cast(MData, Id ~ Year | ...)
The output becomes:
$Var
Id 2000 2001 2002
1 1 1 1 NA
2 2 NA 1 0
3 3 0 0 NA
The string NAs give me problems when I try to use my recast data. How do I get the correct NAs when I have a factor and numeric variables in the data frame I want to melt and recast?
Thanks,
M
I am confident that I have found the answer to my own question by reading the comments and the documentation over and over. Bascially, the problem is that when using the melt.data.frame() method all the variable values are put in 1 column, and since I have both strings and numeric values the numeric values are implicitly converted to strings.
The only way around this I see is to reshape the numeric variables and the factors separately:
MDataNum = melt.data.frame(df[c("Id","Year","Var")],id=c("Year","Id"))
RSDataNum <- cast(MDataNum, Id ~ Year | ...)
MDataFac = melt.data.frame(df[c("Id","Year","Fac")],id=c("Year","Id"))
RSDataFac <- cast(MDataFac, Id ~ Year | ...)
The result becomes:
> RSDataNum
$Var
Id 2000 2001 2002
1 1 2 1 NA
2 2 NA 4 3
3 3 1 5 NA
> RSDataFac
$Fac
Id 2000 2001 2002
1 1 1.0 1.0 <NA>
2 2 <NA> 1.0 0.1
3 3 0.1 0.1 <NA>

R ifelse condition with hourly data: frequency of continuously NA

With the help of sebastian-c, I figured out my problem with daily data. Please see: R ifelse condition: frequency of continuously NA
And now I have a data set with hourly data:
set.seed(1234)
day <- c(rep(1:2, each=24))
hr <- c(rep(0:23, 2))
v <- c(rep(NA, 48))
A <- data.frame(cbind(day, hr, v))
A$v <- sample(c(NA, rnorm(100)), nrow(A), prob=c(0.5, rep(0.5/100, 100)), replace=TRUE)
What I need to do is: If there are more(>=) 4 continuously missing day-hours(7AM-7PM) or >= 3 continuously missing night-hours(7PM-7AM), I will delete the entire day from the data frame, otherwise just run linear interpolation. Thus, the second day should be entirely deleted from the data frame since there are 4 continuously NA during day-time (7AM-10AM). The result is preferably remain data frame. Please help, thank you!
If I modify the NA_run function from the question you linked to take a variable named v instead of value and return the boolean rather than the data.frame:
NA_run <- function(x, maxlen){
runs <- rle(is.na(x$v))
any(runs$lengths[runs$values] >= maxlen)
}
I can then write a wrapper function to call it twice for daytime and nighttime:
dropfun <- function(x) {
dt <- x$hr > 7 & x$hr < 19
daytime <- NA_run(x[dt,], 4)
nighttime <- NA_run(x[!dt,], 3)
any(daytime, nighttime)
}
Which gives me a data.frame of days to drop.
> ddply(A, .(day), dropfun)
day V1
1 1 TRUE
2 2 FALSE
>
We can alter the dropfun to return the dataframe instead though:
dropfun <- function(x) {
dt <- x$hr > 7 & x$hr < 19
daytime <- NA_run(x[dt,], 4)
nighttime <- NA_run(x[!dt,], 3)
if(any(daytime, nighttime)) NULL else x
}
> ddply(A, .(day), dropfun)
day hr v
1 2 0 NA
2 2 1 NA
3 2 2 2.54899107
4 2 3 NA
5 2 4 -0.03476039
6 2 5 NA
7 2 6 0.65658846
8 2 7 0.95949406
9 2 8 NA
10 2 9 1.08444118
11 2 10 0.95949406
12 2 11 NA
13 2 12 -1.80603126
14 2 13 NA
15 2 14 NA
16 2 15 0.97291675
17 2 16 NA
18 2 17 NA
19 2 18 NA
20 2 19 -0.29429386
21 2 20 0.87820363
22 2 21 NA
23 2 22 0.56305582
24 2 23 -0.11028549
>

Is there already a function to substract different variables in subsequent quarters?

I have an unbalanced quarterly panel data set with missing values. I want to substract variable A2 from A1 in subsequent quarters. Note that I do not want to get differences of A2, but substract DIFFERENT variables from each other. Differences should be calculated separately for every uid. Besides changing years like Q4 1999 and Q1 2000 are meant to be subsequent.
I am really not sure whether i should concatenate my time index here since packages like zoo only take one index. But that's not the problem here. Here is a some example data:
structure(list(uid = c(1, 1, 1, 2, 2, 3, 3, 3), tndx = c(1999.4,
2000.1, 2000.2, 1999.4, 2000.1, 2000.1, 2000.2, 2000.3), A1 = c(2,
2, 2, 10, 11, 1, 1, 1), A2 = c(3, 3, 3, 14, 14, 2, 100, 2)), .Names = c("uid",
"tndx", "A1", "A2"), row.names = c(NA, -8L), class = "data.frame")
# which results in
uid tndx A1 A2
1 1 1999.4 2 3
2 1 2000.1 2 3
3 1 2000.2 2 3
4 2 1999.4 10 14
5 2 2000.1 11 14
6 3 2000.1 1 2
7 3 2000.2 1 100
8 3 2000.3 1 2
If you prefer a separated index, use this example:
# Thx Andrie!
x2 <- data.frame(x, colsplit(x$tndx, "\\.", names=c("year", "qtr")))
Is there a good way to solve this with reshape2, plyr or even base or would you rather write a custom function?
Note, it is also possible that some uid occurs only once. Obviously you can't calculate a lagged difference then. Still I need to check for that and create an NA then.
We split it on the uid using by and within the function that operates on each set of rows for a single uid, we create a zoo object, z, using yearqtr class for the index. Then we merge the time series with an empty series having all the desired quarters including any missing intermediate quarters giving zm and perform the computation giving zz. Finally we convert to the data.frame form on the way out:
library(zoo)
to.yearqtr <- function(x) as.yearqtr(trunc(x) + (10*(x-trunc(x))-1)/4)
DF <- do.call("rbind", by(x, x$uid, function(x) {
# columns of x are: uid tndx A1 A2
z <- zoo(x[c("A1", "A2")], to.yearqtr(x$tndx))
zm <- merge(z, zoo(, seq(start(z), end(z), 1/4)))
zz <- with(zm, cbind(zm, `A1 - A2 lag` = A1 - lag(A2, -1)))
if (ncol(zz) <= ncol(z)) zz$`A1 - A2 lag` <- NA # append NA if col not added
data.frame(uid = x[1, 1], tndx = time(zz), coredata(zz), check.names = FALSE)
}))
which gives this:
> DF
uid tndx A1 A2 result A1 - A2 lagged
1.1 1 1999 Q4 2 3 NA NA
1.2 1 2000 Q1 2 2 NA -1
1.3 1 2000 Q2 2 3 NA 0
2.1 2 1999 Q4 2 4 NA NA
2.2 2 2000 Q1 NA NA NA NA
2.3 2 2000 Q2 NA NA NA NA
2.4 2 2000 Q3 NA NA NA NA
2.5 2 2000 Q4 NA NA NA NA
2.6 2 2001 Q1 3 4 NA NA
3.1 3 2000 Q1 1 2 NA NA
3.2 3 2000 Q2 1 NA NA -1
3.3 3 2000 Q3 1 2 NA NA
EDIT: Completely re-did the solution based on further discussion. Note that this not only adds an extra column but it also converts the index to "yearqtr" class and adds the extra missing rows.
EDIT: Some minor simplifications in the by function.
I wasn't entirely clear what you wnated because you didn't include a "right answer". If you want to subtract one lagged variable from another unlagged variable you cna do that with indexing that is offset. (You do need to pad the result if you wnat it to get put back into the dataframe.
x$A1lagA2 <- ave(x[, c("A1", "A2")], x$uid, FUN=function(z) {
with(z, c(NA, A1[2:NROW(z)] -A2[1:(NROW(z)-1)]) ) } )[[1]]
x
uid tndx A1 A2 A1lagA2
1 1 1999.4 2 3 NA
2 1 2000.1 2 3 -1
3 1 2000.2 2 3 -1
4 2 1999.4 10 14 NA
5 2 2000.1 11 14 -3
6 3 2000.1 1 2 NA
7 3 2000.2 1 100 -1
8 3 2000.3 1 2 -99
You do get annoying duplicate extra columns with ave() when it argument is multicolumn, but I just took the first one.

Resources