Taking the content in this thread a bit further: I've gone as far as I can, but finally hit a wall. I'm looking to use PLYR to create some ARIMA models with exogenous regressors at scale. A high-level overview of the process I've been using (code with example data follows)
1) I have a dataframe with businesses, regions, revenue and orders, all by date
2) For each combination of business + region, I want to create a forecast for revenue based on previous values of revenue + previous values of orders.
3) I want to use an ARIMA model (using auto.arima() ) to figure out optimal orders for both revenue and orders, then apply that information to a forecast function
4) The problem I run into seems to boil down to not being able to pass multiple lists to a PLYR argument to operate on, which most likely in turn boils down to my not fully understanding how llply works (so hopefully this is an easy task)
Here's some sample data I'm working off:
library(plyr)
library(xts)
library(forecast)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 100, replace = TRUE),
region = sample(c("mideast","americas","asia"), 100, replace = TRUE),
date = rep(seq(as.Date("2010-02-01"), length=10, by = "1 day"),10),
revenue = sample(1:100),
orders = sample(1:100)
)
Edit: First, reorganize data through ddply to get rid of duplicate entries:
dataframe <- ddply(data, c("biz","region","date"), function(df) {
c(revenue = sum(df[,4]),
orders = sum(df[,5]))
})
Step 1: Create a list that contains the time series info for each combination of business + region:
list1 <- dlply(dataframe, .(biz,region), identity)
Step 2: Turn that list into an XTS object so we can use it for time-series analysis:
xtsobject <- llply(list1, function(list) {
xts(x=list[,c("revenue","orders")], order.by=list[,"date"])
})
Here's where I run into trouble. I want to make a list of orders from the auto.arima() function to pass into a forecast.Arima() function. This would be straightforward if I were just doing one variable with no exogenous regressors:
arimamodel1 <- llply(xtsobject, function(list) {
fity <- auto.arima(list$revenue)
})
And then I would apply that list to the forecast.Arima() function:
forecast1 <- llply(arimamodel1, function(model) {
forecast.Arima(model, h=2)
})
That comes out fine. I've tried changing the argument to include some room for the extra regressors, but I'm not sure the forecasts are actually pulling in the x values:
arimamodel2 <- llply(xtstest, function(list) {
fity <- auto.arima(list$revenue, xreg=list$orders)
fitx <- auto.arima(list$orders)
})
and the forecasts:
forecast2 <- llply(arimamodel2, function(model) {
forecast.Arima(model, h=2)
})
... But it seems like in the forecast function, I should be doing something to account for the x regressor model in the way I normally use forecast.Arima() with multiple regressors; something like:
forecast.Arima(model,h=2, xreg=forecast(model,h=2)$mean)
But this doesn't work. Does anybody have any insight into how to use PLYR to make forecasts based on auto.arima() for multiple regressors?
I'm pretty sure I figured this out, in case anybody stumbles on to this question. It's just a matter of making a function that passes through all these arguments, then passing that function through lapply or llply (the data in the question won't work for auto.arima because of the way it was created, but it works on the actual data I'm using):
arimafunc <- function(list) {
fity <- auto.arima(list$revenue, xreg=list$orders)
fitx <- auto.arima(list$orders)
forecast <- forecast.Arima(fity,h=2,xreg=forecast(fitx,h=2)$mean)
return(forecast)
}
then pass through the list apply:
forecasts <- lapply(xtsobject,FUN=arimafunc)
I'm sure there's a way to do this using built-in functionality of something like llply or from one of the base commands, mapply, but this works for now...
Related
I'm trying to join two function results in one and trying to obtain results in one column.
My code
myfun <-function(x){
fit <-Arima(x, order = c(1,1,1), seasonal = list(order = c(0,1,0), period = 52),include.mean=TRUE,
include.constant = FALSE, method = 'CSS')
fit_a <- forecast(fit$fitted)
fit_a <- data.frame(fit_a$fitted)
colnames(fit_a)[1] <- "load"
fit_a$load <- as.data.frame(fit_a$load)
fit_b <- data.frame(forecast(fit,h=400))
fit_b <- data.frame(fit_b$Point.Forecast)
colnames(fit_b)[1] <- "load"
fit_b$load <- as.data.frame(fit_b$load)
return(rbind(fit_a,fit_b))
}
I'm getting values individually like return(fit_a) and return(fit_b) but while doing rbind() I can't because of individual time-series data.
Tried : c(fit_a,fit_b) showing two different ts( which confirms we are having output and just failing over rbind()).
Can someone help me, how to extract both fitted and forecasted values in same function.
Thanks in advance!
One way to return more than one object is to create a list of the objects, then return that. In your case you could use this at the end of your function:
fit <- list(fit_a, fit_b)
return(fit)
Then you can access the elements using fit[[1]] or fit[[2]].
You also have the option of naming the elements so you can access them using the $, like so:
fit <- list(fit_a = fit_a, fit_b = fit_b)
return(fit)
Then you can use fit$fit_a and fit$fit_b
I'd like to do something like the following: (myData is a data table)
#create some data
myData = data.table(invisible.covariate=rnorm(50),
visible.covariate=rnorm(50),
category=factor(sample(1:3,50, replace=TRUE)),
treatment=sample(0:1,50, replace=TRUE))
myData[,outcome:=invisible.covariate+visible.covariate+treatment*as.integer(category)]
myData[,invisible.covariate:=NULL]
#process it
myData[treatment == 0,untreated.outcome:=outcome]
myData[treatment == 1,treated.outcome:=outcome]
myPredictors = matrix(0,ncol(myData),ncol(myData))
myPredictors[5,] = c(1,1,0,0,0,0)
myPredictors[6,] = c(1,1,0,0,0,0)
myImp = mice(myData,predictorMatrix=myPredictors)
fit1 = with(myImp, lm(treated.outcome ~ category)) #this works fine
for_each_imputed_dataset(myImp, #THIS IS NOT A REAL FUNCTION but I hope you get the idea
function(imputed_data_table) {
imputed_data_table[,treatment.effect:=treated.outcome-untreated.outcome]
})
fit2 = with(myImp, lm(treatment.effect ~ category))
#I want fit2 to be an object similar to fit1
...
I would like to add a calculated value to each imputed data set, then do statistics using that calculated value. Obviously the structure above is probably not how you'd do it. I'd be happy with any solution, whether it involves preparing the data table somehow before the mice, a step before the "fit =" as sketched above, or some complex function inside the "with" call.
The complete() function will generate the "complete" imputed data set for each of the requested iterations. But note that mice expects to work with data.frames, so it returns data.frames and not data.tables. (Of course you can convert if you like). But here is one way to fit all those models
imp = mice(myData,predictorMatrix=predictors)
fits<-lapply(seq.int(imp$m), function(i) {
lm(I(treated.outcome-untreated.outcome)~category, complete(imp, i))
})
fits
The results will be in a list and you can extract particular lm objects via fits[[1]], fits[[2]], etc
As part of my data analysis, I am using linear regression analysis to check whether I can predict tomorrow's value using today's data.
My data are about 100 time series of company returns. Here is my code so far:
returns <- read.zoo("returns.csv", header=TRUE, sep=",", format="%d-%m-%y")
returns_lag <- lag(returns)
lm_univariate <- lm(returns_lag$companyA ~ returns$companyA)
This works without problems, now I wish to run a linear regression for every of the 100 companies. Since setting up each linear regression model manually would take too much time, I would like to use some kind of loop (or apply function) to shorten the process.
My approach:
test <- lapply(returns_lag ~ returns, lm)
But this leads to the error "unexpected symbol in "test2" " since the tilde is not being recognized there.
So, basically I want to run a linear regression for every company separately.
The only question that looks similar to what I wanted is Linear regression of time series over multiple columns , however there the data seems to be stored in a matrix and the code example is quite messy compared to what I was looking for.
Formulas are great when you know the exact name of the variables you want to include in the regression. When you are looping over values, they aren't so great. Here's an example that uses indexing to extract the columns of interest for each iteration
#sample data
x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14) - 1
returns <- zoo(cbind(companya=rnorm(10), companyb=rnorm(10)), x.Date)
returns_lag <- lag(returns)
$loop over columns/companies
xx<-lapply(setNames(1:ncol(returns),names(returns)), function(i) {
today <-returns_lag[,i]
yesterday <-head(returns[,i], -1)
lm(today~yesterday)
})
xx
This will return the results for each column as a list.
Using the dyn package (which loads zoo) we can do this:
library(dyn)
z <- zoo(EuStockMarkets) # test data
lapply(as.list(z), function(z) dyn$lm(z ~ lag(z, -1)))
I'm having some difficulties figuring out how to approach this problem. I have a data frame that I am splitting into distinct sites (link5). Once split I basically want to run a linear regression model on the subsets. Here is the code I'm working with, but it's definitely not correct. Also, It would be great if I could output the model results to a new data frame such that each site would have one row with the model parameter estimates - that is just a wish and not a necessity right now. Thank you for any help!
les_events <- split(les, les$link5)
result <- lapply(les_events) {
lm1 <-lm(cpe~K,data=les_events)
coef <- coef(lm1)
q.hat <- -coef(lm1)[2]
les_events$N0.hat <- coef(lm1[1]/q.hat)
}
You have a number of issues.
You haven't passed a function (the FUN argument) to lapply
Your closure ( The bit inside {} is almost, but not quite the body you want for your function)
something like th following will return the coefficients from your models
result <- lapply(les_events, function(DD){
lm1 <-lm(cpe~K,data=DD)
coef <- coef(lm1)
data.frame(as.list(coef))
})
This will return a list of data.frames containing columns for each coefficient.
lapply(les_events, lm, formula = 'cpe~K')
will return a list of linear model objects, which may be more useful.
For a more general split / apply / combine approaches use plyr or data.table
data.table
library(data.table)
DT <- data.table(les)
result <- les[, {lm1 <- lm(cpe ~ K, data = .SD)
as.list(lm1)}, by = link5]
plyr
library(plyr)
result <- ddply(les, .(link5), function(DD){
lm1 <-lm(cpe~K,data=DD)
coef <- coef(lm1)
data.frame(as.list(coef))
})
# or to return a list of linear model objects
dlply(les, link5, function(DD){ lm(cpe ~K, data =DD)})
I am trying to get a rolling prediction of a dynamic timeseries in R (and then work out squared errors of the forecast). I based a lot of this code on this StackOverflow question, but I am very new to R so I am struggling quite a bit. Any help would be much appreciated.
require(zoo)
require(dynlm)
set.seed(12345)
#create variables
x<-rnorm(mean=3,sd=2,100)
y<-rep(NA,100)
y[1]<-x[1]
for(i in 2:100) y[i]=1+x[i-1]+0.5*y[i-1]+rnorm(1,0,0.5)
int<-1:100
dummydata<-data.frame(int=int,x=x,y=y)
zoodata<-as.zoo(dummydata)
prediction<-function(series)
{
mod<-dynlm(formula = y ~ L(y) + L(x), data = series) #get model
nextOb<-nrow(series)+1
#make forecast
predicted<-coef(mod)[1]+coef(mod)[2]*zoodata$y[nextOb-1]+coef(mod)[3]*zoodata$x[nextOb-1]
#strip timeseries information
attributes(predicted)<-NULL
return(predicted)
}
rolling<-rollapply(zoodata,width=40,FUN=prediction,by.column=FALSE)
This returns:
20 21 ..... 80
10.18676 10.18676 10.18676
Which has two problems I was not expecting:
Runs from 20->80, not 40->100 as I would expect (as the width is 40)
The forecasts it gives out are constant: 10.18676
What am I doing wrong? And is there an easier way to do the prediction than to write it all out? Thanks!
The main problem with your function is the data argument to dynlm. If you look in ?dynlm you will see that the data argument must be a data.frame or a zoo object. Unfortunately, I just learned that rollapply splits your zoo objects into array objects. This means that dynlm, after noting that your data argument was not of the right form, searched for x and y in your global environment, which of course were defined at the top of your code. The solution is to convert series into a zoo object. There were a couple of other issues with your code, I post a corrected version here:
prediction<-function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
# nextOb <- nrow(series)+1 # This will always be 21. I think you mean:
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# make forecast
# predicted<-coef(mod)[1]+coef(mod)[2]*zoodata$y[nextOb-1]+coef(mod)[3]*zoodata$x[nextOb-1]
# That would work, but there is a very nice function called predict
predicted=predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
# I'm not sure why you used nextOb-1
attributes(predicted)<-NULL
# I added the square error as well as the prediction.
c(predicted=predicted,square.res=(predicted-zoodata[nextOb,'y'])^2)
}
}
rollapply(zoodata,width=20,FUN=prediction,by.column=F,align='right')
Your second question, about the numbering of your results, can be controlled by the align argument is rollapply. left would give you 1..60, center (the default) would give you 20..80 and right gets you 40..100.