Forecasting multivariate data with Auto.arima - r

I am trying to forecasts sales of weekly data. The data consists of these variables week no, sales, avgprice/perunit , holiday(whether that week contains holiday or not) and promotion(if any promotion is going) of 104 weeks. So basically the last 6 obs of data set looks as:
Week Sales Avg.price.unit Holiday Promotion
101 8,970 50 0 1
102 17,000 50 1 1
103 23,000 80 1 0
104 28,000 180 1 0
105 176 1 0
106 75 0 1
Now I want to forecast for 105th and 106th week. So I created univariate time series x by using ts function and then ran auto.arima function by issuing the command:
x<-ts(sales$Sales, frequency=7)
> fit<-auto.arima(x,xreg=external, test=c("kpss","adf","pp"),seasonal.test=c("ocsb","ch"),allowdrift=TRUE)
>fit
ARIMA(1,1,1)
**Coefficients:
ar1 ma1 Avg.price.unit Holiday Promotion
-0.1497 -0.9180 0.0363 -10.4181 -4.8971
s.e. 0.1012 0.0338 0.0646 5.1999 5.5148
sigma^2 estimated as 479.3: log likelihood=-465.09
AIC=942.17 AICc=943.05 BIC=957.98**
Now when I want to forecast the values for last 2 weeks(105th and 1o6th) I supply the external values of regressors for 105th and 106th week:
forecast(fit, xreg=ext)
where ext consists of future values of regressors for last 2 weeks.
The output comes as:
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
15.85714 44.13430 16.07853 72.19008 1.226693 87.04191
16.00000 45.50166 17.38155 73.62177 2.495667 88.50765
The output looks incorrect since the forecasted value of sales is very less as the sales value of previous values(training) values are generallly in range of thousands.
If anyone can tell me why it is coming incorrect/unexpected, that would be great.

If you knew a priori that certain weeks of the year or certain events in the year were possibly important you could form a Transfer Function that couild be useful. You might have to include some ARIMA structure to deal with short-term autoregressive structure AND/OR some Pulse/Level Shift/Local Trends to deal with unspecified deterministic series ( omitted variables ). If you would like to post all of your data I would be glad to demonstrate that for you thus providing ground zero help. Alternatively you can email it to me at dave#autobox.com and I will analyze it and post the data and the results to the list. Other commentators on this question might also want to do the same for comparative analytics.

Where are the 51 weekly dummies in your model? Without them you have no way to capture seasonality.

Related

Predicting survival within time (cumulative hazard) [duplicate]

This question already has an answer here:
Extract survival probabilities in Survfit by groups
(1 answer)
Closed 3 years ago.
By using R, how can one develop an index score for predicting patient overall survival (OS)?
I have a shortlist of 4 candidate predictors that showed to be associate with OS. They resulted from Cox multivariate regression (run with coxph()). The predictors are protein levels, hence they are all continuous variables.
The data table looks something like this (showing only n=10 here):
days Status Prot1 Prot13 Prot7 Prot21
Subj_1 115.69 0 2.284498 6.319168 6.070115 8.457412
Subj_2 72.30 1 2.473034 6.066573 6.140178 8.225987
Subj_3 1.08 1 2.662481 6.212845 6.971018 8.128949
Subj_4 69.63 1 2.761391 5.902610 6.433883 7.876319
Subj_5 78.41 1 3.038122 6.355257 6.852981 7.500973
Subj_6 42.90 1 2.058549 6.020681 7.231307 8.164025
Subj_7 31.00 1 2.305096 5.415107 8.126941 8.566320
Subj_8 51.12 1 2.931978 5.574601 7.503275 7.529957
Subj_9 11.01 1 2.218814 6.270222 6.710297 8.193895
Subj_10 27.68 1 2.821947 6.132379 6.911071 8.428218
The question is: How can I create a formula which is capable to classify these patients into 2 groups: a group where the estimated survival is <60% in a 1-year period, and another which will include those with estimated survival> 60% in the same time period?
Would there be any function() in R that deals with that?
Thanks a lot in advance.
I think you should post this question here
https://stats.stackexchange.com
since it is a matter of statistics. Anyway, you could try with a binomial regression to start, but there are many other models you could try. how many subjects do you have?

Regression - out-of-sample forecasting

I try to figure out how to deal with my forecasting problem and I am not sure if my understanding is right in this field, so it would be really nice if someone can help me. First of all, my goal is to forecast a time series with regression. Instead of using ARIMA model or other heuristic models I want to focus on machine learning techniques like regressions such as random forest regression, k-nearest-neighbour regression etc.. Here is an overview of the dataset:
Timestamp UsageCPU UsageMemory Indicator Delay
2014-01-03 21:50:00 3123 1231 1 123
2014-01-03 22:00:00 5123 2355 1 322
2014-01-03 22:10:00 3121 1233 2 321
2014-01-03 22:20:00 2111 1234 2 211
2014-01-03 22:30:00 1000 2222 2 0
2014-01-03 22:40:00 4754 1599 1 0
The timestamp is increased in steps of 10 minutes and I want to predict the independent variable UsageCPU with the dependent variables UsageMemory, Indicator etc.. At this point i will explain my general knowledge of the prediction part. So for the prediction it is necessary to separate the dataset into training, validation and test sets. For this my dataset that contains 2 whole weeks is separated in 60% training, 20% validation and 20% test. This means for training set I have the first 8 days included and for the validation and the test set I have each 3 days. After that I can train a model in SparkR (the settings are not important).
model <- spark.randomForest(train, UsageMemory ~ UsageMemory, Indicator, Delay,
type = "regression", maxDepth = 30, maxBins = 50, numTrees=50,
impurity="variance", featureSubsetStrategy="all")
So after this I can validate the results with the validation set and compute the RMSE to see the accuracy of the model and which point have to tuned in my model building part. If that is finished I can predict on the test dataset:
predictions <- predict(model, test)
So the prediction works fine, but this is only an in-sample forecast and can not be used to predict for example the next day. In my understanding the in-sample can only used to predict the data in the data set and not to predict future values that can happen tomorrow. So really want to predict for example the next day or only the next 10 minutes / 1 hour, which is only possible to success with the out-of-sample forecasting. I also tried something like this (rolling regression) on the predicted values from random forest, but in my case the rolling regression is only used for evaluating the performance of different regressors with respect to different parameters combinations. So this is in my understanding no out-sample forecasting.
t <- bind(prediction, RollingRegression3 = rollApply(prediction, fun=function(x) mean(UsageCPU), window=6, align='right'))
So in my understanding I need something (maybe lag values?), before the model building process starts. I also read a lot of different papers and books, but there is no clear way how to do it and what are the key points. There is only standing something like t+1, t+n, but right now I do not even know how to do it. Would be really nice if someone can help me, because I tried to figure this out since three month now, thank you.
Let’s see if I get your problem right. I suppose that, given a time window, e.g. 144 last observations (one day) of UsageCPU, UsageMemory, Indicator and Delay, you want to forecast the ‘n’ next observations of UsageCPU. One way you could do such a thing, using random forests, is assigning one model for each next observation you want to forecast. So, if you want to forecast the 10 next UsageCPU observations, you should train 10 random forest models.
Using the example I began with, you could split the data you have in chunks of 154 observations. In each, you will use the first 144 observations to forecast the last 10 values of UsageCPU. There are lots of ways in which you could use feature engineering to extract information from these first 144 observations to train your model with, e.g. mean for each variable, last observation of each variable, global mean for each variable. So, for each chunk you will get a vector containing a bunch of predictors and 10 target values.
Bind the vectors you got for each chunk and you’ll have a matrix where the first columns are the predictors and the last 10 columns are the targets. Train each random forest with the n predictors columns and 1 of the targets column. Now you can apply the models on the features you extract from any data chunk containing the 144 observations. The model trained for target column 1 will ‘forecast’ one observation ahead, the model trained for target column 2 will ‘forecast’ two observations ahead, the model trained for target column 3 will ‘forecast’ three observations ahead...

Making zero-inflated or hurdle model with R

I need to make a model which could find probability that a registered user will buy some plan or no plan (i.e., will use just a free plan or won't do anything) and if they do, after what time.
I have data with around 13 000 rows and around 12 000 of them are free users ( never paid - 0 value ) and the other 1 000 paid after some time ( from 1 to 690 days) also I have some count and categorical data - country, number of user clients, how many times he used plan, plan (premium, free, premium plus).
The mean of time after they paid or not is around 6.37 and variance is 1801.17, without zeros - 100 and 19012, which suggests to me that I should use a negative binomial model.
But I'm not sure which model fits best; I'm thinking about a zero-inflated negative binomial or hurdle model.
Here is histogram of diff.time with 0 and without 0 data :
I tried these models with the pscl package:
summary(m1 <- zeroinfl(diff.time3 ~
factor(Registration.country) + factor(Plan) + Campaigns.sent +
Number.of.subscribers |
factor(Registration.country) + factor(Plan) + Campaigns.sent +
Number.of.subscribers,
data=df , link="logit",dist= "negbin"))
or the same with hurdle()
but they gave me an error :
Error in quantile.default(x$residuals): missing values and NaN's not allowed if 'na.rm' is FALSE In addition: Warning message: glm.fit: algorithm did not converge
with hurdle():
Error in solve.default(as.matrix(fit_count$hessian)) : Lapack routine dgesv: system is exactly singular: U[3,3] = 0
I have never tried these models before so I'm not sure how to fix these errors or if I chose the right models.
Unfortunately, I have no opportunuty to share some part of my data, but I'll try to explain them:
1st column "plan" - most of the data are "free"(around 12 000), also "Earning more", "Premium" or "Premium trial", where "free" and "premium trial" are not paid.
2nd column "Plan used" - around 8 000 rows are 0, 1 000 - 1, 3 000 - from 1 to 10 and another 1 000 from 10 to 510
3th column "Clients" describes how many clients user have - around 2 000 have 0, 4 0000 - 1-10, 3 000 - 10-200, 2 000- 200-1000, 2 000 - 1000- 340 000
4th column "registration country" - 36 different countries, over half of data is united states, other have from 5 to few hundreds rows.
5th column is diff.time which should be my dependent variable, as I said before most of the data are 0 (12 000) and others variuos from 1 day to 690 days)
If your actual data is similarly structured to the data you posted then you will have problems estimating a model like the one you specified. Let's first have a look at the data you posted on the Google drive:
load("duom.Rdata")
table(a$diff.time3 > 0)
## FALSE TRUE
## 950 50
Thus there is some variation in the response but not a lot. You have only 5% non-zeros, overall 50 observations. From this information alone it might seem more reasonable to estimate a bias-reduced binary model (brglm) to the hurdle part (zero vs. non-zero).
For the zero-truncated count part you can possibly fit a model but you need to be careful which effects you want to include because there are only 50 degrees of freedom. You can estimate the zero-truncated part of the hurdle model using the zerotrunc function in package countreg, available from R-Forge.
Also you should clean up your factors. By re-applying the factor function within the formula, levels with zero occurrences are excluded. But there are also levels with only one occurrence for which you will not get meaningful results.
table(factor(a$Plan))
## Earning much more Free Mailing Premium
## 1 950 1 24
## Premium trial
## 24
table(factor(a$Registration.country))
## australia Australia Austria Bangladesh Belgium brasil Brasil
## 1 567 7 5 56 1 53
## Bulgaria Canada
## 10 300
Also, you need to clean up the country levels with all lower-case letters.
After that I would start out by buidling a binary GLM for zero vs. non-zero - and based on those results continue with the zero-truncated count part.

R time at risk for each group

I have been preparing survival analysis and cox regression in R. However, my line manager is a Stata user and wants the output displayed in a similar way as Stata would display it, e.g.
# Stata code
. strate
. stsum, by (GROUP)
stsum will output a time at risk for each group and an incidence rate, and I can't figure out how to achieve this with R.
The data look roughly like this (I can't get to it as it's in a secure environment):
PERS GROUP INJURY FOLLOWUP
111 1 0 2190
222 2 1 45
333 1 1 560
444 2 0 1200
So far I have been using fairly bog standard code:
library(survival)
library(coin)
# survival analysis
table(data$INJURY, data$GROUP)
survdiff(Surv(FOLLOWUP, INJURY)~GROUP, data=data)
surv_test(Surv(FOLLOWUP, INJURY)~factor(GROUP), data=data)
surv.all <- survfit(Surv(FOLLOWUP, INJURY)~GROUP, data=data)
print(sur.all, print.rmean=TRUE)
# cox regression
cox.all<- coxph(Surv(FOLLOWUP, INJURY)~GROUP, data=data))
summary(cox.all)
At the the moment we have 4 lines of data and no clear description (at least to a non-user of Stata) of the desired output:
dat <- read.table(text="PERS GROUP INJURY FOLLOWUP
111 1 0 2190
222 2 1 45
333 1 1 560
444 2 0 1200",header=TRUE)
I do not know if there are functions in either the coin or the survival packages that deliver a crude event rate for such data. It is trivial to deliver crude event rates (using 'crude' in the technical sense with no disparagement intended) with ordinary R functions:
by(dat, dat$GROUP, function(d) sum(d$INJURY)/sum(d$FOLLOWUP) )
#----------------
dat$GROUP: 1
[1] 0.0003636364
------------------------------------------------------
dat$GROUP: 2
[1] 0.0008032129
The corresponding function for time at risk (or both printed to the console) would be very a simple modification. It's possible that the 'Epi' or 'epiR' package or one of the other packages devoted to teaching basic epidemiology would have designed functions for this. The 'survival' and 'coin' authors may not have seen a need to write up and document such a simple function.
When I needed to aggregate the ratios of actual to expected events within strata of factor covariates, I needed to construct a function that properly created the stratified tables of events (to support confidence estimates), sums of "expecteds" (calculated on basis of age,gender and duration of observation), and divide actual A/E ratios. I assemble them into a list object and round the ratios to 2 decimal places. When I got it finished, I found these most useful as a sensibility check against the results I was getting with the 'survival' and 'rms' regression methods I was using. They also help explain results to a nonstatistical audience that is more familiar with tabular methods than with regression. I now have it as part of my Startup .profile.

Lme error: "Error in reStruct"

4 beehives were equipped with sensors that collected temp, humidity, pressure, decibels inside the hive. these are the response variables.
the treatment was wifi exposure, the experimental groups were exposed to wifi from day 1 to day 20, then again from day 35-45, and data was collected until day day 54. n of hives = 4, n of data collected by sensors in each hive = ~million.
I am having difficulties running mixed effects models.
there is a data frame of all the hives' response variables.
names(Hives)
[1] "time" "dht22_t" "dht11_t" "dht22_h"
[5] "dht11_h" "db" "pa" "treatment_hive"
[9] "wifi"
time is in "%Y-%m-%d %H:%M:%S", dht11/22_t/h are either temperature and humidity data. "wifi" is a dichotomous variable (1=on 0=off) that corresponds to the time of exposure, and treatment hive is another dichotomous variable for the hives exposed to wifi (1=exposure, 0=control).
Here is the error i am getting.
attach(Hives)
model2 = lme(pa_t~wifi*treatment_hive, random=time, na.action=na.omit, method="REML",)
Error in reStruct(random, REML = REML, data = NULL) :
Object must be a list or a formula
Here is a sample of the code:
time dht22_t dht11_t dht22_h dht11_h db pa treatment_hive wifi
1 01/09/2014 15:19 NA NA NA NA 51.75467 NA 0 1
2 01/09/2014 15:19 30.8 31 59.8 44 55.27682 100672 0 1
3 01/09/2014 15:19 30.8 31 60.3 44 54.81995 100675 0 1
4 01/09/2014 15:19 30.8 31 60.9 44 54.14134 100671 0 1
5 01/09/2014 15:19 30.8 31 61.1 44 53.88574 100672 0 1
6 01/09/2014 15:19 30.8 31 61.2 44 53.68800 100680 0 1
R version 2.15.1 (2012-06-22)
Platform: i486-pc-linux-gnu (32-bit)
attached packages:
[1] ggplot2_0.8.9 proto_0.3-9.2 reshape_0.8.4 plyr_1.7.1 nlme_3.1-104
[6] lme4_0.999999-0 Matrix_1.0-6 lattice_0.20-6
There are a variety of issues here, some relevant to programming (StackOverflow) but probably the statistical issues (suitable for CrossValidated or r-sig-mixed-models#r-project.org) are more important.
tl;dr If you just want to avoid the error I think you need random=~1|hive (whatever your hive-indicator variable is) to fit a model where baseline response (intercept) varies across hives, but I'd encourage you to read on ...
can we have a (small!) reproducible example ?
don't use attach(Hives), use data=Hives in your lme() call (not necessarily the problem, but [much] better practice)
with only 4 hives it is a bit questionable whether a random effect specification across hives will work (although with a million observations you might get away with it)
the random effect must be composed of a categorical (factor) grouping variable; in your case I think "hive" is the grouping variable, but I can't tell from your question which variable identifies hives
you should almost certainly have a model that accounts for trends in time and variation in time trends across hives, i.e. a random-slopes model, which would be expressed as formula=...~...+time, random=~time|hive (where the ... represents the bits of your existing model)
you'll have to convert time to something sensible to use it in your model (see ?strptime or the lubridate package), something like seconds/minutes/hours from starting time might be most sensible. (What is your time resolution? Do you have multiple sensors per hive, in which case you should consider fitting a random effect of sensor as well?)
with millions of data points your model fit is likely to be very slow; you might want to consider the lme4 package
with millions of data points everything is going to be statistically significant, and very sensitive to aspects of the model that don't appear in the data, such as (1) nonlinear trends in time (e.g. consider fitting additive models of the time trends with mgcv::gamm or the gamm4 package); (2) temporal autocorrelation (consider adding a correlation parameter in your lme model).

Resources