I build ARIMA model with regressors in SAS and R, but the model's results are totally different, I cannot figure out why two packages give different outputs.
The following is the SAS code
proc ARIMA data=TSDATA;
identify var=LOG_Sale
crosscorr=(
Log_Var1
Log_Var2
Log_Var3)
nlag=12 ALPHA=0.05 WHITENOISE = IGNOREMISS SCAN;
run;
estimate q=(4)(10)
input=
( Log_Var1
Log_Var2
Log_Var3)
method=ml plot ;
run;
The following is the R code:
finalmodel <- arima (LOG_Sale, order=c(0, 0, 4), seasonal=list (order=c(0, 0, 1),period=10),include.mean = TRUE, xreg=xinput,fixed=c(0,0,0,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA),
method="ML")
summary(finalmodel)
As you can see, the model include MA(4)(10) and 3 regressors, I defined a matrix xinput to include the three regressors(Log_Var1,Log_Var2,Log_Var3).
The coefficients are totally different in two outputs(SAS and R), I don't know why, please help me out if you can point out what's wrong in the R code, because I think the SAS code is quite typical and should be right, but I am new to R and I guess the R code maybe wrong....
Thanks.
the data is typical weekly time series data
Date Log_Var1 Log_Var2 Log_Var3
3-Jan-11 13.47487027 8.65886635 9.096499556
10-Jan-11 14.1688108 9.182043773 9.096499556
17-Jan-11 14.3192497 9.175024027 9.096499556
24-Jan-11 14.54051181 9.1902397 9.096499556
31-Jan-11 14.33370089 9.1902397 9.096499556
7-Feb-11 13.76581591 9.431962767 9.326321786
14-Feb-11 14.09526221 9.29844282 9.326321786
21-Feb-11 14.61994905 9.29844282 9.326321786
28-Feb-11 14.94652204 8.700680735 9.326321786
7-Mar-11 14.71066636 9.026056892 9.348993004
As you can see from SAS code, the model is ARIMA(0,0,4)(0,0,10) with three input series, it's straigtforward in SAS, but I read many R materials and I cannot find any useful documents and examples to show how to build ARIMA with specific high order of p,q or P,Q (subset ARIMA) with external regressors.
the R code you see here actually works in R and the output looks alright, but the coefficients are different from the output in SAS, so I guess the algorithms of ARIMA in R could be different from SAS, but both are using ML method...
so the point is whether the R code is correct or not if the model is ARIMA(0,0,4)(0,0,10), it should be noted that q=(4)(10), which is only 4 and 10, not from 1 to 4 and not from 1 to 10, which is only subset orders.
thanks.
Related
I am using the forecast package of R and I created a MA(1) model by using the ARIMA function. I plotted the time series itself ($x variable of the ma_model), the model itself ($fitted variable of the ma_model) and the residuals (residuals variable of the ma_model). Strangely the time series looks equal to the model altough there are nonegative residuals. Here is the code that I used:
library(forecast)
ma_model<-Arima(ts(generationData$Price[1:200]), order=c(0,1,0))
plot(ma_model$fitted, main = "Fitted")
plot(ma_model$x, main = "X")
plot(ma_model$residuals, main = "Residuals")
Here is the result
Basically the model can't be equal to the real time series especially when having residuals. Can anyone explain this to me? I'd appreciate every comment.
Update: I tried to use the order=c(0,0,20) so I have a MA(20) or AR(20) model (I am not sure which parameters stands for MA and AR). Now the fitted curve and the original time series look quite equal (but not exactly equal). Is this possible and usual? I'd appreciate every further comment.
Any comments on this issue?
I am not sure about your output, but from the code it seems that you just took the difference in the model, not the MA.
I think it should be order=c(0,0,1) instead of order=c(0,1,0) for building the MA model.
I have some code in Stata that I'm trying to redo in R. I'm working on a delayed entry survival model and I want to limit the follow-up to 5 years. In Stata this is very easy and can be done as follows for example:
stset end, fail(failure) id(ID) origin(start) enter(entry) exit(time 5)
stcox var1
However, I'm having trouble recreating this in R. I've made a toy example limiting follow-up to 1000 days - here is the setup:
library(survival); library(foreign); library(rstpm2)
data(brcancer)
brcancer$start <- 0
# Make delayed entry time
brcancer$entry <- brcancer$rectime / 2
# Write to dta file for Stata
write.dta(brcancer, "brcancer.dta")
Ok so now we've set up an identical dataset for use in both R and Stata. Here is the Stata bit code and model result:
use "brcancer.dta", clear
stset rectime, fail(censrec) origin(start) enter(entry) exit(time 1000)
stcox hormon
And here is the R code and results:
# Limit follow-up to 1000 days
brcancer$limit <- ifelse(brcancer$rectime <1000, brcancer$rectime, 1000)
# Cox model
mod1 <- coxph(Surv(time=entry, time2= limit, event = censrec) ~ hormon, data=brcancer, ties = "breslow")
summary(mod1)
As you can see the R estimates and State estimates differ slightly, and I cannot figure out why. Have I set up the R model incorrectly to match Stata, or is there another reason the results differ?
Since the methods match on an avaialble dataset after recoding the deaths that occur after to termination date, I'm posting the relevant sections of my comment as an answer.
I also think that you should have changed any of the deaths at time greater than 1000 to be considered censored. (Notice that the numbers of events is quite different in the two sets of results.
I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!
You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.
I am trying to run a repeated - measures ANOVA using R and compared it to the SPSS output and results differ a lot! Maybe I make a mistake somewhere, but I cannot figure it out
So some sample data:
id is the subject. Every subject makes one rating for three items (res_1, res_2 and res_3). I want to compare an overall effect of item.
id<-c(1,2,3,4,5,6)
res_1<-c(1,1,1,2,2,1)
res_2<-c(4,5,2,4,4,3)
res_3<-c(4,5,6,3,6,6)
## wide format for spss
table<-as.data.frame(cbind(id, res_1, res_2, res_3))
## reshape to long format
library(reshape2)
table<-melt(table, id.vars="id")
colnames(table)<-c("id", "item", "rating")
aov.out = aov(rating ~ item+ Error(id/item), data=table)
summary(aov.out)
And here is my SPSS code (from wide format data)
GLM item_1 item_2 item_3
/WSFACTOR=factor1 3 Polynomial
/METHOD=SSTYPE(3)
/PRINT=DESCRIPTIVE
/CRITERIA=ALPHA(.05)
/WSDESIGN=factor1.
The results I get from
R: p value 0.0526 (error:within)
and SPSS: p value 0.003 (test of within subject effect)
Does anyone have a suggestion that may explain the difference?
If I do a non-parametric Friedmann test, I get the same results in SPSS and R.
Actually, when looking at my data, the summary(aov.out) is the same as SPSS's "test of within subjects contrast" (but I learned to look at the test of within subjects effect).
Thanks!
There's a lot of stuff out there; I am a bit surprised that your google for 'spss versus R anova' did not bring you to links explaining about the difference in sums-of-squares between SPSS (type-III) and R (type-I) as well as difference in how contrasts are handled.
These are the top two results that I found:
http://myowelt.blogspot.ca/2008/05/obtaining-same-anova-results-in-r-as-in.html and
https://stats.stackexchange.com/questions/40958/testing-anova-hypothesis-with-contrasts-in-r-and-spss
Q1:
I have been trying to get the AUC value for a classification problem and have been trying to use e1071 and ROCR packages in R for this. ROCR has a nice example "ROCR.simple" which has prediction values and label values.
library(ROCR)
data(ROCR.simple)
pred<-prediction(ROCR.simpe$predictions, ROCR.simple$labels)
auc<-performance(pred,"auc")
This gives the AUC value, no problem.
MY PROBLEM is: How do I get the type of data given by ROCR.simple$predictions in the above example?
I run my analysis like
library(e1071)
data(iris)
y<-Species
x<-iris[,1:2]
model<-svm(x,y)
pred<-predict(model,x)
Upto here I'm ok.
Then how do I get the kind of predictions that ROCR.simpe$predictions give?
Q2:
there is a nice example involving ROCR.xvals. This is a problem with 10 cross validations.
They run
pred<-prediction(ROCR.xval$predictions,ROCR.xval$labels)
auc<-performance(pred,"auc")
This gives results for all 10 cross validations.
My problem is:
How do I use
model<-svm(x,y,cross=10) # where x and y are as given in Q1
and get all 10 results of predictions and labels into a list as given in ROCR.xvals?
Q1. You could use
pred<-prediction(as.numeric(pred), as.numeric(iris$Species))
auc<-performance(pred,"auc")
BUT. number of classes is not equal to 2.
ROCR currently supports only evaluation of binary classification tasks (according to the error I got)
Q2. I don't think that the second can be done the way you want. I can only think to perform cross validations manualy i.e.
Get resample.indices (from package peperr)
cv.ind <- resample.indices(nrow(iris), sample.n = 10, method = c("cv"))
x <- lapply(cv.ind$sample.index,function(x){iris[x,1:2]})
y <- lapply(cv.ind$sample.index,function(x){iris[x,5]})
then generate models and predictions for each cv sample
model1<-svm(x[[1]],y[[1]])
pred1<-predict(model1,x[[1]])
etc.
Then you could manualy construct a list like ROCR.xval