auto.arima() seemingly selects different models given same data - r

I was trying something like the auto.arima example in https://otexts.com/fpp2/lagged-predictors.html and noticed I get different results depending on whether I specify (all) rows of data explicitly or not. MWE:
library(forecast); library(fpp2)
nrow(insurance)
auto.arima(insurance[,1], xreg=insurance[,2], stationary=TRUE)
auto.arima(insurance[1:40,1], xreg=insurance[1:40,2], stationary=TRUE)
The nrow(insurance) shows there are 40 rows, so I'd think insurance[,1] would be the same as insurance[1:40,1], and similarly for the second column. Yet, the first way results in a "Regression with ARIMA(3,0,0) errors" whereas the second way results in a "Regression with ARIMA(1,0,2) errors."
Why do these seemingly equivalent calls result in different selected models?

Note that insurance[,1] has labels and insurance[1:40,1] does not. If you pass as.numeric(insurance[,1]) you will actually receive "ARIMA(1,0,2)". So I bet it has to do with if the first argument has labels or not...Also note that it doesn't matter if xreg=insurance[,2] or xreg=insurance[1:40,2] they both will work

Corey nudged me in the right direction: insurance[,1] is a "time series" whereas insurance[1:40,1] is numeric. That is, is.ts(insurance[,1]) is TRUE but is.ts(insurance[1:40,1]) is FALSE. The forecast package has a subset function that preserves the time series structure, so is.ts(subset(insurance[,1],start=1,end=40)) is TRUE and
auto.arima(subset(insurance[,1],start=1,end=40),
xreg=subset(insurance[,2],start=1,end=40), stationary=TRUE)
gives the same output as the first version in my question (with insurance[,1] and insurance[,2]).
I think that explains "why" at least superficially, although I don't understand
1) why the time series structure changes the result here (since there doesn't seem to be any seasonality in the selected models?), and
2) why in the linked example Hyndman uses insurance[4:40,1] instead of his own subset() function from his forecast package?
I'll wait to see if somebody wants to answer those "deeper" questions, otherwise I'll probably accept this answer.

Related

How to combine two actions into one object

I recently just started with R a few weeks ago at the Uni. We were given a problem which we had to solve. However in this problem, I find that there are two answers that fit the question:
Verify that you created lo_heval correctly (incl. missing values). Store your verification in the object proof2.
So i find this is correct:
proof2 <- soep[1:100, c("heval", "lo_heval")]
But I think that this answer is also correct:
proof2 <- table(soep$heval, soep$lo_heval, useNA = "always")
Instead of having to decide for one answer, how do I combine them both into the object? I tried to use &, but I get an error. I may be using it wrong.
Prof. if you're seeing this, please don't fail me. I just can't decide between them.
Thanks in advance!
R lists can hold any arbitrary objects in them, so you could use
proof2 <- list(
soep[1:100, c("heval", "lo_heval")],
table(soep$heval, soep$lo_heval, useNA = "always")
)
However, to my mind 100 rows of two columns isn't proof - it's an exercise to look through those and verify things are right. (And what about the rows past 100? It's a decent spot check, but if there are more rows in the data it is more strong evidence than proof.) The table approach, on the other hand, seems succinct and effective.

What are the rules for ppp objects? Is selecting two variables possible for an sapply function?

Working with code that describes a poisson cluster process in spatstat. Breaking down each line of code one at a time to understand. Easy to begin.
library(spatstat)
lambda<-100
win<-owin(c(0,1),c(0,1))
n.seeds<-lambda*win$xrange[2]*win$yrange[2]
Once the window is defined I then generate my points using a random generation function
x=runif(min=win$xrange[1],max=win$xrange[2],n=pmax(1,n.seeds))
y=runif(min=win$yrange[1],max=win$yrange[2],n=pmax(1,n.seeds))
This can be plotted straight away I know using the ppp function
seeds<-ppp(x=x,
y=y,
window=win)
plot(seeds)
The next line I add marks to the ppp object, it is apparently describing the angle of rotation of the points, I don't understand how this works right now but that is okay, I will figure out later.
marks<-data.frame(angles=runif(n=pmax(1,n.seeds),min=0,max=2*pi))
seeds1<-ppp(x=x,
y=y,
window=win,
marks=marks)
The first problem I encounter is that an objects called pops, describing the populations of the window, is added to the ppp object. I understand how the values are derived, it is a poisson distribution given the input value mu, which can be any value and the total number of observations equal to points in the window.
seeds2<-ppp(x=x,
y=y,
window=win,
marks=marks,
pops=rpois(lambda=5,n=pmax(1,n.seeds)))
My first question is, how is it possible to add a variable that has no classification in the ppp object? I checked the ppp documentation and there is no mention of pops.
The second question I have is about using double variables, the next line requires an sapply function to define dimensions.
dim1<-pmax(1,sapply(seeds1$marks$pops, FUN=function(x)rpois(n=1,sqrt(x))))
I have never seen the $ function being used twice, and seeds2$marks$pop returns $ operator is invalid for atomic vectors. Could you explain what is going on here?
Many thanks.
That's several questions - please ask one question at a time.
From your post it is not clear whether you are trying to understand someone else's code, or developing code yourself. This makes a difference to the answer.
Just to clarify, this code does not come from inside the spatstat package; it is someone's code using the spatstat package to generate data. There is code in the spatstat package to generate simulated realisations of a Poisson cluster process (which is I think what you want to do), and you could look at the spatstat code for rPoissonCluster to see how it can be done correctly and efficiently.
The code you have shown here has numerous errors. But I will start by answering the two questions in your title.
The rules for creating ppp objects are set out in the help file for ppp. The help says that if the argument window is given, then unmatched arguments ... are ignored. This means that in the line seeds2<-ppp(x=x,y=y,window=win,marks=marks,pops=rpois(lambda=5,n=pmax(1,n.seeds)))
the argument pops will be ignored.
The idiom sapply(seeds1$marks$pops, FUN=f) is perfectly valid syntax in R. If the object seeds1 is a structure or list which has a component named marks, which in turn is a structure or list which has a component named pops, then the idiom seeds1$marks$pops would extract it. This has nothing particularly to do with sapply.
Now turning to errors in the code,
The line n.seeds<-lambda*win$xrange[2]*win$yrange[2] is presumably meant to calculate the expected number of cluster parents (cluster seeds) in the window. This would only work if the window is a rectangle with bottom left corner at the origin (0,0). It would be safer to write n.seeds <- lambda * area(win).
However, the variable n.seeds is used later as it it were the number of cluster parents (cluster seeds). The author has forgotten that the number of seeds is random with a Poisson distribution. So, the more correct calculation would be n.seeds <- rpois(1, lambda * area(win))
However this is still not correct because cluster parents (seed points) outside the window can also generate offspring points inside the window. So, seed points must actually be generated in a larger window obtained by expanding win. The appropriate command used inside spatstat to generate the cluster parents is bigwin <- grow.rectangle(Frame(win), cluster_diameter) ; Parents <- rpoispp(lambda, bigwin)
The author apparently wants to assign two mark values to each parent point: a random angle and a random number pops. The correct way to do this is to make the marks a data frame with two columns, for example marks(seeds1) <- data.frame(angles=runif(n.seeds, max=2*pi), pops=rpois(n.seeds, 5))

Correlations and what brackets indicate

I have this code, from Julian Farawy's linear models book:
round(cor(seatpos[,-9]),2)
I am unsure what [,-9],2 is doing - could someone please assist?
When you are learning new stuff nested functions can be difficult. This same computation could be accomplished in steps, which might be easier for you to see what KeonV and MrFlick are suggesting.
Here is an alternative way of doing this the same functions but easier steps to differentiate with simple explanations.
sub_seatpos<- seatpos[,-9]
this says take a subset of all rows and all columns EXCEPT column number nine and save it into sub_seatpos (this subseting was done in the initial code, but not saved into a new variable. This just makes seeing how each step works easier).
and reflects the bold portion below
round(cor(seatpos[,-9]),2)
cor_seatpos <- cor(sub_seatpos)
This takes the correlation for sub_seatpos and saves them into a variable named cor_seatpos. It reflects the part listed below in bold
round( cor( seatpos[,-9] ),2)
The final step just says round the correlation to 2 decimal places and would look like this in separate lines of code.
round(cor_seatpos, 2)
it is reflected in the bold below
round( cor(seatpos[,-9]),2)
What makes this confusing is that all of the functions are nested. As you become more proficient, this becomes less of a difficulty to read. But it can be confusing with new functions.

Prediction with cpdist using "probabilities" as evidence

I have a very quick question with an easy reproducible example that is related to my work on prediction with bnlearn
library(bnlearn)
Learning.set4=cbind(c("Yes","Yes","Yes","No","No","No"),c(9,10,8,3,2,1))
Learning.set4=as.data.frame(Learning.set4)
Learning.set4[,c(2)]=as.numeric(as.character(Learning.set4[,c(2)]))
colnames(Learning.set4)=c("Cause","Cons")
b.network=empty.graph(colnames(Learning.set4))
struct.mat=matrix(0,2,2)
colnames(struct.mat)=colnames(Learning.set4)
rownames(struct.mat)=colnames(struct.mat)
struct.mat[1,2]=1
bnlearn::amat(b.network)=struct.mat
haha=bn.fit(b.network,Learning.set4)
#Some predictions with "lw" method
#Here is the approach I know with a SET particular modality.
#(So it's happening with certainty, here for example I know Cause is "Yes")
classic_prediction=cpdist(haha,nodes="Cons",evidence=list("Cause"="Yes"),method="lw")
print(mean(classic_prediction[,c(1)]))
#What if I wanted to predict the value of Cons, when Cause has a 60% chance of being Yes and 40% of being no?
#I decided to do this, according the help
#I could also make a function that generates "Yes" or "No" with proper probabilities.
prediction_idea=cpdist(haha,nodes="Cons",evidence=list("Cause"=c("Yes","Yes","Yes","No","No")),method="lw")
print(mean(prediction_idea[,c(1)]))
Here is what the help says:
"In the case of a discrete or ordinal node, two or more values can also be provided. In that case, the value for that node will be sampled with uniform probability from the set of specified values"
When I predict the value of a variable using categorical variables, I for now just used a certain modality of said variable as in the first prediction in the example. (Having the evidence set at "Yes" gets Cons to take a high value)
But if I wanted to predict Cons without knowing the exact modality of the variable Cause with certainty, could I use what I did in the second prediction (Just knowing the probabilities) ?
Is this an elegant way or are there better implemented ones I don't know off?
I got in touch with the creator of the package, and I will paste his answer related to the question here:
The call to cpquery() is wrong:
Prediction_idea=cpdist(haha,nodes="Cons",evidence=list("Cause"=c("Yes","Yes","Yes","No","No")),method="lw")
print(mean(prediction_idea[,c(1)]))
A query with the 40%-60% soft evidence requires you to place these new probabilities in the network first
haha$Cause = c(0.40, 0.60)
and then run the query without an evidence argument. (Because you do not have any hard evidence, really, just a different probability distribution for Cause.)
I will post the code that lets me do what I wanted off of the fitted network from the example.
change=haha$Cause$prob
change[1]=0.4
change[2]=0.6
haha$Cause=change
new_prediction=cpdist(haha,nodes="Cons",evidence=TRUE,method="lw")
print(mean(new_prediction[,c(1)]))

Taylor diagram from existing Correlation and Standard Dev values

Is it possible to create a Taylor diagram from already calculated correlation and standard deviation values?
I am doing model evaluation, and I have already the correlation and standard deviations values.I understand that there is already a package plotrix where by giving the observation and the modeled values, the diagram is created. However for the type of work that I am doing, it is easier to start by giving already the correlation and standard deviation values.
Is there any way I can do this in R?
There's no reason it shouldn't be possible, but the authors didn't seem to allow for that when they wrote the function. The function is a bit long and complex, but the part that does the calculation is at the top. It is possible to swap out that code and replace it to allow for the passing of summary statistics. Now, keep in mind what i'm about to do is a hack and i've only tested it with versions 3.5-5 of plotrix. Other version may not work.
Here will will create a new function taylor.diagram2 that takes all the code from taylor.diagram but adds in an extra if statement to check for a list of summarized data as the first argument
taylor.diagram2<-taylor.diagram
bl<-as.list(body(taylor.diagram))
cond<-list(
as.name("if"),
quote(is.list(ref) & missing(model)), #condition
quote({R<-ref$R; sd.r<-ref$sd.r; sd.f<-ref$sd.f}), #if true
as.call(c(as.symbol("{"), bl[3:8]))) #else
bl<-c(bl[1:2], as.call(cond), bl[9:length(bl)]) #splice in new code
body(taylor.diagram2)<-as.call(bl) #update function
Now we can test the function. First, we'll do things the standard way
#test data
aref<-rnorm(30,sd=2)
amodel1<-aref+rnorm(30)/2
#standard behavior function
taylor.diagram2(aref,amodel1, main="Standard Behavior"))
#summarized data
xx<-list(
R=cor(aref, amodel1, use = "pairwise"),
sd.r=sd(aref),
sd.f=sd(amodel1)
)
#modified behavior
taylor.diagram2(xx, main="Modified Behavior")
So the new taylor.diagram2 function can do both. If you pass it two vectors, it will do the standard behavior. If you pass it a list with the names R, sd.r, and sd.f, then it will do the same plot but with the values you passed in. Also, the model parameter must be empty for the modified version to work. That means if you want to set any additional parameter, you must use named parameters rather than positional arguments.

Resources