How to do the RESET test on an AR model? - r

> library("lmtest")
> a = arima.sim(list(ar = c(.05, -.05)), 1000)
> b = arima(a, order = c(2, 0, 0))
> resettest(b)
**Error in terms.default(formula) : no terms component nor attribute**
Question 1. What I am doing is shown above. What should I do about that?
(I have tried to put in type, data and power parameter at resettest(), result is the same.)
Question 2.If I want to do the same thing on the model below
๐‘Ÿ๐‘ก=0.5+0.5๐‘Ÿ(๐‘กโˆ’1)โˆ’0.5๐‘Ÿ(๐‘กโˆ’2)+0.1๐‘Ÿ(๐‘กโˆ’1)^2+๐œ–_๐‘ก
which is a ar(2) model plus 0.1๐‘Ÿ_(๐‘กโˆ’1)^2, how to fit this nonlinear model (by using R, thank you!)?
should have earn more reputation... can't post pic below 10 :(

The issue is that the first argument of resettest is
formula - a symbolic description for the model to be tested (or a fitted "lm" object).
So, passing an Arima object is not going to work. Instead we may manually define the lagged variables and provide an lm object or just the formula:
la1 <- Hmisc::Lag(a, 1)
la2 <- Hmisc::Lag(a, 2)
resettest(a ~ la1 + la2)
#
# RESET test
#
# data: a ~ la1 + la2
# RESET = 0.10343, df1 = 2, df2 = 993, p-value = 0.9018
Now your second model is nonlinear in variables but linear in parameters, so the same estimation methods still apply. (I'm assuming that the true DGP remains the same and you just want to test a new specification.) In particular,
resettest(a ~ la1 + la2 + I(la2^2))
#
# RESET test
#
# data: a ~ la1 + la2 + I(la2^2)
# RESET = 0.089211, df1 = 2, df2 = 992, p-value = 0.9147

Related

R tree doesn't use all variables(why?)

Hi I'm working on a decision tree.
tree1=tree(League.binary~TME.factor+APM.factor+Wmd.factor,starcraft)
The tree shows a partitioning based solely on the APM.factor and the leaves aren't pure. here's a screenshot:
I tried creating a tree with a subset with 300 of the 3395 observations and it used more than one variable. What went wrong in the first case? Did it not need the extra two variables so it used only one?
Try playing with the tree.control() parameters, for example setting minsize=1 so that you end up with a single observation in each leaf (overfit), e.g:
model = tree(y ~ X1 + X2, data = data, control = tree.control(nobs=n, minsize = 2, mindev=0))
Also, try the same thing with the rpart package, see what results you get, which is the "new" version of tree. You can also plot the importance of the variables. Here a syntax example:
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
## fit tree
### alt1: class
model = rpart(y ~ X1 + X2, data=data, method = "class")
### alt2: reg
model = rpart(y ~ X1 + X2, data=data, control = rpart.control(maxdepth = 30, minsplit = 1, minbucket = 1, cp=0))
## show model
print(model)
rpart.plot(model, cex=0.5)
## importance
model$variable.importance
Note that since trees do binary splits, it is possible that a single variable explains most/all of the SSR (for regression). Try plotting the response for each regressor, see if there's any significant relation to anything but the variable you're getting.
In case you want to run the examples above, here a data simulation (put it at beginning of code):
n = 12000
X1 = runif(n, -100, 100)
X2 = runif(n, -100, 100)
## 1. SQUARE DATA
# y = ifelse( (X1< -50) | (X1>50) | (X2< -50) | (X2>50), 1, 0)
## 2. CIRCLE DATA
y = ifelse(sqrt(X1^2+X2^2)<=50, 0, 1)
## 3. LINEAR BOUNDARY DATA
# y = ifelse(X2<=-X1, 0, 1)
# Create
color = ifelse(y==0,"red","green")
data = data.frame(y,X1,X2,color)
# Plot
data$color = data$color %>% as.character()
plot(data$X2 ~ data$X1, col = data$color, type='p', pch=15)

Implementing multinomial-Poisson transformation with multilevel models

I know variations of this question have been asked before but I haven't yet seen an answer on how to implement the multinomial Poisson transformation with multilevel models.
I decided to make a fake dataset and follow the method outlined here, also consulting the notes the poster mentions as well as the Baker paper on MP transformation.
In order to check if I'm doing the coding correctly, I decided to create a binary outcome variable as a first step; because glmer can handle binary response variables, this will let me check I'm correctly recasting the logit regression as multiple Poissons.
The context of this problem is running multilevel regressions with survey data where the outcome variable is response to a question and the possible predictors are demographic variables. As I mentioned above, I wanted to see if I could properly code the binary outcome variable as a Poisson regression before moving on to multi-level outcome variables.
library(dplyr)
library(lme4)
key <- expand.grid(sex = c('Male', 'Female'),
age = c('18-34', '35-64', '45-64'))
set.seed(256)
probs <- runif(nrow(key))
# Make a fake dataset with 1000 responses
n <- 1000
df <- data.frame(sex = sample(c('Male', 'Female'), n, replace = TRUE),
age = sample(c('18-34', '35-64', '45-64'), n, replace = TRUE),
obs = seq_len(n), stringsAsFactors = FALSE)
age <- model.matrix(~ age, data = df)[, -1]
sex <- model.matrix(~ sex, data = df)[, -1]
beta_age <- matrix(c(0, 1), nrow = 2, ncol = 1)
beta_sex <- matrix(1, nrow = 1, ncol = 1)
# Create class probabilities as a function of age and sex
probs <- plogis(
-0.5 +
age %*% beta_age +
sex %*% beta_sex +
rnorm(n)
)
id <- ifelse(probs > 0.5, 1, 0)
df$y1 <- id
df$y2 <- 1 - df$y1
# First run the regular hierarchical logit, just with a varying intercept for age
glm_out <- glmer(y1 ~ (1|age), family = 'binomial', data = df)
summary(glm_out)
#Next, two Poisson regressions
glm_1 <- glmer(y1 ~ (1|obs) + (1|age), data = df, family = 'poisson')
glm_2 <- glmer(y2 ~ (1|obs) + (1|age), data = df, family = 'poisson')
coef(glm_1)$age - coef(glm_2)$age
coef(glm_out)$age
The outputs for the last two lines are:
> coef(glm_1)$age - coef(glm_2)$age
(Intercept)
18-34 0.14718933
35-64 0.03718271
45-64 1.67755129
> coef(glm_out)$age
(Intercept)
18-34 0.13517758
35-64 0.02190587
45-64 1.70852847
These estimates seem close but they are not exactly the same. I'm thinking I've specified an equation wrong with the intercept.

Coef not defined because of similarity and pvalues

I have a CRM data set used for an experiment, where the dummy W corresponds to the treatment/control group (see code below). When I tested for the independence of W from the other features, I realized two things:
When using model.matrix, some coefficients (1 in this dummy dataset) were not defined because of similarity. This did not happen when feeding the DT straight to lm()
The model obtained in both cases produces different results i.e., the p-values of the individual features changes
I (think that I) understand the concept of multi-collinearity but in this particular case I don't quite understand a) why it comes up b) why it has a different impact on model.matrix and lm
What am I missing?
Thanks a lot!
set.seed(1)
n = 302
DT = data.table(
zipcode = factor(sample(seq(1,52), n, replace=TRUE)),
gender = factor(sample(c("M","F"), n, replace=TRUE)),
age = sample(seq(1,95), n, replace=TRUE),
days_since_last_purchase = sample(seq(1,259), n, replace=TRUE),
W = sample(c(0,1), n, replace=TRUE)
)
summary(DT)
m = model.matrix(W ~ . +0, DT)
f1 = lm(DT$W ~ m)
f2= lm(W~ ., DT)
p_value_ratio <- function(lm)
{
summary_randomization = summary(lm)
p_values_randomization = summary_randomization$coefficients[, 4]
L = length(p_values_randomization)
return(sum(p_values_randomization <= 0.05)/(L-1))
}
all.equal(p_value_ratio(f1), p_value_ratio(f2))
alias(f1)
alias(f2)
Your problem is the + 0 in model.matrix. The second fit includes the intercept in the model matrix. If you exclude it, less factor levels (which are normally represented by the intercept) get excluded:
colnames(model.matrix(W ~ ., DT))
#excludes zipcode1 and genderf since these define the intercept
colnames(model.matrix(W ~ . + 0, DT))
#excludes only genderf
Note that f1 includes an intercept, which is added by lm (I believe by an internal call to model.matrix, but haven't checked):
m = model.matrix(W ~ . + 0, DT);
f1 = lm(DT$W ~ m );
model.matrix(f1)
You might want this:
m = model.matrix(W ~ ., DT);
f1 = lm(DT$W ~ m[,-1]);
(Usually you construct the model matrix only manually if you want to use lm.fit directly.)
f2= lm(W~ ., DT);
all.equal(unname(coef(f1)), unname(coef(f2)))
#[1] TRUE
In the end, this boils down to your understanding of treatment contrasts. Usually, you shouldn't exclude the intercept from the model matrix.

retain several best models during model dredging in R

Is there a way to retain the best models, for example, within two Alkaike Information Criterion (AIC) units of the best fitting model, during a model dredging approach in R? I am using the glmulti package, which returns the AIC of the best models, but does not allow visualizing the models associated with those values.
Thanks in advance.
Here is my example (data here):
results <- read.csv("gameresults.csv")
require(glmulti)
M <- glmulti(result~speed*svl*tailsize*strategy,
data=results, name = "glmulti.analysis",
intercept = TRUE, marginality = FALSE,
level = 2, minsize = 0, maxsize = -1, minK = 0, maxK = -1,
fitfunction = Multinom, method = "h", crit = "aic",
confsetsize = 100,includeobjects=TRUE)
summary(M)
The function glmulti::glmulti returns a S4 class object that can be accessed like a list. All of your models, not just the best, could be accessed. Since I don't have your functions and some other optional inputs, I performed a simplified version of your model just as a demonstration:
results <- read.csv("gameresults.csv")
library(glmulti)
M <- glmulti(result~speed*svl*strategy, data=results, crit = "aic", plotty = TRUE)
Here are a list of all models, accessed by the # operator:
M#formulas
# [[1]]
# result ~ 1 + speed + svl:speed + strategy:speed
# <environment: 0x11a616750>
#
# [[2]]
# result ~ 1 + speed + svl + svl:speed + strategy:speed
# <environment: 0x11a616750>
#
# [[3]]
# result ~ 1 + strategy + speed + svl:speed + strategy:speed
# <environment: 0x11a616750>
#
## **I omitted the remaining 36-3=33 models**
You can plot them individually based on the formula, using the base graphic or any packages that support use of model formulas. For example, I randomly selected one from the list:
plot(result ~ 1 + speed + svl, data=results)
## Hit <Return> to see next plot:
## Hit <Return> to see next plot:

R: cant get a lme{nlme} to fit when using self-constructed interaction variables

I'm trying to get a lme with self constructed interaction variables to fit. I need those for post-hoc analysis.
library(nlme)
# construct fake dataset
obsr <- 100
dist <- rep(rnorm(36), times=obsr)
meth <- dist+rnorm(length(dist), mean=0, sd=0.5); rm(dist)
meth <- meth/dist(range(meth)); meth <- meth-min(meth)
main <- data.frame(meth = meth,
cpgl = as.factor(rep(1:36, times=obsr)),
pbid = as.factor(rep(1:obsr, each=36)),
agem = rep(rnorm(obsr, mean=30, sd=10), each=36),
trma = as.factor(rep(sample(c(TRUE, FALSE), size=obsr, replace=TRUE), each=36)),
depr = as.factor(rep(sample(c(TRUE, FALSE), size=obsr, replace=TRUE), each=36)))
# check if all factor combinations are present
# TRUE for my real dataset; Naturally TRUE for the fake dataset
with(main, all(table(depr, trma, cpgl) >= 1))
# construct interaction variables
main$depr_trma <- interaction(main$depr, main$trma, sep=":", drop=TRUE)
main$depr_cpgl <- interaction(main$depr, main$cpgl, sep=":", drop=TRUE)
main$trma_cpgl <- interaction(main$trma, main$cpgl, sep=":", drop=TRUE)
main$depr_trma_cpgl <- interaction(main$depr, main$trma, main$cpgl, sep=":", drop=TRUE)
# model WITHOUT preconstructed interaction variables
form1 <- list(fixd = meth ~ agem + depr + trma + depr*trma + cpgl +
depr*cpgl +trma*cpgl + depr*trma*cpgl,
rndm = ~ 1 | pbid,
corr = ~ cpgl | pbid)
modl1 <- nlme::lme(fixed=form1[["fixd"]],
random=form1[["rndm"]],
correlation=corCompSymm(form=form1[["corr"]]),
data=main)
# model WITH preconstructed interaction variables
form2 <- list(fixd = meth ~ agem + depr + trma + depr_trma + cpgl +
depr_cpgl + trma_cpgl + depr_trma_cpgl,
rndm = ~ 1 | pbid,
corr = ~ cpgl | pbid)
modl2 <- nlme::lme(fixed=form2[["fixd"]],
random=form2[["rndm"]],
correlation=corCompSymm(form=form2[["corr"]]),
data=main)
The first model fits without any problems whereas the second model gives me following error:
Error in MEEM(object, conLin, control$niterEM) :
Singularity in backsolve at level 0, block 1
Nothing i found out about this error so far helped me to solve the problem. However the solution is probably pretty easy.
Can someone help me? Thanks in advance!
EDIT 1:
When i run:
modl3 <- lm(form1[["fixd"]], data=main)
modl4 <- lm(form2[["fixd"]], data=main)
The summaries reveal that modl4 (with the self constructed interaction variables) in contrast to modl3 shows many more predictors. All those that are in 4 but not in 3 show NA as coefficients. The problem therefore definitely lies within the way i create the interaction variables...
EDIT 2:
In the meantime I created the interaction variables "by hand" (mainly paste() and grepl()) - It seems to work now. However I would still be interested in how i could have realized it by using the interaction() function.
I should have only constructed the largest of the interaction variables (combining all 3 simple variables).
If i do so the model gets fit. The likelihoods then are very close to each other and the number of coefficients matches exactly.

Resources