Simulate an AR(1) in R as follows:
# True parameters
b0 <- 1 # intercept
b1 <- 0.9 # coefficient
trueMean <- b0 / (1-b1) # equals to 10
set.seed(8236)
capT <- 1000
eps <- rnorm(capT)
y <- rep(NA,capT)
y[1] <- b0 + b1*trueMean + eps[1] # Initialize the series
for(t in 2:capT) y[t] = b0 + b1*y[t-1] + eps[t]
reg1 <- ar(y)
reg2 <- arima(y, order=c(1,0,0))
reg3 <- lm( y[2:capT] ~y[1:(capT-1)] )
Both reg1 and reg3 estimates are close to the true values. However, reg2 which uses the arima function estimates an intercept close to the true Mean of 10. Any clue as to why this is happening?
Got the answer on this page http://www.stat.pitt.edu/stoffer/tsa2/Rissues.htm
It seems arima() reports the mean but calls it intercept!
Related
I am trying to extract the confidence intervals for my panel logit regression. I am using the following code:
model <- bife(dependent_variable ~ x1 + x2 | area, data = df, model = 'logit')
confint(model)
Running confint gives me NA values for all the coefficients and their confidence intervals.
Is this because of the 'bife' object? The model itself runs fine.
It's the bife:::vcov.bife method which doesn't produce dimnames. Until the author fixes this, we could help ourselves by writing a confint.bife method, that assigns coefficient names to the vcov.
confint.bife <- function (object, parm, level=0.95, ...) {
cf <- coef(object)
pnames <- names(cf)
if (missing(parm)) parm <- pnames
else if (is.numeric(parm)) parm <- pnames[parm]
a <- (1 - level)/2
a <- c(a, 1 - a)
pct <- stats:::format.perc(a, 3)
fac <- qnorm(a)
ci <- array(NA, dim=c(length(parm), 2L),
dimnames=list(parm, pct))
vc <- `dimnames<-`(vcov(object), list(pnames, pnames))
ses <- sqrt(diag(vc))[parm]
ci[] <- cf[parm] + ses %o% fac
ci
}
library('bife')
mod <- bife(LFP ~ I(AGE^2) + log(INCH) + KID1 + KID2 + KID3 +
factor(TIME) | ID, psid)
confint(mod)
# 2.5 % 97.5 %
# I(AGE^2) -0.003787755 -0.001185755
# log(INCH) -0.606681358 -0.236717893
# KID1 -1.393748723 -1.008131941
# KID2 -0.830532213 -0.485097762
# KID3 -0.248997085 0.012550225
# factor(TIME)2 -0.244728227 0.303869081
# factor(TIME)3 -0.190434814 0.438179674
# factor(TIME)4 0.117647679 0.870167422
# factor(TIME)5 0.635239557 1.547524672
# factor(TIME)6 0.613792831 1.689971248
# factor(TIME)7 0.639896725 1.876532219
# factor(TIME)8 0.585828050 2.017753781
# factor(TIME)9 0.753717289 2.381327746
In glm() it is possible to model bernoulli [0,1] outcomes with a logistic regression using the following sort of syntax.
glm(bin ~ x, df, family = "binomial")
However you can also perform aggregated binomial regression, where each observation represents a count of target events from a certain fixed number of bernoulli trials. For example see the following data:
set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
p <- 0.4 + 0.2*x
y <- rbinom(n, cov, p)
With these sort of data you use slightly different syntax in glm()
mod <- glm(cbind(y, cov-y) ~ x, family="binomial")
mod
# output
# Call: glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
#
# Coefficients:
# (Intercept) x
# -0.3064 0.6786
#
# Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
# Null Deviance: 53.72
# Residual Deviance: 39.54 AIC: 178
I was wondering is it possible to model this type of aggregated binomial data in the glmnet package? If so, what is the syntax?
Yes you can do it as the following
set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
x = cbind(x, xx = c(rep(0.5,20), rep(0.7, 20), rep(1,10)))
p <- 0.4 + 0.2*x
y <- rbinom(n, cov, p)
I added another covariate here called xx as glmnet accepts minimum of two covariates
In glm as you have it in your post
mod <- glm(cbind(y, cov-y) ~ x, family="binomial")
mod
# output
# Call: glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
# Coefficients:
# (Intercept) xx xxx
# 0.04366 0.86126 -0.64862
# Degrees of Freedom: 49 Total (i.e. Null); 47 Residual
# Null Deviance: 53.72
# Residual Deviance: 38.82 AIC: 179.3
In glmnet, without regularization (lambda=0) to reproduce similar results as in glm
library(glmnet)
fit = glmnet(x, cbind(cov-y,y), family="binomial", lambda=0)
coef(fit)
# output
# 3 x 1 sparse Matrix of class "dgCMatrix"
# s0
# (Intercept) 0.04352689
# x 0.86111234
# xx -0.64831806
I am trying to calculate manually the r-squared given by lm() in R
Considering:
fit <- lm(obs_values ~ preds_values, df)
with sd(df$obs_values) == sd(df$preds_values) and mean(df$obs_values) == mean(df$preds_values)
To do so I can extract the residuals by doing
res_a = residuals(fit) and then inject them in the formula as :
y = sum( (df$obs_values - mean(df$obs_values))^2 )
r-squared = 1 - sum(res_a^2)/y
Here I get the expected r-squared
Now, I would like to get the residual manually.
It should be as trivial as :
res_b = df$obs_values - df$predss_values, but for some reason, res_b is different than res_a...
You can't just do y - x in a regression y ~ x to get residuals. Where have regression coefficients gone?
fit <- lm(y ~ x)
b <- coef(fit)
resi <- y - (b[1] + b[2] * x)
You have many options:
## Residuals manually
# option 1
beta_hat <- coef(fit)
obs_values_hat <- beta_hat["(Intercept)"] + beta_hat["preds_values"] * preds_values
u_hat <- obs_values - obs_values_hat # residuals
# option 2
obs_values_hat <- fitted(fit)
u_hat <- obs_values - obs_values_hat # residuals
# (option 3 - not manually) or just u_hat <- resid(fit)
## R-squared manually
# option 1
var(obs_values_hat) / var(obs_values)
# option 2
1 - var(u_hat) / var(obs_values)
# option 3
cor(obs_values, obs_values_hat)^2
Given below is the code for analysis of a resolvable alpha design (alpha lattice design) using the R package asreml.
# load the data
library(agridat)
data(john.alpha)
dat <- john.alpha
# load asreml
library(asreml)
# model1 - random `gen`
#----------------------
# fitting the model
model1 <- asreml(yield ~ 1 + rep, data=dat, random=~ gen + rep:block)
# variance due to `gen`
sg2 <- summary(model1 )$varcomp[1,'component']
# mean variance of a difference of two BLUPs
vblup <- predict(model1 , classify="gen")$avsed ^ 2
# model2 - fixed `gen`
#----------------------
model2 <- asreml(yield ~ 1 + gen + rep, data=dat, random = ~ rep:block)
# mean variance of a difference of two adjusted treatment means (BLUE)
vblue <- predict(model2 , classify="gen")$avsed ^ 2
# H^2 = .803
sg2 / (sg2 + vblue/2)
# H^2c = .809
1-(vblup / 2 / sg2)
I am trying to replicate the above using the R package lme4.
# model1 - random `gen`
#----------------------
# fitting the model
model1 <- lmer(yield ~ 1 + (1|gen) + rep + (1|rep:block), dat)
# variance due to `gen`
varcomp <- VarCorr(model1)
varcomp <- data.frame(print(varcomp, comp = "Variance"))
sg2 <- varcomp[varcomp$grp == "gen",]$vcov
# model2 - fixed `gen`
#----------------------
model2 <- lmer(yield ~ 1 + gen + rep + (1|rep:block), dat)
How to compute the vblup and vblue (mean variance of difference) in lme4 equivalent to predict()$avsed ^ 2 of asreml ?
I'm not that familiar with this variance partitioning stuff, but I'll take a shot.
library(lme4)
model1 <- lmer(yield ~ 1 + rep + (1|gen) + (1|rep:block), john.alpha)
model2 <- update(model1, . ~ . + gen - (1|gen))
## variance due to `gen`
sg2 <- c(VarCorr(model1)[["gen"]]) ## 0.142902
Get conditional variances of BLUPs:
rr1 <- ranef(model1,condVar=TRUE)
vv1 <- attr(rr$gen,"postVar")
str(vv1)
## num [1, 1, 1:24] 0.0289 0.0289 0.0289 0.0289 0.0289 ...
This is a 1x1x24 array (effectively just a vector of variances; we could collapse using c() if we needed to). They're not all the same, but they're pretty close ... I don't know whether they should all be identical (and this is a roundoff issue)
(uv <- unique(vv1))
## [1] 0.02887451 0.02885887 0.02885887
The relative variation is approximately 5.4e-4 ...
If these were all the same then the mean variance of a difference of any two would be just twice the variance (Var(x-y) = Var(x)+Var(y); by construction the BLUPs are all independent). I'm going to go ahead and use this.
vblup <- 2*mean(vv1)
For the model with gen fitted as a fixed effect, let's extract the variances of the parameters relating to genotypes (which are differences in the expected value from the first level):
vv2 <- diag(vcov(model2))[-(1:3)]
summary(vv2)
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06631 0.06678 0.07189 0.07013 0.07246 0.07286
I'm going to take the means of these values (not double the values, since these are already the variances of differences)
vblue <- mean(vv2)
sg2/(sg2+vblue/2) ## 0.8029779
1-(vblup/2/sg2) ## 0.7979965
The H^2 estimate looks right on, but the H^2c estimate is a little different (0.797 vs. 0.809, a 1.5% relative difference); I don't know if that is big enough to be of concern or not.
I'm trying to implement a Bayesian ANCOVA that takes account of heteroscedasticity in R using JAGS. However, despite going through several tutorials of Bayesian simple regression and ANOVA, I can't understand how to prepare the file for JAGS. Here is my code so far:
y1 = rexp(57, rate=0.8) # dependent variable
x1 = hist(rbeta(57, 6, 2)) # continuous factor
x2 = rep(c(1, 2), 57/2) # categorical factor
groups = 2
n = 57
# list of variables
lddados <- list(g=groups, n=length(x), y=y, x1=x1, x2=x2)
sink('reglin.txt') # nome do arquivo aqui
cat('
# model
{
for(i in 1:n){
mu[i] = a0 + a[i]
y[i] = a0 + x1*a[ x2[i] ] + ε[i]
}
priors
y ~ dgamma(0.001,0.01)
for(i in 1:n){
inter[i] ~ dgamma(0.001,0.001)
coef[i] ~ dnorm(0.0,1.0E-
likelihood
got stuck...
}
}#------fim do modelo
')
sink()
Im currently trying out ANCOVA using rjags myself...
From my understanding, I would test this (untested);
require(rjags)
require(coda)
model_string <- "
model {
for ( i in 1:n ){
mu[i] <- a0 + a[x2[i]] + a3 * x1[i] # linear predictor
y[i] ~ dnorm(mu[i], prec) # y is norm. dist.
}
# priors
a0 ~ dnorm(0, 1.0E-6) # intercept
a[1] ~ dnorm(0, 1.0E-6) # effect of x1 at x2 level 1
a[2] ~ dnorm(0, 1.0E-6) # effect of x1 at x2 level 2
a3 ~ dnorm(0, 1.0E-6) # regression coefficient for x1 (covariate)
prec ~ dgamma(0.001, 0.001) # precision (inverse of variance)
}
"
# initial values for the mcmc
inits_list <- list(a=0, b=c(0,0), prec=100)
# model, initial values and data in right format
jags_model <- jags.model(textConnection(model_string), data=data, inits=inits_list, n.adapt = 500, n.chains = 3, quiet = T)
# burn-in
update(jags_model, 10000)
# run the mcmc chains using the coda package
mcmc_samples <- coda.samples(jags_model, c("mu", "a", "a1", "a2", "prec"), n.iter = 100000)
Tell me if it works...
Recommended books; McCarthy M. Bayesian Methods for Ecology and Kruschke JK. Doing Bayesian Data Analysis