I am trying to model a "what if" situation in a binary logit model. I am estimating the probability of passing a test, given the level of difficulty of the test (1=easiest, 5=toughest), with gender as control. (The data is here). Students are administered a test which is generally tough ("HIGH" in the data). From this we can estimate the impact of test-difficulty on the likelihood of passing:
model = glm(PASS ~ as.factor(SEX) + as.factor(HIGH), family=binomial(link="logit"), data=df)
summary(model)
We can also get the predicted probabilities of passing with:
predict.high = predict(model, type="response")
The question is, what if the "LOW" test were given instead? To get the new probabilities, we can do:
newdata = rename.vars(subset(df, select=c(-HIGH)), 'LOW','HIGH')
predict.low = predict(model, newdata=newdata, type="response")
But how do I know how many additional students would have passed in this case? Is there an obvious switch in glm() I am not seeing?
I have not yet tried to dig out my code for prediction that I wrote based on Gelman and Hill (2006) who, I seem to recall used simulation. I still intend to do that. One aspect of your question that seemed unique in my limited experience was that I was accustomed to predicting for a single observation (in this case a single student taking a single test). You, however, seem to want to predict a difference between two sets of predictions. In other words, you want to predict how many more students will pass if given a set of 5 easy exams rather than a set of 5 hard exams.
I am not sure whether Gelman and Hill (2006) covered that. You also seem to want to do this with a frequentist approach.
I am thinking that if you can predict for a single observation, so that you have a confidence interval for each observation, then perhaps you can estimate a weighted average probability of passing within each group and subtract the two weighted averages. The delta method could be used to estimate a confidence interval on the weighted averages and on their difference.
Covariance among predicted observations might have to be assumed to be 0 to implement that approach.
If assuming a covariance of 0 is not satisfactory then perhaps a Bayesian approach would be better. Again, I am only familiar with predicting for a single observation. With a Bayesian approach I have predicted a single observation by including the independent variables, but not the dependent variable, for the observation to be predicted. I suppose you could predict for every observation in the same Bayesian run (predict each student in HIGH and in LOW). The weighted averages of passing tests for each group and the difference in weighted averages are derived parameters and I suspect could be included directly in the code for the Bayesian logistic regression. Then you would have your point estimate and estimate of variance for probability of passing each group of tests and for the difference in probability of passing each group of tests. If you want the difference in the number of students passing each group of tests, perhaps that could be included in the Bayesian code as a derived parameter also.
I realize this answer, so far, has been more conversational than might be desired. I am simply mapping out strategies to attempt without having had the time yet to try implementing those strategies. Providing all of the R and WinBUGS code to implement both proposed strategies might take me a few days. (WinBUGS or OpenBUGS can be called from within R.) I will append the code to this answer as I go along. If anyone deems my proposed strategies, and/or forthcoming code, incorrect I hope they will feel free to point out my errors and offer corrections.
EDIT
Below is code that generates fake data and analyzes that data using a frequentist and Bayesian approach. I have not yet added the code to implement the above ideas for prediction. I will try to add the Bayesian prediction code in the next 1-2 days. I only used three tests instead of five. The way the code is written below you can change the number of students, n, to any non-zero number that can be divided into 6 equal whole numbers.
# Bayesian_logistic_regression_June2012.r
# June 24, 2012
library(R2WinBUGS)
library(arm)
library(BRugs)
set.seed(3234)
# create fake data for n students and three tests
n <- 1200
# create factors for n/6 students in each of 6 categories
gender <- c(rep(0, (n/2)), rep(1, (n/2)))
test2 <- c(rep(0, (n/6)), rep(1, (n/6)), rep(0, (n/6)),
rep(0, (n/6)), rep(1, (n/6)), rep(0, (n/6)))
test3 <- c(rep(0, (n/6)), rep(0, (n/6)), rep(1, (n/6)),
rep(0, (n/6)), rep(0, (n/6)), rep(1, (n/6)))
# assign slopes to factors
B0 <- 0.4
Bgender <- -0.2
Btest2 <- 0.6
Btest3 <- 1.2
# estimate probability of passing test
p.pass <- ( exp(B0 + Bgender * gender +
Btest2 * test2 +
Btest3 * test3) /
(1 + exp(B0 + Bgender * gender +
Btest2 * test2 +
Btest3 * test3)))
# identify which students passed their test, 0 = fail, 1 = pass
passed <- rep(0, n)
r.passed <- runif(n,0,1)
passed[r.passed <= p.pass] = 1
# use frequentist approach in R to estimate probability
# of passing test
m.freq <- glm(passed ~ as.factor(gender) +
as.factor(test2) +
as.factor(test3) ,
family = binomial)
summary(m.freq)
# predict(m.freq, type = "response")
# use OpenBUGS to analyze same data set
# Define model
sink("Bayesian.logistic.regression.txt")
cat("
model {
# Priors
alpha ~ dnorm(0,0.01)
bgender ~ dnorm(0,0.01)
btest2 ~ dnorm(0,0.01)
btest3 ~ dnorm(0,0.01)
# Likelihood
for (i in 1:n) {
passed[i] ~ dbin(p[i], 1)
logit(p[i]) <- (alpha + bgender * gender[i] +
btest2 * test2[i] +
btest3 * test3[i])
}
# Derived parameters
p.g.t1 <- exp(alpha) / (1 + exp(alpha))
p.b.t1 <- exp(alpha + bgender) / (1 + exp(alpha + bgender))
p.g.t2 <- ( exp(alpha + btest2) /
(1 + exp(alpha + btest2)))
p.b.t2 <- ( exp(alpha + bgender + btest2) /
(1 + exp(alpha + bgender + btest2)))
p.g.t3 <- ( exp(alpha + btest3) /
(1 + exp(alpha + btest3)))
p.b.t3 <- ( exp(alpha + bgender + btest3) /
(1 + exp(alpha + bgender + btest3)))
}
", fill = TRUE)
sink()
my.data <- list(passed = passed,
gender = gender,
test2 = test2,
test3 = test3,
n = length(passed))
# Inits function
inits <- function(){ list(alpha = rlnorm(1),
bgender = rlnorm(1),
btest2 = rlnorm(1),
btest3 = rlnorm(1)) }
# Parameters to estimate
params <- c("alpha", "bgender", "btest2", "btest3",
"p.g.t1", "p.b.t1", "p.g.t2", "p.b.t2",
"p.g.t3", "p.b.t3")
# MCMC settings
nc <- 3
ni <- 2000
nb <- 500
nt <- 2
# Start Gibbs sampling
out <- bugs(data = my.data, inits = inits,
parameters.to.save = params,
"c:/users/Mark W Miller/documents/Bayesian.logistic.regression.txt",
program = 'OpenBUGS',
n.thin = nt, n.chains = nc,
n.burnin = nb, n.iter = ni, debug = TRUE)
print(out, dig = 5)
Before I attempt to implement the weighted-average approach to prediction I wanted to convince myself that it might work. So I ginned up the following code, which seems to suggest it may:
# specify number of girls taking each test and
# number of boys taking each test
g.t1 <- rep(0,400)
b.t1 <- rep(0,120)
g.t2 <- rep(0,1200)
b.t2 <- rep(0,50)
g.t3 <- rep(0,1000)
b.t3 <- rep(0,2000)
# specify probability of individuals in each of the
# 6 groups passing their test
p.g1.t1 <- 0.40
p.b1.t1 <- 0.30
p.g1.t2 <- 0.60
p.b1.t2 <- 0.50
p.g1.t3 <- 0.80
p.b1.t3 <- 0.70
# identify which individuals in each group passed their test
g.t1[1:(p.g1.t1 * length(g.t1))] = 1
sum(g.t1)
b.t1[1:(p.b1.t1 * length(b.t1))] = 1
sum(b.t1)
g.t2[1:(p.g1.t2 * length(g.t2))] = 1
sum(g.t2)
b.t2[1:(p.b1.t2 * length(b.t2))] = 1
sum(b.t2)
g.t3[1:(p.g1.t3 * length(g.t3))] = 1
sum(g.t3)
b.t3[1:(p.b1.t3 * length(b.t3))] = 1
sum(b.t3)
# determine the weighted average probability of passing
# on test day for all individuals as a class
wt.ave.p <- ((p.g1.t1 * length(g.t1) + p.b1.t1 * length(b.t1) +
p.g1.t2 * length(g.t2) + p.b1.t2 * length(b.t2) +
p.g1.t3 * length(g.t3) + p.b1.t3 * length(b.t3) ) /
(length(g.t1) + length(b.t1) + length(g.t2) +
length(b.t2) + length(g.t3) + length(b.t3)))
wt.ave.p
# determine the expected number of individuals passing
# their test in the class as a whole
exp.num.pass <- wt.ave.p * (length(g.t1) + length(b.t1) +
length(g.t2) + length(b.t2) +
length(g.t3) + length(b.t3))
exp.num.pass
# determine the number of individuals passing
num.passing <- (sum(g.t1) + sum(b.t1) +
sum(g.t2) + sum(b.t2) +
sum(g.t3) + sum(b.t3) )
num.passing
# the expected number of students passing, exp.num.pass,
# should equal the observed number of students passing,
# num.passing regardless of the number of students in each
# group and regardless of the probability of passing a
# given test, within rounding error
identical(round(exp.num.pass), round(num.passing))
Hopefully in the next couple of days I can try adding the prediction code to the above Bayesian code.
EDIT - June 27, 2012
I have not forgotten about this. Rather, I have encountered several problems:
With logistic regression it is possible to predict: a) the probability, p, that students in a given group pass a test and b) the outcome of a given student taking a test (0 or 1). All of the 0's and 1's are then averaged. I am not sure which of these to use. The point estimate and SD of the predicted p is identical to the estimated p for known test outcomes. The point estimate of the average of the predicted 0's and 1's is a little different and the SD of the averaged 0's and 1's is much larger. I believe I want b, the average of the predicted 0's and 1's. However, I am attempting to examine various websites and books to be sure. Collett (1991) has a worked example that does not employ computer code, but that worked example includes a half-dozen variables including 2 interactions and I am having a little trouble getting my Bayesian estimates to match her frequentist estimates.
With lots of derived parameters the program is taking a long time to run.
Apparently OpenBUGS has been crashing frequently, I believe, even without prediction code. I am not sure whether that is because of something I am doing wrong or because of changes in the recent versions of R or changes in recent versions of R packages or maybe because I am trying to run the code with a 64-bit R or something else.
I will try to post the prediction code soon, but all of the above issues have slowed me down.
You can easily use this approach to find a cut off:
cutoff <- runif(length(predicted_probabilities))
This is a deterministic decision based on Metropolis-Hastings.
Related
I'm currently working on constructing a zero-inflated negative binomial model in JAGS to model yearly change in abundance using count data and am currently a bit lost on how best to specify the model. I've included an example of the base model I'm using below. The main issue I'm struggling with is that in the model output I'm getting poor convergence (high Rhat values, low Neff values) and the 95% credible intervals are huge. I realize that without seeing/running the actual data there's probably not much anyone can help with but I thought I'd at least try and see if there are any obvious errors in the way I have the basic model specified. I also tried fitting a variety of other model types (regular negative binomial, Poisson, and zero-inflated Poisson) but decided to go with the ZINB since it had the lowest DIC scores of all the models and also makes the most intuitive sense to me, given my data structure.
library(R2jags)
# Create example dataframe
years <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)
sites <- c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
months <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
# Count data
day1 <- floor(runif(18,0,7))
day2 <- floor(runif(18,0,7))
day3 <- floor(runif(18,0,7))
day4 <- floor(runif(18,0,7))
day5 <- floor(runif(18,0,7))
df <- as.data.frame(cbind(years, sites, months, day1, day2, day3, day4, day5))
# Put count data into array
y <- array(NA,dim=c(2,3,3,5))
for(m in 1:2){
for(k in 1:3){
sel.rows <- df$years == m &
df$months==k
y[m,k,,] <- as.matrix(df)[sel.rows,4:8]
}
}
# JAGS model
sink("model1.txt")
cat("
model {
# PRIORS
for(m in 1:2){
r[m] ~ dunif(0,50)
}
t.int ~ dlogis(0,1)
b.int ~ dlogis(0,1)
p.det ~ dunif(0,1)
# LIKELIHOOD
# ECOLOGICAL SUBMODEL FOR TRUE ABUNDANCE
for (m in 1:2) {
zero[m] ~ dbern(pi[m])
pi[m] <- ilogit(mu.binary[m])
mu.binary[m] <- t.int
for (k in 1:3) {
for (i in 1:3) {
N[m,k,i] ~ dnegbin(p[m,k,i], r)
p[m,k,i] <- r[m] / (r[m] + (1 - zero[m]) * lambda.count[m,k,i]) - 1e-10 * zero[m]
lambda.count[m,k,i] <- exp(mu.count[m,k,i])
log(mu.count[m,k,i]) <- b.int
# OBSERVATIONAL SUBMODEL FOR DETECTION
for (j in 1:5) {
y[m,k,i,j] ~ dbin(p.det, N[m,k,i])
}#j
}#i
}#k
}#m
}#END", fill=TRUE)
sink()
win.data <- list(y = y)
Nst <- apply(y,c(1,2,3),max)+1
inits <- function()list(N = Nst)
params <- c("N")
nc <- 3
nt <- 1
ni <- 50000
nb <- 5000
out <- jags(win.data, inits, params, "model1.txt",
n.chains = nc, n.thin = nt, n.iter = ni, n.burnin = nb,
working.directory = getwd())
print(out)
Tried fitting a ZINB model in JAGS using the code specified above but am having issues with model convergence.
The way that I have tended to specify zero-inflated models is to model the data as being Poisson distributed with mean that is either zero if that individual is part of the zero-inflated group, or distributed according to a gamma distribution otherwise. Something like:
Obs[i] ~ dpois(lambda[i] * is_zero[i])
is_zero[i] ~ dbern(zero_prob)
lambda[i] ~ dgamma(k, k/mean)
Something similar to this was first used in this paper: https://www.researchgate.net/publication/5231190_The_distribution_of_the_pathogenic_nematode_Nematodirus_battus_in_lambs_is_zero-inflated
These models usually converge OK, although the performance is not as good as for simpler models of course. You also need to make sure to supply initial values for is_zero so that the model starts with all individuals with positive counts in the appropriate group.
In your case, you have multiple timepoints, so you need to decide if the zero-inflation is fixed over time points (i.e. an individual cannot switch to or from zero-inflated group over time), or if each observation is completely independent with respect to zero-inflation status. You also need to decide if you want to have co-variates of year/month/site affecting the mean count (i.e. the gamma part) or the probability of a positive count (i.e. the zero-inflation part). For the former, you need to index mean (in my formulation) by i and then use a GLM-like formula (probably using log link) to relate this to the appropriate covariates. For the latter, you need to index zero_prob by i and then use a GLM-like formula (probably using logit link) to relate this to the appropriate covariates. It is also possible to do both, but if you try to use the same covariates in both parts then you can expect convergence problems!
It would arguably be better to replace the separate Poisson-Gamma distributions with a single Negative Binomial distribution using the 'ecology parameterisation' with mean and k. This is not currently implemented in JAGS, but I will add it for the next update.
http://www.statsci.org/data/oz/snails.txt
You can get data from here.
My data is 4*3*3*2 completely randomized design experiment data. I want to model the probability of survival in terms of the stimulus variables.
I tried ANOVA, but I'm not sure whether it's right or not.
Because I want to model the "probability", should I use logistic model??
(I also tried logistic model. But the data shows the sum of 0(Survived) and 1(Deaths). Even though it is not 0 and 1, can I use logistic??)
I want to put "probability" as Y variable.
So I used logit but it's not working.
The program says that y is Inf.
How can I use logit as Y variable in aov?
glm_a <- glm(Deaths ~ Exposure + Rel.Hum + Temp + Species, data = data,
family = binomial)
prob <- Deaths / 20
logitt <- log(prob / (1 - prob))
logmodel <- lm(logitt ~ data$Species + data$Exposure + data$Rel.Hum + data$Temp)
summary(logmodel)
A <- factor(data$Species, levels = c("A", "B"), labels = c(-1, 1))
glm_a <- glm(Y ~ data$Species * data$Exposure * data$Rel.Hum * data$Temp,
data=data, family = binomial)
summary(glm_a)
help("glm") should direct you to help("family"), which reveals the following
For the binomial and quasibinomial families the response can be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not having the first level (and hence usually of having the second level).
As a numerical vector with values between 0 and 1, interpreted as the proportion of successful cases (with the total number of cases given by the weights).
As a two-column integer matrix: the first column gives the number of successes and the second the number of failures.
So for the question "How can I make logistic model with this data?", we can go with route #3 quite easily:
data <- read.table("http://www.statsci.org/data/oz/snails.txt", header = TRUE)
glm_a <- glm(cbind(Deaths, N - Deaths) ~ Species * Exposure * Rel.Hum * Temp,
data = data, family = binomial)
summary(glm_a)
# [output omitted]
As for the question "I tried ANOVA, but I'm not sure whether it's right or not. Because I want to model the "probability", should I use logistic model?", it's better to ask on Cross Validated
For my graduate research I'm using the CPLM package (specifically the cpglmm function) to account for zero-inflated data (Tweedie compound Poisson distribution) in a data set looking at the effects of logging on breeding bird densities. This isn't a widely used package like lme4, nlme, etc. Therefore, the model validation methods that can be used on these more commonly used packages cannot be used on cpglmm.
I'm currently at the stage of describing the fit of my models and am trying to calculate R-squared values, both marginal and conditional. Unfortunately I cannot use the r2glmm package or MuMln to calculate R-squared values because they do not support cpglmm. Therefore, I've had to calculate those values manually through an example found here (example found in Appendix 6 under cpglmm parasite models, pg. 33). Here's the script from that example:
# Fit null model without fixed effects (but including all random effects)
parmodCPr <- cpglmm(Parasite ~ 1 + (1 | Population) + (1 | Container), data = DataAll)
# Fit alternative model including fixed and all random effects
parmodCPf <- cpglmm(Parasite ~ Sex + Treatment + Habitat + (1 | Population) +
(1 | Container), data = DataAll)
# Calculation of the variance in fitted values
VarF <- var(as.vector(model.matrix(parmodCPf) %*% fixef(parmodCPf)))
# getting the observation-level variance Null model
phiN <- parmodCPr#phi # the dispersion parameter
pN <- parmodCPr#p # the index parameter
mu <- exp(fixef(parmodCPr) + 0.5 * (VarCorr(parmodCPr)$Population[1] + VarCorr(parmodCPr)$Container[1]))
VarOdN <- phiN * mu^(pN - 2) # the delta method
# Full model
phiF <- parmodCPf#phi # the dispersion parameter
pF <- parmodCPf#p # the index parameter
VarOdF <- phiF * mu^(pF - 2) # the delta method
# R2[GLMM(m)] - marginal R2[GLMM]; using the delta method observation-level variance
R2glmmM <- VarF/(VarF + sum(as.numeric(VarCorr(parmodCPf))) + VarOdF)
# R2[GLMM(c)] - conditional R2[GLMM] for full model
R2glmmC <- (VarF + sum(as.numeric(VarCorr(parmodCPf))))/(VarF + sum(as.numeric(VarCorr(parmodCPf))) +
VarOdF)
What I would like to be able to do is write a function in R using this code outputting both the marginal and conditional R-squared values (RglmmM and RglmmC) with my models as the input. I'd greatly appreciate any help with this problem. Hopefully I have supplied enough information.
Thanks.
Believe I figured it out. Here's an example I wrote up:
R2glmm <- function(model){
# Calculation of the variance in fitted values
VarALT <- var(as.vector(model.matrix(model) %*% fixef(model)))
# getting the observation-level variance Null model
phiNULL <- NULLmodel$phi # the dispersion parameter
pNULL <- NULLmodel$p # the index parameter
mu <- exp(fixef(NULLmodel) + 0.5 * (VarCorr(NULLmodel)$YEAR[1]))
VarOdNULL <- phiNULL * mu^(pNULL - 2) # the delta method
# Alternate model
phiALT <- model$phi # the dispersion parameter
pALT <- model$p # the index parameter
VarOdALT <- phiALT * mu^(pALT - 2) # the delta method
# R2[GLMM(m)] - marginal R2[GLMM]; using the delta method observation-level variance
R2glmmM <- VarALT/(VarALT + sum(as.numeric(VarCorr(model))) + VarOdALT)
# R2[GLMM(c)] - conditional R2[GLMM] for full model
R2glmmC <- (VarALT + sum(as.numeric(VarCorr(model))))/(VarALT + sum(as.numeric(VarCorr(model))) + VarOdALT)
return(c(R2glmmM, R2glmmC))
}
Variables containing ALT refers to the alternate model. "model" represents any cpglmm model you need to run through the function.
Hope this helps someone out. Been working on this problem and other related ones for ages now.
I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903
This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.
I am trying to build a dynamic regression model and so far I did it with the dynlm package. Basically the model looks like this
y_t = a*x1_t + b*x2_t + ... + c*y_(t-1).
y_t shall be predicted, x1_t and x2_t will be given and so is y_(t-1).
Building the model with the dynlm package worked fine, but when it came to predict y_t I got confused...
I found this, which seems to be a very similar problem, but it did not help me to handle my own problem.
Here is the problem I am facing (basically what predict() does, seems to be weird. See comments!):
library(dynlm)
# Create Data
set.seed(1)
y <- arima.sim(model = list(ar = c(.9)), n = 11) #Create AR(1) dependant variable
A <- rnorm(11) #Create independent variables
B <- rnorm(11)
y <- y + .5 * A + .2 * B #Add relationship to independent variables
data = cbind(y, A, B)
# subset used for the fitting of the model
reg <- data[1:10, ]
# Fit dynamic linear model
model <- dynlm(y ~ A + B + L(y, k = 1), data = reg) # dynlm
model
# Time series regression with "zooreg" data:
# Start = 2, End = 11
#
# Call:
# dynlm(formula = y ~ A + B + L(y, k = 1), data = reg)
# Coefficients:
# (Intercept) A B L(y, k = 1)
# 0.8930 -0.2175 0.2892 0.5176
# subset last two rows.
# the last row (r11) for which y_t shall be predicted, where from the same time A and B are input for the prediction
# and the second last row (r10), so y_(t-1) can be input for the model as well
pred <- as.data.frame(data[10:11, ])
# prediction using predict()
predict(model, newdata = pred)
# 1 2
# 1.833134 1.483809
# manual calculation of prediction of y in r11 (how I thought it should be...), taking y_(t-1) as input
predicted_value <- model$coefficients[1] + model$coefficients[2] * pred[2, 2] + model$coefficients[3] * pred[2, 3] + model$coefficients[4] * pred[1, 1]
predicted_value
# (Intercept)
# 1.743334
# and then what gives the value from predict() above taking y_t into the model (which is the value that should be predicted and not y_(t-1))
predicted_value <- model$coefficients[1] + model$coefficients[2] * pred[2, 2] + model$coefficients[3] * pred[2, 3] + model$coefficients[4] * pred[2, 1]
predicted_value
# (Intercept)
# 1.483809
Of course I could just use my own prediction function, but the problem is that my real model will have way more variables (which can even vary as I use the the step function to optimize the model according to AIC) and that I is why I want to use the predict() function.
Any ideas, how to solve this?
Unfortunately, the dynlm package does not provide a predict() method. At the moment the package completely separates the data pre-processing (which knows about functions like d(), L(), trend(), season() etc.) and the model fitting (which itself is not aware of the functions). A predict() method has been on my wishlist but so far I did not get round to write one because the flexibility of the interface allows so many models where it is not quite straightforward what to do. In the meantime, I should probably add a method that throws a warning before the lm method is found by inheritance.