R: Dynamic linear regression with dynlm package, how to predict()? - r

I am trying to build a dynamic regression model and so far I did it with the dynlm package. Basically the model looks like this
y_t = a*x1_t + b*x2_t + ... + c*y_(t-1).
y_t shall be predicted, x1_t and x2_t will be given and so is y_(t-1).
Building the model with the dynlm package worked fine, but when it came to predict y_t I got confused...
I found this, which seems to be a very similar problem, but it did not help me to handle my own problem.
Here is the problem I am facing (basically what predict() does, seems to be weird. See comments!):
library(dynlm)
# Create Data
set.seed(1)
y <- arima.sim(model = list(ar = c(.9)), n = 11) #Create AR(1) dependant variable
A <- rnorm(11) #Create independent variables
B <- rnorm(11)
y <- y + .5 * A + .2 * B #Add relationship to independent variables
data = cbind(y, A, B)
# subset used for the fitting of the model
reg <- data[1:10, ]
# Fit dynamic linear model
model <- dynlm(y ~ A + B + L(y, k = 1), data = reg) # dynlm
model
# Time series regression with "zooreg" data:
# Start = 2, End = 11
#
# Call:
# dynlm(formula = y ~ A + B + L(y, k = 1), data = reg)
# Coefficients:
# (Intercept) A B L(y, k = 1)
# 0.8930 -0.2175 0.2892 0.5176
# subset last two rows.
# the last row (r11) for which y_t shall be predicted, where from the same time A and B are input for the prediction
# and the second last row (r10), so y_(t-1) can be input for the model as well
pred <- as.data.frame(data[10:11, ])
# prediction using predict()
predict(model, newdata = pred)
# 1 2
# 1.833134 1.483809
# manual calculation of prediction of y in r11 (how I thought it should be...), taking y_(t-1) as input
predicted_value <- model$coefficients[1] + model$coefficients[2] * pred[2, 2] + model$coefficients[3] * pred[2, 3] + model$coefficients[4] * pred[1, 1]
predicted_value
# (Intercept)
# 1.743334
# and then what gives the value from predict() above taking y_t into the model (which is the value that should be predicted and not y_(t-1))
predicted_value <- model$coefficients[1] + model$coefficients[2] * pred[2, 2] + model$coefficients[3] * pred[2, 3] + model$coefficients[4] * pred[2, 1]
predicted_value
# (Intercept)
# 1.483809
Of course I could just use my own prediction function, but the problem is that my real model will have way more variables (which can even vary as I use the the step function to optimize the model according to AIC) and that I is why I want to use the predict() function.
Any ideas, how to solve this?

Unfortunately, the dynlm package does not provide a predict() method. At the moment the package completely separates the data pre-processing (which knows about functions like d(), L(), trend(), season() etc.) and the model fitting (which itself is not aware of the functions). A predict() method has been on my wishlist but so far I did not get round to write one because the flexibility of the interface allows so many models where it is not quite straightforward what to do. In the meantime, I should probably add a method that throws a warning before the lm method is found by inheritance.

Related

R: one regression model for 2 different data sets to prepare for waldtest

I have two different data sets. Each of them represents one portfolio of my two portfolios.
y(p) as dependent variable and x1(p), x2(p),x3(p),x4(p) as independent variables.
(p) indicates a portfolio-specific value. column 1 of each variable represents portfolio 1 and column 2 represents portfolio 2.
The regression equation is:
y(p)=∝(p)+ 𝛽1(p)*x1(p)+𝛽2(p)*x2(p)+𝛽3(p)*x3(p)+𝛽4(p)*x4(p)
What i did so far is to implement a separate regression model for each portfolio in R:
lm1 <- lm(y[,1]~x1[,1]+x2[,1]+x3[,1]+x4[,1])
lm2 <- lm(y[,2]~x1[,2]+x2[,2]+x3[,2]+x4[,2])
My objective is to compare the two intercepts of both regression models. Within the scope of this comparison i need to test the joint significance of these intercepts. As far as i can tell, using the wald test should be appropriate.
If I use the waldtest-function from the lmtest-package it does not work.
Obviously, because the response variable is not the same for both models.
library(lmtest)
waldtest(lm1,lm2)
In waldtest.default(object, ..., test = match.arg(test)) :
models with response "y[, 2]" removed because response differs from model 1
All workarounds I tried so far did not work either, e.g. R: Waldtest: "Error in solve.default(vc[ovar, ovar]) : 'a' is 0-diml"
My guess is that the regression needs to be done in a different way to fix the problems regarding the waldtest.
So that leads to my question:
Is there a possibility to do the regression in one model, which still generates portfolio-specific intercepts and coefficients? (I assume, that this would fix the problems with the waldtest-function.)
Any advice or suggestion will be appreciated.
The following data can be used for a reproducible example:
y=matrix(rnorm(10),ncol=2)
x1=matrix(rnorm(10),ncol=2)
x2=matrix(rnorm(10),ncol=2)
x3=matrix(rnorm(10),ncol=2)
x4=matrix(rnorm(10),ncol=2)
lm1 <- lm(y[,1]~x1[,1]+x2[,1]+x3[,1]+x4[,1])
lm2 <- lm(y[,2]~x1[,2]+x2[,2]+x3[,2]+x4[,2])
library(lmtest)
waldtest(lm1,lm2)
Best regards,
Simon
Here are three ways to test intercepts equality. The second one is an implementation of the accepted answer to this question, while the other two are implementations of the second answer to the aforementioned question under different assumptions.
Let
n <- 5
y <- matrix(rnorm(10), ncol = 2)
x <- matrix(rnorm(10), ncol = 2)
First, we may indeed perform the test with only a single model. For that purpose we create a new vector Y that concatenates y[, 1] and y[, 2]. As for the independent variables, we create a block-diagonal matrix with the regressors of one model at the upper-left block and those for the other model at the lower-right block. Lastly, I create a group factor indicating the hidden model. Hence,
library(Matrix)
Y <- c(y)
X <- as.matrix(bdiag(x[, 1], x[, 2]))
G <- factor(rep(0:1, each = n))
Now the unrestricted model is
m1 <- lm(Y ~ G + X - 1)
while the restricted one is
m2 <- lm(Y ~ X)
Testing for intercepts equality gives
library(lmtest)
waldtest(m1, m2)
# Wald test
#
# Model 1: Y ~ G + X - 1
# Model 2: Y ~ X
# Res.Df Df F Pr(>F)
# 1 6
# 2 7 -1 0.5473 0.4873
so that, as expected, we cannot reject they equality. A problem with this solution, however, is that it is like estimating the two models separately but assuming that the errors have the same variance in both. Also, we don't allow for a cross-correlation between errors.
Second, we can relax the assumption of identical errors variance by estimating two separate models and employing a Z-test as follows.
M1 <- lm(y[, 1] ~ x[, 1])
M2 <- lm(y[, 2] ~ x[, 2])
Z <- unname((coef(M1)[1] - coef(M2)[1]) / (coef(summary(M1))[1, 2]^2 + coef(summary(M2))[1, 2])^2)
2 * pnorm(-abs(Z))
# [1] 0.5425736
leading to the same conclusion.
Lastly, we can employ the SUR in this way allowing for model-dependent errors variance as well as contemporaneous errors cross-dependence (that may be not necessary in your case, it matters what kind of data you are using). For that we can use the systemfit package as follows:
library(systemfit)
eq1 <- y[, 1] ~ x[, 1]
eq2 <- y[, 2] ~ x[, 2]
m <- systemfit(list(eq1, eq2), method = "SUR")
In this case we also are able to perform the Wald test:
R <- matrix(c(1, 0, -1, 0), nrow = 1) # Restriction matrix
linearHypothesis(m, R, test = "Chisq")
# Linear hypothesis test (Chi^2 statistic of a Wald test)
#
# Hypothesis:
# eq1_((Intercept) - eq2_(Intercept) = 0
#
# Model 1: restricted model
# Model 2: m
#
# Res.Df Df Chisq Pr(>Chisq)
# 1 7
# 2 6 1 0.3037 0.5816

Manual Perceptron example in R - are the results acceptable?

I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903
This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.

How to put a complicated equation into a R formula?

We have the diameter of trees as the predictor and tree height as the dependent variable. A number of different equations exist for this kind of data and we try to model some of them and compare the results.
However, we we can't figure out how to correctly put one equation into the corresponding R formula format.
The trees data set in R can be used as an example.
data(trees)
df <- trees
df$h <- df$Height * 0.3048 #transform to metric system
df$dbh <- (trees$Girth * 0.3048) / pi #transform tree girth to diameter
First, the example of an equation that seems to work well:
form1 <- h ~ I(dbh ^ -1) + I( dbh ^ 2)
m1 <- lm(form1, data = df)
m1
Call:
lm(formula = form1, data = df)
Coefficients:
(Intercept) I(dbh^-1) I(dbh^2)
27.1147 -5.0553 0.1124
Coefficients a, b and c are estimated, which is what we are interested in.
Now the problematic equation:
Trying to fit it like this:
form2 <- h ~ I(dbh ^ 2) / dbh + I(dbh ^ 2) + 1.3
gives an error:
m1 <- lm(form2, data = df)
Error in terms.formula(formula, data = data)
invalid model formula in ExtractVars
I guess this is because / is interpreted as a nested model and not an arithmetic operator?
This doesn't give an error:
form2 <- h ~ I(I(dbh ^ 2) / dbh + I(dbh ^ 2) + 1.3)
m1 <- lm(form2, data = df)
But the result is not the one we want:
m1
Call:
lm(formula = form2, data = df)
Coefficients:
(Intercept) I(I(dbh^2)/dbh + I(dbh^2) + 1.3)
19.3883 0.8727
Only one coefficient is given for the whole term within the outer I(), which seems to be logic.
How can we fit the second equation to our data?
Assuming you are using nls the R formula can use an ordinary R function, H(a, b, c, D), so the formula can be just h ~ H(a, b, c, dbh) and this works:
# use lm to get startingf values
lm1 <- lm(1/(h - 1.3) ~ I(1/dbh) + I(1/dbh^2), df)
start <- rev(setNames(coef(lm1), c("c", "b", "a")))
# run nls
H <- function(a, b, c, D) 1.3 + D^2 / (a + b * D + c * D^2)
nls1 <- nls(h ~ H(a, b, c, dbh), df, start = start)
nls1 # display result
Graphing the output:
plot(h ~ dbh, df)
lines(fitted(nls1) ~ dbh, df)
You've got a couple problems. (1) You're missing parentheses for the denominator of form2 (and R has no way to know that you want to add a constant a in the denominator, or where to put any of the parameters, really), and much more problematic: (2) your 2nd model isn't linear, so lm won't work.
Fixing (1) is easy:
form2 <- h ~ 1.3 + I(dbh^2) / (a + b * dbh + c * I(dbh^2))
Fixing (2), though there are many ways to estimate parameters for a nonlinear model, the nls (nonlinear least squares) is a good place to start:
m2 <- nls(form2, data = df, start = list(a = 1, b = 1, c = 1))
You need to provide starting guesses for the parameters in nls. I just picked 1's, but you should use better guesses that ballpark what the parameters might be.
edit: fixed, no longer incorrectly using offset ...
An answer that complements #shujaa's:
You can transform your problem from
H = 1.3 + D^2/(a+b*D+c*D^2)
to
1/(H-1.3) = a/D^2+b/D+c
This would normally mess up the assumptions of the model (i.e., if H were normally distributed with constant variance, then 1/(H-1.3) wouldn't be. However, let's try it anyway:
data(trees)
df <- transform(trees,
h=Height * 0.3048, #transform to metric system
dbh=Girth * 0.3048 / pi #transform tree girth to diameter
)
lm(1/(h-1.3) ~ poly(I(1/dbh),2,raw=TRUE),data=df)
## Coefficients:
## (Intercept) poly(I(1/dbh), 2, raw = TRUE)1
## 0.043502 -0.006136
## poly(I(1/dbh), 2, raw = TRUE)2
## 0.010792
These results would normally be good enough to get good starting values for the nls fit. However, you can do better than that via glm, which uses a link function to allow for some forms of non-linearity. Specifically,
(fit2 <- glm(h-1.3 ~ poly(I(1/dbh),2,raw=TRUE),
family=gaussian(link="inverse"),data=df))
## Coefficients:
## (Intercept) poly(I(1/dbh), 2, raw = TRUE)1
## 0.041795 -0.002119
## poly(I(1/dbh), 2, raw = TRUE)2
## 0.008175
##
## Degrees of Freedom: 30 Total (i.e. Null); 28 Residual
## Null Deviance: 113.2
## Residual Deviance: 80.05 AIC: 125.4
##
You can see that the results are approximately the same as the linear fit, but not quite.
pframe <- data.frame(dbh=seq(0.8,2,length=51))
We use predict, but need to correct the prediction to account for the fact that we subtracted a constant from the LHS:
pframe$h <- predict(fit2,newdata=pframe,type="response")+1.3
p2 <- predict(fit2,newdata=pframe,se.fit=TRUE) ## predict on link scale
pframe$h_lwr <- with(p2,1/(fit+1.96*se.fit))+1.3
pframe$h_upr <- with(p2,1/(fit-1.96*se.fit))+1.3
png("dbh_tmp1.png",height=4,width=6,units="in",res=150)
par(las=1,bty="l")
plot(h~dbh,data=df)
with(pframe,lines(dbh,h,col=2))
with(pframe,polygon(c(dbh,rev(dbh)),c(h_lwr,rev(h_upr)),
border=NA,col=adjustcolor("black",alpha=0.3)))
dev.off()
Because we have used the constant on the LHS (this almost, but doesn't quite, fit into the framework of using an offset -- we could only use an offset if our formula were 1/H - 1.3 = a/D^2 + ..., i.e. if the constant adjustment were on the link (inverse) scale rather than the original scale), this doesn't fit perfectly into ggplot's geom_smooth framework
library("ggplot2")
ggplot(df,aes(dbh,h))+geom_point()+theme_bw()+
geom_line(data=pframe,colour="red")+
geom_ribbon(data=pframe,colour=NA,alpha=0.3,
aes(ymin=h_lwr,ymax=h_upr))
ggsave("dbh_tmp2.png",height=4,width=6)

Adding lagged variables to an lm model?

I'm using lm on a time series, which works quite well actually, and it's super super fast.
Let's say my model is:
> formula <- y ~ x
I train this on a training set:
> train <- data.frame( x = seq(1,3), y = c(2,1,4) )
> model <- lm( formula, train )
... and I can make predictions for new data:
> test <- data.frame( x = seq(4,6) )
> test$y <- predict( model, newdata = test )
> test
x y
1 4 4.333333
2 5 5.333333
3 6 6.333333
This works super nicely, and it's really speedy.
I want to add lagged variables to the model. Now, I could do this by augmenting my original training set:
> train$y_1 <- c(0,train$y[1:nrow(train)-1])
> train
x y y_1
1 1 2 0
2 2 1 2
3 3 4 1
update the formula:
formula <- y ~ x * y_1
... and training will work just fine:
> model <- lm( formula, train )
> # no errors here
However, the problem is that there is no way of using 'predict', because there is no way of populating y_1 in a test set in a batch manner.
Now, for lots of other regression things, there are very convenient ways to express them in the formula, such as poly(x,2) and so on, and these work directly using the unmodified training and test data.
So, I'm wondering if there is some way of expressing lagged variables in the formula, so that predict can be used? Ideally:
formula <- y ~ x * lag(y,-1)
model <- lm( formula, train )
test$y <- predict( model, newdata = test )
... without having to augment (not sure if that's the right word) the training and test datasets, and just being able to use predict directly?
Have a look at e.g. the dynlm package which gives you lag operators. More generally the Task Views on Econometrics and Time Series will have lots more for you to look at.
Here is the beginning of its examples -- a one and twelve month lag:
R> data("UKDriverDeaths", package = "datasets")
R> uk <- log10(UKDriverDeaths)
R> dfm <- dynlm(uk ~ L(uk, 1) + L(uk, 12))
R> dfm
Time series regression with "ts" data:
Start = 1970(1), End = 1984(12)
Call:
dynlm(formula = uk ~ L(uk, 1) + L(uk, 12))
Coefficients:
(Intercept) L(uk, 1) L(uk, 12)
0.183 0.431 0.511
R>
Following Dirk's suggestion on dynlm, I couldn't quite figure out how to predict, but searching for that led me to dyn package via https://stats.stackexchange.com/questions/6758/1-step-ahead-predictions-with-dynlm-r-package
Then after several hours of experimentation I came up with the following function to handle the prediction. There were quite a few 'gotcha's on the way, eg you can't seem to rbind time series, and the result of predict is offset by start and a whole bunch of things like that, so I feel this answer adds significantly compared to just naming a package, though I have upvoted Dirk's answer.
So, a solution that works is:
use the dyn package
use the following method for prediction
predictDyn method:
# pass in training data, test data,
# it will step through one by one
# need to give dependent var name, so that it can make this into a timeseries
predictDyn <- function( model, train, test, dependentvarname ) {
Ntrain <- nrow(train)
Ntest <- nrow(test)
# can't rbind ts's apparently, so convert to numeric first
train[,dependentvarname] <- as.numeric(train[,dependentvarname])
test[,dependentvarname] <- as.numeric(test[,dependentvarname])
testtraindata <- rbind( train, test )
testtraindata[,dependentvarname] <- ts( as.numeric( testtraindata[,dependentvarname] ) )
for( i in 1:Ntest ) {
result <- predict(model,newdata=testtraindata,subset=1:(Ntrain+i-1))
testtraindata[Ntrain+i,dependentvarname] <- result[Ntrain + i + 1 - start(result)][1]
}
return( testtraindata[(Ntrain+1):(Ntrain + Ntest),] )
}
Example usage:
library("dyn")
# size of training and test data
N <- 6
predictN <- 10
# create training data, which we can get exact fit on, so we can check the results easily
traindata <- c(1,2)
for( i in 3:N ) { traindata[i] <- 0.5 + 1.3 * traindata[i-2] + 1.7 * traindata[i-1] }
train <- data.frame( y = ts( traindata ), foo = 1)
# create testing data, bunch of NAs
test <- data.frame( y = ts( rep(NA,predictN) ), foo = 1)
# fit a model
model <- dyn$lm( y ~ lag(y,-1) + lag(y,-2), train )
# look at the model, it's a perfect fit. Nice!
print(model)
test <- predictDyn( model, train, test, "y" )
print(test)
# nice plot
plot(test$y, type='l')
Output:
> model
Call:
lm(formula = dyn(y ~ lag(y, -1) + lag(y, -2)), data = train)
Coefficients:
(Intercept) lag(y, -1) lag(y, -2)
0.5 1.7 1.3
> test
y foo
7 143.2054 1
8 325.6810 1
9 740.3247 1
10 1682.4373 1
11 3823.0656 1
12 8686.8801 1
13 19738.1816 1
14 44848.3528 1
15 101902.3358 1
16 231537.3296 1
Edit: hmmm, this is super slow though. Even if I limit the data in the subset to a constant few rows of the dataset, it takes about 24 milliseconds per prediction, or, for my task, 0.024*7*24*8*20*10/60/60 = 1.792 hours :-O
Try the ARIMA function. The AR parameter is for auto-regressive, which means lagged y. xreg = allows you to add other X variables. You can get predictions with predict.ARIMA.
Here's a thought:
Why don't you create a new data frame? Fill a data frame with the regressors you need. You could have columns like L1, L2, ..., Lp for all lags of any variable you want and, then, you get to use your functions exactly like you would for a cross-section type of regression.
Because you will not have to operate on your data every time you call fitting and prediction functions, but will have transformed the data once, it will be considerably faster. I know that Eviews and Stata provide lagging operators. It is true that there is some convenience to it. But it also is inefficient if you do not need everything functions like 'lm' compute. If you have a few hundreds of thousands of iterations to perform and you just need the forecast, or the forecast and the value of information criteria like BIC or AIC, you can beat 'lm' in speed by avoiding to make computations that you will not use -- just write an OLS estimator in a function and you're good to go.

R: Predict (0,1) in logistic regression in glm()

I am trying to model a "what if" situation in a binary logit model. I am estimating the probability of passing a test, given the level of difficulty of the test (1=easiest, 5=toughest), with gender as control. (The data is here). Students are administered a test which is generally tough ("HIGH" in the data). From this we can estimate the impact of test-difficulty on the likelihood of passing:
model = glm(PASS ~ as.factor(SEX) + as.factor(HIGH), family=binomial(link="logit"), data=df)
summary(model)
We can also get the predicted probabilities of passing with:
predict.high = predict(model, type="response")
The question is, what if the "LOW" test were given instead? To get the new probabilities, we can do:
newdata = rename.vars(subset(df, select=c(-HIGH)), 'LOW','HIGH')
predict.low = predict(model, newdata=newdata, type="response")
But how do I know how many additional students would have passed in this case? Is there an obvious switch in glm() I am not seeing?
I have not yet tried to dig out my code for prediction that I wrote based on Gelman and Hill (2006) who, I seem to recall used simulation. I still intend to do that. One aspect of your question that seemed unique in my limited experience was that I was accustomed to predicting for a single observation (in this case a single student taking a single test). You, however, seem to want to predict a difference between two sets of predictions. In other words, you want to predict how many more students will pass if given a set of 5 easy exams rather than a set of 5 hard exams.
I am not sure whether Gelman and Hill (2006) covered that. You also seem to want to do this with a frequentist approach.
I am thinking that if you can predict for a single observation, so that you have a confidence interval for each observation, then perhaps you can estimate a weighted average probability of passing within each group and subtract the two weighted averages. The delta method could be used to estimate a confidence interval on the weighted averages and on their difference.
Covariance among predicted observations might have to be assumed to be 0 to implement that approach.
If assuming a covariance of 0 is not satisfactory then perhaps a Bayesian approach would be better. Again, I am only familiar with predicting for a single observation. With a Bayesian approach I have predicted a single observation by including the independent variables, but not the dependent variable, for the observation to be predicted. I suppose you could predict for every observation in the same Bayesian run (predict each student in HIGH and in LOW). The weighted averages of passing tests for each group and the difference in weighted averages are derived parameters and I suspect could be included directly in the code for the Bayesian logistic regression. Then you would have your point estimate and estimate of variance for probability of passing each group of tests and for the difference in probability of passing each group of tests. If you want the difference in the number of students passing each group of tests, perhaps that could be included in the Bayesian code as a derived parameter also.
I realize this answer, so far, has been more conversational than might be desired. I am simply mapping out strategies to attempt without having had the time yet to try implementing those strategies. Providing all of the R and WinBUGS code to implement both proposed strategies might take me a few days. (WinBUGS or OpenBUGS can be called from within R.) I will append the code to this answer as I go along. If anyone deems my proposed strategies, and/or forthcoming code, incorrect I hope they will feel free to point out my errors and offer corrections.
EDIT
Below is code that generates fake data and analyzes that data using a frequentist and Bayesian approach. I have not yet added the code to implement the above ideas for prediction. I will try to add the Bayesian prediction code in the next 1-2 days. I only used three tests instead of five. The way the code is written below you can change the number of students, n, to any non-zero number that can be divided into 6 equal whole numbers.
# Bayesian_logistic_regression_June2012.r
# June 24, 2012
library(R2WinBUGS)
library(arm)
library(BRugs)
set.seed(3234)
# create fake data for n students and three tests
n <- 1200
# create factors for n/6 students in each of 6 categories
gender <- c(rep(0, (n/2)), rep(1, (n/2)))
test2 <- c(rep(0, (n/6)), rep(1, (n/6)), rep(0, (n/6)),
rep(0, (n/6)), rep(1, (n/6)), rep(0, (n/6)))
test3 <- c(rep(0, (n/6)), rep(0, (n/6)), rep(1, (n/6)),
rep(0, (n/6)), rep(0, (n/6)), rep(1, (n/6)))
# assign slopes to factors
B0 <- 0.4
Bgender <- -0.2
Btest2 <- 0.6
Btest3 <- 1.2
# estimate probability of passing test
p.pass <- ( exp(B0 + Bgender * gender +
Btest2 * test2 +
Btest3 * test3) /
(1 + exp(B0 + Bgender * gender +
Btest2 * test2 +
Btest3 * test3)))
# identify which students passed their test, 0 = fail, 1 = pass
passed <- rep(0, n)
r.passed <- runif(n,0,1)
passed[r.passed <= p.pass] = 1
# use frequentist approach in R to estimate probability
# of passing test
m.freq <- glm(passed ~ as.factor(gender) +
as.factor(test2) +
as.factor(test3) ,
family = binomial)
summary(m.freq)
# predict(m.freq, type = "response")
# use OpenBUGS to analyze same data set
# Define model
sink("Bayesian.logistic.regression.txt")
cat("
model {
# Priors
alpha ~ dnorm(0,0.01)
bgender ~ dnorm(0,0.01)
btest2 ~ dnorm(0,0.01)
btest3 ~ dnorm(0,0.01)
# Likelihood
for (i in 1:n) {
passed[i] ~ dbin(p[i], 1)
logit(p[i]) <- (alpha + bgender * gender[i] +
btest2 * test2[i] +
btest3 * test3[i])
}
# Derived parameters
p.g.t1 <- exp(alpha) / (1 + exp(alpha))
p.b.t1 <- exp(alpha + bgender) / (1 + exp(alpha + bgender))
p.g.t2 <- ( exp(alpha + btest2) /
(1 + exp(alpha + btest2)))
p.b.t2 <- ( exp(alpha + bgender + btest2) /
(1 + exp(alpha + bgender + btest2)))
p.g.t3 <- ( exp(alpha + btest3) /
(1 + exp(alpha + btest3)))
p.b.t3 <- ( exp(alpha + bgender + btest3) /
(1 + exp(alpha + bgender + btest3)))
}
", fill = TRUE)
sink()
my.data <- list(passed = passed,
gender = gender,
test2 = test2,
test3 = test3,
n = length(passed))
# Inits function
inits <- function(){ list(alpha = rlnorm(1),
bgender = rlnorm(1),
btest2 = rlnorm(1),
btest3 = rlnorm(1)) }
# Parameters to estimate
params <- c("alpha", "bgender", "btest2", "btest3",
"p.g.t1", "p.b.t1", "p.g.t2", "p.b.t2",
"p.g.t3", "p.b.t3")
# MCMC settings
nc <- 3
ni <- 2000
nb <- 500
nt <- 2
# Start Gibbs sampling
out <- bugs(data = my.data, inits = inits,
parameters.to.save = params,
"c:/users/Mark W Miller/documents/Bayesian.logistic.regression.txt",
program = 'OpenBUGS',
n.thin = nt, n.chains = nc,
n.burnin = nb, n.iter = ni, debug = TRUE)
print(out, dig = 5)
Before I attempt to implement the weighted-average approach to prediction I wanted to convince myself that it might work. So I ginned up the following code, which seems to suggest it may:
# specify number of girls taking each test and
# number of boys taking each test
g.t1 <- rep(0,400)
b.t1 <- rep(0,120)
g.t2 <- rep(0,1200)
b.t2 <- rep(0,50)
g.t3 <- rep(0,1000)
b.t3 <- rep(0,2000)
# specify probability of individuals in each of the
# 6 groups passing their test
p.g1.t1 <- 0.40
p.b1.t1 <- 0.30
p.g1.t2 <- 0.60
p.b1.t2 <- 0.50
p.g1.t3 <- 0.80
p.b1.t3 <- 0.70
# identify which individuals in each group passed their test
g.t1[1:(p.g1.t1 * length(g.t1))] = 1
sum(g.t1)
b.t1[1:(p.b1.t1 * length(b.t1))] = 1
sum(b.t1)
g.t2[1:(p.g1.t2 * length(g.t2))] = 1
sum(g.t2)
b.t2[1:(p.b1.t2 * length(b.t2))] = 1
sum(b.t2)
g.t3[1:(p.g1.t3 * length(g.t3))] = 1
sum(g.t3)
b.t3[1:(p.b1.t3 * length(b.t3))] = 1
sum(b.t3)
# determine the weighted average probability of passing
# on test day for all individuals as a class
wt.ave.p <- ((p.g1.t1 * length(g.t1) + p.b1.t1 * length(b.t1) +
p.g1.t2 * length(g.t2) + p.b1.t2 * length(b.t2) +
p.g1.t3 * length(g.t3) + p.b1.t3 * length(b.t3) ) /
(length(g.t1) + length(b.t1) + length(g.t2) +
length(b.t2) + length(g.t3) + length(b.t3)))
wt.ave.p
# determine the expected number of individuals passing
# their test in the class as a whole
exp.num.pass <- wt.ave.p * (length(g.t1) + length(b.t1) +
length(g.t2) + length(b.t2) +
length(g.t3) + length(b.t3))
exp.num.pass
# determine the number of individuals passing
num.passing <- (sum(g.t1) + sum(b.t1) +
sum(g.t2) + sum(b.t2) +
sum(g.t3) + sum(b.t3) )
num.passing
# the expected number of students passing, exp.num.pass,
# should equal the observed number of students passing,
# num.passing regardless of the number of students in each
# group and regardless of the probability of passing a
# given test, within rounding error
identical(round(exp.num.pass), round(num.passing))
Hopefully in the next couple of days I can try adding the prediction code to the above Bayesian code.
EDIT - June 27, 2012
I have not forgotten about this. Rather, I have encountered several problems:
With logistic regression it is possible to predict: a) the probability, p, that students in a given group pass a test and b) the outcome of a given student taking a test (0 or 1). All of the 0's and 1's are then averaged. I am not sure which of these to use. The point estimate and SD of the predicted p is identical to the estimated p for known test outcomes. The point estimate of the average of the predicted 0's and 1's is a little different and the SD of the averaged 0's and 1's is much larger. I believe I want b, the average of the predicted 0's and 1's. However, I am attempting to examine various websites and books to be sure. Collett (1991) has a worked example that does not employ computer code, but that worked example includes a half-dozen variables including 2 interactions and I am having a little trouble getting my Bayesian estimates to match her frequentist estimates.
With lots of derived parameters the program is taking a long time to run.
Apparently OpenBUGS has been crashing frequently, I believe, even without prediction code. I am not sure whether that is because of something I am doing wrong or because of changes in the recent versions of R or changes in recent versions of R packages or maybe because I am trying to run the code with a 64-bit R or something else.
I will try to post the prediction code soon, but all of the above issues have slowed me down.
You can easily use this approach to find a cut off:
cutoff <- runif(length(predicted_probabilities))
This is a deterministic decision based on Metropolis-Hastings.

Resources