I am trying to create a Bayesian latent class model through R software and OpenBUGS for 2 populations using 2 diagnostic tests. I have created the equations for most parameters but I also have inconclusive results. Therefore I want to expand the model to include the probability of an inconclusive result in test 1 in an infected individual and an uninfected individual, and then the same for test 2. I have been trying to create the equations for these probabilities (p1/2[1] - p1/2[4] in the model attached) but I am struggling and don't think I am quite there yet as the model won't run. Would you be able to help me with these equations? The error code I get every time I run it at the moment is invalid integer value for x1[1]
#Bayesian LCA model creation
library(R2OpenBUGS)
setwd("mywd")
model3=paste0("model3{
#multinomial model for the data
x1[1:8] ~ dmulti(p1[1:8], n1)
x2[1:8] ~ dmulti(p2[1:8], n2)
#Observed prevalence
#Pop 1 with 2 tests and unknown prevalence
p1[1] <- prev1*Se1*Se2*(1-IncT1Inf)+(1-prev1)*(1-Sp1)*(1-Sp2)*IncT1NonInf
p1[2] <- prev1*Se1*(1-Se2)*(1-IncT1NonInf)+(1-prev1)*(1-Sp1)*Sp2*IncT1Inf
p1[3] <- prev1*Se1*Se2*(1-IncT2Inf)+(1-prev1)*(1-Sp1)*(1-Sp2)*IncT2NonInf
p1[4] <- prev1*(1-Se1)*Se2*(1-IncT2NonInf)+(1-prev1)*Sp1*(1-Sp2)*IncT2Inf
p1[5] <- prev1*Se1*Se2+(1-prev1)*(1-Sp1)*(1-Sp2)
p1[6] <- prev1*Se1*(1-Se2)+(1-prev1)*(1-Sp1)*Sp2
p1[7] <- prev1*(1-Se1)*Se2+(1-prev1)*Sp1*(1-Sp2)
p1[8] <- prev1*(1-Se1)*(1-Se2)+(1-prev1)*Sp1*Sp2
#Pop 2 with 2 tests (same tests) and unknown prevalence
p2[1] <- prev2*Se1*Se2*(1-IncT1Inf)+(1-prev2)*(1-Sp1)*(1-Sp2)*IncT1NonInf
p2[2] <- prev2*Se1*(1-Se2)*(1-IncT1NonInf)+(1-prev2)*(1-Sp1)*Sp2*IncT1Inf
p2[3] <- prev2*Se1*Se2*(1-IncT2Inf)+(1-prev2)*(1-Sp1)*(1-Sp2)*IncT2NonInf
p2[4] <- prev2*(1-Se1)*Se2*(1-IncT2NonInf)+(1-prev2)*Sp1*(1-Sp2)*IncT2Inf
p2[5] <- prev2*Se1*Se2+(1-prev2)*(1-Sp1)*(1-Sp2)
p2[6] <- prev2*Se1*(1-Se2)+(1-prev2)*(1-Sp1)*Sp2
p2[7] <- prev2*(1-Se1)*Se2+(1-prev2)*Sp1*(1-Sp2)
p2[8] <- prev2*(1-Se1)*(1-Se2)+(1-prev2)*Sp1*Sp2
#Priors taken using median values from total lit search and 95th percentile provided in lit
prev1 ~ dbeta(8.89,60.26)
prev2 ~ dbeta(4.35,76.63)
Se1 ~ dbeta(14.59,0.86)
Sp1 ~ dbeta(14.95,0.86)
Se2 ~ dbeta(78.55,15.96)
Sp2 ~ dbeta(1.71,0.38)
IncT1Inf ~ dbeta(1,1)
IncT1NonInf ~ dbeta(1,1)
IncT2Inf ~ dbeta(1,1)
IncT2NonInf ~ dbeta(1,1)
}")
#write to temporary text file
write.table(model3, file="model3.txt", quote=FALSE, sep="", row.names=FALSE, col.names=FALSE)
#Data
#Sanctuary 1
n1=19
x1<-matrix(c(3,3,1,0,1,0,0,6,5),byrow=T,ncol=3,dimnames=list(c("TST+", "TST-",
"TSTInc"),c("QFT+", "QFT-", "QFTInc")))
as.numeric(x1)
x1 <- as.numeric(x1)
#Sanctuary 2
n2=12
x2<-matrix(c(0,0,0,4,8,0,0,0,0),byrow=T,ncol=3,dimnames=list(c("TST+", "TST-",
"TSTInc"),c("QFT+", "QFT-", "QFTInc")))
as.numeric(x2)
x2 <- as.numeric(x2)
#set data inputs to BUGS
dat <- list(x1=x1,n1=sum(x1),x2=x2,n2=sum(x2))
dat
#Set parameters desired to monitor
paras <- c("Se1","Sp1","Se2","Sp2","prev1","prev2","IncT1Inf","IncT1NonInf", "IncT2Inf",
"IncT2NonInf")
#Initialising values for 3 chains done using median and mean values from initial prior
calculation
inits<-list(list(Se1=0.964, Sp1=0.965, Se2=0.834, Sp2=0.969, prev1=0.125,
prev2=0.050,IncT1Inf=0.100,IncT1NonInf=0.250,IncT2Inf=0.150,IncT2Inf=0.250), list(Se1=0.965,
Sp1=0.965, Se2=0.834, Sp2=0.969, prev1=0.125, prev2=0.05,
IncT1Inf=0.200,IncT1NonInf=0.150,IncT2Inf=0.050,IncT2Inf=0.200), list(Se1=0.960,
Sp1=0.965,Se2=0.858,Sp2=0.937,IncT1Inf=0.100,IncT1NonInf=0.250,IncT2Inf=0.150,IncT2Inf=0.250)
)
#run model in R2OpenBUGS
niterations=12000
#Running with initial 12000 iterations, burning first 1000 and thinning every 10
bug.out <- bugs(dat, inits, paras, model.file="model3.txt", n.iter=niterations, n.burnin=1000,
n.thin=10, n.chains=3, saveExec=F, restart=F, debug=T, DIC=T, digits=6, codaPkg=F,
working.directory="mywd", clearWD=F, useWINE=F, WINE=NULL,
newWINE=F, WINEPATH=NULL, bugs.seed=1, summary.only=FALSE, over.relax = F)
Related
I would like to derive individual growth rates from our growth model directly, similar to this OP and this OP.
I am working with a dataset that contains the age and weight (wt) measurements for ~2000 individuals in a population. Each individual is represented by a unique id number.
A sample of the data can be found here. Here is what the data looks like:
id age wt
1615 6 15
3468 32 61
1615 27 50
1615 60 145
6071 109 209
6071 125 207
10645 56 170
10645 118 200
I have developed a non-linear growth curve to model growth for this dataset (at the population level). It looks like this:
wt~ A*atan(k*age - t0) + m
which predicts weight (wt) for a given age and has modifiable parameters A, t0, and m. I have fit this model to the dataset at the population level using a nlme regression fit where I specified individual id as a random effect and used pdDiag to specify each parameter as uncorrelated. (Note: the random effect would need to be dropped when looking at the individual level.)
The code for this looks like:
nlme.k = nlme(wt~ A*atan(k*age - t0) + m,
data = df,
fixed = A+k+t0+m~1,
random = list(id = pdDiag(A+t0+k+m~1)), #cannot include when looking at the individual level
start = c(A = 99.31,k = 0.02667, t0 = 1.249, m = 103.8), #these values are what we are using at the population level # might need to be changed for individual models
na.action = na.omit,
control = nlmeControl(maxIter = 200, pnlsMaxIter = 10, msMaxIter = 100))
I have our population level growth model (nlme.k), but I would like to use it to derive/extract individual values for each growth constant.
How can I extract individual growth constants for each id using my population level growth model (nlme.k)? Note that I don't need it to be a solution that uses nlme, that is just the model I used for the population growth model.
Any suggestions would be appreciated!
I think this is not possible due to the nature on how random effects are designed. According to this post the effect size (your growth constant) is estimated using partial pooling. This involves using data points from other groups. Thus you can not estimate the effect size of each group (your individual id).
Strictly speaking (see here) random effects are not really a part of the model at all, but more a part of the error.
However, you can estimate the R2 for all groups together. If you want it on an individual level (e.g. parameter estiamtes for id 1), then just run the same model only on all data points of this particular individual. This give you n models with n parameter sets for n individuals.
We ended up using a few loops to do this.
Note that our answer builds off a model posted in this OP if anyone wants the background script. We will also link to the published script when it is posted.
For now - this is should give a general idea of how we did this.
#Individual fits dataframe generation
yid_list <- unique(young_inds$squirrel_id)
indf_prs <- list('df', 'squirrel_id', 'A_value', 'k_value', 'mx_value', 'my_value', 'max_grate', 'hit_asymptote', 'age_asymptote', 'ind_asymptote', 'ind_mass_asy', 'converge') #List of parameters
ind_fits <- data.frame(matrix(ncol = length(indf_prs), nrow = length(yid_list))) #Blank dataframe for all individual fits
colnames(ind_fits) <- indf_prs
#Calculates individual fits for all individuals and appends into ind_fits
for (i in 1:length(yid_list)) {
yind_df <-young_inds%>%filter(squirrel_id %in% yid_list[i]) #Extracts a dataframe for each squirrel
ind_fits[i , 'squirrel_id'] <- as.numeric(yid_list[i]) #Appends squirrel i's id into individual fits dataframe
sex_lab <- unique(yind_df$sex) #Identifies and extracts squirrel "i"s sex
mast_lab <- unique(yind_df$b_mast) #Identifies and extracts squirrel "i"s mast value
Hi_dp <- max(yind_df$wt) #Extracts the largest mass for each squirrel
ind_long <- unique(yind_df$longevity) #Extracts the individual death date
#Sets corresponding values for squirrel "i"
if (mast_lab==0 && sex_lab=="F") { #Female no mast
ind_fits[i , 'df'] <- "fnm" #Squirrel dataframe (appends into ind_fits dataframe)
df_asm <- af_asm #average asymptote value corresponding to sex
df_B_guess <- guess_df[1, "B_value"] #Inital guesses for nls fits corresponding to sex and mast sex and mast
df_k_guess <- guess_df[1, "k_value"]
df_mx_guess <- guess_df[1, "mx_value"]
df_my_guess <- guess_df[1, "my_value"]
ind_asyr <- indf_asy #growth rate at individual asymptote
} else if (mast_lab==0 && sex_lab=="M") { #Male no mast
ind_fits[i , 'df'] <- "mnm"
df_asm <- am_asm
df_B_guess <- guess_df[2, "B_value"]
df_k_guess <- guess_df[2, "k_value"]
df_mx_guess <- guess_df[2, "mx_value"]
df_my_guess <- guess_df[2, "my_value"]
ind_asyr <- indm_asy
} else if (mast_lab==1 && sex_lab=="F") { #Female mast
ind_fits[i , 'df'] <- "fma"
df_asm <- af_asm
df_B_guess <- guess_df[3, "B_value"]
df_k_guess <- guess_df[3, "k_value"]
df_mx_guess <- guess_df[3, "mx_value"]
df_my_guess <- guess_df[3, "my_value"]
ind_asyr <- indm_asy
} else if (mast_lab==1 && sex_lab=="M") { #Males mast
ind_fits[i , 'df'] <- "mma"
df_asm <- am_asm
df_B_guess <- guess_df[4, "B_value"]
df_k_guess <- guess_df[4, "k_value"]
df_mx_guess <- guess_df[4, "mx_value"]
df_my_guess <- guess_df[4, "my_value"]
ind_asyr <- indf_asy
} else { #If sex or mast is not identified or identified improperlly in the data
print("NA")
} #End of if else loop
#Arctangent
#Fits nls model to the created dataframe
nls.floop <- tryCatch({data.frame(tidy(nls(wt~ B*atan(k*(age - mx)) + my, #tryCatch lets nls have alternate results instead of "code stopping" errors
data=yind_df,
start = list(B = df_B_guess, k = df_k_guess, mx = df_mx_guess, my = df_my_guess),
control= list(maxiter = 200000, minFactor = 1/100000000))))
},
error = function(e){
nls.floop <- data.frame(c(0,0), c(0,0)) #Specifies nls.floop as a dummy dataframe if no convergence
},
warning = function(w) {
nls.floop <- data.frame(tidy(nls.floop)) #Fit is the same if warning is displayed
}) #End of nls.floop
#Creates a dummy numerical index from nls.floop for if else loop below
numeric_floop <- as.numeric(nls.floop[1, 2])
#print(numeric_floop) #Taking a look at the values. If numaric floop...
# == 0, function did not converge on iteration "i"
# != 0, function did converge on rapid "i" and code will run through calculations
if (numeric_floop != 0) {
results_DF <- nls.floop
ind_fits[i , 'converge'] <- 1 #converge = 1 for converging fit
#Extracting, calculating, and appending values into dataframe
B_value <- as.numeric(results_DF[1, "estimate"]) #B value
k_value <- as.numeric(results_DF[2, "estimate"]) #k value
mx_value <- as.numeric(results_DF[3, "estimate"]) #mx value
my_value <- as.numeric(results_DF[4, "estimate"]) #my value
A_value <- ((B_value*pi)/2)+ my_value #A value calculation
ind_fits[i , 'A_value'] <- A_value
ind_fits[i , 'k_value'] <- k_value
ind_fits[i , 'mx_value'] <- mx_value
ind_fits[i , 'my_value'] <- my_value #appends my_value into df
ind_fits[i , 'max_grate'] <- adr(mx_value, B_value, k_value, mx_value, my_value) #Calculates max growth rate
}
} #End of individual fits loop
Which gives this output:
> head(ind_fits%>%select(df, squirrel_id, A_value, k_value, mx_value, my_value))
df squirrel_id A_value k_value mx_value my_value
1 mnm 332 257.2572 0.05209824 52.26842 126.13183
2 mnm 1252 261.0728 0.02810033 42.37454 103.02102
3 mnm 3466 260.4936 0.03946594 62.27705 131.56665
4 fnm 855 437.9569 0.01347379 86.18629 158.27641
5 fnm 2409 228.7047 0.04919819 63.99252 123.63404
6 fnm 1417 196.0578 0.05035963 57.67139 99.65781
Note that you need to create a blank dataframe first before running the loops.
Good Morning, please I need community help in order to understand some problems that occurred writing this model.
I aim at modeling causes of death proportion using as predictors "log_GDP" (Gross domestic product in log scale), and "log_h" (hospital beds per 1,000 people on log scale)
y: 3 columns that are observed proportions of deaths over the years.
x1: "log_GDP" (Gross domestic product in log scale)
x2: "log_h" (hospital beds per 1,000 people in log scale)
As you can see from the estimation result in the last plot, I got a high noise level. Where I worked using just one covariate i.e. log_GDP, I obtained smoothed results
Here the model specification:
Here simulated data:
library(reshape2)
library(tidyverse)
library(ggplot2)
library(runjags)
CIRC <- c(0.3685287, 0.3675516, 0.3567829, 0.3517274, 0.3448940, 0.3391031, 0.3320184, 0.3268640,
0.3227445, 0.3156360, 0.3138515,0.3084506, 0.3053657, 0.3061224, 0.3051044)
NEOP <- c(0.3602199, 0.3567355, 0.3599409, 0.3591258, 0.3544591, 0.3566269, 0.3510974, 0.3536156,
0.3532980, 0.3460948, 0.3476183, 0.3475634, 0.3426035, 0.3352433, 0.3266048)
OTHER <-c(0.2712514, 0.2757129, 0.2832762, 0.2891468, 0.3006468, 0.3042701, 0.3168842, 0.3195204,
0.3239575, 0.3382691, 0.3385302, 0.3439860, 0.3520308, 0.3586342, 0.3682908)
log_h <- c(1.280934, 1.249902, 1.244155, 1.220830, 1.202972, 1.181727, 1.163151, 1.156881, 1.144223,
1.141033, 1.124930, 1.115142, 1.088562, 1.075002, 1.061257)
log_GDP <- c(29.89597, 29.95853, 29.99016, 30.02312, 30.06973, 30.13358, 30.19878, 30.25675, 30.30184,
30.31974, 30.30164, 30.33854, 30.37460, 30.41585, 30.45150)
D <- data.frame(CIRC=CIRC, NEOP=NEOP, OTHER=OTHER,
log_h=log_h, log_GDP=log_GDP)
cause.y <- as.matrix((data.frame(D[,1],D[,2],D[,3])))
cause.y <- cause.y/rowSums(cause.y)
mat.x<- D$log_GDP
mat.x2 <- D$log_h
n <- 15
Jags Model
dirlichet.model = "
model {
#setup priors for each species
for(j in 1:N.spp){
m0[j] ~ dnorm(0, 1.0E-3) #intercept prior
m1[j] ~ dnorm(0, 1.0E-3) # mat.x prior
m2[j] ~ dnorm(0, 1.0E-3)
}
#implement dirlichet
for(i in 1:N){
y[i,1:N.spp] ~ ddirch(a0[i,1:N.spp])
for(j in 1:N.spp){
log(a0[i,j]) <- m0[j] + m1[j] * mat.x[i]+ m2[j] * mat.x2[i] # m0 = intercept; m1= coeff log_GDP; m2= coeff log_h
}
}} #close model loop.
"
jags.data <- list(y = cause.y,mat.x= mat.x,mat.x2= mat.x2, N = nrow(cause.y), N.spp = ncol(cause.y))
jags.out <- run.jags(dirlichet.model,
data=jags.data,
adapt = 5000,
burnin = 5000,
sample = 10000,
n.chains=3,
monitor=c('m0','m1','m2'))
out <- summary(jags.out)
head(out)
Gather coefficient and I make estimation of proportions
coeff <- out[c(1,2,3,4,5,6,7,8,9),4]
coef1 <- out[c(1,4,7),4] #coeff (interc and slope) caus 1
coef2 <- out[c(2,5,8),4] #coeff (interc and slope) caus 2
coef3 <- out[c(3,6,9),4] #coeff (interc and slope) caus 3
pred <- as.matrix(cbind(exp(coef1[1]+coef1[2]*mat.x+coef1[3]*mat.x2),
exp(coef2[1]+coef2[2]*mat.x+coef2[3]*mat.x2),
exp(coef3[1]+coef3[2]*mat.x+coef3[3]*mat.x2)))
pred <- pred / rowSums(pred)
Predicted and Obs. values DB
Obs <- data.frame(Circ=cause.y[,1],
Neop=cause.y[,2],
Other=cause.y[,3],
log_GDP=mat.x,
log_h=mat.x2)
Obs$model <- "Obs"
Pred <- data.frame(Circ=pred[,1],
Neop=pred[,2],
Other=pred[,3],
log_GDP=mat.x,
log_h=mat.x2)
Pred$model <- "Pred"
tot60<-as.data.frame(rbind(Obs,Pred))
tot <- melt(tot60,id=c("log_GDP","log_h","model"))
tot$variable <- as.factor(tot$variable)
Plot
tot %>%filter(model=="Obs") %>% ggplot(aes(log_GDP,value))+geom_point()+
geom_line(data = tot %>%
filter(model=="Pred"))+facet_wrap(.~variable,scales = "free")
The problem for the non-smoothness is that you are calculating Pr(y=m|X) = f(x1, x2) - that is the predicted probability is a function of x1 and x2. Then you are plotting Pr(y=m|X) as a function of a single x variable - log of GDP. That result will almost certainly not be smooth. The log_GDP and log_h variables are highly negatively correlated which is why the result is not much more variable than it is.
In my run of the model, the average coefficient for log_GDP is actually positive for NEOP and Other, suggesting that the result you see in the plot is quite misleading. If you were to plot these in two dimensions, you would see that the result is again, smooth.
mx1 <- seq(min(mat.x), max(mat.x), length=25)
mx2 <- seq(min(mat.x2), max(mat.x2), length=25)
eg <- expand.grid(mx1 = mx1, mx2 = mx2)
pred <- as.matrix(cbind(exp(coef1[1]+coef1[2]*eg$mx1 + coef1[3]*eg$mx2),
exp(coef2[1]+coef2[2]*eg$mx1 + coef2[3]*eg$mx2),
exp(coef3[1]+coef3[2]*eg$mx1 + coef3[3]*eg$mx2)))
pred <- pred / rowSums(pred)
Pred <- data.frame(Circ=pred[,1],
Neop=pred[,2],
Other=pred[,3],
log_GDP=mx1,
log_h=mx2)
lattice::wireframe(Neop ~ log_GDP + log_h, data=Pred, drape=TRUE)
A couple of other things to watch out for.
Usually in hierarchical Bayesian models, your the parameters of your coefficients would themselves be distributions with hyperparameters. This enables shrinkage of the coefficients toward the global mean which is a hallmark of hierarhical models.
Not sure if this is what your data really look like or not, but the correlation between the two independent variables is going to make it difficult for the model to converge. You could try using a multivariate normal distribution for the coefficients - that might help.
I am trying to run a Monte Carlo simulation of a difference in differences estimator, but I am running into an error. Here is the code I am running:
# Set the random seed
set.seed(1234567)
library(MonteCarlo)
#Set up problem, doing this before calling the function
# set sample size
n<- 400
# set true parameters: betas and sd of u
b0 <- 1 # intercept for control data (b0 in diffndiff)
b1 <- 1 # shift on both control and treated after treatment (b1 in
#diffndiff)
b2 <- 2 # difference between intercept on control vs. treated (b2-this is
#the level difference pre-treatment to compare to coef on treat)
b3 <- 3 # shift after treatment that is only for treated group (b3-this is
#the coefficient of interest in diffndiff)
b4 <- 0 # parallel time trend (not measured in diffndiff) biases b0,b1 but
#not b3 that we care about
b5 <- 0 # allows for treated group trend to shift after treatment (0 if
#parallel trends holds)
su <- 4 # std. dev for errors
dnd <- function(n,b0,b1,b2,b3,b4,b5,su){
#initialize a time vector (set observations equal to n)
timelength = 10
t <- c(1:timelength)
num_obs_per_period = n/timelength #allows for multiple observations in one
#time period (can simulate multiple states within one group or something)
t0 <- c(1:timelength)
for (p in 1:(num_obs_per_period-1)){
t <- c(t,t0)
}
T<- 5 #set treatment period
g <- t >T
post <- as.numeric(g)
# assign equal amounts of observations to each state to start with (would
#like to allow selection into treatment at some point)
treat <- vector()
for (m in 1:(round(n/2))){
treat <- c(treat,0)
}
for (m in 1:(round(n/2))){
treat <- c(treat,1)
}
u <- rnorm(n,0,su) #This assumes the mean error is zero
#create my y vector now from the data
y<- b0 + b1*post + b2*treat + b3*treat*post + b4*t + b5*(t-T)*treat*post +u
interaction <- treat*post
#run regression
olsres <- lm(y ~ post + treat + interaction)
olsres$coefficients
# assign the coeeficients
bhat0<- olsres$coefficients[1]
bhat1 <- olsres$coefficients[2]
bhat2<- olsres$coefficients[3]
bhat3<- olsres$coefficients[4]
bhat3_stderr <- coef(summary(olsres))[3, "Std. Error"]
#Here I will use bhat3 to conduct a t-test and determine if this was a pass
#or a fail
tval <- (bhat3-b3)/ bhat3_stderr
#decision at 5% confidence I believe (False indicates the t-stat was less
#than 1.96, and we fail to reject the null)
decision <- abs(tval) > 1.96
decision <- unname(decision)
return(list(decision))
}
#Define a parameter grid to simulate over
from <- -5
to <- 5
increment <- .25
gridparts<- c(from , to , increment)
b5_grid <- seq(from = gridparts[1], to = gridparts[2], by = gridparts[3])
parameter <- list("n" = n, "b0" = b0 , "b1" = b1 ,"b2" = b2 ,"b3" = b3 ,"b4"
=
b4 ,"b5" = b5_grid ,"su" = su)
#Now simulate this multiple times in a monte carlo setting
results <- MonteCarlo(func = dnd ,nrep = 100, param_list = parameter)
And the error that comes up is:
in results[[i]] <- array(NA, dim = c(dim_vec, nrep)) :
attempt to select less than one element in integerOneIndex
This leads me to believe that somewhere something is attempting to access the "0th" element of a vector, which doesn't exist in R as far as I understand. I don't think the part that is doing this arises from my code vs. internal to this package however, and I can't make sense of the code that runs when I run the package.
I am also open to hearing about other methods that will essentially replace simulate() from Stata.
The function passed to MonteCarlo must return a list with named components. Changing line 76 to
return(list("decision" = decision))
should work
This began as a question, but after reading the reference provided in the answer below, as well as the source code, the solution became clear. In case anyone else finds themselves in this position:
The orcutt package, version 2.2 uses a special procedure to calculate the DW statistic for its CO models. The MWE uses the orcutt package's example to show that its Durbin-Watson statistic is not based on the residuals of the OC estimation.
library(orcutt)
data(icecream, package="orcutt")
dw_calc = function(x){sum((x[2:length(x)] - x[1:(length(x)-1)]) ^ 2) / sum(x ^ 2)}
lm = lm(cons ~ price + income + temp, data=icecream)
e = lm$residuals
dw_calc(e)
# 1.02117 <- Durbin-Watson Statistic
coch = cochrane.orcutt(lm)
e = coch$residuals
dw_calc(e)
# 1.006431 <- Durbin-Watson Statistic
coch
# Durbin-Watson statistic
# (original): 1.02117 , p-value: 3.024e-04
# (transformed): 1.54884 , p-value: 5.061e-02
The orcutt package reports 1.54884 but the actual DW is 1.006431 for the new residuals. The reported value, 1.54884, comes from the last round of the convergence procedure (see Hildreth-Lu). See below for a thorough explanation:
lm = lm(cons ~ price + income + temp, data=icecream)
reg = lm(cons ~ price + income + temp, data=icecream)
convergence = 8
X <- model.matrix(reg)
Y <- model.response(model.frame(reg))
n<-length(Y)
e<-reg$residuals
e2<-e[-1]
e3<-e[-n]
regP<-lm(e2~e3-1)
rho<-summary(regP)$coeff[1]
rho2<-c(rho)
XB<-X[-1,]-rho*X[-n,]
YB<-Y[-1]-rho*Y[-n]
regCO<-lm(YB~XB-1)
ypCO<-regCO$coeff[1]+as.matrix(X[,-1])%*%regCO$coeff[-1]
e1<-ypCO-Y
e2<-e1[-1]
e3<-e1[-n]
regP<-lm(e2~e3-1)
rho<-summary(regP)$coeff[1]
rho2[2]<-rho
i<-2
while (round(rho2[i-1],convergence)!=round(rho2[i],convergence)){
XB<-X[-1,]-rho*X[-n,]
YB<-Y[-1]-rho*Y[-n]
regCO<-lm(YB~XB-1)
ypCO<-regCO$coeff[1]+as.matrix(X[,-1])%*%regCO$coeff[-1]
e1<-ypCO-Y
e2<-e1[-1]
e3<-e1[-n]
regP<-lm(e2~e3-1)
rho<-summary(regP)$coeff[1]
i<-i+1
rho2[i]<-rho
}
regCO$number.interaction<-i-1
regCO$rho <- rho2[i-1]
regCO$DW <- c(lmtest::dwtest(reg)$statistic, lmtest::dwtest(reg)$p.value,
lmtest::dwtest(regCO)$statistic, lmtest::dwtest(regCO)$p.value)
dw_calc(regCO$residuals)
regF<-lm(YB ~ 1)
tF <- anova(regCO,regF)
regCO$Fs <- c(tF$F[2],tF$`Pr(>F)`[2])
# fitted.value
regCO$fitted.values <- model.matrix(reg) %*% (as.matrix(regCO$coeff))
# coeff
names(regCO$coefficients) <- colnames(X)
# st.err
regCO$std.error <- summary(regCO)$coeff[,2]
# t value
regCO$t.value <- summary(regCO)$coeff[,3]
# p value
regCO$p.value <- summary(regCO)$coeff[,4]
class(regCO) <- "orcutt"
# formula
regCO$call <- reg$call
# F statistics and p value
df1 <- dim(model.frame(reg))[2] - 1
df2 <- length(regCO$residuals) - df1 - 1
RSS <- sum((regCO$residuals)^2)
TSS <- sum((regCO$model[1] - mean(regCO$model[,1]))^2)
regCO$rse <- sqrt(RSS/df2)
regCO$r.squared <- 1 - (RSS/TSS)
regCO$adj.r.squared <- 1 - ((RSS/df2)/(TSS/(df1 + df2)))
regCO$gdl <- c(df1, df2)
#
regCO$rank <- df1
regCO$df.residual <- df2
regCO$assign <- regCO$assign[-(df1+1)]
regCO$residuals <- Y - regCO$fitted.values
regCO
}
I don't know this area well, but it seems much more likely that there are multiple definitions/estimation methods for this statistic. Going back to the book cited in ?orcutt (Verbeek M. (2004) A guide to modern econometrics, John Wiley & Sons Ltd, ISBN:978-88-08-17054-5), and searching on Google books gives
The value shown as computed by orcutt in your example above agrees with the value given in the book. Earlier in the book (p. 108) it says
In [the Cochrane-Orcutt procedure], $\rho$ and $\beta$ are recursively estimated until convergence, i.e. having estimated $\beta$ by EGLS (by $\beta^*$), the residuals are recomputed and $\rho$ is estimated again using the residuals from the EGLS step. With this new estimate of $\rho$, EGLS is applied again and one obtains a new estimate of $\beta$ ...
In other words, it seems as though the estimate of $\rho$ that you give above corresponds only to the first step of the Orcutt-Cochrane procedure.
I'm trying to efficiently implement a block bootstrap technique to get the distribution of regression coefficients. The main outline is as follows.
I have a panel data set, and say firm and year are the indices. For each iteration of the bootstrap, I wish to sample n subjects with replacement. From this sample, I need to construct a new data frame that is an rbind() stack of all the observations for each sampled subject, run the regression, and pull out the coefficients. Repeat for a bunch of iterations, say 100.
Each firm can potentially be selected multiple times, so I need to include it data multiple times in each iteration's data set.
Using a loop and subset approach, like below, seems computationally burdensome.
Note that for my real data frame, n, and the number iterations is much larger than the example below.
My thoughts initially are to break the existing data frame into a list by subject using the split() command. From there, use
sample(unique(df1$subject),n,replace=TRUE)
to get the new list, then perhaps implement quickdf from the plyr package to construct a new data frame.
Example slow code:
require(plm)
data("Grunfeld", package="plm")
firms = unique(Grunfeld$firm)
n = 10
iterations = 100
mybootresults=list()
for(j in 1:iterations){
v = sample(length(firms),n,replace=TRUE)
newdata = NULL
for(i in 1:n){
newdata = rbind(newdata,subset(Grunfeld, firm == v[i]))
}
reg1 = lm(value ~ inv + capital, data = newdata)
mybootresults[[j]] = coefficients(reg1)
}
mybootresults = as.data.frame(t(matrix(unlist(mybootresults),ncol=iterations)))
names(mybootresults) = names(reg1$coefficients)
mybootresults
(Intercept) inv capital
1 373.8591 6.981309 -0.9801547
2 370.6743 6.633642 -1.4526338
3 528.8436 6.960226 -1.1597901
4 331.6979 6.239426 -1.0349230
5 507.7339 8.924227 -2.8661479
...
...
How about something like this:
myfit <- function(x, i) {
mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
coefficients(lm(value ~ inv + capital, data = mydata))
}
firms <- unique(Grunfeld$firm)
b0 <- boot(firms, myfit, 999)
You can also use the tsboot function in the boot package with fixed block resampling scheme.
require(plm)
require(boot)
data(Grunfeld)
### each firm is of length 20
table(Grunfeld$firm)
## 1 2 3 4 5 6 7 8 9 10
## 20 20 20 20 20 20 20 20 20 20
blockboot <- function(data)
{
coefficients(lm(value ~ inv + capital, data = data))
}
### fixed length (every 20 obs, so for each different firm) block bootstrap
set.seed(321)
boot.1 <- tsboot(Grunfeld, blockboot, R = 99, l = 20, sim = "fixed")
boot.1
## Bootstrap Statistics :
## original bias std. error
## t1* 410.81557 -25.785972 174.3766
## t2* 5.75981 0.451810 2.0261
## t3* -0.61527 0.065322 0.6330
dim(boot.1$t)
## [1] 99 3
head(boot.1$t)
## [,1] [,2] [,3]
## [1,] 522.11 7.2342 -1.453204
## [2,] 626.88 4.6283 0.031324
## [3,] 479.74 3.2531 0.637298
## [4,] 557.79 4.5284 0.161462
## [5,] 568.72 5.4613 -0.875126
## [6,] 379.04 7.0707 -1.092860
Here is a method that should typically be faster than the accepted answer, returns the same results and does not rely on additional packages (except boot). The key here is to use which and integer indexing to construct each data.frame replicate rather than split/subset and do.call/rbind.
# get function for boot
myIndex <- function(x, i) {
# select the observations to subset. Likely repeated observations
blockObs <- unlist(lapply(i, function(n) which(x[n] == Grunfeld$firm)))
# run regression for given replicate, return estimated coefficients
coefficients(lm(value~ inv + capital, data=Grunfeld[blockObs,]))
}
now, bootstrap
# get result
library(boot)
set.seed(1234)
b1 <- boot(firms, myIndex, 200)
Run the accepted answer
set.seed(1234)
b0 <- boot(firms, myfit, 200)
Let's eyeball a comparison
using indexing
b1
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = firms, statistic = myIndex, R = 200)
Bootstrap Statistics :
original bias std. error
t1* 410.8155650 -6.64885086 197.3147581
t2* 5.7598070 0.37922066 2.4966872
t3* -0.6152727 -0.04468225 0.8351341
Original version
b0
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = firms, statistic = myfit, R = 200)
Bootstrap Statistics :
original bias std. error
t1* 410.8155650 -6.64885086 197.3147581
t2* 5.7598070 0.37922066 2.4966872
t3* -0.6152727 -0.04468225 0.8351341
These look pretty close. Now, a bit more checking
identical(b0$t, b1$t)
[1] TRUE
and
identical(summary(b0), summary(b1))
[1] TRUE
Finally, we'll do a quick benchmark
library(microbenchmark)
microbenchmark(index={b1 <- boot(firms, myIndex, 200)},
rbind={b0 <- boot(firms, myfit, 200)})
On my computer, this returns
Unit: milliseconds
expr min lq mean median uq max neval
index 292.5770 296.3426 303.5444 298.4836 301.1119 395.1866 100
rbind 712.1616 720.0428 729.6644 724.0777 731.0697 833.5759 100
So, direct indexing is more than 2 times faster at every level of the distribution.
note on missing fixed effects
As with most of the answers, the issue of missing "fixed effects" may emerge. Commonly, fixed effects are used as controls and the researcher is interested in one or a couple of variables that will be included with every selected observation. In this dominant case, there is no (or very little) harm in restricting the returned result of the myIndex or myfit function to only include the variables of interest in the returned vector.
The solution needs to be modified to manage fixed effects.
library(boot) # for boot
library(plm) # for Grunfeld
library(dplyr) # for left_join
## Get the Grunfeld firm data (10 firms, each for 20 years, 1935-1954)
data("Grunfeld", package="plm")
## Create dataframe with unique firm identifier (one line per firm)
firms <- data.frame(firm=unique(Grunfeld$firm),junk=1)
## for boot(), X is the firms dataframe; i index the sampled firms
myfit <- function(X, i) {
## join the sampled firms to their firm-year data
mydata <- left_join(X[i,], Grunfeld, by="firm")
## Distinguish between multiple resamples of the same firm
## Otherwise they have the same id in the fixed effects regression
## And trouble ensues
mydata <- mutate(group_by(mydata,firm,year),
firm_uniq4boot = paste(firm,"+",row_number())
)
## Run regression with and without firm fixed effects
c(coefficients(lm(value ~ inv + capital, data = mydata)),
coefficients(lm(value ~ inv + capital + factor(firm_uniq4boot), data = mydata)))
}
set.seed(1)
system.time(b <- boot(firms, myfit, 1000))
summary(b)
summary(lm(value ~ inv + capital, data=Grunfeld))
summary(lm(value ~ inv + capital + factor(firm), data=Grunfeld))
I found a method using dplyr::left_join that is a bit more concise, only takes about 60% as long, and gives the same results as in the answer by Sean. Here's a complete self-contained example.
library(boot) # for boot
library(plm) # for Grunfeld
library(dplyr) # for left_join
# First get the data
data("Grunfeld", package="plm")
firms <- unique(Grunfeld$firm)
myfit1 <- function(x, i) {
# x is the vector of firms
# i are the indexes into x
mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
coefficients(lm(value ~ inv + capital, data = mydata))
}
myfit2 <- function(x, i) {
# x is the vector of firms
# i are the indexes into x
mydata <- left_join(data.frame(firm=x[i]), Grunfeld, by="firm")
coefficients(lm(value ~ inv + capital, data = mydata))
}
# rbind method
set.seed(1)
system.time(b1 <- boot(firms, myfit1, 5000))
## user system elapsed
## 13.51 0.01 13.62
# left_join method
set.seed(1)
system.time(b2 <- boot(firms, myfit2, 5000))
## user system elapsed
## 8.16 0.02 8.26
b1
## original bias std. error
## t1* 410.8155650 9.2896499 198.6877889
## t2* 5.7598070 0.5748503 2.5725441
## t3* -0.6152727 -0.1200954 0.7829191
b2
## original bias std. error
## t1* 410.8155650 9.2896499 198.6877889
## t2* 5.7598070 0.5748503 2.5725441
## t3* -0.6152727 -0.1200954 0.7829191