Implementing a neuralnetwork from scratch in R - r

I'm working on an assignment for my Machine Learning course, and as part of it I'm trying to implement a neural network. Since it's for school, I have to implement the algorithm manually, and not use any of the neuralnet packages available.
I've been using the material in "Learning from Data" along with the CalTech lectures that follow it on youtube.
I've put together the algorithm in R to the best of my ability, but there's something going wrong along the way. I haven't been able to implement the difference in the cost function as a measure for when the last iteration should be, so for now I've just fixed the number of iterations as a constant.
** Edit **
Hey guys. Thanks for the response. I can see I'm missing a lot of needed information. Sorry about that, don't really know what I was thinking.
The data I'm using is simply "toy data" generated from the sinc function sinc(x)=sin(x)/x.
The problem I'm having specifically is that the estimates that I get at the end of the algorithm are completely off from the real values, and they are significantly different every time I run the algorithm. It seems like I've put the algorithm together the way the book states, but I can't see where the problem is.
Edit 2
Added the data to the code so it can be run without doing anything extra. I also separated the individual parts of the function. As i mentioned in a comment, I was able to numerically verify the partial derivatives, so I think that part is ok. The problem I have is when I need to update the weights in order to train the network.
It's not in this part of the code, but I thought that in order to update the weights, you simply took the old weight and subtracted the partial derivative of that weight scaled by the learning rate? (wNew = wOld - eta*djdwOld)
theta <- function(a){
a / (1+abs(a)) # Here we apply the sigmoid function as our
# non-linearity.
}
theta.prime <- function(a){
1 / (1+abs(a))^2
}
x <- c( 5.949110, -1.036600, 3.256780, 7.824520, -3.606010, 3.115640, -7.786960,
-7.598090, 2.083880, 3.983000, 8.060120, 7.879760, -2.456670,
-2.152720, 3.471950, 3.567960, -4.232630, 6.831610, -9.486860, 8.692330,
-1.551860, 0.917305, 4.669480, -7.760430, 2.835410)
y <- c(-0.10804400, 0.78264000, -0.05313330, 0.13484700, -0.05522470, -0.05758530,
0.19566100, 0.13846000, 0.43534100, -0.16861400, 0.10625000,
0.08427310, 0.27012900, 0.44004800, -0.00880575, -0.10711400, -0.18671100,
0.01158470, 0.02767190, 0.06319830, 0.61802000, 0.87124300,
-0.25668100, 0.06160800, 0.10575700)
inputlayer <- 1
outputlayer <- 1
hiddenlayer <- 2
w1 <- t(matrix(rnorm(hiddenlayer,0,.01),hiddenlayer,inputlayer))
w2 <- matrix(rnorm(hiddenlayer,0,.01),hiddenlayer,outputlayer)
### Forwardprop ###
forward <- function(x,w1,w2,theta){
s2 <- x%*%w1
a2 <- apply(s2,c(1,2),theta)
s3 <- a2%*%w2
yhat <- apply(s3,c(1,2),theta)
return(yhat)
}
### Forwardpropagation maunally ###
s2 <- x%*%w1
a2 <- apply(s2,c(1,2),theta)
s3 <- a2%*%w2
yhat <- apply(s3,c(1,2),theta)
### Error function ###
#yhat <- forward(x,w1,w2,theta)
E <- sum((y-yhat)^2)/(length(x))
### Backward Propagation ###
delta3 <- (-2*(y-yhat)) * apply(s3,c(1,2),theta.prime)
djdw2 <- t(a2) %*% delta3
delta2 <- delta3 %*% t(w2) * apply(s2,c(1,2),theta.prime)
djdw1 <- t(x)%*%delta2
### Numerically estimated gradients ###
e <- 1e-8
numgrad1 <- matrix(0,1,2)
eps <- matrix(0,1,2)
w1e <- matrix(0,1,2)
for(j in 1:2) {
eps[1,j] <- e
w1e <- w1 + eps
loss2 <- sum((y-forward(x,w1e,w2,theta))^2)
w1e <- w1
loss1 <- sum((y-forward(x,w1e,w2,theta))^2)
numgrad1[1,j] <- (loss2 - loss1)/(e)
eps[1,j] <- 0
}
numgrad2 <- matrix(0,2,1)
eps <- matrix(0,2,1)
w2e <- matrix(0,2,1)
for(j in 1:2) {
eps[j,1] <- e
w2e <- w2 + eps
loss2 <- sum((y-forward(x,w1,w2e,theta))^2)
w2e <- w2
loss1 <- sum((y-forward(x,w1,w2e,theta))^2)
numgrad2[j,1] <- (loss2 - loss1)/(e)
eps[j,1] <- 0
}
# Comparison of our gradients from backpropagation
# and numerical estimation.
c(djdw1,djdw2)
c(numgrad1,numgrad2)

Related

R nleqslv difficulties - solving for pH in an acid-base buffer

Goal
Build a theoretical titration curve for the phosphoric acid buffer (1M).
I provide a fully reproducible and self-contained example (of my failures ^.^).
Model equations
Acid-base equilibrium equations for phosphoric acid are:
Model implementation
Ka.1 <- 7.1 * 10^-3
Ka.2 <- 6.3 * 10^-8
Ka.3 <- 4.5 * 10^-13
Kw <- 10^-14
balance <- function(vars, Na_ca, P_ca, convert.fun=function(x) x){
# Apply positive only constraint
vars <- convert.fun(vars)
H <- vars[1]
H3A <- vars[2]
H2A <- vars[3]
HA <- vars[4]
A <- vars[5]
Na <- convert.fun(Na_ca)
eq.system <- c(H3A + H2A + HA + A - P_ca,
H + Na - Kw/H - H2A - 2*HA - 3*A,
H * H2A / Ka.1 - H3A,
H * HA / Ka.2 - H2A,
H * A / Ka.3 - HA
)
return(eq.system)
}
Notice that convert.fun is there to try different ways of forcing positive values on concentrations.
The return value is a vector of the model's equations, equated to zero (is this right?).
Iteration
I wished to solve the system for all possible Na+ concentrations, up to 3 equivalence "volumes".
I set initial conditions that woked for the lowest one: [Na]=0.
Then solved it with nleqslv and used the result to "seed" the next iteration.
And it seemed to work nicely:
But, on close inspection, the issues will become obvious.
But, before that, some code!
Setup initial conditions and results matrix:
P_ca <- 1
ci.start <- c(H=10^-1, H3A=0.9, H2A=0.1, HA=0.1, A=0.1)
Na.seq <- seq(from=0,to=3*P_ca,by=P_ca/1000)
varnames <- c("Na", "H", "H3A", "H2A", "HA", "A")
result.m <- matrix(ncol = length(varnames), nrow = length(Na.seq))
colnames(result.m) <- varnames
result.m[,1] <- Na.seq
Iteration:
convert.fun <- function(x) abs(x)
for(i in 1:length(Na.seq)){
Na_ca <- result.m[i,1]
if(i == 1){ # Si es la primera iteración,
ci <- ci.start # usar los valores "start" como C.I.
} else { # Si no,
ci <- result.m[i-1, 2:6] # usar los valores de la solución anterior
}
result <- nleqslv::nleqslv(x = ci,
fn = balance,
Na = Na_ca, P = P_ca,
convert.fun = convert.fun,
method="Newton", #method="Broyden",
global="dbldog",
control = list(allowSingular=TRUE,
maxit=1000))
result$x <- convert.fun(result$x)
result.m[i,2:6] <- result$x
stopifnot(all(result$x >= 0))
} # END LOOP
result.df <- as.data.frame(result.m)
Notice that convert.fun is now abs(x) (is this ok?).
The problem
The problem with the last plot is that the right part of it is flattened out.
The problem is even more obvious in the following plot:
The red curve is supossed to end up at the top, and the purple one at the bottom. This seems to start happening at Na~2, but after a few more iterations, the result flattens out (and becomes exactly constant).
Possible hints for the savvy
The problem is a bit worse using method="Broyden" instead of "Newton".
nleqslv's return message changes from "Function criterion near zero" to "x-values within tolerance 'xtol'".
I also tried adding a Jacobian. That didnt change the result, but at the problematic point I get something like this:
Chkjac possible error in jacobian[2,1] = 2.7598836276240e+06
Estimated[2,1] = 1.1104869955110e+04
I am now really out of ideas! And would really appreciate some help or guidance.
You should always test the termination code of nleqslv to determine if a solution has been found. And somehow display the termination code and/or the message nleqslv returns. You will see that in some case no better point was found. Therefore any result is invalid and useless.
You are using so many values for Na.seq that it is impossible to the wood through the trees.
I would suggest starting with a very limited set of values for Na.seq.
Something like
Na.seq <- seq(from=0,to=3*P_ca,by=P_ca/10)
and also this to include the termination code in the result
varnames <- c("Na", "H", "H3A", "H2A", "HA", "A", "termcd")
result.m <- matrix(ncol = length(varnames), nrow = length(Na.seq))
And then change the iteration loop to this
for(i in 1:length(Na.seq)){
Na_ca <- result.m[i,1]
if(i == 1){ # Si es la primera iteración,
ci <- ci.start # usar los valores "start" como C.I.
} else { # Si no,
ci <- result.m[i-1, 2:6] # usar los valores de la solución anterior
}
iter.trace <- 1
cat("Iteration ",i,"\n\n")
result <- nleqslv::nleqslv(x = ci,
fn = balance,
Na = Na_ca, P = P_ca,
convert.fun = convert.fun,
method="Newton", #method="Broyden",
global="dbldog",
control = list(allowSingular=TRUE,
maxit=1000,trace=iter.trace))
cat("\n\n ",result$message,"\n\n")
result$x <- convert.fun(result$x)
result.m[i,2:6] <- result$x
result.m[i,7] <- result$termcd
stopifnot(all(result$x >= 0))
} # END LOOP
and start analysing the output to find out what the problem is and where.
Addendum
I am reasonably sure that the difficulties with solving are (partly) caused by numerical difficulties. With the above modifications I changed the values for Ka.1, Ka.2, Ka.3,and Kw to
Ka.1 <- 7.1 * 10^-1
Ka.2 <- 6.3 * 10^-3
Ka.3 <- 4.5 * 10^-3
Kw <- 10^-3
and then there are no problems in finding a solution (all termination codes are 1). I suspect that the very small values for the K... constants are the cause of the problem. Check the system for possible errors or try to change the measurement units of the variables.
Solution details
Find details and full code at this repo.
The numerical method worked, and the analytical answer provided at chemistry stackexchange happily coincides :)
Sadly it does not match experimental data from Julia Martín et al (DOI 10.20431/2349-0403.0409002). Perhaps I'll post a question about it on chemistry stackexchange.
My thanks to everyone who helped out <3
Lastly, important plots from the numerical simulation:

Equivalent of Stata command `simulate` in R for Montecarlo Simulation

I am searching for an equivalent function in R of the extremely convenient Stata command simulate. The command basically allows you to declare a program (reg_simulation in the example below) and then invoke such a program from simulate and store desired outputs.
Below is a Stata illustration of the usage of the simulate program, together with my attempt to replicate it using R.
Finally, my main question is: is this how R users will run a Montecarlo simulation? or am I missing something in terms of structure or speed bottlenecks? Thank you a lot in advance.
Stata example
Defining reg_simulation program.
clear all
*Define "reg_simulation" to be used later on by "simulate" command
program reg_simulation, rclass
*Declaring Stata version
version 13
*Droping all variables on memory
drop _all
*Set sample size (n=100)
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
*Estimate OLS
reg y x1 x2
*Store coefficients
matrix B = e(b)
return matrix betas = B
end
Calling reg_simulation from simulate command:
*Seet seed
set seed 1234
*Run the actual simulation 10 times using "reg_simulation"
simulate , reps(10) nodots: reg_simulation
Obtained result (stored data on memory)
_b_x1 _b_x2 _b_cons
.4470155 1.50748 1.043514
.4235979 1.60144 1.048863
.5006762 1.362679 .8828927
.5319981 1.494726 1.103693
.4926634 1.476443 .8611253
.5920001 1.557737 .8391003
.5893909 1.384571 1.312495
.4721891 1.37305 1.017576
.7109139 1.47294 1.055216
.4197589 1.442816 .9404677
R replication of the Stata program above.
Using R I have managed to get the following (not an R expert tho). However, the part that worries me the most is the for-loop structure that loops over each the number of repetitions nreps.
Defining reg_simulation function.
#Defining a function
reg_simulation<- function(obs = 1000){
data <- data.frame(
#Generate data
x1 <-rnorm(obs, 0 , 1) ,
x2 <-rnorm(obs, 0 , 1) ,
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
#Estimate OLS
ols <- lm(y ~ x1 + x2, data=data)
return(ols$coefficients)
}
Calling reg_simulation 10 times using a for-loop structure:
#Generate list to store results from simulation
results_list <- list()
# N repetitions
nreps <- 10
for (i in 1:nreps) {
#Set seed internally (to get different values in each run)
set.seed(i)
#Save results into list
results_list[i] <- list(reg_simulation(obs=1000))
}
#unlist results
df_results<- data.frame(t(sapply(results_list,
function(x) x[1:max(lengths(results_list))])))
Obtained result: df_results.
#final results
df_results
# X.Intercept. x1 x2
# 1 1.0162384 0.5490488 1.522017
# 2 1.0663263 0.4989537 1.496758
# 3 0.9862365 0.5144083 1.462388
# 4 1.0137042 0.4767466 1.551139
# 5 0.9996164 0.5020535 1.489724
# 6 1.0351182 0.4372447 1.444495
# 7 0.9975050 0.4809259 1.525741
# 8 1.0286192 0.5253288 1.491966
# 9 1.0107962 0.4659812 1.505793
# 10 0.9765663 0.5317318 1.501162
You're on the right track. Couple of hints/corrections:
Don't use <- inside data.frame()
In R, we construct data frames using = for internal column assignment, i.e. data.frame(x = 1:10, y = 11:20) rather than data.frame(x <- 1:10, y <- 11:20).
(There's more to be said about <- vs =, but I don't want to distract from your main question.)
In your case, you don't actually even need to create a data frame since x1, x2 and y will all be recognized as "global" variables within the scope of the function. I'll post some code at the end of my answer demonstrating this.
When growing a list via a for loop in R, always try to pre-allocate the list first
Always try to pre-allocate the list length and type if you are going to grow a (long) for loop. Reason: That way, R knows how much memory to efficiently allocate to your object. In the case where you are only doing 10 reps, that would mean starting with something like:
results_list <- vector("list", 10)
3. Consider using lapply instead of for
for loops have a bit of bad rep in R. (Somewhat unfairly, but that's a story for another day.) An alternative that many R users would consider is the functional programming approach offered by lapply. I'll hold off on showing you the code for a second, but it will look very similar to a for loop. Just to note quickly, following on from point 2, that one immediate benefit is that you don't need to pre-allocate the list with lapply.
4. Run large loops in parallel
A Monte Carlo simulation is an ideal candidate for running everything in parallel, since each iteration is supposed to be independent of the others. An easy way to go parallel in R is via the future.apply package.
Putting everything together, here's how I'd probably do your simulation. Note that this might be more "advanced" than you possibly need, but since I'm here...
library(data.table) ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
plan(multisession) ## use all available cores
obs <- 1e3
# Defining a function
reg_simulation <- function(...){
x1 <- rnorm(obs, 0 , 1)
x2 <- rnorm(obs, 0 , 1)
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
#Estimate OLS
ols <- lm(y ~ x1 + x2)
# return(ols$coefficients)
return(as.data.frame(t(ols$coefficients)))
}
# N repetitions
nreps <- 10
## Serial version
# results <- lapply(1:nreps, reg_simulation)
## Parallel version
results <- future_lapply(1:nreps, reg_simulation, future.seed = 1234L)
## Unlist / convert into a data.table
results <- rbindlist(results)
So, following up on the comments, you want to vary your independent variables (x) and also the error term and simulate the coefficients, but you also want to catch errors if any occur. The following would do the trick:
set.seed(42)
#Defining a function
reg_simulation<- function(obs = 1000){
data <- data.frame(
#Generate data
x1 <-rnorm(obs, 0 , 1) ,
x2 <-rnorm(obs, 0 , 1) ,
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
#Estimate OLS
tryCatch(
{
ols <- lm(y ~ x1 + x2, data=data)
return(ols$coefficients)
},
error = function(e){
return(c('(Intercept)'=NA, 'x1'=NA, 'x2'=NA))
}
)
}
output <- t(data.frame(replicate(10, reg_simulation())))
output
(Intercept) x1 x2
X1 0.9961328 0.4782010 1.481712
X2 1.0234698 0.4801982 1.556393
X3 1.0336289 0.5239380 1.435468
X4 0.9796523 0.5095907 1.493548
...
Here, tryCatch (see also failwith) catches the error and returns NA as the default value.
Note that you only need to set the seed once because the seed changes automatically with every call to random number generator in a deterministic fashion.

Maximum likelihood estimation of a multivariate normal distribution of arbitrary dimesion in R - THE ULTIMATE GUIDE?

I notice searching through stackoverflow for similar questions that this has been asked several times hasn't really been properly answered. Perhaps with help from other users this post can be a helpful guide to programming a numerical estimate of the parameters of a multivariate normal distribution.
I know, I know! The closed form solutions are available and trivial to implement. In my case I am interested in modifying the likelihood function for a specific purpose and I don't expect an exact analytic solution so this is a test case to check the procedure.
So here is my attempt. Please comment. Especially if I am missing opportunities for optimization. Note, I'm not a statistician so I'd appreciate any pointers.
ll_multN <- function(theta,X) {
# theta = c(mu, diag(Sigma), Sigma[upper.tri(Sigma)])
# X is an nxk dataset
# MLE: L = - (nk/2)*log(2*pi) - (n/2)*log(det(Sigma)) - (1/2)*sum_i(t(X_i-mu)^2 %*% Sigma^-1 %*% (X_i-mu)^2)
# summation over i is performed using a apply call for efficiency
n <- nrow(X)
k <- ncol(X)
# def mu
mu.vec <- theta[1:k]
# def Sigma
Sigma.diag <- theta[(k+1):(2*k)]
Sigma.offd <- theta[(2*k+1):length(theta)]
Sigma <- matrix(NA, k, k)
Sigma[upper.tri(Sigma)] <- Sigma.offd
Sigma <- t(Sigma)
Sigma[upper.tri(Sigma)] <- Sigma.offd
diag(Sigma) <- Sigma.diag
# compute summation
sum_i <- sum(apply(X, 1, function(x) (matrix(x,1,k)-mu.vec)%*%solve(Sigma)%*%t(matrix(x,1,k)-mu.vec)))
# compute log likelihood
logl <- -.5*n*k*log(2*pi) - .5*n*log(det(Sigma))
logl <- logl - .5*sum_i
return(-logl)
}
Simulated dataset generated using the rmvnorm() function in the package "mvtnorm". Random positive definite covariance matrix generated using the additional function Posdef() (taken from here: https://stat.ethz.ch/pipermail/r-help/2008-February/153708)
library(mvtnorm)
Posdef <- function (n, ev = runif(n, 0, 5)) {
# generates a random positive definite covariance matrix
Z <- matrix(ncol=n, rnorm(n^2))
decomp <- qr(Z)
Q <- qr.Q(decomp)
R <- qr.R(decomp)
d <- diag(R)
ph <- d / abs(d)
O <- Q %*% diag(ph)
Z <- t(O) %*% diag(ev) %*% O
return(Z)
}
set.seed(2)
n <- 1000 # number of data points
k <- 3 # number of variables
mu.tru <- sample(0:3, k, replace=T) # random mean vector
Sigma.tru <- Posdef(k) # random covariance matrix
eigen(Sigma.tru)$val # check positive def (all lambda > 0)
# Generate simulated dataset
X <- rmvnorm(n, mean=mu.tru, sigma=Sigma.tru)
# initial parameter values
pars.init <- c(mu=rep(0,k), sig_ii=rep(1,k), sig_ij=rep(0, k*(k-1)/2))
# limits for optimization algorithm
eps <- .Machine$double.eps # get a small value for bounding the paramter space to avoid things such as log(0).
lower.bound <- c(rep(-Inf,k), # bound on mu
rep(eps,k), # bound on sigma_ii
rep(-Inf,k)) # bound on sigma_ij i=/=j
upper.bound <- c(rep(Inf,k), # bound on mu
rep(100,k), # bound on sigma_ii
rep(100,k)) # bound on sigma_ij i=/=j
system.time(
o <- optim(pars.init,
ll_multN, X=X, method="L-BFGS-B",
lower = lower.bound,
upper = upper.bound)
)
plot(x=c(mu.tru,diag(Sigma.tru),Sigma.tru[upper.tri(Sigma.tru)]),
y=o$par,
xlab="Parameter",
ylab="Estimate",
pch=20)
abline(c(0,1), col="red", lty=2)
This currently runs on my laptop in
user system elapsed
47.852 24.014 24.611
and gives this graphical output:
Estimated mean and variance
In particular any advice on limit setting or algorithm choice would be much appreciated.
Thanks

R: Index out of bound using gmm package

I am new to Stackoverflow and this post is probably very basic. I get an unexpected "index out of list" mistake using the "gmm" package. More specifically, I am using the gel function of that package and I need to supply the argument "g" which is a function that returns a matrix. The function that I pass to the "g" argument works perfectly on its own but not in as an argument of the gel function. I am aware there are very closely related question:
https://stackoverflow.com/search?q=index+out+of+bounds+r
However none of these helped me fix the issue I am faced with.
I attach a reproducible example.
Thanks in advance.
rm(list=ls())
install.packages("gmm")
library(mvtnorm)
library(gmm)
#set.seed(1)
########################################
#functions declaration and construction#
########################################
moment.function <- function(data,alpha) {
instrus.index <- length(alpha)+1
data<-as.matrix(data)
nbr.instrus <- ncol(data)-instrus.index
data1 <-data[,1]-data[,(2:instrus.index)]%*%alpha
data1<-matrix(rep(data1,nbr.instrus),nrow(total.data),nbr.instrus)
g.fun <- data[,-(1:instrus.index)]*data1
#g.fun <- t(data[,-(1:instrus.index)])%*%(data[,1]-data[,(2:instrus.index)]%*%alpha)
return(g.fun)
}
##################
#DGP construction#
##################
#set params
n <- 70
beta1 <- 1
beta2 <- 1
beta.first.stage <- 0.1
rho <- 0.1
cov.exo.instrus <- 0.3
sigma2.epsilon <- 0.1
sigma2.V <- 0.1
sigma2.simus <-0.01
Sigma <- rbind(c(1,cov.exo.instrus,cov.exo.instrus),
c(cov.exo.instrus,1,cov.exo.instrus),
c(cov.exo.instrus,cov.exo.instrus,1))
#generate obs according to DGP
#instruments and exogenous covariates
X <- rmvnorm(n, rep(0,3), Sigma)
#two disturbance terms
epsilon<-rnorm(n,0,sigma2.epsilon)
V <- rnorm(n,0,sigma2.V)
#endogenous regressor
Y2 <- beta.first.stage*(X[,2]+X[,3])+V
#outcome variable with structural error term
#h()=()^2
Y1 <- beta1*X[,1]+beta2*(Y2^2+sigma2.V-V^2-2*beta.first.stage*(X[,2]+X[,3])*V)+epsilon
#matrices for the finite-dimensional case
second.stage.vars <- cbind(Y1,X[,1],Y2^2)
total.data <- cbind(second.stage.vars,X)
###################################
#simulations in the finite-dimensional case
#with gel there is a problem
gel(moment.function, total.data, c(1.5, 1.5))
#moment.function alone has no problem
moment.function(total.data,c(1.5,1.5))
The gmm function expects the arguments for the data and the parameters to be the other way round, i.e. your moment function should be
moment.function <- function(alpha, data) {
## function body
}
With that change your example works for me.

First Difference Bootstrap from Negative Binomial

novice here. I am fitting a negative binomial model on count data where Y is the count of events, D is the treatment, and X is a logarithmic offset:
out <- glm.nb(y ~ d + offset(log(x)),data=d1)
I would like to bootstrap the confidence intervals of the first difference between D=1 and D=0. I've gotten this far, but not sure if it is the correct approach:
holder <- matrix(NA,1200,1)
out <- out <- glm.nb(y ~ d + offset(log(x)),data=d1)
for (i in 1:1200){
q <- sample(1:nrow(d1), 1)
d2 <- d1[q,]
d1_1 <- d1_2 <- d2
d1_1$d <- 1
d1_2$d <- 0
d1pred <- predict(out,d1_1,type="response")
d2pred <- predict(out,d1_2,type="response")
holder[i,1] <- (d1pred[1] - d2pred[1])
}
mean(holder)
Is this the correct way to bootstrap the first difference?
Generally, your approach is ok, but you can do it in more R-ish way. Firstly, if you are serious about bootstrapping you can employ boot library and benefit from more compact code, no loops and many other advanced options.
In your case it can look like:
## Data generation
N <- 100
set.seed(1)
d1 <- data.frame(y=rbinom(N, N, 0.5),
d=rbinom(N, 1, 0.5),
x=rnorm(N, 10, 3))
## Model
out <- glm.nb(y ~ d + offset(log(x)), data=d1)
## Statistic function (what we are bootstrapping)
## Returns difference between D=1 and D=0
diff <- function(x,i,model){
v1 <- v2 <- x[i,]
v1$d <- 1
v2$d <- 0
predict(model,v1,type="response") - predict(model,v2,type="response")
}
## Bootstrapping itself
b <- boot(d1, diff, R=5e3, model=out)
mean(b$t)
Now b$t holds bootstrapped values. See names(b) and/or ?boot for extra information.
Bootstrapping is time consuming operation, and one of the obvious advantage of boot library is support for parallel operations. It's as easy as:
b <- boot(d1, diff, R=5e3, model=out, parallel="multicore", ncpus=2)
If you are on Windows use parallel="snow" instead.

Resources