novice here. I am fitting a negative binomial model on count data where Y is the count of events, D is the treatment, and X is a logarithmic offset:
out <- glm.nb(y ~ d + offset(log(x)),data=d1)
I would like to bootstrap the confidence intervals of the first difference between D=1 and D=0. I've gotten this far, but not sure if it is the correct approach:
holder <- matrix(NA,1200,1)
out <- out <- glm.nb(y ~ d + offset(log(x)),data=d1)
for (i in 1:1200){
q <- sample(1:nrow(d1), 1)
d2 <- d1[q,]
d1_1 <- d1_2 <- d2
d1_1$d <- 1
d1_2$d <- 0
d1pred <- predict(out,d1_1,type="response")
d2pred <- predict(out,d1_2,type="response")
holder[i,1] <- (d1pred[1] - d2pred[1])
}
mean(holder)
Is this the correct way to bootstrap the first difference?
Generally, your approach is ok, but you can do it in more R-ish way. Firstly, if you are serious about bootstrapping you can employ boot library and benefit from more compact code, no loops and many other advanced options.
In your case it can look like:
## Data generation
N <- 100
set.seed(1)
d1 <- data.frame(y=rbinom(N, N, 0.5),
d=rbinom(N, 1, 0.5),
x=rnorm(N, 10, 3))
## Model
out <- glm.nb(y ~ d + offset(log(x)), data=d1)
## Statistic function (what we are bootstrapping)
## Returns difference between D=1 and D=0
diff <- function(x,i,model){
v1 <- v2 <- x[i,]
v1$d <- 1
v2$d <- 0
predict(model,v1,type="response") - predict(model,v2,type="response")
}
## Bootstrapping itself
b <- boot(d1, diff, R=5e3, model=out)
mean(b$t)
Now b$t holds bootstrapped values. See names(b) and/or ?boot for extra information.
Bootstrapping is time consuming operation, and one of the obvious advantage of boot library is support for parallel operations. It's as easy as:
b <- boot(d1, diff, R=5e3, model=out, parallel="multicore", ncpus=2)
If you are on Windows use parallel="snow" instead.
Related
So, just a touch of backstory. I've been learning biostatistics in the past 4-5 months in university, 6 months of biomathematics before that. I only started deep diving into programming around 5 days ago.
I've been trying to redo t.test() with my own function.
test2 = function(t,u){
T = (mean(t) - u) / ( sd(t) / sqrt(length(t)))
t1=round(T, digits=5)
df=length(t)
cat(paste('t - value =', t1,
'\n','df =', df-1,
'\n','Alternative hipotézis: a minta átlag nem egyenlő a hipotetikus átlaggal'))
}
I tried searching the formula for the p-value, I found one, but when I used it, my value was different from the one within the t.test.
The t-value and the df do match t.test().
I highly appreciate any help, thank you.
P.s: Don't worry about the last line, it's in Hungarian.
The p-value can be derived from the probability function of the t distribution pt. Using this and making the notation more common with sample x and population mean mu we can use something like:
test2 <- function(x, u){
t <- (mean(x) - u) / (sd(x) / sqrt(length(x)))
df <- length(x) - 1
cat('t-value =', t, ', df =', df, ', p =', 2 * (1 - pt(q=t, df=df)), '\n')
}
set.seed(123) # remove this for other random values
## random sample
x <- rnorm(10, mean=5.5)
## population mean
mu <- 5
## own function
test2(x, mu)
## one sample t-test from R
t.test(x, mu=mu)
We get for the own test2:
t-value = 1.905175 , df = 9, p = 0.08914715
and for R's t.test
One Sample t-test
data: x
t = 1.9052, df = 9, p-value = 0.08915
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.892330 6.256922
sample estimates:
mean of x
5.574626
The definitive source of what R is doing is the source code. If you look at the source code for stats:::t.test.default (which you can get by typing stats:::t.test.default into the console, without parentheses at the end and hitting enter), you'll see that for a single-sample test like the one you're trying to do above, you would get the following:
nx <- length(x)
mx <- mean(x)
vx <- var(x)
df <- nx - 1
stderr <- sqrt(vx/nx)
tstat <- (mx - mu)/stderr
if (alternative == "less") {
pval <- pt(tstat, df)
}
else if (alternative == "greater") {
pval <- pt(tstat, df, lower.tail = FALSE)
}
else {
pval <- 2 * pt(-abs(tstat), df)
}
These are the relevant pieces (there's a lot more code in there, too).
I am searching for an equivalent function in R of the extremely convenient Stata command simulate. The command basically allows you to declare a program (reg_simulation in the example below) and then invoke such a program from simulate and store desired outputs.
Below is a Stata illustration of the usage of the simulate program, together with my attempt to replicate it using R.
Finally, my main question is: is this how R users will run a Montecarlo simulation? or am I missing something in terms of structure or speed bottlenecks? Thank you a lot in advance.
Stata example
Defining reg_simulation program.
clear all
*Define "reg_simulation" to be used later on by "simulate" command
program reg_simulation, rclass
*Declaring Stata version
version 13
*Droping all variables on memory
drop _all
*Set sample size (n=100)
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
*Estimate OLS
reg y x1 x2
*Store coefficients
matrix B = e(b)
return matrix betas = B
end
Calling reg_simulation from simulate command:
*Seet seed
set seed 1234
*Run the actual simulation 10 times using "reg_simulation"
simulate , reps(10) nodots: reg_simulation
Obtained result (stored data on memory)
_b_x1 _b_x2 _b_cons
.4470155 1.50748 1.043514
.4235979 1.60144 1.048863
.5006762 1.362679 .8828927
.5319981 1.494726 1.103693
.4926634 1.476443 .8611253
.5920001 1.557737 .8391003
.5893909 1.384571 1.312495
.4721891 1.37305 1.017576
.7109139 1.47294 1.055216
.4197589 1.442816 .9404677
R replication of the Stata program above.
Using R I have managed to get the following (not an R expert tho). However, the part that worries me the most is the for-loop structure that loops over each the number of repetitions nreps.
Defining reg_simulation function.
#Defining a function
reg_simulation<- function(obs = 1000){
data <- data.frame(
#Generate data
x1 <-rnorm(obs, 0 , 1) ,
x2 <-rnorm(obs, 0 , 1) ,
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
#Estimate OLS
ols <- lm(y ~ x1 + x2, data=data)
return(ols$coefficients)
}
Calling reg_simulation 10 times using a for-loop structure:
#Generate list to store results from simulation
results_list <- list()
# N repetitions
nreps <- 10
for (i in 1:nreps) {
#Set seed internally (to get different values in each run)
set.seed(i)
#Save results into list
results_list[i] <- list(reg_simulation(obs=1000))
}
#unlist results
df_results<- data.frame(t(sapply(results_list,
function(x) x[1:max(lengths(results_list))])))
Obtained result: df_results.
#final results
df_results
# X.Intercept. x1 x2
# 1 1.0162384 0.5490488 1.522017
# 2 1.0663263 0.4989537 1.496758
# 3 0.9862365 0.5144083 1.462388
# 4 1.0137042 0.4767466 1.551139
# 5 0.9996164 0.5020535 1.489724
# 6 1.0351182 0.4372447 1.444495
# 7 0.9975050 0.4809259 1.525741
# 8 1.0286192 0.5253288 1.491966
# 9 1.0107962 0.4659812 1.505793
# 10 0.9765663 0.5317318 1.501162
You're on the right track. Couple of hints/corrections:
Don't use <- inside data.frame()
In R, we construct data frames using = for internal column assignment, i.e. data.frame(x = 1:10, y = 11:20) rather than data.frame(x <- 1:10, y <- 11:20).
(There's more to be said about <- vs =, but I don't want to distract from your main question.)
In your case, you don't actually even need to create a data frame since x1, x2 and y will all be recognized as "global" variables within the scope of the function. I'll post some code at the end of my answer demonstrating this.
When growing a list via a for loop in R, always try to pre-allocate the list first
Always try to pre-allocate the list length and type if you are going to grow a (long) for loop. Reason: That way, R knows how much memory to efficiently allocate to your object. In the case where you are only doing 10 reps, that would mean starting with something like:
results_list <- vector("list", 10)
3. Consider using lapply instead of for
for loops have a bit of bad rep in R. (Somewhat unfairly, but that's a story for another day.) An alternative that many R users would consider is the functional programming approach offered by lapply. I'll hold off on showing you the code for a second, but it will look very similar to a for loop. Just to note quickly, following on from point 2, that one immediate benefit is that you don't need to pre-allocate the list with lapply.
4. Run large loops in parallel
A Monte Carlo simulation is an ideal candidate for running everything in parallel, since each iteration is supposed to be independent of the others. An easy way to go parallel in R is via the future.apply package.
Putting everything together, here's how I'd probably do your simulation. Note that this might be more "advanced" than you possibly need, but since I'm here...
library(data.table) ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
plan(multisession) ## use all available cores
obs <- 1e3
# Defining a function
reg_simulation <- function(...){
x1 <- rnorm(obs, 0 , 1)
x2 <- rnorm(obs, 0 , 1)
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
#Estimate OLS
ols <- lm(y ~ x1 + x2)
# return(ols$coefficients)
return(as.data.frame(t(ols$coefficients)))
}
# N repetitions
nreps <- 10
## Serial version
# results <- lapply(1:nreps, reg_simulation)
## Parallel version
results <- future_lapply(1:nreps, reg_simulation, future.seed = 1234L)
## Unlist / convert into a data.table
results <- rbindlist(results)
So, following up on the comments, you want to vary your independent variables (x) and also the error term and simulate the coefficients, but you also want to catch errors if any occur. The following would do the trick:
set.seed(42)
#Defining a function
reg_simulation<- function(obs = 1000){
data <- data.frame(
#Generate data
x1 <-rnorm(obs, 0 , 1) ,
x2 <-rnorm(obs, 0 , 1) ,
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
#Estimate OLS
tryCatch(
{
ols <- lm(y ~ x1 + x2, data=data)
return(ols$coefficients)
},
error = function(e){
return(c('(Intercept)'=NA, 'x1'=NA, 'x2'=NA))
}
)
}
output <- t(data.frame(replicate(10, reg_simulation())))
output
(Intercept) x1 x2
X1 0.9961328 0.4782010 1.481712
X2 1.0234698 0.4801982 1.556393
X3 1.0336289 0.5239380 1.435468
X4 0.9796523 0.5095907 1.493548
...
Here, tryCatch (see also failwith) catches the error and returns NA as the default value.
Note that you only need to set the seed once because the seed changes automatically with every call to random number generator in a deterministic fashion.
This began as a question, but after reading the reference provided in the answer below, as well as the source code, the solution became clear. In case anyone else finds themselves in this position:
The orcutt package, version 2.2 uses a special procedure to calculate the DW statistic for its CO models. The MWE uses the orcutt package's example to show that its Durbin-Watson statistic is not based on the residuals of the OC estimation.
library(orcutt)
data(icecream, package="orcutt")
dw_calc = function(x){sum((x[2:length(x)] - x[1:(length(x)-1)]) ^ 2) / sum(x ^ 2)}
lm = lm(cons ~ price + income + temp, data=icecream)
e = lm$residuals
dw_calc(e)
# 1.02117 <- Durbin-Watson Statistic
coch = cochrane.orcutt(lm)
e = coch$residuals
dw_calc(e)
# 1.006431 <- Durbin-Watson Statistic
coch
# Durbin-Watson statistic
# (original): 1.02117 , p-value: 3.024e-04
# (transformed): 1.54884 , p-value: 5.061e-02
The orcutt package reports 1.54884 but the actual DW is 1.006431 for the new residuals. The reported value, 1.54884, comes from the last round of the convergence procedure (see Hildreth-Lu). See below for a thorough explanation:
lm = lm(cons ~ price + income + temp, data=icecream)
reg = lm(cons ~ price + income + temp, data=icecream)
convergence = 8
X <- model.matrix(reg)
Y <- model.response(model.frame(reg))
n<-length(Y)
e<-reg$residuals
e2<-e[-1]
e3<-e[-n]
regP<-lm(e2~e3-1)
rho<-summary(regP)$coeff[1]
rho2<-c(rho)
XB<-X[-1,]-rho*X[-n,]
YB<-Y[-1]-rho*Y[-n]
regCO<-lm(YB~XB-1)
ypCO<-regCO$coeff[1]+as.matrix(X[,-1])%*%regCO$coeff[-1]
e1<-ypCO-Y
e2<-e1[-1]
e3<-e1[-n]
regP<-lm(e2~e3-1)
rho<-summary(regP)$coeff[1]
rho2[2]<-rho
i<-2
while (round(rho2[i-1],convergence)!=round(rho2[i],convergence)){
XB<-X[-1,]-rho*X[-n,]
YB<-Y[-1]-rho*Y[-n]
regCO<-lm(YB~XB-1)
ypCO<-regCO$coeff[1]+as.matrix(X[,-1])%*%regCO$coeff[-1]
e1<-ypCO-Y
e2<-e1[-1]
e3<-e1[-n]
regP<-lm(e2~e3-1)
rho<-summary(regP)$coeff[1]
i<-i+1
rho2[i]<-rho
}
regCO$number.interaction<-i-1
regCO$rho <- rho2[i-1]
regCO$DW <- c(lmtest::dwtest(reg)$statistic, lmtest::dwtest(reg)$p.value,
lmtest::dwtest(regCO)$statistic, lmtest::dwtest(regCO)$p.value)
dw_calc(regCO$residuals)
regF<-lm(YB ~ 1)
tF <- anova(regCO,regF)
regCO$Fs <- c(tF$F[2],tF$`Pr(>F)`[2])
# fitted.value
regCO$fitted.values <- model.matrix(reg) %*% (as.matrix(regCO$coeff))
# coeff
names(regCO$coefficients) <- colnames(X)
# st.err
regCO$std.error <- summary(regCO)$coeff[,2]
# t value
regCO$t.value <- summary(regCO)$coeff[,3]
# p value
regCO$p.value <- summary(regCO)$coeff[,4]
class(regCO) <- "orcutt"
# formula
regCO$call <- reg$call
# F statistics and p value
df1 <- dim(model.frame(reg))[2] - 1
df2 <- length(regCO$residuals) - df1 - 1
RSS <- sum((regCO$residuals)^2)
TSS <- sum((regCO$model[1] - mean(regCO$model[,1]))^2)
regCO$rse <- sqrt(RSS/df2)
regCO$r.squared <- 1 - (RSS/TSS)
regCO$adj.r.squared <- 1 - ((RSS/df2)/(TSS/(df1 + df2)))
regCO$gdl <- c(df1, df2)
#
regCO$rank <- df1
regCO$df.residual <- df2
regCO$assign <- regCO$assign[-(df1+1)]
regCO$residuals <- Y - regCO$fitted.values
regCO
}
I don't know this area well, but it seems much more likely that there are multiple definitions/estimation methods for this statistic. Going back to the book cited in ?orcutt (Verbeek M. (2004) A guide to modern econometrics, John Wiley & Sons Ltd, ISBN:978-88-08-17054-5), and searching on Google books gives
The value shown as computed by orcutt in your example above agrees with the value given in the book. Earlier in the book (p. 108) it says
In [the Cochrane-Orcutt procedure], $\rho$ and $\beta$ are recursively estimated until convergence, i.e. having estimated $\beta$ by EGLS (by $\beta^*$), the residuals are recomputed and $\rho$ is estimated again using the residuals from the EGLS step. With this new estimate of $\rho$, EGLS is applied again and one obtains a new estimate of $\beta$ ...
In other words, it seems as though the estimate of $\rho$ that you give above corresponds only to the first step of the Orcutt-Cochrane procedure.
I'm working on an assignment for my Machine Learning course, and as part of it I'm trying to implement a neural network. Since it's for school, I have to implement the algorithm manually, and not use any of the neuralnet packages available.
I've been using the material in "Learning from Data" along with the CalTech lectures that follow it on youtube.
I've put together the algorithm in R to the best of my ability, but there's something going wrong along the way. I haven't been able to implement the difference in the cost function as a measure for when the last iteration should be, so for now I've just fixed the number of iterations as a constant.
** Edit **
Hey guys. Thanks for the response. I can see I'm missing a lot of needed information. Sorry about that, don't really know what I was thinking.
The data I'm using is simply "toy data" generated from the sinc function sinc(x)=sin(x)/x.
The problem I'm having specifically is that the estimates that I get at the end of the algorithm are completely off from the real values, and they are significantly different every time I run the algorithm. It seems like I've put the algorithm together the way the book states, but I can't see where the problem is.
Edit 2
Added the data to the code so it can be run without doing anything extra. I also separated the individual parts of the function. As i mentioned in a comment, I was able to numerically verify the partial derivatives, so I think that part is ok. The problem I have is when I need to update the weights in order to train the network.
It's not in this part of the code, but I thought that in order to update the weights, you simply took the old weight and subtracted the partial derivative of that weight scaled by the learning rate? (wNew = wOld - eta*djdwOld)
theta <- function(a){
a / (1+abs(a)) # Here we apply the sigmoid function as our
# non-linearity.
}
theta.prime <- function(a){
1 / (1+abs(a))^2
}
x <- c( 5.949110, -1.036600, 3.256780, 7.824520, -3.606010, 3.115640, -7.786960,
-7.598090, 2.083880, 3.983000, 8.060120, 7.879760, -2.456670,
-2.152720, 3.471950, 3.567960, -4.232630, 6.831610, -9.486860, 8.692330,
-1.551860, 0.917305, 4.669480, -7.760430, 2.835410)
y <- c(-0.10804400, 0.78264000, -0.05313330, 0.13484700, -0.05522470, -0.05758530,
0.19566100, 0.13846000, 0.43534100, -0.16861400, 0.10625000,
0.08427310, 0.27012900, 0.44004800, -0.00880575, -0.10711400, -0.18671100,
0.01158470, 0.02767190, 0.06319830, 0.61802000, 0.87124300,
-0.25668100, 0.06160800, 0.10575700)
inputlayer <- 1
outputlayer <- 1
hiddenlayer <- 2
w1 <- t(matrix(rnorm(hiddenlayer,0,.01),hiddenlayer,inputlayer))
w2 <- matrix(rnorm(hiddenlayer,0,.01),hiddenlayer,outputlayer)
### Forwardprop ###
forward <- function(x,w1,w2,theta){
s2 <- x%*%w1
a2 <- apply(s2,c(1,2),theta)
s3 <- a2%*%w2
yhat <- apply(s3,c(1,2),theta)
return(yhat)
}
### Forwardpropagation maunally ###
s2 <- x%*%w1
a2 <- apply(s2,c(1,2),theta)
s3 <- a2%*%w2
yhat <- apply(s3,c(1,2),theta)
### Error function ###
#yhat <- forward(x,w1,w2,theta)
E <- sum((y-yhat)^2)/(length(x))
### Backward Propagation ###
delta3 <- (-2*(y-yhat)) * apply(s3,c(1,2),theta.prime)
djdw2 <- t(a2) %*% delta3
delta2 <- delta3 %*% t(w2) * apply(s2,c(1,2),theta.prime)
djdw1 <- t(x)%*%delta2
### Numerically estimated gradients ###
e <- 1e-8
numgrad1 <- matrix(0,1,2)
eps <- matrix(0,1,2)
w1e <- matrix(0,1,2)
for(j in 1:2) {
eps[1,j] <- e
w1e <- w1 + eps
loss2 <- sum((y-forward(x,w1e,w2,theta))^2)
w1e <- w1
loss1 <- sum((y-forward(x,w1e,w2,theta))^2)
numgrad1[1,j] <- (loss2 - loss1)/(e)
eps[1,j] <- 0
}
numgrad2 <- matrix(0,2,1)
eps <- matrix(0,2,1)
w2e <- matrix(0,2,1)
for(j in 1:2) {
eps[j,1] <- e
w2e <- w2 + eps
loss2 <- sum((y-forward(x,w1,w2e,theta))^2)
w2e <- w2
loss1 <- sum((y-forward(x,w1,w2e,theta))^2)
numgrad2[j,1] <- (loss2 - loss1)/(e)
eps[j,1] <- 0
}
# Comparison of our gradients from backpropagation
# and numerical estimation.
c(djdw1,djdw2)
c(numgrad1,numgrad2)
I have solved what I want to get out of my code, I'm in search of a cleaner way of getting this result out? As in any built in functions, who I don't know about?
We have 2 correlated variables and a lot of binomial factors (around 200),
here illustrated with just f1 and f2:
x <- rnorm(100)
y <- rnorm(100)
f1 <- rbinom(100, 1, 0.5)
f2 <- rbinom(100, 1, 0.5)
# which gives the possible groups:
group <- rep(NA, 100)
group[which(f1 & f2)] <- "A"
group[which(!f1 & f2)] <- "B"
group[which(f1 & !f2)] <- "C"
group[which(!f1 & !f2)] <- "D"
df <- data.frame(group,y,x,f1,f2)
We run a model selection adding and removing terms and interactions and end up
with a model, here we say both f1 and f2 and their interactions with x
came out as predictors
m <- glm(y ~ x * f1 + x * f2)
Then my aim is to make a simple linear model output for each group i.e.:
y = a * x + b
# The possible groups:
groups <- data.frame(groups = c("A", "B", "C", "D"), f1=c(1,0,1,0), f2=c(1,1,0,0))
interactions <- grep(":", attr(m$terms, "term.labels"))
factors <- attr(m$terms, "term.labels")[-c(1,interactions)]
interaction.terms <- substring(attr(m$terms, "term.labels")[interactions], 3)
functions <- data.frame(groups$groups, intercept=NA, slope=NA)
for(i in seq(along=groups$groups)) {
intercept <- coef(m)["(Intercept)"] + sum(groups[i, factors]*coef(m)[factors])
slope <- coef(m)["x"] + sum(groups[i, interaction.terms]*coef(m)[paste("x:", interaction.terms, sep="")])
functions[i, "intercept"] <- intercept
functions[i, "slope"] <- slope
}
Which gives an output like this:
> functions
groups.groups intercept slope
1 A -0.10932806 -0.07468630
2 B -0.37755949 -0.17769345
3 C 0.23635139 0.18406047
4 D -0.03188004 0.08105332
The output is the correct, and what I would like. So that is fine. I just think that this method is a quite complicated mess. I can't seem to find a cleaner way of getting these functions out.
I would probably recommend using predict() for this. The intercept is just the value a time x=0, and the slope is the difference in the values between x=1 and x=0. So you can do
int <- predict(m, cbind(groups,x=0))
t1 <- predict(m, cbind(groups,x=1))
data.frame(group=groups$groups, int=int, slope=t1-int)
You didn't set a seed for your example so your exact results aren't reproducible, but if you do set.seed(15) before the sample generation, you should get
group int slope
1 A -0.08372785 -0.16037708
2 B -0.03904330 0.14322623
3 C 0.16455660 -0.02951151
4 D 0.20924114 0.27409179
with both methods