Bootstrapping regression coefficients from random subsets of data - r

I’m attempting to perform a regression calibration on two variables using the yorkfit() function in the IsoplotR package. I would like to estimate the confidence interval of the bootstrapped slope coefficient from this model; however, instead of using the typical bootstrap method below, I’d like to only perform the iterations on 75% of the data (randomly selected) at a time. So far, using the following sample data, I managed to bootstrap the slope coefficient result of the yorkfit() function:
library(boot)
library(IsoplotR)
X <- c(9.105,8.987,8.974,8.994,8.996,8.966,9.035,9.215,9.239,
9.307,9.227,9.17, 9.102)
Y <- c(28.1,28.9,29.6,29.5,29.0,28.8,28.5,27.3,27.1,26.5,
27.0,27.5,28.4)
n <- length(X)
sX <- X*0.02
sY <- Y*0.05
rXY <- rep(0.8,n)
dat <- cbind(X,sX,Y,sY,rXY)
fit <- york(dat)
boot.test <- function(data,indices){
sample = data[indices,]
mod = york(sample)
return (mod$b)
}
result <- boot(data=dat, statistic = boot.test, R=1000)
boot.ci(result, type = 'bca')
...but I'm not really sure where to go from here. Any help to move me in the right direction would be greatly appreciated. I’m new to R so I apologize if question is ambiguous. Thanks.

Based on the package documentation, you should be able to use the ran.gen argument, with sim="parametric", to sample using a custom function. In this case, the sample is a certain percent of the total observations, chosen at random. Something like the following should accomplish what you want:
result <- boot(
data=dat,
statistic =boot.test,
R=1000,
sim="parametric",
ran.gen=function(data, percent){
n=nrow(data)
indic=runif(n)
data[rank(indic, ties.method="random")<=round(n*percent,0),]
},
percent=0.75)

Related

How to conduct parametric bootstrapping in R?

I am working with the orings data set in the faraway package in R. I have written the following grouped binomial model:
orings_model <- glm(cbind(damage, 6-damage) ~ temp, family = binomial, data = orings)
summary(orings_model)
I then constructed the Chi-Square test statistic and calculated the p-value:
pchisq(orings_model$null.deviance, orings_model$df.null,lower=FALSE)
First, I would like to generate data under the null distribution for this test statistic using rbinom with the average proportion of damaged o-rings (i.e., the variable "damage"). Second, I would like to recompute the above test statistic with this new data. I am not sure how to do this.
And second, I want to the process above 1000 times, saving the test statistic
each time. I am also not sure how to do this. My inclination is to use a for loop, but I am not sure how to set it up. Any help would be really appreciated!
It is not completely clear what you're looking to do here, but we can at least show some quick principles of how we can achieve this, and then hopefully you can get to your goal.
1) Simulating the null model
It is not entirely clear that you would like to simulate the null model here. It seems more like you're interested in simulating the actual model fit. Note that the null model is the model with form cbind(damage, 6-damage) ~ 1, and the null deviance and df are from this model. Either way, we can simulate data from the model using the simulate function in base R.
sims <- simulate(orings_model, 1000)
If you want to go the manual way estimate the mean vector of your model and use this for the probabilities in your call to rbinom
nsim <- 1000 * nrow(orings)
probs <- predict(orings_model, type = 'response')
sims_man <- matrix(rbinom(nsim, 6, probs),
ncol = 1000)
# Check they are equal:
# rowMeans(sims_man) - probs
In the first version we get a data.frame with 1000 columns each with a n times 2 matrix (damage vs not damage). In the latter we just summon the damage outcome.
2) Perform the bootstrapping
You could do this manually with the data above.
# Data from simulate
statfun <- function(x){
data <- orings_model$data
data$damage <- if(length(dim(x)) > 1)
x[, 1]
else
x
newmod <- update(orings_model, data = data)
pchisq(newmod$null.deviance, newmod$df.null, lower=FALSE)
}
sapply(sims, statfun)
# data from manual method
apply(sims_man, 2, statfun)
or alternatively one could take a bit of time with the boot function, allowing for a standardized way to perform the bootstrap:
library(boot)
# See help("boot")
ran_gen <- function(data, mle){
data$damage <- simulate(orings_model)[[1]][,1]
data
}
boot_metric <- function(data, w){
model <- glm(cbind(damage = damage, not_damage = 6 - damage) ~ temp,
family = binomial, data = data)
pchisq(model$null.deviance,
model$df.null,
lower=FALSE)
}
boots <- boot(orings, boot_metric,
R = 1000,
sim = 'parametric',
ran.gen = ran_gen,
mle = pchisq(orings_model$null.deviance,
orings_model$df.null,
lower=FALSE))
At which point we have the statistic in boots$t and the null statistic in boots$t0, so a simple statistic can be estimated using sum(boots$t > boots$t0) / boots$R (R being the number of replication).

Fitting a confidence interval to dlmForecast in R

I've fit a Dyanmic Linear Model to some data using the dlmFilter in R [from the dlm package]. From said filter I have predicted 7 steps ahead using the dlmForecast function. The predicted outcome is very good, but I would like to add a 95% confidence interval and [after a lot of testing] have struggled to do so.
I've mocked up some similar code, below:
library(dlm)
data <- c(20.68502, 17.28549, 12.18363, 13.53479, 15.38779, 16.14770, 20.17536, 43.39321, 42.91027, 49.41402, 59.22262, 55.42043)
mod.build <- function(par) {
dlmModPoly(1, dV = exp(par[1]), dW = exp(par[2]))
}
# Returns most likely estimate of relevant values for parameters
mle <- dlmMLE(a2, rep(0,2), mod.build); #nileMLE$conv
if(mle$convergence==0) print("converged") else print("did not converge")
mod1 <- dlmModPoly(dV = v, dW = c(0, w))
mod1Filt <- dlmFilter(a1, mod1)
fut1 <- dlmForecast(mod1Filt, n = 7)
The forecast outcome appears to be very good [although the model to some extent over-fits the data due to the small number of observations]. However, I would like to add a 95% confidence interval and have struggled to figure out how to do so.
Any advice would be appreciated?
Cheers
hwidth <- (outer(sapply(fut1$Q, FUN=function(x) sqrt(diag(x))), qnorm(0.025, lower = FALSE)) +as.vector(t(fut1$f)))

R: Differences in MLE estimation of t distribution

I am currently working on a group project estimating VaR and ES for a series of returns. One of the tasks is to estimate the degrees of freedom of a t distribution.
I am using the following approach:
make_loglik <- function(x){ Vectorize( function(nu) sum(dt(x, df=nu, log=TRUE)) )}
t_nu_mle <- function(x) {
loglik <- make_loglik(x)
res <- optimize(loglik, interval=c(0.01, 2000), maximum=TRUE)$maximum
res
}
Which yields for example:
set.seed(123)
t <- rt(20000,df=3)
t_nu_mle(t)
[1] 2.986434
However on the empirical data that is given in the assignment I get a different value for the DF using for example the package MASS:
library(MASS)
fitdistr(t, "t")
Additionally, my other group members obtain different results as well when using the actual data set.
Is my function above flawed? What could be the issue here?

how to use previous observations to forecast the next period using for loops in r?

I have made 1000 observations for xt = γ1xt−1 + γ2xt−2 + εt [AR(2)].
What I would like to do is to use the first 900 observations to estimate the model, and use the remaining 100 observations to predict one-step ahead.
This is what I have done so far:
data2=arima.sim(n=1000, list(ar=c(0.5, -0.7))) #1000 observations simulated, (AR (2))
arima(data2, order = c(2,0,0), method= "ML") #estimated parameters of the model with ML
fit2<-arima(data2[1:900], c(2,0,0), method="ML") #first 900 observations used to estimate the model
predict(fit2, 100)
But the problem with my code right now is that the n.ahead=100 but I would like to use n.ahead=1 and make 100 predictions in total.
I think I need to use for loops for this, but since I am a very new user of Rstudio I haven't been able to figure out how to use for loops to make predictions. Can anyone help me with this?
If I've understood you correctly, you want one-step predictions on the test set. This should do what you want without loops:
library(forecast)
data2 <- arima.sim(n=1000, list(ar=c(0.5, -0.7)))
fit2 <- Arima(data2[1:900], c(2,0,0), method="ML")
fit2a <- Arima(data2[901:1000], model=fit2)
fc <- fitted(fit2a)
The Arima command allows a model to be applied to a new data set without the parameters being re-estimated. Then fitted gives one-step in-sample forecasts.
If you want multi-step forecasts on the test data, you will need to use a loop. Here is an example for two-step ahead forecasts:
fcloop <- numeric(100)
h <- 2
for(i in 1:100)
{
fit2a <- Arima(data2[1:(899+i)], model=fit2)
fcloop[i] <- forecast(fit2a, h=h)$mean[h]
}
If you set h <- 1 above you will get almost the same results as using fitted in the previous block of code. The first two values will be different because the approach using fitted does not take account of the data at the end of the training set, while the approach using the loop uses the end of the training set when making the forecasts.

Using anova() on gamma distributions gives seemingly random p-values

I am trying to determine whether there is a significant difference between two Gamm distributions. One distribution has (shape, scale)=(shapeRef,scaleRef) while the other has (shape, scale)=(shapeTarget,scaleTarget). I try to do analysis of variance with the following code
n=10000
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
glmm1 <- gam(y~x,family=Gamma(link=log))
anova(glmm1)
The resulting p values keep changing and can be anywhere from <0.1 to >0.9.
Am I going about this the wrong way?
Edit: I use the following code instead
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
anova(glm(xy ~ f, family = Gamma(link = log)),test="F")
But, every time I run it I get a different p-value.
You will indeed get a different p-value every time you run this, if you pick different realizations every time. Just like your data values are random variables, which you'd expect to vary each time you ran an experiment, so is the p-value. If the null hypothesis is true (which was the case in your initial attempts), then the p-values will be uniformly distributed between 0 and 1.
Function to generate simulated data:
simfun <- function(n=100,shapeRef=2,shapeTarget=2,
scaleRef=1,scaleTarget=2) {
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
data.frame(xy,f)
}
Function to run anova() and extract the p-value:
sumfun <- function(d) {
aa <- anova(glm(xy ~ f, family = Gamma(link = log),data=d),test="F")
aa["f","Pr(>F)"]
}
Try it out, 500 times:
set.seed(101)
r <- replicate(500,sumfun(simfun()))
The p-values are always very small (the difference in scale parameters is easily distinguishable), but they do vary:
par(las=1,bty="l") ## cosmetic
hist(log10(r),col="gray",breaks=50)

Resources