Simulating thousands of regressions and obtaining p-values - r

I'm looking to do some basic simulation in R to examine the nature of p-values. My goal is to see whether large sample sizes trend towards small p-values. My thought is to generate random vectors of 1,000,000 data points, regress them on each other, and then plot the distribution of p-values and look for skew.
This is what I was thinking so far:
x1 = runif(1000000, 0, 1000)
x2 = runif(1000000, 0, 1000)
model1 = lm(x2~x1)
Using code taken from another thread:
lmp <- function (modelobject) {
if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
f <- summary(modelobject)$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
}
lmp(model1)
0.3874139
Any suggestions for how I might do this for 1000 models or even more? Thanks!

see ?replicate ... but the p-value you're calculating assumes gaussian errors not uniform ones (not that this will matter much at n=10^6)
Specifically, something like this:
nrep <- 1000
ndat <- 1000000
results <- replicate(nrep, {
x1=runif(ndat, 0, 1000);
x2=runif(ndat, 0, 1000);
model1=lm(x1 ~ x2);
lmp(model1)
})
should work, but
it will take a long time to run.
I'd suggest making nrep and ndat smaller to try it out.

Related

Exponential distribution in R

I want to simulate some data from an exp(1) distribution but they have to be > 0.5 .so i used a while loop ,but it does not seem to work as i would like to .Thanks in advance for your responses !
x1<-c()
w<-rexp(1)
while (length(x1) < 100) {
if (w > 0.5) {
x1<- w }
else {
w<-rexp(1)
}
}
1) The code in the question has these problems:
we need a new random variable on each iteration but it only generates new random variables if the if condition is FALSE
x1 is repeatedly overwritten rather than extended
although while could be used repeat seems better since having the test at the end is a better fit than the test at the beginning
We can fix this up like this:
x1 <- c()
repeat {
w <- rexp(1)
if (w > 0.5) {
x1 <- c(x1, w)
if (length(x1) == 100) break
}
}
1a) A variation would be the following. Note that an if whose condition is FALSE evaluates to NULL if there is no else leg so if the condition is FALSE on the line marked ## then nothing is concatenated to x1.
x1 <- c()
repeat {
w <- rexp(1)
x1 <- c(x1, if (w > 0.5) w) ##
if (length(x1) == 100) break
}
2) Alternately, this generates 200 exponential random variables keeping only those greater than 0.5. If fewer than 100 are generated then repeat. At the end it takes the first 100 from the last batch generated. We have chosen 200 to be sufficiently large that on most runs only one iteration of the loop will be needed.
repeat {
r <- rexp(200)
r <- r[r > 0.5]
if (length(r) >= 100) break
}
r <- head(r, 100)
Alternative (2) is actually faster than (1) or (1a) because it is more highly vectorized. This is despite it throwing away more exponential random variables than the other solutions.
I would advise against a while (or any other accept/reject) loop; instead use the methods from truncdist:
# Sample 1000 observations from a truncated exponential
library(truncdist);
x <- rtrunc(1000, spec = "exp", a = 0.5);
# Plot
library(ggplot2);
ggplot(data.frame(x = x), aes(x)) + geom_histogram(bins = 50) + xlim(0, 10);
It's also fairly straightforward to implement a sampler using inverse transform sampling to draw samples from a truncated exponential distribution that avoids rejecting samples in a loop. This will be a more efficient method than any accept/reject-based sampling method, and works particularly well in your case, since there exists a closed form of the truncated exponential cdf.
See for example this post for more details.

Random data generation leading to good prediction on random labels

I've been playing around with implementing CV in R but encountered a weird problem with the returned value among folds in LOOCV.
First I'll randomly generate data as well as labels, then I'll fit a randomForest on what should be just noise. From the returned loop I get not only a good AUC but a significant p-value from a t-test. I don't understand how this could be theoretically happening so I was curious if the ways I attempted to generate data/labels was best?
Here is a code snippet that shows my issue.
library(randomForest)
library(pROC)
n=30
p=900
set.seed(3)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('P', 'C'), n, replace=T))
resp = vector()
for(i in 1:n){
fit = randomForest(XX[-i,], YY[-i])
pred = predict(fit, XX[i,], type = "prob")[2]
resp[i] <- pred
}
t.test(resp~YY)$p.value
roc(YY, resp)$auc
I tried multiple ways of generating data all of which result in the same thing
XX=matrix(runif(n*p), nrow=n)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
and
random_data=matrix(0, n, p)
for(i in 1:n){
random_data[i,]=jitter(runif(p), factor = 1, amount = 10)
}
XX=as.matrix(random_data)
Since the randomForest is finding relevant predictors in this scenario that leads me to believe that data may not be truly random. Is there a better possible way I could generate data, or generate the random labels? is it possible that this is an issue with R?
This is a partial answer: I modified your roc function call to make sure the distribution of AUC values are between 0 and 1. Then I ran it 20 times. Mean AUC and p-value are 0.73 and 0.12, respectively. Improved but still better than random...
library(ROCR)
library(randomForest)
library(pROC)
n=30
p=900
pvs=vector()
aucs=vector()
for (j in seq(20)){
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('C', 'P'), n, replace=T))
resp = vector()
for(i in 1:n){
fit = randomForest(XX[-i,], YY[-i])
pred = predict(fit, XX[i,], type = "prob")[2]
resp[i] <- pred
}
pvs[j]=t.test(resp~YY)$p.value
aucs[j]=roc(YY, resp, direction='>')$auc
}

confidence interval around predicted value from complex inverse function

I'm trying to get a 95% confidence interval around some predicted values, but am not capable of achieving this.
Basically, I estimated a growth curve like this:
set.seed(123)
dat=data.frame(size=rnorm(50,10,3),age=rnorm(50,5,2))
S <- function(t,ts,C,K) ((C*K)/(2*pi))*sin(2*pi*(t-ts))
sommers <- function(t,Linf,K,t0,ts,C)
Linf*(1-exp(-K*(t-t0)-S(t,ts,C,K)+S(t0,ts,C,K)))
model <- nls(size~sommers(age,Linf,K,t0,ts,C),data=dat,
start=list(Linf=10,K=4.7,t0=2.2,C=0.9,ts=0.1))
I have independent size measurements, for which I would like to predict the age. Therefore, the inverse of the function, which is not very straightforward, I calculated like this:
model.out=coef(model)
S.out <- function(t)
((model.out[[4]]*model.out[[2]])/(2*pi))*sin(2*pi*(t-model.out[[5]]))
sommers.out <- function(t)
model.out[[1]]*(1-exp(-model.out[[2]]*(t-model.out[[3]])-S.out(t)+S.out(model.out[[3]])))
inverse = function (f, lower = -100, upper = 100) {
function (y) uniroot((function (x) f(x) - y), lower = lower, upper = upper)[1]
}
sommers.inverse = inverse(sommers.out, 0, 25)
x= sommers.inverse(10) #this works with my complete dataset, but not with this fake one
Although this works fine, I need to know the confidence interval (95%) around this estimate (x). For linear models there is for example "predict(... confidence=)". I could also bootstrap the function somehow to get the quantiles associated with the parameters (didn't find how), to then use the extremes of those to calculate the maximum and minimum values predictable. But that doesn't really look like the good way of doing this....
Any help would be greatly appreciated.
EDIT after answer:
So this worked (explained in the book of Ben Bolker, see answer):
vmat = mvrnorm(1000, mu = coef(mfit), Sigma = vcov(mfit))
dist = numeric(1000)
for (i in 1:1000) {dist[i] = sommers_inverse(9.938,vmat[i,])}
quantile(dist, c(0.025, 0.975))
On the rather bad fake data I gave, this works of course rather horrible. But on the real data (which I have a problem recreating), this is ok!
Unless I'm mistaken, you're going to have to use either regular (parametric) bootstrapping or a method called either "population predictive intervals" (e.g., see section 5 of chapter 7 of Bolker 2008), which assumes that the sampling distributions of your parameters are multivariate Normal. However, I think you may have bigger problems, unless I've somehow messed up your model in adapting it ...
Generate data (note that random data may actually bad for testing your model - see below ...)
set.seed(123)
dat <- data.frame(size=rnorm(50,10,3),age=rnorm(50,5,2))
S <- function(t,ts,C,K) ((C*K)/(2*pi))*sin(2*pi*(t-ts))
sommers <- function(t,Linf,K,t0,ts,C)
Linf*(1-exp(-K*(t-t0)-S(t,ts,C,K)+S(t0,ts,C,K)))
Plot the data and the initial curve estimate:
plot(size~age,data=dat,ylim=c(0,16))
agevec <- seq(0,10,length=1001)
lines(agevec,sommers(agevec,Linf=10,K=4.7,t0=2.2,ts=0.1,C=0.9))
I had trouble with nls so I used minpack.lm::nls.lm, which is slightly more robust. (There are other options here, e.g. calculating the derivatives and providing the gradient function, or using AD Model Builder or Template Model Builder, or using the nls2 package.)
For nls.lm we need a function that returns the residuals:
sommers_fn <- function(par,dat) {
with(c(as.list(par),dat),size-sommers(age,Linf,K,t0,ts,C))
}
library(minpack.lm)
mfit <- nls.lm(fn=sommers_fn,
par=list(Linf=10,K=4.7,t0=2.2,C=0.9,ts=0.1),
dat=dat)
coef(mfit)
## Linf K t0 C ts
## 10.6540185 0.3466328 2.1675244 136.7164179 0.3627371
Here's our problem:
plot(size~age,data=dat,ylim=c(0,16))
lines(agevec,sommers(agevec,Linf=10,K=4.7,t0=2.2,ts=0.1,C=0.9))
with(as.list(coef(mfit)), {
lines(agevec,sommers(agevec,Linf,K,t0,ts,C),col=2)
abline(v=t0,lty=2)
abline(h=c(0,Linf),lty=2)
})
With this kind of fit, the results of the inverse function are going to be extremely unstable, as the inverse function is many-to-one, with the number of inverse values depending sensitively on the parameter values ...
sommers_pred <- function(x,pars) {
with(as.list(pars),sommers(x,Linf,K,t0,ts,C))
}
sommers_pred(6,coef(mfit)) ## s(6)=9.93
sommers_inverse <- function (y, pars, lower = -100, upper = 100) {
uniroot(function(x) sommers_pred(x,pars) -y, c(lower, upper))$root
}
sommers_inverse(9.938, coef(mfit)) ## 0.28
If I pick my interval very carefully I can get back the correct answer ...
sommers_inverse(9.938, coef(mfit), 5.5, 6.2)
Maybe your model will be better behaved with more realistic data. I hope so ...

Monte Carlo simulation in R

I am trying to simulate data (Y) from an AR(1) model with rho=0.7. Then I will use this data to run a regression of Y on an intercept ( by so doing the parameter estimate becomes the mean of Y), then test the null hypothesis of the coefficient being less than or equal to zero ( alternative is greater than 0) using robust standard errors.
I want to run a Monte Carlo simulation of this hypothesis using 2000 replications for different lag values. the purpose is to show the finite sample performance of the Newey West estimator as the lag changes. so this is how I began
A<-array(0, dim=c(2000,1))
for(i in 1:2000){
y_new<-arima.sim(model=list(ar=0.7), n=50, mean=0,sd=1)
reg<-lm(y_new~1)
ad<-coeftest(reg, alternative="greater", vcov=NeweyWest(reg, lag=1, prewhite=FALSE))
A[i]<-ad[,3]
}
My question: is the code above the right way of doing this kind of simulation? And if it is, how can I get a code to repeat this process for different lag values in the HAC test. I want to run the test each time increasing the lag by 1, thus I will be doing this 50 times for lags 1,2,3,4......,50, each time storing the 2000 simulated test statistics in a vector with different names. calculate rejection probabilities for the test statistic (sig. level =0,05, using the critical value of 1.645) for each case and plot them(rejection probabilities) against the various lag values.
Please help
Because you didn't mention the possible purpose of the simulation, it is hard to tell whether it is the right way.
You save a lot of time by computing 50 test statistics for each simulated sample, instead of repeating the simulation 2000 times for each lag (that is, the number of simulation is 2000*50).
Much better format of doing simulation is
library(AER)
library(dplyr)
lags <- 1:50
nreps <- 2000
sim <- function (){
ynew <- arima.sim(model = list(ar=0.7), n=50, mean=0, sd=1)
reg <- lm(ynew ~ 1 )
s <- rep(NA, 50)
for(i in lags){
ad <- coeftest(reg, alternative="greater", vcov=NeweyWest(reg, lag = i, prewhite=FALSE))
s[i] <- ad[ ,4]
}
s
}
Following code stores simulation results in a data.frame
result <- lapply(1:nreps, function(i)data.frame(simulation = i, lag = lags, pvalues = sim())) %>%
rbind_all
From your vague description, I extrapolate what you want looks something like
library(ggplot2)
result %>%
group_by(lag) %>%
summarize(rejectfreq = mean(pvalues > 0.05)) %>%
ggplot(., aes(lag, rejectfreq)) + geom_line()+
coord_cartesian(ylim = c(0,1)) +
scale_y_continuous(breaks=seq(0, 1, by=0.1))
Although the figure was created using only 100 simulations, it is evident that the choice of the lags in Newey-West wouldn't matter much when the disturbance terms are i.i.d.

Using anova() on gamma distributions gives seemingly random p-values

I am trying to determine whether there is a significant difference between two Gamm distributions. One distribution has (shape, scale)=(shapeRef,scaleRef) while the other has (shape, scale)=(shapeTarget,scaleTarget). I try to do analysis of variance with the following code
n=10000
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
glmm1 <- gam(y~x,family=Gamma(link=log))
anova(glmm1)
The resulting p values keep changing and can be anywhere from <0.1 to >0.9.
Am I going about this the wrong way?
Edit: I use the following code instead
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
anova(glm(xy ~ f, family = Gamma(link = log)),test="F")
But, every time I run it I get a different p-value.
You will indeed get a different p-value every time you run this, if you pick different realizations every time. Just like your data values are random variables, which you'd expect to vary each time you ran an experiment, so is the p-value. If the null hypothesis is true (which was the case in your initial attempts), then the p-values will be uniformly distributed between 0 and 1.
Function to generate simulated data:
simfun <- function(n=100,shapeRef=2,shapeTarget=2,
scaleRef=1,scaleTarget=2) {
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
data.frame(xy,f)
}
Function to run anova() and extract the p-value:
sumfun <- function(d) {
aa <- anova(glm(xy ~ f, family = Gamma(link = log),data=d),test="F")
aa["f","Pr(>F)"]
}
Try it out, 500 times:
set.seed(101)
r <- replicate(500,sumfun(simfun()))
The p-values are always very small (the difference in scale parameters is easily distinguishable), but they do vary:
par(las=1,bty="l") ## cosmetic
hist(log10(r),col="gray",breaks=50)

Resources