I am trying to run a simulation code,
And in the matrix named par.est1 I am saving in the 5th and 6th columns the standard errors of the coefficients b1 and b2, but those happen to be the exact same on the 1000 repetitions. Could anyone know why is this happening? Maybe it has something to do with the way that I created the correlated variables?
This is the code:
set.seed(185736)
reps <- 1000 #repetitions
par.est1 <- matrix(NA, nrow=reps, ncol=6)
b1 <- 4
b2 <- 5.8
n <- 26
r <- 0.1
#Create correlated variables
library(MASS)
data <- mvrnorm(n=n, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2), empirical=TRUE)
V1 = data[, 1] # standard normal (mu=0, sd=1)
V2 = data[, 2]
cor(V1, V2)
for(i in 1:reps){
Y <- V1*b1+V2*b2+rnorm(n,0,1) #The true DGP, with N(0,1) error
model1 <- lm(Y~V1+V2)
vcv1 <- vcov(model1)
par.est1[i,1] <- model1$coef[1]
par.est1[i,2] <- model1$coef[2]
par.est1[i,3] <- model1$coef[3]
par.est1[i,4] <- sqrt(diag(vcv1)[1]) #SE
par.est1[i,5] <- sqrt(diag(vcv1)[2])
par.est1[i,6] <- sqrt(diag(vcv1)[3])
}
Thank you.
Thanks user2554330.
Any way I can make correlated variables with different means and variances??
Related
I have a data frame of values and want to add a column based on an inequality condition that involves matrix multiplication.
The data frame looks like this
# Set possible values for variables
b0 <- b1 <- b2 <- seq(0, 2, by=0.1)
# Create data frame of all the different combos of these variables
df <- setNames(expand.grid(b0, b1, b2), c("b0", "b1", "b2"))
There are a lot of precursor objects I have to define before adding this column:
##### Set n
n = 100
#### Generate (x1i, x2i)
# Install and load the 'MASS' package
#install.packages("MASS")
library("MASS")
# Input univariate parameters
rho <- 0.5
mu1 <- 0; s1 <- 1
mu2 <- 0; s2 <- 1
# Generate parameters for bivariate normal distribution
mu <- c(mu1, mu2)
sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2), nrow=2, ncol=2)
# Generate draws from bivariate normal distribution
bvn <- mvrnorm(n, mu=mu, Sigma=sigma ) # from MASS package
x1 <- bvn[, 1]
x2 <- bvn[, 2]
##### Generate error
error <- rnorm(n)
##### Generate dependent variable
y <- 0.5 + x1 + x2 + error
##### Create the model
lm <- lm(y ~ x1 + x2)
# Setup parameters
n <- 100
K <- 3
c <- qf(.95, K, n - K)
# Define necessary objects
sigma_hat_sq <- 1
b0_hat <- summary(lm)$coefficients[1, 1]
b1_hat <- summary(lm)$coefficients[2, 1]
b2_hat <- summary(lm)$coefficients[3, 1]
x <- cbind(1, x1, x2)
I am trying to add this conditional column like this:
# Add a column to the data frame that says whether the condition holds
df <- transform(df, ueq = (
(1/(K*sigma_hat_sq))*
t(matrix(c(b0_hat-b0, b1_hat-b1, b2_hat-b2)))%*%
t(x)%*%x%*%
matrix(c(b0_hat-b0, b1_hat-b1, b2_hat-b2))
<= c
))
...but doing so generates the error message
Error in t(matrix(c(b0_hat - b0, b1_hat - b1, b2_hat - b2))) %*% t(x) :
non-conformable arguments
Mathematically, the condition is [1/(Ksigmahat^2)](Bhat-B)'X'X(Bhat-B) <= c, where, for each triple (b0,b1,b2), (Bhat-B) is a 3x1 matrix with elements {B0hat, B1hat, B2hat}. I'm just not sure how to write this condition in R.
Any help would be greatly appreciated!
In order to only work with one row of df at a time (and get a separate answer for each 1 x 3 matrix, you need a loop.
A simple way to do this in R is mapply.
df <- transform(df, ueq = mapply(\(b0, b1, b2)
(1/(K*sigma_hat_sq)) *
t(c(b0_hat-b0, b1_hat-b1, b2_hat-b2)) %*%
t(x) %*% x %*%
c(b0_hat-b0, b1_hat-b1, b2_hat-b2)
<= c,
b0 = b0, b1 = b1, b2 = b2
))
This leads to 91 TRUE rows.
sum(df$ueq)
[1] 91
I'm trying to learn how to make a loop in R
I have this:
sigma2 <- 0.4
a0 <- -0.1260805
b <- 0.1260805
tt <- 1:50, 1:50
z <- rnorm(50, 0, sigma2)
y <- rep(1, 50)
for(i in 1:50){
y[i]=exp(a0 + b*tt[i])*exp(z[i])
}
y
and I want to kind of test the code above 1000 times, since I want to test the hypothesis at the 0.05 level
can I treid this, and seens to be wrong:
aa <- rep(1, 1000)
for(i in 1:1000){
y[i]=exp(a0 + b1*tt[i])*exp(z[i])
}
Thanks for help!
I think this is what you want (or at least closer):
# re-write original code with vectorization:
n <- 50
sigma2 <- 0.4
a0 <- -0.1260805
b <- 0.1260805
tt <- 1:n
z <- rnorm(n, 0, sigma2)
y <- exp(a0 + b*tt)*exp(z)
# do it 20 times
result <- replicate(20, exp(a0 + b*tt)*exp(rnorm(n, 0, sigma2)))
result is a 50x20 matrix - one column per repetition.
I'm trying to speed up a script that otherwise takes days to handle larger data sets. So, is there a way to completely vectorize the following script:
# k-fold cross validation
df <- trees # a data frame 'trees' from R.
df <- df[sample(nrow(df)), ] # randomly shuffles the data.
k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross validation.
folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates unique numbers for k equally size folds.
df$ID <- folds # adds fold IDs.
df[paste("pred", 1:10, sep="")] <- NA # adds multiple columns "pred1"..."pred10" to speed up the following loop.
library(mgcv)
for(i in 1:k) {
# looping for different models:
m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i))
m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i))
m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))
# looping for predictions:
df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ], type="response")
df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ], type="response")
df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ], type="response")
}
# calculating residuals:
df$res1 <- with(df, Volume - pred1)
df$res2 <- with(df, Volume - pred2)
df$res3 <- with(df, Volume - pred3)
Model <- paste("m", 1:10, sep="") # creates a vector of model names.
# creating a vector of mean-square errors (MSE):
MSE <- with(df, c(
sum(res1^2) / nrow(df),
sum(res2^2) / nrow(df),
sum(res3^2) / nrow(df)
))
model.mse <- data.frame(Model, MSE, R2) # creates a data frame of model names, mean-square errors and coefficients of determination.
model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous data frame in order of increasing mean-square errors.
I'd appreciate any help. This code takes several days if run on 30,000 different GAM models and 3 predictors. Thanks
When plotting runjags output, how does one plot a single specific variable, when many other variables have similar names? Providing a quoted variable name with the varsargument doesn't seem to do it (it still provides all partial matches).
Here is a simple reproducible example.
N <- 200
nobs <- 3
psi <- 0.35
p <- 0.45
z <- rbinom(n=N, size=1,prob=psi)
y <- rbinom(n=N, size=nobs,prob=p*z)
sink("model.txt")
cat("
model {
for (i in 1:N){
z[i] ~ dbern(psi)
pz[i] <- z[i]*p
y[i] ~ dbin(pz[i],nobs)
} #i
psi ~ dunif(0,1)
p ~ dunif(0,1)
}
",fill = TRUE)
sink()
m <-list(y=y,N=N,nobs=nobs)
inits <- function(){list(psi=runif(1),p=runif(1),z=as.numeric(y>0))}
parameters <- c("p","psi")
ni <- 1000
nt <- 1
nb <- 200
nc <- 3
ad <- 100
library(runjags)
out <- run.jags(model="model.txt",monitor=parameters,data=m,n.chains=nc,inits=inits,burnin=nb,
sample=ni,adapt=ad,thin=nt,modules=c("glm","dic"),method="parallel")
windows(9,4)
plot(out,plot.type=c("trace","histogram"),vars="p",layout=c(1,2),new.window=FALSE)
It should be possible to double quote variables to get an exact match, but this seems to be broken. It should also be possible to specify a logical vector to vars but this seems to be broken for the plot method ... how embarrassing. The following does work though:
# Generate a logical vector to use with matching variable names:
variables <- extract(out, 'stochastic')
variables['psi'] <- FALSE
# Add summary statistics only for the specified variables and pre-draw plots:
out2 <- add.summary(out, vars=variables, plots=TRUE)
plot(out2, plot.type=c("trace","histogram"))
I will fix the other issues for the next release.
Matt
I am just really getting into trying to write MLE commands in R that function and look similar to native R functions. In this attempt I am trying to do a simple MLE with
y=b0 + x*b1 + u
and
u~N(0,sd=s0 + z*s1)
However, even such a simple command I am having difficulty coding. I have written a similar command in Stata in a handful of lines
Here is the code I have written so far in R.
normalreg <- function (beta, sigma=NULL, data, beta0=NULL, sigma0=NULL,
con1 = T, con2 = T) {
# If a formula for sigma is not specified
# assume it is the same as the formula for the beta.
if (is.null(sigma)) sigma=beta
# Grab the call expression
mf <- match.call(expand.dots = FALSE)
# Find the position of each argument
m <- match(c("beta", "sigma", "data", "subset", "weights", "na.action",
"offset"), names(mf), 0L)
# Adjust names of mf
mf <- mf[c(1L, m)]
# Since I have two formulas I will call them both formula
names(mf)[2:3] <- "formula"
# Drop unused levels
mf$drop.unused.levels <- TRUE
# Divide mf into data1 and data2
data1 <- data2 <- mf
data1 <- mf[-3]
data2 <- mf[-2]
# Name the first elements model.frame which will be
data1[[1L]] <- data2[[1L]] <- as.name("model.frame")
data1 <- as.matrix(eval(data1, parent.frame()))
data2 <- as.matrix(eval(data2, parent.frame()))
y <- data1[,1]
data1 <- data1[,-1]
if (con1) data1 <- cbind(data1,1)
data2 <- unlist(data2[,-1])
if (con2) data2 <- cbind(data2,1)
data1 <- as.matrix(data1) # Ensure our data is read as matrix
data2 <- as.matrix(data2) # Ensure our data is read as matrix
if (!is.null(beta0)) if (length(beta0)!=ncol(data1))
stop("Length of beta0 need equal the number of ind. data2iables in the first equation")
if (!is.null(sigma0)) if (length(sigma0)!=ncol(data2))
stop("Length of beta0 need equal the number of ind. data2iables in the second equation")
# Set initial parameter estimates
if (is.null(beta0)) beta0 <- rep(1, ncol(data1))
if (is.null(sigma0)) sigma0 <- rep(1, ncol(data2))
# Define the maximization function
normMLE <- function(est=c(beta0,sigma0), data1=data1, data2=data2, y=y) {
data1est <- as.matrix(est[1:ncol(data1)], nrow=ncol(data1))
data2est <- as.matrix(est[(ncol(data1)+1):(ncol(data1)+ncol(data2))],
nrow=ncol(data1))
ps <-pnorm(y-data1%*%data1est,
sd=data2%*%data2est)
# Estimate a vector of log likelihoods based on coefficient estimates
llk <- log(ps)
-sum(llk)
}
results <- optim(c(beta0,sigma0), normMLE, hessian=T,
data1=data1, data2=data2, y=y)
results
}
x <-rnorm(10000)
z<-x^2
y <-x*2 + rnorm(10000, sd=2+z*2) + 10
normalreg(y~x, y~z)
At this point the biggest issue is finding an optimization routine that does not fail when the some of the values return NA when the standard deviation goes negative. Any suggestions? Sorry for the huge amount of code.
Francis
I include a check to see if any of the standard deviations are less than or equal to 0 and return a likelihood of 0 if that is the case. Seems to work for me. You can figure out the details of wrapping it into your function.
#y=b0 + x*b1 + u
#u~N(0,sd=s0 + z*s1)
ll <- function(par, x, z, y){
b0 <- par[1]
b1 <- par[2]
s0 <- par[3]
s1 <- par[4]
sds <- s0 + z*s1
if(any(sds <= 0)){
return(log(0))
}
preds <- b0 + x*b1
sum(dnorm(y, preds, sds, log = TRUE))
}
n <- 100
b0 <- 10
b1 <- 2
s0 <- 2
s1 <- 2
x <- rnorm(n)
z <- x^2
y <- b0 + b1*x + rnorm(n, sd = s0 + s1*z)
optim(c(1,1,1,1), ll, x=x, z=z,y=y, control = list(fnscale = -1))
With that said it probably wouldn't be a bad idea to parameterize the standard deviation in such a way that it is impossible to go negative...