How does rjags process and update the input data in coda.samples()? - r

I am working on a problem that requires me to use Metropolis Hastings algorithm to generate values for 3 unknowns, beta, lambda and phi for a group of users, and then use that output as an input for the gibbs sampler to solve a regression equation and further estimate the true values of beta, lambda and phi. I am using the rjags library in R, which works fine, but I don't clearly understand how the library is calculating the values under the hood.
The following is what the output of Metropolis Hastings looks like, let's say we did 3 iterations of MH, and had 3 users, so each user will have 3 values of beta, lambda and phi:
user number
beta
lambda
phi
1
0.1
0.2
0.5
1
0.2
0.2
0.3
1
0.1
0.4
0.3
2
0.8
0.4
0.6
2
0.7
0.5
0.7
2
0.8
0.4
0.7
3
1.8
2.4
1.7
3
3.6
4.8
6.1
3
3.6
8.2
3.2
I also have certain covariates for the 3 users:
user number
x1
x2
1
24
36
2
12
15
3
18
31
The above two tables go as input to rjags. Rjags code is as follows:
# metropolis is the output of MH
beta <- metropolis[,beta]
lambda <- metropolis[,lambda]
phi <- metropolis[,phi]
user_id <- metropolis[,user number
x1 <- covariates[, x1]
x2 <- covariates[, x2]
#number of MH iterations
N <- dim(metropolis)[1]
#number of customers
P <- metropolis[,uniqueN(user_id)]
data.jags <- list(beta = beta, lambda = lambda, phi = phi,
N = N, P = P,
user_id = user_id, x1 = x1, x2 = x2)
inits.jags <- list(".RNG.name" = "base::Wichmann-Hill",
".RNG.seed" = 111)
modelString <-
"model{
for (i in 1:N){
beta[i] ~ dnorm(theta_beta[cust_id[i]], tau_beta)
lambda[i] ~ dnorm(theta_lambda[cust_id[i]], tau_lambda)
phi[i] ~ dnorm(theta_phi[cust_id[i]], tau_phi)
}
for (j in 1:P){
theta_beta[j] <- d_0 + d_1 * x1 + d_2 * x2 + error_beta[j]
theta_lambdaj] <- g_0 + g_1 * x1 + g_2 * x2 + error_lambda[j]
theta_phij] <- e_0 + e_1 * x1 + e_2 * x2 + error_phi[j]
real_beta[j] <- exp(theta_beta[j])
real_lambda[j] <- exp(theta_lambda[j])/(1 + exp(theta_lambda[j]))
real_phi[j] <- exp(theta_phi[j])/(1 + exp(theta_phi[j]))
error_beta[j] ~ dnorm(0,0.1)
error_lambda[j] ~ dnorm(0,0.1)
error_phi[j] ~ dnorm(0,0.1)
}
#priors
d_0 ~ dnorm(0,0.01)
d_1 ~ dnorm(0,0.01)
d_2 ~ dnorm(0,0.01)
g_0 ~ dnorm(0,0.01)
g_1 ~ dnorm(0,0.01)
g_2 ~ dnorm(0,0.01)
e_0 ~ dnorm(0,0.01)
e_1 ~ dnorm(0,0.01)
e_2 ~ dnorm(0,0.01)
sigma_beta ~ dunif(0,100)
sigma_lambda ~ dunif(0,100)
sigma_phi ~ dunif(0,100)
tau_beta <- 1/pow(sigma_beta,2)
tau_phi <- 1/pow(sigma_phi,2)
tau_lambda <- 1/pow(sigma_lambda,2)
}"
Now, the results from the above code look fine, but I don't understand how exactly is the output of Metropolis Hastings (beta, lambda, phi) being used here. In the first for loop of the modelString, the values of beta, lambda and phi are being drawn from a normal distribution, so where do the values of beta, lambda and phi, which were given as inputs in data.jags come into picture?
Any kind of explanation/ insight would be much appreciated. Thanks!

Related

How to format beta0 and beta1 with confidence intervals in R?

I need to calculate β^0 and β^1 for a simple linear regression yi = β0 + β1xi with 87% confidence intervals for β0 and β1 and have to display my results with three significant digits in the following format:
Est L U
beta0 1.13 0.889 1.37
beta1 3.57 1.950 5.19
What code should I use to get it in this format?
I have done the following, but cannot figure out how to show Intercept and x as beta0 and beta1 with their Estimate and Lower CI and Upper CI:
> M <- lm(y ~ x) # fit linear model
> signif(coef(M), digits = 2) # MLE's of beta
(Intercept) x
-5.40 0.13
>
> signif(confint(M, level = 0.87), digits = 3)
6.5 % 93.5 %
(Intercept) -5.710 -5.160
x 0.127 0.136
I'm doing this in RStudio
EDIT:
I've used data.frame to get it like this:
> # data.frame for MLE's of beta with 87% confidence interval for beta0 and beta1
> data.frame(df, stringsAsFactors = )
Est L U
beta0 -5.40 -5.710 -5.160
beta1 0.13 0.127 0.136
> Est <- c(-5.40, 0.13)
> L <- c(-5.710, 0.127)
> U <- c(-5.160, 0.136)
> df <- data.frame(Est,L,U)
> row.names(df) <- c('beta0', 'beta1')
But is there a better way of getting it in this form using the built-in R functions lm, coef, confint?
Rename the rownames and colnames of the output of confint.
n <- 50
df <- data.frame(x = rnorm(n), y = rnorm(n))
fit <- lm(y ~ x, data = df)
ci <- confint(fit, level = .87)
colnames(ci) <- c("L", "U")
rownames(ci) <- c("beta0", "beta1")
ci
L U
beta0 -0.4962463 0.002210674
beta1 -0.3157171 0.152844873
If you need the estimate as an additional row, convert the ci matrix to a dataframe and add fit$coefficients as an additional column:
ci_df <- data.frame(ci)
est <- fit$coefficients
ci_df$Est <- est
ci_df
L U Est
beta0 -0.4962463 0.002210674 -0.24701781
beta1 -0.3157171 0.152844873 -0.08143609
And if you need to round, just do round(ci_df, 3).

How to parametrize piecewise regression coefficient to represent the slope for the following interval (instead of the change in the slope)

Consider the following dataset
Quantity <- c(25,39,45,57,70,85,89,100,110,124,137,150,177)
Sales <- c(1000,1250,2600,3000,3500,4500,5000,4700,4405,4000,3730,3400,3300)
df <- data.frame(Quantity,Sales)
df
Plotting the data, the distribution of observations is clearly non-linear, but presents a likely breaking-point around Quantity = 89 (I skip the plot here). Therefore, I built a joint piecewise linear model as follows
df$Xbar <- ifelse(df$Quantity>89,1,0)
df$diff <- df$Quantity - 89
reg <- lm(Sales ~ Quantity + I(Xbar * (Quantity - 89)), data = df)
summary(reg)
or simply
df$X <- df$diff*df$Xbar
reg <- lm(Sales ~ Quantity + X, data = df)
summary(reg)
However, according to this parametrization, the coefficient of X represents the change in the slope from the preceding interval.
How can I parametrize the relevant coefficient to rather represent the slope for the second interval?
I did some research but I was unable to find the desired specification, apart from some automatization in stata (see the voice 'marginal' here https://www.stata.com/manuals13/rmkspline.pdf).
Any help is much appreciated. Thank you!
Acknowledgement:
the workable example is retrieved from
https://towardsdatascience.com/unraveling-spline-regression-in-r-937626bc3d96
The key here is to use a logical variable is.right which is TRUE for the points to the right of 89 and FALSE otherwise.
From the the output shown 60.88 is the slope to the left of 89 and -19.97 is the slope to the right. The lines intersect at Quantity = 89, Sales = 4817.30.
is.right <- df$Quantity > 89
fm <- lm(Sales ~ diff : is.right, df)
fm
## Call:
## lm(formula = Sales ~ diff:is.right, data = df)
##
## Coefficients:
## (Intercept) diff:is.rightFALSE diff:is.rightTRUE
## 4817.30 60.88 -19.97
Alternatives
Alternately if you want to use Xbar from the question do it this way. It gives the same coefficients as fm.
fm2 <- lm(Sales ~ diff : factor(Xbar), df)
or
fm3 <- lm(Sales ~ I(Xbar * diff) + I((1 - Xbar) * diff), df)
Double check with nls
We can double check these using nls with the following formulation which makes use of the fact that if we extend both lines the one to use at any Quantity is the lower of the two.
st <- list(a = 0, b1 = 1, b2 = -1)
fm4 <- nls(Sales ~ a + pmin(b1 * (Quantity - 89), b2 * (Quantity - 89)), start = st)
fm4
## Nonlinear regression model
## model: Sales ~ a + pmin(b1 * (Quantity - 89), b2 * (Quantity - 89))
## data: parent.frame()
## a b1 b2
## 4817.30 60.88 -19.97
## residual sum-of-squares: 713120
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 2.285e-09
This would also work:
fm5 <- nls(Sales ~ a + ifelse(Quantity > 89, b2, b1) * diff, df, start = st)
Plot
Here is a plot:
plot(Sales ~ Quantity, df)
lines(fitted(fm) ~ Quantity, df)
Model matrix
And here is the model matrix for the linear regression:
> model.matrix(fm)
(Intercept) diff:is.rightFALSE diff:is.rightTRUE
1 1 -64 0
2 1 -50 0
3 1 -44 0
4 1 -32 0
5 1 -19 0
6 1 -4 0
7 1 0 0
8 1 0 11
9 1 0 21
10 1 0 35
11 1 0 48
12 1 0 61
13 1 0 88
If you know the breakpoints, then you almost have the model, it should be:
fit=lm(Sales ~ Quantity + Xbar + Quantity:Xbar,data=df)
Because if you don't introduce a new intercept (Xbar), it will start from the intercept already in the model, which will not work. We can plot it:
plot(df$Quantity,df$Sales)
newdata = data.frame(Quantity=seq(40,200,by=5))
newdata$Xbar= ifelse(newdata$Quantity>89,1,0)
lines(newdata$Quantity,predict(fit,newdata))
The coefficients are:
summary(fit)
Call:
lm(formula = Sales ~ Quantity * Xbar, data = df)
Residuals:
Min 1Q Median 3Q Max
-527.9 -132.2 -15.1 148.1 464.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -545.435 327.977 -1.663 0.131
Quantity 59.572 5.746 10.367 2.65e-06 ***
Xbar 7227.288 585.933 12.335 6.09e-07 ***
Quantity:Xbar -80.133 6.856 -11.688 9.64e-07 ***
And the coefficient of the 2nd slope is 59.572+(-80.133) = -20.561

Discrepancy between y values calculated using predict() or using explicit fitting equation

If I calculate the y value for a specific x value using predict() function I obtain a value different from the one I can calculate using the explicit fitting equation.
I fitted the data below using nls(MyEquation) and obtained the m1, m2,... parameters.
Then, I want to reverse calculate the y value for a specific x value using both the predict(m) function or the explicit equation I used for fitting (putting in the desired x value).
I obtain different y values for the same x value. Which one is the correct one?
> df
pH activity
1 3.0 0.88
2 4.0 1.90
3 5.0 19.30
4 6.0 70.32
5 7.0 100.40
6 7.5 100.00
7 8.0 79.80
8 9.0 7.75
9 10.0 1.21
x <- df$pH
y <- df$activity
m<-nls(y~(m1*(10^(-x))+m2*10^(-m3))/(10^(-m3)+10^(-x)) - (m5*(10^(-x))+1*10^(-i))/(10^(-i)+10^(-x)), start = list(m1=1,m2=100,m3=7,m5=1))
> m
Nonlinear regression model
model: y ~ (m1 * (10^(-x)) + m2 * 10^(-m3))/(10^(-m3) + 10^(-x)) - (m5 * (10^(-x)) + 1 * 10^(-i))/(10^(-i) + 10^(-x))
data: parent.frame()
m1 m2 m3 m5
-176.032 13.042 6.282 -180.704
residual sum-of-squares: 1522
Number of iterations to convergence: 14
Achieved convergence tolerance: 5.805e-06
list2env(as.list(coef(m)), .GlobalEnv)
#calculate y based on fitting parameters
# choose the 7th x value (i.e. x[7]) that corresponds to pH = 8
# (using predict)
> x_pH8 <- x[7]
> predict(m)[7]
[1] 52.14299
# (using the explicit fitting equation with the fitted parameters
> x1 <- x_pH8
> (m1*(10^(-x1))+m2*10^(-m3))/(10^(-m3)+10^(-x1)) - (m5*(10^(-x1))+1*10^(-8.3))/(10^(-8.3)+10^(-x1))
[1] 129.5284
As you can see:
predict(m)[7] gives y = 52.14299 (for x = 8)
while
(m1*(10^(-x1))+m2*10^(-m3))/(10^(-m3)+10^(-x1)) - (m5*(10^(-x1))+1*10^(-8.3))/(10^(-8.3)+10^(-x1)) gives y = 129.5284 (for x = 8)
The value of i you use in the manual calculation is probably not the same as the one you use in the model fitting. I don't get any discrepancy:
x <- df$pH
y <- df$activity
i <- 8.3
m <- nls(y~(m1*(10^(-x))+m2*10^(-m3))/(10^(-m3)+10^(-x)) - (m5*(10^(-x))+1*10^(-i))/(10^(-i)+10^(-x)), start = list(m1=1,m2=100,m3=7,m5=1))
x <- 8
with(as.list(coef(m)),
(m1*(10^(-x))+m2*10^(-m3))/(10^(-m3)+10^(-x)) - (m5*(10^(-x))+1*10^(-i))/(10^(-i)+10^(-x)))
# [1] 75.46504
predict(m)[7]
# [1] 75.46504

Panel regression with ar1 model

Excuse my naiveté. I'm not sure what this type of model is called -- perhaps panel regression.
Imagine I have the following data:
n <- 100
x1 <- rnorm(n)
y1 <- x1 * 0.5 + rnorm(n)/2
x2 <- rnorm(n)
y2 <- x2 * 0.5 + rnorm(n)/2
x3 <- rnorm(n)
y3 <- x3 * 0.25 + rnorm(n)/2
x4 <- rnorm(n)
y4 <- x4 * 0 + rnorm(n)/2
x5 <- rnorm(n)
y5 <- x5 * -0.25 + rnorm(n)/2
x6 <- rnorm(n)
y6 <- x6 * -0.5 + rnorm(n) + rnorm(n)/2
x7 <- rnorm(n)
y7 <- x7 * -0.75 + rnorm(n)/2
foo <- data.frame(s=rep(1:100,times=7),
y=c(y1,y2,y3,y4,y5,y6,y7),
x=c(x1,x2,x3,x4,x5,x6,x7),
i=rep(1:7,each=n))
Where y and x are individual AR1 time series measured over 100 seconds (I use 's' instead of 't' for the time variable) divided equally into groups (i). I wish to model these as:
y_t= b_0 + b_1(y_{t-1}) + b_2(x_{t}) + e_t
but while taking the group (i) into account:
y_{it)= b_0 + b_1(y_{it-1}) + b_2(x_{it}) + e_{it}
I wish to know if b_2 (the coef on x) is a good predictor of y and how that coef varies with group. I also want to know the R2 and RMSE by group and to predict y_i given x_i and i. The grouping variable can be discrete or continuous.
I gather that this type of problem is called panel regression but it is not a term that is familiar to me. Is using plm in R a good approach to investigate this problem?
Based on the comment below, I guess this is a simple start:
require(dplyr)
require(broom)
fitted_models <- foo %>% group_by(grp) %>% do(model = lm(y ~ x, data = .))
fitted_models %>% tidy(model)
fitted_models %>% glance(model)
Since you don't include fixed or random effects in the model, we are dealing with the pooled OLS (POLS) which can be estimated using lm or plm.
Let's construct example data of 100 groups and 100 observations for each:
df <- data.frame(x = rnorm(100 * 100), y = rnorm(100 * 100),
group = factor(rep(1:100, each = 100)))
df$ly <- unlist(tapply(df$y, df$group, function(x) c(NA, head(x, -1))))
head(df, 2)
# x y group ly
# 1 1.7893855 1.2694873 1 NA
# 2 0.8671304 -0.9538848 1 1.2694873
Then
m1 <- lm(y ~ ly + x:group, data = df)
is a model with a common autoregressive coefficient and a group-dependent effect of x:
head(coef(m1)[-1:-2], 5)
# x:group1 x:group2 x:group3 x:group4 x:group5
# -0.02057244 0.06779381 0.04628942 -0.11384630 0.06377069
This allows you to plot them, etc. I suppose one thing that you will want to do is to test whether those coefficients are equal. That can be done as follows:
m2 <- lm(y ~ ly + x, data = df)
library(lmtest)
lrtest(m1, m2)
# Likelihood ratio test
#
# Model 1: y ~ ly + x:group
# Model 2: y ~ ly + x
# #Df LogLik Df Chisq Pr(>Chisq)
# 1 103 -14093
# 2 4 -14148 -99 110.48 0.2024
Hence, we cannot reject that the effects of x are the same, as expected.

Arrange monte carlo p-value into a matrix for different sample size and variance estimators

The following code works out quite well (based on my previous question). But I have to change the variance estimator (ols, hc0, hc1, hc2, hc3) every time before I run the code. I would like to solve this problem with a loop.
Hereafter, I briefly describe the code. Within the code, 1000 regression models for each sample size (n = 25, 50, 100, 250, 500, 1000) are created. Then, each regression model out of the 1000 is estimated by OLS. After that, I calculate t-statistics based on the different beta values of x3 out of the 1000 samples. The null hypothesis reads: H0: beta03 = beta3, that is the calculated beta value of x3 equals the 'real' value which I defined as 1. In the last step, I check how often the null hypothesis is rejected (significance level = 0.05). My final goal is to create a code which spits out the procentual rejection rate of the null hypothesis for each sample size and variance estimator. Thus, the result should be a matrix whereas right now I get a vector as a result. I would be pleased if anyone of you could help me with that. Here you can see my code:
library(car)
sample_size = c("n=25"=25, "n=50"=50, "n=100"=100, "n=250"=250, "n=500"=500, "n=1000"=1000)
B <- 1000
beta0 <- 1
beta1 <- 1
beta2 <- 1
beta3 <- 1
alpha <- 0.05
simulation <- function(n, beta3h0){
t.test.values <- rep(NA, B)
#simulation of size
for(rep in 1:B){
#data generation
d1 <- runif(n, 0, 1)
d2 <- rnorm(n, 0, 1)
d3 <- rchisq(n, 1, ncp=0)
x1 <- (1 + d1)
x2 <- (3*d1 + 0.6*d2)
x3 <- (2*d1 + 0.6*d3)
# homoskedastic error term: exi <- rchisq(n, 4, ncp = 0)
exi <- sqrt(x3 + 1.6)*rchisq(n, 4, ncp = 0)
y <- beta0 + beta1*x1 + beta2*x2 + beta3*x3 + exi
mydata <- data.frame(y, x1, x2, x3)
#ols estimation
lmobj <- lm(y ~ x1 + x2 + x3, mydata)
#extraction
betaestim <- coef(lmobj)[4]
betavar <- vcov(lmobj)[4,4]
#robust variance estimators: hc0, hc1, hc2, hc3
betavar0 <- hccm(lmobj, type="hc0")[4,4]
betavar1 <- hccm(lmobj, type="hc1")[4,4]
betavar2 <- hccm(lmobj, type="hc2")[4,4]
betavar3 <- hccm(lmobj, type="hc3")[4,4]
#t statistic
t.test.values[rep] <- (betaestim - beta3h0)/sqrt(betavar)
}
mean(abs(t.test.values) > qt(p=c(1-alpha/2), df=n-4))
}
sapply(sample_size, simulation, beta3h0 = 1)
You don't need a double nested loop. Just make sure you get a matrix inside your loop. Update your current simulation with the following:
## set up a matrix
## replacing `t.test.values <- rep(NA, B)`
t.test.values <- matrix(nrow = 5, ncol = B) ## 5 estimators
## update / fill a column
## replacing `t.test.values[rep] <- (betaestim - beta3h0)/sqrt(betavar)`
t.test.values[, rep] <- abs(betaestim - beta3h0) / sqrt(c(betavar, betavar0, betavar1, betavar2, betavar3))
## row means
## replacing `mean(abs(t.test.values) > qt(p=c(1-alpha/2), df=n-4))`
rowMeans(t.test.values > qt(1-alpha/2, n-4))
Now, simulation would return a vector of length 5. For each sample size, the monte carlo estimate of t-statistic p-value is returned for all 5 variance estimators. Then, when you call sapply, you get a matrix result:
sapply(sample_size, simulation, beta3h0 = 1)
# n=25 n=50 n=100 n=250 n=500 n=1000
#[1,] 0.132 0.237 0.382 0.696 0.917 0.996
#[2,] 0.198 0.241 0.315 0.574 0.873 0.994
#[3,] 0.157 0.220 0.299 0.569 0.871 0.994
#[4,] 0.119 0.173 0.248 0.545 0.859 0.994
#[5,] 0.065 0.122 0.197 0.510 0.848 0.993

Resources