I'm trying to fit a number of linear models as shown below. It is important that all interaction terms are sorted lexicographically. Note that the second model is missing the main effect for x.
x = rnorm(100)
y = rnorm(100)
z = x + y + rnorm(100)
m1 = glm(z ~ x + y + x:y)
m2 = glm(z ~ y + x:y)
The models don't behave as expected with respect to the interaction terms:
m1:
x:y -0.1565 0.1151 -1.360 0.1770
m2:
y:x -0.2776 0.1416 -1.961 0.0528 .
I understand that there may be a way to use the interaction() function with the lex.order argument but I can't figure out how or, indeed, whether this is the best way to go. Advice?
Related
I am using the following model in R:
glmer(y ~ x + z + (1|id), weights = specification, family = binomial, data = data)
which:
y ~ binomial(y, specification)
Logit(y) = intercept + a*x + b*z
a and b are coefficients for x and z variables.
a|a0+a1 = a0 + a1*I
One of the variables (here x) depends on other variable (here I), so I need a hierarchical model to see the heterogeneity in mean of the x.
I would appreciate it if anyone could help me with this problem?
Sorry if the question does not look professional! This is one of my first experiences.
I'm not perfectly sure I understand the question, but: if Logit(y) = intercept + a*x + b*z and a = a0 + a1*I, then it would seem that
Logit(y) = intercept + (a0+a1*I)*x + b*z
This looks like a straightforward interaction model:
glmer(y ~ 1 + x + x:I + z + (1|id), ...)
to make it more explicit, this could also be written as
glmer(y ~ 1 + x + I(x*I) + z + (1|id), ...)
(although the use of I as a predictor variable and in the I() function is a little bit confusing at first glance ...)
I have the following fitted model w/o restriction:
reg <- lm(y ~ indi_x + x + inter)
where indi_x = indicator variable for x > 14 and inter = interaction variable for indi_x and x.
I want to impose the restriction that indi_x + (inter * 14) = 0 to fit the two segments at x = 14. I've been using the I() function within lm but am not getting the output I want.
Thanks!
If I understand correctly, you have two slopes that are joined at x = 14, and you want to infer the individual slopes (and possibly the common intercept?)
This would do it:
reg <- lm(y ~ 1 + x + x : I(x > 14))
Note that x * I(x > 14) is now the change in slope. So the absolute slope of the second segment is slope_2 - slope_1.
How can I simulate data so that the coefficients recovered by lm are determined to be particular pre-determined values and have normally distributed residuals? For example, could I generate data so that lm(y ~ 1 + x) will yield (Intercept) = 1.500 and x = 4.000? I would like the solution to be versatile enough to work for multiple regression with continuous x (e.g., lm(y ~ 1 + x1 + x2)) but there are bonus points if it works for interactions as well (lm(y ~ 1 + x1 + x2 + x1*x2)). Also, it should work for small N (e.g., N < 200).
I know how to simulate random data which is generated by these parameters (see e.g. here), but that randomness carries over to variation in the estimated coefficients, e.g., Intercept = 1.488 and x = 4.067.
Related: It is possible to generate data that yields pre-determined correlation coefficients (see here and here). So I'm asking if this can be done for multiple regression?
One approach is to use a perfectly symmetrical noise. The noise cancels itself so the estimated parameters are exactly the input parameters, yet the residuals appear normally distributed.
x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)
x <- c(x, x)
y <- c(y + eps, y - eps)
fit <- lm(y ~ x)
# (Intercept) x
# 1.5 4.0
plot(fit)
Residuals are normally distributed...
... but exhibit an anormally perfect symmetry!
EDIT by OP: I wrote up a general-purpose code exploiting the symmetrical-residuals trick. It scales well with more complex models. This example also shows that it works for categorical predictors and interaction effects.
library(dplyr)
# Data and residuals
df = tibble(
# Predictors
x1 = 1:100, # Continuous
x2 = rep(c(0, 1), each=50), # Dummy-coded categorical
# Generate y from model, including interaction term
y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
noise = rnorm(100) # Residuals
)
# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
df %>% mutate(y = y_model + noise),
df %>% mutate(y = y_model - noise) # Mirrored
)
# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept) x1 x2 x1:x2
# 1.50000 4.00000 -2.10000 8.76543
You could do rejection sampling:
set.seed(42)
tol <- 1e-8
x <- 1:100
continue <- TRUE
while(continue) {
y <- cbind(1,x) %*% c(1.5, 4) + rnorm(length(x))
if (sum((coef(lm(y ~ x)) - c(1.5, 4))^2) < tol) continue <- FALSE
}
coef(lm(y ~ x))
#(Intercept) x
# 1.500013 4.000023
Obviously, this is a brute-force approach and the smaller the tolerance and the more complex the model, the longer this will take. A more efficient approach should be possible by providing residuals as input and then employing some matrix algebra to calculate y values. But that's more of a maths question ...
I want to estimate a regression with two exogenous variables, two endogenous variable and a pair of fixed effects. Each endogenous variable has its own instrument.
Y = b0 + b1*X1 + b2*X2 + b3*Q + b4*W + C1*factor(id) + C2*factor(firm)
W = d0 + d1*X3
Q = e0 + e1*X4
Here is the part where I use generated data for Y, X, Q, W
require(lfe)
oldopts <- options(lfe.threads=1)
x <- rnorm(1000)
x2 <- rnorm(length(x))
id <- factor(sample(20,length(x),replace=TRUE))
firm <- factor(sample(13,length(x),replace=TRUE))
id.eff <- rnorm(nlevels(id))
firm.eff <- rnorm(nlevels(firm))
u <- rnorm(length(x))
y <- x + 0.5*x2 + id.eff[id] + firm.eff[firm] + u
x3 <- rnorm(length(x))
x4 <- 5*rnorm(length(x))^2
Q <- 0.3*x3 - 0.3*rnorm(length(x),sd=0.3) - 0.7*id.eff[id]
W <- 0.3*log(x4)- 2*x + 0.1*x2 - 0.2*y+ rnorm(length(x),sd=0.6)
y <- y + Q + W
I can estimate the coefficients using the old lfe syntax
reg <- felm(y~x+x2+G(id)+G(firm),iv=list(Q~x3,W~x4))
But the package strongly discourages the use of old syntax and I do not know how to specify different first stage equations in the new syntax.
If I try this line, both x3 and x4 would be used for both Q and W first stage equations.
reg_new <- felm(y ~ x + x2 | id+firm | (Q|W ~x3 + x4))
I'm sorry for the late answer. As the author of the lfe package, I am not aware of any theory for using different sets of instruments for different endogenous variables. It should not have been allowed in the old syntax either. If one of the instruments is uncorrelated with one of the endogenous variables, its coefficient in the first stage will simply be estimated as zero. The theory for IV-estimation by means of two stage regression simply uses some matrix identities to split the IV-estimation into two stages of ordinary regression, for convenience and reduction to well-known methods. As far as I'm aware, there is no IV with separate sets of instruments for the endogenous variables.
See e.g. wikipedia's entry on this:
https://en.wikipedia.org/wiki/Instrumental_variable#Estimation
I was reading the documentation on R Formula, and trying to figure out how to work with depmix (from the depmixS4 package).
Now, in the documentation of depmixS4, sample formula tends to be something like y ~ 1.
For simple case like y ~ x, it is defining a relationship between input x and output y, so I get that it is similar to y = a * x + b, where a is the slope, and b is the intercept.
If we go back to y ~ 1, the formula is throwing me off. Is it equivalent to y = 1 (a horizontal line at y = 1)?
To add a bit context, if you look at the depmixs4 documentation, there is one example below
depmix(list(rt~1,corr~1),data=speed,nstates=2,family=list(gaussian(),multinomial()))
I think in general, formula that end with ~ 1 is confusing to me. Can any explain what ~ 1 or y ~ 1 mean?
Many of the operators used in model formulae (asterix, plus, caret) in R, have a model-specific meaning and this is one of them: the 'one' symbol indicates an intercept.
In other words, it is the value the dependent variable is expected to have when the independent variables are zero or have no influence. (To use the more common mathematical meaning of model terms, you wrap them in I()). Intercepts are usually assumed so it is most common to see it in the context of explicitly stating a model without an intercept.
Here are two ways of specifying the same model for a linear regression model of y on x. The first has an implicit intercept term, and the second an explicit one:
y ~ x
y ~ 1 + x
Here are ways to give a linear regression of y on x through the origin (that is, without an intercept term):
y ~ 0 + x
y ~ -1 + x
y ~ x - 1
In the specific case you mention ( y ~ 1 ), y is being predicted by no other variable so the natural prediction is the mean of y, as Paul Hiemstra stated:
> data(city)
> r <- lm(x~1, data=city)
> r
Call:
lm(formula = x ~ 1, data = city)
Coefficients:
(Intercept)
97.3
> mean(city$x)
[1] 97.3
And removing the intercept with a -1 leaves you with nothing:
> r <- lm(x ~ -1, data=city)
> r
Call:
lm(formula = x ~ -1, data = city)
No coefficients
formula() is a function for extracting formula out of objects and its help file isn't the best place to read about specifying model formulae in R. I suggest you look at this explanation or Chapter 11 of An Introduction to R.
if your model were of the form y ~ x1 + x2 This (roughly speaking) represents:
y = β0 + β1(x1) + β2(x2)
Which is of course the same as
y = β0(1) + β1(x1) + β2(x2)
There is an implicit +1 in the above formula. So really, the formula above is y ~ 1 + x1 + x2
We could have a very simple formula, whereby y is not dependent on any other variable. This is the formula that you are referencing,
y ~ 1 which roughly would equate to
y = β0(1) = β0
As #Paul points out, when you solve the simple model, you get β0 = mean (y)
Here is an example
# Let's make a small sample data frame
dat <- data.frame(y= (-2):3, x=3:8)
# Create the linear model as above
simpleModel <- lm(y ~ 1, data=dat)
## COMPARE THE COEFFICIENTS OF THE MODEL TO THE MEAN(y)
simpleModel$coef
# (Intercept)
# 0.5
mean(dat$y)
# [1] 0.5
In general such a formula describes the relation between dependent and independent variables in the form of a linear model. The lefthand side are the dependent variables, the right hand side the independent. The independent variables are used to calculate the trend component of the linear model, the residuals are then assumed to have some kind of distribution. When the independent are equal to one ~ 1, the trend component is a single value, e.g. the mean value of the data, i.e. the linear model only has an intercept.