Excuse my naiveté. I'm not sure what this type of model is called -- perhaps panel regression.
Imagine I have the following data:
n <- 100
x1 <- rnorm(n)
y1 <- x1 * 0.5 + rnorm(n)/2
x2 <- rnorm(n)
y2 <- x2 * 0.5 + rnorm(n)/2
x3 <- rnorm(n)
y3 <- x3 * 0.25 + rnorm(n)/2
x4 <- rnorm(n)
y4 <- x4 * 0 + rnorm(n)/2
x5 <- rnorm(n)
y5 <- x5 * -0.25 + rnorm(n)/2
x6 <- rnorm(n)
y6 <- x6 * -0.5 + rnorm(n) + rnorm(n)/2
x7 <- rnorm(n)
y7 <- x7 * -0.75 + rnorm(n)/2
foo <- data.frame(s=rep(1:100,times=7),
y=c(y1,y2,y3,y4,y5,y6,y7),
x=c(x1,x2,x3,x4,x5,x6,x7),
i=rep(1:7,each=n))
Where y and x are individual AR1 time series measured over 100 seconds (I use 's' instead of 't' for the time variable) divided equally into groups (i). I wish to model these as:
y_t= b_0 + b_1(y_{t-1}) + b_2(x_{t}) + e_t
but while taking the group (i) into account:
y_{it)= b_0 + b_1(y_{it-1}) + b_2(x_{it}) + e_{it}
I wish to know if b_2 (the coef on x) is a good predictor of y and how that coef varies with group. I also want to know the R2 and RMSE by group and to predict y_i given x_i and i. The grouping variable can be discrete or continuous.
I gather that this type of problem is called panel regression but it is not a term that is familiar to me. Is using plm in R a good approach to investigate this problem?
Based on the comment below, I guess this is a simple start:
require(dplyr)
require(broom)
fitted_models <- foo %>% group_by(grp) %>% do(model = lm(y ~ x, data = .))
fitted_models %>% tidy(model)
fitted_models %>% glance(model)
Since you don't include fixed or random effects in the model, we are dealing with the pooled OLS (POLS) which can be estimated using lm or plm.
Let's construct example data of 100 groups and 100 observations for each:
df <- data.frame(x = rnorm(100 * 100), y = rnorm(100 * 100),
group = factor(rep(1:100, each = 100)))
df$ly <- unlist(tapply(df$y, df$group, function(x) c(NA, head(x, -1))))
head(df, 2)
# x y group ly
# 1 1.7893855 1.2694873 1 NA
# 2 0.8671304 -0.9538848 1 1.2694873
Then
m1 <- lm(y ~ ly + x:group, data = df)
is a model with a common autoregressive coefficient and a group-dependent effect of x:
head(coef(m1)[-1:-2], 5)
# x:group1 x:group2 x:group3 x:group4 x:group5
# -0.02057244 0.06779381 0.04628942 -0.11384630 0.06377069
This allows you to plot them, etc. I suppose one thing that you will want to do is to test whether those coefficients are equal. That can be done as follows:
m2 <- lm(y ~ ly + x, data = df)
library(lmtest)
lrtest(m1, m2)
# Likelihood ratio test
#
# Model 1: y ~ ly + x:group
# Model 2: y ~ ly + x
# #Df LogLik Df Chisq Pr(>Chisq)
# 1 103 -14093
# 2 4 -14148 -99 110.48 0.2024
Hence, we cannot reject that the effects of x are the same, as expected.
Related
Is it ok to run a "plm" fixed effect model and add a factor dummy variable in R as below?
The three factors "Time", "Firm” and "Country" are all separate indices which I want to fix all together.
Instead of making two indices in total by combining "Firm” and "Country", I find the below specification works much better for my case.
Is this an acceptable format?
plm(y ~ lag(x1, 1) + x2 + x3 + x4 + x5 + factor(Country), data=DATA,
index=c("Firm","Time"), model="within")
It is okay to add additional factors. We can prove this by calculating an LSDV model. As a preliminary note, you will of course need robust standard errors, usually clustered at the highest aggregate level, i.e. country in this case.
Note: R >= 4.1 is used in the following.
LSDV
fit1 <-
lm(y ~ d + x1 + x2 + x3 + x4 + factor(id) + factor(time) + factor(country),
dat)
lmtest::coeftest(
fit1, vcov.=sandwich::vcovCL(fit1, cluster=dat$country, type='HC0')) |>
{\(.) .[!grepl('\\(|factor', rownames(.)), ]}()
# Estimate Std. Error t value Pr(>|t|)
# d 10.1398727 0.3181993 31.8664223 4.518874e-191
# x1 1.1217514 1.6509390 0.6794627 4.968995e-01
# x2 3.4913273 2.7782157 1.2566797 2.089718e-01
# x3 0.6257981 3.3162148 0.1887085 8.503346e-01
# x4 0.1942742 0.8998307 0.2159008 8.290804e-01
After adding factor(country), the estimators we get with plm::plm are identical to LSDV:
plm::plm
fit2 <- plm::plm(y ~ d + x1 + x2 + x3 + x4 + factor(country),
index=c('id', 'time'), model='within', effect='twoways', dat)
summary(fit2, vcov=plm::vcovHC(fit2, cluster='group', type='HC1'))$coe
# Estimate Std. Error t-value Pr(>|t|)
# d 10.1398727 0.3232850 31.3651179 5.836597e-186
# x1 1.1217514 1.9440165 0.5770277 5.639660e-01
# x2 3.4913273 3.2646905 1.0694206 2.849701e-01
# x3 0.6257981 3.1189939 0.2006410 8.409935e-01
# x4 0.1942742 0.9250759 0.2100089 8.336756e-01
However, cluster='group' will refer to "id" and not to "country", so the standard errors are wrong. It seems that clustering by the additional factor with plm is currently not possible, at least I am not aware of anything.
Alternatively you may use lfe::felm to not have to do without the immensely reduced computing times relative to LSDV:
lfe::felm
summary(lfe::felm(y ~ d + x1 + x2 + x3 + x4 | id + time + country | 0 | country,
dat))$coe
# Estimate Cluster s.e. t value Pr(>|t|)
# d 10.1398727 0.3184067 31.8456637 1.826374e-33
# x1 1.1217514 1.6520151 0.6790201 5.004554e-01
# x2 3.4913273 2.7800267 1.2558611 2.153737e-01
# x3 0.6257981 3.3183765 0.1885856 8.512296e-01
# x4 0.1942742 0.9004173 0.2157602 8.301083e-01
For comparison, here is what Stata computes, the standard errors closely resemble those of LSDV and lfe::felm:
Stata
. reghdfe y d x1 x2 x3 x4, absorb (country time id) vce(cluster country)
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
d | 10.13987 .3185313 31.83 0.000 9.49907 10.78068
x1 | 1.121751 1.652662 0.68 0.501 -2.202975 4.446478
x2 | 3.491327 2.781115 1.26 0.216 -2.103554 9.086209
x3 | .6257981 3.319675 0.19 0.851 -6.052528 7.304124
x4 | .1942742 .9007698 0.22 0.830 -1.617841 2.006389
_cons | 14.26801 23.65769 0.60 0.549 -33.32511 61.86114
Simulated Panel Data:
n1 <- 20; t1 <- 4; n2 <- 48
dat <- expand.grid(id=1:n1, time=1:t1, country=1:n2)
set.seed(42)
dat <- within(dat, {
id <- as.vector(apply(matrix(1:(n1*n2), n1), 2, rep, t1))
d <- runif(nrow(dat), 70, 80)
x1 <- sample(0:1, nrow(dat), replace=TRUE)
x2 <- runif(nrow(dat))
x3 <- runif(nrow(dat))
x4 <- rnorm(nrow(dat))
y <-
10*d + ## treatment effect
as.vector(replicate(n2, rep(runif(n1, 2, 5), t1))) + ## id FE
rep(runif(n1, 10, 12), each=t1) + ## time FE
rep(runif(n2, 10, 12), each=n1*t1) + ## country FE
- .7*x1 + 1.3*x2 + 2.4*x3 +
.5 * x4 + rnorm(nrow(dat), 0, 50)
})
readstata13::save.dta13(dat, 'panel.dta') ## for Stata
Consider a two variable (Y1, Y2) problem, with each variable defined as follows:
Y1 = 1 + Z1, and Y1 is fully observed
Y2 = 5 + 2*(Z1) + Z2, and Y2 is missing if 2*(Y1 − 1) + Z3 < 0
Z1, Z2, and Z3 follow independent standard normal distributions.
How would we go about simulating a (complete) dataset of size 500 on (Y1, Y2)? This is what I wrote below:
n <- 500
y <- rnorm(n)
How would we simulate the corresponding observed dataset (by imposing missingness
on Y2)? I'm not sure where to go with this question.
n <- 500
z1 <- rnorm(n)
z2 <- rnorm(n)
z3 <- rnorm(n)
y1 <- 1 + z1
y2 <- 5 + 2*z1 + z2
Display the marginal distribution of Y2 for the complete (as originally simulated) and observed (after imposing missingness) data.
Another way to display the distributions, in addition to the great explanation of #jay.sf is building the missing data mechanism in a new variable and compare both y2 and y2_missing:
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(123)
#Data
n <- 500
#Random vars
z1 <- rnorm(n)
z2 <- rnorm(n)
z3 <- rnorm(n)
#Design Y1 and Y2
y1 <- 1+z1
y2 = 5 + 2*(z1) + z2
#For missing
y2_missing <- y2
#Set missing
index <- which(((2*(y1-1))+z3)<0)
y2_missing[index]<-NA
#Complete dataset
df <- data.frame(y1,y2,y2_missing)
#Plot distributions
df %>% select(-y1) %>%
pivot_longer(everything()) %>%
ggplot(aes(x=value,fill=name))+
geom_density(alpha=0.5)+
ggtitle('Distribution for y2 and y2_missing')+
labs(fill='Variable')+
theme_bw()
Output:
You probably want to include an error term in your data simulations, so another vector with mean zero should be included in the equation using again rnorm(n).
seed <- sample(1:1e3, 1)
set.seed(635) ## for sake of reproducibility
n <- 500
z1 <- rnorm(n)
z2 <- rnorm(n)
To get the missings you may sample a percentage of the vector and set it NA.
y2 <- 5 + 2*z1 + z2 + rnorm(n) ## add error term independent of the `z`s
pct.mis <- .1 ## percentage missings
y2[sample(length(y2), length(y2)*pct.mis)] <- NA
## check 1: resulting missings
prop.table(table(is.na(y2)))
# FALSE TRUE
# 0.9 0.1
summary(y2)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# -2.627 3.372 5.123 4.995 6.643 13.653 50
## check 2: rounded coefficients resemble equation
fit <- lm(y2 ~ z1 + z2)
round(fit$coe)
# (Intercept) z1 z2
# 5 2 1
## check 3: number of fitted values equals number of non-missing obs.
length(fit$fitted.values) / length(y2)
# [1] 0.9
Consider the two data.frames below. In each case I want to extract the intercept, and slopes for the three variables from the associated models.
set.seed(911)
df1 <- data.frame(y=rnorm(10) + 1:10, x=1:10, x2=rnorm(10), x3 = rnorm(10))
model1 <- lm(y ~ x + x2 + x3, data = df1)
summary(model1)
summary(model1)$coefficients[1]
summary(model1)$coefficients[2]
summary(model1)$coefficients[3]
summary(model1)$coefficients[4]
set.seed(911)
df2 <- data.frame(y=rnorm(10) + 1:10, x=1:10, x2=1, x3 = rnorm(10))
model2 <- lm(y ~ x + x2 + x3, data = df2)
summary(model2)
summary(model2)$coefficients[1]
summary(model2)$coefficients[2]
summary(model2)$coefficients[3]
summary(model2)$coefficients[4]
However, in the second example there is no variation in x2 and so the coefficient estimate is NA. Importantly, summary(model2) prints the NA but summary(model2)$coefficients[3] does not return the NA but skips and moves to the next parameter.
But instead I would want:
0.9309032
0.8736204
NA
0.5494
If I do not know in adnavce which coefficients will be NA, i.e. it could be x1,x2 or x2 &x3or even something likex1&x2&x3`, how can I return the result I want?
Grab them directly from the model. No need for using summary():
> model2$coefficients
(Intercept) x x2 x3
0.9309032 0.8736204 NA 0.5493671
Is it possible to replace coefficients in lm object?
I thought the following would work
# sample data
set.seed(2157010)
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x1 + rnorm(length(x1))
fit <- lm( y ~ x1 + x2)
# view origional coefficeints
coef(fit)
# replace coefficent with new values
fit$coef(fit$coef[2:3]) <- c(5, 1)
# view new coefficents
coef(fit)
Any assistance would be greatly appreciated
Your code is not reproducible, as there's few errors in your code. Here's corrected version which shows also your mistake:
set.seed(2157010) #forgot set.
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x2 + rnorm(length(x1)) #you had x, not x1 or x2
fit <- lm( y ~ x1 + x2)
# view original coefficients
coef(fit)
(Intercept) x1 x2
260.55645444 -0.04276353 2.91272272
# replace coefficients with new values, use whole name which is coefficients:
fit$coefficients[2:3] <- c(5, 1)
# view new coefficents
coef(fit)
(Intercept) x1 x2
260.5565 5.0000 1.0000
So the problem was that you were using fit$coef, although the name of the component in lm output is really coefficients. The abbreviated version works for getting the values, but not for setting, as it made new component named coef, and the coef function extracted the values of fit$coefficient.
I'm looking for suggestions on how to deal with NA's in linear regressions when all occurrences of an independent/explanatory variable are NA (i.e. x3 below).
I know the obvious solution would be to exclude the independent/explanatory variable in question from the model but I am looping through multiple regions and would prefer not to have a different functional forms for each region.
Below is some sample data:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
# Quick plot of data
library(ggplot2)
library(reshape2)
df.melt <-melt(df, id=c("time"))
p <- ggplot(df.melt, aes(x=time, y=value)) +
geom_line() + facet_grid(variable ~ .)
p
I have read the documentation for lm and tried various na.action settings without success:
lm(y~x1+x2+x3, data=df, singular.ok=TRUE)
lm(y~x1+x2+x3, data=df, na.action=na.omit)
lm(y~x1+x2+x3, data=df, na.action=na.exclude)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.omit)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.exclude)
Is there a way to get lm to run without error and simply return a coefficient for the explanatory reflective of the lack of explanatory power (i.e. either zero or NA) from the variable in question?
Here's one idea:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
replaceNA<-function(x){
if(all(is.na(x))){
rep(0,length(x))
} else x
}
lm(y~x1+x2+x3, data= data.frame(lapply(df,replaceNA)))
Call:
lm(formula = y ~ x1 + x2 + x3, data = data.frame(lapply(df, replaceNA)))
Coefficients:
(Intercept) x1 x2 x3
0.05467 1.01133 -0.10613 NA
lm(y~x1+x2, data=df)
Call:
lm(formula = y ~ x1 + x2, data = df)
Coefficients:
(Intercept) x1 x2
0.05467 1.01133 -0.10613
So you replace the variables which contain only NA's with variable which contains only 0's. you get the coefficient value NA, but all the relevant parts of the model fits are same (expect qr decomposition, but if information about that is needed, it can be easily modified). Note that component summary(fit)$alias (see ?alias) might be useful.
This seems to relate your other question: Replace lm coefficients in [r]
You won't be able to include a column with all NA values. It does strange things to model.matrix
x1 <- 1:5
x2 <- rep(NA,5)
model.matrix(~x1+x2)
(Intercept) x1 x2TRUE
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$x2
[1] "contr.treatment"
So your alternative is to programatically create the model formula based on the data.
Something like...
make_formula <- function(variables, data, response = 'y'){
if(missing(data)){stop('data not specified')}
using <- Filter(variables,f= function(i) !all(is.na(data[[i]])))
deparse(reformulate(using, response))
}
variables <- c('x1','x2','x3')
make_formula(variables, data =df)
[1] "y ~ x1 + x2"
I've used deparse to return a character string so that there is no environment issues from creating the formula within the function. lm can happily take a character string which is a valid formula.