R: Using a variable with less observations in a regression (plm) - r

I have been trying to deal with this for a while now with no luck. Essentially, what I am doing is a two-stage least squares on some panel data. To do this I am using the plm package. What I want to do is
Do a 2SLS
Get the residuals from the 2SLS in 1.
Use these residuals as an instrument in a different 2SLS
The issue I have is that in the first 2SLS the number of observations used is less than the total observations in the dataset, so my residuals vector is short and I get the following error
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'ivreg.2.a$residuals')
Here is the code I am trying to run for reference, let me know if you need any more details. I really just need my residual vector to be the same length as the data used in the first 2SLS. For reference my data has 1713 observations, however, only 1550 get used in the regression and as a result my residuals vector is length 1550. My code for the two 2SLS regressions is below.
ivreg.2.a = plm(formula = diff(loda) ~ factor(year)+diff(lgdp) | index_g_l + diff(lcru_l) + diff(lcru_l_sq) + factor(year), index = c("country", "year"), model = "within", data = panel[complete.cases(panel[, c(1,2,3,4,5,7)]),])
ivreg.2.a = plm(formula = diff(lgdp) ~ factor(year)+index_g_l + diff(lcru_l) + diff(lcru_l_sq) + diff(loda)| index_g_l + diff(lcru_l) + diff(lcru_l_sq) + factor(year) + ivreg.2.a$residuals, index = c("country", "year"), model = "within", data = panel[complete.cases(panel[, c(1,2,3,4,5,7)]),])
Let me know if you need anything else.

I assume the 163 observations are dropped because they have NA in one of the relevant variables. Most *lm functions in R have a na.action argument, which can be used to pad the residuals to correct length. E.g., when missing observation 3,
residuals(lm(formula, data, na.action=na.omit)) # 1 2 4
residuals(lm(formula, data, na.action=na.exclude)) # 1 2 NA 4
Documentation of plm, however, says that this argument is "currently not fully supported", so it would be simpler if you just filter those 1550 rows to a new dataframe first, and do all subsequent work on that.
BTW, if plm behaves like lm, you shouldn't need to specify complete.cases for it to work, as it should just skip anything with NAs.

Related

Panel regression in R with variable coefficients

We have tried to do two (similar) panel regressions in R.
1) One with time and individual fixed effects (the usual intercept dummies) using plm(). However, we are only interested mostly interested in a "slope coefficient" or beta for each individual and not one for all of the individuals:
Regression 1
Where alpha_i is the individual fixed effect, gamme_t is the time fixed effect. The sum of X is the variable X and three lags:
Sum X variable
We have already included the lagged X variables as new columns in our dataset so in our specification in the code we simply treat them as four different variables:
This is an attempt at using plm() and include our own dummy variables for each individual Beta
plm(income ~ (factor(firmid)-1)*(expense_rate + lag1 + lag2 + lag3), data = data1,
effect = c("time"), model = c("within"), index = c("name", "date"))
The lag1, lag,2, lag2 are the lagged variables of expense rate.
Data1 is in the form of a data frame.
“(factor(firmid)-1)” is an attempt at introducing dummies to get Betas for each individual instead of one for all individuals.
2) The second (and simpler) regression is:
Regression 2
This is an example of our attempt at using pvcm
pvcm1 <- pvcm(income ~ expense_rate + lag1 + lag2 + lag3, data = data1,
effect = "individual", model = "within")
Our question is what specific code and or packages/functions which would be suitable for these regressions. We have tried pvcm to no avail, running into errors such as:
“Error in table(index[1], index[2], useNA = "ifany") :
attempt to make a table with >= 2^31 elements”
and
“Error: cannot allocate vector of size 599.7 Gb”
. Furthermore, pvcm() does not seem to be able to cope with both individua and time fixed effects as in 1).

Stepwise regression in r with mixed models: numbers of rows changing [duplicate]

I want to run a stepwise regression in R to choose the best fit model, my code is attached here:
full.modelfixed <- glm(died_ed ~ age_1 + gender + race + insurance + injury + ais + blunt_pen +
comorbid + iss +min_dist + pop_dens_new + age_mdn + male_pct +
pop_wht_pct + pop_blk_pct + unemp_pct + pov_100x_npct +
urban_pct, data = trauma, family = binomial (link = 'logit'), na.action = na.exclude)
reduced.modelfixed <- stepAIC(full.modelfixed, direction = "backward")
There is a error message said
Error in stepAIC(full.modelfixed, direction = "backward") :
number of rows in use has changed: remove missing values?
Almost every variable in the data has some missing values, so I cannot delete all missing values (data = na.omit(data))
Any idea on how to fix this?
Thanks!!
This should probably be in a stats forum (stats.stackexchange) but briefly there are a number of considerations.
The main one is that when comparing two models they need to be fitted on the same dataset (i.e you need to be able to nest the models within each other).
For examples
glm1 <- glm(Dependent~indep1+indep2+indep3, family = binomial, data = data)
glm2 <- glm(Dependent~indep2+indep2, family = binomial, data = data)
Now imagine that we are missing values of indep3 but not indep1 or indep2.
When we run glm1 we are running it on a smaller dataset - the dataset for which we have the dependent variable and all three independent ones (i.e we exclude any rows where indep3 values are missing).
When we run glm2 the rows missing a value for indep3 are included because those rows do contain dependent, indep1 and indep2 which are the models in the variable.
We can no longer directly compare models as they are fitted on different datasets.
I think broadly you can either
1) Limit to data which is complete
2) If appropriate consider multiple imputation
Hope that helps.
You can use the MICE package to do imputation, then working with the dataset will not give you errors

How to deal with the error variable lengths differ

I have a panel data covering several individuals over a five year period and I want to do a 2SLS estimation using the plm package. My Instrumental Variables are the first lag of the endogenous variables, S and S_active. When I run the second stage IV regression I have the following errors:
variable lengths differ (found for 'S_hat')
Please, can someone help me out on how to resolve this error?
Apparently, by making use of the first lag of the S and S_active variables in the first stage regression, I will lose observations for one year for each panel. As such I understand the error to mean that the fitted values of my endogenous variables will have a shorter length compared to the original data. And that's why my second stage IV regressions throws up that error.
I have googled how to deal with this error and came across quite similar questions here. But none has particularly addressed my specific situation.
I tried another suggested solution (ie. to add the fitted values to the original data) but got another error:
arguments imply differing number of rows: 9196, 7192
The following is how my code looks like:
library(bootstrap)
library(AER)
library(systemfit)
library(sandwich)
library(lmtest)
library(boot)
library(laeken)
library(smoothmest)
library(glm2)
library(tidyverse)
library(foreign)
library(plm)
data_09=read.csv("panel2009.csv")
attach(data_09)
table(is.na(data_09))
FALSE
1480556
#First Stage. Run plm, regressing S on exogenous idependent variables including the IV, lag(S)
firstpan=plm(S~ act + plm::lag(S) + S_active + C_tot, data=data_09, na.action = na.exclude, index = c("individual","year"), method = "within", effect = "twoways")
summary((firstpan))
#First Stage. Run plm, regressing S_active on exogenous idependent variables including the IV, lag(S_active)
secondpan=plm(S_active ~act + act*plm::lag(towater) + S + C_tot, data=data_09, na.action = na.exclude, index = c("individual","year"), method = "within", effect = "twoways")
summary((secondpan))
#Collect fitted values and add them to data
S_hat=fitted(firstpan)
S_active_hat=fitted(secondpan)
#Run the standard 2SLS using S_hat=fitted and S_active_hat as instruments
iv=ivreg(Y ~ act+ S + S_active + C_tot|act + S_hat +S_active_hat + C_tot,data=data_09, na.action=na.exclude)
summary(iv)
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'S_hat')
The first stage regressions run succesfully, but the second-stage IV regessions throws up the error: Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'S_hat')

Force inclusion of observations with missing data in lmer

I want to fit a linear mixed-effects model using lme4::lmer without discarding observations with missing data. That is, I want lmer to go ahead and maximize the likelihood using all the data.
Am I correct in thinking that using na.pass produces this behavior? This unanswered question is making me wonder if this might be wrong.
lmer(like most model functions) can't deal with missing data. To illustrate that:
data(Orthodont,package="nlme")
Orthodont$nsex <- as.numeric(Orthodont$Sex=="Male")
Orthodont$nsexage <- with(Orthodont, nsex*age)
Orthodont[1, 2] <- NA
lmer(distance ~ age + (age|Subject) + (0+nsex|Subject) +
(0 + nsexage|Subject), data=Orthodont, na.action = na.pass)
#Error in lme4::lFormula(formula = distance ~ age + (age | Subject) + (0 + :
# NA in Z (random-effects model matrix): please use "na.action='na.omit'" or "na.action='na.exclude'"
If you don't want to discard observations with missing data, your only option is imputation. Check out packages like mice or Amelia.

Why doesn't predict like the dimensions of my newdata?

I want to perform a multiple regression in R and make predictions based on the trained model. Below is an example code I am using:
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
predict(lm(price ~ predictors), data.frame(predictors=matrix(c(3,5),nrow=1)))
So, based on the 2-variate regression model trained by 5 samples, I want to make a prediction for the test data point where the first variate is 3 and second variate is 5. But I get a warning from above code saying that 'newdata' had 1 rows but variable(s) found have 5 rows. How can I correct above code? Below code works fine where I give the variables separately to the model formula. But since I will have hundreds of variates, I have to give them in a matrix since it would be unfeasible to append hundreds of columns using + sign.
price = c(10,18,18,11,17)
predictor1 = c(5,6,3,4,5)
predictor2 = c(2,1,8,5,6)
predict(lm(price ~ predictor1 + predictor2), data.frame(predictor1=3,predictor2=5))
Thanks in advance!
The easiest way to get past the issue of matching up variable names from a matrix of covariates to newdata data.frame column names is to put your input data into a data.frame as well. Try this
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
indata<-data.frame(price,predictors=predictors)
predict(lm(price ~ ., indata), data.frame(predictors=matrix(c(3,5),nrow=1)))
Here we combine price and predictors into a data.frame such that it will be named the same say as the newdata data.frame. We use the . in the formula to mean "all other columns" so we don't have to specify them explicitly.
Need to build the model first, then predict from it:
mod1 <- lm(price ~ predictor1 + predictor2)
predict( mod1 , data.frame(predictor1=3,predictor2=5))

Resources