How to deal with the error variable lengths differ - r

I have a panel data covering several individuals over a five year period and I want to do a 2SLS estimation using the plm package. My Instrumental Variables are the first lag of the endogenous variables, S and S_active. When I run the second stage IV regression I have the following errors:
variable lengths differ (found for 'S_hat')
Please, can someone help me out on how to resolve this error?
Apparently, by making use of the first lag of the S and S_active variables in the first stage regression, I will lose observations for one year for each panel. As such I understand the error to mean that the fitted values of my endogenous variables will have a shorter length compared to the original data. And that's why my second stage IV regressions throws up that error.
I have googled how to deal with this error and came across quite similar questions here. But none has particularly addressed my specific situation.
I tried another suggested solution (ie. to add the fitted values to the original data) but got another error:
arguments imply differing number of rows: 9196, 7192
The following is how my code looks like:
library(bootstrap)
library(AER)
library(systemfit)
library(sandwich)
library(lmtest)
library(boot)
library(laeken)
library(smoothmest)
library(glm2)
library(tidyverse)
library(foreign)
library(plm)
data_09=read.csv("panel2009.csv")
attach(data_09)
table(is.na(data_09))
FALSE
1480556
#First Stage. Run plm, regressing S on exogenous idependent variables including the IV, lag(S)
firstpan=plm(S~ act + plm::lag(S) + S_active + C_tot, data=data_09, na.action = na.exclude, index = c("individual","year"), method = "within", effect = "twoways")
summary((firstpan))
#First Stage. Run plm, regressing S_active on exogenous idependent variables including the IV, lag(S_active)
secondpan=plm(S_active ~act + act*plm::lag(towater) + S + C_tot, data=data_09, na.action = na.exclude, index = c("individual","year"), method = "within", effect = "twoways")
summary((secondpan))
#Collect fitted values and add them to data
S_hat=fitted(firstpan)
S_active_hat=fitted(secondpan)
#Run the standard 2SLS using S_hat=fitted and S_active_hat as instruments
iv=ivreg(Y ~ act+ S + S_active + C_tot|act + S_hat +S_active_hat + C_tot,data=data_09, na.action=na.exclude)
summary(iv)
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'S_hat')
The first stage regressions run succesfully, but the second-stage IV regessions throws up the error: Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'S_hat')

Related

Marginal Effect from svyglm object with a subsample in R

I need to compute marginal effects out of a Generalized Linear Model (family=Poisson) estimated via the svyglm function from the R package survey for a subsample.
First, I declared the survey desgin with:
myDesisgn = svydesign(id=data$id, strata=data$strata, weights=data$sw, data=data)
Second, I estimated my model as:
fit = svyglm(y~ x1 +x2, design=myDesisgn, data=data, subset= x3 == 1, family= poisson(link = "log"))
Finally, when I want to get the Average Marginal Effect for, let's say, x1 I run:
summary(margins(fit, variables = "x1", design=myDesisgn))
... but I get the following error message:
"Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'summary': 'x' and 'w' must have the same length"
Running the following does not work either:
summary(margins(fit, variables = "x1", design=myDesisgn, subset=x3==1))
Solution:
summary(margins(fit, variables = "x1", design=myDesisgn[myDesisgn$variables$x3 == 1]))
Subsetting complex surveys leads to problems in the error estimation. When interested in a parameter for a specific subsample, one should use the desired subsample to estimate the parameter of interest and the full sample for the estimation of its error.
For example, svyglm(y~x, data=data, subset = z == 1) does exactly this (beta_hat estimated using observations for which z=1 and se(beta_hat) using the full sample).
Subsetting a svy design is possible and it keeps the original design information about number of clusters, strata. The code shown above is the "manual" way of doing so. Alternative one can directly rely on the subset.survey.design {survey} function.
myDesign_subset <- subset(myDesign, data$x3 == 1)
The two methods are equivalent and produce correct z-stats.

Error in brglm model with Backward elimination with Interaction: error in do.call("glm.control", control) : second argument must be a list

After fitting a model with glm I got this as a result:
Warning message:
glm.fit: Adjusted probabilities with numerical value 0 or 1.**
After some research on Google, I tried with the brglm package. When I try to apply backward elimination on the model, I get the following error:
Error in do.call("glm.control", control) : second argument must be a list.
I searched on Google but I didn't find anything.
Here is my code with brglm:
library(mlbench)
#require(Amelia)
library(caTools)
library(mlr)
library(ciTools)
library(brglm)
data("BreastCancer")
data_bc <- BreastCancer
data_bc
head(data_bc)
dim(data_bc)
#Delete id column
data_bc<- data_bc[,-1]
data_bc
dim(data_bc)
str(data_bc)
# convert all factors columns to be numeric except class.
for(i in 1:9){
data_bc[,i]<- as.numeric(as.character(data_bc[,i]))
}
str(data_bc)
#convert class: benign and malignant to binary 0 and 1:
data_bc$Class<-ifelse(data_bc$Class=="malignant",1,0)
# now convert class to factor
data_bc$Class<- factor(data_bc$Class, levels = c(0,1))
str(data_bc)
model <- brglm(formula = Class~.^2, data = data_bc, family = "binomial",
na.action = na.exclude )
summary(model)
#Backward Elimination:
final <- step(model, direction = "backward")
You can work around this by using the brglm2 package, which supersedes the brglm package anyway:
model <- glm(formula = Class~.^2, data = na.omit(data_bc), family = "binomial",
na.action = na.fail, method="brglmFit" )
final <- step(model, direction = "backward")
length(coef(model)) ## 46
length(coef(final)) ## 42
setdiff(names(coef(model)), names(coef(final))
## [1] "Cl.thickness:Epith.c.size" "Cell.size:Marg.adhesion"
## [3] "Cell.shape:Bl.cromatin" "Bl.cromatin:Mitoses"
Some general concerns about your approach:
stepwise reduction is one of the worst forms of model reduction (cf. lasso, ridge, elasticnet ...)
in the presence of missing data, model comparison (e.g. by AIC) is questionable, as different models will be fitted to different subsets of the data. Given that you are only going to lose a small fraction of your data by using na.omit() (comparing nrow(bc_data) with sum(complete.cases(bc_data)), I would strongly recommend dropping observations with NA values from the data set before starting
it's also not clear to me that comparing penalized models via AIC is statistically appropriate (see here)

R code: Error in model.matrix.default(mt, mf, contrasts) : Variable 1 has no levels

I am trying to build a logistic regression model with a response as diagnosis ( 2 Factor variable: B, M).
I am getting an Error on building a logistic regression model:
Error in model.matrix.default(mt, mf, contrasts) :
variable 1 has no levels
I am not able to figure out how to solve this issue.
R Code:
Cancer <- read.csv("Breast_Cancer.csv")
## Logistic Regression Model
lm.fit <- glm(diagnosis~.-id-X, data = Cancer, family = binomial)
summary(lm.fit)
Dataset Reference: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Your problem is similar to the one reported here on the randomForest classifier.
Apparently glm checks through the variables in your data and throws an error because X contains only NA values.
You can fix that error by
either by dropping X completely from your dataset, setting Cancer$X <- NULL before handing it to glm and leaving X out in your formula (glm(diagnosis~.-id, data = Cancer, family = binomial));
or by adding na.action = na.pass to the glm call (which will instruct to ignore the NA-warning, essentially) but still excluding X in the formula itself (glm(diagnosis~.-id-X, data = Cancer, family = binomial, na.action = na.pass))
However, please note that still, you'd have to make sure to provide the diagnosis variable in a form digestible by glm. Meaning: either a numeric vector with values 0 and 1, a logical or a factor-vector
"For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success)" - from the glm-doc
Just define Cancer$diagnosis <- as.factor(Cancer$diagnosis).
On my end, this still leaves some warnings, but I think those are coming from the data or your feature selection. It clears the blocking errors :)

R: Using a variable with less observations in a regression (plm)

I have been trying to deal with this for a while now with no luck. Essentially, what I am doing is a two-stage least squares on some panel data. To do this I am using the plm package. What I want to do is
Do a 2SLS
Get the residuals from the 2SLS in 1.
Use these residuals as an instrument in a different 2SLS
The issue I have is that in the first 2SLS the number of observations used is less than the total observations in the dataset, so my residuals vector is short and I get the following error
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'ivreg.2.a$residuals')
Here is the code I am trying to run for reference, let me know if you need any more details. I really just need my residual vector to be the same length as the data used in the first 2SLS. For reference my data has 1713 observations, however, only 1550 get used in the regression and as a result my residuals vector is length 1550. My code for the two 2SLS regressions is below.
ivreg.2.a = plm(formula = diff(loda) ~ factor(year)+diff(lgdp) | index_g_l + diff(lcru_l) + diff(lcru_l_sq) + factor(year), index = c("country", "year"), model = "within", data = panel[complete.cases(panel[, c(1,2,3,4,5,7)]),])
ivreg.2.a = plm(formula = diff(lgdp) ~ factor(year)+index_g_l + diff(lcru_l) + diff(lcru_l_sq) + diff(loda)| index_g_l + diff(lcru_l) + diff(lcru_l_sq) + factor(year) + ivreg.2.a$residuals, index = c("country", "year"), model = "within", data = panel[complete.cases(panel[, c(1,2,3,4,5,7)]),])
Let me know if you need anything else.
I assume the 163 observations are dropped because they have NA in one of the relevant variables. Most *lm functions in R have a na.action argument, which can be used to pad the residuals to correct length. E.g., when missing observation 3,
residuals(lm(formula, data, na.action=na.omit)) # 1 2 4
residuals(lm(formula, data, na.action=na.exclude)) # 1 2 NA 4
Documentation of plm, however, says that this argument is "currently not fully supported", so it would be simpler if you just filter those 1550 rows to a new dataframe first, and do all subsequent work on that.
BTW, if plm behaves like lm, you shouldn't need to specify complete.cases for it to work, as it should just skip anything with NAs.

Prediction and Marginal Effects failure using mlogit() in R for a Nested Logit Model with updated data frame

I have run a Nested Logit model in R using the mlogit() package. I am now trying to measure marginal effects/elasticities and continue to run into an error. Here I have recreated the error by modifying the vignette by the package author:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish,
nests=list(water=c("boat","charter"),
land=c("beach","pier")))
# compute a data.frame containing the mean value of the covariates in the sample
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I get the following error:
Error in `colnames<-`(`*tmp*`, value = c("beach", "boat", "charter", "pier" :
attempt to set 'colnames' on an object with less than two dimensions
In addition: Warning message:
In cbind(Gb, Gl) :
number of rows of result is not a multiple of vector length (arg 2)
This works fine when I do not have a nested model (like a regular Multinomial Logit), and that has been covered in some previous stackoverflow questions, but something weird is happening specifically with the step of re-predicting on a changed data frame (in this case the means frame z).
Ill note that the solution here: marginal effects of mlogit in R did not help me.

Resources