I need to compute marginal effects out of a Generalized Linear Model (family=Poisson) estimated via the svyglm function from the R package survey for a subsample.
First, I declared the survey desgin with:
myDesisgn = svydesign(id=data$id, strata=data$strata, weights=data$sw, data=data)
Second, I estimated my model as:
fit = svyglm(y~ x1 +x2, design=myDesisgn, data=data, subset= x3 == 1, family= poisson(link = "log"))
Finally, when I want to get the Average Marginal Effect for, let's say, x1 I run:
summary(margins(fit, variables = "x1", design=myDesisgn))
... but I get the following error message:
"Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'summary': 'x' and 'w' must have the same length"
Running the following does not work either:
summary(margins(fit, variables = "x1", design=myDesisgn, subset=x3==1))
Solution:
summary(margins(fit, variables = "x1", design=myDesisgn[myDesisgn$variables$x3 == 1]))
Subsetting complex surveys leads to problems in the error estimation. When interested in a parameter for a specific subsample, one should use the desired subsample to estimate the parameter of interest and the full sample for the estimation of its error.
For example, svyglm(y~x, data=data, subset = z == 1) does exactly this (beta_hat estimated using observations for which z=1 and se(beta_hat) using the full sample).
Subsetting a svy design is possible and it keeps the original design information about number of clusters, strata. The code shown above is the "manual" way of doing so. Alternative one can directly rely on the subset.survey.design {survey} function.
myDesign_subset <- subset(myDesign, data$x3 == 1)
The two methods are equivalent and produce correct z-stats.
Related
After fitting a model with glm I got this as a result:
Warning message:
glm.fit: Adjusted probabilities with numerical value 0 or 1.**
After some research on Google, I tried with the brglm package. When I try to apply backward elimination on the model, I get the following error:
Error in do.call("glm.control", control) : second argument must be a list.
I searched on Google but I didn't find anything.
Here is my code with brglm:
library(mlbench)
#require(Amelia)
library(caTools)
library(mlr)
library(ciTools)
library(brglm)
data("BreastCancer")
data_bc <- BreastCancer
data_bc
head(data_bc)
dim(data_bc)
#Delete id column
data_bc<- data_bc[,-1]
data_bc
dim(data_bc)
str(data_bc)
# convert all factors columns to be numeric except class.
for(i in 1:9){
data_bc[,i]<- as.numeric(as.character(data_bc[,i]))
}
str(data_bc)
#convert class: benign and malignant to binary 0 and 1:
data_bc$Class<-ifelse(data_bc$Class=="malignant",1,0)
# now convert class to factor
data_bc$Class<- factor(data_bc$Class, levels = c(0,1))
str(data_bc)
model <- brglm(formula = Class~.^2, data = data_bc, family = "binomial",
na.action = na.exclude )
summary(model)
#Backward Elimination:
final <- step(model, direction = "backward")
You can work around this by using the brglm2 package, which supersedes the brglm package anyway:
model <- glm(formula = Class~.^2, data = na.omit(data_bc), family = "binomial",
na.action = na.fail, method="brglmFit" )
final <- step(model, direction = "backward")
length(coef(model)) ## 46
length(coef(final)) ## 42
setdiff(names(coef(model)), names(coef(final))
## [1] "Cl.thickness:Epith.c.size" "Cell.size:Marg.adhesion"
## [3] "Cell.shape:Bl.cromatin" "Bl.cromatin:Mitoses"
Some general concerns about your approach:
stepwise reduction is one of the worst forms of model reduction (cf. lasso, ridge, elasticnet ...)
in the presence of missing data, model comparison (e.g. by AIC) is questionable, as different models will be fitted to different subsets of the data. Given that you are only going to lose a small fraction of your data by using na.omit() (comparing nrow(bc_data) with sum(complete.cases(bc_data)), I would strongly recommend dropping observations with NA values from the data set before starting
it's also not clear to me that comparing penalized models via AIC is statistically appropriate (see here)
I have a panel data covering several individuals over a five year period and I want to do a 2SLS estimation using the plm package. My Instrumental Variables are the first lag of the endogenous variables, S and S_active. When I run the second stage IV regression I have the following errors:
variable lengths differ (found for 'S_hat')
Please, can someone help me out on how to resolve this error?
Apparently, by making use of the first lag of the S and S_active variables in the first stage regression, I will lose observations for one year for each panel. As such I understand the error to mean that the fitted values of my endogenous variables will have a shorter length compared to the original data. And that's why my second stage IV regressions throws up that error.
I have googled how to deal with this error and came across quite similar questions here. But none has particularly addressed my specific situation.
I tried another suggested solution (ie. to add the fitted values to the original data) but got another error:
arguments imply differing number of rows: 9196, 7192
The following is how my code looks like:
library(bootstrap)
library(AER)
library(systemfit)
library(sandwich)
library(lmtest)
library(boot)
library(laeken)
library(smoothmest)
library(glm2)
library(tidyverse)
library(foreign)
library(plm)
data_09=read.csv("panel2009.csv")
attach(data_09)
table(is.na(data_09))
FALSE
1480556
#First Stage. Run plm, regressing S on exogenous idependent variables including the IV, lag(S)
firstpan=plm(S~ act + plm::lag(S) + S_active + C_tot, data=data_09, na.action = na.exclude, index = c("individual","year"), method = "within", effect = "twoways")
summary((firstpan))
#First Stage. Run plm, regressing S_active on exogenous idependent variables including the IV, lag(S_active)
secondpan=plm(S_active ~act + act*plm::lag(towater) + S + C_tot, data=data_09, na.action = na.exclude, index = c("individual","year"), method = "within", effect = "twoways")
summary((secondpan))
#Collect fitted values and add them to data
S_hat=fitted(firstpan)
S_active_hat=fitted(secondpan)
#Run the standard 2SLS using S_hat=fitted and S_active_hat as instruments
iv=ivreg(Y ~ act+ S + S_active + C_tot|act + S_hat +S_active_hat + C_tot,data=data_09, na.action=na.exclude)
summary(iv)
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'S_hat')
The first stage regressions run succesfully, but the second-stage IV regessions throws up the error: Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'S_hat')
I'm using the MuMln package in R to get an averaged model (http://www.inside-r.org/packages/cran/MuMIn/docs/model.avg), and predict from that. The package also includes a predict function specifically for an object returned by model.avg (http://www.inside-r.org/node/123636). I've tried using the examples listed, code as follows:
# Example from Burnham and Anderson (2002), page 100:
fm1 <- lm(y ~ X1 + X2 + X3 + X4, data = Cement)
ms1 <- dredge(fm1)
# obtain model average for AIC delta <2
avgm <- model.avg(ms1, subset=delta<2)
# predict from the averaged model
averaged.full <- predict(avgm, full = TRUE)
But I keep getting
Error in predict.averaging(avgm, full = TRUE): can predict only from 'averaging' object containing model list
which I don't understand, because I did follow the examples and used an object returned by model.avg. Am I missing something?
When you create an "averaging" object directly from "model.selection" object, it does not contain the component models, which are required for predict to work. You can use model.avg(..., fit = TRUE) which will fit the models again.
To avoid fitting the models twice, you can first create a list of all models with
lapply(dredge(..., evaluate = FALSE), eval) and afterwards
use model.avg(..., subset = ...) on it.
I would like to fit a Binomial GLM on a certain dataset. Using glm(...,family=binomial) everything works fine however I would like to do it with the caret train() function. Unfortunately I get an unexpected error which I cannot get rid of.
library("marginalmodelplots")
library("caret")
MissUSA <- MissAmerica08[,c(2,4,6,7,8,10)]
formula <- cbind(Top10, 9-Top10)~.
glmfit <- glm(formula=formula, data=MissUSA, family=binomial())
trainfit <-train(form=formula,data=MissUSA,trControl=trainControl(method = "none"), method="glm", family=binomial())
The error I get is:
"Error : nrow(x) == length(y) is not TRUE"
caret doesn't support grouped data for a binomial outcome. You can expand the data into a factor variable that is binary (Bernoulli) data. Also, if you do that, you do not need to use family=binomial() in the call to train.
Max
I have a dataset that contains some missing values (on independent variables). I’m fitting a glm model :
f.model=glm(data = data, formula = y~x1 +x2, "binomial", na.action =na.omit )
After this model I want the ‘null’ model , so I used update:
n.model=update(f.model, . ~ 1)
This seems to work, but the number of observations in both models differ (f.model n=234; n.model n=235). So when I try to estimate a likelihood ratio I get an error: Number of observation not equal!!.
Q: How to update the model so that it accounts for the missing values?
Although it is a bit strange that na.action =na.omit dit not solve the NA problem. I decided to filter out the data.
library(epicalc) # for lrtest
vars=c(“y”, “x1”, “x2”) #variables in the model
n.data=data[,vars] #filter data
f.model=glm(data = data, formula = y~x1 +x2, binomial)
n.model=update(f.model, . ~ 1)
LR= lrtest(n.model,f.model)
If someone has a better solution or an argument way na.action in combination with update results in unequal observations, your answer or solution is more than welcome!