I'm trying to run a robust F-Test using a PLM object in R. I'm running the following code:
library('wooldridge')
data(wagepan,package='wooldridge')
library('plm')
pdata<-pdata.frame(wagepan,index=c("nr","year"))
fdreg<-plm(lwage~educ*(d81+d82+d83+d84+d85+d86+d87)+union,data=pdata,model="fd")
summary(fdreg)
library('car')
H0<-matchCoefs(fdreg,"educ")
linearHypothesis(fdreg,H0)
linearHypothesis(fdreg,H0,vcov=vcovHC(fdreg,"HC1"))
I'm getting the following error message:
'arg' should be one of “arellano”, “white1”, “white2”
Can anyone help me with this?
The error occurs in the 2nd line for linearHypothesis.
"HC1" is a value for argument type but the way you input it, it is used for argument method due to order of arguments (see ?plm::vcovHC for the arguments.
I suggest you change your line to specify the argument explicitly, e.g.,
linearHypothesis(fdreg, H0, vcov. = vcovHC(fdreg, type = "HC1"))
Linear hypothesis test
Hypothesis:
educ:d81 = 0
educ:d82 = 0
educ:d83 = 0
educ:d84 = 0
educ:d85 = 0
educ:d86 = 0
educ:d87 = 0
Model 1: restricted model
Model 2: lwage ~ educ * (d81 + d82 + d83 + d84 + d85 + d86 + d87) + union
Note: Coefficient covariance matrix supplied.
Res.Df Df Chisq Pr(>Chisq)
1 3807
2 3800 7 8.0019 0.3324
Related
I'm trying to write stan code for multilevel logistic regression. The model that I tried is a mixed intercept logistic model with two predictors. The first level is children level and the second level is mom level. When I tried to match the summary result of the code I wrote versus the one generated by function stan_glmer(), the results of fixed intercept did not match. First, the data I used as below:
library(rstanarm)
library(rstan)
data(guImmun, package = "mlmRev")
summary(guImmun)
require(dplyr)
guImmun <- guImmun %>%
mutate(immun = ifelse(immun == "N",0,1))
Second, the stan code was written as below:
data {
int N; // number of obs
int M; // number of groups
int K; // number of predictors
int y[N]; // outcome
row_vector[K] x[N]; // predictors
int g[N]; // map obs to groups (kids to women)
}
parameters {
real alpha;
real a[M];
vector[K] beta;
real<lower=0,upper=10> sigma;
}
model {
alpha ~ normal(0,1);
a ~ normal(0,sigma);
beta ~ normal(0,1);
for(n in 1:N) {
y[n] ~ bernoulli(inv_logit( alpha + a[g[n]] + x[n]*beta));
}
}
Fitting data to the model:
guI_data <- list(g=as.integer(guImmun$mom),
y=guImmun$immun,
x=data.frame(guImmun$kid2p, guImmun$mom25p),
N=nrow(guImmun),
K=2,
M=nlevels(guImmun$mom))
ranIntFit <- stan(file = "first_model.stan", data = guI_data,
iter = 500, chains = 1)
summary(ranIntFit, pars = c("alpha", "beta", "a[1]", "a[2]", "a[3]", "sigma"),
probs = c(0.025, 0.975),
digits = 2)
I got the following result:
results of written model
However, if I use stan_glmer() function, the result would be presented as follows.
M1_stanglmer <- stan_glmer(immun ~ kid2p + mom25p + (1 | mom),
family = binomial("logit"),
data = guImmun,
iter = 500,
chains = 1,
seed = 349)
print(M1_stanglmer, digits = 2)
But the results do not match, especially the result of fixed intercept.
Results generated by the stan_glmer() function
Could anyone help me figure out what's wrong with my code? Thanks!
So, I wouldn't expect an exact equivalence between your model in Stan and the version implemented in stan_glmer, but for models that sample well it's reasonable to expect the estimates to be similar.
However, in your case, there is yet another issue impacting your estimates:
The covariates you are using in the guI_Data$x object have values in {1,2}, where the typical implementation would use values in {0,1} to represent a binary covariate. This is what is done in stan_glmer.
This coding is apparent if you use glimpse to inspect the data structure:
> library(tidyverse)
> glimpse(guI_data)
List of 6
$ g: int [1:2159] 1 2 3 4 5 5 6 7 7 8 ...
$ y: num [1:2159] 1 0 0 0 0 1 1 1 1 1 ...
$ x:'data.frame': 2159 obs. of 2 variables:
..$ guImmun.kid2p : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 1 2 2 ...
..$ guImmun.mom25p: Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 2 2 2 ...
$ N: int 2159
$ K: num 2
$ M: int 1595
This is having the biggest impact in your intercept parameter, since the intercept represents the expected linear predictor when all covariates are 0. That value will often change when covariates are transformed or added.
Actually, I would expect that the estimated coefficients from your fit and the stan_glmer model are in fact similar, once you take this transformation into consideration.
For example, consider:
define: x_m = x + 1
Your model (m): yhat_m = alpha_m + x_m1*beta_m1 + x_m2*beta_m2
Stan_glmer: yhat = alpha + x_1*beta_1 + x_2*beta_2
and substitute:
yhat_m = alpha_m + (x_1 + 1)*beta_m1 + (x_2 + 1)*beta_m1
yhat_m = alpha_m + x_1*beta_m1 + beta_m1 + x_2*beta_m2 + beta_m2
yhat_m = alpha_m + beta_m1 + beta_m2 + x_1*beta_m1 + x_2*beta_m2
If we assume that yhat_m ~= yhat, beta_m1 ~= beta_1, and beta_m2 ~= beta_2... Then
alpha = alpha_m + beta_m1 + beta_m2
So, I would expect the stan_glmer alpha (-1.7) to be close to the hand-coded Stan alpha + both betas (-3.2 + 1.7 - 0.1).
Which indeed it is (-1.6).
If you furthermore update your Stan data to scale these covariates as {0,1} instead of {1,2}:
guI_data2 <- list(g=as.integer(guImmun$mom),
y=guImmun$immun,
x=data.frame(guImmun$kid2p == "Y", guImmun$mom25p == "Y"),
N=nrow(guImmun),
K=2,
M=nlevels(guImmun$mom))
ranIntFit2 <- stan(file = "first_model.stan", data = guI_data2,
iter = 500, chains = 1)
And look at the output:
> summary(ranIntFit2, pars = c('alpha', 'beta'))
$summary
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha -1.5110714 0.022982199 0.1903571 -1.8974997 -1.6318370 -1.5038593 -1.3861628334 -1.1729671 68.60488 1.0505237
beta[1] 1.5224756 0.025017739 0.1737332 1.2260666 1.4058789 1.5118314 1.6492158203 1.8673450 48.22471 1.0592955
beta[2] -0.1206084 0.009410305 0.1640406 -0.4267987 -0.2368855 -0.1267984 -0.0003187197 0.1894375 303.87510 0.9964177
You can confirm for yourself that you are in the right ballpark.
After this, the differences between your model and the stan_glmer will come down to the priors, the parameterization of hierarchical parameters, the sampling quality, etc.
Aside: there are a number of ways that categorical covariates can be coded into a model.matrix, each for a specific interpretation of the effect parameters. The models are usually equivalent, meaning that one can convert from one parameterization to another using linear transformations of effects as done above.
I'm doing a bit of exploratory data analysis using HMDA data from the AER package; however, the variables that I used to fit the model seem to contain some observations that perfectly determine the outcomes, an issue known as "separation." So I tried to remedy this using the solution recommended by this thread, yet when I tried to execute the first set of source code from glm.fit(), R returned an error message:
Error in family$family : object of type 'closure' is not subsettable
so I could not proceed any further to remove those fully determined observations from my data with this code. I am wondering if anyone could help me fix this?
My current code is provided at below for your reference.
# load the AER package and HMDA data
library(AER)
data(HMDA)
# fit a 2-degree olynomial probit model
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial, data = HMDA)
# using the revised source code from that stackexchage thread to find out observations that received a warning message
library(tidyverse)
library(dplyr)
library(broom)
eps <- 10 * .Machine$double.eps
if (family$family == "binomial") {
if (any(mu > 1 - eps) || any(mu < eps))
warning("glm.fit: fitted probabilities numerically 0 or 1 occurred",
call. = FALSE)
}
# this return the following error message
# Error in family$family : object of type 'closure' is not subsettable
probit.resids <- augment(probit.fit) %>%
mutate(p = 1 / (1 + exp(-.fitted)),
warning = p > 1-eps)
arrange(probit.resids, desc(.fitted)) %>%
select(2:5, p, warning) %>%
slice(1:10)
HMDA.nwarning <- filter(HMDA, !probit.resids$warning)
# using HMDA.nwarning should solve the problem...
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial, data = HMDA.nwarning)
This chunk of code
if (family$family == "binomial") {
if (any(mu > 1 - eps) || any(mu < eps))
warning("glm.fit: fitted probabilities numerically 0 or 1 occurred",
call. = FALSE)
}
there is a function, binomial() called when you run glm with family == "binomial". If you look under glm (just type glm):
if (is.character(family))
family <- get(family, mode = "function", envir = parent.frame())
if (is.function(family))
family <- family()
if (is.null(family$family)) {
print(family)
stop("'family' not recognized")
}
And the glm function checks binomial()$family during the fit, and if any of the predicted values differ from 1 or 0 by eps, it raises that warning.
You don't need to run that part, and yes, you need to set eps <- 10 * .Machine$double.eps . So let's run the code below, and if you run a probit, you need to specify link="probit" in binomial, otherwise the default is logit:
library(AER)
library(tidyverse)
library(dplyr)
library(broom)
data(HMDA)
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial(link="probit"), data = HMDA)
eps <- 10 * .Machine$double.eps
probit.resids <- augment(probit.fit) %>%
mutate(p = 1 / (1 + exp(-.fitted)),
warning = p > 1-eps)
The column warning indicates if the observations raises a warning, in this dataset, there's one:
table(probit.resids$warning)
FALSE TRUE
2379 1
We can use the next step to filter it
HMDA.nwarning <- filter(HMDA, !probit.resids$warning)
dim(HMDA.nwarning)
[1] 2379 14
And rerun the regression:
probit.fit <- glm(deny ~ poly(hirat, 2), family = binomial(link="probit"), data = HMDA.nwarning)
coefficients(probit.fit)
(Intercept) poly(hirat, 2)1 poly(hirat, 2)2
-1.191292 8.708494 6.884404
I have a training dataset
Out Revolver Ratio Num ...
0 1 0.766127 0.802982 0 ...
1 0 0.957151 0.121876 1
2 0 0.658180 0.085113 0
3 0 0.233810 0.036050 3
4 1 0.907239 0.024926 5
The outcome variable Out is binary and only takes on the values 0 or 1. Num is not a factor
I then attempted to run nnet using caret. I want to eventually try nnGrid but I just want to make sure this works first:
nnTrControl=trainControl(method = "cv", classProbs = TRUE, summaryFunction = twoClassSummary,
number = 2,verboseIter = TRUE, returnData = FALSE, returnResamp = "all")
#nnGrid = expand.grid(.size=c(1,4,7),.decay=c(0,0.001,0.1))
Outf<-factor(training$Out)
model <- train(Outf~ Revolver+Ratio+Num, data=training, method='nnet',
trControl = nnTrControl, metric="logLoss")#, tuneGrid=nnGrid)
I get the error
Error in train.default(x, y, weights = w, ...) :
At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
However, I've used caret and gotten this error before, which I resolved by using make.names. So when I try the below instead:
yCat<-make.names(training$Out, unique=FALSE, allow_=TRUE)
mnn <- model.matrix( ~Revolver + Ratio + Num, data = training)
model <- train(y=yCat, x=mnn, method='nnet',
trControl = nnTrControl, metric="logLoss")#, tuneGrid=nnGrid)
I then get the message
The metric "logLoss" was not in the result set. ROC will be used instead.
But I don't understand why its not evaluating according to logLoss?
If I then use this to predict on a test set
probs<-predict(model, newdata=testSet, type="prob")
I get
Error in eval(expr, envir, enclos) : object '(Intercept)' not found
How do I fix this?
What is the difference between type="class" and type="response" in the predict function?
For instance between:
predict(modelName, newdata=testData, type = "class")
and
predict(modelName, newdata=testData, type = "response")
Response gives you the numerical result while class gives you the label assigned to that value.
Response lets you to determine your threshold. For instance,
glm.fit = glm(Direction~., data=data, family = binomial, subset = train)
glm.probs = predict(glm.fit, test, type = "response")
In glm.probs we have some numerical values between 0 and 1. Now we can determine the threshold value, let's say 0.6. Direction has two possible outcomes, up or down.
glm.pred = rep("Down",length(test))
glm.pred[glm.probs>.6] = "Up"
type = "response" is used in glm models and type = "class" is used in rpart models(CART).
See:
predict.glm
predict.rpart
see ?predict.lm:
predict.lm produces a vector of predictions or a matrix of predictions and bounds with column names fit, lwr, and upr if interval is set. For type = "terms" this is a matrix with a column per term and may have an attribute "constant".
> d <- data.frame(x1=1:10,x2=rep(1:5,each=2),y=1:10+rnorm(10)+rep(1:5,each=2))
> l <- lm(y~x1+x2,d)
> predict(l)
1 2 3 4 5 6 7 8 9 10
2.254772 3.811761 4.959634 6.516623 7.664497 9.221486 10.369359 11.926348 13.074222 14.631211
> predict(l,type="terms")
x1 x2
1 -7.0064511 0.8182315
2 -5.4494620 0.8182315
3 -3.8924728 0.4091157
4 -2.3354837 0.4091157
5 -0.7784946 0.0000000
6 0.7784946 0.0000000
7 2.3354837 -0.4091157
8 3.8924728 -0.4091157
9 5.4494620 -0.8182315
10 7.0064511 -0.8182315
attr(,"constant")
[1] 8.442991
i.e. predict(l) is the row sums of predict(l,type="terms") + the constant
I want to fit a mixed model using nlme package in R which is equivalent to following SAS codes:
proc mixed data = one;
class var1 var2 year loc rep;
model yld = var1 * var2;
random loc year(loc) rep*year(loc);
EDITS: Explanation of what is experiment about
the same combination of var1 and var2 were tested in replicates (rep- replicates are numbered 1:3). The replicates (rep) is considered random. This set of experiment is repeated over locations (loc) and years (year). Although replicates are numbered 1:3 within each location and year for covinience because they do not have any name, replication 1 within a location and a year doesnot have correlation replication 1 within other location and other year
I tried the following codes:
require(nlme)
fm1 <- lme(yld ~ var1*var2, data = one, random = loc + year / loc + rep * year / loc)
Is my codes correct?
EDITS: data and model based on suggestions
you can download the example data file from the following link:
https://sites.google.com/site/johndatastuff/mydata1.csv
data$var1 <- as.factor(data$var1)
data$var2 <- as.factor(data$var2)
data$year <- as.factor(data$year)
data$loc <- as.factor(data$loc)
data$rep <- as.factor(data$rep)
following suggestions from the comments below:
fm1 <- lme(yld ~ var1*var2, data = data, random = ~ loc + year / loc + rep * year / loc)
Error in getGroups.data.frame(dataMix, groups) :
Invalid formula for groups
EXPECTED BASED ON SAS OUTPUT
Type 3 tests of fixed effects
var1*var2 14 238 F value 16.12 Pr >F = < 0.0001
Covariance parameters:
loc = 0, year(loc) = 922161, year*rep(loc) = 2077492, residual = 1109238
I tried the following model, I still getting some errors:
Edits: Just for information I tried the following model
require(lme4)
fm1 <- lmer(yld ~ var1*var2 + (1|loc) + (1|year / loc) + (1|rep : (year / loc)),
data = data)
Error in rep:`:` : NA/NaN argument
In addition: Warning message:
In rep:`:` : numerical expression has 270 elements: only the first used
Thanks for the more detailed information. I stored the data in d to avoid confusion with the data function and parameter; the commands works either way but this avoiding data is generally considered good practice.
Note that the interaction is hard to fit because of the lack of balance between var and var2; for reference here's the crosstabs:
> xtabs(~var1 + var2, data=d)
var2
var1 1 2 3 4 5
1 18 18 18 18 18
2 0 18 18 18 18
3 0 0 18 18 18
4 0 0 0 18 18
5 0 0 0 0 18
Normally to just fit the interaction (and no main effects) you'd use : instead of *, but here it works best to make a single factor, like this:
d$var12 <- factor(paste(d$var1, d$var2, sep=""))
Then with nlme, try
fm1 <- lme(yld ~ var12, random = ~ 1 | loc/year/rep, data = d)
anova(fm1)
and with lme4, try
fm1 <- lmer(yld ~ var12 + (1 | loc/year/rep), data=d)
anova(fm1)
Also note that because nlme and lme4 have overlap in their function names you need to only load one at time into your R session; to switch you need to close R and restart. (Other ways exist but that's the simplest to explain.)