Fama MacBeth standard errors in R - r

Does anyone know if there is a package that would run Fama-MacBeth regressions in R and calculate the standard errors? I am aware of the sandwich package and its ability to estimate Newey-West standard errors, as well as providing functions for clustering. However, I have not seen anything with respect to Fama-MacBeth.

The plm package can estimate Fama-MacBeth regressions and SEs.
require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")
fpmg <- pmg(y~x, test, index=c("year","firmid")) ##Fama-MacBeth
> ##Fama-MacBeth
> coeftest(fpmg)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.031278 0.023356 1.3392 0.1806
x 1.035586 0.033342 31.0599 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
However note that this method works only if your data can be coerced to a pdata.frame. (It will fail if you have "duplicate couples (time-id)".)
For further details see:
Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R

Related

Poisson model for non-integer [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated yesterday.
I have a GLM with (quasi)poisson family.
My dataset has 3 variables:
rate_data
rate_benchmark
X
So fitting the model:
model <- glm(formula = rate_data ~ offset(log(rate_benchmark)) + X - 1, family = (quasi)poisson, data = data)
model_null <- glm(formula = rate_data ~ offset(log(rate_benchmark)) - 1, family = (quasi)poisson, data = data)
When using "poisson" it gives me warnings about non-integer values, which it doesnt give me for the quasipoisson. However, when testing for my beta being zero anova(model_null, model, test = "LRT") it gives me completely different deviance (hence also different p-values).
Which model am I supposed to use? My first thought was using quasipoisson, but no warnings does not necessarily mean it is correct.
The Poisson and quasi-Poisson models differ in their assumptions about the form of the function relating the mean and variance of each observation. The Poisson assumes the variance equals the mean; the quasi-Poisson assumes that $\sigma^2 = \theta\mu$, which reduces to the Poisson when $\theta=1$. Consequently, the deviance and p-values will, as you have observed, be different between the two models.
You can in fact run a Poisson regression on non-integer data, at least in R; you'll still get the "right" coefficient estimates etc. The warnings are there as warnings; they don't represent an algorithm failure. Here's an example:
z <- 1 + 2*runif(100)
x <- rgamma(100,2,sqrt(z + z*z/2))
summary(glm(x~z, family=poisson))
... blah blah blah ...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.56390 0.17214 3.276 0.00105 **
z 0.43119 0.08368 5.153 2.57e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
... blah blah blah ...
There were 50 or more warnings (use warnings() to see the first 50)
Now we'll compare to a pure "quasi" model with the same link function and relationship between mean and variance; the "quasi" model makes no assumptions about integer values for the target variable:
summary(glm(x~z, family=quasi(link="log", variance="mu")))
... stuff ...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5639 0.2621 2.151 0.03392 *
z 0.4312 0.1274 3.384 0.00103 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasi family taken to be 2.318851)
Note that the parameter estimates are exactly the same, but the standard errors are different; this is due to the different calculations of variance, as reflected by the different dispersion parameters.
Now for the quasi-Poisson model, which will, again, give us the same parameter estimates as the Poisson model, but with different standard errors:
summary(glm(x~z, family=quasipoisson))
... more stuff ...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5639 0.2621 2.151 0.03392 *
z 0.4312 0.1274 3.384 0.00103 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 2.31885)
Since the mean-variance relationship and link functions are the same as in the "quasi" model, the model results are the same also.
The Poisson distribution deals with counts -- the actual number of objects you counted in a defined volume, or the actual number of events you counted in a defined period of time.
If you normalized to a rate, the distribution is not Poisson.

What is ga() in the gamlss package doing?

I have been looking into the gamlss package for fitting semiparametric models and came across something strange in the ga() function. Even if the model is specified as having a gamma distribution, fitted using REML, the output for the model is Gaussian, fitted using GCV.
Example::
library(mgcv)
library(gamlss)
library(gamlss.add)
data(rent)
ga3 <- gam(R~s(Fl)+s(A), method="REML", data=rent, family=Gamma(log))
gn3 <- gamlss(R~ga(~s(Fl)+s(A), method="REML"), data=rent, family=GA)
Model summary for the GAM::
summary(ga3)
Family: Gamma
Link function: log
Formula:
R ~ s(Fl) + s(A)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.667996 0.008646 771.2 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Fl) 1.263 1.482 442.53 <2e-16 ***
s(A) 4.051 4.814 36.34 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.302 Deviance explained = 28.8%
-REML = 13979 Scale est. = 0.1472 n = 1969
Model summary for the GAMLSS::
summary(getSmo(gn3))
Family: gaussian
Link function: identity
Formula:
Y.var ~ s(Fl) + s(A)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.306e-13 8.646e-03 0 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Fl) 1.269 1.492 440.14 <2e-16 ***
s(A) 3.747 4.469 38.83 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.294 Deviance explained = 29.6%
GCV = 0.97441 Scale est. = 0.97144 n = 1969
Question::
Why is the model output giving the incorrect distribution and fitting method? Is there something that I am missing here and this is correct?
When using the ga()-function, gamlss calls in the background the gam()-function from mgcv without specifying the family. As a result, the splines are fitted assuming a normal distribution. Therefore you see when showing the fitted smoothers family: gaussian and link function: identity. Also note that the scale estimate returned when using ga() is related to the normal distribution.
Yes, when using the ga()-function, each gamlss iteration calls in the background the gam()-function from mgcv. It uses the correct local working variable and local weights for a gamma distribution on each iteration.

R: The estimate parameter is different between GLM model and optim() package

I want to find estimate parameter with optim() package in R.
And I compare my result with GLM model in R. The code is
d <- read.delim("http://dnett.github.io/S510/Disease.txt")
d$disease=factor(d$disease)
d$ses=factor(d$ses)
d$sector=factor(d$sector)
str(d)
oreduced <- glm(disease~age+sector, family=binomial(link=logit), data=d)
summary(oreduced)
y<-as.numeric(as.character(d$disease))
x1<-as.numeric(as.character(d$age))
x2<-as.numeric(as.character(d$sector))
nlldbin=function(param){
eta<-param[1]+param[2]*x1+param[3]*x2
p<-1/(1+exp(-eta))
-sum(y*log(p)+(1-y)*log(1-p),na.rm=TRUE)
}
MLE_estimates<-optim(c(Intercept=0.1,age=0.1,sector2=0.1),nlldbin,hessian=TRUE)
MLE_estimatesenter
The result with GLM model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.15966 0.34388 -6.280 3.38e-10 ***
age 0.02681 0.00865 3.100 0.001936 **
sector2 1.18169 0.33696 3.507 0.000453 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And with optim()
$par
Intercept age sector2
-3.34005918 0.02680405 1.18101449
Can someone please tell me why its different and how to fix this? Thank you
You've given R two different problems. In your GLM, all of the parameters in the formula are factor variables. This mean that you've told R that they can only take particular values (e.g. d$disease can only take values 0 and 1). In your MLE approach, you've converted them to numeric variables, meaning that they can take any value and that your data just happens to use a small set of values.
The "fix" is to only give R one problem to solve. For example, if you instead fit glm(y~x1+x2, family=binomial(link=logit)), which uses no factor variables, you get pretty much the same parameter estimates with both the MLE as with the fitted model. You've seen this before.

How do I treat female and male as binary variables when I am working with binomial data?

I am a complete beginner in R/R Studio, coding and statistics in general.
In R, I am running a GLM where my Y variable is a no/yes (0/1) category and my X variable is a Sex category (female/male).
So I have run the following script:
hello <- read.csv(file.choose())
hello$sexbin <- ifelse(hello$Sex == 'm',0,ifelse(hello$Sex == 'f',1,NA))
modifhello <- subset(hello,hello$Combi_cag_long>=36)
model1 <- glm(modifhello$VAB~modifhello$Sex, family=binomial(link=logit),
na.action=na.exclude, data=modifhello)
summary.lm(model1)
However, in my output, R seems to have split male/female as two separate variables, suggesting that it is not treating them as proper binary variables:
Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.689 1.009 -3.656 0.000258 ***
modifhello$Sexf 2.506 1.010 2.482 0.013084 *
modifhello$Sexm 2.922 1.010 2.894 0.003820 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
What do I need to add to my script to correct this?
FOUND THE SOLUTION
Need to simply put modifhello$VAB~modifhello$sexbin not modifhello$VAB~modifhello$sex (as this is the old column).

Getting p-value via nls in r

I'm a complete novice in R and I'm trying to do a non-linear least squares fit to some data. The following (SC4 and t are my data columns) seems to work:
fit = nls(SC4 ~ fin+(inc-fin)*exp(-t/T), data=asc4, start=c(fin=0.75,inc=0.55,T=150.0))
The "summary(fit)" command produces an output that doesn't include a p-value, which is ultimately, aside from the fitted parameters, what I'm trying to get. The parameters I get look sensible.
Formula: SC4 ~ fin + (inc - fin) * exp(-t/T)
Parameters:
Estimate Std. Error t value Pr(>|t|)
fin 0.73703 0.02065 35.683 <2e-16 ***
inc 0.55671 0.02206 25.236 <2e-16 ***
T 51.48446 21.25343 2.422 0.0224 *
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04988 on 27 degrees of freedom
Number of iterations to convergence: 8
Achieved convergence tolerance: 4.114e-06
So is there any way to get a p-value? I'd be happy to use another command other than nls if that will do the job. In fact, I'd be happy to use gnuplot for example if there's some way to get a p-value from it (in fact gnuplot is what I normally use for graphics).
PS I'm looking for a p-value for the overall fit, rather than for individual coefficients.
The way to do this in R is you have to use the anova function to compute the fitted values of your current model and then fit your model with less variables, and then use the function anova(new_model,previous_model). The computed F-score will be closer to one if you cannot reject the null that the parameters for the variables you have removed are equal to zero. The summary function when doing the standard linear regression will usually do this for you automatically.
So for example this is how you would do it for the standard linear regression:
> x = rnorm(100)
> y=rnorm(100)
> reg = lm(y~x)
> summary(reg)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.3869 -0.6815 -0.1137 0.7431 2.5939
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.002802 0.105554 -0.027 0.9789
x -0.182983 0.104260 -1.755 0.0824 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.056 on 98 degrees of freedom
Multiple R-squared: 0.03047, Adjusted R-squared: 0.02058
F-statistic: 3.08 on 1 and 98 DF, p-value: 0.08237
But then if you use anova you should get the same F-score:
> anova(reg)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 1 3.432 3.4318 3.0802 0.08237 .
Residuals 98 109.186 1.1141
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Resources