R: The estimate parameter is different between GLM model and optim() package - r

I want to find estimate parameter with optim() package in R.
And I compare my result with GLM model in R. The code is
d <- read.delim("http://dnett.github.io/S510/Disease.txt")
d$disease=factor(d$disease)
d$ses=factor(d$ses)
d$sector=factor(d$sector)
str(d)
oreduced <- glm(disease~age+sector, family=binomial(link=logit), data=d)
summary(oreduced)
y<-as.numeric(as.character(d$disease))
x1<-as.numeric(as.character(d$age))
x2<-as.numeric(as.character(d$sector))
nlldbin=function(param){
eta<-param[1]+param[2]*x1+param[3]*x2
p<-1/(1+exp(-eta))
-sum(y*log(p)+(1-y)*log(1-p),na.rm=TRUE)
}
MLE_estimates<-optim(c(Intercept=0.1,age=0.1,sector2=0.1),nlldbin,hessian=TRUE)
MLE_estimatesenter
The result with GLM model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.15966 0.34388 -6.280 3.38e-10 ***
age 0.02681 0.00865 3.100 0.001936 **
sector2 1.18169 0.33696 3.507 0.000453 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And with optim()
$par
Intercept age sector2
-3.34005918 0.02680405 1.18101449
Can someone please tell me why its different and how to fix this? Thank you

You've given R two different problems. In your GLM, all of the parameters in the formula are factor variables. This mean that you've told R that they can only take particular values (e.g. d$disease can only take values 0 and 1). In your MLE approach, you've converted them to numeric variables, meaning that they can take any value and that your data just happens to use a small set of values.
The "fix" is to only give R one problem to solve. For example, if you instead fit glm(y~x1+x2, family=binomial(link=logit)), which uses no factor variables, you get pretty much the same parameter estimates with both the MLE as with the fitted model. You've seen this before.

Related

R code to test the difference between coefficients of regressors from one panel regression

I am trying to compare two regression coefficient from the same panel regression used over two different time periods in order to confirm the statistical significance of difference. Therefore, running my panel regression first with observations over 2007-2009, I get an estimate of one coefficient I am interested in to compare with the estimate of the same coefficient obtained from the same panel model applied over the period 2010-2017.
Based on R code to test the difference between coefficients of regressors from one regression, I tried to compute a likelihood ratio test. In the linked discussion, they use a simple linear equation. If I use the same commands in R than described in the answer, I get results based on a chi-squared distribution and I don't understand if and how I can interpret that or not.
In r, I did the following:
linearHypothesis(reg.pannel.recession.fe, "Exp_Fri=0.311576")
where reg.pannel.recession.fe is the panel regression over the period 2007-2009, Exp_Fri is the coefficient of this regression I want to compare, 0.311576 is the estimated coefficient over the period 2010-2017.
I get the following results using linearHypothesis():
How can I interpret that? Should I use another function as it is plm objects?
Thank you very much for your help.
You get a F test in that example because as stated in the vignette:
The method for "lm" objects calls the default method, but it changes
the
default test to "F" [...]
You can also set the test to F, but basically linearHypothesis works whenever the standard error of the coefficient can be estimated from the variance-covariance matrix, as also said in the vignette:
The default method will work with any model
object for which the coefficient vector can be retrieved by ‘coef’
and the coefficient-covariance matrix by ‘vcov’ (otherwise the
argument ‘vcov.’ has to be set explicitly)
So using an example from the package:
library(plm)
data(Grunfeld)
wi <- plm(inv ~ value + capital,
data = Grunfeld, model = "within", effect = "twoways")
linearHypothesis(wi,"capital=0.3",test="F")
Linear hypothesis test
Hypothesis:
capital = 0.3
Model 1: restricted model
Model 2: inv ~ value + capital
Res.Df Df F Pr(>F)
1 170
2 169 1 6.4986 0.01169 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
linearHypothesis(wi,"capital=0.3")
Linear hypothesis test
Hypothesis:
capital = 0.3
Model 1: restricted model
Model 2: inv ~ value + capital
Res.Df Df Chisq Pr(>Chisq)
1 170
2 169 1 6.4986 0.0108 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And you can also use a t.test:
tested_value = 0.3
BETA = coefficients(wi)["capital"]
SE = coefficients(summary(wi))["capital",2]
tstat = (BETA- tested_value)/SE
pvalue = as.numeric(2*pt(-tstat,wi$df.residual))
pvalue
[1] 0.01168515

One-way ANOVA for stratified samples in R

I have a stratified sample with three groups ("a","b","c") that where drawn from a larger population N. All groups have 30 observations but their proportions in N are unequal, hence their sampling weights differ.
I use the survey package in R to calculate summary statistics and linear regression models and would like to know how to calculate a one-way ANOVA correcting for the survey design (if necessary).
My assumption is and please correct me if I'm wrong, that the standard error for the variance should be normally higher for a population where the weight is smaller, hence a simple ANOVA that does not account for the survey design should not be reliable.
Here is an example. Any help would be appreciated.
## Oneway- ANOVA tests in R for surveys with stratified sampling-design
library("survey")
# create test data
test.df<-data.frame(
id=1:90,
variable=c(rnorm(n = 30,mean=150,sd=10),
rnorm(n = 30,mean=150,sd=10),
rnorm(n = 30,mean=140,sd=10)),
groups=c(rep("a",30),
rep("b",30),
rep("c",30)),
weights=c(rep(1,30), # undersampled
rep(1,30),
rep(100,30))) # oversampled
# correct for survey design
test.df.survey<-svydesign(id=~id,
strata=~groups,
weights=~weights,
data=test.df)
## descriptive statistics
# boxplot
svyboxplot(~variable~groups,test.df.survey)
# means
svyby(~variable,~groups,test.df.survey,svymean)
# variances
svyby(~variable,~groups,test.df.survey,svyvar)
### ANOVA ###
## One-way ANOVA without correcting for survey design
summary(aov(formula = variable~groups,data = test.df))
Hmm this is a interesting question, as far as I know it'd be difficult to consider weights in one-way anova. Thus I decided to show you the way that I'd solve this problem.
I'm going to use two-way anova and then soem port hoc test.
First of all let's build a linear model based on your data and check how does it look like.
library(car)
library(agricolae)
model.lm = lm(variable ~ groups * weights, data = test.df)
shapiro.test(resid(model.lm))
Shapiro-Wilk normality test
data: resid(model.lm)
W = 0.98238, p-value = 0.263
leveneTest(variable ~ groups * factor(weights), data = test.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 2.6422 0.07692 .
87
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Distribution is close to normal, variances differ between groups, so the variance isn't homogeneic - should be for parametrical test - anova. However let's perform the test anyway.
Several plots to check that our data fits to this test:
hist(resid(model.lm))
plot(model.lm)
Here is interpretation of plots, they don't look bad actually.
Let's run two-way anova:
anova(model.lm)
Analysis of Variance Table
Response: variable
Df Sum Sq Mean Sq F value Pr(>F)
groups 2 2267.8 1133.88 9.9566 0.0001277 ***
Residuals 87 9907.8 113.88
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As you see, the results are very close to yours. Some post hoc test:
(result.hsd = HSD.test(model.lm, list('groups', 'weights')))
$statistics
MSerror Df Mean CV MSD
113.8831 87 147.8164 7.2195 6.570186
$parameters
test name.t ntr StudentizedRange alpha
Tukey groups:weights 3 3.372163 0.05
$means
variable std r Min Max Q25 Q50 Q75
a:1 150.8601 11.571185 30 113.3240 173.0429 145.2710 151.9689 157.8051
b:1 151.8486 8.330029 30 137.1907 176.9833 147.8404 150.3161 154.7321
c:100 140.7404 11.762979 30 118.0823 163.9753 131.6112 141.1810 147.8231
$comparison
NULL
$groups
variable groups
b:1 151.8486 a
a:1 150.8601 a
c:100 140.7404 b
attr(,"class")
[1] "group"
And maybe some different way:
aov_cont<- aov(test.df$variable ~ test.df$groups * test.df$weights)
summary(aov_cont)
Df Sum Sq Mean Sq F value Pr(>F)
test.df$groups 2 2268 1133.9 9.957 0.000128 ***
Residuals 87 9908 113.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(TukeyHSD(aov_cont))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = test.df$variable ~ test.df$groups * test.df$weights)
$`test.df$groups`
diff lwr upr p adj
b-a 0.9884608 -5.581725 7.558647 0.9315792
c-a -10.1197048 -16.689891 -3.549519 0.0011934
c-b -11.1081657 -17.678352 -4.537980 0.0003461
Summarizing, the results are very close to yours. Personaly I'll run two way anova with (*) symbol or (+) when you are sure that your variables are independent - additive model.
Group c with bigger weight differs from groups a and b substantially.
According to the main statistician of our institute there is no easy implementation of this kind of analysis in any common modeling environment. The reason for that is that ANOVA and ANCOVA are linear models that where not further developed after the emergence of General Linear Models (later Generalized linear models - GLMs) in the 70's.
A normal linear regression model yields practically the same results as an ANOVA, but is much more flexible regarding variable choice. Since weighting methods exist for GLMs (see survey package in R) there is no real need to develop methods to weight for stratified sampling design in ANOVA... simply use a GLM instead.
summary(svyglm(variable~groups,test.df.survey))

How to plot a variable to find out whether there is negative linear trend

My data consists of new product introductions and after finding the parameters of the Bass model, I would like to know whether there is a negative linear trend over time. The parameter m stands for the number of ultimate adopters.
As could be seen from the outcome of the regression model, it seems that there is a negative trend. But how can I plot this nicely?
My dataset consists of product level data. The variable Date indicates what year the new product was launched (2009:2015) and the m indicates the parameter estimation of the bass model (continuous).
LM <- lm(m ~ Date, data = TotalBassModel1)
m Estimate Std. Error t-value Pr(>|t|)
Intercept 371.51 29.10 12.766 < 2e-16 ***
Date2010 -18.74 27.87 -0.672 0.50132
Date2011 -71.41 27.30 -2.616 0.00893 **
Date2012 -71.03 26.17 -2.714 0.00669 **
Date2013 -137.07 25.32 -5.414 6.62e-08 ***
Date2014 -170.25 25.15 -6.770 1.53e11 ***
Date2015 -223.50 35.63 -6.273 4.03e10 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
First I think you can treat your date as a numeric variable, by removing the "Date" characters before the date and setting it to numeric using as.numeric().
It will then gives you a proper adjustement with one coefficient that relate m to Date.
You can then plot your regression by using the predict() function, or by using ggplot functions http://t-redactyl.io/blog/2016/05/creating-plots-in-r-using-ggplot2-part-11-linear-regression-plots.html
boxplot(m ~ Date, data = TotalBassModel1)
should give you a 5 number summary for each of the 7 years in your dataset. If the data are symmetric, then the mean is the median, and it is a nice way to robustly visualize trends in datasets with ordered categorical predictors, with either big or small sample sizes.

How to fit data in the form of vectors to Gumbel copula in R?

I am relatively new to R. For my final exam at university I decided to write a dissertation on copulas and so I collected some knowledge of the theory behind copulas, what they are and how they work. I also program in R but have never used copulas in R therefore I am a novice to copulas in R.
My question is:
how can I fit raw data (in the form of an R vector) to a Gumbel copula?
Here is the format of my data:
x <- c(x1,x2,x3,...,xn)
y <- c(y1,y2,y3,...,yn)
Right now I'm using the copula package, I know how to generate random copulas and Gumbel copulas given theta as below:
#independence case
r_matrix <- t(rgumbel(2000,theta=1))
plot(r_matrix[1,], r_matrix[2,], col="blue", main="Gumbel, independence case")
positive dependence
r_matrix <- t(rgumbel(2000,theta=3))
plot(r_matrix[1,], r_matrix[2,], col="blue", main="Gumbel, Positive dependence")
However I do not know how to estimate theta.. I know what Spearman's Rho and Kendall's Tau are, if that could help.
Could you please help me to fit x and y to the Gumbel copula? Should I use another package? Thank you
I hope this helps
library(copula)
gumbel.cop= gumbelCopula(2, dim = 7)
set.seed(117)
u1 = rCopula(500, gumbel.cop)
fit.ml = fitCopula(gumbel.cop, u1, method = "ml")
The output for the above is
fitCopula() estimation based on 'maximum likelihood'
and a sample of size 500.
Estimate Std. Error z value Pr(>|z|)
param 2.01132 0.02902 69.31 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The maximized loglikelihood is 1643
Optimization converged
Number of loglikelihood evaluations:
function gradient
20 3
You can also try "mpl","itau" or "irho" instead of "ml" for the estimator. Just so you know
"ml" is (maximum likelihood),
"mpl" is (maximum pseudo-likelihood), "itau" is (inversion of Kendall’s
tau), and "irho" is (inversion of Spearman’s rho). See R copula package for more details

Fama MacBeth standard errors in R

Does anyone know if there is a package that would run Fama-MacBeth regressions in R and calculate the standard errors? I am aware of the sandwich package and its ability to estimate Newey-West standard errors, as well as providing functions for clustering. However, I have not seen anything with respect to Fama-MacBeth.
The plm package can estimate Fama-MacBeth regressions and SEs.
require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")
fpmg <- pmg(y~x, test, index=c("year","firmid")) ##Fama-MacBeth
> ##Fama-MacBeth
> coeftest(fpmg)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.031278 0.023356 1.3392 0.1806
x 1.035586 0.033342 31.0599 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
However note that this method works only if your data can be coerced to a pdata.frame. (It will fail if you have "duplicate couples (time-id)".)
For further details see:
Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R

Resources