test whether coefficients in quantile regression model differ from each other significantly - r

I have a quantile regression model, where I am interested in estimating effects for the .25, .5, and .875 quantile. The coefficients in my model differ from each other in a way that is in line with the substantive substantive theory underlying my model.
The next step is to test whether the coefficient of a particular explanatory variable for one quantile differs significantly from the estimated coefficient for another quantile. How do I test that?
Further, I also want to test whether the coefficient for that variable for a given quantile differs significantly from the estimnate in a OLS model. How do I do that?
I am interested in any answer, although I would prefer an answer that involves R.
Here's some test code: (NOTE: this is not my actual model or data, but is an easy example as the data is available in the R installation)
data(airquality)
library(quantreg)
summary(rq(Ozone ~ Solar.R + Wind + Temp, tau = c(.25, .5, .75), data = airquality, method = "br"), se = "nid")
tau: [1] 0.25
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -69.92874 12.18362 -5.73957 0.00000
Solar.R 0.06220 0.00917 6.77995 0.00000
Wind -2.63528 0.59364 -4.43918 0.00002
Temp 1.43521 0.14363 9.99260 0.00000
Call: rq(formula = Ozone ~ Solar.R + Wind + Temp, tau = c(0.25, 0.5,
0.75), data = airquality, method = "br")
tau: [1] 0.5
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -75.60305 23.27658 -3.24803 0.00155
Solar.R 0.03354 0.02301 1.45806 0.14775
Wind -3.08913 0.68670 -4.49853 0.00002
Temp 1.78244 0.26067 6.83793 0.00000
Call: rq(formula = Ozone ~ Solar.R + Wind + Temp, tau = c(0.25, 0.5,
0.75), data = airquality, method = "br")
tau: [1] 0.75
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -91.56585 41.86552 -2.18714 0.03091
Solar.R 0.03945 0.04217 0.93556 0.35161
Wind -2.95452 1.17821 -2.50764 0.01366
Temp 2.11604 0.45693 4.63103 0.00001
and the OLS model:
summary(lm(Ozone ~ Solar.R + Wind + Temp, data = airquality))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -64.34208 23.05472 -2.791 0.00623 **
Solar.R 0.05982 0.02319 2.580 0.01124 *
Wind -3.33359 0.65441 -5.094 1.52e-06 ***
Temp 1.65209 0.25353 6.516 2.42e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.18 on 107 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
(don't worry about the actual model estimated above, this is just for illustrative purposes)
How now to test whether, for example, the coefficient for Temp differs statistically significantly (at some give level) between quantiles .25 and .75 and whether the coefficient at the .25 quantile for Temp differs significantly from the OLS coefficient for Temp?
Answers welcome in R or those that focus on the statistical approach.

For quantile regression, I generally prefer visual inspection.
data(airquality)
library(quantreg)
q <- rq(Ozone ~ Solar.R + Wind + Temp, tau = 1:9/10, data = airquality)
plot(summary(q, se = "nid"), level = 0.95)
The dotted red lines are the 95% confidence interval for linear regression and the shaded grey area is the 95% confidence interval for each of the quantreg estimates.
In the plot below, we can visualize that the quantreg estimates are within the bounds of the linear regression estimates which suggests that there may not be a statistically significant difference.
You can further test whether the differences in the coefficients are statistically significant using anova. See ?anova.rq for more details.
q50 <- rq(Ozone ~ Solar.R + Wind + Temp, tau = 0.5, data = airquality)
q90 <- rq(Ozone ~ Solar.R + Wind + Temp, tau = 0.9, data = airquality)
anova(q50, q90)

Related

How to determine degrees of freedom for t stat with quantile regression and bootstrapped standard errors in R

I am using R to conduct a quantile regression with bootstrapped standard errors to test if one variable is higher than a second variable at the 5th, 50th, and 95th percentiles of the distributions. The output does not include degrees of freedom for the t stat. How can I calculate this?
summary(rq(data$var1~data$var2, tau=.05), se="boot")
summary(rq(data$var1~data$var2, tau=.5), se="boot")
Assuming you used the library quantreg, if you were to call rq() by itself, you get the degrees of freedom.
It looks like you're fairly new to SO; welcome to the community! If you want great answers quickly, it's best to make your question reproducible. This includes the libraries you used or sample data like the output from dput(head(dataObject))). Check it out: making R reproducible questions.
Capturing the degrees of freedom, in this case, should be relatively easy.
In truth, the number of observations and subtract the number of observations is total degrees of freedom. The residual degrees of freedom are the number of observations minus the number of variables in the formula.
The degrees of freedom for each t-statistic is the number of variables that are represented for that t-statistic (typically one).
If you call the regression directly (instead of nested in the summary function), it gives you information about the degrees of freedom, as well. That being said, if you don't run the model independently, it is more difficult to test the assumptions that the data must meet for the analysis. Lastly, in this form, you can't test the model for overfitting, either.
library(quantreg)
data(mtcars)
(fit <- rq(mpg ~ wt, data = mtcars, tau = .05))
# Call:
# rq(formula = mpg ~ wt, tau = 0.05, data = mtcars)
#
# Coefficients:
# (Intercept) wt
# 37.561538 -6.515837
#
# Degrees of freedom: 32 total; 30 residual
(fit2 <- rq(mpg ~ wt, data = mtcars, tau = .5))
# Call:
# rq(formula = mpg ~ wt, tau = 0.5, data = mtcars)
#
# Coefficients:
# (Intercept) wt
# 34.232237 -4.539474
#
# Degrees of freedom: 32 total; 30 residual
summary(fit, se = "boot")
#
# Call: rq(formula = mpg ~ wt, tau = 0.05, data = mtcars)
#
# tau: [1] 0.05
#
# Coefficients:
# Value Std. Error t value Pr(>|t|)
# (Intercept) 37.56154 5.30762 7.07690 0.00000
# wt -6.51584 1.58456 -4.11208 0.00028
summary(fit2, se = "boot")
#
# Call: rq(formula = mpg ~ wt, tau = 0.5, data = mtcars)
#
# tau: [1] 0.5
#
# Coefficients:
# Value Std. Error t value Pr(>|t|)
# (Intercept) 34.23224 3.20718 10.67362 0.00000
# wt -4.53947 1.04645 -4.33798 0.00015
I would like to point out that se = "boot" doesn't appear to be doing anything. Additionally, you can run both tau settings in the same model. The Quantreg package has several tools for comparing the models when ran as together.

How does the Predict function handle continuous values with a 0 in R for a Poisson Log Link Model?

I am using a Poisson GLM on some dummy data to predict ClaimCounts based on two variables, frequency and Judicial Orientation.
Dummy Data Frame:
data5 <-data.frame(Year=c("2006","2006","2006","2007","2007","2007","2008","2009","2010","2010","2009","2009"),
JudicialOrientation=c("Defense","Plaintiff","Plaintiff","Neutral","Defense","Plaintiff","Defense","Plaintiff","Neutral","Neutral","Plaintiff","Defense"),
Frequency=c(0.0,0.06,.07,.04,.03,.02,0,.1,.09,.08,.11,0),
ClaimCount=c(0,5,10,3,4,0,7,8,15,16,17,12),
Loss = c(100000,100,2500,100000,25000,0,7500,5200, 900,100,0,50),
Exposure=c(10,20,30,1,2,4,3,2,1,54,12,13)
)
Model GLM:
ClaimModel <- glm(ClaimCount~JudicialOrientation+Frequency
,family = poisson(link="log"), offset=log(Exposure), data = data5, na.action=na.pass)
Call:
glm(formula = ClaimCount ~ JudicialOrientation + Frequency, family = poisson(link = "log"),
data = data5, na.action = na.pass, offset = log(Exposure))
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7555 -0.7277 -0.1196 2.6895 7.4768
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3493 0.2125 -1.644 0.1
JudicialOrientationNeutral -3.3343 0.5664 -5.887 3.94e-09 ***
JudicialOrientationPlaintiff -3.4512 0.6337 -5.446 5.15e-08 ***
Frequency 39.8765 6.7255 5.929 3.04e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 149.72 on 11 degrees of freedom
Residual deviance: 111.59 on 8 degrees of freedom
AIC: 159.43
Number of Fisher Scoring iterations: 6
I am using an offset of Exposure as well.
I then want to use this GLM to predict claim counts for the same observations:
data5$ExpClaimCount <- predict(ClaimModel, newdata=data5, type="response")
If I understand correctly then the Poisson glm equation should then be:
ClaimCount = exp(-.3493 + -3.3343*JudicialOrientationNeutral +
-3.4512*JudicialOrientationPlaintiff + 39.8765*Frequency + log(Exposure))
However I tried this manually(In excel =EXP(-0.3493+0+0+LOG(10)) for observation 1 for example) and for some of the observations but did not get the correct answer.
Is my understanding of the GLM equation incorrect?
You are right with the assumption about how predict() for a Poisson GLM works. This can be verified in R:
co <- coef(ClaimModel)
p1 <- with(data5,
exp(log(Exposure) + # offset
co[1] + # intercept
ifelse(as.numeric(JudicialOrientation)>1, # factor term
co[as.numeric(JudicialOrientation)], 0) +
Frequency * co[4])) # linear term
all.equal(p1, predict(ClaimModel, type="response"), check.names=FALSE)
[1] TRUE
As indicated in the comments you probably get the wrong results in Excel because of the different basis of the logarithm (10 in Excel, Euler's number in R).

R - Regression Analysis for Logarthmic

I perform regression analysis and try to find the best fit model for the dataset diamonds.csv in ggplot2. I use price(response variable) vs carat and I perform linear regression, quadratic, and cubic regression. The line is not the best fit. I realize the logarithmic from excel has the best fitting line. However, I couldn't figure out how to code in R to find the logarithmic fitting line. Anyone can help?
Comparing Price vs Carat
model<-lm(price~carat, data = diamonds)
Model 2 uses the polynomial to compare
model2<-lm(price~carat + I(carat^2), data = diamonds)
use cubic in model3
model3 <- lm(price~carat + I(carat^2) + I(carat^3), data = diamonds)
How can I code the log in R to get same result as excel?
y = 0.4299ln(x) - 2.5495
R² = 0.8468
Thanks!
The result you report from excel y = 0.4299ln(x) - 2.5495 does not contain any polynomial or cubic terms. What are you trying to do? price is very skewed and as with say 'income' it is common practice to take the log from that. This also provides the R2 you are referring to, but very different coefficients for the intercept and carat parameter.
m1 <- lm(log(price) ~ carat, data = diamonds)
summary(m1)
Call:
lm(formula = log(price) ~ carat, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-6.2844 -0.2449 0.0335 0.2578 1.5642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.215021 0.003348 1856 <2e-16 ***
carat 1.969757 0.003608 546 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3972 on 53938 degrees of freedom
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8468
F-statistic: 2.981e+05 on 1 and 53938 DF, p-value: < 2.2e-16

Regression in R with categorical variables

I'm trying to understand regression in R. I'm trying to solve an exercise wich has a 100 random male-female dataset like this:
sex sbp bmi
male 130 40.0
female 126 29.0
female 115 25.0
male 120 33.0
female 128 34.0
...
I want to get a numerical summary (0) plot the relation between sbp and bmi (1) and estimate beta1, beta2 and sigma parameters with R^2 (2). Then, check the goodness of the model (3) and get the confidence intervals (4)..
I think that sex is a categorical variable, so here it's my code:
as.numeric(framingham$sex) - 1
apply(framingham, 2, class)
#0
framingham$sex <- factor (framingham$sex)
levels (framingham$sex) <- c("female", "male")
resultadoNumerico <- compareGroups(~., data = framingham)
resumenNumerico <- createTable(resultadoNumerico)
resumenNumerico
# 1
framinghamMatrix <- data.matrix(framingham)
pairs(framinghamMatrix)
cor(framinghamMatrix)
#2
regre <- lm(sbp ~ bmi+sex, data = framingham)
regreSum <- summary(regre)
regreSum
# Sigma
regreSum$sigma
# Betas
regreSum$coefficients
#3
plot(framingham$bmi, framingham$sbp, xlab = "SBP", ylab = "BMI")
abline (regre)
But i think that im not doing things right... Could you help me? Thanks in advance...
To check the relation between variables try a plot called pairs.panels from psych library. It gives the distributions , scatter plot and correlation coefficients.
library(psych)
pairs.panels(framingham)
The sex variable here is categorical hence convert it into factor and then provide as input to your linear regression model. By alphabetical order the first level in the factor becomes your reference level and hence in the summary of model you can see only levels other than the reference level (in this case female is base -reference level)
framingham$sex<-as.factor(framingham$sex)
Now create your linear model.
model <- lm(sbp ~ bmi+sex, data = framingham)
model
summary(model)
The summary gives the coefficients, intercept, standard error (95% confidence) , t-value and p-value( that indicates the significance of variables), Multiple R-squared (Goodness of fit) , Adjusted R-squared (Goodness of fit adjusted to model complexity) etc.
I've made sex-1 for the categorical variable:
regre <- lm(sbp ~ bmi+sex***-1***, data = framingham)
regreSum <- summary(regre)
regreSum
And now I obtain
Call:
lm(formula = sbp ~ bmi + sex - 1, data = framingham)
Residuals:
Min 1Q Median 3Q Max
-28.684 -13.025 -1.314 8.711 73.476
Coefficients:
Estimate Std. Error t value Pr(>|t|)
bmi 1.9338 0.3965 4.877 4.21e-06 ***
sexhombre 79.0624 11.0716 7.141 1.71e-10 ***
sexmujer 82.1020 10.5184 7.806 6.93e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 18.48 on 97 degrees of freedom
Multiple R-squared: 0.9813, Adjusted R-squared: 0.9808
F-statistic: 1700 on 3 and 97 DF, p-value: < 2.2e-16
Maybe am I going in the right way?

Syntax for stepwise logistic regression in r

I am trying to conduct a stepwise logistic regression in r with a dichotomous DV. I have researched the STEP function that uses AIC to select a model, which requires essentially having a NUll and a FULL model. Here's the syntax I've been trying (I have a lot of IVs, but the N is 100,000+):
Full = glm(WouldRecommend_favorability ~ i1 + i2 + i3 + i4 + i5 + i6.....i83 + b14 +
Shift_recoded, data = ee2015, family = "binomial")
Nothing = glm(WouldRecommend_favorability ~ 1, data = ee2015, family = "binomial")
Full_Nothing_Step = step(Nothing, scope = Full,Nothing, scale = 0, direction = c('both'),
trace = 1, keep = NULL, steps = 1000, k = 2)
One thing I am not sure about is the order in which "Nothing" and "Full" should be entered in the step formula. Whichever way I try, when I print a summary of "Full_Nothing_Step," it only gives me a summary of either "Nothing" or "Full:"
Call:
glm(formula = WouldRecommend_favorability ~ 1, family = "binomial",
data = ee2015)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8263 0.1929 0.1929 0.1929 0.1929
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.97538 0.01978 201 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 25950 on 141265 degrees of freedom
Residual deviance: 25950 on 141265 degrees of freedom
AIC: 25952
Number of Fisher Scoring iterations: 6
I am pretty familiar with logistic regression in general but am new to R.
As the documentation states, you can enter scope as a formula and or a list with both upper and lower bounds to search over.
In the example below, my initial model is lm1, I then implement the stepwise procedure in both directions. The bounds of this selection procedure are a model with all interaction terms while the lower bound is all terms. You can easily adapt this to a glm model and add the additional arguments you desire.
Be sure to read through the help page though.
lm1 <- lm(Fertility ~ ., data = swiss)
slm1 <- step(lm1, scope = list(upper = as.formula(Fertility ~ .^2),
lower = as.formula(Fertility ~ .)),
direction = "both")

Resources