I am trying to find the marginal effects of my probit (but if anyone knows how to do it with a logit regression I can use that one instead) regression. My dependent variable (my Y) tells me 4 possible actions that one can do and are ordered by aggressiveness of the move (Action1: most aggressive response, Action4 least aggressive response). My independent variables are 4 variables (all continuous) that tell me the state of the system. The goal of the regression is to see how does a change in the state of the system affect the choice of reaction.
I have looked at several packages (mlogit, erer, VGAM, etc) but neither package seems to have an marginal effect function that simply gives you the marginal effect of each independent variable.
I would like to get something similar to what you can get for a binomial logit/probit regression using a marginal effect function such as maBina. For example, if I were to run a simply logit/probit regression using glm I would get:
mylogit <- glm(admit ~ gre + gpa + rank, family = binomial(link = "logit"), x=TRUE, data = mydata)
> summary(mylogit)
glm(formula = admit ~ gre + gpa + rank, family = binomial(link = "logit"),
data = mydata, x = TRUE)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6268 -0.8662 -0.6388 1.1490 2.0790
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
but since this is a logit regression the coefficients don't tell me the marginal effect of, say, GPA on the probability of getting admitted into college. To get such marginal effect, hence to answer the question "how does an increase in the value of GPA affect my likeliness of being accepted into college?") I need to run a separate command, such as maBina and I get:
>maBina(mylogit, x.mean = FALSE, rev.dum = TRUE, digits = 3)
Call: glm(formula = admit ~ gre + gpa + rank, family = binomial(link = "logit"),
data = mydata, x = TRUE)
(Intercept) gre gpa rank2 rank3 rank4
-3.989979 0.002264 0.804038 -0.675443 -1.340204 -1.551464
Degrees of Freedom: 399 Total (i.e. Null); 394 Residual
Null Deviance: 500
Residual Deviance: 458.5 AIC: 470.5
effect error t.value p.value
(Intercept) **-0.776** 0.233 -3.337 0.001
gre **0.000** 0.000 1.931 0.054
gpa **0.156** 0.069 2.263 0.024
rank2 **-0.136** 0.061 -2.221 0.027
rank3 **-0.261** 0.072 -3.614 0.000
rank4 **-0.251** 0.049 -5.106 0.000
where "effect" (the 2nd column from the left in the latest table, in bold) is what I'm looking for.

Generally one uses summary.glm and pulls the coefficients table from that object if all you want is the table of coefficients and standard errors, which it appears is the case here:
summary(glmfit)$coefficients # or
coef( summary(glmfit))
On the other hand if what you want are predictions for proportions or probabilities, then the use of predict.glm is capable of delivering predicted responses on the measured scale rather than on the transformed scale where the regression coefficients were estimated:
There is also an effects package that provides graphical displays and allows specification of selected contrasts.
install.packages("effects", dependencies=TRUE)
It would clarify your expectations if you presented a simple example and said what values you mean to be "effects".
So after clarification I now wonder if you want a programmatic method for extracting a particular value. If so then it is as simple as:
> ea$out['gpa', 'effect']
[1] 0.534 # where ea is the object created in ?maBina example


How to calculate the partial R squared for a linear model with factor interaction in R

I have a linear model where my response Y is say the percentage (proportion) of fat in milk. I have two explanatory variables one (x1) is a continuous variable, the other (z) is a three level factor.
I now do the regression in R as:
contrasts(z) <- "contr.sum"
model<-lm(logit(Y) ~ log(x1)*z)
the model summary gives me the R2 of this model . However, I want to find out the importance of x1 in my model.
I can look at the p-value if the slope is statistically different from 0, but this does not tell me if x1 is actually a good predictor.
Is there a way to get the partial R2 for this model and the overall effect of x1? As this model includes an interaction I am not sure how to calculate this and if there is one unique solution or if I get a partial R2 for the main effect of x1 and a partial R2 for main effect of x1 plus its interaction.
Or would it be better to avoid partial R2 and explain the magnitude of the slope of the main effect and interaction. But given my logit transformation I am not sure if this has any practical meaning for say how log(x1) changes the log odds ratio of % fat in milk.
-I tried to fit the model without the interaction and without the factor to get a usual R2 , but this would not be my preferred solution and I would like to get the partial R2 when specifying a full model.
Update: As requested in a comment, here the output from the summary(model). As written above z is sum contrast coded.
lm(formula = y ~ log(x1) * z, data = mydata)
Min 1Q Median 3Q Max
-1.21240 -0.09487 0.03282 0.13588 0.85941
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.330678 0.034043 -68.462 < 2e-16 ***
log(x1) -0.012948 0.005744 -2.254 0.02454 *
z1 0.140710 0.048096 2.926 0.00357 **
z2 -0.348526 0.055156 -6.319 5.17e-10 ***
log(x1):z1 0.017051 0.008095 2.106 0.03558 *
log(x1):z2 -0.028201 0.009563 -2.949 0.00331 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2288 on 594 degrees of freedom
Multiple R-squared: 0.1388, Adjusted R-squared: 0.1315
F-statistic: 19.15 on 5 and 594 DF, p-value: < 2.2e-16
Update: As requested in a comment, here the output from
aov(formula = model)
log(x1) z log(x1):z Residuals
Sum of Squares 0.725230 3.831223 0.456677 31.105088
Deg. of Freedom 1 2 2 594
Residual standard error: 0.228835
Estimated effects may be unbalanced.
As written above, z is sum contrast coded.

How to extract the actual values of parameters and their standard error instead of the marginal effect estimates in lmer from package lme4 in R?

I have this fake dataset that describes the effect of air temperature on the growth of two plant species (a and b).
data1 <- read.csv(text = "
The experiment was conducted two years and in a block design (nested within years). The goal is to inform how much growth is affected per unit of change in temperature. Also, the is a need to provide a measure of uncertainty (standard error) for this estimate. The same needs to be done for the growth recorded at zero degrees of temperature.
test.model.1 <- lmer(growth ~
specie +
temperature +
specie*temperature +
(1|year) +
data= data1,
control=lmerControl(check.nobs.vs.nlev = "ignore",
check.nobs.vs.rankZ = "ignore",
The summary give me this output for the fixed effect:
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: growth ~ specie + temperature + specie * temperature + (1 | year) +
(1 | year:block)
Data: data1
Control: lmerControl(check.nobs.vs.nlev = "ignore", check.nobs.vs.rankZ = "ignore",
check.nobs.vs.nRE = "ignore")
REML criterion at convergence: 331.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.6408 -0.7637 0.1516 0.5248 2.4809
Random effects:
Groups Name Variance Std.Dev.
year:block (Intercept) 6.231 2.496
year (Intercept) 0.000 0.000
Residual 74.117 8.609
Number of obs: 48, groups: year:block, 4; year, 2
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.699 3.356 26.256 0.804 0.428
specieb 4.433 4.406 41.000 1.006 0.320
temperature 8.624 1.029 41.000 8.381 2.0e-10 ***
specieb:temperature 7.088 1.455 41.000 4.871 1.7e-05 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) specib tmprtr
specieb -0.656
temperature -0.767 0.584
spcb:tmprtr 0.542 -0.826 -0.707
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
From this I can get the growth at 0 degrees of temperature for specie "a" (2.699), and for specie "b" (2.699 + 4.443 = 7.132). Also, the rate of change in growth per unit change in temperature is (8.624) for species "a" and (8.624 + 7.088 = 15.712). The problem I have is that the standard deviation reported in summary() is for the marginal estimate, not for the actual value of the parameter. For instance, the standard error for 4.443 (specieb) is 4.406.. but that is not the standard error for the actual growth at 0 degrees for specie b that is 7.132. What I am looking for is the standard error of let's say 7.132. Also, I'd be nice to have all the calculations I did by hand automatically performed.
I was trying making some tries with emmeans() from lsmeans package but I didn't succeed.
emmeans(test.model.1, growth ~ specie*temperature)
Error in contrast.emmGrid(object = new("emmGrid", = list(call = lmer(formula = growth ~ :
Contrast function 'growth.emmc' not found
I think your main problem is that you don't need the response variable on the left side of the formula you give to emmeans (the package assumes that you're going to use the same response variable as in the original model!) The left-hand side of the formula is reserved for specifying contrasts, e.g. pairwise ~ ... - see help("contrast-methods", package = "emmeans").
I think you might be looking for:
emmeans(test.model.1, ~specie, at = list(temperature=0))
NOTE: Results may be misleading due to involvement in interactions
specie emmean SE df lower.CL upper.CL
a 2.70 3.36 11.3 -4.665 10.1
b 7.13 3.36 11.3 -0.232 14.5
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
If you don't specify the value of temperature, then emmeans uses (I think) the overall average temperature.
For slopes, you want emtrends:
emtrends(test.model.1, ~specie, var = "temperature")
specie temperature.trend SE df lower.CL upper.CL
a 8.62 1.03 41 6.55 10.7
b 15.71 1.03 41 13.63 17.8
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
I highly recommend the extensive and clearly written vignettes for the emmeans package. Since emmeans has so many capabilities it may take a little while to find the answers to your precise questions, but the effort will be repaid in the long term.
As a small picky point, I would say that what summary() gives you are the "actual" parameters that R uses internally, and what emmeans() gives you are the marginal means (as suggested by the name of the package — expected marginal means ...)

How to fit a known linear equation to my data in R?

I used a linear model to obtain the best fit to my data, lm() function.
From literature I know that the optimal fit would be a linear regression with the slope = 1 and the intercept = 0. I would like to see how good this equation (y=x) fits my data? How do I proceed in order to find an R^2 as well as a p-value?
This is my data
(y = modelled, x = measured)
measured<-c(67.39369,28.73695,60.18499,49.32405,166.39318,222.29022,271.83573,241.72247, 368.46304,220.27018,169.92343,56.49579,38.18381,49.33753,130.91752,161.63536,294.14740,363.91029,358.32905,239.84112,129.65078,32.76462,30.13952,52.83656,67.35427,132.23034,366.87857,247.40125,273.19316,278.27902,123.24256,45.98363,83.50199,240.99459,266.95707,308.69814,228.34256,220.51319,83.97942,58.32171,57.93815,94.64370,264.78007,274.25863,245.72940,155.41777,77.45236,70.44223,104.22838,294.01645,312.42321,122.80831,41.65770,242.22661,300.07147,291.59902,230.54478,89.42498,55.81760,55.60525,111.64263,305.76432,264.27192,233.28214,192.75603,75.60803,63.75376)
modelled<-c(42.58318,71.64667,111.08853,67.06974,156.47303,240.41188,238.25893,196.42247,404.28974,138.73164,116.73998,55.21672,82.71556,64.27752,145.84891,133.67465,295.01014,335.25432,253.01847,166.69241,68.84971,26.03600,45.04720,75.56405,109.55975,202.57084,288.52887,140.58476,152.20510,153.99427,75.70720,92.56287,144.93923,335.90871,NA,264.25732,141.93407,122.80440,83.23812,42.18676,107.97732,123.96824,270.52620,388.93979,308.35117,100.79047,127.70644,91.23133,162.53323,NA ,276.46554,100.79440,81.10756,272.17680,387.28700,208.29715,152.91548,62.54459,31.98732,74.26625,115.50051,324.91248,210.14204,168.29598,157.30373,45.76027,76.07370)
Now I would like to see how good the equation y=x fits the data presented above (R^2 and p-value)?
I am very grateful if somebody can help me with this (basic) problem, as I found no answers to my question on stackoverflow?
Best regards Cyril
Let's be clear what you are asking here. You have an existing model, which is "the modelled values are the expected value of the measured values", or in other words, measured = modelled + e, where e are the normally distributed residuals.
You say that the "optimal fit" should be a straight line with intercept 0 and slope 1, which is another way of saying the same thing.
The thing is, this "optimal fit" is not the optimal fit for your actual data, as we can easily see by doing:
summary(lm(measured ~ modelled))
#> Call:
#> lm(formula = measured ~ modelled)
#> Residuals:
#> Min 1Q Median 3Q Max
#> -103.328 -39.130 -4.881 40.428 114.829
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.09461 13.11026 1.762 0.083 .
#> modelled 0.91143 0.07052 12.924 <2e-16 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> Residual standard error: 55.13 on 63 degrees of freedom
#> Multiple R-squared: 0.7261, Adjusted R-squared: 0.7218
#> F-statistic: 167 on 1 and 63 DF, p-value: < 2.2e-16
This shows us the line that would produce the optimal fit to your data in terms of reducing the sum of the squared residuals.
But I guess what you are asking is "How well do my data fit the model measured = modelled + e ?"
Trying to coerce lm into giving you a fixed intercept and slope probably isn't the best way to answer this question. Remember, the p value for the slope only tells you whether the actual slope is significantly different from 0. The above model already confirms that. If you want to know the r-squared of measured = modelled + e, you just need to know the proportion of the variance of measured that is explained by modelled. In other words:
1 - var(measured - modelled) / var(measured)
#> [1] 0.7192672
This is pretty close to the r squared from the lm call.
I think you have sufficient evidence to say that your data is consistent with the model measured = modelled, in that the slope in the lm model includes the value 1 within its 95% confidence interval, and the intercept contains the value 0 within its 95% confidence interval.
As mentioned in the comments, you can use the lm() function, but this actually estimates the slope and intercept for you, whereas what you want is something different.
If slope = 1 and the intercept = 0, essentially you have a fit and your modelled is already the predicted value. You need the r-square from this fit. R squared is defined as:
See this link for definition of RSS and TSS.
We can only work with observations that are complete (non NA). So we calculate each of them:
TSS = nonNA = ! & !
# residuals from your prediction
RSS = sum((modelled[nonNA] - measured[nonNA])^2,na.rm=T)
# total residuals from data
TSS = sum((measured[nonNA] - mean(measured[nonNA]))^2,na.rm=T)
[1] 0.7116585
If measured and modelled are supposed to represent the actual and fitted values of an undisclosed model, as discussed in the comments below another answer, then if fm is the lm object for that undisclosed model then
will show the R^2 and p value of that model.
The R squared value can actually be calculated using only measured and modelled but the formula is different if there is or is not an intercept in the undisclosed model. The signs are that there is no intercept since if there were an intercept sum(modelled - measured, an.rm = TRUE) should be 0 but in fact it is far from it.
In any case R^2 and the p value are shown in the output of the summary(fm) where fm is the undisclosed linear model so there is no point in restricting the discussion to measured and modelled if you have the lm object of the undisclosed model.
For example, if the undisclosed model is the following then using the builtin CO2 data frame:
fm <- lm(uptake ~ Type + conc, CO2)
we have the this output where the last two lines show R squared and p value.
lm(formula = uptake ~ Type + conc, data = CO2)
Min 1Q Median 3Q Max
-18.2145 -4.2549 0.5479 5.3048 12.9968
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.830052 1.579918 16.349 < 2e-16 ***
TypeMississippi -12.659524 1.544261 -8.198 3.06e-12 ***
conc 0.017731 0.002625 6.755 2.00e-09 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.077 on 81 degrees of freedom
Multiple R-squared: 0.5821, Adjusted R-squared: 0.5718
F-statistic: 56.42 on 2 and 81 DF, p-value: 4.498e-16

Different coefficient in LRM vs GLM output

Let me first note that I haven't been able to reproduce this error on anything outside of my data set. However, here is the general idea. I have a data frame and I'm trying to build a simple logistic regression to understand the marginal effect of Amount on IsWon. Both models perform poorly, it's one predictor after all, but they produce two different coefficients
First is the glm output:
> summary(mod4)
glm(formula = as.factor(IsWon) ~ Amount, family = "binomial",
data = final_data_obj_samp)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2578 -1.2361 1.0993 1.1066 3.7307
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.18708622416 0.03142171761 5.9540 0.000000002616 ***
Amount -0.00000315465 0.00000035466 -8.8947 < 0.00000000000000022 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6928.69 on 4999 degrees of freedom
Residual deviance: 6790.87 on 4998 degrees of freedom
AIC: 6794.87
Number of Fisher Scoring iterations: 6
Notice that negative coefficient for Amount.
And now the lrm function from rms
Logistic Regression Model
lrm(formula = as.factor(IsWon) ~ Amount, data = final_data_obj_samp,
x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 5000 LR chi2 137.82 R2 0.036 C 0.633
0 2441 d.f. 1 g 0.300 Dxy 0.266
1 2559 Pr(> chi2) <0.0001 gr 1.350 gamma 0.288
max |deriv| 0.0007 gp 0.054 tau-a 0.133
Brier 0.242
Coef S.E. Wald Z Pr(>|Z|)
Intercept 0.1871 0.0314 5.95 <0.0001
Amount 0.0000 0.0000 -8.89 <0.0001
Both models do a poor job, but one estimates a positive coefficient and the other a negative coefficient. Sure, the values are negligible, but can someone help me understand this.
For what it's worth, here's what the plot of the lrm object looks like.
> plot(Predict(mod2, fun=plogis))
The plot shows the predicted probabilities of winning have a very negative relationship with Amount.
It seems like lrm is estimating the coefficient to the nearest ±0.0000 value. Since the coefficient value is well below that, it is simply rounding it to 0.0000. Hence it seems positive but may in fact not be.
You should not rely on the printed result from summary to check for coefficients. The summary table is controlled by print, hence will always subject to rounding problem. Have you tried mod4$coef (get coefficients of glm model mod4) and mod2$coef (get coefficients of lrm model mod2)? It is good idea to read the "values" section of ?glm and ?lrm.

Coding difference between linear and cubic model

I have two variables ENERGY and TEMP
I have created two other variables temp2 and temp 3
> temp2 <- data$temp^2
> temp3 <- data$temp^3
>data=cbind(data, energy, temp,temp2,temp3)
Now to create a cubic model would it look just like a linear model?
Ok so I did what you suggested and this is the output:
> ?poly
> model<- lm( energy ~ poly(temp, 3) , data=data )
> summary(model)
lm(formula = energy ~ poly(temp, 3), data = data)
Min 1Q Median 3Q Max
-19.159 -11.257 -2.377 9.784 26.841
Estimate Std. Error t value Pr(>|t|)
(Intercept) 95.50 3.21 29.752 < 2e-16 ***
poly(temp, 3)1 207.90 15.72 13.221 2.41e-11 ***
poly(temp, 3)2 -50.07 15.72 -3.184 0.00466 **
poly(temp, 3)3 81.59 15.72 5.188 4.47e-05 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.73 on 20 degrees of freedom
Multiple R-squared: 0.9137, Adjusted R-squared: 0.9008
F-statistic: 70.62 on 3 and 20 DF, p-value: 8.105e-11
I would assume that I would test for the goodness of fit test the same way and look at the Pr(>|t|). This would lead me to believe that all of the variables are significant.
would I be able to use this fitted regression model to predict the average energy consumption for an average difference in temperature?
Instead of coding up dummy variable you should consider using the poly function:
?poly # Polynomial contrasts
model<- lm( energy ~ poly(temp, 3) , data=data )
If you want to use the same columns as you would have gotten with the dummies approach (which is not good for statistical inference purposes), you can use the 'raw' parameter:
model.r<- lm( energy ~ poly(temp, 3, raw=TRUE) , data=data )
Predictions will be the same, but the standard errors will not. This should give you the same estimates as would be returned by #RomanLuštrik's suggestion. The terms will not be orthogonal, so their necessary correlations will be high and you will be unable to make correct inferences about independent effects.
Added question: "would I be able to use this fitted regression model to predict the average energy consumption for an average difference in temperature?"
No. You would need to specify a particular two temperatures and then predict could give you a difference, but that difference will vary depending on what the reference point is, even if the magnitude of the difference is the same.. That was a consequence of using a non-linear term. Maybe you should describe your goals and use a forum that is more geared to methods questions. SO is for coding when you know what you want to do. may be more appropriate when you have formulated your question with more clarity.
There are two ways to do polynomial regression with lm:
lm( y ~ x + I(x^2) + I(x^3) )
lm( y ~ poly(x, 3, raw=TRUE) )
(That's cubic. I'm sure you can generalise to quartic, quintic, etc.)
