Linear model in R doesn't fit properly - r

I know that the title doesn't specify exactly what I mean so let me explain it here.
I working on a dataset that consists of yield of wheat given a certain wheat type (A,B,C,D). Now my issue when fitting linear model is that I'm trying to fit:
lm1 = lm(yield ~ type), when doing so R commits the first wheat type(A) and marks it as a global intercept and then estimates influence of all other types on the yield.
I know that I can fit a linear model like such:
lm2 = lm(yield ~ 0 + type) which will give me estimates of the influence of each type on the yield however what I really want to see is a sort of combination of the two of them.
Is there an option to fit a linear model in R s.t
lm3 = lm(yield ~ GlobalIntercept + type) where GlobalIntercept would represent the general intercept of my linear model and then I could see the influence of each type of wheat on that general intercept. So kind of like in the first model though this time we'd estimate the influence of all types of wheat (A,B,C,D) on the general yield.

Questions to SO should include minimal reproducible example data -- see instructions at the top of the r tag page. Since the question did not include this we will provide it this time by using the built-in InsectSprays data set that comes with R.
Here are a few approaches:
1) lm/contr.sum/dummy.coef Try using contr.sum sum-to-zero contrasts for the spray factor and look at the dummy coefficients. That will expand the coefficients to include all 6 levels of the spray factor in this example:
fm <- lm(count ~ spray, InsectSprays, contrasts = list(spray = contr.sum))
dummy.coef(fm)
## Full coefficients are
##
## (Intercept): 9.5
## spray: A B C D E F
## 5.000000 5.833333 -7.416667 -4.583333 -6.000000 7.166667
sum(dummy.coef(fm)$spray) # check that coefs sum to zero
## [1] 0
2) tapply If each level has the same number of rows in the data set such as is the case with InsectSprays where each level has 12 rows then we can take the mean for each level and then subtract the Intercept (which is the overall mean). This does not work if the data set is unbalanced, i.e. if the different levels have different numbers of rows. Note how the calculations below give the same result as (1).
mean(InsectSprays$count) # intercept
## [1] 9.5
with(InsectSprays, tapply(count, spray, mean) - mean(count))
## A B C D E F
## 5.000000 5.833333 -7.416667 -4.583333 -6.000000 7.166667
3) aov/model.tables We can also use aov with model.tables like this:
fm2 <- aov(count ~ spray, InsectSprays)
model.tables(fm2)
## Tables of effects
##
## spray
## spray
## A B C D E F
## 5.000 5.833 -7.417 -4.583 -6.000 7.167
model.tables(fm2, type = "means")
## Tables of means
## Grand mean
##
## 9.5
##
## spray
## spray
## A B C D E F
## 14.500 15.333 2.083 4.917 3.500 16.667
4) emmeans We can use lm followed by emmeans like this:
library(emmeans)
fm <- lm(count ~ spray, InsectSprays)
emmeans(fm, "spray")
## spray emmean SE df lower.CL upper.CL
## A 14.50 1.13 66 12.240 16.76
## B 15.33 1.13 66 13.073 17.59
## C 2.08 1.13 66 -0.177 4.34
## D 4.92 1.13 66 2.656 7.18
## E 3.50 1.13 66 1.240 5.76
## F 16.67 1.13 66 14.406 18.93
##
## Confidence level used: 0.95

As per the information provided by you, I could infer that you are modeling the yield as a linear function of type which has four categories. Your expectation is to have an intercept apart from the coefficients of each of the types. This doesn't make sense.
You are predicting the yield based on nominal variable. If you want to have regression with intercept, you need to have the predictor variable with origin. The property of a nominal variable is that it doesn't have origin. The origin means that the zero value for the predictor. A nominal variable cannot have an origin. In other words, the intercept (with a continuous predictor variable) means the value of the dependent variable y, when the predictor value is zero (in your case, the category of the type is zero which is practically impossible). That is why your model takes one of the categories as a reference category and calculates the intercept for it. The changes in the y variable when the category is different than the reference category is given by the coefficients.

Related

Getting "+" sign in the results of MuMIn :: dredge

I am trying to MuMIn::dredge linear mixed-effect models lme4::lmer with categorical/continuous variables, the code is as follows:
# Selection of variables of interest
sig<-c("Age", "Sex", "BMI", "(1|HID)", "h_age", "h", "h_g", "smk_hs")
# Model formula
formula<-paste0("log10_PBA_N", "~", paste0(c(sig), collapse="+"))
# Global model
model<-lmer(formula, data=data)
# Dredging
DRG<-dredge(global.model=model)
The code runs fine (I guess), but in the results, I have this:
Global model call: lmer(formula = formula, data = data)
---
Model selection table
(Int) Age BMI h h_age h_g Sex smk_hs df logLik AICc delta weight
2 -0.2363 -0.01421 4 -332.476 673.0 0.00 0.847
66 -0.2461 -0.01420 + 5 -333.689 677.5 4.47 0.090
34 -0.2406 -0.01417 + 5 -334.508 679.2 6.11 0.040
4 -0.3348 -0.01598 0.007096 5 -335.935 682.0 8.96 0.010
18 -0.1553 -0.01421 + 7 -334.310 682.9 9.84 0.006
98 -0.2493 -0.01416 + + 6 -335.723 683.6 10.60 0.004
68 -0.3463 -0.01599 0.007206 + 6 -337.140 686.5 13.43 0.001
Can someone please explain to me, what does the "+" sign mean in the results?
I recently had the exact same question and was struggling to find an answer. However, based on a response to a similar question asked on R Studio Community, I think the answer is simply that a '+' sign means that a given categorical variable term is included as significant in that particular model.
So, looking at your table, the first model only includes the intercept, the second includes the intercept and the smk_hs categorical variable, the third includes the intercept and the Sex variable, etc.

Show family class in TukeyHSD

I am use to conducting Tukey post-hoc tests in minitab. When I do, I usually get family grouping of the dependent/predictor variables.
In R, using TukeyHSD() the family grouping is not displayed (or calculated?). It only displays the relationship between each of the dependent/predictor variables. Is it possible to display the family groupings like in minitab?
Using the diamonds data set:
av <- aov(price ~ cut, data = diamonds)
tk <- TukeyHSD(av, ordered = T, which = "cut")
plot(tk)
Output:
Fit: aov(formula = price ~ cut, data = diamonds)
$cut
diff lwr upr p adj
Good-Ideal 471.32248 300.28228 642.3627 0.0000000
Very Good-Ideal 524.21792 401.33117 647.1047 0.0000000
Fair-Ideal 901.21579 621.86019 1180.5714 0.0000000
Premium-Ideal 1126.71573 1008.80880 1244.6227 0.0000000
Very Good-Good 52.89544 -130.15186 235.9427 0.9341158
Fair-Good 429.89331 119.33783 740.4488 0.0014980
Premium-Good 655.39325 475.65120 835.1353 0.0000000
Fair-Very Good 376.99787 90.13360 663.8622 0.0031094
Premium-Very Good 602.49781 467.76249 737.2331 0.0000000
Premium-Fair 225.49994 -59.26664 510.2665 0.1950425
Picture added to help clarify my response to Maruits's comment:
Here is a step-by-step example on how to reproduce minitab's table for the ggplot2::diamonds dataset. I've included details/explanation as much as possible.
Please note that as far as I can tell, results shown in minitab's table are not dependent/related to results from Tukey's post-hoc test; they are based on results from the analysis of variance. Tukey's honest significant difference (HSD) test is a post-hoc test that establishes which comparisons (of all the possible pairwise comparisons of group means) are (honestly) statistically significant, given the ANOVA results.
In order to reproduce minitabs "mean-grouping" summary table (see the first table of "Interpret the results: Step 3" of the minitab Express Support), I recommend (re-)running a linear model to extract means and confidence intervals. Note that this is exactly how aov fits the analysis of variance model for each group.
Fit a linear model
We specify a 0 offset to get absolute estimates for every group (rather than estimates for the changes relative to an offset).
fit <- lm(price ~ 0 + cut, data = diamonds)
coef <- summary(fit)$coef;
coef;
# Estimate Std. Error t value Pr(>|t|)
#cutFair 4358.758 98.78795 44.12236 0
#cutGood 3928.864 56.59175 69.42468 0
#cutVery Good 3981.760 36.06181 110.41487 0
#cutPremium 4584.258 33.75352 135.81570 0
#cutIdeal 3457.542 27.00121 128.05137 0
Determine family groupings
In order to obtain something similar to minitab's "family groupings", we adopt the following approach:
Calculate confidence intervals for all parameters
Perform a hierarchical clustering analysis on the confidence interval data for all parameters
Cut the resulting tree at a height corresponding to the standard deviation of the CIs. This will gives us a grouping of parameter estimates based on their confidence intervals. This is a somewhat empirical approach but justifiable as the tree measures pairwise distances between the confidence intervals, and the standard deviation can be interpreted as a Euclidean distance.
We start by calculating the confidence interval and cluster the resulting distance matrix using hierarchical clustering using complete linkage.
CI <- confint(fit);
hc <- hclust(dist(CI));
We inspect the cluster dendrogram
plot(hc);
We now cut the tree at a height corresponding to the standard deviation of all CIs across all parameter estimates to get the "family groupings"
grps <- cutree(hc, h = sd(CI))
Summarise results
Finally, we collate all quantities and store results in a table similar to minitab's "mean-grouping" table.
library(tidyverse)
bind_cols(
cut = rownames(coef),
N = as.integer(table(fit$model$cut)),
Mean = coef[, 1],
Groupings = grps) %>%
as.data.frame()
# cut N Mean Groupings
#1 cutFair 1610 4358.758 1
#2 cutGood 4906 3928.864 2
#3 cutVery Good 12082 3981.760 2
#4 cutPremium 13791 4584.258 1
#5 cutIdeal 21551 3457.542 3
Note the near-perfect agreement of our results with those from the minitab "mean-grouping" table: cut = Ideal is by itself in group 3 (group C in minitab's table), while Fair+Premium share group 1 (minitab: group A ), and Good+Very Good share group 2 (minitab: group B).
See the cld function in the multcomp package, as explained here (copy-pasted below).
Example data set:
> data(ToothGrowth)
> ToothGrowth$treat <- with(ToothGrowth, interaction(supp,dose))
> str(ToothGrowth)
'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
$ treat: Factor w/ 6 levels "OJ.0.5","VC.0.5",..: 2 2 2 2 2 2 2 2 2 2 ...
Model fit:
> fit <- lm(len ~ treat, data=ToothGrowth)
All pairwise comparisons with Tukey test:
> apctt <- multcomp::glht(fit, linfct = multcomp::mcp(treat = "Tukey"))
Letter-based representation of all-pairwise comparisons (algorithm from Piepho 2004):
> lbrapc <- multcomp::cld(apctt)
> lbrapc
OJ.0.5 VC.0.5 OJ.1 VC.1 OJ.2 VC.2
"b" "a" "c" "b" "c" "c"

Plot Kaplan-Meier for Cox regression

I have a Cox proportional hazards model set up using the following code in R that predicts mortality. Covariates A, B and C are added simply to avoid confounding (i.e. age, sex, race) but we are really interested in the predictor X. X is a continuous variable.
cox.model <- coxph(Surv(time, dead) ~ A + B + C + X, data = df)
Now, I'm having troubles plotting a Kaplan-Meier curve for this. I've been searching on how to create this figure but I haven't had much luck. I'm not sure if plotting a Kaplan-Meier for a Cox model is possible? Does the Kaplan-Meier adjust for my covariates or does it not need them?
What I did try is below, but I've been told this isn't right.
plot(survfit(cox.model), xlab = 'Time (years)', ylab = 'Survival Probabilities')
I also tried to plot a figure that shows cumulative hazard of mortality. I don't know if I'm doing it right since I've tried it a few different ways and get different results. Ideally, I would like to plot two lines, one that shows the risk of mortality for the 75th percentile of X and one that shows the 25th percentile of X. How can I do this?
I could list everything else I've tried, but I don't want to confuse anyone!
Many thanks.
Here is an example taken from this paper.
url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
Rossi <- read.table(url, header=TRUE)
Rossi[1:5, 1:10]
# week arrest fin age race wexp mar paro prio educ
# 1 20 1 no 27 black no not married yes 3 3
# 2 17 1 no 18 black no not married yes 8 4
# 3 25 1 no 19 other yes not married yes 13 3
# 4 52 0 yes 23 black yes married yes 1 5
# 5 52 0 no 19 other yes not married yes 3 3
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + wexp + mar + paro + prio,
data=Rossi)
mod.allison
# Call:
# coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
# mar + paro + prio, data = Rossi)
#
#
# coef exp(coef) se(coef) z p
# finyes -0.3794 0.684 0.1914 -1.983 0.0470
# age -0.0574 0.944 0.0220 -2.611 0.0090
# raceother -0.3139 0.731 0.3080 -1.019 0.3100
# wexpyes -0.1498 0.861 0.2122 -0.706 0.4800
# marnot married 0.4337 1.543 0.3819 1.136 0.2600
# paroyes -0.0849 0.919 0.1958 -0.434 0.6600
# prio 0.0915 1.096 0.0286 3.194 0.0014
#
# Likelihood ratio test=33.3 on 7 df, p=2.36e-05 n= 432, number of events= 114
Note that the model uses fin, age, race, wexp, mar, paro, prio to predict arrest. As mentioned in this document the survfit() function uses the Kaplan-Meier estimate for the survival rate.
plot(survfit(mod.allison), ylim=c(0.7, 1), xlab="Weeks",
ylab="Proportion Not Rearrested")
We get a plot (with a 95% confidence interval) for the survival rate. For the cumulative hazard rate you can do
# plot(survfit(mod.allison)$cumhaz)
but this doesn't give confidence intervals. However, no worries! We know that H(t) = -ln(S(t)) and we have confidence intervals for S(t). All we need to do is
sfit <- survfit(mod.allison)
cumhaz.upper <- -log(sfit$upper)
cumhaz.lower <- -log(sfit$lower)
cumhaz <- sfit$cumhaz # same as -log(sfit$surv)
Then just plot these
plot(cumhaz, xlab="weeks ahead", ylab="cumulative hazard",
ylim=c(min(cumhaz.lower), max(cumhaz.upper)))
lines(cumhaz.lower)
lines(cumhaz.upper)
You'll want to use survfit(..., conf.int=0.50) to get bands for 75% and 25% instead of 97.5% and 2.5%.
The request for estimated survival curve at the 25th and 75th percentiles for X first requires determining those percentiles and specifying values for all the other covariates in a dataframe to be used as newdata argument to survfit.:
Can use the data suggested by other resondent from Fox's website, although on my machine it required building an url-object:
url <- url("http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt")
Rossi <- read.table(url, header=TRUE)
It's probably not the best example for this wquestion but it does have a numeric variable that we can calculate the quartiles:
> summary(Rossi$prio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 2.984 4.000 18.000
So this would be the model fit and survfit calls:
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + prio ,
data=Rossi)
prio.fit <- survfit(mod.allison,
newdata= data.frame(fin="yes", age=30, race="black", prio=c(1,4) ))
plot(prio.fit, col=c("red","blue"))
Setting the values of the confounders to a fixed value and plotting the predicted survival probabilities at multiple points in time for given values of X (as #IRTFM suggested in his answer), results in a conditional effect estimate. That is not what a standard Kaplan-Meier estimator is used for and I don't think that is what the original poster wanted. Usually we are interested in average causal effects. In other words: What would the survival probability be if X had been set to some specific value x in the entire sample?
We can obtain this probability using the cox-model that was fit plus g-computation. In g-computation, we set the value of X to x in the entire sample and then use the cox model to predict the survival probability at t for each individual, using their observed covariate values in the process. Then we simply take the average of those predictions to obtain our final estimate. By repeating this process for a range of points in time and a range of possible values for X, we obtain a three-dimensional survival surface. We can then visualize this surface using color scales.
This can be done using the contsurvplot R-package I developed, as discussed in this previous answer: Converting survival analysis by a continuous variable to categorical or in the documentation of the package. More information about this strategy in general can be found in the preprint version of my article on this topic: https://arxiv.org/pdf/2208.04644.pdf

Dummy variables for Logistic regression in R

I am running a logistic regression on three factors that are all binary.
My data
table1<-expand.grid(Crime=factor(c("Shoplifting","Other Theft Acts")),Gender=factor(c("Men","Women")),
Priorconv=factor(c("N","P")))
table1<-data.frame(table1,Yes=c(24,52,48,22,17,60,15,4),No=c(1,9,3,2,6,34,6,3))
and the model
fit4<-glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
summary(fit4)
R seems to take 1 for prior conviction P and 1 for crime shoplifting. As a result the interaction effect is only 1 if both of the above are 1. I would now like to try different combinations for the interaction term, for example I would like to see what it would be if prior conviction is P and crime is not shoplifting.
Is there a way to make R take different cases for the 1s and the 0s? It would facilitate my analysis greatly.
Thank you.
You're already getting all four combinations of the two categorical variables in your regression. You can see this as follows:
Here's the output of your regression:
Call:
glm(formula = cbind(Yes, No) ~ Priorconv + Crime + Priorconv:Crime,
family = binomial, data = table1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9062 0.3231 5.899 3.66e-09 ***
PriorconvP -1.3582 0.3835 -3.542 0.000398 ***
CrimeShoplifting 0.9842 0.6069 1.622 0.104863
PriorconvP:CrimeShoplifting -0.5513 0.7249 -0.761 0.446942
So, for Priorconv, the reference category (the one with dummy value = 0) is N. And for Crime the reference category is Other. So here's how to interpret the regression results for each of the four possibilities (where log(p/(1-p)) is the log of the odds of a Yes result):
1. PriorConv = N and Crime = Other. This is just the case where both dummies are
zero, so your regression is just the intercept:
log(p/(1-p)) = 1.90
2. PriorConv = P and Crime = Other. So the Priorconv dummy equals 1 and the
Crime dummy is still zero:
log(p/(1-p)) = 1.90 - 1.36
3. PriorConv = N and Crime = Shoplifting. So the Priorconv dummy is 0 and the
Crime dummy is now 1:
log(p/(1-p)) = 1.90 + 0.98
4. PriorConv = P and Crime = Shoplifting. Now both dummies are 1:
log(p/(1-p)) = 1.90 - 1.36 + 0.98 - 0.55
You can reorder the factor values of the two predictor variables, but that will just change which combinations of variables fall into each of the four cases above.
Update: Regarding the issue of regression coefficients relative to ordering of the factors. Changing the reference level will change the coefficients, because the coefficients will represent contrasts between different combinations of categories, but it won't change the predicted probabilities of a Yes or No outcome. (Regression modeling wouldn't be all that credible if you could change the predictions just by changing the reference category.) Note, for example, that the predicted probabilities are the same even if we switch the reference category for Priorconv:
m1 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
predict(m1, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
table2 = table1
table2$Priorconv = relevel(table2$Priorconv, ref = "P")
m2 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table2,family=binomial)
predict(m2, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
I agree with the interpretation provided by #eipi10. You can also use relevel to change the reference level before fitting the model:
levels(table1$Priorconv)
## [1] "N" "P"
table1$Priorconv <- relevel(table1$Priorconv, ref = "P")
levels(table1$Priorconv)
## [1] "P" "N"
m <- glm(cbind(Yes, No) ~ Priorconv*Crime, data = table1, family = binomial)
summary(m)
Note that I changed the formula argument of glm() to include Priorconv*Crime which is more compact.

Interpretation of ordered and non-ordered factors, vs. numerical predictors in model summary

I have fitted a model where:
Y ~ A + A^2 + B + mixed.effect(C)
Y is continuous
A is continuous
B actually refers to a DAY and currently looks like this:
Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 < 11 < 12
I can easily change the data type, but I'm not sure whether it is more appropriate to treat B as numeric, a factor, or as an ordered factor. AND when treated as numeric or ordered factor, I'm not quite sure how to interpret the output.
When treated as an ordered factor, summary(my.model) outputs something like this:
Linear mixed model fit by REML ['lmerMod']
Formula: Y ~ A + I(A^2) + B + (1 | mixed.effect.C)
Fixed effects:
Estimate Std. Error t value
(Intercept) 19.04821 0.40926 46.54
A -151.01643 7.19035 -21.00
I(A^2) 457.19856 31.77830 14.39
B.L -3.00811 0.29688 -10.13
B.Q -0.12105 0.24561 -0.49
B.C 0.35457 0.24650 1.44
B^4 0.09743 0.24111 0.40
B^5 -0.08119 0.22810 -0.36
B^6 0.19640 0.22377 0.88
B^7 0.02043 0.21016 0.10
B^8 -0.48931 0.20232 -2.42
B^9 -0.43027 0.17798 -2.42
B^10 -0.13234 0.15379 -0.86
What are L, Q, and C? I need to know the effect of each additional day (B) on the response (Y). How do I get this information from the output?
When I treat B as.numeric, I get something like this as output:
Fixed effects:
Estimate Std. Error t value
(Intercept) 20.79679 0.39906 52.11
A -152.29941 7.17939 -21.21
I(A^2) 461.89157 31.79899 14.53
B -0.27321 0.02391 -11.42
To get the effect of each additional day (B) on the response (Y), am I supposed to multiply the coefficient of B times B (the day number)? Not sure what to do with this output...
This is not really a mixed-model specific question, but rather a general question about model parameterization in R.
Let's try a simple example.
set.seed(101)
d <- data.frame(x=sample(1:4,size=30,replace=TRUE))
d$y <- rnorm(30,1+2*d$x,sd=0.01)
x as numeric
This just does a linear regression: the x parameter denotes the change in y per unit of change in x; the intercept specifies the expected value of y at x=0.
coef(lm(y~x,d))
## (Intercept) x
## 0.9973078 2.0001922
x as (unordered/regular) factor
coef(lm(y~factor(x),d))
## (Intercept) factor(x)2 factor(x)3 factor(x)4
## 3.001627 1.991260 3.995619 5.999098
The intercept specifies the expected value of y in the baseline level of the factor (x=1); the other parameters specify the difference between the expected value of y when x takes on other values.
x as ordered factor
coef(lm(y~ordered(x),d))
## (Intercept) ordered(x).L ordered(x).Q ordered(x).C
## 5.998121421 4.472505514 0.006109021 -0.003125958
Now the intercept specifies the value of y at the mean factor level (halfway between 2 and 3); the L (linear) parameter gives a measure of the linear trend (not quite sure I can explain the particular value ...), Q and C specify quadratic and cubic terms (which are close to zero in this case because the pattern is linear); if there were more levels the higher-order contrasts would be numbered 5, 6, ...
successive-differences contrasts
coef(lm(y~factor(x),d,contrasts=list(`factor(x)`=MASS::contr.sdif)))
## (Intercept) factor(x)2-1 factor(x)3-2 factor(x)4-3
## 5.998121 1.991260 2.004359 2.003478
This contrast specifies the parameters as the differences between successive levels, which are all a constant value of (approximately) 2.

Resources