I am running a logistic regression on three factors that are all binary.
My data
table1<-expand.grid(Crime=factor(c("Shoplifting","Other Theft Acts")),Gender=factor(c("Men","Women")),
Priorconv=factor(c("N","P")))
table1<-data.frame(table1,Yes=c(24,52,48,22,17,60,15,4),No=c(1,9,3,2,6,34,6,3))
and the model
fit4<-glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
summary(fit4)
R seems to take 1 for prior conviction P and 1 for crime shoplifting. As a result the interaction effect is only 1 if both of the above are 1. I would now like to try different combinations for the interaction term, for example I would like to see what it would be if prior conviction is P and crime is not shoplifting.
Is there a way to make R take different cases for the 1s and the 0s? It would facilitate my analysis greatly.
Thank you.
You're already getting all four combinations of the two categorical variables in your regression. You can see this as follows:
Here's the output of your regression:
Call:
glm(formula = cbind(Yes, No) ~ Priorconv + Crime + Priorconv:Crime,
family = binomial, data = table1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9062 0.3231 5.899 3.66e-09 ***
PriorconvP -1.3582 0.3835 -3.542 0.000398 ***
CrimeShoplifting 0.9842 0.6069 1.622 0.104863
PriorconvP:CrimeShoplifting -0.5513 0.7249 -0.761 0.446942
So, for Priorconv, the reference category (the one with dummy value = 0) is N. And for Crime the reference category is Other. So here's how to interpret the regression results for each of the four possibilities (where log(p/(1-p)) is the log of the odds of a Yes result):
1. PriorConv = N and Crime = Other. This is just the case where both dummies are
zero, so your regression is just the intercept:
log(p/(1-p)) = 1.90
2. PriorConv = P and Crime = Other. So the Priorconv dummy equals 1 and the
Crime dummy is still zero:
log(p/(1-p)) = 1.90 - 1.36
3. PriorConv = N and Crime = Shoplifting. So the Priorconv dummy is 0 and the
Crime dummy is now 1:
log(p/(1-p)) = 1.90 + 0.98
4. PriorConv = P and Crime = Shoplifting. Now both dummies are 1:
log(p/(1-p)) = 1.90 - 1.36 + 0.98 - 0.55
You can reorder the factor values of the two predictor variables, but that will just change which combinations of variables fall into each of the four cases above.
Update: Regarding the issue of regression coefficients relative to ordering of the factors. Changing the reference level will change the coefficients, because the coefficients will represent contrasts between different combinations of categories, but it won't change the predicted probabilities of a Yes or No outcome. (Regression modeling wouldn't be all that credible if you could change the predictions just by changing the reference category.) Note, for example, that the predicted probabilities are the same even if we switch the reference category for Priorconv:
m1 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table1,family=binomial)
predict(m1, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
table2 = table1
table2$Priorconv = relevel(table2$Priorconv, ref = "P")
m2 = glm(cbind(Yes,No)~Priorconv+Crime+Priorconv:Crime,data=table2,family=binomial)
predict(m2, type="response")
1 2 3 4 5 6 7 8
0.9473684 0.8705882 0.9473684 0.8705882 0.7272727 0.6336634 0.7272727 0.6336634
I agree with the interpretation provided by #eipi10. You can also use relevel to change the reference level before fitting the model:
levels(table1$Priorconv)
## [1] "N" "P"
table1$Priorconv <- relevel(table1$Priorconv, ref = "P")
levels(table1$Priorconv)
## [1] "P" "N"
m <- glm(cbind(Yes, No) ~ Priorconv*Crime, data = table1, family = binomial)
summary(m)
Note that I changed the formula argument of glm() to include Priorconv*Crime which is more compact.
Related
I am running a model with the lavaan R package that predicts a continuous outcome by a continuous and two categorical codes. One of them is a dichotomous variable (let's call it A; 0 = no, 1 = yes) and the other is a three-level categorical variable (let's call it B; 0 = low, medium, 3 = high). Below is an example of the data:
outcome gender age continuous A B
1 1.333333 2 23.22404 1.333333 1 0
2 1.500000 2 23.18033 1.833333 1 1
3 1.500000 2 22.37978 2.166667 1 NA
4 2.250000 1 18.74044 1.916667 1 0
5 1.250000 1 22.37978 1.916667 1 1
6 1.500000 2 20.16940 1.500000 1 NA
In addition to a continuous, a dichotomous, and a three-level categorical variable, my model also includes some control variables:
model.1a <- 'outcome ~ gender + age + continuous + A + B
A ~~ continuous
A ~~ B
continuous ~~ B'
fit.1a <- sem(model=model.1a, data=dat)
summary(fit.1a, fit.measures=TRUE, standardized=TRUE, ci=TRUE, rsquare=T)
In a second step, I also want to include an interaction between variable A and B. For this, I first centered these two variables and then included the interaction in the model:
model.1b <- 'outcome ~ gender + age + continuous + A_centr + B_centr + interaction
A_centr ~~ continuous
A_centr ~~ B_centr
continuous ~~ B_centr
interaction ~~ 0*gender + 0*age
gender ~~ age'
fit.1b <- sem(model=model.1b, data=dat)
summary(fit.1b, fit.measures=TRUE, standardized=TRUE, ci=TRUE, rsquare=T)
However, when I run this model, I get the following error:
Error in lav_samplestats_icov(COV = cov[[g]], ridge = ridge, x.idx = x.idx[[g]], :
lavaan ERROR: sample covariance matrix is not positive-definite
From what I can tell, this is the case because the interaction between the two categorical variables is very similar to the original variables, but I am unsure how to solve this. Does anyone have a suggestion for solving the issue?
For your information, I have already tried using the non-centered version for one or both of the categorical variables for creating the interaction term and in the regression model.
I know that the title doesn't specify exactly what I mean so let me explain it here.
I working on a dataset that consists of yield of wheat given a certain wheat type (A,B,C,D). Now my issue when fitting linear model is that I'm trying to fit:
lm1 = lm(yield ~ type), when doing so R commits the first wheat type(A) and marks it as a global intercept and then estimates influence of all other types on the yield.
I know that I can fit a linear model like such:
lm2 = lm(yield ~ 0 + type) which will give me estimates of the influence of each type on the yield however what I really want to see is a sort of combination of the two of them.
Is there an option to fit a linear model in R s.t
lm3 = lm(yield ~ GlobalIntercept + type) where GlobalIntercept would represent the general intercept of my linear model and then I could see the influence of each type of wheat on that general intercept. So kind of like in the first model though this time we'd estimate the influence of all types of wheat (A,B,C,D) on the general yield.
Questions to SO should include minimal reproducible example data -- see instructions at the top of the r tag page. Since the question did not include this we will provide it this time by using the built-in InsectSprays data set that comes with R.
Here are a few approaches:
1) lm/contr.sum/dummy.coef Try using contr.sum sum-to-zero contrasts for the spray factor and look at the dummy coefficients. That will expand the coefficients to include all 6 levels of the spray factor in this example:
fm <- lm(count ~ spray, InsectSprays, contrasts = list(spray = contr.sum))
dummy.coef(fm)
## Full coefficients are
##
## (Intercept): 9.5
## spray: A B C D E F
## 5.000000 5.833333 -7.416667 -4.583333 -6.000000 7.166667
sum(dummy.coef(fm)$spray) # check that coefs sum to zero
## [1] 0
2) tapply If each level has the same number of rows in the data set such as is the case with InsectSprays where each level has 12 rows then we can take the mean for each level and then subtract the Intercept (which is the overall mean). This does not work if the data set is unbalanced, i.e. if the different levels have different numbers of rows. Note how the calculations below give the same result as (1).
mean(InsectSprays$count) # intercept
## [1] 9.5
with(InsectSprays, tapply(count, spray, mean) - mean(count))
## A B C D E F
## 5.000000 5.833333 -7.416667 -4.583333 -6.000000 7.166667
3) aov/model.tables We can also use aov with model.tables like this:
fm2 <- aov(count ~ spray, InsectSprays)
model.tables(fm2)
## Tables of effects
##
## spray
## spray
## A B C D E F
## 5.000 5.833 -7.417 -4.583 -6.000 7.167
model.tables(fm2, type = "means")
## Tables of means
## Grand mean
##
## 9.5
##
## spray
## spray
## A B C D E F
## 14.500 15.333 2.083 4.917 3.500 16.667
4) emmeans We can use lm followed by emmeans like this:
library(emmeans)
fm <- lm(count ~ spray, InsectSprays)
emmeans(fm, "spray")
## spray emmean SE df lower.CL upper.CL
## A 14.50 1.13 66 12.240 16.76
## B 15.33 1.13 66 13.073 17.59
## C 2.08 1.13 66 -0.177 4.34
## D 4.92 1.13 66 2.656 7.18
## E 3.50 1.13 66 1.240 5.76
## F 16.67 1.13 66 14.406 18.93
##
## Confidence level used: 0.95
As per the information provided by you, I could infer that you are modeling the yield as a linear function of type which has four categories. Your expectation is to have an intercept apart from the coefficients of each of the types. This doesn't make sense.
You are predicting the yield based on nominal variable. If you want to have regression with intercept, you need to have the predictor variable with origin. The property of a nominal variable is that it doesn't have origin. The origin means that the zero value for the predictor. A nominal variable cannot have an origin. In other words, the intercept (with a continuous predictor variable) means the value of the dependent variable y, when the predictor value is zero (in your case, the category of the type is zero which is practically impossible). That is why your model takes one of the categories as a reference category and calculates the intercept for it. The changes in the y variable when the category is different than the reference category is given by the coefficients.
I am use to conducting Tukey post-hoc tests in minitab. When I do, I usually get family grouping of the dependent/predictor variables.
In R, using TukeyHSD() the family grouping is not displayed (or calculated?). It only displays the relationship between each of the dependent/predictor variables. Is it possible to display the family groupings like in minitab?
Using the diamonds data set:
av <- aov(price ~ cut, data = diamonds)
tk <- TukeyHSD(av, ordered = T, which = "cut")
plot(tk)
Output:
Fit: aov(formula = price ~ cut, data = diamonds)
$cut
diff lwr upr p adj
Good-Ideal 471.32248 300.28228 642.3627 0.0000000
Very Good-Ideal 524.21792 401.33117 647.1047 0.0000000
Fair-Ideal 901.21579 621.86019 1180.5714 0.0000000
Premium-Ideal 1126.71573 1008.80880 1244.6227 0.0000000
Very Good-Good 52.89544 -130.15186 235.9427 0.9341158
Fair-Good 429.89331 119.33783 740.4488 0.0014980
Premium-Good 655.39325 475.65120 835.1353 0.0000000
Fair-Very Good 376.99787 90.13360 663.8622 0.0031094
Premium-Very Good 602.49781 467.76249 737.2331 0.0000000
Premium-Fair 225.49994 -59.26664 510.2665 0.1950425
Picture added to help clarify my response to Maruits's comment:
Here is a step-by-step example on how to reproduce minitab's table for the ggplot2::diamonds dataset. I've included details/explanation as much as possible.
Please note that as far as I can tell, results shown in minitab's table are not dependent/related to results from Tukey's post-hoc test; they are based on results from the analysis of variance. Tukey's honest significant difference (HSD) test is a post-hoc test that establishes which comparisons (of all the possible pairwise comparisons of group means) are (honestly) statistically significant, given the ANOVA results.
In order to reproduce minitabs "mean-grouping" summary table (see the first table of "Interpret the results: Step 3" of the minitab Express Support), I recommend (re-)running a linear model to extract means and confidence intervals. Note that this is exactly how aov fits the analysis of variance model for each group.
Fit a linear model
We specify a 0 offset to get absolute estimates for every group (rather than estimates for the changes relative to an offset).
fit <- lm(price ~ 0 + cut, data = diamonds)
coef <- summary(fit)$coef;
coef;
# Estimate Std. Error t value Pr(>|t|)
#cutFair 4358.758 98.78795 44.12236 0
#cutGood 3928.864 56.59175 69.42468 0
#cutVery Good 3981.760 36.06181 110.41487 0
#cutPremium 4584.258 33.75352 135.81570 0
#cutIdeal 3457.542 27.00121 128.05137 0
Determine family groupings
In order to obtain something similar to minitab's "family groupings", we adopt the following approach:
Calculate confidence intervals for all parameters
Perform a hierarchical clustering analysis on the confidence interval data for all parameters
Cut the resulting tree at a height corresponding to the standard deviation of the CIs. This will gives us a grouping of parameter estimates based on their confidence intervals. This is a somewhat empirical approach but justifiable as the tree measures pairwise distances between the confidence intervals, and the standard deviation can be interpreted as a Euclidean distance.
We start by calculating the confidence interval and cluster the resulting distance matrix using hierarchical clustering using complete linkage.
CI <- confint(fit);
hc <- hclust(dist(CI));
We inspect the cluster dendrogram
plot(hc);
We now cut the tree at a height corresponding to the standard deviation of all CIs across all parameter estimates to get the "family groupings"
grps <- cutree(hc, h = sd(CI))
Summarise results
Finally, we collate all quantities and store results in a table similar to minitab's "mean-grouping" table.
library(tidyverse)
bind_cols(
cut = rownames(coef),
N = as.integer(table(fit$model$cut)),
Mean = coef[, 1],
Groupings = grps) %>%
as.data.frame()
# cut N Mean Groupings
#1 cutFair 1610 4358.758 1
#2 cutGood 4906 3928.864 2
#3 cutVery Good 12082 3981.760 2
#4 cutPremium 13791 4584.258 1
#5 cutIdeal 21551 3457.542 3
Note the near-perfect agreement of our results with those from the minitab "mean-grouping" table: cut = Ideal is by itself in group 3 (group C in minitab's table), while Fair+Premium share group 1 (minitab: group A ), and Good+Very Good share group 2 (minitab: group B).
See the cld function in the multcomp package, as explained here (copy-pasted below).
Example data set:
> data(ToothGrowth)
> ToothGrowth$treat <- with(ToothGrowth, interaction(supp,dose))
> str(ToothGrowth)
'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
$ treat: Factor w/ 6 levels "OJ.0.5","VC.0.5",..: 2 2 2 2 2 2 2 2 2 2 ...
Model fit:
> fit <- lm(len ~ treat, data=ToothGrowth)
All pairwise comparisons with Tukey test:
> apctt <- multcomp::glht(fit, linfct = multcomp::mcp(treat = "Tukey"))
Letter-based representation of all-pairwise comparisons (algorithm from Piepho 2004):
> lbrapc <- multcomp::cld(apctt)
> lbrapc
OJ.0.5 VC.0.5 OJ.1 VC.1 OJ.2 VC.2
"b" "a" "c" "b" "c" "c"
I'm doing logistic regression on Boston data with a column high.medv (yes/no) which indicates if the median house pricing given by column medv is either more than 25 or not.
Below is my code for logistic regression.
high.medv <- ifelse(Boston$medv>25, "Y", "N") # Applying the desired
`condition to medv and storing the results into a new variable called "medv.high"
ourBoston <- data.frame (Boston, high.medv)
ourBoston$high.medv <- as.factor(ourBoston$high.medv)
attach(Boston)
# 70% of data <- Train
train2<- subset(ourBoston,sample==TRUE)
# 30% will be Test
test2<- subset(ourBoston, sample==FALSE)
glm.fit <- glm (high.medv ~ lstat,data = train2, family = binomial)
summary(glm.fit)
The output is as follows:
Deviance Residuals:
[1] 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.57 48196.14 0 1
lstat NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.0000e+00 on 0 degrees of freedom
Residual deviance: 3.1675e-10 on 0 degrees of freedom
AIC: 2
Number of Fisher Scoring iterations: 21
Also i need:
Now I'm required to use the misclassification rate as the measure of error for the two cases:
using lstat as the predictor, and
using all predictors except high.medv and medv.
but i am stuck at the regression itself
With every classification algorithm, the art relies on choosing the threshold upon which you will determine whether the the result is positive or negative.
When you predict your outcomes in the test data set you estimate probabilities of the response variable being either 1 or 0. Therefore, you need to the tell where you are gonna cut, the threshold, at which the prediction becomes 1 or 0.
A high threshold is more conservative about labeling a case as positive, which makes it less likely to produce false positives and more likely to produce false negatives. The opposite happens for low thresholds.
The usual procedure is to plot the rates that interests you, e.g., true positives and false positives against each other, and then choose what is the best rate for you.
set.seed(666)
# simulation of logistic data
x1 = rnorm(1000) # some continuous variables
z = 1 + 2*x1 # linear combination with a bias
pr = 1/(1 + exp(-z)) # pass through an inv-logit function
y = rbinom(1000, 1, pr)
df = data.frame(y = y, x1 = x1)
df$train = 0
df$train[sample(1:(2*nrow(df)/3))] = 1
df$new_y = NA
# modelling the response variable
mod = glm(y ~ x1, data = df[df$train == 1,], family = "binomial")
df$new_y[df$train == 0] = predict(mod, newdata = df[df$train == 0,], type = 'response') # predicted probabilities
dat = df[df$train==0,] # test data
To use missclassification error to evaluate your model, first you need to set up a threshold. For that, you can use the roc function from pROC package, which calculates the rates and provides the corresponding thresholds:
library(pROC)
rates =roc(dat$y, dat$new_y)
plot(rates) # visualize the trade-off
rates$specificity # shows the ratio of true negative over overall negatives
rates$thresholds # shows you the corresponding thresholds
dat$jj = as.numeric(dat$new_y>0.7) # using 0.7 as a threshold to indicate that we predict y = 1
table(dat$y, dat$jj) # provides the miss classifications given 0.7 threshold
0 1
0 86 20
1 64 164
The accuracy of your model can be computed as the ratio of the number of observations you got right against the size of your sample.
I was trying out linear regression with R using categorical attributes and observe that I don't get a coefficient value for each of the different factor levels I have.
Please see my code below, I have 5 factor levels for states, but see only 4 values of co-efficients.
> states = c("WA","TE","GE","LA","SF")
> population = c(0.5,0.2,0.6,0.7,0.9)
> df = data.frame(states,population)
> df
states population
1 WA 0.5
2 TE 0.2
3 GE 0.6
4 LA 0.7
5 SF 0.9
> states=NULL
> population=NULL
> lm(formula=population~states,data=df)
Call:
lm(formula = population ~ states, data = df)
Coefficients:
(Intercept) statesLA statesSF statesTE statesWA
0.6 0.1 0.3 -0.4 -0.1
I also tried with a larger data set by doing the following, but still see the same behavior
for(i in 1:10)
{
df = rbind(df,df)
}
EDIT : Thanks to responses from eipi10, MrFlick and economy. I now understand one of the levels is being used as reference level. But when I get a new test data whose state's value is "GE", how do I substitute in the equation y=m1x1+m2x2+...+c ?
I also tried flattening out the data such that each of these factor levels gets it's separate column, but again for one of the column, I get NA as coefficient. If I have a new test data whose state is 'WA', how can I get the 'population value'? What do I substitute as it's coefficient?
> df1
population GE MI TE WA
1 1 0 0 0 1
2 2 1 0 0 0
3 2 0 0 1 0
4 1 0 1 0 0
lm(formula = population ~ (GE+MI+TE+WA),data=df1)
Call:
lm(formula = population ~ (GE + MI + TE + WA), data = df1)
Coefficients:
(Intercept) GE MI TE WA
1 1 0 1 NA
GE is dropped, alphabetically, as the intercept term. As eipi10 stated, you can interpret the coefficients for the other levels in states with GE as the baseline (statesLA = 0.1 meaning LA is, on average, 0.1x more than GE).
EDIT:
To respond to your updated question:
If you include all of the levels in a linear regression, you're going to have a situation called perfect collinearity, which is responsible for the strange results you're seeing when you force each category into its own variable. I won't get into the explanation of that, just find a wiki, and know that linear regression doesn't work if the variable coefficients are completely represented (and you're also expecting an intercept term). If you want to see all of the levels in a regression, you can perform a regression without an intercept term, as suggested in the comments, but again, this is ill-advised unless you have a specific reason to.
As for the interpretation of GE in your y=mx+c equation, you can calculate the expected y by knowing that the levels of the other states are binary (zero or one), and if the state is GE, they will all be zero.
e.g.
y = x1b1 + x2b2 + x3b3 + c
y = b1(0) + b2(0) + b3(0) + c
y = c
If you don't have any other variables, like in your first example, the effect of GE will be equal to the intercept term (0.6).