categorical variable in logistic regression in r - r

how I have to implement a categorical variable in a binary logistic regression in R? I want to test the influence of the professional fields (student, worker, teacher, self-employed) on the probability of a purchase of a product.
In my example y is a binary variable (1 for buying a product, 0 for not buying). - x1: is the gender (0 male, 1 female)
- x2: is the age (between 20 and 80)
- x3: is the categorical variable (1=student, 2=worker, 3=teacher, 4=self-employed)
set.seed(123)
y<-round(runif(100,0,1))
x1<-round(runif(100,0,1))
x2<-round(runif(100,20,80))
x3<-round(runif(100,1,4))
test<-glm(y~x1+x2+x3, family=binomial(link="logit"))
summary(test)
If I implement x3 (the professional fields) in my regression above, I get the wrong estimates/interpretation for x3.
What I have to do to get the right influence/estimates for the categorical variable (x3)?
Thanks a lot

I suggest you to set x3 as a factor variable, there is no need to create dummies:
set.seed(123)
y <- round(runif(100,0,1))
x1 <- round(runif(100,0,1))
x2 <- round(runif(100,20,80))
x3 <- factor(round(runif(100,1,4)),labels=c("student", "worker", "teacher", "self-employed"))
test <- glm(y~x1+x2+x3, family=binomial(link="logit"))
summary(test)
Here is the summary:
This is the output of your model:
Call:
glm(formula = y ~ x1 + x2 + x3, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4665 -1.1054 -0.9639 1.1979 1.4044
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.464751 0.806463 0.576 0.564
x1 0.298692 0.413875 0.722 0.470
x2 -0.002454 0.011875 -0.207 0.836
x3worker -0.807325 0.626663 -1.288 0.198
x3teacher -0.567798 0.615866 -0.922 0.357
x3self-employed -0.715193 0.756699 -0.945 0.345
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.47 on 99 degrees of freedom
Residual deviance: 135.98 on 94 degrees of freedom
AIC: 147.98
Number of Fisher Scoring iterations: 4
In any case, I suggest you to study this post on R-bloggers:
https://www.r-bloggers.com/logistic-regression-and-categorical-covariates/

Related

Interaction terms for an incomplete design R

I'm trying to fit a GLM on some data and I feel like there should be an interaction term between two of the explanatory variables (one categorical and one discrete) but all the non-zero instances of the discrete variable occur on the "1" state of the categorical variable (partly why I feel like there should be an interaction).
When I put the interaction in the glm (var1*var2), it just shows N/A for the interaction term (var1:var2) in the summary ANOVA.
Any help would be appreciated!
Thank you!
Edit: here is a mock example of my issue
a <- data.frame("y" <- c(0,1,2,3),
"var1" <- c(0,1,1,1),
"var2" <- c(0,0,1,2))
a.glm <- glm(y ~ var1*var2, family=poisson, data = a)
summary(a.glm)
This comes up in the console:
Call:
glm(formula = y ~ var1 * var2, family = poisson, data = a)
Deviance Residuals:
1 2 3 4
-0.00002 -0.08284 0.12401 -0.04870
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.303 42247.166 0.00 1.00
var1 22.384 42247.166 0.00 1.00
var2 0.522 0.534 0.98 0.33
var1:var2 NA NA NA NA
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4.498681 on 3 degrees of freedom
Residual deviance: 0.024614 on 1 degrees of freedom
AIC: 13.63
Number of Fisher Scoring iterations: 20

Visualising crossed random effect for lme

I am new to mixed models and have some problems. I've got a model:
lmer(F2 ~ (phoneme|individual) + (1|word) + age + frequency + (1|zduration), data = nurse_female)
Linear mixed model fit by REML ['lmerMod']
Formula:
F2 ~ (phoneme | individual) + (1 | word) + age + frequency +
(1 | zduration)
Data: nurse_female
REML criterion at convergence: 654.4
Scaled residuals:
Min 1Q Median 3Q Max
-2.09203 -0.20332 0.03263 0.25273 1.37056
Random effects:
Groups Name Variance Std.Dev. Corr
zduration (Intercept) 0.27779 0.5271
word (Intercept) 0.04488 0.2118
individual (Intercept) 0.34181 0.5846
phonemeIr 0.54227 0.7364 -0.82
phonemeVr 1.52090 1.2332 -0.93 0.91
Residual 0.06326 0.2515
Number of obs: 334, groups:
zduration, 280; word, 116; individual, 23
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.79167 0.32138 5.575
age -0.01596 0.00508 -3.142
frequencylow -0.37587 0.18560 -2.025
frequencymid -1.18901 0.27738 -4.286
frequencyvery high -0.68365 0.26564 -2.574
Correlation of Fixed Effects:
(Intr) age frqncyl frqncym
age -0.811
frequencylw -0.531 -0.013
frequencymd -0.333 -0.006 0.589
frqncyvryhg -0.356 0.000 0.627 0.389
The model predicts the normalised formant values of vowels such in NURSE for female speakers. Without getting too much into it, there are roughly 3 variants possible that I coded under phoneme as <Er, Ir, Vr>. Individual describes the speaker. I managed to plot the F2 variance of each speaker using random effects.
But how do I plot the model predictions for the F2 values for each speaker with phoneme on the x-axis (i.e. 3 marks for <Er, Ir, Vr>) and F2 on the y-axis?
I tried a few ways but none of them worked.
Thanks in advance. If you need further information/data just say so

What does NA in odds ratio mean?

I am currently working on landing page testing with both independent and dependent variables as logical variables. I wanted to check which of these variables, if true, is a major factor for a conversion.
So basically we are testing multiple variations of a single variable. For example, we have three different images, if image 1 is true for one row, the other two variables are false.
I used Logistic regression to conduct this test. When I looked at the odds ratio output, I ended up having a lot of NAs. I am not sure how to interpret them and how to rectify them.
Below is the sample dataset. The actual data has 18000+ rows.
classifier1 <- glm(formula = Target ~ .,
family = binomial,
data = Dataset)
This is the output.
Does this mean I need more data? Is there some other way to conduct multivariate landing page testing?
It looks like two or more of your variables (columns) are perfectly correlated. Try to remove several columns.
You can see it at the toy data.frame with the random content:
n <- 20
y <- matrix(sample(c(TRUE, FALSE), 5 * n, replace = TRUE), ncol = 5)
colnames(y) <- letters[1:5]
z <- as.data.frame(y)
z$target <- rep(0:1, 2 * n)[1:nrow(z)]
m <- glm(target ~ ., data = z, family = binomial)
summary(m)
At the summary you can see that everything is OK.
Call:
glm(formula = target ~ ., family = binomial, data = z)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.89808 -0.48166 -0.00004 0.64134 1.89222
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.3679 4700.1462 -0.005 0.9962
aTRUE 3.2286 1.6601 1.945 0.0518 .
bTRUE 20.2584 4700.1459 0.004 0.9966
cTRUE 0.7928 1.3743 0.577 0.5640
dTRUE 17.0438 4700.1460 0.004 0.9971
eTRUE 2.9238 1.6658 1.755 0.0792 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.726 on 19 degrees of freedom
Residual deviance: 14.867 on 14 degrees of freedom
AIC: 26.867
Number of Fisher Scoring iterations: 18
But if you make two columns perfectly correlated as below, and then make generalized linear model:
z$a <- z$b
m <- glm(target ~ ., data = z, family = binomial)
summary(m)
you can observe NAs as below
Call:
glm(formula = target ~ ., family = binomial, data = z)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.66621 -1.01173 0.00001 1.06907 1.39309
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.8718 3243.8340 -0.006 0.995
aTRUE 18.7777 3243.8339 0.006 0.995
bTRUE NA NA NA NA
cTRUE 0.3544 1.0775 0.329 0.742
dTRUE 17.1826 3243.8340 0.005 0.996
eTRUE 1.1952 1.2788 0.935 0.350
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.726 on 19 degrees of freedom
Residual deviance: 19.996 on 15 degrees of freedom
AIC: 29.996
Number of Fisher Scoring iterations: 17

R duplicating predictor variables with glm and categorical variables

I am building a glm in R with categorical predictors and a binary response. My data is like this (but much bigger and with multiple predictors):
y <- c(1,1,1,0,0) #response
x <- c(0,0,0,1,2) #predictor
Since this data is categorical (but it is represented by numbers), I did this:
y <- as.factor(y)
x <- as.factor(x)
And then I built my model:
g1 <- glm(y~x, family=binomial(link="logit"))
But the details of the model are the following:
g1
Call: glm(formula = y ~ x, family = binomial(link = "logit"))
Coefficients:
(Intercept) x1 x2
24.57 -49.13 -49.13
Degrees of Freedom: 4 Total (i.e. Null); 2 Residual
Null Deviance: 6.73
Residual Deviance: 2.143e-10 AIC: 6
And the summary is:
summary(g1)
Call:
glm(formula = y ~ x, family = binomial(link = "logit"))
Deviance Residuals:
1 2 3 4 5
6.547e-06 6.547e-06 6.547e-06 -6.547e-06 -6.547e-06
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 24.57 75639.11 0 1
x1 -49.13 151278.15 0 1
x2 -49.13 151278.15 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.7301e+00 on 4 degrees of freedom
Residual deviance: 2.1434e-10 on 2 degrees of freedom
AIC: 6
Number of Fisher Scoring iterations: 23
What I don't understand is why R has duplicated the x predictor in x1 and x2? What do x1 and x2 mean?
I also need to explicitly write down the model with the estimates, something of the form: y ~ B0 + B1*x so I am stuck now because x has been divided in two and there are no initial variables called x1 and x2...
Thanks for your help!
This happens because you have made x a factor. This factor has three levels (0, 1 and 2). When you put a categorical variable in a regression model, one way of coding it is to use a reference category. In this case R has chosen to make the 0 level the reference category. Then the coefficients of x1 and x2 are the difference in the levels between 0 and 1 as well as 0 and 2 respectively.
This is pretty standard in regression so you shouldn't find it too surprising. Perhaps you were just confused about how R named the coefficients.

predict() in lmer regression, but I need it only 2 categories

I am attempting to estimate a multilevel model. My code is:
fullModel2 <- lmer(pharmexp_2001 ~ gdp_1000_gm + health_exp_per_cap_1000_gm + life_exp +
labour_cost_1000_gm + (year_gm|lowerID), data=adat, REML=F)
which results in the following model:
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: pharmexp_2001 ~ gdp_1000_gm + health_exp_per_cap_1000_gm + life_exp +
labour_cost_1000_gm + (year_gm | lowerID)
Data: adat
AIC BIC logLik deviance df.resid
1830.2 1859.9 -906.1 1812.2 191
Scaled residuals:
Min 1Q Median 3Q Max
-2.5360 -0.6853 -0.0842 0.4923 4.0051
Random effects:
Groups Name Variance Std.Dev. Corr
lowerID (Intercept) 134.6851 11.6054
year_gm 0.4214 0.6492 -1.00
Residual 487.5324 22.0801
Number of obs: 200, groups: lowerID, 2
Fixed effects:
Estimate Std. Error t value
(Intercept) -563.7924 75.4125 -7.476
gdp_1000_gm -0.9050 0.2051 -4.413
health_exp_per_cap_1000_gm 37.5394 6.3943 5.871
life_exp 8.8571 0.9498 9.326
labour_cost_1000_gm -1.3573 0.4684 -2.898
Correlation of Fixed Effects:
(Intr) g_1000 h____1 lif_xp
gdp_1000_gm -0.068
hl____1000_ 0.374 -0.254
life_exp -0.996 0.072 -0.393
lbr_c_1000_ -0.133 -0.139 -0.802 0.142
I know that it's a problem that the correlation is -1 by random effects, but I have a bigger problem. I have to plot my results, but only I need 2 lines: when lowerID=0 and when lowerID=1. So I want to plot pharmaexp_2001 on the y-axis against year on the x-axis, but I need only 2 lines (by lowerID). I know that I have to use predict.merMod, but how can I plot these results, plotting only these two lines? Currently my plot has 21 lines (because I analyse pharmaceutical expenditure in 21 countries).
Welcome to the site, #Eszter Takács!
You only need to specify the two IDs in newdata. Here is an example based on sleepstudy data in R. I assume you want to plot the predicted values on the y-axis. Just replace the code with your data and variables, you will obtain the predicted values for lowerID==0 and lowerID==1. Then you can use your code to plot the two lines for the two IDs.
> (fm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy, REML=F))
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
AIC BIC logLik deviance
1763.9393 1783.0971 -875.9697 1751.9393
Random effects:
Groups Name Std.Dev. Corr
Subject (Intercept) 23.781
Days 5.717 0.08
Residual 25.592
Number of obs: 180, groups: Subject, 18
Fixed Effects:
(Intercept) Days
251.41 10.47
> newdata = sleepstudy[sleepstudy$Subject==308 | sleepstudy$Subject==333,]
> str(p <- predict(fm1,newdata)) # new data, all RE
Named num [1:20] 254 274 293 313 332 ...
- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...

Resources