I wish to estimate a regression model where the dependent variable is a dummy (coded 0/1) and I have five or six ordinal independent variables (that I will dummy out), plus a bunch of other stuff. Can anyone recommend a package that will do the dummying-out with a minimum of fuss or otherwise handle the ordinal RHS variables? thanks
You can do it all with the built-in glm function, plus appropriate use of factor around the variables in your formula that should be made into dummy variables.
Example:
R> y <- rbinom(100, 1, .5)
R> x1 <- sample(1:5, 100, replace = TRUE)
R> x2 <- sample(1:5, 100, replace = TRUE)
R> m1 <- glm(y ~ factor(x1) + factor(x2), family = binomial(link = "probit"))
R> m1
Call: glm(formula = y ~ factor(x1) + factor(x2), family = binomial(link = "probit"))
Coefficients:
(Intercept) factor(x1)2 factor(x1)3 factor(x1)4 factor(x1)5 factor(x2)2
0.335 -0.729 -0.670 -0.639 -0.740 0.327
factor(x2)3 factor(x2)4 factor(x2)5
-0.106 0.624 0.483
Degrees of Freedom: 99 Total (i.e. Null); 91 Residual
Null Deviance: 138
Residual Deviance: 129 AIC: 147
You may also want to take a look at the dummies package.
Related
I'm trying to fit a GLM on some data and I feel like there should be an interaction term between two of the explanatory variables (one categorical and one discrete) but all the non-zero instances of the discrete variable occur on the "1" state of the categorical variable (partly why I feel like there should be an interaction).
When I put the interaction in the glm (var1*var2), it just shows N/A for the interaction term (var1:var2) in the summary ANOVA.
Any help would be appreciated!
Thank you!
Edit: here is a mock example of my issue
a <- data.frame("y" <- c(0,1,2,3),
"var1" <- c(0,1,1,1),
"var2" <- c(0,0,1,2))
a.glm <- glm(y ~ var1*var2, family=poisson, data = a)
summary(a.glm)
This comes up in the console:
Call:
glm(formula = y ~ var1 * var2, family = poisson, data = a)
Deviance Residuals:
1 2 3 4
-0.00002 -0.08284 0.12401 -0.04870
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.303 42247.166 0.00 1.00
var1 22.384 42247.166 0.00 1.00
var2 0.522 0.534 0.98 0.33
var1:var2 NA NA NA NA
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4.498681 on 3 degrees of freedom
Residual deviance: 0.024614 on 1 degrees of freedom
AIC: 13.63
Number of Fisher Scoring iterations: 20
I am running a logistic regression and I am noticing that each unique character string in my vector is receiving its own parameter. Is R optimizing the prediction on the outcome variable based each collection of unique values within the vector?
sorry. A little new to stack overflow.
library(stats)
df = as.data.frame( matrix(c("a","a","b","c","c","b","a","a","b","b","c",1,0,0,0,1,0,1,1,0,1,0,1,0,100,10,8,3,5,6,13,10,4,"SF","CHI","NY","NY","SF","SF","CHI","CHI","SF","CHI","NY"), ncol = 4))
colnames(df) = c("letter","number1","number2","city")
df$letter = as.factor(df$letter)
df$city = as.factor(df$city)
df$number1 = as.numeric(df$number1)
df$number2 = as.numeric(df$number2)
glm(number1 ~ .,data=df)
#Call: glm(formula = number1 ~ ., data = df)
#Coefficients:
# (Intercept) letterb letterc number2 cityNY citySF
#1.57191 -0.25227 -0.01424 0.04593 -0.69269 -0.20634
#Degrees of Freedom: 10 Total (i.e. Null); 5 Residual
#Null Deviance: 2.727
#Residual Deviance: 1.35 AIC: 22.14
How is the logit treating city?
Given sample data of proportions of successes plus sample sizes and independent variable(s), I am attempting logistic regression in R.
The following code does what I want and seems to give sensible results, but does not look like a sensible approach; in effect it doubles the size of the data set
datf <- data.frame(prop = c(0.125, 0, 0.667, 1, 0.9),
cases = c(8, 1, 3, 3, 10),
x = c(11, 12, 15, 16, 18))
datf2 <- rbind(datf,datf)
datf2$success <- rep(c(1, 0), each=nrow(datf))
datf2$cases <- round(datf2$cases*ifelse(datf2$success,datf2$prop,1-datf2$prop))
fit2 <- glm(success ~ x, weight=cases, data=datf2, family="binomial")
datf$proppredicted <- 1 / (1 + exp(-predict(fit2, datf)))
plot(datf$x, datf$proppredicted, type="l", col="red", ylim=c(0,1))
points(datf$x, datf$prop, cex=sqrt(datf$cases))
producing a chart like
which looks reasonably sensible.
But I am not happy about the use of datf2 as a way of separating the successes and failures by duplicating the data. Is something like this necessary?
As a lesser question, is there a cleaner way of calculating the predicted proportions?
No need to construct artificial data like that; glm can fit your model from the dataset as given.
> glm(prop ~ x, family=binomial, data=datf, weights=cases)
Call: glm(formula = prop ~ x, family = binomial, data = datf, weights = cases)
Coefficients:
(Intercept) x
-9.3533 0.6714
Degrees of Freedom: 4 Total (i.e. Null); 3 Residual
Null Deviance: 17.3
Residual Deviance: 2.043 AIC: 11.43
You will get a warning about "non-integer #successes", but that is because glm is being silly. Compare to the model on your constructed dataset:
> fit2
Call: glm(formula = success ~ x, family = "binomial", data = datf2,
weights = cases)
Coefficients:
(Intercept) x
-9.3532 0.6713
Degrees of Freedom: 7 Total (i.e. Null); 6 Residual
Null Deviance: 33.65
Residual Deviance: 18.39 AIC: 22.39
The regression coefficients (and therefore predicted values) are basically equal. However your residual deviance and AIC are suspect because you've created artificial data points.
how I have to implement a categorical variable in a binary logistic regression in R? I want to test the influence of the professional fields (student, worker, teacher, self-employed) on the probability of a purchase of a product.
In my example y is a binary variable (1 for buying a product, 0 for not buying). - x1: is the gender (0 male, 1 female)
- x2: is the age (between 20 and 80)
- x3: is the categorical variable (1=student, 2=worker, 3=teacher, 4=self-employed)
set.seed(123)
y<-round(runif(100,0,1))
x1<-round(runif(100,0,1))
x2<-round(runif(100,20,80))
x3<-round(runif(100,1,4))
test<-glm(y~x1+x2+x3, family=binomial(link="logit"))
summary(test)
If I implement x3 (the professional fields) in my regression above, I get the wrong estimates/interpretation for x3.
What I have to do to get the right influence/estimates for the categorical variable (x3)?
Thanks a lot
I suggest you to set x3 as a factor variable, there is no need to create dummies:
set.seed(123)
y <- round(runif(100,0,1))
x1 <- round(runif(100,0,1))
x2 <- round(runif(100,20,80))
x3 <- factor(round(runif(100,1,4)),labels=c("student", "worker", "teacher", "self-employed"))
test <- glm(y~x1+x2+x3, family=binomial(link="logit"))
summary(test)
Here is the summary:
This is the output of your model:
Call:
glm(formula = y ~ x1 + x2 + x3, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4665 -1.1054 -0.9639 1.1979 1.4044
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.464751 0.806463 0.576 0.564
x1 0.298692 0.413875 0.722 0.470
x2 -0.002454 0.011875 -0.207 0.836
x3worker -0.807325 0.626663 -1.288 0.198
x3teacher -0.567798 0.615866 -0.922 0.357
x3self-employed -0.715193 0.756699 -0.945 0.345
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.47 on 99 degrees of freedom
Residual deviance: 135.98 on 94 degrees of freedom
AIC: 147.98
Number of Fisher Scoring iterations: 4
In any case, I suggest you to study this post on R-bloggers:
https://www.r-bloggers.com/logistic-regression-and-categorical-covariates/
I am building a glm in R with categorical predictors and a binary response. My data is like this (but much bigger and with multiple predictors):
y <- c(1,1,1,0,0) #response
x <- c(0,0,0,1,2) #predictor
Since this data is categorical (but it is represented by numbers), I did this:
y <- as.factor(y)
x <- as.factor(x)
And then I built my model:
g1 <- glm(y~x, family=binomial(link="logit"))
But the details of the model are the following:
g1
Call: glm(formula = y ~ x, family = binomial(link = "logit"))
Coefficients:
(Intercept) x1 x2
24.57 -49.13 -49.13
Degrees of Freedom: 4 Total (i.e. Null); 2 Residual
Null Deviance: 6.73
Residual Deviance: 2.143e-10 AIC: 6
And the summary is:
summary(g1)
Call:
glm(formula = y ~ x, family = binomial(link = "logit"))
Deviance Residuals:
1 2 3 4 5
6.547e-06 6.547e-06 6.547e-06 -6.547e-06 -6.547e-06
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 24.57 75639.11 0 1
x1 -49.13 151278.15 0 1
x2 -49.13 151278.15 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.7301e+00 on 4 degrees of freedom
Residual deviance: 2.1434e-10 on 2 degrees of freedom
AIC: 6
Number of Fisher Scoring iterations: 23
What I don't understand is why R has duplicated the x predictor in x1 and x2? What do x1 and x2 mean?
I also need to explicitly write down the model with the estimates, something of the form: y ~ B0 + B1*x so I am stuck now because x has been divided in two and there are no initial variables called x1 and x2...
Thanks for your help!
This happens because you have made x a factor. This factor has three levels (0, 1 and 2). When you put a categorical variable in a regression model, one way of coding it is to use a reference category. In this case R has chosen to make the 0 level the reference category. Then the coefficients of x1 and x2 are the difference in the levels between 0 and 1 as well as 0 and 2 respectively.
This is pretty standard in regression so you shouldn't find it too surprising. Perhaps you were just confused about how R named the coefficients.