I was running a regression using categorical variables and came across this question. Here, the user wanted to add a column for each dummy. This left me quite confused because I though having long data with the column including all the dummies stored using as.factor() was equivalent to having dummy variables.
Could someone explain the difference between the following two linear regression models?
Linear Model 1, where Month is a factor:
dt_long
Sales Period Month
1: 0.4898943 1 M1
2: 0.3097716 1 M1
3: 1.0574771 1 M1
4: 0.5121627 1 M1
5: 0.6650744 1 M1
---
8108: 0.5175480 24 M12
8109: 1.2867316 24 M12
8110: 0.6283875 24 M12
8111: 0.6287151 24 M12
8112: 0.4347708 24 M12
M1 <- lm(data = dt_long,
fomrula = Sales ~ Period + factor(Month)
Linear Model 2 where each month is an indicator variable:
dt_wide
Sales Period M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
1: 0.4898943 1 1 0 0 0 0 0 0 0 0 0 0 0
2: 0.3097716 1 1 0 0 0 0 0 0 0 0 0 0 0
3: 1.0574771 1 1 0 0 0 0 0 0 0 0 0 0 0
4: 0.5121627 1 1 0 0 0 0 0 0 0 0 0 0 0
5: 0.6650744 1 1 0 0 0 0 0 0 0 0 0 0 0
---
8108: 0.5175480 24 0 0 0 0 0 0 0 0 0 0 0 1
8109: 1.2867316 24 0 0 0 0 0 0 0 0 0 0 0 1
8110: 0.6283875 24 0 0 0 0 0 0 0 0 0 0 0 1
8111: 0.6287151 24 0 0 0 0 0 0 0 0 0 0 0 1
8112: 0.4347708 24 0 0 0 0 0 0 0 0 0 0 0 1
M2 <- lm(data = data_wide,
formula = Sales ~ Period + M1 + M2 + M3 + ... + M11 + M12
Judging by this previously asked question, both models seem exactly the same. However, after running both models, I noticed that M1 returns 11 dummy estimators (because M1 is used as the reference level), while M2 returns 12 dummies.
Is one model better than the other? Is M1 more efficien? Can I set the reference level in M1 to make both models exactly equivalent?
Defining a model as in M1 is just a shortcut of including dummy variables: if you wanted to compute those regression coefficients by hand, clearly they'd have to be numeric.
Now something that perhaps you didn't notice about M2 is that one of the dummies should have a NA coefficient. That is because you manually included all of them and left the intercept. In this way we have a perfect collinearity issue. By not including one of the dummies or adding -1 to eliminate the constant term everything would be fine.
Some examples. Let
y <- rnorm(100)
x0 <- rep(1:0, each = 50)
x1 <- rep(0:1, each = 50)
x <- factor(x1)
In this way x0 and x1 is a decomposition of x. Then
## Too much
lm(y ~ x0 + x1)
# Call:
# lm(formula = y ~ x0 + x1)
# Coefficients:
# (Intercept) x0 x1
# -0.15044 0.07561 NA
## One way to fix it
lm(y ~ x0 + x1 - 1)
# Call:
# lm(formula = y ~ x0 + x1 - 1)
# Coefficients:
# x0 x1
# -0.07483 -0.15044
## Another one
lm(y ~ x1)
# Call:
# lm(formula = y ~ x1)
# Coefficients:
# (Intercept) x1
# -0.07483 -0.07561
## The same results
lm(y ~ x)
# Call:
# lm(formula = y ~ x)
# Coefficients:
# (Intercept) x1
# -0.07483 -0.07561
Ultimately all the models contain the same amount of information, but in the case of multicollinearity we face the issue of identification.
Improper dummy coding.
When you change a categorical variable into dummy variables, you will have one fewer dummy variable than you had categories. That’s because the last category is already indicated by having a 0 on all other dummy variables. Including the last category just adds redundant information, resulting in multicollinearity. So always check your dummy coding if it seems you’ve got a multicollinearity problem.
#Here is my code:
library(MASS, caret, stepPlr, janitor)
#stepPlr: L2 penalized logistic regression with a stepwise variable selection
#MASS: Support Functions and Datasets for Venables and Ripley's MASS
#caret: Classification and Regression Training
#janitor: Simple Tools for Examining and Cleaning Dirty Data
#Howells is a main dataframe, we will segregate it.
HNORSE <- Howells[which(Pop=='NORSE'),]
#Let's remove NA cols
#We will use janitor package here to remove NA cols
HNORSE <- remove_empty_cols(HNORSE)
#Assigning 0's and 1's to females and males resp.
HNORSE$PopSex[HNORSE$PopSex=="NORSEF"] <- '0'
HNORSE$PopSex[HNORSE$PopSex=="NORSEM"] <- '1'
HNORSE$PopSex <- as.numeric(HNORSE$PopSex)
HNORSE$PopSex
#Resultant column looks like this
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1
[41] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0
[81] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I want to use Stepplr from caret package
a <- step.plr(HNORSE[,c(6:76)], HNORSE$PopSex, lambda = 1e-4, cp="bic", max.terms = 1, trace = TRUE, type = "forward")
#Where HNORSE[,c(6:76)] --> features
#HNORSE$PopSex ---> Binary response
#lambda ----> Default value
#max.terms ---> I tried more than 1 value for max.terms, but then R goes into infinite loop of 'Convergence Error'.
#That's why using max.terms=1
Then I ran summary command on "a"
summary(a)
Call: plr(x = ix0, y = y, weights = weights, offset.subset = offset.subset,
offset.coefficients = offset.coefficients, lambda = lambda,
cp = cp)
Coefficients:Estimate Std.Error z value Pr(>|z|)
Intercept -71.93470 13.3521 -5.388 0
ZYB 0.55594 0.1033 5.382 0
Null deviance: 152.49 on 109 degrees of freedom
Residual deviance: 57.29 on 108 degrees of freedom
Score: deviance + 4.7 * df = 66.69
I used step.plr so, I should then use predict.stepplr right? and not predict.plr?
By this logic I wish to use predict.stepplr. The default function argument example goes like this:
n <- 100
p <- 5
x0 <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x0 <- cbind(rnorm(n),x0)
y <- sample(c(0,1),n,replace=TRUE)
level <- vector("list",length=6)
for (i in 2:6) level[[i]] <- seq(3)
fit <- step.plr(x0,y,level=level)
x1 <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x1 <- cbind(rnorm(n),x1)
pred1 <- predict(fit,x0,x1,type="link")
pred2 <- predict(fit,x0,x1,type="response")
pred3 <- predict(fit,x0,x1,type="class")
object: stepplr object
x: matrix of features used for fitting object.
If newx is provided, x must be provided as well.
newx: matrix of features at which the predictions are made.
If newx=NULL, predictions for the training data are returned.
type: If type=link, the linear predictors are returned;
if type=response, the probability estimates are returned; and
if type=class, the class labels are returned. Default is type=link.
...
other options for prediction..
So First of all, I did not do any sampling like shown in here.
I want to predict HNORSE$PopSex which is binary variable.
My dataset which does not include the binary variable column is HNORSE[,c(6:76)].
I want to know what x0 and x1 function arguments should I put in
predict.stepplr()?
If not, HOW do I correctly implement
predict.stepplr?
I want to use overall accuracy to plot(Density(overall_accuracy))
I would like to run a logistic regression with specific group (range 0f values) of a categorial variable.I did the following steps:
1. I cut the variable to groups:
cut_Var3 <- cut(dat$Var3,breaks=c(0,3,6,9))
the outcome of table(cut_Var3) gave me this output (cut_Var3 was turned into a factor):
# (0,3] (3,6] (6,9]
# 5 4 4
I wanted to do a logistic regression with other variable but in separate for the level of (3,6) only.
So I'll be able to run the regression on the 4 observations of the second group.
2. I tried to write this line of code (and also other variations):
ff <- glm( TargetVar ~ relevel(cut_Var3,3:6), data = dat)
but with no luck.
What should I do in order to run it properly?
attached is an example data set:
dat <- read.table(text = " TargetVar Var1 Var2 Var3
0 0 0 7
0 0 1 1
0 1 0 3
0 1 1 7
1 0 0 5
1 0 1 1
1 1 0 0
1 1 1 6
0 0 0 8
0 0 1 5
1 1 1 4
0 0 1 2
1 0 0 9
1 1 1 2 ", header = TRUE)
For relevel you need to specify the level label exactly as it appear in the factor:
glm( TargetVar ~ relevel(cut_Var3,"(3,6]"), data = dat)
Call: glm(formula = TargetVar ~ relevel(cut_Var3, "(3,6]"), data = dat)
Coefficients:
(Intercept) relevel(cut_Var3, "(3,6]")(0,3]
0.75 -0.35
relevel(cut_Var3, "(3,6]")(6,9]
-0.50
Degrees of Freedom: 12 Total (i.e. Null); 10 Residual
(1 observation deleted due to missingness)
Null Deviance: 3.231
Residual Deviance: 2.7 AIC: 24.46
I would like to run the dependent variable of a logistic regression (in my data set it's : dat$admit) with all available variables, each regression with its own Independent variable vs dependent variable.
The outcome that I wanted to get is a list of each regression summary. Using the data set submitted below there should be 3 regressions.
Here is a sample data set (where admit is the logistic regression dependent variable) :
dat <- read.table(text = "
+ female apcalc admit num
+ 0 0 0 7
+ 0 0 1 1
+ 0 1 0 3
+ 0 1 1 7
+ 1 0 0 5
+ 1 0 1 1
+ 1 1 0 0
+ 1 1 1 6",
+ header = TRUE)
I got an example for simple linear regression but When i tried to change the function from lm to glm I got "list()" as a result.
Here is the original code - for the iris dataset where "Sepal.Length" is the dependent variable :
sapply(names(iris)[-1],
function(x) lm.fit(cbind(1, iris[,x]), iris[,"Sepal.Length"])$coef)
How can I create the right function for a logistic regression?
dat <- read.table(text = "
female apcalc admit num
0 0 0 7
0 0 1 1
0 1 0 3
0 1 1 7
1 0 0 5
1 0 1 1
1 1 0 0
1 1 1 6",
header = TRUE)
This is perhaps a little too condensed, but it does the job.
Of course, the sample data set is too small to get any sensible
answers ...
t(sapply(setdiff(names(dat),"admit"),
function(x) coef(glm(reformulate(x,response="admit"),
data=dat,family=binomial))))
I have a data.frame consisting of numeric and factor variables as seen below.
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))
I want to build out a matrix that assigns dummy variables to the factor and leaves the numeric variables alone.
model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)
As expected when running lm this leaves out one level of each factor as the reference level. However, I want to build out a matrix with a dummy/indicator variable for every level of all the factors. I am building this matrix for glmnet so I am not worried about multicollinearity.
Is there a way to have model.matrix create the dummy for every level of the factor?
(Trying to redeem myself...) In response to Jared's comment on #Fabians answer about automating it, note that all you need to supply is a named list of contrast matrices. contrasts() takes a vector/factor and produces the contrasts matrix from it. For this then we can use lapply() to run contrasts() on each factor in our data set, e.g. for the testFrame example provided:
> lapply(testFrame[,4:5], contrasts, contrasts = FALSE)
$Fourth
Alice Bob Charlie David
Alice 1 0 0 0
Bob 0 1 0 0
Charlie 0 0 1 0
David 0 0 0 1
$Fifth
Edward Frank Georgia Hank Isaac
Edward 1 0 0 0 0
Frank 0 1 0 0 0
Georgia 0 0 1 0 0
Hank 0 0 0 1 0
Isaac 0 0 0 0 1
Which slots nicely into #fabians answer:
model.matrix(~ ., data=testFrame,
contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE))
You need to reset the contrasts for the factor variables:
model.matrix(~ Fourth + Fifth, data=testFrame,
contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F),
Fifth=contrasts(testFrame$Fifth, contrasts=F)))
or, with a little less typing and without the proper names:
model.matrix(~ Fourth + Fifth, data=testFrame,
contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)),
Fifth=diag(nlevels(testFrame$Fifth))))
caret implemented a nice function dummyVars to achieve this with 2 lines:
library(caret)
dmy <- dummyVars(" ~ .", data = testFrame)
testFrame2 <- data.frame(predict(dmy, newdata = testFrame))
Checking the final columns:
colnames(testFrame2)
"First" "Second" "Third" "Fourth.Alice" "Fourth.Bob" "Fourth.Charlie" "Fourth.David" "Fifth.Edward" "Fifth.Frank" "Fifth.Georgia" "Fifth.Hank" "Fifth.Isaac"
The nicest point here is you get the original data frame, plus the dummy variables having excluded the original ones used for the transformation.
More info: http://amunategui.github.io/dummyVar-Walkthrough/
dummyVars from caret could also be used. http://caret.r-forge.r-project.org/preprocess.html
Ok. Just reading the above and putting it all together. Suppose you wanted the matrix e.g. 'X.factors' that multiplies by your coefficient vector to get your linear predictor. There are still a couple extra steps:
X.factors =
model.matrix( ~ ., data=X, contrasts.arg =
lapply(data.frame(X[,sapply(data.frame(X), is.factor)]),
contrasts, contrasts = FALSE))
(Note that you need to turn X[*] back into a data frame in case you have only one factor column.)
Then say you get something like this:
attr(X.factors,"assign")
[1] 0 1 **2** 2 **3** 3 3 **4** 4 4 5 6 7 8 9 10 #emphasis added
We want to get rid of the **'d reference levels of each factor
att = attr(X.factors,"assign")
factor.columns = unique(att[duplicated(att)])
unwanted.columns = match(factor.columns,att)
X.factors = X.factors[,-unwanted.columns]
X.factors = (data.matrix(X.factors))
A tidyverse answer:
library(dplyr)
library(tidyr)
result <- testFrame %>%
mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>%
mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "")
yields the desired result (same as #Gavin Simpson's answer):
> head(result, 6)
First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac
1 1 5 4 0 0 1 0 0 1 0 0 0
2 1 14 10 0 0 0 1 0 0 1 0 0
3 2 2 9 0 1 0 0 1 0 0 0 0
4 2 5 4 0 0 0 1 0 1 0 0 0
5 2 13 5 0 0 1 0 1 0 0 0 0
6 2 15 7 1 0 0 0 1 0 0 0 0
Using the R package 'CatEncoders'
library(CatEncoders)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))
fit <- OneHotEncoder.fit(testFrame)
z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output
z <- transform(fit,testFrame,sparse=FALSE) # give the dense output
I am currently learning Lasso model and glmnet::cv.glmnet(), model.matrix() and Matrix::sparse.model.matrix()(for high dimensions matrix, using model.matrix will killing our time as suggested by the author of glmnet.).
Just sharing there has a tidy coding to get the same answer as #fabians and #Gavin's answer. Meanwhile, #asdf123 introduced another package library('CatEncoders') as well.
> require('useful')
> # always use all levels
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE)
>
> # just use all levels for Fourth
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE))
Source : R for Everyone: Advanced Analytics and Graphics (page273)
I write a package called ModelMatrixModel to improve the functionality of model.matrix(). The ModelMatrixModel() function in the package in default return a class containing a sparse matrix with all levels of dummy variables which is suitable for input in cv.glmnet() in glmnet package. Importantly, returned
class also stores transforming parameters such as the factor level information, which can then be applied to new data. The function can hand most items in r formula like poly() and interaction. It also gives several other options like handle invalid factor levels , and scale output.
#devtools::install_github("xinyongtian/R_ModelMatrixModel")
library(ModelMatrixModel)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5))
newdata=data.frame(First=sample(1:10, 2, replace=T),
Second=sample(1:20, 2, replace=T), Third=sample(1:10, 2, replace=T),
Fourth=c("Bob","Charlie"))
mm=ModelMatrixModel(~First+Second+Fourth, data = testFrame)
class(mm)
## [1] "ModelMatrixModel"
class(mm$x) #default output is sparse matrix
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
data.frame(as.matrix(head(mm$x,2)))
## First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1 7 17 1 0 0 0
## 2 9 7 0 1 0 0
#apply the same transformation to new data, note the dummy variables for 'Fourth' includes the levels not appearing in new data
mm_new=predict(mm,newdata)
data.frame(as.matrix(head(mm_new$x,2)))
## First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1 6 3 0 1 0 0
## 2 2 12 0 0 1 0
You can use tidyverse to achieve this without specifying each column manually.
The trick is to make a "long" dataframe.
Then, munge a few things, and spread it back to wide to create the indicators/dummy variables.
Code:
library(tidyverse)
## add index variable for pivoting
testFrame$id <- 1:nrow(testFrame)
testFrame %>%
## pivot to "long" format
gather(feature, value, -id) %>%
## add indicator value
mutate(indicator=1) %>%
## create feature name that unites a feature and its value
unite(feature, value, col="feature_value", sep="_") %>%
## convert to wide format, filling missing values with zero
spread(feature_value, indicator, fill=0)
The output:
id Fifth_Edward Fifth_Frank Fifth_Georgia Fifth_Hank Fifth_Isaac First_2 First_3 First_4 ...
1 1 1 0 0 0 0 0 0 0
2 2 0 1 0 0 0 0 0 0
3 3 0 0 1 0 0 0 0 0
4 4 0 0 0 1 0 0 0 0
5 5 0 0 0 0 1 0 0 0
6 6 1 0 0 0 0 0 0 0
7 7 0 1 0 0 0 0 1 0
8 8 0 0 1 0 0 1 0 0
9 9 0 0 0 1 0 0 0 0
10 10 0 0 0 0 1 0 0 0
11 11 1 0 0 0 0 0 0 0
12 12 0 1 0 0 0 0 0 0
...
model.matrix(~ First + Second + Third + Fourth + Fifth - 1, data=testFrame)
or
model.matrix(~ First + Second + Third + Fourth + Fifth + 0, data=testFrame)
should be the most straightforward