Suppose we have a toy data frame:
x <- data.frame(x1 = gl(3, 2, labels = letters[1:3]),
x2 = gl(3, 2, labels = LETTERS[1:3]))
I would like to construct a model matrix
# x1b x1c x2B x2C
# 1 0 0 0 0
# 2 0 0 0 0
# 3 1 0 1 0
# 4 1 0 1 0
# 5 0 1 0 1
# 6 0 1 0 1
by:
model.matrix(~ x1 + x2 - 1, data = x,
contrasts.arg = list(x1 = contr.treatment(letters[1:3]),
x2 = contr.treatment(LETTERS[1:3])))
but actually I get:
# x1a x1b x1c x2B x2C
# 1 1 0 0 0 0
# 2 1 0 0 0 0
# 3 0 1 0 1 0
# 4 0 1 0 1 0
# 5 0 0 1 0 1
# 6 0 0 1 0 1
# attr(,"assign")
# [1] 1 1 1 2 2
# attr(,"contrasts")
# attr(,"contrasts")$x1
# b c
# a 0 0
# b 1 0
# c 0 1
# attr(,"contrasts")$x2
# B C
# A 0 0
# B 1 0
# C 0 1
I am sort of confused here:
I have passed in explicit contrast matrix to drop first factor levels;
I have asked for dropping intercept.
Then why am I getting a model matrix with 5 columns? How can I get the model matrix I want?
Whenever we lose control of something at R level, there must be some default, unchangable behaviour at C level. C code for model.matrix.default() can be found in R source package at:
R-<release_number>/src/library/stats/src/model.c
We can find the explanation here:
/* If there is no intercept we look through the factor pattern */
/* matrix and adjust the code for the first factor found so that */
/* it will be coded by dummy variables rather than contrasts. */
Let's make a small test on this, with a data frame
x <- data.frame(x1 = gl(2, 2, labels = letters[1:2]), x2 = sin(1:4))
if we only have x2 on the RHS, we can drop intercept successfully:
model.matrix(~ x2 - 1, data = x)
# x2
#1 0.8414710
#2 0.9092974
#3 0.1411200
#4 -0.7568025
if we have only x1 on the RHS, contrast is not applied:
model.matrix(~ x1 - 1, data = x)
# x1a x1b
#1 1 0
#2 1 0
#3 0 1
#4 0 1
when we have both x1 and x2, contrast is not applied:
model.matrix(~ x1 + x2 - 1, data = x)
# x1a x1b x2
#1 1 0 0.8414710
#2 1 0 0.9092974
#3 0 1 0.1411200
#4 0 1 -0.7568025
This implies that while there is difference between:
lm(y ~ x2, data = x)
lm(y ~ x2 - 1, data = x)
there is no difference between
lm(y ~ x1, data = x)
lm(y ~ x1 - 1, data = x)
or
lm(y ~ x1 + x2, data = x)
lm(y ~ x1 + x2 - 1, data = x)
The reason for such behaviour is not to ensure numerical stability, but to ensure the sensibility of estimation / prediction. If we really drop the intercept while applying contrast to x1, we end up with a model matrix:
# x1b
#1 0
#2 0
#3 1
#4 1
The effect is that we constrain estimation for level a to 0.
In this post: How can I force dropping intercept or equivalent in this linear model?, we have a dataset:
# Y X1 X2
#1 1.8376852 TRUE TRUE
#2 -2.1173739 TRUE FALSE
#3 1.3054450 FALSE TRUE
#4 -0.3476706 TRUE FALSE
#5 1.3219099 FALSE TRUE
#6 0.6781750 FALSE TRUE
There isn't joint existence (X1 = FALSE, X2 = FALSE) in this dataset. But in broad sense, model.matrix() has to do something safe and sensible. It is biased to assume that no joint existence of two factor levels in the training dataset implies that they need not be predicted. If we really drop intercept while applying contrast, such joint existence is constrained at 0. However, the OP of that post deliberately wants such non-standard behaviour (for some reason), in which case, a possible workaround was given in my answer there.
Related
There is something I do not understand in model.matrix. When I enter a single binary variable without an intercept it returns two levels.
> temp.data <- data.frame('x' = sample(c('A', 'B'), 1000, replace = TRUE))
> temp.data.table <- model.matrix( ~ 0 + x, data = temp.data)
> head(temp.data.table)
xA xB
1 1 0
2 0 1
3 0 1
4 0 1
5 1 0
6 0 1
However, when I enter another binary level, it creates only 3 columns. Why is that? What makes the behavior of the function suddenly different? and how can I avoid it?
> temp.data <- data.frame('x' = sample(c('A', 'B'), 1000, replace = TRUE),
+ 'y' = sample(c('J', 'D'), 1000, replace = TRUE))
> temp.data.table <- model.matrix( ~ 0 + x + y, data = temp.data)
> head(temp.data.table)
xA xB yJ
1 0 1 0
2 0 1 1
3 0 1 1
4 0 1 0
5 1 0 1
6 0 1 0
You need to work with factors and set the contrasts to FALSE. Try this:
n <- 10
temp.data <- data.frame('x'=sample(c('A', 'B'), n, replace=TRUE),
'y'=factor(sample(c('J', 'D'), n, replace=TRUE)))
model.matrix( ~ 0 + x + y, data=temp.data,
contrasts=list(y=contrasts(temp.data$y, contrasts=FALSE)))
# xA xB yD yJ
# 1 0 1 1 0
# 2 1 0 0 1
# 3 0 1 1 0
# 4 1 0 0 1
# 5 0 1 0 1
# 6 1 0 1 0
# 7 1 0 1 0
# 8 0 1 1 0
# 9 0 1 0 1
# 10 0 1 1 0
# attr(,"assign")
# [1] 1 1 2 2
# attr(,"contrasts")
# attr(,"contrasts")$x
# [1] "contr.treatment"
#
# attr(,"contrasts")$y
# D J
# D 1 0
# J 0 1
To understand why this happens, try:
contrasts(temp.data$y)
# J
# D 0
# J 1
contrasts(temp.data$y, contrasts=F)
# D J
# D 1 0
# J 0 1
With your x variable this happens automatically by setting 0 + to remove the intercept. (Actually x also should be coded as factor).
The reason is, that in linear regression the levels of factor variables are usually compared to a reference level (which you could change using relevel). In your model matrix, with 0 + you remove the intercept for your first variable but not to the following (try model.matrix( ~ 0 + y + x, data=temp.data) where you get only one x but to y). This is determined in the standard contrasts setting using treatment contrasts by default.
You may want to read a relevant post of Rose Maier (2015) explaining this in great detail:
Contrasts in R
You need to reset the contrasts of the factor variables. See this post.
temp.data <- data.frame('x' = sample(c('A', 'B'), 1000, replace = TRUE),
+ 'y' = sample(c('J', 'D'), 1000, replace = TRUE))
dat = model.matrix(~ -1 +., data=temp.data, contrasts.arg = lapply(temp.data[,1:2], contrasts, contrasts=FALSE))
head(dat)
xA xB yD yJ
1 0 1 0 1
2 1 0 0 1
3 1 0 0 1
4 1 0 0 1
5 0 1 1 0
6 0 1 0 1
Here is the data set for reproducibility:
a=c(90.41,37.37,18.98)
b=c(103.39,39.44,51.68)
c=c(83.51,36.41,47.46)
d=c(94.60,38.57,50.22)
e=c(95.04,38.81,50.49)
xx=t(data.frame(a,b,c,d,e))
df=data.frame(xx)
And here is the if/else function I am trying run on the data frame
classify=function(df){
if (df>=110) {
class="5"}
else if (df<110 & df>=103){
class="4"}
else if (df<103 & df>=95){
class="3"}
else if (df<95 & df>=76){
class="2"}
else if (df<76){
class="1"}
else {class="none"}
}
However, what I want the if/else function to produce is a new data frame that looks like this:
df
X1 X2 X3
a 2 1 1
b 4 1 1
c 2 1 1
d 2 1 1
e 3 1 1
I am unsure as to how to do this so any help would be super appreciated. I anticipate something is wrong in the if/else function itself but I am quite inexperienced and I don't know how to detect errors in the script that easily. Thank you!
sapply(df, function(x) {as.numeric(as.character(cut(x, c(-Inf,76,95,103,110,Inf), seq(1:5))))})
X1 X2 X3
[1,] 2 1 1
[2,] 4 1 1
[3,] 2 1 1
[4,] 2 1 1
[5,] 3 1 1
Use cut to set the intervals (its 2nd argument) and the labels (its 3rd argument). However, it returns a factor, so convert back to numeric if you like that. Since you want to run the function over the the full dataframe, use sapply or lapply.
You can do this with findInterval. All you have to do is to pass it a non-decreasing vector of break points.
classify <- function(DF, breaks = c(-Inf, 76, 95, 103, 110, Inf)){
f <- function(x, breaks) findInterval(x, breaks)
DF[] <- lapply(DF, f, breaks)
DF
}
classify(df)
# X1 X2 X3
#a 2 1 1
#b 4 1 1
#c 2 1 1
#d 2 1 1
#e 3 1 1
Quite a similar approach to your example, using case_when from dplyr:
library(dplyr)
classify <- function(x){
case_when(
x >= 110 ~ "5",
x >= 103 & x < 110 ~ "4",
x >= 95 & x < 103 ~ "3",
x >= 76 & x < 95 ~ "2",
x < 76 ~ "1",
TRUE ~ "none"
)
}
a = c(90.41, 37.37, 18.98)
b = c(103.39, 39.44, 51.68)
c = c(83.51, 36.41, 47.46)
d = c(94.60, 38.57, 50.22)
e = c(95.04, 38.81, 50.49)
df <- data.frame(matrix(c(a, b, c, d, e), ncol = 3, byrow = T))
mutate_all(df, classify)
# X1 X2 X3
#1 2 1 1
#2 4 1 1
#3 2 1 1
#4 2 1 1
#5 3 1 1
In case if:
df
# X1 X2 X3
#1 -Inf 37.37 18.98
#2 103.39 NaN 51.68
#3 83.51 36.41 47.46
#4 94.60 Inf 50.22
#5 95.04 38.81 NA
The results look like this:
mutate_all(df, classify)
# X1 X2 X3
#1 1 1 1
#2 4 none 1
#3 2 1 1
#4 2 5 1
#5 3 1 none
I would like to fit model at factor level and use those fitted model name on fly for predicting new data at such matching factor level. I am failing in prediction in this logic, can someone guide on this considering below case?
Aa <- data.frame(amount=c(1,2,1,2,1,1,2,2,1,1,1,2,2,2,1), cat1=sample(letters[21:24], 15,rep=TRUE),cat2=sample(letters[11:18], 5,rep=TRUE),
card=c("a","b","c","a","c","b","a","c","b","a","b","c","a","c","a"), delay=sample(c(1,1,0,0,0),5,rep=TRUE))
ModelFit<-sapply(as.character(unique(Aa[["card"]])), function(x)glm(delay~amount+cat1+cat2, family = "binomial", data = subset(Aa, card==x)), simplify = FALSE, USE.NAMES = TRUE)
Bb<-Aa[-(which(names(Aa) %in% "delay"))]
sapply(unique(Aa[["card"]]), function(x,y) predict(seq_along(x=ModelFit), newdata=DataOPEN[DataOPEN$SubsidiaryName],type="response"))
I have made this into a loop for simplicity. The prediction throws a warning, but seems to work. Your DataOPEN dataset was not provided, so I just calculated the prediction using the original Aa (new column pred). A final rounded version of the prediction is shown in column pred.round.
Aa <- data.frame(amount=c(1,2,1,2,1,1,2,2,1,1,1,2,2,2,1), cat1=sample(letters[21:24], 15,rep=TRUE),cat2=sample(letters[11:18], 5,rep=TRUE),
card=c("a","b","c","a","c","b","a","c","b","a","b","c","a","c","a"), delay=sample(c(1,1,0,0,0),5,rep=TRUE))
ModelFit <- sapply(as.character(unique(Aa[["card"]])), function(x)glm(delay~amount+cat1+cat2, family = "binomial", data = subset(Aa, card==x)), simplify = FALSE, USE.NAMES = TRUE)
Aa$pred <- NaN # create a new variable for prediction
for(i in levels(Aa$card)){
newdat <- subset(Aa, subset=card==i)
newdat$pred <- predict(ModelFit[[i]], newdata=newdat,type="response")
Aa$pred[match(rownames(newdat), rownames(Aa))] <- newdat$pred
}
Aa$pred.round <- round(Aa$pred) # a rounded prediction
Aa
The output:
> Aa
amount cat1 cat2 card delay pred pred.round
1 1 u p a 0 1.170226e-09 0
2 2 x o b 1 1.000000e+00 1
3 1 x o c 0 2.143345e-11 0
4 2 w m a 0 1.170226e-09 0
5 1 v n c 0 2.143345e-11 0
6 1 x p b 0 5.826215e-11 0
7 2 u o a 1 5.000000e-01 0
8 2 x o c 0 2.143345e-11 0
9 1 w m b 0 5.826215e-11 0
10 1 w n a 0 1.170226e-09 0
11 1 w p b 0 5.826215e-11 0
12 2 w o c 1 1.000000e+00 1
13 2 u o a 0 5.000000e-01 0
14 2 u m c 0 2.143345e-11 0
15 1 w n a 0 1.170226e-09 0
I'm trying to generate a model.matrix that puts dummy variables for a categorical variable if it exists in either of a pair of factors. Here is an example:
group1 <- factor(c("A","A","A","A","B",
"B","B","C","C","D"),
levels=c("A","B","C","D","E"))
group2 <- factor(c("B","C","D","E","C",
"D","E","D","E","E"),
levels=levels(group1))
set.seed(8)
val <- rnorm(10,1,.25)
control1 <- rnorm(10,2,.5)
df <- data.frame(group1,
group2,
val,
control1)
This results in 10 rows for the (5*(5-1)/2) pairs of (A,B,C,D,E):
df
group1 group2 val control1
1 A B 0.9788535 1.620103
2 A C 1.2101000 2.146025
3 A D 0.8841293 2.210699
4 A E 0.8622912 1.352755
5 B C 1.1840101 2.034643
6 B D 0.9730296 1.593481
7 B E 0.9574277 2.755427
8 C D 0.7279171 1.864196
9 C E 0.2472371 2.779127
10 D E 0.8517064 1.881325
I want to control for a fixed effect in a linear model when a particular level is in either group1 or group2. I can construct a model matrix for this:
tmp1 <- model.matrix(~ 0+group1,df)
tmp2 <- model.matrix(~ 0+group2,df)
tmp3 <- (tmp1|tmp2)*1
tmp3
group1A group1B group1C group1D group1E
1 1 1 0 0 0
2 1 0 1 0 0
3 1 0 0 1 0
4 1 0 0 0 1
5 0 1 1 0 0
6 0 1 0 1 0
7 0 1 0 0 1
8 0 0 1 1 0
9 0 0 1 0 1
10 0 0 0 1 1
A few questions:
Doing it this way does not leave me a lot of options in terms of other covariates. How can I construct such a dummy variable as represented by the model matrix tmp3 and then use it in a call to lm with other covariates such as control1?
The idea is that there is a fixed effect on whether an individual (A,B,C,D,E) is in either group1 or group2. This seems like a reasonable assumption, but I haven't found any references. Am I missing something obvious or does this have a common name in statistics?
Thanks for any help.
I am not sure if model.matrix does provide any options from this, but at least in your example you can reconstruct the matrix you are after without too much effort.
model_mat <- data.frame(tmp3[,-1], val = df$val, control1 = df$control1)
lm(val ~ ., data = model_mat)
You need to remove one of the dummies, I have removed A but you can of course pick any of the others as reference category.
Here's a solution using akrun's idea:
group1 <- factor(c("A","A","A","A","B",
"B","B","C","C","D"),
levels=c("A","B","C","D","E"))
group2 <- factor(c("B","C","D","E","C",
"D","E","D","E","E"),
levels=levels(group1))
set.seed(8)
val <- rnorm(10,1,.25)
control1 <- rnorm(10,2,.5)
df <- data.frame(group1,
group2,
val,
control1)
tmpval <- as.data.frame(Reduce('|',lapply(df[1:2], function(group) model.matrix(~0+group)))*1)
indf <- cbind(df,tmpval)
mod1 <- lm(val ~ 0+groupA+groupB+groupC+groupD+groupE,
indf)
summary(mod1)
So, my data set consists of 15 variables, one of them (sex) has only 2 levels. I want to use it as a dummy variable, but the levels are 1 and 2. How do I do this? I want to have levels 0 and 1, but I don't know how to manage this in R!
With most of R's modelling tools with a formula interface you don't need to create dummy variables, the underlying code that handles and interprets the formula will do this for you. If you want a dummy variable for some other reason then there are several options. The easiest (IMHO) is to use model.matrix():
set.seed(1)
dat <- data.frame(sex = sample(c("male","female"), 10, replace = TRUE))
model.matrix( ~ sex - 1, data = dat)
which gives:
> dummy <- model.matrix( ~ sex - 1, data = dat)
> dummy
sexfemale sexmale
1 0 1
2 0 1
3 1 0
4 1 0
5 0 1
6 1 0
7 1 0
8 1 0
9 1 0
10 0 1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$sex
[1] "contr.treatment"
> dummy[,1]
1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 1 1 0
You can use either column of dummy as a numeric dummy variable; choose whichever column you want to be the 1-based level. dummy[,1] chooses 1 as representing the female class and dummy[,2] the male class.
Cast this as a factor if you want it to be interpreted as a categorical object:
> factor(dummy[, 1])
1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 1 1 0
Levels: 0 1
But that is defeating the object of factor; what is 0 again?
Ty this
set.seed(001) # generating some data
sex <- factor(sample(1:2, 10, replace=TRUE)) # this is what you have
[1] 1 1 2 2 1 2 2 2 2 1
Levels: 1 2
sex<-factor(ifelse(as.numeric(sex)==2, 1,0)) # this is what you want
sex
[1] 0 0 1 1 0 1 1 1 1 0
Levels: 0 1
If you want labels to be 0 = Male and 1 = Female, then...
sex<-factor(ifelse(as.numeric(sex)==2, 1,0), labels=c('M', 'F'))
sex # this is what you want
[1] M M F F M F F F F M
Levels: M F
Actually you don't need to create a dummy variable in order to estimate a model using lm, let's see this example:
set.seed(001) # Generating some data
N <- 100
x <- rnorm(N, 50, 20)
y <- 20 + 3.5*x + rnorm(N)
sex <- factor(sample(1:2, N, replace=TRUE))
# Estimating the linear model
lm(y ~ x + sex) # using the first category as the baseline (this means sex==1)
Call:
lm(formula = y ~ x + sex)
Coefficients:
(Intercept) x sex2
19.97815 3.49994 -0.02719
# renaming the categories and labelling them
sex<-factor(ifelse(as.numeric(sex)==2, 1,0), labels=c('M', 'F'))
lm(y ~ x + sex) # the same results, baseline is 'Male'
Call:
lm(formula = y ~ x + sex)
Coefficients:
(Intercept) x sexF
19.97815 3.49994 -0.02719
As you can see R deals with the dummies pretty well, you just pass them into the formula as factor variable and R will do the rest for you.
By the way there's no need to change the categories from c(2,1) into c(0,1), the results will be the same as you can seen in the example above.
As suggested by many above, turn it into factor.
If you really want to dummy code the gender variable, consider this
set.seed(100)
gender = rbinom(100,1,0.5)+1
gender_dummy = gender-1