How do I make a dummy variable in R? - r

So, my data set consists of 15 variables, one of them (sex) has only 2 levels. I want to use it as a dummy variable, but the levels are 1 and 2. How do I do this? I want to have levels 0 and 1, but I don't know how to manage this in R!

With most of R's modelling tools with a formula interface you don't need to create dummy variables, the underlying code that handles and interprets the formula will do this for you. If you want a dummy variable for some other reason then there are several options. The easiest (IMHO) is to use model.matrix():
set.seed(1)
dat <- data.frame(sex = sample(c("male","female"), 10, replace = TRUE))
model.matrix( ~ sex - 1, data = dat)
which gives:
> dummy <- model.matrix( ~ sex - 1, data = dat)
> dummy
sexfemale sexmale
1 0 1
2 0 1
3 1 0
4 1 0
5 0 1
6 1 0
7 1 0
8 1 0
9 1 0
10 0 1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$sex
[1] "contr.treatment"
> dummy[,1]
1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 1 1 0
You can use either column of dummy as a numeric dummy variable; choose whichever column you want to be the 1-based level. dummy[,1] chooses 1 as representing the female class and dummy[,2] the male class.
Cast this as a factor if you want it to be interpreted as a categorical object:
> factor(dummy[, 1])
1 2 3 4 5 6 7 8 9 10
0 0 1 1 0 1 1 1 1 0
Levels: 0 1
But that is defeating the object of factor; what is 0 again?

Ty this
set.seed(001) # generating some data
sex <- factor(sample(1:2, 10, replace=TRUE)) # this is what you have
[1] 1 1 2 2 1 2 2 2 2 1
Levels: 1 2
sex<-factor(ifelse(as.numeric(sex)==2, 1,0)) # this is what you want
sex
[1] 0 0 1 1 0 1 1 1 1 0
Levels: 0 1
If you want labels to be 0 = Male and 1 = Female, then...
sex<-factor(ifelse(as.numeric(sex)==2, 1,0), labels=c('M', 'F'))
sex # this is what you want
[1] M M F F M F F F F M
Levels: M F
Actually you don't need to create a dummy variable in order to estimate a model using lm, let's see this example:
set.seed(001) # Generating some data
N <- 100
x <- rnorm(N, 50, 20)
y <- 20 + 3.5*x + rnorm(N)
sex <- factor(sample(1:2, N, replace=TRUE))
# Estimating the linear model
lm(y ~ x + sex) # using the first category as the baseline (this means sex==1)
Call:
lm(formula = y ~ x + sex)
Coefficients:
(Intercept) x sex2
19.97815 3.49994 -0.02719
# renaming the categories and labelling them
sex<-factor(ifelse(as.numeric(sex)==2, 1,0), labels=c('M', 'F'))
lm(y ~ x + sex) # the same results, baseline is 'Male'
Call:
lm(formula = y ~ x + sex)
Coefficients:
(Intercept) x sexF
19.97815 3.49994 -0.02719
As you can see R deals with the dummies pretty well, you just pass them into the formula as factor variable and R will do the rest for you.
By the way there's no need to change the categories from c(2,1) into c(0,1), the results will be the same as you can seen in the example above.

As suggested by many above, turn it into factor.
If you really want to dummy code the gender variable, consider this
set.seed(100)
gender = rbinom(100,1,0.5)+1
gender_dummy = gender-1

Related

How to include "count" frequency of variable combination in logistic regression?

I am trying to build a logistic regression on predicting cancer (1) or no cancer (0) using various categorical variables in the dataset.
In the raw data set, there is a "count" column that indicates the frequency of each combination.
The dataset is large, so in order to reduce the number of rows in the dataset, they added a "count" column to indicate these variables combinations happened xxx times.
How do I incorporate this count column in the logistic regression?
my_model <- glm(cancer ~ age_grp + density + race + bmi, bcancer)
Dataset from BCSC: https://www.bcsc-research.org/data/rfdataset/dataset
You seem to have data like this.
head(dat)
# cancer age_grp race bmi count
# 1 0 1 1 18 561
# 2 1 1 1 18 997
# 3 0 2 1 18 321
# 4 1 2 1 18 153
# 5 0 3 1 18 74
# 6 1 3 1 18 228
You could calculate weighted regression with count as weights.
summary(glm(cancer ~ age_grp + race + bmi, family=binomial(), dat,
weights=count))$coef
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.364588477 0.041604639 8.763169 1.898369e-18
# age_grp 0.009726589 0.002182186 4.457269 8.301035e-06
# race 0.020779774 0.005636968 3.686339 2.275036e-04
# bmi -0.021827620 0.001754685 -12.439623 1.592543e-35
You could also try to "unpack" the data,
dat_unpack <- do.call(rbind.data.frame,
apply(dat, 1, \(x)
t(replicate(x['count'], x[setdiff(names(x), 'count')]))))
head(dat_unpack)
# cancer age_grp race bmi
# 1 0 1 1 18
# 2 0 1 1 18
# 3 0 1 1 18
# 4 0 1 1 18
# 5 0 1 1 18
# 6 0 1 1 18
but it's wasted labor of love since, except for the usual rounding errors, the results are identical.
summary(glm(cancer ~ age_grp + race + bmi, family=binomial(), dat_unpack))$coef
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.364588477 0.041604640 8.763169 1.898374e-18
# age_grp 0.009726589 0.002182186 4.457268 8.301070e-06
# race 0.020779774 0.005636970 3.686338 2.275043e-04
# bmi -0.021827620 0.001754685 -12.439621 1.592570e-35
Data
set.seed(42)
dat <- expand.grid(cancer=0:1, age_grp=1:7, race=1:3, bmi=18:26)
dat$count <- sample(1e3, nrow(dat), replace=TRUE)

Creating a combination of dummy variables into a single variable in a logistic regression model in R

I need to create possible combinations of 3 dummy variables into one categorical variable in a logistic regression using R.
I made the combination manually just like the following:
new_variable_code
variable_1
variable_2
variable_3
1
0
0
0
2
0
1
0
3
0
1
1
4
1
0
0
5
1
1
0
6
1
1
1
I excluded the other two options (0 0 1) and (1 0 1) because I do not need them, they are not represented by the data.
I then used new_variable_code as a factor in the logistic regression along with other predictors.
My question is: Is there is any automated way to create the same new_variable_code? or even another econometric technique to encode the 3 dummy variables into 1 categorical variable inside a logistic regression model?
My objective: To understand which variable combination has the highest odds ratio on the outcome variable (along with other predictors explained in the same model).
Thank you
You could use pmap_dbl in the following way to recode your dummy variables to a 1-6 scale:
library(tidyverse)
# Reproducing your data
df1 <- tibble(
variable_1 = c(0,0,0,1,1,1),
variable_2 = c(0,1,1,0,1,1),
variable_3 = c(0,0,1,0,0,1)
)
factorlevels <- c("000","010","011","100","110","111")
df1 <- df1 %>%
mutate(
new_variable_code = pmap_dbl(list(variable_1, variable_2, variable_3),
~ which(paste0(..1, ..2, ..3) == factorlevels))
)
Output:
# A tibble: 6 x 4
variable_1 variable_2 variable_3 new_variable_code
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 1
2 0 1 0 2
3 0 1 1 3
4 1 0 0 4
5 1 1 0 5
6 1 1 1 6
I would just create a variable with paste using sep="." and make it a factor:
newvar <- factor( paste(variable_1, variable_2, variable_3, sep="."))
I don't think it would be a good idea to then make it a sequential vlaue, it's already an integer with levels, since that's how factors get created.

model.matrix(): why do I lose control of contrast in this case

Suppose we have a toy data frame:
x <- data.frame(x1 = gl(3, 2, labels = letters[1:3]),
x2 = gl(3, 2, labels = LETTERS[1:3]))
I would like to construct a model matrix
# x1b x1c x2B x2C
# 1 0 0 0 0
# 2 0 0 0 0
# 3 1 0 1 0
# 4 1 0 1 0
# 5 0 1 0 1
# 6 0 1 0 1
by:
model.matrix(~ x1 + x2 - 1, data = x,
contrasts.arg = list(x1 = contr.treatment(letters[1:3]),
x2 = contr.treatment(LETTERS[1:3])))
but actually I get:
# x1a x1b x1c x2B x2C
# 1 1 0 0 0 0
# 2 1 0 0 0 0
# 3 0 1 0 1 0
# 4 0 1 0 1 0
# 5 0 0 1 0 1
# 6 0 0 1 0 1
# attr(,"assign")
# [1] 1 1 1 2 2
# attr(,"contrasts")
# attr(,"contrasts")$x1
# b c
# a 0 0
# b 1 0
# c 0 1
# attr(,"contrasts")$x2
# B C
# A 0 0
# B 1 0
# C 0 1
I am sort of confused here:
I have passed in explicit contrast matrix to drop first factor levels;
I have asked for dropping intercept.
Then why am I getting a model matrix with 5 columns? How can I get the model matrix I want?
Whenever we lose control of something at R level, there must be some default, unchangable behaviour at C level. C code for model.matrix.default() can be found in R source package at:
R-<release_number>/src/library/stats/src/model.c
We can find the explanation here:
/* If there is no intercept we look through the factor pattern */
/* matrix and adjust the code for the first factor found so that */
/* it will be coded by dummy variables rather than contrasts. */
Let's make a small test on this, with a data frame
x <- data.frame(x1 = gl(2, 2, labels = letters[1:2]), x2 = sin(1:4))
if we only have x2 on the RHS, we can drop intercept successfully:
model.matrix(~ x2 - 1, data = x)
# x2
#1 0.8414710
#2 0.9092974
#3 0.1411200
#4 -0.7568025
if we have only x1 on the RHS, contrast is not applied:
model.matrix(~ x1 - 1, data = x)
# x1a x1b
#1 1 0
#2 1 0
#3 0 1
#4 0 1
when we have both x1 and x2, contrast is not applied:
model.matrix(~ x1 + x2 - 1, data = x)
# x1a x1b x2
#1 1 0 0.8414710
#2 1 0 0.9092974
#3 0 1 0.1411200
#4 0 1 -0.7568025
This implies that while there is difference between:
lm(y ~ x2, data = x)
lm(y ~ x2 - 1, data = x)
there is no difference between
lm(y ~ x1, data = x)
lm(y ~ x1 - 1, data = x)
or
lm(y ~ x1 + x2, data = x)
lm(y ~ x1 + x2 - 1, data = x)
The reason for such behaviour is not to ensure numerical stability, but to ensure the sensibility of estimation / prediction. If we really drop the intercept while applying contrast to x1, we end up with a model matrix:
# x1b
#1 0
#2 0
#3 1
#4 1
The effect is that we constrain estimation for level a to 0.
In this post: How can I force dropping intercept or equivalent in this linear model?, we have a dataset:
# Y X1 X2
#1 1.8376852 TRUE TRUE
#2 -2.1173739 TRUE FALSE
#3 1.3054450 FALSE TRUE
#4 -0.3476706 TRUE FALSE
#5 1.3219099 FALSE TRUE
#6 0.6781750 FALSE TRUE
There isn't joint existence (X1 = FALSE, X2 = FALSE) in this dataset. But in broad sense, model.matrix() has to do something safe and sensible. It is biased to assume that no joint existence of two factor levels in the training dataset implies that they need not be predicted. If we really drop intercept while applying contrast, such joint existence is constrained at 0. However, the OP of that post deliberately wants such non-standard behaviour (for some reason), in which case, a possible workaround was given in my answer there.

Fixed effect on overlap of factors

I'm trying to generate a model.matrix that puts dummy variables for a categorical variable if it exists in either of a pair of factors. Here is an example:
group1 <- factor(c("A","A","A","A","B",
"B","B","C","C","D"),
levels=c("A","B","C","D","E"))
group2 <- factor(c("B","C","D","E","C",
"D","E","D","E","E"),
levels=levels(group1))
set.seed(8)
val <- rnorm(10,1,.25)
control1 <- rnorm(10,2,.5)
df <- data.frame(group1,
group2,
val,
control1)
This results in 10 rows for the (5*(5-1)/2) pairs of (A,B,C,D,E):
df
group1 group2 val control1
1 A B 0.9788535 1.620103
2 A C 1.2101000 2.146025
3 A D 0.8841293 2.210699
4 A E 0.8622912 1.352755
5 B C 1.1840101 2.034643
6 B D 0.9730296 1.593481
7 B E 0.9574277 2.755427
8 C D 0.7279171 1.864196
9 C E 0.2472371 2.779127
10 D E 0.8517064 1.881325
I want to control for a fixed effect in a linear model when a particular level is in either group1 or group2. I can construct a model matrix for this:
tmp1 <- model.matrix(~ 0+group1,df)
tmp2 <- model.matrix(~ 0+group2,df)
tmp3 <- (tmp1|tmp2)*1
tmp3
group1A group1B group1C group1D group1E
1 1 1 0 0 0
2 1 0 1 0 0
3 1 0 0 1 0
4 1 0 0 0 1
5 0 1 1 0 0
6 0 1 0 1 0
7 0 1 0 0 1
8 0 0 1 1 0
9 0 0 1 0 1
10 0 0 0 1 1
A few questions:
Doing it this way does not leave me a lot of options in terms of other covariates. How can I construct such a dummy variable as represented by the model matrix tmp3 and then use it in a call to lm with other covariates such as control1?
The idea is that there is a fixed effect on whether an individual (A,B,C,D,E) is in either group1 or group2. This seems like a reasonable assumption, but I haven't found any references. Am I missing something obvious or does this have a common name in statistics?
Thanks for any help.
I am not sure if model.matrix does provide any options from this, but at least in your example you can reconstruct the matrix you are after without too much effort.
model_mat <- data.frame(tmp3[,-1], val = df$val, control1 = df$control1)
lm(val ~ ., data = model_mat)
You need to remove one of the dummies, I have removed A but you can of course pick any of the others as reference category.
Here's a solution using akrun's idea:
group1 <- factor(c("A","A","A","A","B",
"B","B","C","C","D"),
levels=c("A","B","C","D","E"))
group2 <- factor(c("B","C","D","E","C",
"D","E","D","E","E"),
levels=levels(group1))
set.seed(8)
val <- rnorm(10,1,.25)
control1 <- rnorm(10,2,.5)
df <- data.frame(group1,
group2,
val,
control1)
tmpval <- as.data.frame(Reduce('|',lapply(df[1:2], function(group) model.matrix(~0+group)))*1)
indf <- cbind(df,tmpval)
mod1 <- lm(val ~ 0+groupA+groupB+groupC+groupD+groupE,
indf)
summary(mod1)

R How to transform Prediction as N Column Vector

I am trying to transform my each prediction into an N Column Vector. i.e
Say My Prediction set is a factor of 3 levels and I would like to write each prediction as vector of 3.
My Current Output is
Id Prediction
1 Prediction 1
2 prediction 2
3 prediction 3
and what I am trying to achieve
Id Prediction1 Prediction2 Predication3
1 0 0 1
2 1 0 0
What is a simpler way of achieving this in R?
It looks like you want to perform so-called "one hot encoding" of your Prediction factor variable by introducing dummy variables. One way to do so is using the caret package.
Suppose you have a data frame like this:
> df <- data.frame(Id = c(1, 2, 3, 4), Prediction = c("Prediction 3", "Prediction 1", "Prediction 2", "Prediction 3"))
> df
Id Prediction
1 1 Prediction 3
2 2 Prediction 1
3 3 Prediction 2
4 4 Prediction 3
First make sure you have the caret package installed and loaded.
> install.packages('caret')
> library(caret)
You can then use caret's dummyVars() function to create dummy variables.
> dummies <- dummyVars( ~ Prediction, data = df, levelsOnly = TRUE)
The first argument to dummyVars(), a formula, tells it to generate dummy variables for the Prediction factor in the date frame df. (levelsOnly = TRUE strips the variable name from the columns names, leaving just the level, which looks nicer in this case.)
The dummy variables can then be passed to the predict() function to generate a matrix with the one hot encoded factors.
> encoded <- predict(dummies, df)
> encoded
Prediction 1 Prediction 2 Prediction 3
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
You can then, for example, create a new data frame with the encoded variables instead of the original factor variable:
> data.frame(Id = df$Id, encoded)
Id Prediction.1 Prediction.2 Prediction.3
1 1 0 0 1
2 2 1 0 0
3 3 0 1 0
4 4 0 0 1
This technique generalises easily to a mixture of numerical and categorical variables. Here's a more general example:
> df <- data.frame(Id = c(1,2,3,4), Var1 = c(3.4, 2.1, 6.0, 4.7), Var2 = c("B", "A", "B", "A"), Var3 = c("Rainy", "Sunny", "Sunny", "Cloudy"))
> dummies <- dummyVars(Id ~ ., data = df)
> encoded <- predict(dummies, df)
> encoded
Var1 Var2.A Var2.B Var3.Cloudy Var3.Rainy Var3.Sunny
1 3.4 0 1 0 1 0
2 2.1 1 0 0 0 1
3 6.0 0 1 0 0 1
4 4.7 1 0 1 0 0
All numerical variables remain unchanged, whereas all categorical variables get encoded. A typical situation where this is useful is to prepare data for a machine learning algorithm that only accepts numerical variables, not categorical variables.
You can use something like:
as.numeric(data[1,][2:4])
Where '1' is the row number that you are converting to a vector.
Taking WhiteViking's start and using table function seems to work.
> df <- data.frame(Id = c(1, 2, 3, 4), Prediction = c("Prediction 3", "Prediction 1", "Prediction 2", "Prediction 3"))
> df
Id Prediction
1 1 Prediction 3
2 2 Prediction 1
3 3 Prediction 2
4 4 Prediction 3
> table(df$Id, df$Prediction)
Prediction 1 Prediction 2 Prediction 3
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
I would use the reshape function

Resources