Adding columns to dataframe using for loop R [duplicate] - r

I have had trouble generating the following dummy-variables in R:
I'm analyzing yearly time series data (time period 1948-2009). I have two questions:
How do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)?
How do I generate a dummy variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?

Another option that can work better if you have many variables is factor and model.matrix.
year.f = factor(year)
dummies = model.matrix(~year.f)
This will include an intercept column (all ones) and one column for each of the years in your data set except one, which will be the "default" or intercept value.
You can change how the "default" is chosen by messing with contrasts.arg in model.matrix.
Also, if you want to omit the intercept, you can just drop the first column or add +0 to the end of the formula.
Hope this is useful.

The simplest way to produce these dummy variables is something like the following:
> print(year)
[1] 1956 1957 1957 1958 1958 1959
> dummy <- as.numeric(year == 1957)
> print(dummy)
[1] 0 1 1 0 0 0
> dummy2 <- as.numeric(year >= 1957)
> print(dummy2)
[1] 0 1 1 1 1 1
More generally, you can use ifelse to choose between two values depending on a condition. So if instead of a 0-1 dummy variable, for some reason you wanted to use, say, 4 and 7, you could use ifelse(year == 1957, 4, 7).

Using dummies::dummy():
library(dummies)
# example data
df1 <- data.frame(id = 1:4, year = 1991:1994)
df1 <- cbind(df1, dummy(df1$year, sep = "_"))
df1
# id year df1_1991 df1_1992 df1_1993 df1_1994
# 1 1 1991 1 0 0 0
# 2 2 1992 0 1 0 0
# 3 3 1993 0 0 1 0
# 4 4 1994 0 0 0 1

Package mlr includes createDummyFeatures for this purpose:
library(mlr)
df <- data.frame(var = sample(c("A", "B", "C"), 10, replace = TRUE))
df
# var
# 1 B
# 2 A
# 3 C
# 4 B
# 5 C
# 6 A
# 7 C
# 8 A
# 9 B
# 10 C
createDummyFeatures(df, cols = "var")
# var.A var.B var.C
# 1 0 1 0
# 2 1 0 0
# 3 0 0 1
# 4 0 1 0
# 5 0 0 1
# 6 1 0 0
# 7 0 0 1
# 8 1 0 0
# 9 0 1 0
# 10 0 0 1
createDummyFeatures drops original variable.
https://www.rdocumentation.org/packages/mlr/versions/2.9/topics/createDummyFeatures
.....

The other answers here offer direct routes to accomplish this task—one that many models (e.g. lm) will do for you internally anyway. Nonetheless, here are ways to make dummy variables with Max Kuhn's popular caret and recipes packages. While somewhat more verbose, they both scale easily to more complicated situations, and fit neatly into their respective frameworks.
caret::dummyVars
With caret, the relevant function is dummyVars, which has a predict method to apply it on a data frame:
df <- data.frame(letter = rep(c('a', 'b', 'c'), each = 2),
y = 1:6)
library(caret)
dummy <- dummyVars(~ ., data = df, fullRank = TRUE)
dummy
#> Dummy Variable Object
#>
#> Formula: ~.
#> 2 variables, 1 factors
#> Variables and levels will be separated by '.'
#> A full rank encoding is used
predict(dummy, df)
#> letter.b letter.c y
#> 1 0 0 1
#> 2 0 0 2
#> 3 1 0 3
#> 4 1 0 4
#> 5 0 1 5
#> 6 0 1 6
recipes::step_dummy
With recipes, the relevant function is step_dummy:
library(recipes)
dummy_recipe <- recipe(y ~ letter, df) %>%
step_dummy(letter)
dummy_recipe
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 1
#>
#> Steps:
#>
#> Dummy variables from letter
Depending on context, extract the data with prep and either bake or juice:
# Prep and bake on new data...
dummy_recipe %>%
prep() %>%
bake(df)
#> # A tibble: 6 x 3
#> y letter_b letter_c
#> <int> <dbl> <dbl>
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 0 1
#> 6 6 0 1
# ...or use `retain = TRUE` and `juice` to extract training data
dummy_recipe %>%
prep(retain = TRUE) %>%
juice()
#> # A tibble: 6 x 3
#> y letter_b letter_c
#> <int> <dbl> <dbl>
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 0 1
#> 6 6 0 1

For the usecase as presented in the question, you can also just multiply the logical condition with 1 (or maybe even better, with 1L):
# example data
df1 <- data.frame(yr = 1951:1960)
# create the dummies
df1$is.1957 <- 1L * (df1$yr == 1957)
df1$after.1957 <- 1L * (df1$yr >= 1957)
which gives:
> df1
yr is.1957 after.1957
1 1951 0 0
2 1952 0 0
3 1953 0 0
4 1954 0 0
5 1955 0 0
6 1956 0 0
7 1957 1 1
8 1958 0 1
9 1959 0 1
10 1960 0 1
For the usecases as presented in for example the answers of #zx8754 and #Sotos, there are still some other options which haven't been covered yet imo.
1) Make your own make_dummies-function
# example data
df2 <- data.frame(id = 1:5, year = c(1991:1994,1992))
# create a function
make_dummies <- function(v, prefix = '') {
s <- sort(unique(v))
d <- outer(v, s, function(v, s) 1L * (v == s))
colnames(d) <- paste0(prefix, s)
d
}
# bind the dummies to the original dataframe
cbind(df2, make_dummies(df2$year, prefix = 'y'))
which gives:
id year y1991 y1992 y1993 y1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
2) use the dcast-function from either data.table or reshape2
dcast(df2, id + year ~ year, fun.aggregate = length)
which gives:
id year 1991 1992 1993 1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
However, this will not work when there are duplicate values in the column for which the dummies have to be created. In the case a specific aggregation function is needed for dcast and the result of of dcast need to be merged back to the original:
# example data
df3 <- data.frame(var = c("B", "C", "A", "B", "C"))
# aggregation function to get dummy values
f <- function(x) as.integer(length(x) > 0)
# reshape to wide with the cumstom aggregation function and merge back to the original
merge(df3, dcast(df3, var ~ var, fun.aggregate = f), by = 'var', all.x = TRUE)
which gives (note that the result is ordered according to the by column):
var A B C
1 A 1 0 0
2 B 0 1 0
3 B 0 1 0
4 C 0 0 1
5 C 0 0 1
3) use the spread-function from tidyr (with mutate from dplyr)
library(dplyr)
library(tidyr)
df2 %>%
mutate(v = 1, yr = year) %>%
spread(yr, v, fill = 0)
which gives:
id year 1991 1992 1993 1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0

What I normally do to work with this kind of dummy variables is:
(1) how do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)
data$factor_year_1 <- factor ( with ( data, ifelse ( ( year == 1957 ), 1 , 0 ) ) )
(2) how do I generate a dummy-variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?
data$factor_year_2 <- factor ( with ( data, ifelse ( ( year < 1957 ), 0 , 1 ) ) )
Then, I can introduce this factor as a dummy variable in my models. For example, to see whether there is a long-term trend in a varible y :
summary ( lm ( y ~ t, data = data ) )
Hope this helps!

If you want to get K dummy variables, instead of K-1, try:
dummies = table(1:length(year),as.factor(year))
Best,

I read this on the kaggle forum:
#Generate example dataframe with character column
example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"
#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
for(level in unique(example$strcol)){
example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level, 1, 0)
}

The ifelse function is best for simple logic like this.
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, 1, 0)
ifelse(x <= 1957, 1, 0)
> [1] 0 0 0 0 0 0 0 1 0 0 0
> [1] 1 1 1 1 1 1 1 1 0 0 0
Also, if you want it to return character data then you can do so.
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, "foo", "bar")
ifelse(x <= 1957, "foo", "bar")
> [1] "bar" "bar" "bar" "bar" "bar" "bar" "bar" "foo" "bar" "bar" "bar"
> [1] "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "bar" "bar" "bar"
Categorical variables with nesting...
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, "foo", ifelse(x == 1958, "bar","baz"))
> [1] "baz" "baz" "baz" "baz" "baz" "baz" "baz" "foo" "bar" "baz" "baz"
This is the most straightforward option.

Another way is to use mtabulate from qdapTools package, i.e.
df <- data.frame(var = sample(c("A", "B", "C"), 5, replace = TRUE))
var
#1 C
#2 A
#3 C
#4 B
#5 B
library(qdapTools)
mtabulate(df$var)
which gives,
A B C
1 0 0 1
2 1 0 0
3 0 0 1
4 0 1 0
5 0 1 0

This one liner in base R
model.matrix( ~ iris$Species - 1)
gives
iris$Speciessetosa iris$Speciesversicolor iris$Speciesvirginica
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 1 0 0
7 1 0 0
8 1 0 0
9 1 0 0
10 1 0 0
11 1 0 0
12 1 0 0
13 1 0 0
14 1 0 0
15 1 0 0
16 1 0 0
17 1 0 0
18 1 0 0
19 1 0 0
20 1 0 0
21 1 0 0
22 1 0 0
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 0 0
29 1 0 0
30 1 0 0
31 1 0 0
32 1 0 0
33 1 0 0
34 1 0 0
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0
40 1 0 0
41 1 0 0
42 1 0 0
43 1 0 0
44 1 0 0
45 1 0 0
46 1 0 0
47 1 0 0
48 1 0 0
49 1 0 0
50 1 0 0
51 0 1 0
52 0 1 0
53 0 1 0
54 0 1 0
55 0 1 0
56 0 1 0
57 0 1 0
58 0 1 0
59 0 1 0
60 0 1 0
61 0 1 0
62 0 1 0
63 0 1 0
64 0 1 0
65 0 1 0
66 0 1 0
67 0 1 0
68 0 1 0
69 0 1 0
70 0 1 0
71 0 1 0
72 0 1 0
73 0 1 0
74 0 1 0
75 0 1 0
76 0 1 0
77 0 1 0
78 0 1 0
79 0 1 0
80 0 1 0
81 0 1 0
82 0 1 0
83 0 1 0
84 0 1 0
85 0 1 0
86 0 1 0
87 0 1 0
88 0 1 0
89 0 1 0
90 0 1 0
91 0 1 0
92 0 1 0
93 0 1 0
94 0 1 0
95 0 1 0
96 0 1 0
97 0 1 0
98 0 1 0
99 0 1 0
100 0 1 0
101 0 0 1
102 0 0 1
103 0 0 1
104 0 0 1
105 0 0 1
106 0 0 1
107 0 0 1
108 0 0 1
109 0 0 1
110 0 0 1
111 0 0 1
112 0 0 1
113 0 0 1
114 0 0 1
115 0 0 1
116 0 0 1
117 0 0 1
118 0 0 1
119 0 0 1
120 0 0 1
121 0 0 1
122 0 0 1
123 0 0 1
124 0 0 1
125 0 0 1
126 0 0 1
127 0 0 1
128 0 0 1
129 0 0 1
130 0 0 1
131 0 0 1
132 0 0 1
133 0 0 1
134 0 0 1
135 0 0 1
136 0 0 1
137 0 0 1
138 0 0 1
139 0 0 1
140 0 0 1
141 0 0 1
142 0 0 1
143 0 0 1
144 0 0 1
145 0 0 1
146 0 0 1
147 0 0 1
148 0 0 1
149 0 0 1
150 0 0 1

Convert your data to a data.table and use set by reference and row filtering
library(data.table)
dt <- as.data.table(your.dataframe.or.whatever)
dt[, is.1957 := 0]
dt[year == 1957, is.1957 := 1]
Proof-of-concept toy example:
library(data.table)
dt <- as.data.table(cbind(c(1, 1, 1), c(2, 2, 3)))
dt[, is.3 := 0]
dt[V2 == 3, is.3 := 1]

I use such a function (for data.table):
# Ta funkcja dla obiektu data.table i zmiennej var.name typu factor tworzy dummy variables o nazwach "var.name: (level1)"
factorToDummy <- function(dtable, var.name){
stopifnot(is.data.table(dtable))
stopifnot(var.name %in% names(dtable))
stopifnot(is.factor(dtable[, get(var.name)]))
dtable[, paste0(var.name,": ",levels(get(var.name)))] -> new.names
dtable[, (new.names) := transpose(lapply(get(var.name), FUN = function(x){x == levels(get(var.name))})) ]
cat(paste("\nDodano zmienne dummy: ", paste0(new.names, collapse = ", ")))
}
Usage:
data <- data.table(data)
data[, x:= droplevels(x)]
factorToDummy(data, "x")

We can also use cSplit_e from splitstackshape. Using #zx8754's data
df1 <- data.frame(id = 1:4, year = 1991:1994)
splitstackshape::cSplit_e(df1, "year", fill = 0)
# id year year_1 year_2 year_3 year_4
#1 1 1991 1 0 0 0
#2 2 1992 0 1 0 0
#3 3 1993 0 0 1 0
#4 4 1994 0 0 0 1
To make it work for data other than numeric we need to specify type as "character" explicitly
df1 <- data.frame(id = 1:4, let = LETTERS[1:4])
splitstackshape::cSplit_e(df1, "let", fill = 0, type = "character")
# id let let_A let_B let_C let_D
#1 1 A 1 0 0 0
#2 2 B 0 1 0 0
#3 3 C 0 0 1 0
#4 4 D 0 0 0 1

Hi i wrote this general function to generate a dummy variable which essentially replicates the replace function in Stata.
If x is the data frame is x and i want a dummy variable called a which will take value 1 when x$b takes value c
introducedummy<-function(x,a,b,c){
g<-c(a,b,c)
n<-nrow(x)
newcol<-g[1]
p<-colnames(x)
p2<-c(p,newcol)
new1<-numeric(n)
state<-x[,g[2]]
interest<-g[3]
for(i in 1:n){
if(state[i]==interest){
new1[i]=1
}
else{
new1[i]=0
}
}
x$added<-new1
colnames(x)<-p2
x
}

another way you can do it is use
ifelse(year < 1965 , 1, 0)

Related

How to expand a column based on values ( Years) [duplicate]

I have had trouble generating the following dummy-variables in R:
I'm analyzing yearly time series data (time period 1948-2009). I have two questions:
How do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)?
How do I generate a dummy variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?
Another option that can work better if you have many variables is factor and model.matrix.
year.f = factor(year)
dummies = model.matrix(~year.f)
This will include an intercept column (all ones) and one column for each of the years in your data set except one, which will be the "default" or intercept value.
You can change how the "default" is chosen by messing with contrasts.arg in model.matrix.
Also, if you want to omit the intercept, you can just drop the first column or add +0 to the end of the formula.
Hope this is useful.
The simplest way to produce these dummy variables is something like the following:
> print(year)
[1] 1956 1957 1957 1958 1958 1959
> dummy <- as.numeric(year == 1957)
> print(dummy)
[1] 0 1 1 0 0 0
> dummy2 <- as.numeric(year >= 1957)
> print(dummy2)
[1] 0 1 1 1 1 1
More generally, you can use ifelse to choose between two values depending on a condition. So if instead of a 0-1 dummy variable, for some reason you wanted to use, say, 4 and 7, you could use ifelse(year == 1957, 4, 7).
Using dummies::dummy():
library(dummies)
# example data
df1 <- data.frame(id = 1:4, year = 1991:1994)
df1 <- cbind(df1, dummy(df1$year, sep = "_"))
df1
# id year df1_1991 df1_1992 df1_1993 df1_1994
# 1 1 1991 1 0 0 0
# 2 2 1992 0 1 0 0
# 3 3 1993 0 0 1 0
# 4 4 1994 0 0 0 1
Package mlr includes createDummyFeatures for this purpose:
library(mlr)
df <- data.frame(var = sample(c("A", "B", "C"), 10, replace = TRUE))
df
# var
# 1 B
# 2 A
# 3 C
# 4 B
# 5 C
# 6 A
# 7 C
# 8 A
# 9 B
# 10 C
createDummyFeatures(df, cols = "var")
# var.A var.B var.C
# 1 0 1 0
# 2 1 0 0
# 3 0 0 1
# 4 0 1 0
# 5 0 0 1
# 6 1 0 0
# 7 0 0 1
# 8 1 0 0
# 9 0 1 0
# 10 0 0 1
createDummyFeatures drops original variable.
https://www.rdocumentation.org/packages/mlr/versions/2.9/topics/createDummyFeatures
.....
The other answers here offer direct routes to accomplish this task—one that many models (e.g. lm) will do for you internally anyway. Nonetheless, here are ways to make dummy variables with Max Kuhn's popular caret and recipes packages. While somewhat more verbose, they both scale easily to more complicated situations, and fit neatly into their respective frameworks.
caret::dummyVars
With caret, the relevant function is dummyVars, which has a predict method to apply it on a data frame:
df <- data.frame(letter = rep(c('a', 'b', 'c'), each = 2),
y = 1:6)
library(caret)
dummy <- dummyVars(~ ., data = df, fullRank = TRUE)
dummy
#> Dummy Variable Object
#>
#> Formula: ~.
#> 2 variables, 1 factors
#> Variables and levels will be separated by '.'
#> A full rank encoding is used
predict(dummy, df)
#> letter.b letter.c y
#> 1 0 0 1
#> 2 0 0 2
#> 3 1 0 3
#> 4 1 0 4
#> 5 0 1 5
#> 6 0 1 6
recipes::step_dummy
With recipes, the relevant function is step_dummy:
library(recipes)
dummy_recipe <- recipe(y ~ letter, df) %>%
step_dummy(letter)
dummy_recipe
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 1
#>
#> Steps:
#>
#> Dummy variables from letter
Depending on context, extract the data with prep and either bake or juice:
# Prep and bake on new data...
dummy_recipe %>%
prep() %>%
bake(df)
#> # A tibble: 6 x 3
#> y letter_b letter_c
#> <int> <dbl> <dbl>
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 0 1
#> 6 6 0 1
# ...or use `retain = TRUE` and `juice` to extract training data
dummy_recipe %>%
prep(retain = TRUE) %>%
juice()
#> # A tibble: 6 x 3
#> y letter_b letter_c
#> <int> <dbl> <dbl>
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 0 1
#> 6 6 0 1
For the usecase as presented in the question, you can also just multiply the logical condition with 1 (or maybe even better, with 1L):
# example data
df1 <- data.frame(yr = 1951:1960)
# create the dummies
df1$is.1957 <- 1L * (df1$yr == 1957)
df1$after.1957 <- 1L * (df1$yr >= 1957)
which gives:
> df1
yr is.1957 after.1957
1 1951 0 0
2 1952 0 0
3 1953 0 0
4 1954 0 0
5 1955 0 0
6 1956 0 0
7 1957 1 1
8 1958 0 1
9 1959 0 1
10 1960 0 1
For the usecases as presented in for example the answers of #zx8754 and #Sotos, there are still some other options which haven't been covered yet imo.
1) Make your own make_dummies-function
# example data
df2 <- data.frame(id = 1:5, year = c(1991:1994,1992))
# create a function
make_dummies <- function(v, prefix = '') {
s <- sort(unique(v))
d <- outer(v, s, function(v, s) 1L * (v == s))
colnames(d) <- paste0(prefix, s)
d
}
# bind the dummies to the original dataframe
cbind(df2, make_dummies(df2$year, prefix = 'y'))
which gives:
id year y1991 y1992 y1993 y1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
2) use the dcast-function from either data.table or reshape2
dcast(df2, id + year ~ year, fun.aggregate = length)
which gives:
id year 1991 1992 1993 1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
However, this will not work when there are duplicate values in the column for which the dummies have to be created. In the case a specific aggregation function is needed for dcast and the result of of dcast need to be merged back to the original:
# example data
df3 <- data.frame(var = c("B", "C", "A", "B", "C"))
# aggregation function to get dummy values
f <- function(x) as.integer(length(x) > 0)
# reshape to wide with the cumstom aggregation function and merge back to the original
merge(df3, dcast(df3, var ~ var, fun.aggregate = f), by = 'var', all.x = TRUE)
which gives (note that the result is ordered according to the by column):
var A B C
1 A 1 0 0
2 B 0 1 0
3 B 0 1 0
4 C 0 0 1
5 C 0 0 1
3) use the spread-function from tidyr (with mutate from dplyr)
library(dplyr)
library(tidyr)
df2 %>%
mutate(v = 1, yr = year) %>%
spread(yr, v, fill = 0)
which gives:
id year 1991 1992 1993 1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
What I normally do to work with this kind of dummy variables is:
(1) how do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)
data$factor_year_1 <- factor ( with ( data, ifelse ( ( year == 1957 ), 1 , 0 ) ) )
(2) how do I generate a dummy-variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?
data$factor_year_2 <- factor ( with ( data, ifelse ( ( year < 1957 ), 0 , 1 ) ) )
Then, I can introduce this factor as a dummy variable in my models. For example, to see whether there is a long-term trend in a varible y :
summary ( lm ( y ~ t, data = data ) )
Hope this helps!
If you want to get K dummy variables, instead of K-1, try:
dummies = table(1:length(year),as.factor(year))
Best,
I read this on the kaggle forum:
#Generate example dataframe with character column
example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"
#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
for(level in unique(example$strcol)){
example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level, 1, 0)
}
The ifelse function is best for simple logic like this.
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, 1, 0)
ifelse(x <= 1957, 1, 0)
> [1] 0 0 0 0 0 0 0 1 0 0 0
> [1] 1 1 1 1 1 1 1 1 0 0 0
Also, if you want it to return character data then you can do so.
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, "foo", "bar")
ifelse(x <= 1957, "foo", "bar")
> [1] "bar" "bar" "bar" "bar" "bar" "bar" "bar" "foo" "bar" "bar" "bar"
> [1] "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "bar" "bar" "bar"
Categorical variables with nesting...
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, "foo", ifelse(x == 1958, "bar","baz"))
> [1] "baz" "baz" "baz" "baz" "baz" "baz" "baz" "foo" "bar" "baz" "baz"
This is the most straightforward option.
Another way is to use mtabulate from qdapTools package, i.e.
df <- data.frame(var = sample(c("A", "B", "C"), 5, replace = TRUE))
var
#1 C
#2 A
#3 C
#4 B
#5 B
library(qdapTools)
mtabulate(df$var)
which gives,
A B C
1 0 0 1
2 1 0 0
3 0 0 1
4 0 1 0
5 0 1 0
This one liner in base R
model.matrix( ~ iris$Species - 1)
gives
iris$Speciessetosa iris$Speciesversicolor iris$Speciesvirginica
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 1 0 0
7 1 0 0
8 1 0 0
9 1 0 0
10 1 0 0
11 1 0 0
12 1 0 0
13 1 0 0
14 1 0 0
15 1 0 0
16 1 0 0
17 1 0 0
18 1 0 0
19 1 0 0
20 1 0 0
21 1 0 0
22 1 0 0
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 0 0
29 1 0 0
30 1 0 0
31 1 0 0
32 1 0 0
33 1 0 0
34 1 0 0
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0
40 1 0 0
41 1 0 0
42 1 0 0
43 1 0 0
44 1 0 0
45 1 0 0
46 1 0 0
47 1 0 0
48 1 0 0
49 1 0 0
50 1 0 0
51 0 1 0
52 0 1 0
53 0 1 0
54 0 1 0
55 0 1 0
56 0 1 0
57 0 1 0
58 0 1 0
59 0 1 0
60 0 1 0
61 0 1 0
62 0 1 0
63 0 1 0
64 0 1 0
65 0 1 0
66 0 1 0
67 0 1 0
68 0 1 0
69 0 1 0
70 0 1 0
71 0 1 0
72 0 1 0
73 0 1 0
74 0 1 0
75 0 1 0
76 0 1 0
77 0 1 0
78 0 1 0
79 0 1 0
80 0 1 0
81 0 1 0
82 0 1 0
83 0 1 0
84 0 1 0
85 0 1 0
86 0 1 0
87 0 1 0
88 0 1 0
89 0 1 0
90 0 1 0
91 0 1 0
92 0 1 0
93 0 1 0
94 0 1 0
95 0 1 0
96 0 1 0
97 0 1 0
98 0 1 0
99 0 1 0
100 0 1 0
101 0 0 1
102 0 0 1
103 0 0 1
104 0 0 1
105 0 0 1
106 0 0 1
107 0 0 1
108 0 0 1
109 0 0 1
110 0 0 1
111 0 0 1
112 0 0 1
113 0 0 1
114 0 0 1
115 0 0 1
116 0 0 1
117 0 0 1
118 0 0 1
119 0 0 1
120 0 0 1
121 0 0 1
122 0 0 1
123 0 0 1
124 0 0 1
125 0 0 1
126 0 0 1
127 0 0 1
128 0 0 1
129 0 0 1
130 0 0 1
131 0 0 1
132 0 0 1
133 0 0 1
134 0 0 1
135 0 0 1
136 0 0 1
137 0 0 1
138 0 0 1
139 0 0 1
140 0 0 1
141 0 0 1
142 0 0 1
143 0 0 1
144 0 0 1
145 0 0 1
146 0 0 1
147 0 0 1
148 0 0 1
149 0 0 1
150 0 0 1
Convert your data to a data.table and use set by reference and row filtering
library(data.table)
dt <- as.data.table(your.dataframe.or.whatever)
dt[, is.1957 := 0]
dt[year == 1957, is.1957 := 1]
Proof-of-concept toy example:
library(data.table)
dt <- as.data.table(cbind(c(1, 1, 1), c(2, 2, 3)))
dt[, is.3 := 0]
dt[V2 == 3, is.3 := 1]
I use such a function (for data.table):
# Ta funkcja dla obiektu data.table i zmiennej var.name typu factor tworzy dummy variables o nazwach "var.name: (level1)"
factorToDummy <- function(dtable, var.name){
stopifnot(is.data.table(dtable))
stopifnot(var.name %in% names(dtable))
stopifnot(is.factor(dtable[, get(var.name)]))
dtable[, paste0(var.name,": ",levels(get(var.name)))] -> new.names
dtable[, (new.names) := transpose(lapply(get(var.name), FUN = function(x){x == levels(get(var.name))})) ]
cat(paste("\nDodano zmienne dummy: ", paste0(new.names, collapse = ", ")))
}
Usage:
data <- data.table(data)
data[, x:= droplevels(x)]
factorToDummy(data, "x")
We can also use cSplit_e from splitstackshape. Using #zx8754's data
df1 <- data.frame(id = 1:4, year = 1991:1994)
splitstackshape::cSplit_e(df1, "year", fill = 0)
# id year year_1 year_2 year_3 year_4
#1 1 1991 1 0 0 0
#2 2 1992 0 1 0 0
#3 3 1993 0 0 1 0
#4 4 1994 0 0 0 1
To make it work for data other than numeric we need to specify type as "character" explicitly
df1 <- data.frame(id = 1:4, let = LETTERS[1:4])
splitstackshape::cSplit_e(df1, "let", fill = 0, type = "character")
# id let let_A let_B let_C let_D
#1 1 A 1 0 0 0
#2 2 B 0 1 0 0
#3 3 C 0 0 1 0
#4 4 D 0 0 0 1
Hi i wrote this general function to generate a dummy variable which essentially replicates the replace function in Stata.
If x is the data frame is x and i want a dummy variable called a which will take value 1 when x$b takes value c
introducedummy<-function(x,a,b,c){
g<-c(a,b,c)
n<-nrow(x)
newcol<-g[1]
p<-colnames(x)
p2<-c(p,newcol)
new1<-numeric(n)
state<-x[,g[2]]
interest<-g[3]
for(i in 1:n){
if(state[i]==interest){
new1[i]=1
}
else{
new1[i]=0
}
}
x$added<-new1
colnames(x)<-p2
x
}
another way you can do it is use
ifelse(year < 1965 , 1, 0)

Naming variables recursively [duplicate]

I have had trouble generating the following dummy-variables in R:
I'm analyzing yearly time series data (time period 1948-2009). I have two questions:
How do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)?
How do I generate a dummy variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?
Another option that can work better if you have many variables is factor and model.matrix.
year.f = factor(year)
dummies = model.matrix(~year.f)
This will include an intercept column (all ones) and one column for each of the years in your data set except one, which will be the "default" or intercept value.
You can change how the "default" is chosen by messing with contrasts.arg in model.matrix.
Also, if you want to omit the intercept, you can just drop the first column or add +0 to the end of the formula.
Hope this is useful.
The simplest way to produce these dummy variables is something like the following:
> print(year)
[1] 1956 1957 1957 1958 1958 1959
> dummy <- as.numeric(year == 1957)
> print(dummy)
[1] 0 1 1 0 0 0
> dummy2 <- as.numeric(year >= 1957)
> print(dummy2)
[1] 0 1 1 1 1 1
More generally, you can use ifelse to choose between two values depending on a condition. So if instead of a 0-1 dummy variable, for some reason you wanted to use, say, 4 and 7, you could use ifelse(year == 1957, 4, 7).
Using dummies::dummy():
library(dummies)
# example data
df1 <- data.frame(id = 1:4, year = 1991:1994)
df1 <- cbind(df1, dummy(df1$year, sep = "_"))
df1
# id year df1_1991 df1_1992 df1_1993 df1_1994
# 1 1 1991 1 0 0 0
# 2 2 1992 0 1 0 0
# 3 3 1993 0 0 1 0
# 4 4 1994 0 0 0 1
Package mlr includes createDummyFeatures for this purpose:
library(mlr)
df <- data.frame(var = sample(c("A", "B", "C"), 10, replace = TRUE))
df
# var
# 1 B
# 2 A
# 3 C
# 4 B
# 5 C
# 6 A
# 7 C
# 8 A
# 9 B
# 10 C
createDummyFeatures(df, cols = "var")
# var.A var.B var.C
# 1 0 1 0
# 2 1 0 0
# 3 0 0 1
# 4 0 1 0
# 5 0 0 1
# 6 1 0 0
# 7 0 0 1
# 8 1 0 0
# 9 0 1 0
# 10 0 0 1
createDummyFeatures drops original variable.
https://www.rdocumentation.org/packages/mlr/versions/2.9/topics/createDummyFeatures
.....
The other answers here offer direct routes to accomplish this task—one that many models (e.g. lm) will do for you internally anyway. Nonetheless, here are ways to make dummy variables with Max Kuhn's popular caret and recipes packages. While somewhat more verbose, they both scale easily to more complicated situations, and fit neatly into their respective frameworks.
caret::dummyVars
With caret, the relevant function is dummyVars, which has a predict method to apply it on a data frame:
df <- data.frame(letter = rep(c('a', 'b', 'c'), each = 2),
y = 1:6)
library(caret)
dummy <- dummyVars(~ ., data = df, fullRank = TRUE)
dummy
#> Dummy Variable Object
#>
#> Formula: ~.
#> 2 variables, 1 factors
#> Variables and levels will be separated by '.'
#> A full rank encoding is used
predict(dummy, df)
#> letter.b letter.c y
#> 1 0 0 1
#> 2 0 0 2
#> 3 1 0 3
#> 4 1 0 4
#> 5 0 1 5
#> 6 0 1 6
recipes::step_dummy
With recipes, the relevant function is step_dummy:
library(recipes)
dummy_recipe <- recipe(y ~ letter, df) %>%
step_dummy(letter)
dummy_recipe
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 1
#>
#> Steps:
#>
#> Dummy variables from letter
Depending on context, extract the data with prep and either bake or juice:
# Prep and bake on new data...
dummy_recipe %>%
prep() %>%
bake(df)
#> # A tibble: 6 x 3
#> y letter_b letter_c
#> <int> <dbl> <dbl>
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 0 1
#> 6 6 0 1
# ...or use `retain = TRUE` and `juice` to extract training data
dummy_recipe %>%
prep(retain = TRUE) %>%
juice()
#> # A tibble: 6 x 3
#> y letter_b letter_c
#> <int> <dbl> <dbl>
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 0 1
#> 6 6 0 1
For the usecase as presented in the question, you can also just multiply the logical condition with 1 (or maybe even better, with 1L):
# example data
df1 <- data.frame(yr = 1951:1960)
# create the dummies
df1$is.1957 <- 1L * (df1$yr == 1957)
df1$after.1957 <- 1L * (df1$yr >= 1957)
which gives:
> df1
yr is.1957 after.1957
1 1951 0 0
2 1952 0 0
3 1953 0 0
4 1954 0 0
5 1955 0 0
6 1956 0 0
7 1957 1 1
8 1958 0 1
9 1959 0 1
10 1960 0 1
For the usecases as presented in for example the answers of #zx8754 and #Sotos, there are still some other options which haven't been covered yet imo.
1) Make your own make_dummies-function
# example data
df2 <- data.frame(id = 1:5, year = c(1991:1994,1992))
# create a function
make_dummies <- function(v, prefix = '') {
s <- sort(unique(v))
d <- outer(v, s, function(v, s) 1L * (v == s))
colnames(d) <- paste0(prefix, s)
d
}
# bind the dummies to the original dataframe
cbind(df2, make_dummies(df2$year, prefix = 'y'))
which gives:
id year y1991 y1992 y1993 y1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
2) use the dcast-function from either data.table or reshape2
dcast(df2, id + year ~ year, fun.aggregate = length)
which gives:
id year 1991 1992 1993 1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
However, this will not work when there are duplicate values in the column for which the dummies have to be created. In the case a specific aggregation function is needed for dcast and the result of of dcast need to be merged back to the original:
# example data
df3 <- data.frame(var = c("B", "C", "A", "B", "C"))
# aggregation function to get dummy values
f <- function(x) as.integer(length(x) > 0)
# reshape to wide with the cumstom aggregation function and merge back to the original
merge(df3, dcast(df3, var ~ var, fun.aggregate = f), by = 'var', all.x = TRUE)
which gives (note that the result is ordered according to the by column):
var A B C
1 A 1 0 0
2 B 0 1 0
3 B 0 1 0
4 C 0 0 1
5 C 0 0 1
3) use the spread-function from tidyr (with mutate from dplyr)
library(dplyr)
library(tidyr)
df2 %>%
mutate(v = 1, yr = year) %>%
spread(yr, v, fill = 0)
which gives:
id year 1991 1992 1993 1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
What I normally do to work with this kind of dummy variables is:
(1) how do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)
data$factor_year_1 <- factor ( with ( data, ifelse ( ( year == 1957 ), 1 , 0 ) ) )
(2) how do I generate a dummy-variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?
data$factor_year_2 <- factor ( with ( data, ifelse ( ( year < 1957 ), 0 , 1 ) ) )
Then, I can introduce this factor as a dummy variable in my models. For example, to see whether there is a long-term trend in a varible y :
summary ( lm ( y ~ t, data = data ) )
Hope this helps!
If you want to get K dummy variables, instead of K-1, try:
dummies = table(1:length(year),as.factor(year))
Best,
I read this on the kaggle forum:
#Generate example dataframe with character column
example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"
#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
for(level in unique(example$strcol)){
example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level, 1, 0)
}
The ifelse function is best for simple logic like this.
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, 1, 0)
ifelse(x <= 1957, 1, 0)
> [1] 0 0 0 0 0 0 0 1 0 0 0
> [1] 1 1 1 1 1 1 1 1 0 0 0
Also, if you want it to return character data then you can do so.
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, "foo", "bar")
ifelse(x <= 1957, "foo", "bar")
> [1] "bar" "bar" "bar" "bar" "bar" "bar" "bar" "foo" "bar" "bar" "bar"
> [1] "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "bar" "bar" "bar"
Categorical variables with nesting...
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, "foo", ifelse(x == 1958, "bar","baz"))
> [1] "baz" "baz" "baz" "baz" "baz" "baz" "baz" "foo" "bar" "baz" "baz"
This is the most straightforward option.
Another way is to use mtabulate from qdapTools package, i.e.
df <- data.frame(var = sample(c("A", "B", "C"), 5, replace = TRUE))
var
#1 C
#2 A
#3 C
#4 B
#5 B
library(qdapTools)
mtabulate(df$var)
which gives,
A B C
1 0 0 1
2 1 0 0
3 0 0 1
4 0 1 0
5 0 1 0
This one liner in base R
model.matrix( ~ iris$Species - 1)
gives
iris$Speciessetosa iris$Speciesversicolor iris$Speciesvirginica
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 1 0 0
7 1 0 0
8 1 0 0
9 1 0 0
10 1 0 0
11 1 0 0
12 1 0 0
13 1 0 0
14 1 0 0
15 1 0 0
16 1 0 0
17 1 0 0
18 1 0 0
19 1 0 0
20 1 0 0
21 1 0 0
22 1 0 0
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 0 0
29 1 0 0
30 1 0 0
31 1 0 0
32 1 0 0
33 1 0 0
34 1 0 0
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0
40 1 0 0
41 1 0 0
42 1 0 0
43 1 0 0
44 1 0 0
45 1 0 0
46 1 0 0
47 1 0 0
48 1 0 0
49 1 0 0
50 1 0 0
51 0 1 0
52 0 1 0
53 0 1 0
54 0 1 0
55 0 1 0
56 0 1 0
57 0 1 0
58 0 1 0
59 0 1 0
60 0 1 0
61 0 1 0
62 0 1 0
63 0 1 0
64 0 1 0
65 0 1 0
66 0 1 0
67 0 1 0
68 0 1 0
69 0 1 0
70 0 1 0
71 0 1 0
72 0 1 0
73 0 1 0
74 0 1 0
75 0 1 0
76 0 1 0
77 0 1 0
78 0 1 0
79 0 1 0
80 0 1 0
81 0 1 0
82 0 1 0
83 0 1 0
84 0 1 0
85 0 1 0
86 0 1 0
87 0 1 0
88 0 1 0
89 0 1 0
90 0 1 0
91 0 1 0
92 0 1 0
93 0 1 0
94 0 1 0
95 0 1 0
96 0 1 0
97 0 1 0
98 0 1 0
99 0 1 0
100 0 1 0
101 0 0 1
102 0 0 1
103 0 0 1
104 0 0 1
105 0 0 1
106 0 0 1
107 0 0 1
108 0 0 1
109 0 0 1
110 0 0 1
111 0 0 1
112 0 0 1
113 0 0 1
114 0 0 1
115 0 0 1
116 0 0 1
117 0 0 1
118 0 0 1
119 0 0 1
120 0 0 1
121 0 0 1
122 0 0 1
123 0 0 1
124 0 0 1
125 0 0 1
126 0 0 1
127 0 0 1
128 0 0 1
129 0 0 1
130 0 0 1
131 0 0 1
132 0 0 1
133 0 0 1
134 0 0 1
135 0 0 1
136 0 0 1
137 0 0 1
138 0 0 1
139 0 0 1
140 0 0 1
141 0 0 1
142 0 0 1
143 0 0 1
144 0 0 1
145 0 0 1
146 0 0 1
147 0 0 1
148 0 0 1
149 0 0 1
150 0 0 1
Convert your data to a data.table and use set by reference and row filtering
library(data.table)
dt <- as.data.table(your.dataframe.or.whatever)
dt[, is.1957 := 0]
dt[year == 1957, is.1957 := 1]
Proof-of-concept toy example:
library(data.table)
dt <- as.data.table(cbind(c(1, 1, 1), c(2, 2, 3)))
dt[, is.3 := 0]
dt[V2 == 3, is.3 := 1]
I use such a function (for data.table):
# Ta funkcja dla obiektu data.table i zmiennej var.name typu factor tworzy dummy variables o nazwach "var.name: (level1)"
factorToDummy <- function(dtable, var.name){
stopifnot(is.data.table(dtable))
stopifnot(var.name %in% names(dtable))
stopifnot(is.factor(dtable[, get(var.name)]))
dtable[, paste0(var.name,": ",levels(get(var.name)))] -> new.names
dtable[, (new.names) := transpose(lapply(get(var.name), FUN = function(x){x == levels(get(var.name))})) ]
cat(paste("\nDodano zmienne dummy: ", paste0(new.names, collapse = ", ")))
}
Usage:
data <- data.table(data)
data[, x:= droplevels(x)]
factorToDummy(data, "x")
We can also use cSplit_e from splitstackshape. Using #zx8754's data
df1 <- data.frame(id = 1:4, year = 1991:1994)
splitstackshape::cSplit_e(df1, "year", fill = 0)
# id year year_1 year_2 year_3 year_4
#1 1 1991 1 0 0 0
#2 2 1992 0 1 0 0
#3 3 1993 0 0 1 0
#4 4 1994 0 0 0 1
To make it work for data other than numeric we need to specify type as "character" explicitly
df1 <- data.frame(id = 1:4, let = LETTERS[1:4])
splitstackshape::cSplit_e(df1, "let", fill = 0, type = "character")
# id let let_A let_B let_C let_D
#1 1 A 1 0 0 0
#2 2 B 0 1 0 0
#3 3 C 0 0 1 0
#4 4 D 0 0 0 1
Hi i wrote this general function to generate a dummy variable which essentially replicates the replace function in Stata.
If x is the data frame is x and i want a dummy variable called a which will take value 1 when x$b takes value c
introducedummy<-function(x,a,b,c){
g<-c(a,b,c)
n<-nrow(x)
newcol<-g[1]
p<-colnames(x)
p2<-c(p,newcol)
new1<-numeric(n)
state<-x[,g[2]]
interest<-g[3]
for(i in 1:n){
if(state[i]==interest){
new1[i]=1
}
else{
new1[i]=0
}
}
x$added<-new1
colnames(x)<-p2
x
}
another way you can do it is use
ifelse(year < 1965 , 1, 0)

Categorical variable in classification model [duplicate]

I have had trouble generating the following dummy-variables in R:
I'm analyzing yearly time series data (time period 1948-2009). I have two questions:
How do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)?
How do I generate a dummy variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?
Another option that can work better if you have many variables is factor and model.matrix.
year.f = factor(year)
dummies = model.matrix(~year.f)
This will include an intercept column (all ones) and one column for each of the years in your data set except one, which will be the "default" or intercept value.
You can change how the "default" is chosen by messing with contrasts.arg in model.matrix.
Also, if you want to omit the intercept, you can just drop the first column or add +0 to the end of the formula.
Hope this is useful.
The simplest way to produce these dummy variables is something like the following:
> print(year)
[1] 1956 1957 1957 1958 1958 1959
> dummy <- as.numeric(year == 1957)
> print(dummy)
[1] 0 1 1 0 0 0
> dummy2 <- as.numeric(year >= 1957)
> print(dummy2)
[1] 0 1 1 1 1 1
More generally, you can use ifelse to choose between two values depending on a condition. So if instead of a 0-1 dummy variable, for some reason you wanted to use, say, 4 and 7, you could use ifelse(year == 1957, 4, 7).
Using dummies::dummy():
library(dummies)
# example data
df1 <- data.frame(id = 1:4, year = 1991:1994)
df1 <- cbind(df1, dummy(df1$year, sep = "_"))
df1
# id year df1_1991 df1_1992 df1_1993 df1_1994
# 1 1 1991 1 0 0 0
# 2 2 1992 0 1 0 0
# 3 3 1993 0 0 1 0
# 4 4 1994 0 0 0 1
Package mlr includes createDummyFeatures for this purpose:
library(mlr)
df <- data.frame(var = sample(c("A", "B", "C"), 10, replace = TRUE))
df
# var
# 1 B
# 2 A
# 3 C
# 4 B
# 5 C
# 6 A
# 7 C
# 8 A
# 9 B
# 10 C
createDummyFeatures(df, cols = "var")
# var.A var.B var.C
# 1 0 1 0
# 2 1 0 0
# 3 0 0 1
# 4 0 1 0
# 5 0 0 1
# 6 1 0 0
# 7 0 0 1
# 8 1 0 0
# 9 0 1 0
# 10 0 0 1
createDummyFeatures drops original variable.
https://www.rdocumentation.org/packages/mlr/versions/2.9/topics/createDummyFeatures
.....
The other answers here offer direct routes to accomplish this task—one that many models (e.g. lm) will do for you internally anyway. Nonetheless, here are ways to make dummy variables with Max Kuhn's popular caret and recipes packages. While somewhat more verbose, they both scale easily to more complicated situations, and fit neatly into their respective frameworks.
caret::dummyVars
With caret, the relevant function is dummyVars, which has a predict method to apply it on a data frame:
df <- data.frame(letter = rep(c('a', 'b', 'c'), each = 2),
y = 1:6)
library(caret)
dummy <- dummyVars(~ ., data = df, fullRank = TRUE)
dummy
#> Dummy Variable Object
#>
#> Formula: ~.
#> 2 variables, 1 factors
#> Variables and levels will be separated by '.'
#> A full rank encoding is used
predict(dummy, df)
#> letter.b letter.c y
#> 1 0 0 1
#> 2 0 0 2
#> 3 1 0 3
#> 4 1 0 4
#> 5 0 1 5
#> 6 0 1 6
recipes::step_dummy
With recipes, the relevant function is step_dummy:
library(recipes)
dummy_recipe <- recipe(y ~ letter, df) %>%
step_dummy(letter)
dummy_recipe
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 1
#>
#> Steps:
#>
#> Dummy variables from letter
Depending on context, extract the data with prep and either bake or juice:
# Prep and bake on new data...
dummy_recipe %>%
prep() %>%
bake(df)
#> # A tibble: 6 x 3
#> y letter_b letter_c
#> <int> <dbl> <dbl>
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 0 1
#> 6 6 0 1
# ...or use `retain = TRUE` and `juice` to extract training data
dummy_recipe %>%
prep(retain = TRUE) %>%
juice()
#> # A tibble: 6 x 3
#> y letter_b letter_c
#> <int> <dbl> <dbl>
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 0 1
#> 6 6 0 1
For the usecase as presented in the question, you can also just multiply the logical condition with 1 (or maybe even better, with 1L):
# example data
df1 <- data.frame(yr = 1951:1960)
# create the dummies
df1$is.1957 <- 1L * (df1$yr == 1957)
df1$after.1957 <- 1L * (df1$yr >= 1957)
which gives:
> df1
yr is.1957 after.1957
1 1951 0 0
2 1952 0 0
3 1953 0 0
4 1954 0 0
5 1955 0 0
6 1956 0 0
7 1957 1 1
8 1958 0 1
9 1959 0 1
10 1960 0 1
For the usecases as presented in for example the answers of #zx8754 and #Sotos, there are still some other options which haven't been covered yet imo.
1) Make your own make_dummies-function
# example data
df2 <- data.frame(id = 1:5, year = c(1991:1994,1992))
# create a function
make_dummies <- function(v, prefix = '') {
s <- sort(unique(v))
d <- outer(v, s, function(v, s) 1L * (v == s))
colnames(d) <- paste0(prefix, s)
d
}
# bind the dummies to the original dataframe
cbind(df2, make_dummies(df2$year, prefix = 'y'))
which gives:
id year y1991 y1992 y1993 y1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
2) use the dcast-function from either data.table or reshape2
dcast(df2, id + year ~ year, fun.aggregate = length)
which gives:
id year 1991 1992 1993 1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
However, this will not work when there are duplicate values in the column for which the dummies have to be created. In the case a specific aggregation function is needed for dcast and the result of of dcast need to be merged back to the original:
# example data
df3 <- data.frame(var = c("B", "C", "A", "B", "C"))
# aggregation function to get dummy values
f <- function(x) as.integer(length(x) > 0)
# reshape to wide with the cumstom aggregation function and merge back to the original
merge(df3, dcast(df3, var ~ var, fun.aggregate = f), by = 'var', all.x = TRUE)
which gives (note that the result is ordered according to the by column):
var A B C
1 A 1 0 0
2 B 0 1 0
3 B 0 1 0
4 C 0 0 1
5 C 0 0 1
3) use the spread-function from tidyr (with mutate from dplyr)
library(dplyr)
library(tidyr)
df2 %>%
mutate(v = 1, yr = year) %>%
spread(yr, v, fill = 0)
which gives:
id year 1991 1992 1993 1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
What I normally do to work with this kind of dummy variables is:
(1) how do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)
data$factor_year_1 <- factor ( with ( data, ifelse ( ( year == 1957 ), 1 , 0 ) ) )
(2) how do I generate a dummy-variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?
data$factor_year_2 <- factor ( with ( data, ifelse ( ( year < 1957 ), 0 , 1 ) ) )
Then, I can introduce this factor as a dummy variable in my models. For example, to see whether there is a long-term trend in a varible y :
summary ( lm ( y ~ t, data = data ) )
Hope this helps!
If you want to get K dummy variables, instead of K-1, try:
dummies = table(1:length(year),as.factor(year))
Best,
I read this on the kaggle forum:
#Generate example dataframe with character column
example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"
#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
for(level in unique(example$strcol)){
example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level, 1, 0)
}
The ifelse function is best for simple logic like this.
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, 1, 0)
ifelse(x <= 1957, 1, 0)
> [1] 0 0 0 0 0 0 0 1 0 0 0
> [1] 1 1 1 1 1 1 1 1 0 0 0
Also, if you want it to return character data then you can do so.
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, "foo", "bar")
ifelse(x <= 1957, "foo", "bar")
> [1] "bar" "bar" "bar" "bar" "bar" "bar" "bar" "foo" "bar" "bar" "bar"
> [1] "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "bar" "bar" "bar"
Categorical variables with nesting...
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, "foo", ifelse(x == 1958, "bar","baz"))
> [1] "baz" "baz" "baz" "baz" "baz" "baz" "baz" "foo" "bar" "baz" "baz"
This is the most straightforward option.
Another way is to use mtabulate from qdapTools package, i.e.
df <- data.frame(var = sample(c("A", "B", "C"), 5, replace = TRUE))
var
#1 C
#2 A
#3 C
#4 B
#5 B
library(qdapTools)
mtabulate(df$var)
which gives,
A B C
1 0 0 1
2 1 0 0
3 0 0 1
4 0 1 0
5 0 1 0
This one liner in base R
model.matrix( ~ iris$Species - 1)
gives
iris$Speciessetosa iris$Speciesversicolor iris$Speciesvirginica
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 1 0 0
7 1 0 0
8 1 0 0
9 1 0 0
10 1 0 0
11 1 0 0
12 1 0 0
13 1 0 0
14 1 0 0
15 1 0 0
16 1 0 0
17 1 0 0
18 1 0 0
19 1 0 0
20 1 0 0
21 1 0 0
22 1 0 0
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 0 0
29 1 0 0
30 1 0 0
31 1 0 0
32 1 0 0
33 1 0 0
34 1 0 0
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0
40 1 0 0
41 1 0 0
42 1 0 0
43 1 0 0
44 1 0 0
45 1 0 0
46 1 0 0
47 1 0 0
48 1 0 0
49 1 0 0
50 1 0 0
51 0 1 0
52 0 1 0
53 0 1 0
54 0 1 0
55 0 1 0
56 0 1 0
57 0 1 0
58 0 1 0
59 0 1 0
60 0 1 0
61 0 1 0
62 0 1 0
63 0 1 0
64 0 1 0
65 0 1 0
66 0 1 0
67 0 1 0
68 0 1 0
69 0 1 0
70 0 1 0
71 0 1 0
72 0 1 0
73 0 1 0
74 0 1 0
75 0 1 0
76 0 1 0
77 0 1 0
78 0 1 0
79 0 1 0
80 0 1 0
81 0 1 0
82 0 1 0
83 0 1 0
84 0 1 0
85 0 1 0
86 0 1 0
87 0 1 0
88 0 1 0
89 0 1 0
90 0 1 0
91 0 1 0
92 0 1 0
93 0 1 0
94 0 1 0
95 0 1 0
96 0 1 0
97 0 1 0
98 0 1 0
99 0 1 0
100 0 1 0
101 0 0 1
102 0 0 1
103 0 0 1
104 0 0 1
105 0 0 1
106 0 0 1
107 0 0 1
108 0 0 1
109 0 0 1
110 0 0 1
111 0 0 1
112 0 0 1
113 0 0 1
114 0 0 1
115 0 0 1
116 0 0 1
117 0 0 1
118 0 0 1
119 0 0 1
120 0 0 1
121 0 0 1
122 0 0 1
123 0 0 1
124 0 0 1
125 0 0 1
126 0 0 1
127 0 0 1
128 0 0 1
129 0 0 1
130 0 0 1
131 0 0 1
132 0 0 1
133 0 0 1
134 0 0 1
135 0 0 1
136 0 0 1
137 0 0 1
138 0 0 1
139 0 0 1
140 0 0 1
141 0 0 1
142 0 0 1
143 0 0 1
144 0 0 1
145 0 0 1
146 0 0 1
147 0 0 1
148 0 0 1
149 0 0 1
150 0 0 1
Convert your data to a data.table and use set by reference and row filtering
library(data.table)
dt <- as.data.table(your.dataframe.or.whatever)
dt[, is.1957 := 0]
dt[year == 1957, is.1957 := 1]
Proof-of-concept toy example:
library(data.table)
dt <- as.data.table(cbind(c(1, 1, 1), c(2, 2, 3)))
dt[, is.3 := 0]
dt[V2 == 3, is.3 := 1]
I use such a function (for data.table):
# Ta funkcja dla obiektu data.table i zmiennej var.name typu factor tworzy dummy variables o nazwach "var.name: (level1)"
factorToDummy <- function(dtable, var.name){
stopifnot(is.data.table(dtable))
stopifnot(var.name %in% names(dtable))
stopifnot(is.factor(dtable[, get(var.name)]))
dtable[, paste0(var.name,": ",levels(get(var.name)))] -> new.names
dtable[, (new.names) := transpose(lapply(get(var.name), FUN = function(x){x == levels(get(var.name))})) ]
cat(paste("\nDodano zmienne dummy: ", paste0(new.names, collapse = ", ")))
}
Usage:
data <- data.table(data)
data[, x:= droplevels(x)]
factorToDummy(data, "x")
We can also use cSplit_e from splitstackshape. Using #zx8754's data
df1 <- data.frame(id = 1:4, year = 1991:1994)
splitstackshape::cSplit_e(df1, "year", fill = 0)
# id year year_1 year_2 year_3 year_4
#1 1 1991 1 0 0 0
#2 2 1992 0 1 0 0
#3 3 1993 0 0 1 0
#4 4 1994 0 0 0 1
To make it work for data other than numeric we need to specify type as "character" explicitly
df1 <- data.frame(id = 1:4, let = LETTERS[1:4])
splitstackshape::cSplit_e(df1, "let", fill = 0, type = "character")
# id let let_A let_B let_C let_D
#1 1 A 1 0 0 0
#2 2 B 0 1 0 0
#3 3 C 0 0 1 0
#4 4 D 0 0 0 1
Hi i wrote this general function to generate a dummy variable which essentially replicates the replace function in Stata.
If x is the data frame is x and i want a dummy variable called a which will take value 1 when x$b takes value c
introducedummy<-function(x,a,b,c){
g<-c(a,b,c)
n<-nrow(x)
newcol<-g[1]
p<-colnames(x)
p2<-c(p,newcol)
new1<-numeric(n)
state<-x[,g[2]]
interest<-g[3]
for(i in 1:n){
if(state[i]==interest){
new1[i]=1
}
else{
new1[i]=0
}
}
x$added<-new1
colnames(x)<-p2
x
}
another way you can do it is use
ifelse(year < 1965 , 1, 0)

Convert multiple categorical to binary variables (R, tidyr) [duplicate]

I have had trouble generating the following dummy-variables in R:
I'm analyzing yearly time series data (time period 1948-2009). I have two questions:
How do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)?
How do I generate a dummy variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?
Another option that can work better if you have many variables is factor and model.matrix.
year.f = factor(year)
dummies = model.matrix(~year.f)
This will include an intercept column (all ones) and one column for each of the years in your data set except one, which will be the "default" or intercept value.
You can change how the "default" is chosen by messing with contrasts.arg in model.matrix.
Also, if you want to omit the intercept, you can just drop the first column or add +0 to the end of the formula.
Hope this is useful.
The simplest way to produce these dummy variables is something like the following:
> print(year)
[1] 1956 1957 1957 1958 1958 1959
> dummy <- as.numeric(year == 1957)
> print(dummy)
[1] 0 1 1 0 0 0
> dummy2 <- as.numeric(year >= 1957)
> print(dummy2)
[1] 0 1 1 1 1 1
More generally, you can use ifelse to choose between two values depending on a condition. So if instead of a 0-1 dummy variable, for some reason you wanted to use, say, 4 and 7, you could use ifelse(year == 1957, 4, 7).
Using dummies::dummy():
library(dummies)
# example data
df1 <- data.frame(id = 1:4, year = 1991:1994)
df1 <- cbind(df1, dummy(df1$year, sep = "_"))
df1
# id year df1_1991 df1_1992 df1_1993 df1_1994
# 1 1 1991 1 0 0 0
# 2 2 1992 0 1 0 0
# 3 3 1993 0 0 1 0
# 4 4 1994 0 0 0 1
Package mlr includes createDummyFeatures for this purpose:
library(mlr)
df <- data.frame(var = sample(c("A", "B", "C"), 10, replace = TRUE))
df
# var
# 1 B
# 2 A
# 3 C
# 4 B
# 5 C
# 6 A
# 7 C
# 8 A
# 9 B
# 10 C
createDummyFeatures(df, cols = "var")
# var.A var.B var.C
# 1 0 1 0
# 2 1 0 0
# 3 0 0 1
# 4 0 1 0
# 5 0 0 1
# 6 1 0 0
# 7 0 0 1
# 8 1 0 0
# 9 0 1 0
# 10 0 0 1
createDummyFeatures drops original variable.
https://www.rdocumentation.org/packages/mlr/versions/2.9/topics/createDummyFeatures
.....
The other answers here offer direct routes to accomplish this task—one that many models (e.g. lm) will do for you internally anyway. Nonetheless, here are ways to make dummy variables with Max Kuhn's popular caret and recipes packages. While somewhat more verbose, they both scale easily to more complicated situations, and fit neatly into their respective frameworks.
caret::dummyVars
With caret, the relevant function is dummyVars, which has a predict method to apply it on a data frame:
df <- data.frame(letter = rep(c('a', 'b', 'c'), each = 2),
y = 1:6)
library(caret)
dummy <- dummyVars(~ ., data = df, fullRank = TRUE)
dummy
#> Dummy Variable Object
#>
#> Formula: ~.
#> 2 variables, 1 factors
#> Variables and levels will be separated by '.'
#> A full rank encoding is used
predict(dummy, df)
#> letter.b letter.c y
#> 1 0 0 1
#> 2 0 0 2
#> 3 1 0 3
#> 4 1 0 4
#> 5 0 1 5
#> 6 0 1 6
recipes::step_dummy
With recipes, the relevant function is step_dummy:
library(recipes)
dummy_recipe <- recipe(y ~ letter, df) %>%
step_dummy(letter)
dummy_recipe
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 1
#>
#> Steps:
#>
#> Dummy variables from letter
Depending on context, extract the data with prep and either bake or juice:
# Prep and bake on new data...
dummy_recipe %>%
prep() %>%
bake(df)
#> # A tibble: 6 x 3
#> y letter_b letter_c
#> <int> <dbl> <dbl>
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 0 1
#> 6 6 0 1
# ...or use `retain = TRUE` and `juice` to extract training data
dummy_recipe %>%
prep(retain = TRUE) %>%
juice()
#> # A tibble: 6 x 3
#> y letter_b letter_c
#> <int> <dbl> <dbl>
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 0 1
#> 6 6 0 1
For the usecase as presented in the question, you can also just multiply the logical condition with 1 (or maybe even better, with 1L):
# example data
df1 <- data.frame(yr = 1951:1960)
# create the dummies
df1$is.1957 <- 1L * (df1$yr == 1957)
df1$after.1957 <- 1L * (df1$yr >= 1957)
which gives:
> df1
yr is.1957 after.1957
1 1951 0 0
2 1952 0 0
3 1953 0 0
4 1954 0 0
5 1955 0 0
6 1956 0 0
7 1957 1 1
8 1958 0 1
9 1959 0 1
10 1960 0 1
For the usecases as presented in for example the answers of #zx8754 and #Sotos, there are still some other options which haven't been covered yet imo.
1) Make your own make_dummies-function
# example data
df2 <- data.frame(id = 1:5, year = c(1991:1994,1992))
# create a function
make_dummies <- function(v, prefix = '') {
s <- sort(unique(v))
d <- outer(v, s, function(v, s) 1L * (v == s))
colnames(d) <- paste0(prefix, s)
d
}
# bind the dummies to the original dataframe
cbind(df2, make_dummies(df2$year, prefix = 'y'))
which gives:
id year y1991 y1992 y1993 y1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
2) use the dcast-function from either data.table or reshape2
dcast(df2, id + year ~ year, fun.aggregate = length)
which gives:
id year 1991 1992 1993 1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
However, this will not work when there are duplicate values in the column for which the dummies have to be created. In the case a specific aggregation function is needed for dcast and the result of of dcast need to be merged back to the original:
# example data
df3 <- data.frame(var = c("B", "C", "A", "B", "C"))
# aggregation function to get dummy values
f <- function(x) as.integer(length(x) > 0)
# reshape to wide with the cumstom aggregation function and merge back to the original
merge(df3, dcast(df3, var ~ var, fun.aggregate = f), by = 'var', all.x = TRUE)
which gives (note that the result is ordered according to the by column):
var A B C
1 A 1 0 0
2 B 0 1 0
3 B 0 1 0
4 C 0 0 1
5 C 0 0 1
3) use the spread-function from tidyr (with mutate from dplyr)
library(dplyr)
library(tidyr)
df2 %>%
mutate(v = 1, yr = year) %>%
spread(yr, v, fill = 0)
which gives:
id year 1991 1992 1993 1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
What I normally do to work with this kind of dummy variables is:
(1) how do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)
data$factor_year_1 <- factor ( with ( data, ifelse ( ( year == 1957 ), 1 , 0 ) ) )
(2) how do I generate a dummy-variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?
data$factor_year_2 <- factor ( with ( data, ifelse ( ( year < 1957 ), 0 , 1 ) ) )
Then, I can introduce this factor as a dummy variable in my models. For example, to see whether there is a long-term trend in a varible y :
summary ( lm ( y ~ t, data = data ) )
Hope this helps!
If you want to get K dummy variables, instead of K-1, try:
dummies = table(1:length(year),as.factor(year))
Best,
I read this on the kaggle forum:
#Generate example dataframe with character column
example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"
#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
for(level in unique(example$strcol)){
example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level, 1, 0)
}
The ifelse function is best for simple logic like this.
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, 1, 0)
ifelse(x <= 1957, 1, 0)
> [1] 0 0 0 0 0 0 0 1 0 0 0
> [1] 1 1 1 1 1 1 1 1 0 0 0
Also, if you want it to return character data then you can do so.
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, "foo", "bar")
ifelse(x <= 1957, "foo", "bar")
> [1] "bar" "bar" "bar" "bar" "bar" "bar" "bar" "foo" "bar" "bar" "bar"
> [1] "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "bar" "bar" "bar"
Categorical variables with nesting...
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, "foo", ifelse(x == 1958, "bar","baz"))
> [1] "baz" "baz" "baz" "baz" "baz" "baz" "baz" "foo" "bar" "baz" "baz"
This is the most straightforward option.
Another way is to use mtabulate from qdapTools package, i.e.
df <- data.frame(var = sample(c("A", "B", "C"), 5, replace = TRUE))
var
#1 C
#2 A
#3 C
#4 B
#5 B
library(qdapTools)
mtabulate(df$var)
which gives,
A B C
1 0 0 1
2 1 0 0
3 0 0 1
4 0 1 0
5 0 1 0
This one liner in base R
model.matrix( ~ iris$Species - 1)
gives
iris$Speciessetosa iris$Speciesversicolor iris$Speciesvirginica
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 1 0 0
7 1 0 0
8 1 0 0
9 1 0 0
10 1 0 0
11 1 0 0
12 1 0 0
13 1 0 0
14 1 0 0
15 1 0 0
16 1 0 0
17 1 0 0
18 1 0 0
19 1 0 0
20 1 0 0
21 1 0 0
22 1 0 0
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 0 0
29 1 0 0
30 1 0 0
31 1 0 0
32 1 0 0
33 1 0 0
34 1 0 0
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0
40 1 0 0
41 1 0 0
42 1 0 0
43 1 0 0
44 1 0 0
45 1 0 0
46 1 0 0
47 1 0 0
48 1 0 0
49 1 0 0
50 1 0 0
51 0 1 0
52 0 1 0
53 0 1 0
54 0 1 0
55 0 1 0
56 0 1 0
57 0 1 0
58 0 1 0
59 0 1 0
60 0 1 0
61 0 1 0
62 0 1 0
63 0 1 0
64 0 1 0
65 0 1 0
66 0 1 0
67 0 1 0
68 0 1 0
69 0 1 0
70 0 1 0
71 0 1 0
72 0 1 0
73 0 1 0
74 0 1 0
75 0 1 0
76 0 1 0
77 0 1 0
78 0 1 0
79 0 1 0
80 0 1 0
81 0 1 0
82 0 1 0
83 0 1 0
84 0 1 0
85 0 1 0
86 0 1 0
87 0 1 0
88 0 1 0
89 0 1 0
90 0 1 0
91 0 1 0
92 0 1 0
93 0 1 0
94 0 1 0
95 0 1 0
96 0 1 0
97 0 1 0
98 0 1 0
99 0 1 0
100 0 1 0
101 0 0 1
102 0 0 1
103 0 0 1
104 0 0 1
105 0 0 1
106 0 0 1
107 0 0 1
108 0 0 1
109 0 0 1
110 0 0 1
111 0 0 1
112 0 0 1
113 0 0 1
114 0 0 1
115 0 0 1
116 0 0 1
117 0 0 1
118 0 0 1
119 0 0 1
120 0 0 1
121 0 0 1
122 0 0 1
123 0 0 1
124 0 0 1
125 0 0 1
126 0 0 1
127 0 0 1
128 0 0 1
129 0 0 1
130 0 0 1
131 0 0 1
132 0 0 1
133 0 0 1
134 0 0 1
135 0 0 1
136 0 0 1
137 0 0 1
138 0 0 1
139 0 0 1
140 0 0 1
141 0 0 1
142 0 0 1
143 0 0 1
144 0 0 1
145 0 0 1
146 0 0 1
147 0 0 1
148 0 0 1
149 0 0 1
150 0 0 1
Convert your data to a data.table and use set by reference and row filtering
library(data.table)
dt <- as.data.table(your.dataframe.or.whatever)
dt[, is.1957 := 0]
dt[year == 1957, is.1957 := 1]
Proof-of-concept toy example:
library(data.table)
dt <- as.data.table(cbind(c(1, 1, 1), c(2, 2, 3)))
dt[, is.3 := 0]
dt[V2 == 3, is.3 := 1]
I use such a function (for data.table):
# Ta funkcja dla obiektu data.table i zmiennej var.name typu factor tworzy dummy variables o nazwach "var.name: (level1)"
factorToDummy <- function(dtable, var.name){
stopifnot(is.data.table(dtable))
stopifnot(var.name %in% names(dtable))
stopifnot(is.factor(dtable[, get(var.name)]))
dtable[, paste0(var.name,": ",levels(get(var.name)))] -> new.names
dtable[, (new.names) := transpose(lapply(get(var.name), FUN = function(x){x == levels(get(var.name))})) ]
cat(paste("\nDodano zmienne dummy: ", paste0(new.names, collapse = ", ")))
}
Usage:
data <- data.table(data)
data[, x:= droplevels(x)]
factorToDummy(data, "x")
We can also use cSplit_e from splitstackshape. Using #zx8754's data
df1 <- data.frame(id = 1:4, year = 1991:1994)
splitstackshape::cSplit_e(df1, "year", fill = 0)
# id year year_1 year_2 year_3 year_4
#1 1 1991 1 0 0 0
#2 2 1992 0 1 0 0
#3 3 1993 0 0 1 0
#4 4 1994 0 0 0 1
To make it work for data other than numeric we need to specify type as "character" explicitly
df1 <- data.frame(id = 1:4, let = LETTERS[1:4])
splitstackshape::cSplit_e(df1, "let", fill = 0, type = "character")
# id let let_A let_B let_C let_D
#1 1 A 1 0 0 0
#2 2 B 0 1 0 0
#3 3 C 0 0 1 0
#4 4 D 0 0 0 1
Hi i wrote this general function to generate a dummy variable which essentially replicates the replace function in Stata.
If x is the data frame is x and i want a dummy variable called a which will take value 1 when x$b takes value c
introducedummy<-function(x,a,b,c){
g<-c(a,b,c)
n<-nrow(x)
newcol<-g[1]
p<-colnames(x)
p2<-c(p,newcol)
new1<-numeric(n)
state<-x[,g[2]]
interest<-g[3]
for(i in 1:n){
if(state[i]==interest){
new1[i]=1
}
else{
new1[i]=0
}
}
x$added<-new1
colnames(x)<-p2
x
}
another way you can do it is use
ifelse(year < 1965 , 1, 0)

Generate a dummy-variable

I have had trouble generating the following dummy-variables in R:
I'm analyzing yearly time series data (time period 1948-2009). I have two questions:
How do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)?
How do I generate a dummy variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?
Another option that can work better if you have many variables is factor and model.matrix.
year.f = factor(year)
dummies = model.matrix(~year.f)
This will include an intercept column (all ones) and one column for each of the years in your data set except one, which will be the "default" or intercept value.
You can change how the "default" is chosen by messing with contrasts.arg in model.matrix.
Also, if you want to omit the intercept, you can just drop the first column or add +0 to the end of the formula.
Hope this is useful.
The simplest way to produce these dummy variables is something like the following:
> print(year)
[1] 1956 1957 1957 1958 1958 1959
> dummy <- as.numeric(year == 1957)
> print(dummy)
[1] 0 1 1 0 0 0
> dummy2 <- as.numeric(year >= 1957)
> print(dummy2)
[1] 0 1 1 1 1 1
More generally, you can use ifelse to choose between two values depending on a condition. So if instead of a 0-1 dummy variable, for some reason you wanted to use, say, 4 and 7, you could use ifelse(year == 1957, 4, 7).
Using dummies::dummy():
library(dummies)
# example data
df1 <- data.frame(id = 1:4, year = 1991:1994)
df1 <- cbind(df1, dummy(df1$year, sep = "_"))
df1
# id year df1_1991 df1_1992 df1_1993 df1_1994
# 1 1 1991 1 0 0 0
# 2 2 1992 0 1 0 0
# 3 3 1993 0 0 1 0
# 4 4 1994 0 0 0 1
Package mlr includes createDummyFeatures for this purpose:
library(mlr)
df <- data.frame(var = sample(c("A", "B", "C"), 10, replace = TRUE))
df
# var
# 1 B
# 2 A
# 3 C
# 4 B
# 5 C
# 6 A
# 7 C
# 8 A
# 9 B
# 10 C
createDummyFeatures(df, cols = "var")
# var.A var.B var.C
# 1 0 1 0
# 2 1 0 0
# 3 0 0 1
# 4 0 1 0
# 5 0 0 1
# 6 1 0 0
# 7 0 0 1
# 8 1 0 0
# 9 0 1 0
# 10 0 0 1
createDummyFeatures drops original variable.
https://www.rdocumentation.org/packages/mlr/versions/2.9/topics/createDummyFeatures
.....
The other answers here offer direct routes to accomplish this task—one that many models (e.g. lm) will do for you internally anyway. Nonetheless, here are ways to make dummy variables with Max Kuhn's popular caret and recipes packages. While somewhat more verbose, they both scale easily to more complicated situations, and fit neatly into their respective frameworks.
caret::dummyVars
With caret, the relevant function is dummyVars, which has a predict method to apply it on a data frame:
df <- data.frame(letter = rep(c('a', 'b', 'c'), each = 2),
y = 1:6)
library(caret)
dummy <- dummyVars(~ ., data = df, fullRank = TRUE)
dummy
#> Dummy Variable Object
#>
#> Formula: ~.
#> 2 variables, 1 factors
#> Variables and levels will be separated by '.'
#> A full rank encoding is used
predict(dummy, df)
#> letter.b letter.c y
#> 1 0 0 1
#> 2 0 0 2
#> 3 1 0 3
#> 4 1 0 4
#> 5 0 1 5
#> 6 0 1 6
recipes::step_dummy
With recipes, the relevant function is step_dummy:
library(recipes)
dummy_recipe <- recipe(y ~ letter, df) %>%
step_dummy(letter)
dummy_recipe
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 1
#>
#> Steps:
#>
#> Dummy variables from letter
Depending on context, extract the data with prep and either bake or juice:
# Prep and bake on new data...
dummy_recipe %>%
prep() %>%
bake(df)
#> # A tibble: 6 x 3
#> y letter_b letter_c
#> <int> <dbl> <dbl>
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 0 1
#> 6 6 0 1
# ...or use `retain = TRUE` and `juice` to extract training data
dummy_recipe %>%
prep(retain = TRUE) %>%
juice()
#> # A tibble: 6 x 3
#> y letter_b letter_c
#> <int> <dbl> <dbl>
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 1 0
#> 4 4 1 0
#> 5 5 0 1
#> 6 6 0 1
For the usecase as presented in the question, you can also just multiply the logical condition with 1 (or maybe even better, with 1L):
# example data
df1 <- data.frame(yr = 1951:1960)
# create the dummies
df1$is.1957 <- 1L * (df1$yr == 1957)
df1$after.1957 <- 1L * (df1$yr >= 1957)
which gives:
> df1
yr is.1957 after.1957
1 1951 0 0
2 1952 0 0
3 1953 0 0
4 1954 0 0
5 1955 0 0
6 1956 0 0
7 1957 1 1
8 1958 0 1
9 1959 0 1
10 1960 0 1
For the usecases as presented in for example the answers of #zx8754 and #Sotos, there are still some other options which haven't been covered yet imo.
1) Make your own make_dummies-function
# example data
df2 <- data.frame(id = 1:5, year = c(1991:1994,1992))
# create a function
make_dummies <- function(v, prefix = '') {
s <- sort(unique(v))
d <- outer(v, s, function(v, s) 1L * (v == s))
colnames(d) <- paste0(prefix, s)
d
}
# bind the dummies to the original dataframe
cbind(df2, make_dummies(df2$year, prefix = 'y'))
which gives:
id year y1991 y1992 y1993 y1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
2) use the dcast-function from either data.table or reshape2
dcast(df2, id + year ~ year, fun.aggregate = length)
which gives:
id year 1991 1992 1993 1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
However, this will not work when there are duplicate values in the column for which the dummies have to be created. In the case a specific aggregation function is needed for dcast and the result of of dcast need to be merged back to the original:
# example data
df3 <- data.frame(var = c("B", "C", "A", "B", "C"))
# aggregation function to get dummy values
f <- function(x) as.integer(length(x) > 0)
# reshape to wide with the cumstom aggregation function and merge back to the original
merge(df3, dcast(df3, var ~ var, fun.aggregate = f), by = 'var', all.x = TRUE)
which gives (note that the result is ordered according to the by column):
var A B C
1 A 1 0 0
2 B 0 1 0
3 B 0 1 0
4 C 0 0 1
5 C 0 0 1
3) use the spread-function from tidyr (with mutate from dplyr)
library(dplyr)
library(tidyr)
df2 %>%
mutate(v = 1, yr = year) %>%
spread(yr, v, fill = 0)
which gives:
id year 1991 1992 1993 1994
1 1 1991 1 0 0 0
2 2 1992 0 1 0 0
3 3 1993 0 0 1 0
4 4 1994 0 0 0 1
5 5 1992 0 1 0 0
What I normally do to work with this kind of dummy variables is:
(1) how do I generate a dummy variable for observation #10, i.e. for year 1957 (value = 1 at 1957 and zero otherwise)
data$factor_year_1 <- factor ( with ( data, ifelse ( ( year == 1957 ), 1 , 0 ) ) )
(2) how do I generate a dummy-variable which is zero before 1957 and takes the value 1 from 1957 and onwards to 2009?
data$factor_year_2 <- factor ( with ( data, ifelse ( ( year < 1957 ), 0 , 1 ) ) )
Then, I can introduce this factor as a dummy variable in my models. For example, to see whether there is a long-term trend in a varible y :
summary ( lm ( y ~ t, data = data ) )
Hope this helps!
If you want to get K dummy variables, instead of K-1, try:
dummies = table(1:length(year),as.factor(year))
Best,
I read this on the kaggle forum:
#Generate example dataframe with character column
example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"
#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
for(level in unique(example$strcol)){
example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level, 1, 0)
}
The ifelse function is best for simple logic like this.
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, 1, 0)
ifelse(x <= 1957, 1, 0)
> [1] 0 0 0 0 0 0 0 1 0 0 0
> [1] 1 1 1 1 1 1 1 1 0 0 0
Also, if you want it to return character data then you can do so.
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, "foo", "bar")
ifelse(x <= 1957, "foo", "bar")
> [1] "bar" "bar" "bar" "bar" "bar" "bar" "bar" "foo" "bar" "bar" "bar"
> [1] "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "bar" "bar" "bar"
Categorical variables with nesting...
> x <- seq(1950, 1960, 1)
ifelse(x == 1957, "foo", ifelse(x == 1958, "bar","baz"))
> [1] "baz" "baz" "baz" "baz" "baz" "baz" "baz" "foo" "bar" "baz" "baz"
This is the most straightforward option.
Another way is to use mtabulate from qdapTools package, i.e.
df <- data.frame(var = sample(c("A", "B", "C"), 5, replace = TRUE))
var
#1 C
#2 A
#3 C
#4 B
#5 B
library(qdapTools)
mtabulate(df$var)
which gives,
A B C
1 0 0 1
2 1 0 0
3 0 0 1
4 0 1 0
5 0 1 0
This one liner in base R
model.matrix( ~ iris$Species - 1)
gives
iris$Speciessetosa iris$Speciesversicolor iris$Speciesvirginica
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 1 0 0
7 1 0 0
8 1 0 0
9 1 0 0
10 1 0 0
11 1 0 0
12 1 0 0
13 1 0 0
14 1 0 0
15 1 0 0
16 1 0 0
17 1 0 0
18 1 0 0
19 1 0 0
20 1 0 0
21 1 0 0
22 1 0 0
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 0 0
29 1 0 0
30 1 0 0
31 1 0 0
32 1 0 0
33 1 0 0
34 1 0 0
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0
40 1 0 0
41 1 0 0
42 1 0 0
43 1 0 0
44 1 0 0
45 1 0 0
46 1 0 0
47 1 0 0
48 1 0 0
49 1 0 0
50 1 0 0
51 0 1 0
52 0 1 0
53 0 1 0
54 0 1 0
55 0 1 0
56 0 1 0
57 0 1 0
58 0 1 0
59 0 1 0
60 0 1 0
61 0 1 0
62 0 1 0
63 0 1 0
64 0 1 0
65 0 1 0
66 0 1 0
67 0 1 0
68 0 1 0
69 0 1 0
70 0 1 0
71 0 1 0
72 0 1 0
73 0 1 0
74 0 1 0
75 0 1 0
76 0 1 0
77 0 1 0
78 0 1 0
79 0 1 0
80 0 1 0
81 0 1 0
82 0 1 0
83 0 1 0
84 0 1 0
85 0 1 0
86 0 1 0
87 0 1 0
88 0 1 0
89 0 1 0
90 0 1 0
91 0 1 0
92 0 1 0
93 0 1 0
94 0 1 0
95 0 1 0
96 0 1 0
97 0 1 0
98 0 1 0
99 0 1 0
100 0 1 0
101 0 0 1
102 0 0 1
103 0 0 1
104 0 0 1
105 0 0 1
106 0 0 1
107 0 0 1
108 0 0 1
109 0 0 1
110 0 0 1
111 0 0 1
112 0 0 1
113 0 0 1
114 0 0 1
115 0 0 1
116 0 0 1
117 0 0 1
118 0 0 1
119 0 0 1
120 0 0 1
121 0 0 1
122 0 0 1
123 0 0 1
124 0 0 1
125 0 0 1
126 0 0 1
127 0 0 1
128 0 0 1
129 0 0 1
130 0 0 1
131 0 0 1
132 0 0 1
133 0 0 1
134 0 0 1
135 0 0 1
136 0 0 1
137 0 0 1
138 0 0 1
139 0 0 1
140 0 0 1
141 0 0 1
142 0 0 1
143 0 0 1
144 0 0 1
145 0 0 1
146 0 0 1
147 0 0 1
148 0 0 1
149 0 0 1
150 0 0 1
Convert your data to a data.table and use set by reference and row filtering
library(data.table)
dt <- as.data.table(your.dataframe.or.whatever)
dt[, is.1957 := 0]
dt[year == 1957, is.1957 := 1]
Proof-of-concept toy example:
library(data.table)
dt <- as.data.table(cbind(c(1, 1, 1), c(2, 2, 3)))
dt[, is.3 := 0]
dt[V2 == 3, is.3 := 1]
I use such a function (for data.table):
# Ta funkcja dla obiektu data.table i zmiennej var.name typu factor tworzy dummy variables o nazwach "var.name: (level1)"
factorToDummy <- function(dtable, var.name){
stopifnot(is.data.table(dtable))
stopifnot(var.name %in% names(dtable))
stopifnot(is.factor(dtable[, get(var.name)]))
dtable[, paste0(var.name,": ",levels(get(var.name)))] -> new.names
dtable[, (new.names) := transpose(lapply(get(var.name), FUN = function(x){x == levels(get(var.name))})) ]
cat(paste("\nDodano zmienne dummy: ", paste0(new.names, collapse = ", ")))
}
Usage:
data <- data.table(data)
data[, x:= droplevels(x)]
factorToDummy(data, "x")
We can also use cSplit_e from splitstackshape. Using #zx8754's data
df1 <- data.frame(id = 1:4, year = 1991:1994)
splitstackshape::cSplit_e(df1, "year", fill = 0)
# id year year_1 year_2 year_3 year_4
#1 1 1991 1 0 0 0
#2 2 1992 0 1 0 0
#3 3 1993 0 0 1 0
#4 4 1994 0 0 0 1
To make it work for data other than numeric we need to specify type as "character" explicitly
df1 <- data.frame(id = 1:4, let = LETTERS[1:4])
splitstackshape::cSplit_e(df1, "let", fill = 0, type = "character")
# id let let_A let_B let_C let_D
#1 1 A 1 0 0 0
#2 2 B 0 1 0 0
#3 3 C 0 0 1 0
#4 4 D 0 0 0 1
Hi i wrote this general function to generate a dummy variable which essentially replicates the replace function in Stata.
If x is the data frame is x and i want a dummy variable called a which will take value 1 when x$b takes value c
introducedummy<-function(x,a,b,c){
g<-c(a,b,c)
n<-nrow(x)
newcol<-g[1]
p<-colnames(x)
p2<-c(p,newcol)
new1<-numeric(n)
state<-x[,g[2]]
interest<-g[3]
for(i in 1:n){
if(state[i]==interest){
new1[i]=1
}
else{
new1[i]=0
}
}
x$added<-new1
colnames(x)<-p2
x
}
another way you can do it is use
ifelse(year < 1965 , 1, 0)

Resources