mixed types of variables in R

mixed types of variables in R - r

I am a new user of R. I have this kind of data type. How can I separate the different types of variables (eg.binary, or counts; or others are continuous) from them?
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0.17 0 0 12 22 2 1 1 240.215 65.049 1.478 114
0.15 1 0 13 22 2 1 1 247.133 66.315 1.474 120
0.16 0 0 12 22 2 0 1 233.329 58.163 1.353 110
0.07 0 0 12 20 2 0 1 219.660 56.162 1.370 114
0.11 0 0 12 26 2 0 2 289.294 70.844 1.389 134
Thanks in advance!

You can use the function typeof to determine the storage mode of an object.
An example data frame:
dat <- data.frame(a = 1:2,
b = c(0.5, -1.3),
c = c("a", "b"),
d = c(TRUE, FALSE), stringsAsFactors = FALSE)
With lapply you can apply the function to all columns:
lapply(dat, typeof)
The result:
$a
[1] "integer"
$b
[1] "double"
$c
[1] "character"
$d
[1] "logical"
If you want to select, for example, all character columns, you can use:
dat[sapply(dat, typeof) == "character"] # possibility 1
dat[sapply(dat, is.character)] # possibility 2
# both commands will return the same result
c
1 a
2 b
PS: You should also have a look at the functions modeand storage.mode.

In addition to typeof, str and summary are other possibilities. These can also be applied directly to the data frame, ie no lapply or looping required.

Related

Combining columns with a grouping in R

I need to categorize information from column names and restructure a dataset. Here is how my sample dataset looks like:
df <- data.frame(id = c(111,112,113),
Demo_1_Color_Naming = c("Text1","Text1","Text1"),
Demo_1.Errors =c(0,1,2),
Item_1_Color_Naming = c("Text1","Text1","Text1"),
Item_1.Time_in_Seconds =c(10,NA, 44),
Item_1.Errors = c(0,1,NA),
Demo_2_Shape_Naming = c("Text1","Text1","Text1"),
Demo_2.Errors =c(4,1,5),
Item_2_Shape_Naming = c("Text1","Text1","Text1"),
Item_2.Time_in_Seconds =c(55,35, 22),
Item_2.Errors = c(5,2,NA))
> df
id Demo_1_Color_Naming Demo_1.Errors Item_1_Color_Naming Item_1.Time_in_Seconds Item_1.Errors Demo_2_Shape_Naming Demo_2.Errors
1 111 Text1 0 Text1 10 0 Text1 4
2 112 Text1 1 Text1 NA 1 Text1 1
3 113 Text1 2 Text1 44 NA Text1 5
Item_2_Shape_Naming Item_2.Time_in_Seconds Item_2.Errors
1 Text1 55 5
2 Text1 35 2
3 Text1 22 NA
The columns are grouped by the numbers 1,2,3,... Each number represesents a grouping name. For example number 1 in this dataset represents Color grouping where number 2 represents Shape grouping. I would like to keep Time_in_seconds info and Errors info. Then I need to sum both time and errors.
Additionally, this dataset is only limited to two grouping. The bigger dataset has more than 2 grouping. I need to handle this for a multi group/column.
How can I achieve this below:
> df1
id ColorTime ShapeTime ColorError ShapeError TotalTime TotalError
1 111 10 55 0 5 65 5
2 112 NA 35 1 2 35 3
3 113 44 22 NA NA 66 NA

We may do
cbind(df['id'], do.call(cbind, lapply(setNames(c("Time_in_Seconds",
"Item.*Errors"), c("Time_in_Seconds", "Errors")), \(x) {
tmp <- df[grep(x, names(df), value = TRUE)]
out <- setNames(as.data.frame(sapply(split.default(tmp,
gsub("\\D+", "", names(tmp))), rowSums, na.rm = TRUE)), c("Color", "Shape"))
transform(out, Total = rowSums(out))
})))
-output
id Time_in_Seconds.Color Time_in_Seconds.Shape Time_in_Seconds.Total Errors.Color Errors.Shape Errors.Total
1 111 10 55 65 0 5 5
2 112 0 35 35 1 2 3
3 113 44 22 66 0 0 0
If we need the NAs
cbind(df['id'], do.call(cbind, lapply(setNames(c("Time_in_Seconds",
"Item.*Errors"), c("Time_in_Seconds", "Errors")), \(x) {
tmp <- df[grep(x, names(df), value = TRUE)]
out <- setNames(as.data.frame(sapply(split.default(tmp,
gsub("\\D+", "", names(tmp))), \(u) Reduce(`+`, u))), c("Color", "Shape")); transform(out, Total = rowSums(out, na.rm = TRUE)) })))

How to multiply odd rows of your own matrix to get a vector without using loops

I need to write a function that calculates two vectors according to a given matrix BBB1: A - equal to the element product of odd rows of the matrix BBB1 and B – equal to the product of even rows. At the same time, you need to do this without using loops. I understand how to pull out even/odd rows, but I don't understand how to multiply the nth number of rows without a loop.
func <- function(BBB1) {
A <- 1
B <- 1
for (i in (1:dim(BBB1)[1])) {
if (i %% 2 == 0) B <- B*BBB1[i, ]
else A <- A*BBB1[i, ]
}
m <- list(A, B)
return(m)
}
So I wrote down the function, but only through a loop.
Example:
"V1" "V2" "V3" "V4" "V5"
"1" 1 5 9 13 17
"2" 2 6 10 14 18
"3" 3 7 11 15 19
"4" 4 8 12 16 20
Desired result:
Vector AA= 3 35 99 195 323
Vector BB= 8 48 120 224 360

It may be easier to subset with logical index recycled and then use prod in base R
apply(BBB1[c(TRUE, FALSE),], 2, prod)
V1 V2 V3 V4 V5
3 35 99 195 323
apply(BBB1[c(FALSE, TRUE),], 2, prod)
V1 V2 V3 V4 V5
8 48 120 224 360
Or use colProds from matrixStats
library(matrixStats)
> colProds(as.matrix(BBB1[c(TRUE, FALSE),]))
[1] 3 35 99 195 323
> colProds(as.matrix(BBB1[c(FALSE, TRUE),]))
[1] 8 48 120 224 360
data
BBB1 <- structure(list(V1 = 1:4, V2 = 5:8, V3 = 9:12, V4 = 13:16,
V5 = 17:20), class = "data.frame",
row.names = c("1",
"2", "3", "4"))

Dummy/Binary Category Variable creation in Data Frame [duplicate]

I have an R data frame containing a factor that I want to "expand" so that for each factor level, there is an associated column in a new data frame, which contains a 1/0 indicator. E.g., suppose I have:
df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
I want:
df.desired <- data.frame(foo = c(1,1,0,0), bar=c(0,0,1,1), ham=c(1,2,3,4))
Because for certain analyses for which you need to have a completely numeric data frame (e.g., principal component analysis), I thought this feature might be built in. Writing a function to do this shouldn't be too hard, but I can foresee some challenges relating to column names and if something exists already, I'd rather use that.

Use the model.matrix function:
model.matrix( ~ Species - 1, data=iris )

If your data frame is only made of factors (or you are working on a subset of variables which are all factors), you can also use the acm.disjonctif function from the ade4 package :
R> library(ade4)
R> df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red"))
R> acm.disjonctif(df)
eggs.bar eggs.foo ham.blue ham.green ham.red
1 0 1 0 0 1
2 0 1 1 0 0
3 1 0 0 1 0
4 1 0 0 0 1
Not exactly the case you are describing, but it can be useful too...

A quick way using the reshape2 package:
require(reshape2)
> dcast(df.original, ham ~ eggs, length)
Using ham as value column: use value_var to override.
ham bar foo
1 1 0 1
2 2 0 1
3 3 1 0
4 4 1 0
Note that this produces precisely the column names you want.

probably dummy variable is similar to what you want.
Then, model.matrix is useful:
> with(df.original, data.frame(model.matrix(~eggs+0), ham))
eggsbar eggsfoo ham
1 0 1 1
2 0 1 2
3 1 0 3
4 1 0 4

A late entry class.ind from the nnet package
library(nnet)
with(df.original, data.frame(class.ind(eggs), ham))
bar foo ham
1 0 1 1
2 0 1 2
3 1 0 3
4 1 0 4

Just came across this old thread and thought I'd add a function that utilizes ade4 to take a dataframe consisting of factors and/or numeric data and returns a dataframe with factors as dummy codes.
dummy <- function(df) {
NUM <- function(dataframe)dataframe[,sapply(dataframe,is.numeric)]
FAC <- function(dataframe)dataframe[,sapply(dataframe,is.factor)]
require(ade4)
if (is.null(ncol(NUM(df)))) {
DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
names(DF)[1] <- colnames(df)[which(sapply(df, is.numeric))]
} else {
DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
}
return(DF)
}
Let's try it.
df <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"), x=rnorm(4))
dummy(df)
df2 <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"))
dummy(df2)

Here is a more clear way to do it. I use model.matrix to create the dummy boolean variables and then merge it back into the original dataframe.
df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
df.original
# eggs ham
# 1 foo 1
# 2 foo 2
# 3 bar 3
# 4 bar 4
# Create the dummy boolean variables using the model.matrix() function.
> mm <- model.matrix(~eggs-1, df.original)
> mm
# eggsbar eggsfoo
# 1 0 1
# 2 0 1
# 3 1 0
# 4 1 0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"
# Remove the "eggs" prefix from the column names as the OP desired.
colnames(mm) <- gsub("eggs","",colnames(mm))
mm
# bar foo
# 1 0 1
# 2 0 1
# 3 1 0
# 4 1 0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"
# Combine the matrix back with the original dataframe.
result <- cbind(df.original, mm)
result
# eggs ham bar foo
# 1 foo 1 0 1
# 2 foo 2 0 1
# 3 bar 3 1 0
# 4 bar 4 1 0
# At this point, you can select out the columns that you want.

I needed a function to 'explode' factors that is a bit more flexible, and made one based on the acm.disjonctif function from the ade4 package.
This allows you to choose the exploded values, which are 0 and 1 in acm.disjonctif. It only explodes factors that have 'few' levels. Numeric columns are preserved.
# Function to explode factors that are considered to be categorical,
# i.e., they do not have too many levels.
# - data: The data.frame in which categorical variables will be exploded.
# - values: The exploded values for the value being unequal and equal to a level.
# - max_factor_level_fraction: Maximum number of levels as a fraction of column length. Set to 1 to explode all factors.
# Inspired by the acm.disjonctif function in the ade4 package.
explode_factors <- function(data, values = c(-0.8, 0.8), max_factor_level_fraction = 0.2) {
exploders <- colnames(data)[sapply(data, function(col){
is.factor(col) && nlevels(col) <= max_factor_level_fraction * length(col)
})]
if (length(exploders) > 0) {
exploded <- lapply(exploders, function(exp){
col <- data[, exp]
n <- length(col)
dummies <- matrix(values[1], n, length(levels(col)))
dummies[(1:n) + n * (unclass(col) - 1)] <- values[2]
colnames(dummies) <- paste(exp, levels(col), sep = '_')
dummies
})
# Only keep numeric data.
data <- data[sapply(data, is.numeric)]
# Add exploded values.
data <- cbind(data, exploded)
}
return(data)
}

(The question is 10yo, but for the sake of completeness...)
The function i() from the fixest package does exactly that.
Beyond creating a design matrix from a factor-like variable, you can also very easily do two extra things on the fly:
binning values (with the argument 'bin'),
excluding some factor values (with the argument ref).
And since it is made for this task, if your variable happens to be numeric you don't need to wrap it with factor(x_num) (as opposed to the model.matrix solution).
Here's an example:
library(fixest)
data(airquality)
table(airquality$Month)
#> 5 6 7 8 9
#> 31 30 31 31 30
head(i(airquality$Month))
#> 5 6 7 8 9
#> [1,] 1 0 0 0 0
#> [2,] 1 0 0 0 0
#> [3,] 1 0 0 0 0
#> [4,] 1 0 0 0 0
#> [5,] 1 0 0 0 0
#> [6,] 1 0 0 0 0
#
# Binning (check out the help, there are many many ways to bin)
#
colSums(i(airquality$Month, bin = 5:6)))
#> 5 7 8 9
#> 61 31 31 30
#
# References
#
head(i(airquality$Month, ref = c(6, 9)), 3)
#> 5 7 8
#> [1,] 1 0 0
#> [2,] 1 0 0
#> [3,] 1 0 0
And here's a little wrapper expanding all non-numeric variables (by default):
library(fixest)
# data: data.frame
# var: vector of variable names // if missing, all non numeric variables
# no argument checking
expand_factor = function(data, var){
if(missing(var)){
var = names(data)[!sapply(data, is.numeric)]
if(length(var) == 0) return(data)
}
data_list = unclass(data)
new = lapply(var, \(x) i(data_list[[x]]))
data_list[names(data_list) %in% var] = new
do.call("cbind", data_list)
}
my_data = data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
expand_factor(my_data)
#> bar foo ham
#> [1,] 0 1 1
#> [2,] 0 1 2
#> [3,] 1 0 3
#> [4,] 1 0 4
Finally, for those wondering, the timing is equivalent to the model.matrix solution.
library(microbenchmark)
my_data = data.frame(x = as.factor(sample(100, 1e6, TRUE)))
microbenchmark(mm = model.matrix(~x, my_data),
i = i(my_data$x), times = 5)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> mm 155.1904 156.7751 209.2629 182.4964 197.9084 353.9443 5
#> i 154.1697 154.7893 159.5202 155.4166 163.9706 169.2550 5

In sapply == over eggs could be used to generate dummy vectors:
x <- with(df.original, data.frame(+sapply(unique(eggs), `==`, eggs), ham))
x
# foo bar ham
#1 1 0 1
#2 1 0 2
#3 0 1 3
#4 0 1 4
all.equal(x, df.desired)
#[1] TRUE
A maybe faster variant - Result best used as list or data.frame:
. <- unique(df.original$eggs)
with(df.original,
data.frame(+do.call(cbind, lapply(setNames(., .), `==`, eggs)), ham))
Indexing in a matrix - Result best used as matrix:
. <- unique(df.original$eggs)
i <- match(df.original$eggs, .)
nc <- length(.)
nr <- length(i)
cbind(matrix(`[<-`(integer(nc * nr), 1:nr + nr * (i - 1), 1), nr, nc,
dimnames=list(NULL, .)), df.original["ham"])
Using outer - Result best used as matrix:
. <- unique(df.original$eggs)
cbind(+outer(df.original$eggs, setNames(., .), `==`), df.original["ham"])
Using rep - Result best used as matrix:
. <- unique(df.original$eggs)
n <- nrow(df.original)
cbind(+matrix(df.original$eggs == rep(., each=n), n, dimnames=list(NULL, .)),
df.original["ham"])

How to create dummy variables for all the categorical variables in R? [duplicate]

I have an R data frame containing a factor that I want to "expand" so that for each factor level, there is an associated column in a new data frame, which contains a 1/0 indicator. E.g., suppose I have:
df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
I want:
df.desired <- data.frame(foo = c(1,1,0,0), bar=c(0,0,1,1), ham=c(1,2,3,4))
Because for certain analyses for which you need to have a completely numeric data frame (e.g., principal component analysis), I thought this feature might be built in. Writing a function to do this shouldn't be too hard, but I can foresee some challenges relating to column names and if something exists already, I'd rather use that.

Use the model.matrix function:
model.matrix( ~ Species - 1, data=iris )

If your data frame is only made of factors (or you are working on a subset of variables which are all factors), you can also use the acm.disjonctif function from the ade4 package :
R> library(ade4)
R> df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red"))
R> acm.disjonctif(df)
eggs.bar eggs.foo ham.blue ham.green ham.red
1 0 1 0 0 1
2 0 1 1 0 0
3 1 0 0 1 0
4 1 0 0 0 1
Not exactly the case you are describing, but it can be useful too...

A quick way using the reshape2 package:
require(reshape2)
> dcast(df.original, ham ~ eggs, length)
Using ham as value column: use value_var to override.
ham bar foo
1 1 0 1
2 2 0 1
3 3 1 0
4 4 1 0
Note that this produces precisely the column names you want.

probably dummy variable is similar to what you want.
Then, model.matrix is useful:
> with(df.original, data.frame(model.matrix(~eggs+0), ham))
eggsbar eggsfoo ham
1 0 1 1
2 0 1 2
3 1 0 3
4 1 0 4

A late entry class.ind from the nnet package
library(nnet)
with(df.original, data.frame(class.ind(eggs), ham))
bar foo ham
1 0 1 1
2 0 1 2
3 1 0 3
4 1 0 4

Just came across this old thread and thought I'd add a function that utilizes ade4 to take a dataframe consisting of factors and/or numeric data and returns a dataframe with factors as dummy codes.
dummy <- function(df) {
NUM <- function(dataframe)dataframe[,sapply(dataframe,is.numeric)]
FAC <- function(dataframe)dataframe[,sapply(dataframe,is.factor)]
require(ade4)
if (is.null(ncol(NUM(df)))) {
DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
names(DF)[1] <- colnames(df)[which(sapply(df, is.numeric))]
} else {
DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
}
return(DF)
}
Let's try it.
df <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"), x=rnorm(4))
dummy(df)
df2 <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"))
dummy(df2)

Here is a more clear way to do it. I use model.matrix to create the dummy boolean variables and then merge it back into the original dataframe.
df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
df.original
# eggs ham
# 1 foo 1
# 2 foo 2
# 3 bar 3
# 4 bar 4
# Create the dummy boolean variables using the model.matrix() function.
> mm <- model.matrix(~eggs-1, df.original)
> mm
# eggsbar eggsfoo
# 1 0 1
# 2 0 1
# 3 1 0
# 4 1 0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"
# Remove the "eggs" prefix from the column names as the OP desired.
colnames(mm) <- gsub("eggs","",colnames(mm))
mm
# bar foo
# 1 0 1
# 2 0 1
# 3 1 0
# 4 1 0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"
# Combine the matrix back with the original dataframe.
result <- cbind(df.original, mm)
result
# eggs ham bar foo
# 1 foo 1 0 1
# 2 foo 2 0 1
# 3 bar 3 1 0
# 4 bar 4 1 0
# At this point, you can select out the columns that you want.

I needed a function to 'explode' factors that is a bit more flexible, and made one based on the acm.disjonctif function from the ade4 package.
This allows you to choose the exploded values, which are 0 and 1 in acm.disjonctif. It only explodes factors that have 'few' levels. Numeric columns are preserved.
# Function to explode factors that are considered to be categorical,
# i.e., they do not have too many levels.
# - data: The data.frame in which categorical variables will be exploded.
# - values: The exploded values for the value being unequal and equal to a level.
# - max_factor_level_fraction: Maximum number of levels as a fraction of column length. Set to 1 to explode all factors.
# Inspired by the acm.disjonctif function in the ade4 package.
explode_factors <- function(data, values = c(-0.8, 0.8), max_factor_level_fraction = 0.2) {
exploders <- colnames(data)[sapply(data, function(col){
is.factor(col) && nlevels(col) <= max_factor_level_fraction * length(col)
})]
if (length(exploders) > 0) {
exploded <- lapply(exploders, function(exp){
col <- data[, exp]
n <- length(col)
dummies <- matrix(values[1], n, length(levels(col)))
dummies[(1:n) + n * (unclass(col) - 1)] <- values[2]
colnames(dummies) <- paste(exp, levels(col), sep = '_')
dummies
})
# Only keep numeric data.
data <- data[sapply(data, is.numeric)]
# Add exploded values.
data <- cbind(data, exploded)
}
return(data)
}

(The question is 10yo, but for the sake of completeness...)
The function i() from the fixest package does exactly that.
Beyond creating a design matrix from a factor-like variable, you can also very easily do two extra things on the fly:
binning values (with the argument 'bin'),
excluding some factor values (with the argument ref).
And since it is made for this task, if your variable happens to be numeric you don't need to wrap it with factor(x_num) (as opposed to the model.matrix solution).
Here's an example:
library(fixest)
data(airquality)
table(airquality$Month)
#> 5 6 7 8 9
#> 31 30 31 31 30
head(i(airquality$Month))
#> 5 6 7 8 9
#> [1,] 1 0 0 0 0
#> [2,] 1 0 0 0 0
#> [3,] 1 0 0 0 0
#> [4,] 1 0 0 0 0
#> [5,] 1 0 0 0 0
#> [6,] 1 0 0 0 0
#
# Binning (check out the help, there are many many ways to bin)
#
colSums(i(airquality$Month, bin = 5:6)))
#> 5 7 8 9
#> 61 31 31 30
#
# References
#
head(i(airquality$Month, ref = c(6, 9)), 3)
#> 5 7 8
#> [1,] 1 0 0
#> [2,] 1 0 0
#> [3,] 1 0 0
And here's a little wrapper expanding all non-numeric variables (by default):
library(fixest)
# data: data.frame
# var: vector of variable names // if missing, all non numeric variables
# no argument checking
expand_factor = function(data, var){
if(missing(var)){
var = names(data)[!sapply(data, is.numeric)]
if(length(var) == 0) return(data)
}
data_list = unclass(data)
new = lapply(var, \(x) i(data_list[[x]]))
data_list[names(data_list) %in% var] = new
do.call("cbind", data_list)
}
my_data = data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
expand_factor(my_data)
#> bar foo ham
#> [1,] 0 1 1
#> [2,] 0 1 2
#> [3,] 1 0 3
#> [4,] 1 0 4
Finally, for those wondering, the timing is equivalent to the model.matrix solution.
library(microbenchmark)
my_data = data.frame(x = as.factor(sample(100, 1e6, TRUE)))
microbenchmark(mm = model.matrix(~x, my_data),
i = i(my_data$x), times = 5)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> mm 155.1904 156.7751 209.2629 182.4964 197.9084 353.9443 5
#> i 154.1697 154.7893 159.5202 155.4166 163.9706 169.2550 5

In sapply == over eggs could be used to generate dummy vectors:
x <- with(df.original, data.frame(+sapply(unique(eggs), `==`, eggs), ham))
x
# foo bar ham
#1 1 0 1
#2 1 0 2
#3 0 1 3
#4 0 1 4
all.equal(x, df.desired)
#[1] TRUE
A maybe faster variant - Result best used as list or data.frame:
. <- unique(df.original$eggs)
with(df.original,
data.frame(+do.call(cbind, lapply(setNames(., .), `==`, eggs)), ham))
Indexing in a matrix - Result best used as matrix:
. <- unique(df.original$eggs)
i <- match(df.original$eggs, .)
nc <- length(.)
nr <- length(i)
cbind(matrix(`[<-`(integer(nc * nr), 1:nr + nr * (i - 1), 1), nr, nc,
dimnames=list(NULL, .)), df.original["ham"])
Using outer - Result best used as matrix:
. <- unique(df.original$eggs)
cbind(+outer(df.original$eggs, setNames(., .), `==`), df.original["ham"])
Using rep - Result best used as matrix:
. <- unique(df.original$eggs)
n <- nrow(df.original)
cbind(+matrix(df.original$eggs == rep(., each=n), n, dimnames=list(NULL, .)),
df.original["ham"])

Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level

I have an R data frame containing a factor that I want to "expand" so that for each factor level, there is an associated column in a new data frame, which contains a 1/0 indicator. E.g., suppose I have:
df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
I want:
df.desired <- data.frame(foo = c(1,1,0,0), bar=c(0,0,1,1), ham=c(1,2,3,4))
Because for certain analyses for which you need to have a completely numeric data frame (e.g., principal component analysis), I thought this feature might be built in. Writing a function to do this shouldn't be too hard, but I can foresee some challenges relating to column names and if something exists already, I'd rather use that.

Use the model.matrix function:
model.matrix( ~ Species - 1, data=iris )

If your data frame is only made of factors (or you are working on a subset of variables which are all factors), you can also use the acm.disjonctif function from the ade4 package :
R> library(ade4)
R> df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red"))
R> acm.disjonctif(df)
eggs.bar eggs.foo ham.blue ham.green ham.red
1 0 1 0 0 1
2 0 1 1 0 0
3 1 0 0 1 0
4 1 0 0 0 1
Not exactly the case you are describing, but it can be useful too...

A quick way using the reshape2 package:
require(reshape2)
> dcast(df.original, ham ~ eggs, length)
Using ham as value column: use value_var to override.
ham bar foo
1 1 0 1
2 2 0 1
3 3 1 0
4 4 1 0
Note that this produces precisely the column names you want.

probably dummy variable is similar to what you want.
Then, model.matrix is useful:
> with(df.original, data.frame(model.matrix(~eggs+0), ham))
eggsbar eggsfoo ham
1 0 1 1
2 0 1 2
3 1 0 3
4 1 0 4

A late entry class.ind from the nnet package
library(nnet)
with(df.original, data.frame(class.ind(eggs), ham))
bar foo ham
1 0 1 1
2 0 1 2
3 1 0 3
4 1 0 4

Just came across this old thread and thought I'd add a function that utilizes ade4 to take a dataframe consisting of factors and/or numeric data and returns a dataframe with factors as dummy codes.
dummy <- function(df) {
NUM <- function(dataframe)dataframe[,sapply(dataframe,is.numeric)]
FAC <- function(dataframe)dataframe[,sapply(dataframe,is.factor)]
require(ade4)
if (is.null(ncol(NUM(df)))) {
DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
names(DF)[1] <- colnames(df)[which(sapply(df, is.numeric))]
} else {
DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
}
return(DF)
}
Let's try it.
df <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"), x=rnorm(4))
dummy(df)
df2 <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"))
dummy(df2)

Here is a more clear way to do it. I use model.matrix to create the dummy boolean variables and then merge it back into the original dataframe.
df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
df.original
# eggs ham
# 1 foo 1
# 2 foo 2
# 3 bar 3
# 4 bar 4
# Create the dummy boolean variables using the model.matrix() function.
> mm <- model.matrix(~eggs-1, df.original)
> mm
# eggsbar eggsfoo
# 1 0 1
# 2 0 1
# 3 1 0
# 4 1 0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"
# Remove the "eggs" prefix from the column names as the OP desired.
colnames(mm) <- gsub("eggs","",colnames(mm))
mm
# bar foo
# 1 0 1
# 2 0 1
# 3 1 0
# 4 1 0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"
# Combine the matrix back with the original dataframe.
result <- cbind(df.original, mm)
result
# eggs ham bar foo
# 1 foo 1 0 1
# 2 foo 2 0 1
# 3 bar 3 1 0
# 4 bar 4 1 0
# At this point, you can select out the columns that you want.

I needed a function to 'explode' factors that is a bit more flexible, and made one based on the acm.disjonctif function from the ade4 package.
This allows you to choose the exploded values, which are 0 and 1 in acm.disjonctif. It only explodes factors that have 'few' levels. Numeric columns are preserved.
# Function to explode factors that are considered to be categorical,
# i.e., they do not have too many levels.
# - data: The data.frame in which categorical variables will be exploded.
# - values: The exploded values for the value being unequal and equal to a level.
# - max_factor_level_fraction: Maximum number of levels as a fraction of column length. Set to 1 to explode all factors.
# Inspired by the acm.disjonctif function in the ade4 package.
explode_factors <- function(data, values = c(-0.8, 0.8), max_factor_level_fraction = 0.2) {
exploders <- colnames(data)[sapply(data, function(col){
is.factor(col) && nlevels(col) <= max_factor_level_fraction * length(col)
})]
if (length(exploders) > 0) {
exploded <- lapply(exploders, function(exp){
col <- data[, exp]
n <- length(col)
dummies <- matrix(values[1], n, length(levels(col)))
dummies[(1:n) + n * (unclass(col) - 1)] <- values[2]
colnames(dummies) <- paste(exp, levels(col), sep = '_')
dummies
})
# Only keep numeric data.
data <- data[sapply(data, is.numeric)]
# Add exploded values.
data <- cbind(data, exploded)
}
return(data)
}

(The question is 10yo, but for the sake of completeness...)
The function i() from the fixest package does exactly that.
Beyond creating a design matrix from a factor-like variable, you can also very easily do two extra things on the fly:
binning values (with the argument 'bin'),
excluding some factor values (with the argument ref).
And since it is made for this task, if your variable happens to be numeric you don't need to wrap it with factor(x_num) (as opposed to the model.matrix solution).
Here's an example:
library(fixest)
data(airquality)
table(airquality$Month)
#> 5 6 7 8 9
#> 31 30 31 31 30
head(i(airquality$Month))
#> 5 6 7 8 9
#> [1,] 1 0 0 0 0
#> [2,] 1 0 0 0 0
#> [3,] 1 0 0 0 0
#> [4,] 1 0 0 0 0
#> [5,] 1 0 0 0 0
#> [6,] 1 0 0 0 0
#
# Binning (check out the help, there are many many ways to bin)
#
colSums(i(airquality$Month, bin = 5:6)))
#> 5 7 8 9
#> 61 31 31 30
#
# References
#
head(i(airquality$Month, ref = c(6, 9)), 3)
#> 5 7 8
#> [1,] 1 0 0
#> [2,] 1 0 0
#> [3,] 1 0 0
And here's a little wrapper expanding all non-numeric variables (by default):
library(fixest)
# data: data.frame
# var: vector of variable names // if missing, all non numeric variables
# no argument checking
expand_factor = function(data, var){
if(missing(var)){
var = names(data)[!sapply(data, is.numeric)]
if(length(var) == 0) return(data)
}
data_list = unclass(data)
new = lapply(var, \(x) i(data_list[[x]]))
data_list[names(data_list) %in% var] = new
do.call("cbind", data_list)
}
my_data = data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
expand_factor(my_data)
#> bar foo ham
#> [1,] 0 1 1
#> [2,] 0 1 2
#> [3,] 1 0 3
#> [4,] 1 0 4
Finally, for those wondering, the timing is equivalent to the model.matrix solution.
library(microbenchmark)
my_data = data.frame(x = as.factor(sample(100, 1e6, TRUE)))
microbenchmark(mm = model.matrix(~x, my_data),
i = i(my_data$x), times = 5)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> mm 155.1904 156.7751 209.2629 182.4964 197.9084 353.9443 5
#> i 154.1697 154.7893 159.5202 155.4166 163.9706 169.2550 5

In sapply == over eggs could be used to generate dummy vectors:
x <- with(df.original, data.frame(+sapply(unique(eggs), `==`, eggs), ham))
x
# foo bar ham
#1 1 0 1
#2 1 0 2
#3 0 1 3
#4 0 1 4
all.equal(x, df.desired)
#[1] TRUE
A maybe faster variant - Result best used as list or data.frame:
. <- unique(df.original$eggs)
with(df.original,
data.frame(+do.call(cbind, lapply(setNames(., .), `==`, eggs)), ham))
Indexing in a matrix - Result best used as matrix:
. <- unique(df.original$eggs)
i <- match(df.original$eggs, .)
nc <- length(.)
nr <- length(i)
cbind(matrix(`[<-`(integer(nc * nr), 1:nr + nr * (i - 1), 1), nr, nc,
dimnames=list(NULL, .)), df.original["ham"])
Using outer - Result best used as matrix:
. <- unique(df.original$eggs)
cbind(+outer(df.original$eggs, setNames(., .), `==`), df.original["ham"])
Using rep - Result best used as matrix:
. <- unique(df.original$eggs)
n <- nrow(df.original)
cbind(+matrix(df.original$eggs == rep(., each=n), n, dimnames=list(NULL, .)),
df.original["ham"])

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

mixed types of variables in R - r

In addition to typeof, str and summary are other possibilities. These can also be applied directly to the data frame, ie no lapply or looping required.

Related

Combining columns with a grouping in R

How to multiply odd rows of your own matrix to get a vector without using loops

Dummy/Binary Category Variable creation in Data Frame [duplicate]

How to create dummy variables for all the categorical variables in R? [duplicate]

Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level

Categories

Resources