Generate a dummy matrix from multiple factor columns - r

I already searched the web and found no answer. I have a big data.frame that contains multiple columns. Each column is a factor variable.
I want to transform the data.frame such that each possible value of the factor variables is a variable that either contains a "1" if the variable is present in the factor column or "0" otherwise.
Here is an example of what I mean.
labels <- c("1", "2", "3", "4", "5", "6", "7")
#create data frame (note, not all factor levels have to be in the columns,
#NA values are possible)
input <- data.frame(ID = c(1, 2, 3),
Cat1 = factor(c( 4, 1, 1), levels = labels),
Cat2 = factor(c(2, NA, 4), levels = labels),
Cat3 = factor(c(7, NA, NA), levels = labels))
#the seven factor levels now are the variables of the data.frame
desired_output <- data.frame(ID = c(1, 2, 3),
Dummy1 = c(0, 1, 1),
Dummy2 = c(1, 0, 0),
Dummy3 = c(0, 0, 0),
Dummy4 = c(1, 0, 1),
Dummy5 = c(0, 0, 0),
Dummy6 = c(0, 0, 0),
Dummy7 = c(1, 0, 0))
input
ID Cat1 Cat2 Cat3
1 4 2 7
2 1 <NA> <NA>
3 1 4 <NA>
desired_output
ID Dummy1 Dummy2 Dummy3 Dummy4 Dummy5 Dummy6 Dummy7
1 0 1 0 1 0 0 1
2 1 0 0 0 0 0 0
3 1 0 0 1 0 0 0
My actual data.frame has over 3000 rows and factors with more than 100 levels.
I hope you can help me converting the input to the desired output.
Greetings
sush

A couple of methods, that riff off of Gregor's and Aaron's answers.
From Aaron's. factorsAsStrings=FALSE keeps the factor variables hence all labes when using dcast
library(reshape2)
dcast(melt(input, id="ID", factorsAsStrings=FALSE), ID ~ value, drop=FALSE)
ID 1 2 3 4 5 6 7 NA
1 1 0 1 0 1 0 0 1 0
2 2 1 0 0 0 0 0 0 2
3 3 1 0 0 1 0 0 0 1
Then you just need to remove the last column.
From Gregor's
na.replace <- function(x) replace(x, is.na(x), 0)
options(na.action='na.pass') # this keeps the NA's which are then converted to zero
Reduce("+", lapply(input[-1], function(x) na.replace(model.matrix(~ 0 + x))))
x1 x2 x3 x4 x5 x6 x7
1 0 1 0 1 0 0 1
2 1 0 0 0 0 0 0
3 1 0 0 1 0 0 0
Then you just need to cbind the ID column

One way to do this is with matrix indexing. You have data specifying which locations in your output matrix should be 1 (the rest should be zero), so we'll make a matrix of zeros and then fill in the 1's based on your data. To do that, your data needs to be in a two column matrix, with the first column being the row (ID) of the output and the second column being the columns.
Put input data in long format, remove missings, convert values to integers matching the labels, then make a matrix as needed.
in2 <- reshape2::melt(input, id.vars="ID")
in2 <- subset(in2, !is.na(value))
in2$value <- match(in2$value, labels)
in2$variable <- NULL
in2 <- as.matrix(in2)
Then make the new output matrix with all zeros, and fill in the ones using that matrix.
out <- matrix(0, nrow=nrow(input), ncol=length(labels))
colnames(out) <- labels
rownames(out) <- input$ID
out[in2] <- 1
out
## 1 2 3 4 5 6 7
## 1 0 1 0 1 0 0 1
## 2 1 0 0 0 0 0 0
## 3 1 0 0 1 0 0 0

Here's a way using model.matrix. We convert the missing values to 0s, and specify 0 as the reference level for the factor contrasts. Then we just add the individual model matrices together and stick on the IDs:
new_lab = as.character(0:7)
for (i in 2:4) {
temp = as.character(input[[i]])
temp[is.na(temp)] = "0"
input[[i]] = factor(temp, levels = new_lab)
}
mm =
model.matrix(~ Cat1, data = input) +
model.matrix(~ Cat2, data = input) +
model.matrix(~ Cat3, data = input)
mm[, 1] = input$ID
colnames(mm) = c("ID", paste0("Dummy", 1:(ncol(mm) - 1)))
mm
# ID Dummy1 Dummy2 Dummy3 Dummy4 Dummy5 Dummy6 Dummy7
# 1 1 0 1 0 1 0 0 1
# 2 2 1 0 0 0 0 0 0
# 3 3 1 0 0 1 0 0 0
# attr(,"assign")
# [1] 0 1 1 1 1 1 1 1
# attr(,"contrasts")
# attr(,"contrasts")$Cat1
# [1] "contr.treatment"
You can leave the result as a model matrix, change it back to a data frame, or whatever else.

This should work on your data frame. I converted the values to numeric before running the ifelse statement. Hope it works:
# Make dummy df
Cat1 = factor(c( 4, 1, 1))
Cat2 = factor(c(2, NA, 4))
Cat3 = factor(c(7, NA, NA))
df <- data.frame(Cat1,Cat2,Cat3)
# Specify columns
cols <- c(1:length(df))
# Convert Values To Numeric
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
# Perform ifelse. If its NA print 0, else print 1
df[,cols] %<>% lapply(function(x) ifelse(x == is.na(x) | (x) %in% NA, 0, 1))
Based on input:
Cat1 Cat2 Cat3
1 4 2 7
2 1 <NA> <NA>
3 1 4 <NA>
Output looks like this:
Cat1 Cat2 Cat3
1 1 1 1
2 1 0 0
3 1 1 0

Related

Iterating over columns to create flagging variables

I've got a dataset that has a lot of numerical columns (in the example below these columns are x, y, z). I want to create individual flagging variables for each of those columns (x_YN, y_YN, z_YN) such that, if the numerical column is > 0, the flagging variable is = 1 and otherwise it's = 0. What might be the most efficient way to tackle this?
Thanks for the help!
x <- c(3, 7, 0, 10)
y <- c(5, 2, 20, 0)
z <- c(0, 0, 4, 12)
df <- data.frame(x,y,z)
We may use a logical matrix and coerce
df[paste0(names(df), "_YN")] <- +(df > 0)
-output
> df
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1
The dplyr alternative:
library(dplyr)
df %>%
mutate(across(everything(), ~ +(.x > 0), .names = "{col}_YN"))
output
x y z x_YN y_YN z_YN
1 3 5 0 1 1 0
2 7 2 0 1 1 0
3 0 20 4 0 1 1
4 10 0 12 1 0 1

All possible combinations (sequential)

I am wondering what an efficient approach to the following question would be:
Suppose I have three characters in group 1 and two characters in group 2:
group_1 = c("X", "Y", "Z")
group_2 = c("A", "B")
Clearly, the "all" possible combinations for group_1 and group_2 are given by:
group_1_combs = data.frame(X = c(0,1,0,0,1,1,0,1),
Y = c(0,0,1,0,1,0,1,1),
Z = c(0,0,0,1,0,1,1,1))
group_2_combs = data.frame(A = c(0,1,0,1),
B = c(0,0,1,1))
My question is the following:
(1) How do I go from group_1 to group_1_combs efficiently (given that the character vector might be large).
(2) How do I do an "all possible" combinations of each row of group_1_combs and group_2_combs? Specifically, I want a "final" data.frame where each row of group_1_combs is "permuted" with every row of group_2_combs. This means that the final data.frame would have 8 x 4 rows (since there are 8 rows in group_1_combs and 4 rows in group_2_combs) and 5 columns (X,Y,Z,A,B).
Thanks!
You want expand.grid and merge:
Question 1:
group_1_combs <- expand.grid(setNames(rep(list(c(0, 1)), length(group_1)), group_1))
group_2_combs <- expand.grid(setNames(rep(list(c(0, 1)), length(group_2)), group_2))
Question 2:
> merge(group_1_combs, group_2_combs)
X Y Z A B
1 0 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 1 0 0 0
5 0 0 1 0 0
6 1 0 1 0 0
7 0 1 1 0 0
...
Or you can go directly to the merged data.frame:
group_12 <- c(group_1, group_2)
expand.grid(setNames(rep(list(c(0, 1)), length(group_12)), group_12))

dummy variable columns based on strings from other columns [duplicate]

This question already has answers here:
Dummy variables from a string variable
(7 answers)
Closed 3 years ago.
I have a database with patient id number and the treatment they recived. I would like to have a dummy column for every different INDIVIDUAL treatment (ie, as in did the patient recieve treatment A,B,C,D).
This is way simplified because I have over 20 treatments and thousands of patients, and I can't figure out a simple way to do so.
example <- data.frame(id_number = c(0, 1, 2, 3, 4),
treatment = c("A", "A+B+C+D", "C+B", "B+A", "C"))
I would like to have something like this:
desired_result <- data.frame(id_number = c(0, 1, 2, 3, 4),
treatment = c("A", "A+B+C+D", "C+B", "B+A","C"),
A=c(1,1,0,1,0),
B=c(0,1,1,1,0),
C=c(0,1,1,0,1),
D=c(0,1,0,0,0))
A base version:
example["A"] <- as.numeric(grepl("A", example[,"treatment"]))
example["B"] <- as.numeric(grepl("B", example[,"treatment"]))
example["C"] <- as.numeric(grepl("C", example[,"treatment"]))
example["D"] <- as.numeric(grepl("D", example[,"treatment"]))
example
id_number treatment A B C D
1 0 A 1 0 0 0
2 1 A+B+C+D 1 1 1 1
3 2 C+B 0 1 1 0
4 3 B+A 1 1 0 0
5 4 C 0 0 1 0
The grepl function tests the presence of each pattern in each row, and as.numeric changes the logical TRUE/FALSE to 1/0
One tidyverse possibility could be:
example %>%
mutate(treatment2 = strsplit(treatment, "+", fixed = TRUE)) %>%
unnest() %>%
spread(treatment2, treatment2) %>%
mutate_at(vars(-id_number, -treatment), ~ (!is.na(.)) * 1)
id_number treatment A B C D
1 0 A 1 0 0 0
2 1 A+B+C+D 1 1 1 1
3 2 C+B 0 1 1 0
4 3 B+A 1 1 0 0
5 4 C 0 0 1 0
Or:
example %>%
mutate(treatment2 = strsplit(treatment, "+", fixed = TRUE)) %>%
unnest() %>%
mutate(val = 1) %>%
spread(treatment2, val, fill = 0)

changing the values in many r variables

I want to do the equivalent of find and replace 1=0;2=0;3=0;4=1;5=2;6=3 for many different variables in my data set.
Things I've tried:
making 1=0;2=0;3=0;4=1;5=2;6=3 into a function and using sapply. I changed the ; to , and changed the = to <- and no combination of these were recognized as a function. I tried creating a function with that definition and putting it into sapply and it didn't work.
I tried using recode and it did not work:
wdata[ ,cols2] = recode(wdata[ ,cols2], 1=0;2=0;3=0;4=1;5=2;6=3)
Assuming you are working with a data.frame or matrix you can use direct indexing:
# Sample data
set.seed(2017);
df <- as.data.frame(matrix(sample(1:6, 20, replace = T), ncol = 4));
df;
#V1 V2 V3 V4
#1 6 5 5 3
#2 4 1 1 3
#3 3 3 1 5
#4 2 3 3 6
#5 5 2 3 5
df[df == 1 | df == 2 | df == 3] <- 0;
df[df == 4] <- 1;
df[df == 5] <- 2;
df[df == 6] <- 3;
df;
# V1 V2 V3 V4
#1 3 2 2 0
#2 1 0 0 0
#3 0 0 0 2
#4 0 0 0 3
#5 2 0 0 2
Note that the order of the substitutions matters. For example, df[df == 4] = 1; df[df == 1] <- 0; will give a different output from df[df == 1] <- 0; df[df == 4] <- 1;
Alternative solution using recode from dplyr with sapply or mutate_all:
set.seed(2017);
df <- as.data.frame(matrix(sample(1:6, 20, replace = T), ncol = 4));
df
library(dplyr)
f = function(x) recode(x, `1`=0, `2`=0, `3`=0, `4`=1, `5`=2, `6`=3)
sapply(df, f)
# V1 V2 V3 V4
# [1,] 3 2 2 0
# [2,] 1 0 0 0
# [3,] 0 0 0 2
# [4,] 0 0 0 3
# [5,] 2 0 0 2
df %>% mutate_all(f)
# V1 V2 V3 V4
# 1 3 2 2 0
# 2 1 0 0 0
# 3 0 0 0 2
# 4 0 0 0 3
# 5 2 0 0 2
A looping alternative with lapply and match is as follows:
dat[] <- lapply(dat, function(x) c(0, 0, 0, 1, 2, 3)[match(x, 1:6)])
This uses a lookup table on the vector c(0,0,0,1,2,3) with match selecting the indices. Using the data.frame created by Maurits Evers, we get
dat
V1 V2 V3 V4
1 3 2 2 0
2 1 0 0 0
3 0 0 0 2
4 0 0 0 3
5 2 0 0 2
To do this for a subset of the columns, just select them on each side, like
dat[, cols2] <-
lapply(dat[, cols2], function(x) c(0, 0, 0, 1, 2, 3)[match(x, 1:6)])
or
dat[cols2] <- lapply(dat[cols2], function(x) c(0, 0, 0, 1, 2, 3)[match(x, 1:6)])

ordering columns in dataframe based on incomplete vector

I have a vector based on col names which looks like
x <- c("C", "A", "T")
my dataframe looks like with rownames and colnames defined.
names A B C D T
Dan 1 0 1 0 1
Joe 0 1 0 1 0
I want to order the dataframe so the columns in the vector appear first followed by columns not in the vector
names C A T B D
Dan 1 1 1 0 0
Joe 0 0 0 1 1
Thanks
The following will rearrange your data to set the columns specified in the vector x at the beginning, and the remaining columns in their original order afterwards.
x <- c("C", "A", "T")
mydata <- mydata[, c(x, setdiff(names(mydata), x))]
If the names column should stay at the first position and is not specified within x, use (Thanks #StevenBeaupré for pointing it out and providing the code):
mydata <- mydata[, c(names(mydata)[1], x, setdiff(names(mydata)[-1], x))]
Small data example:
mydata <- data.frame(names = c("Dan", "Joe"), A = c(1, 0), B = c(0,1),
C = c(1, 0), D = c(0,1), T = c(1, 0))
> mydata
names A B C D T
1 Dan 1 0 1 0 1
2 Joe 0 1 0 1 0
mydata <- mydata[, c(names(mydata)[1], x, setdiff(names(mydata)[-1], x))]
> mydata
names C A T B D
1 Dan 1 1 1 0 0
2 Joe 0 0 0 1 1

Resources