Encode string column as several dummy columns [duplicate] - r

This question already has answers here:
Split string column to create new binary columns
(10 answers)
Closed 3 years ago.
I'd like to take data of the form
names label
1 A/B V
2 A W
3 A/C/D X
4 B/C Y
5 B/D Z
and encode the 'names' column into several columns containing a dummy variable which shows whether a particular name is included, i.e.
A B C D label
1 1 1 0 0 V
2 1 0 0 0 W
3 1 0 1 1 X
4 0 1 1 0 Y
5 0 1 0 1 Z
It feels like there should be an R function which takes care of this easily, but I have not been able to find one. Thanks for any pointers!

An option would be to split the string column by / and use mtabulate
library(qdapTools)
cbind(mtabulate(strsplit(df1$names, "/")), df1['label'])
# A B C D label
#1 1 1 0 0 V
#2 1 0 0 0 W
#3 1 0 1 1 X
#4 0 1 1 0 Y
#5 0 1 0 1 Z
Or in base R
table(stack(setNames(strsplit(df1$names, "/"), df1$label))[2:1])
NO packages used
data
df1 <- structure(list(names = c("A/B", "A", "A/C/D", "B/C", "B/D"),
label = c("V", "W", "X", "Y", "Z")), class = "data.frame",
row.names = c("1", "2", "3", "4", "5"))

Use separate_rows to put it in long form and then table will produce the output. Transpose to get it in the orientation shown in the quesiton.
library(dplyr)
library(tidyr)
DF %>%
separate_rows(names) %>%
table %>%
t
giving:
names
label A B C D
V 1 1 0 0
W 1 0 0 0
X 1 0 1 1
Y 0 1 1 0
Z 0 1 0 1
Note
The input in reproducible form:
Lines <- "names label
1 A/B V
2 A W
3 A/C/D X
4 B/C Y
5 B/D Z"
DF <- read.table(text = Lines, as.is = TRUE)

Related

All possible combinations (sequential)

I am wondering what an efficient approach to the following question would be:
Suppose I have three characters in group 1 and two characters in group 2:
group_1 = c("X", "Y", "Z")
group_2 = c("A", "B")
Clearly, the "all" possible combinations for group_1 and group_2 are given by:
group_1_combs = data.frame(X = c(0,1,0,0,1,1,0,1),
Y = c(0,0,1,0,1,0,1,1),
Z = c(0,0,0,1,0,1,1,1))
group_2_combs = data.frame(A = c(0,1,0,1),
B = c(0,0,1,1))
My question is the following:
(1) How do I go from group_1 to group_1_combs efficiently (given that the character vector might be large).
(2) How do I do an "all possible" combinations of each row of group_1_combs and group_2_combs? Specifically, I want a "final" data.frame where each row of group_1_combs is "permuted" with every row of group_2_combs. This means that the final data.frame would have 8 x 4 rows (since there are 8 rows in group_1_combs and 4 rows in group_2_combs) and 5 columns (X,Y,Z,A,B).
Thanks!
You want expand.grid and merge:
Question 1:
group_1_combs <- expand.grid(setNames(rep(list(c(0, 1)), length(group_1)), group_1))
group_2_combs <- expand.grid(setNames(rep(list(c(0, 1)), length(group_2)), group_2))
Question 2:
> merge(group_1_combs, group_2_combs)
X Y Z A B
1 0 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 1 0 0 0
5 0 0 1 0 0
6 1 0 1 0 0
7 0 1 1 0 0
...
Or you can go directly to the merged data.frame:
group_12 <- c(group_1, group_2)
expand.grid(setNames(rep(list(c(0, 1)), length(group_12)), group_12))

What code should be written to create a new variable from observations containing the same content in the R column?

My variable is as follows
variable
D
D
B
C
B
D
C
C
D
I want to make the column in the above figure below
variable
B
C
D
D
0
0
1
D
0
0
1
B
1
0
0
C
0
1
0
B
1
0
0
D
0
0
1
C
0
1
0
C
0
1
0
D
0
0
1
But I don't want a code like the one below. Because the number of factors in the variable column is too many
data = data %>% mutate(B=ifelse(variable=="B", 1,0),
C=ifelse(variable=="C", 1,0),
D=ifelse(variable=="D", 1,0))
Here is a base R approach. We can first find all unique variable values from the data frame. Then, sapply over that vector and generate a new column for each value. Finally, we can rbind this new data frame of 0/1 valued columns to the original data frame.
cols <- sort(unique(df$variable))
df2 <- sapply(cols, function(x) ifelse(df$variable == x, 1, 0))
df <- cbind(df, df2)
df
variable B C D
1 D 0 0 1
2 D 0 0 1
3 B 1 0 0
4 C 0 1 0
5 B 1 0 0
6 D 0 0 1
7 C 0 1 0
8 C 0 1 0
9 D 0 0 1
Data:
df <- data.frame(variable=c("D", "D", "B", "C", "B",
"D", "C", "C", "D"),
stringsAsFactors=FALSE)
Try this with reshaping and duplicating the original variable in order to have a reference for values. Then, you can reshape to obtain the expected output:
library(dplyr)
library(tidyr)
#Code
new <- df %>% mutate(Var=variable,Val=1,id=row_number()) %>%
pivot_wider(names_from = Var,values_from=Val,values_fill = 0) %>%
select(-id)
Output:
# A tibble: 9 x 4
variable D B C
<chr> <dbl> <dbl> <dbl>
1 D 1 0 0
2 D 1 0 0
3 B 0 1 0
4 C 0 0 1
5 B 0 1 0
6 D 1 0 0
7 C 0 0 1
8 C 0 0 1
9 D 1 0 0
Some data used:
#Data
df <- structure(list(variable = c("D", "D", "B", "C", "B", "D", "C",
"C", "D")), class = "data.frame", row.names = c(NA, -9L))
1) model.matrix
model.matrix will generate column names like variableB so the last line removes the variable part to ensure that the column names are exactly the same as in the question. Omit the last line if it is not important that the column names be exactly as shown there.
dat2 <- cbind(dat, model.matrix(~ variable - 1, dat))
names(dat2) <- sub("variable(.)", "\\1", names(dat2))
giving:
> dat2
variable B C D
1 D 0 0 1
2 D 0 0 1
3 B 1 0 0
4 C 0 1 0
5 B 1 0 0
6 D 0 0 1
7 C 0 1 0
8 C 0 1 0
9 D 0 0 1
2) outer
This can also be done using outer as shown. Each component of variable is compared to each level. We name the levels so that outer uses them as column names. The output is the same.
levs <- sort(unique(dat$variable))
names(levs) <- levs
cbind(dat, +outer(dat$variable, levs, `==`))
Note
The input in reproducible form:
Lines <- "
variable
D
D
B
C
B
D
C
C
D"
dat <- read.table(text = Lines, header = TRUE)

Convert All Variables in a Dataframe to Numbers

Is there a fast way to convert all variables in a column to numbers, regardless of variable type? ie. if a column only had values "Yes" and "No", they would be converted to 0 and 1; columns with 3 values of "a", "b" and "c" would be converted to 0, 1, 2, etc.
The current df that I am using has the 9th column as "Yes/No".
EDIT:
Using Moody_Mudskipper's suggestion, I have tried:
RawData1 <- as.matrix(as.numeric(factor(RawData[[9]], levels = c("Yes","No"))) - 1)
dput(head(df,10))
structure(c("function (x, df1, df2, ncp, log = FALSE) ", "{",
" if (missing(ncp)) ", " .Call(C_df, x, df1, df2, log)",
" else .Call(C_dnf, x, df1, df2, ncp, log)", "}"), .Dim = c(6L,
1L), .Dimnames = list(c("1", "2", "3", "4", "5", "6"), ""), class =
"noquote")
Moody's answer (+1) explains that you need to convert to factors, then to numeric
You can use mutate_all to change the class of all columns in your data frame
library(dplyr)
df %>%
mutate_all(funs(as.numeric(as.factor(.))))
you can use factors for this:
df <- data.frame(yn = sample(c("yes","no"),10,T),
abc = sample(c("a","b","c"),10,T),
stringsAsFactors = F
)
df$yn2 <- as.numeric(factor(df$yn,levels = c("yes","no"))) - 1
df$abc2 <- as.numeric(factor(df$abc,levels = c("a","b","c"))) - 1
# yn abc yn2 abc2
# 1 no b 1 1
# 2 yes b 0 1
# 3 no b 1 1
# 4 yes a 0 0
# 5 yes c 0 2
# 6 yes c 0 2
# 7 yes c 0 2
# 8 yes a 0 0
# 9 no c 1 2
# 10 yes b 0 1
Another Base R solution to convert all columns:
# Added a numeric column to #Moody_Mudskipper's data example
set.seed(1)
df <- data.frame(yn = sample(c("yes","no"),10,T),
abc = sample(c("a","b","c"),10,T),
num = 1:10,
stringsAsFactors = F
)
df = data.frame(lapply(df, function(x) as.numeric(as.factor(x))))
One issue with this though is that it gives:
yn abc num
1 2 1 1
2 2 1 2
3 1 3 3
4 1 2 4
5 2 3 5
6 1 2 6
7 1 3 7
8 1 3 8
9 1 2 9
10 2 3 10
which is not what OP wants, as he wanted factor/character variables to be converted to 0,1,2,3,... One can try to do this:
df = data.frame(lapply(df, function(x) as.numeric(as.factor(x))-1))
but then all numeric columns would be incorrectly subtracted by 1...Using mutate_all (as in #CPak's answer) has this same issue. What you can do instead is to use mutate_if to only convert columns that are factors/characters:
library(dplyr)
df %>%
mutate_if(function(x) is.factor(x) | is.character(x), funs(as.numeric(as.factor(.))-1))
# or this...
df %>%
mutate_if(function(x) !is.numeric(x), funs(as.numeric(as.factor(.))-1))
Now, columns are correctly converted:
yn abc num
1 1 0 1
2 1 0 2
3 0 2 3
4 0 1 4
5 1 2 5
6 0 1 6
7 0 2 7
8 0 2 8
9 0 1 9
10 1 2 10

ordering columns in dataframe based on incomplete vector

I have a vector based on col names which looks like
x <- c("C", "A", "T")
my dataframe looks like with rownames and colnames defined.
names A B C D T
Dan 1 0 1 0 1
Joe 0 1 0 1 0
I want to order the dataframe so the columns in the vector appear first followed by columns not in the vector
names C A T B D
Dan 1 1 1 0 0
Joe 0 0 0 1 1
Thanks
The following will rearrange your data to set the columns specified in the vector x at the beginning, and the remaining columns in their original order afterwards.
x <- c("C", "A", "T")
mydata <- mydata[, c(x, setdiff(names(mydata), x))]
If the names column should stay at the first position and is not specified within x, use (Thanks #StevenBeaupré for pointing it out and providing the code):
mydata <- mydata[, c(names(mydata)[1], x, setdiff(names(mydata)[-1], x))]
Small data example:
mydata <- data.frame(names = c("Dan", "Joe"), A = c(1, 0), B = c(0,1),
C = c(1, 0), D = c(0,1), T = c(1, 0))
> mydata
names A B C D T
1 Dan 1 0 1 0 1
2 Joe 0 1 0 1 0
mydata <- mydata[, c(names(mydata)[1], x, setdiff(names(mydata)[-1], x))]
> mydata
names C A T B D
1 Dan 1 1 1 0 0
2 Joe 0 0 0 1 1

Generate pairwise movement data from sequence

I have a sequence which looks like this
SEQENCE
1 A
2 B
3 B
4 C
5 A
Now from this sequence, I want to get the matrix like this where i the row and jth column element denotes how many times movement occurred from ith row node to jth column node
A B C
A 0 1 0
B 0 1 1
C 1 0 0
How Can I get this in R
1) Use table like this:
s <- DF[, 1]
table(tail(s, -1), head(s, -1))
giving:
A B C
A 0 0 1
B 1 1 0
C 0 1 0
2) or like this. Since embed does not work with factors we convert the factor to character,
s <- as.character(DF[, 1])
do.call(table, data.frame(embed(s, 2)))
giving:
X2
X1 A B C
A 0 0 1
B 1 1 0
C 0 1 0
3) xtabs also works:
s <- as.character(DF[, 1])
xtabs(data = data.frame(embed(s, 2)))
giving:
X2
X1 A B C
A 0 0 1
B 1 1 0
C 0 1 0
Note: The input DF in reproducible form is:
Lines <- " SEQENCE
1 A
2 B
3 B
4 C
5 A"
DF <- read.table(text = Lines, header = TRUE)

Resources