R Dummy-variable to be populated from multiple columns [duplicate] - r

This question already has answers here:
Search multiple columns for string to set indicator variable
(3 answers)
R model.matrix using same factor set among all columns
(1 answer)
Closed 4 years ago.
I am a beginner in R and looking to implement dummy variables on a dataset.
I am having a data set with few columns like below -
Dataset1
T1 T2 T3
A C B
A C B
A C B
A D C
B D C
B E F
I want to add dummy variables to this like dummy,A; dummy,B; dummy,C and so on.. And assign them values as 1 if it is present in either T1, T2 or T3, else 0.
So the final data set should look like -
T1 T2 T3 dummy,A dummy,B dummy,C dummy,D dummy,E dummy,F
A C B 1 1 1 0 0 0
A C B 1 1 1 0 0 0
A C B 1 1 1 0 0 0
A D C 1 0 1 1 0 0
B D C 0 1 1 1 0 0
B E F 0 1 0 0 1 1
So can anyone please suggest how I can achieve this?
Any help in this regard is really appreciated. Thanks!

We can use mtabulate from qdapTools. Transpose the 'Dataset1', convert it to data.frame, apply the mtabulate, change its column names (if needed) and cbind with the original 'Dataset1'
library(qdapTools)
d1 <- mtabulate(as.data.frame(t(Dataset1)))
row.names(d1) <- NULL
names(d1) <- paste0("dummy.", names(d1))
cbind(Dataset1, d1)
# T1 T2 T3 dummy.A dummy.B dummy.C dummy.D dummy.E dummy.F
#1 A C B 1 1 1 0 0 0
#2 A C B 1 1 1 0 0 0
#3 A C B 1 1 1 0 0 0
#4 A D C 1 0 1 1 0 0
#5 B D C 0 1 1 1 0 0
#6 B E F 0 1 0 0 1 1

Related

Create variables with ones and zeros from a variable options in r [duplicate]

This question already has answers here:
Generate a dummy-variable
(17 answers)
Closed 3 years ago.
I´m trying to create new variables from the options of one I have in my dataframe. This is my initial dataframe:
d1 <- data.frame("id" = c(1,1,2,2,3,4,5), "type" = c("A","B","C","C","A","B","C"))
id type
1 1 A
2 1 B
3 2 C
4 2 C
5 3 A
6 4 B
7 5 C
So, if would like to create new variables depending of the value of "type" for each id, I would like to get this kind of dataframe:
d2 <- data.frame("id" = c(1,1,2,2,3,4,5), "type" = c("A","B","C","C","A","B","C"),
"type.A" = c(1,0,0,0,1,0,0), "type.B" = c(0,1,0,0,0,1,0),
"type.C" = c(0,0,1,1,0,0,1))
id type type.A type.B type.C
1 1 A 1 0 0
2 1 B 0 1 0
3 2 C 0 0 1
4 2 C 0 0 1
5 3 A 1 0 0
6 4 B 0 1 0
7 5 C 0 0 1
The idea is give 1 in the new variable (type.A in this case) if the "type" of an specific "id" is equal to A, if else give 0. Since this is a common problem in big data analysis (I think), I would like to know if there is a function to solve this problem.
cbind(d1, setNames(data.frame(+sapply(unique(d1$type), function(x)
d1$type == x)), unique(d1$type)))
# id type A B C
#1 1 A 1 0 0
#2 1 B 0 1 0
#3 2 C 0 0 1
#4 2 C 0 0 1
#5 3 A 1 0 0
#6 4 B 0 1 0
#7 5 C 0 0 1

Filter multiple columns based on same criteria in R

I have a dataframe in which there are multiple columns (more than 30) that is saved in a list. I would like to apply the same criteria for all those columns without writing each code for each columns. I have example below to help understand my problem better
A<-c("A","B","C","D","E","F","G","H","I")
B<-c(0,0,0,1,2,3,0,0,0)
C<-c(0,1,0,0,1,2,0,0,0)
D<-c(0,0,0,0,1,1,0,1,0)
E<-c(0,0,0,0,0,0,0,1,0)
data<-data.frame(A,B,C,D,E)
Let say I have the above df as an example and I have saved the list of cols as below
list <- c("B","C","D","E")
I would like to use those cols with the same criteria as below
setDT(data)[B>=1 | C>=1 | D>=1 | E>=1]
And get the following result
A B C D E
1: B 0 1 0 0
2: D 1 0 0 0
3: E 2 1 1 0
4: F 3 2 1 0
5: H 0 0 1 1
However, is there a way to get the above answer without writing each individual column criteria (e.g. B>=1 | C>=1 ....) since I have more than 30 cols in the actual data. Thanks a lot
For your specific example of checking if at least one value in a row is at least 1, you could use rowSums
data[rowSums(data[,-1]) > 0, ]
# A B C D E
# 2 B 0 1 0 0
# 4 D 1 0 0 0
# 5 E 2 1 1 0
# 6 F 3 2 1 0
# 8 H 0 0 1 1
If you have other criteria in mind, you might as well consider using any within apply
ind <- apply(data[,-1], 1, function(x) {any(x >= 1)})
data[ind,]
# A B C D E
# 2 B 0 1 0 0
# 4 D 1 0 0 0
# 5 E 2 1 1 0
# 6 F 3 2 1 0
# 8 H 0 0 1 1
dplyr::filter_at will do just that.
library(dplyr)
data %>% filter_at(vars(-A),any_vars(.>=1))
# A B C D E
# 1 B 0 1 0 0
# 2 D 1 0 0 0
# 3 E 2 1 1 0
# 4 F 3 2 1 0
# 5 H 0 0 1 1
You could always use Reduce, this is nice because you can put any type of logic you want into the function:
A simple method might be:
data[Reduce("|", as.data.frame(data[,list] >= 1)),]
# A B C D E
#2 B 0 1 0 0
#4 D 1 0 0 0
#5 E 2 1 1 0
#6 F 3 2 1 0
#8 H 0 0 1 1
A little explanation: Reduce successively applies the same function to each element of x. In this case the "|" operator is applied to each of the logical columns of the data.frame.
If you wanted to do more complicated logic checks you could do that with your own anonymous function.
Please check this using applyin R.
B<-c(0,0,0,1,2,3,0,0,0)
C<-c(0,1,0,0,1,2,0,0,0)
D<-c(0,0,0,0,1,1,0,1,0)
ef=data.frame(B,C,D)
con=apply(ef,2,function(x) x>1 )

R: Update adjacency matrix/data frame using pairwise combinations

Question
Let's say I have this dataframe:
# mock data set
df.size = 10
cluster.id<- sample(c(1:5), df.size, replace = TRUE)
letters <- sample(LETTERS[1:5], df.size, replace = TRUE)
test.set <- data.frame(cluster.id, letters)
Will be something like:
cluster.id letters
<int> <fctr>
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A
Now I want to group these per cluster.id and see what kind of letters I can find within a cluster, so for example cluster 3 contains the letters A,E,D,C. Then I want to get all unique pairwise combinations (but not combinations with itself so no A,A e.g.): A,E ; A,D, A,C etc. Then I want to update the pairwise distance for these combination in an adjacency matrix/data frame.
Idea
# group by cluster.id
# per group get all (unique) pairwise combinations for the letters (excluding pairwise combinations with itself, e.g. A,A)
# update adjacency for each pairwise combinations
What I tried
# empty adjacency df
possible <- LETTERS
adj.df <- data.frame(matrix(0, ncol = length(possible), nrow = length(possible)))
colnames(adj.df) <- rownames(adj.df) <- possible
# what I tried
update.adj <- function( data ) {
for (comb in combn(data$letters,2)) {
# stucked
}
}
test.set %>% group_by(cluster.id) %>% update.adj(.)
Probably there is an easy way to do this because I see adjacency matrices all the time, but I'm not able to figure it out.. Please let me know if it's not clear
Answer to comment
Answer to #Manuel Bickel:
For the data I gave as example (the table under "will be something like"):
This matrix will be A-->Z for the full dataset, keep that in mind.
A B C D E
A 0 0 1 1 2
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0
I will explain what I did:
cluster.id letters
<int> <fctr>
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A
Only the clusters containing more > 1 unique letter are relevant (because we don't want combinations with itself, e.g cluster 1 containing only letter B, so it would result in combination B,B and is therefore not relevant):
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
Now I look for each cluster what pairwise combinations I can make:
cluster 3:
A,E
A,D
A,C
E,D
E,C
D,C
Update these combination in the adjacency matrix:
A B C D E
A 0 0 1 1 1
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0
Then go to the next cluster
cluster 2
A,E
Update the adjacency matrix again:
A B C D E
A 0 0 1 1 2 <-- note the 2 now
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0
As reaction to the huge dataset
library(reshape2)
test.set <- read.table(text = "
cluster.id letters
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A", header = T, stringsAsFactors = F)
x1 <- reshape2::dcast(test.set, cluster.id ~ letters)
x1
#cluster.id A B C D E
#1 1 1 0 0 0 0
#2 2 1 0 0 0 1
#3 3 1 0 1 1 1
#4 4 0 2 0 0 0
#5 5 1 0 0 0 0
x2 <- table(test.set)
x2
# letters
#cluster.id A B C D E
# 1 1 0 0 0 0
# 2 1 0 0 0 1
# 3 1 0 1 1 1
# 4 0 2 0 0 0
# 5 1 0 0 0 0
x1.c <- crossprod(x1)
#Error in crossprod(x, y) :
# requires numeric/complex matrix/vector arguments
x2.c <- crossprod(x2)
#works fine
Following above comment, here the code of Tyler Rinker used with your data. I hope this is what you want.
UPDATE: Following below comments, I added a solution using the package reshape2 in order to be able to handle larger amounts of data.
test.set <- read.table(text = "
cluster.id letters
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A", header = T, stringsAsFactors = F)
x <- table(test.set)
x
letters
#cluster.id A B C D E
# 1 1 0 0 0 0
# 2 1 0 0 0 1
# 3 1 0 1 1 1
# 4 0 2 0 0 0
# 5 1 0 0 0 0
#base approach, based on answer by Tyler Rinker
x <- crossprod(x)
diag(x) <- 0 #this is to set matches such as AA, BB, etc. to zero
x
# letters
# letters
# A B C D E
# A 0 0 1 1 2
# B 0 0 0 0 0
# C 1 0 0 1 1
# D 1 0 1 0 1
# E 2 0 1 1 0
#reshape2 approach
x <- acast(test.set, cluster.id ~ letters)
x <- crossprod(x)
diag(x) <- 0
x
# A B C D E
# A 0 0 1 1 2
# B 0 0 0 0 0
# C 1 0 0 1 1
# D 1 0 1 0 1
# E 2 0 1 1 0

A Pivot table in r with binary output [duplicate]

This question already has answers here:
R: Convert delimited string into variables
(3 answers)
Closed 5 years ago.
I have the following dataset
#datset
id attributes value
1 a,b,c 1
2 c,d 0
3 b,e 1
I wish to make a pivot table out of them and assign binary values to the attribute (1 to the attributes if they exist otherwise assign 0 to them). My ideal output will be the following:
#output
id a b c d e Value
1 1 1 1 0 0 1
2 0 0 1 1 0 0
3 0 1 0 0 1 1
Any tip is really appreciated.
We split the 'attributes' column by ',', get the frequency with mtabulate from qdapTools and cbind with the first and third column.
library(qdapTools)
cbind(df1[1], mtabulate(strsplit(df1$attributes, ",")), df1[3])
# id a b c d e value
#1 1 1 1 1 0 0 1
#2 2 0 0 1 1 0 0
#3 3 0 1 0 0 1 1
With base R:
attributes <- sort(unique(unlist(strsplit(as.character(df$attributes), split=','))))
cols <- as.data.frame(matrix(rep(0, nrow(df)*length(attributes)), ncol=length(attributes)))
names(cols) <- attributes
df <- cbind.data.frame(df, cols)
df <- as.data.frame(t(apply(df, 1, function(x){attributes <- strsplit(x['attributes'], split=','); x[unlist(attributes)] <- 1;x})))[c('id', attributes, 'value')]
df
id a b c d e value
1 1 1 1 1 0 0 1
2 2 0 0 1 1 0 0
3 3 0 1 0 0 1 1

transform dataframe to a matrix in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I have this data frame :
Var1 var 2 var3
var1 var2 var3
A B 1
B C 2
B A 3
D C 4
B D 5
And I would like to transform it to a matrix And add a column and a row to sum the values associated to each variable like this using R code:
A B C D Total
A 0 1 0 0 1
B 3 0 2 5 10
C 0 0 0 0 0
D 0 0 4 0 4
T 3 1 6 5
Can you suggest me a way of doing it ?
Thanks a lot!!
nms <- sort(unique(c(as.character(df$var1),as.character(df$var2))));
m <- matrix(vector(typeof(df$var3),1L),length(nms),length(nms),dimnames=list(nms,nms));
m[cbind(as.character(df$var1),as.character(df$var2))] <- df$var3;
m;
## A B C D
## A 0 1 0 0
## B 3 0 2 5
## C 0 0 0 0
## D 0 0 4 0
The as.character() coercions can be omitted if the var1 and var2 input columns are already character vectors.
Data
df <- data.frame(var1=c('A','B','B','D','B'),var2=c('B','C','A','C','D'),var3=c(1L,2L,3L,4L,
5L));
Marginal totals can be added as follows:
m <- cbind(m,Total=rowSums(m));
m <- rbind(m,T=colSums(m));
m;
## A B C D Total
## A 0 1 0 0 1
## B 3 0 2 5 10
## C 0 0 0 0 0
## D 0 0 4 0 4
## T 3 1 6 5 15

Resources