Transform event list dataframe in adjacency dataframe - r

I have a df in which every columns represent an event and in cells there are the individuals, like this:
df=data.frame(topic1=c("a", "b","c", "d"), topic2=c("e","f", "g", "a"), topic3=c("b","c","g","h"))
I need to transform it in adjacency df, like this:
topic1 topic2 topic3
a 1 1 0
b 1 0 1
c 1 0 1
d 1 0 0
e 0 1 0
f 0 1 0
g 0 1 1
h 0 0 1
THX!

Form levs containing the levels in sorted order and then for each column of df determine which levs are in it. This gives a logical matrix which we can convert to numeric using +.
levs <- sort(unique(unlist(df))) # a b c d e f g h
+ sapply(df, function(x) levs %in% x)
giving:
topic1 topic2 topic3
[1,] 1 1 0
[2,] 1 0 1
[3,] 1 0 1
[4,] 1 0 0
[5,] 0 1 0
[6,] 0 1 0
[7,] 0 1 1
[8,] 0 0 1
The last line could be written even more compactly as:
+ sapply(df, `%in%`, x = levs)

Related

How to merge two dataframes and keep only different columns (content)?

I have two data frame with same row size and different column number, the name of the columns is also different, however the content may be similar in some of them.
i.e. df1:
df1<- data.frame("a"=c("0","1","0","1","0","0","0"),
"b"=c("1","1","1","1","1","0","0"),
"c"=c("1","1","0","0","1","0","0"),
"d"=c("1","1","1","1","1","1","1"))
df2:
df2<- data.frame("e"=c("1","1","0","1","0","0","0"),
"f"=c("1","1","1","1","1","0","0"),
"g"=c("0","0","0","0","1","0","0"),
"h"=c("0","0","0","0","1","1","1"))
If you see, the column "b" of df1 and "f" of df2 are equal. Therefore, the result I want is a new dataframe looking like this:
df3 <- data.frame("a"=c("0","1","0","1","0","0","0"),
"c"=c("1","1","0","0","1","0","0"),
"d"=c("1","1","1","1","1","1","1"),
"e"=c("1","1","0","1","0","0","0"),
"g"=c("0","0","0","0","1","0","0"),
"h"=c("0","0","0","0","1","1","1"))
NOTE: column "b" and "f" (that were similar) are not in the new df3.
I have looked in the web but I did not find an example for this. I think the major complexity is that the merge is by content and not by column name.
This would do the job:
df3 <- cbind(df1,df2)
df3 <- t(t(df3)[!(duplicated(t(df3)) | duplicated(t(df3), fromLast = TRUE)),])
df3
# a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1
this will give you a matrix, you can save the result as a df if so desired
We can use sapply to check for the columns that perfectly match.
mat <- sapply(df1, function(x) sapply(df2, function(y) all(x == y)))
mat
# a b c d
#e FALSE FALSE FALSE FALSE
#f FALSE TRUE FALSE FALSE
#g FALSE FALSE FALSE FALSE
#h FALSE FALSE FALSE FALSE
Here we can see column b from df1 and column f from df2 should be removed. We can do this by :
m2 <- which(mat, arr.ind = TRUE)
cbind(df1[-m2[, 2]], df2[-m2[, 1]])
# a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1
Here is a more tidyverse solution.
library(dplyr)
library(tidyr)
# based on Ronak's sapply approach
matches <- as.data.frame(sapply(df1, function(x) sapply(df2, function(y) identical(x, y)))) %>%
rownames_to_column(var = "df2") %>%
pivot_longer(-df2, names_to = "df1") %>% # pivot longer
filter(value) # keep only the matches
# programmatically build list of names to remove
vars_remove <- c(matches$df1, matches$df2) # will remove var names that are matches
df1 %>% bind_cols(df2) %>%
select(-any_of(vars_remove))
a c d e g h
1 0 1 1 1 0 0
2 1 1 1 1 0 0
3 0 0 1 0 0 0
4 1 0 1 1 0 0
5 0 1 1 0 1 1
6 0 0 1 0 0 1
7 0 0 1 0 0 1
We can use outer from base R
mat <- outer(df1, df2, FUN = Vectorize(function(x, y) all(x == y)))
mat
# e f g h
#a FALSE FALSE FALSE FALSE
#b FALSE TRUE FALSE FALSE
#c FALSE FALSE FALSE FALSE
#d FALSE FALSE FALSE FALSE
Now, we can get the row/column names
m2 <- as.matrix(subset(as.data.frame.table(mat), Freq, select = -Freq))
Now, we use the 'm2' to get remove the column names from 'df1', 'df2' and cbind
cbind(df1[setdiff(names(df1), m2[,1])], df2[setdiff(names(df2), m2[,2])])
# a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1

unordered combination and store the result in a matrix in r

Say I have a list (a, b, c), I Want to find out all the possible combinations of them and store in a matrix like:
a b c
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
[4,] 1 1 0
[5,] 1 0 1
[6,] 0 1 1
[7,] 1 1 1`
I don't know how to make it. Thanks for the help!
To do exactly what you want, use permutations in the gtools package. This works as follows:
m <- permutations(2, 3, v=c(0,1), repeats.allowed=T)
colnames(m) <- c('a','b','c')
# delete [0,0,0]
m <- m[-1,]
Yields:
a b c
[1,] 0 0 1
[2,] 0 1 0
[3,] 0 1 1
[4,] 1 0 0
[5,] 1 0 1
[6,] 1 1 0
[7,] 1 1 1
Idea was taken from the comment section under this question:
Generate all combinations of length 2 using 3 letters
My adaptation is not very elegant... but it seems to do the job.
output <- expand.grid(rep(list(c('a', 'b', 'c')), 3))
colnames(output) <- c('a', 'b', 'c')
for (col in colnames(output)) {
output[, col] <- as.character(output[,col])
output[, col] <- ifelse(output[, col]==col, 1, 0)
}
output <- output[!duplicated(output), ]
rownames(output) <- NULL
print(output)
# a b c
# 1 1 0 0
# 2 0 0 0
# 3 1 1 0
# 4 0 1 0
# 5 1 0 1
# 6 0 0 1
# 7 1 1 1
# 8 0 1 1

Transform ids -> items to {pairs of ids} -> items

I have a data.frame like this:
x1 <- data.frame(id=1:3,item=c("A","B","A","B","C","D"))
x1[order(x1$item),]
id item
1 1 A
3 3 A
2 2 B
4 1 B
5 2 C
6 3 D
I want to get :
id1=c(1,2,1,3,2,3)
id2 = c(2,1,3,1,3,2)
A=c(0,0,1,1,0,0)
B=c(1,1,0,0,0,0)
C = 0
D=0
datawanted <- data.frame(id1,id2,A,B,C,D)
id1 id2 A B C D
1 1 2 0 1 0 0
2 2 1 0 1 0 0
3 1 3 1 0 0 0
4 3 1 1 0 0 0
5 2 3 0 0 0 0
6 3 2 0 0 0 0
if person1 and person2 both have B,then in the datawanted dataframe,column A ,got 1,else get 0.
Can someone give me some suggestions or functions in R,to deal with this problem?
Cool question. You have a bipartite graph, so following Gabor's tutorial...
library(igraph)
g = graph_from_edgelist(as.matrix(x1))
V(g)$type = grepl("[A-Z]", V(g)$name)
For OP's desired output, first we can extract the incidence matrix:
gi = get.incidence(g)
# A B C D
# 1 1 1 0 0
# 2 0 1 1 0
# 3 1 0 0 1
Note (thanks #thelatemail), that if you don't want to use igraph, you can get to gi as table(x1).
Then, we look at the combinations of ids:
res = t(combn(nrow(gi), 2, function(x) c(
as.integer(rownames(gi)[x]),
pmin( gi[x[1], ], gi[x[2], ] )
)))
dimnames(res) <- list( NULL, c("id1", "id2", colnames(gi)))
# id1 id2 A B C D
# [1,] 1 2 0 1 0 0
# [2,] 1 3 1 0 0 0
# [3,] 2 3 0 0 0 0
This essentially is the OP's desired output. They had included redundant rows (e.g., 1,2 and 2,1).
Fun reason to use a graph (ht Chris):
V(g)$color <- ifelse(V(g)$type, "red", "light blue")
V(g)$x <- (1:2)[ V(g)$type + 1 ]
V(g)$y <- ave(seq_along(V(g)), V(g)$type, FUN = seq_along)
plot(g)
Or, apparently this can be done more or less like
plot(g, layout = layout.bipartite(g)[,2:1])

Transforming dataframe into expanded matrix in r

Say I have the following dataframe:
dfx <- data.frame(Var1=c("A", "B", "C", "D", "B", "C", "D", "C", "D", "D"),
Var2=c("E", "E", "E", "E", "A", "A", "A", "B", "B", "C"),
Var1out = c(1,-1,-1,-1,1,-1,-1,1,-1,-1),
Var2out= c(-1,1,1,1,-1,1,1,-1,1,1))
dfx
Var1 Var2 Var1out Var2out
1 A E 1 -1
2 B E -1 1
3 C E -1 1
4 D E -1 1
5 B A 1 -1
6 C A -1 1
7 D A -1 1
8 C B 1 -1
9 D B -1 1
10 D C -1 1
What you see here are 10 rows that correspond to match-ups between players A, B, C, D and E. They play each other once and the winner of each match-up is denoted by a +1 and the loser of each match-up is denoted by a -1 (put into the respective column Player Var1 result in Var1out, Player Var2 result in Var2out).
Desired output.
I wish to transform this dataframe to this output matrix (the order of rows are not important to me, but as you can see each row refers to a unique match-up):
A B C D E
1 1 0 0 0 -1
2 0 -1 0 0 1
3 0 0 -1 0 1
4 0 0 0 -1 1
5 -1 1 0 0 0
6 1 0 -1 0 0
7 1 0 0 -1 0
8 0 -1 1 0 0
9 0 1 0 -1 0
10 0 0 1 -1 0
What I've done:
I managed to make this matrix in a roundabout way. As roundabout ways tend to be slow and less satisfactory, I was wondering if anyone can spot a better way.
I first made sure that my two columns containing players had factor levels that contained every possible player that ever occurs (you'll note for instance that player E never occurs in Var1).
# Making sure Var1 and Var2 have same factor levels
levs <- unique(c(levels(dfx$Var1), levels(dfx$Var2))) #get all possible levels of factors
dfx$Var1 <- factor(dfx$Var1, levels=levs)
dfx$Var2 <- factor(dfx$Var2, levels=levs)
I next split the dataframe into two - one for Var1 and Var1out, and one for Var2 and Var2out:
library(dplyr)
temp.Var1 <- dfx %>% select(Var1, Var1out)
temp.Var2 <- dfx %>% select(Var2, Var2out)
Here I use model.matrix to expand columns by factor level:
mat.Var1<-with(temp.Var1, data.frame(model.matrix(~Var1+0)))
mat.Var2<-with(temp.Var2, data.frame(model.matrix(~Var2+0)))
I then replace for each row the column with a '1' indicating the presence of that factor, with the correct result and add these matrices:
mat1 <- apply(mat.Var1, 2, function(x) ifelse(x==1, x<-temp.Var1$Var1out, x<-0) )
mat2 <- apply(mat.Var2, 2, function(x) ifelse(x==1, x<-temp.Var2$Var2out, x<-0) )
matX <- mat1+mat2
matX
Var1A Var1B Var1C Var1D Var1E
1 1 0 0 0 -1
2 0 -1 0 0 1
3 0 0 -1 0 1
4 0 0 0 -1 1
5 -1 1 0 0 0
6 1 0 -1 0 0
7 1 0 0 -1 0
8 0 -1 1 0 0
9 0 1 0 -1 0
10 0 0 1 -1 0
Although this works, I have a sense that I am probably missing simpler solutions for this problem. Thanks.
Create an empty matrix and use matrix indexing to fill the relevant values in:
cols <- unique(unlist(dfx[1:2]))
M <- matrix(0, nrow = nrow(dfx), ncol = length(cols), dimnames = list(NULL, cols))
M[cbind(sequence(nrow(dfx)), match(dfx$Var1, cols))] <- dfx$Var1out
M[cbind(sequence(nrow(dfx)), match(dfx$Var2, cols))] <- dfx$Var2out
M
# A B C D E
# [1,] 1 0 0 0 -1
# [2,] 0 -1 0 0 1
# [3,] 0 0 -1 0 1
# [4,] 0 0 0 -1 1
# [5,] -1 1 0 0 0
# [6,] 1 0 -1 0 0
# [7,] 1 0 0 -1 0
# [8,] 0 -1 1 0 0
# [9,] 0 1 0 -1 0
# [10,] 0 0 1 -1 0
Another way is to use acast
library(reshape2)
#added `use.names=FALSE` from #Ananda Mahto's comments
dfy <- data.frame(Var=unlist(dfx[,1:2], use.names=FALSE),
VarOut=unlist(dfx[,3:4], use.names=FALSE), indx=1:nrow(dfx))
acast(dfy, indx~Var, value.var="VarOut", fill=0)
# A B C D E
#1 1 0 0 0 -1
#2 0 -1 0 0 1
#3 0 0 -1 0 1
#4 0 0 0 -1 1
#5 -1 1 0 0 0
#6 1 0 -1 0 0
#7 1 0 0 -1 0
#8 0 -1 1 0 0
#9 0 1 0 -1 0
#10 0 0 1 -1 0
Or use spread
library(tidyr)
spread(dfy,Var, VarOut , fill=0)[,-1]
# A B C D E
#1 1 0 0 0 -1
#2 0 -1 0 0 1
#3 0 0 -1 0 1
#4 0 0 0 -1 1
#5 -1 1 0 0 0
#6 1 0 -1 0 0
#7 1 0 0 -1 0
#8 0 -1 1 0 0
#9 0 1 0 -1 0
#10 0 0 1 -1 0

Binning and Naming New Columns with Mean of Binned Columns

This probably has been asked already, but I could not find it. I have a data set, where column names are numbers, and row names are sample names (see below).
"599.773" "599.781" "599.789" "599.797" "599.804" "599.812" "599.82" "599.828"
"A" 0 0 0 0 0 2 1 4
"B" 0 0 0 0 0 1 0 3
"C" 0 0 0 0 2 1 0 1
"D" 3 0 0 0 3 1 0 0
I want to bin the columns, say every 4 columns, by summation, and then name the new columns with the mean of the binned columns. For the above table I would end up with:
"599.785" "599.816"
"A" 0 7
"B" 0 4
"C" 0 4
"D" 3 4
The new column names, 599.785 and 599.816, are average of the column names that were binned. I think something like cut would work for a vector of numbers, but I am not sure how to implement it for large data frames. Thanks for any help!
colnames <- c("599.773", "599.781", "599.789", "599.797",
"599.804", "599.812" ,"599.82" ,"599.828" )
mat <- matrix(scan(), nrow=4, byrow=TRUE)
0 0 0 0 0 2 1 4
0 0 0 0 0 1 0 3
0 0 0 0 2 1 0 1
3 0 0 0 3 1 0 0
colnames(mat)=colnames
rownames(mat) = LETTERS[1:4]
sRows <- function(mat, cols) rowSums(mat[, cols])
sapply(1:(dim(mat)[2]/4), function(base) sRows(mat, base:(base+4)) )
[,1] [,2]
A 0 2
B 0 1
C 2 3
D 6 4
accum <- sapply(1:(dim(mat)[2]/4), function(base)
sRows(mat, base:(base+4)) )
colnames(accum) <- sapply(1:(dim(mat)[2]/4),
function(base)
mean(as.numeric(colnames(mat)[ base:(base+4)] )) )
accum
#-------
599.7888 599.7966
A 0 2
B 0 1
C 2 3
D 6 4
First of all Using numeric values as columns names is not a good/standard habit.
Even I am here giving a solution as the desired OP.
## read data without checking names
dt <- read.table(text='
"599.773" "599.781" "599.789" "599.797" "599.804" "599.812" "599.82" "599.828"
"A" 0 0 0 0 0 2 1 4
"B" 0 0 0 0 0 1 0 3
"C" 0 0 0 0 2 1 0 1
"D" 3 0 0 0 3 1 0 0',header=TRUE, check.names =FALSE)
cols <- as.numeric(colnames(dt))
## create a factor to groups columns
ff <- rep(c(TRUE,FALSE),each=length(cols)/2)
## using tapply to group operations by ff
vals <- do.call(cbind,tapply(cols,ff,
function(x)
rowSums(dt[,paste0(x)])))
nn <- tapply(cols,ff,mean)
## names columns with means
colnames(vals) <- nn[colnames(vals)]
vals
599.816 599.785
A 7 0
B 4 0
C 4 0
D 4 3

Resources