Forming a co-occurence matrix from a data frame - r

I have a data frame which looks something like this:
id val
1 a
1 b
2 a
2 c
2 d
3 a
3 a
think of each row as a label, val, that was given to some observation with an id.
What I ultimately want to get to is a "co-occurence" matrix that looks something like this where I get a count of how many times each letter appears within the same id with each other letter:
a b c d
a 1 1 1 1
b 1 0 0 0
c 1 0 0 1
d 1 0 1 0
I've been wracking my brain looking for ways to do this, but have come up empty so far. Any hints? Preferably using tidyverse tools, but open to other options as well at this point.
EDIT: the solutions to the question linked as a possible duplicate do not work in this case. I'm not sure why, but I suspect it has to do with that question having a data frame with 3 columns.

Here's a solution in base R. Not quite elegant but seems to work
temp = data.frame(do.call(cbind, lapply(split(df, df$id), function(a)
combn(a$val, 2))), stringsAsFactors = FALSE)
sapply(sort(unique(df$val)), function(rows)
sapply(sort(unique(df$val)), function(cols)
sum(sapply(temp, function(x)
identical(sort(x), sort(c(rows, cols)))))))
# a b c d
#a 1 1 1 1
#b 1 0 0 0
#c 1 0 0 1
#d 1 0 1 0
OR with igraph
temp = t(do.call(cbind, lapply(split(df, df$id), function(a) combn(a$val, 2))))
library(igraph)
as.matrix(get.adjacency(graph(temp, directed = FALSE)))
# a c b d
#a 1 1 1 1
#c 1 0 0 1
#b 1 0 0 0
#d 1 1 0 0
DATA
df = structure(list(id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L),
val = c("a", "b", "a", "c", "d", "a", "a")),
.Names = c("id", "val"),
class = "data.frame",
row.names = c(NA, -7L))

A solution with dplyr + purrr:
library(dplyr)
library(purrr)
df %>%
split(.$id) %>%
map_dfr(function(x){
t(combn(x$val, 2)) %>%
data.frame(stringsAsFactors = FALSE)
}) %>%
mutate_all(funs(factor(., levels = c("a", "b", "c", "d")))) %>%
table() %>%
pmax(., t(.))
Result:
X2
X1 a b c d
a 1 1 1 1
b 1 0 0 0
c 1 0 0 1
d 1 0 1 0
Notes:
I first split the df by id, then used map_dfr from purrr to map the combn function to each id group.
combn finds all combinations of elements within a vector (length(vec) choose 2) and returns a matrix.
_dfr at the end of map_dfr means that the result will be a dataframe by row binding each element of the list. So this is effectively do.call(rbind, lapply()).
mutate_all makes sures that table retains all the levels needed even if a letter does not exist in a column.
Finally, since after the table step, an upper triangular matrix is produced, I fed that matrix and its transpose into pmax
pmax finds the parallel maxima from the two inputs and returns a symmetric matrix as desired.
Data:
df = read.table(text= "id val
1 a
1 b
2 a
2 c
2 d
3 a
3 a", header = TRUE, stringsAsFactors = FALSE)

Related

apply function removing 0 counts from table() output of ordered factors [duplicate]

This question already has an answer here:
R: Why am I not getting type or class "factor" after converting columns to factor?
(1 answer)
Closed 7 months ago.
Assume the following data.frame with columns of ordered factors:
dat0 <- data.frame(X1 = 1:5, X2 = 1:5, X3 = c(1,1:4), X4 = c(2,2:5))
dat <- data.frame(lapply(dat0, factor, ordered=TRUE, levels=1:5, labels=letters[1:5]))
I want to create a nice looking table that compiles how many a:e are in each column of dat (including any 0 counts). The function table() is an obvious choice.
My "clean" attempt at making this table does not work. See below:
The table() function works as expected (i.e., includes all 5 factor choices -- even if one or more has a 0 count) when applied to individual columns:
table(dat[,1])
a b c d e
1 1 1 1 1
table(dat[,3])
a b c d e
2 1 1 1 0
# note: that a 0 is provided for any factor missing
However, when I try to use an apply() function on the data.frame to include all column counts into one table, I get wonky resulting formatting:
apply(dat, 2, table)
$X1
a b c d e
1 1 1 1 1
$X2
a b c d e
1 1 1 1 1
$X3
a b c d
2 1 1 1
$X4
b c d e
2 1 1 1
I can demonstrate the cause of the issue by only including columns of my data.frame that have at least 1 count for each factor that is similar between the columns. (i.e., I can get my desired formatting outcome by removing any column with a 0 count for any factor):
apply(dat[1:2], 2, table) # only including columns of dat with all 5 letters (i.e., no 0 counts)
X1 X2
a 1 1
b 1 1
c 1 1
d 1 1
e 1 1
Question: Is there a simple workaround/solution here when using table() or am I going to have to find a different approach?
Note: I know I could simply cbind() the individual table results, but that's very tedious in my actual more complex data set.
We may use table in sapply.
sapply(dat, table)
# X1 X2 X3 X4
# a 1 1 2 0
# b 1 1 1 2
# c 1 1 1 1
# d 1 1 1 1
# e 1 1 0 1
Or vapply which is faster, but we need to know the .
vapply(dat, table, nlevels(unlist(dat)))
# X1 X2 X3 X4
# a 1 1 2 0
# b 1 1 1 2
# c 1 1 1 1
# d 1 1 1 1
# e 1 1 0 1
If we don't urgently need the row names, we may use tabulate.
sapply(dat, tabulate, nlevels(unlist(dat)))
# X1 X2 X3 X4
# [1,] 1 1 2 0
# [2,] 1 1 1 2
# [3,] 1 1 1 1
# [4,] 1 1 1 1
# [5,] 1 1 0 1
In case we know the nlevels before, we may simplify it to vapply(dat, table, numeric(5L)) and sapply(dat, tabulate, numeric(5L)) which also gives a gain in speed.
Here comes the benchmark
set.seed(42)
DAT <- dat[sample(nrow(dat),1e5, replace=TRUE), ]
r <- matrix(, 5L, dim(DAT)[2])
microbenchmark::microbenchmark(
t(data.frame(do.call(rbind,lapply(DAT, table)))),
sapply(DAT, table),
vapply(DAT, table, numeric(5L)),
vapply(DAT, table, numeric(nlevels(unlist(dat)))),
sapply(DAT, tabulate, 5L),
sapply(DAT, tabulate, nlevels(unlist(dat))),
`for`={for (j in seq_along(DAT)) r[, j] <- tabulate(DAT[, j], 5L)}
)
Unit: microseconds
expr min lq mean median uq max neval cld
t(data.frame(do.call(rbind, lapply(DAT, table)))) 9960.629 10101.4820 11662.6014 10221.6970 14459.0215 17422.732 100 c
sapply(DAT, table) 9690.340 9822.2150 11721.6487 9934.2045 14128.6330 19107.070 100 c
vapply(DAT, table, numeric(5L)) 9630.185 9729.9155 11313.4803 9816.3260 14017.8180 22655.129 100 c
vapply(DAT, table, numeric(nlevels(unlist(dat)))) 9753.252 9890.5700 11309.0461 9976.4840 14110.4775 17906.082 100 c
sapply(DAT, tabulate, 5L) 725.613 742.7820 778.6458 785.3595 807.1935 916.700 100 a
sapply(DAT, tabulate, nlevels(unlist(dat))) 848.600 891.1135 936.7825 939.8245 967.2390 1114.601 100 a
for 3580.538 3846.5700 4059.3048 3922.1300 3981.4300 19752.024 100 b
Data:
dat <- structure(list(X1 = structure(1:5, levels = c("a", "b", "c",
"d", "e"), class = c("ordered", "factor")), X2 = structure(1:5, levels = c("a",
"b", "c", "d", "e"), class = c("ordered", "factor")), X3 = structure(c(1L,
1L, 2L, 3L, 4L), levels = c("a", "b", "c", "d", "e"), class = c("ordered",
"factor")), X4 = structure(c(2L, 2L, 3L, 4L, 5L), levels = c("a",
"b", "c", "d", "e"), class = c("ordered", "factor"))), class = "data.frame", row.names = c(NA,
-5L))
Solution:
Use lapply and not apply as explained in the ZheyuanLi's linked answer and his comment.
Summary: The problem of apply is that it converts everything to characters, then table re-factors those characters so that unused levels are not preserved. But lapply gives a list.
Use a combination of data.frame, do.call, rbind, and t (transpose) to get the data into the desired data.frame format:
t(data.frame(do.call(rbind,lapply(dat, table))))
X1 X2 X3 X4
a 1 1 2 0
b 1 1 1 2
c 1 1 1 1
d 1 1 1 1
e 1 1 0 1
Or:
As ZheyuanLi pointed out, one can simply use sapply(dat, table).
Also thanks jay.sf for showing how vapply works.

Adjacency Matrix from a dataframe

I am trying to convert an edgelist to an adjacent matrix.
Below is the sample data
#Sample Data
User<-c("1","1","2","3","4")
v1 <- c("b", "b", "a", "d", "c")
v2 <- c("c", "d", "c", "a", "a")
v3 <- c(0, 0, "d", 0, "b")
v4 <- c(0, 0, 0, 0, 0)
v5 <- c(0, 0, 0, 0, 0)
my_data<-data.frame(User, v1, v2, v3, v4, v5)
my_data
If you run this code you will get the below as output,
User v1 v2 v3 v4 v5
1 b c 0 0 0
1 b d 0 0 0
2 a c d 0 0
3 d a 0 0 0
4 c a b 0 0
Using the data, I want to create an adjacent matrix that looks like follows:
a b c d
a 0 0 2 2
b 0 0 1 1
c 2 1 0 1
d 2 1 1 0
Basically, the desired output diplays the count how many times each pair appeared in column v1~v5 in the sample data frame.
I have tried to use AdjacencyFromEdgelist function from dils library, also tried to create a matrix shell with NAs and fill out the matrix by looping through the dataframe.
However, I could not get neither way to work.
I think this may be close to what you have in mind. In the rows where there are more than 2 vertices, I considered every existing pairs:
library(igraph)
do.call(rbind, my_data[-1] |>
apply(1, \(x) x[x != 0]) |>
lapply(\(x) t(combn(x, m = 2)))) |>
graph_from_edgelist(directed = FALSE) %>%
as_adjacency_matrix()
4 x 4 sparse Matrix of class "dgCMatrix"
b c d a
b . 2 1 1
c 2 . 1 2
d 1 1 . 2
a 1 2 2 .
Or without the pip operator in base R:
tmp <- apply(my_data[-1], 1, function(x) x[x != 0])
tmp <- do.call(rbind, lapply(tmp, function(x) t(combn(x, m = 2))))
my_graph <- graph_from_edgelist(tmp, directed = FALSE)
adj_mat <- as_adjacency_matrix(my_graph)
adj_mat
Another attempt, minus the need to calculate all the combinations with combn
sel <- my_data[-1] != 0
dat <- data.frame(row=row(my_data[-1])[sel], value = my_data[-1][sel])
out <- crossprod(table(dat))
diag(out) <- 0
out
# value
#value a b c d
# a 0 1 2 2
# b 1 0 2 1
# c 2 2 0 1
# d 2 1 1 0
Matches the result from #AnoushiravanR:
adj_mat[c("a","b","c","d"), c("a","b","c","d")]
#4 x 4 sparse Matrix of class "dgCMatrix"
# a b c d
#a . 1 2 2
#b 1 . 2 1
#c 2 2 . 1
#d 2 1 1 .
Another igraph option
do.call(
rbind,
combn(df, 2, setNames, nm = c("from", "to"), simplify = FALSE)
) %>%
filter(from > 0 & to > 0) %>%
arrange(from) %>%
graph_from_data_frame(directed = FALSE) %>%
get.adjacency(sparse = FALSE)
gives
a b c d
a 0 1 2 2
b 1 0 2 1
c 2 2 0 1
d 2 1 1 0

Converting table with missing values to matrix of counts

I have table with an unequal number of elements in each row, with each element having a count of 1 or 2 appended to a string. I want to create a matrix of presence/absence of each string, but including the count (1,2) and placing a zero if the string is not found.
From this:
V1 V2 V3 V4 V5
1 A cat:2 dog:1 mouse:1 horse:2
2 B dog:2 mouse:2 dolphin:2
3 C horse:2
4 D cat:1 mouse:2 dolphin:2
To this:
cat dog mouse horse dolphin
A 2 1 1 2 0
B 0 2 2 0 2
C 0 0 0 2 0
D 1 0 2 0 2
I have looked up previous solutions to similar problems:
Convert a dataframe to presence absence matrix
put they create a 0/1 matrix of absence, not including the count.
sample data:
structure(list(V1 = c("A", "B", "C", "D"),
V2 = c("cat:2", "dog:2", "horse:2", "cat:1"),
V3 = c("dog:1", "mouse:2", "", "mouse:2"),
V4 = c("mouse:1", "dolphin:2", "", "dolphin:2"),
V5 = c("horse:2", "", "", "")),
.Names = c("V1", "V2", "V3", "V4", "V5"),
class = "data.frame", row.names = c(NA, -4L))
Maybe some package could make this easier, but here is a solution. It won't be fast for large data, but it does the job:
#split the strings
tmp <- apply(DF[,-1], 1, strsplit, ":")
#extract the first strings
names <- lapply(tmp,function(x) c(na.omit(sapply(x, "[", 1))))
uniquenames <- unique(unlist(names))
#extract the numbers
reps <- lapply(tmp,function(x) as.numeric(na.omit(sapply(x, "[", 2))))
#make the numbers named vectors
res <- mapply(setNames, reps, names)
#subset the named vectors and combine result in a matrix
res <- do.call(rbind, lapply(res, "[",uniquenames))
#cosmetics
colnames(res) <- uniquenames
rownames(res) <- DF$V1
res[is.na(res)] <- 0
# cat dog mouse horse dolphin
#A 2 1 1 2 0
#B 0 2 2 0 2
#C 0 0 0 2 0
#D 1 0 2 0 2
You can separate the animals from the counts with separate from tidyr right after melting the data into long format and then dcasting to wide using the counts as values (which need to be casted from character to numeric as a previous step).
data %>%
melt("V1") %>%
separate(value, c("animal", "count"), ":", fill = "left") %>%
transform(count = as.numeric(count)) %>%
dcast(V1 ~ animal, value.var = "count", fun.aggregate = sum) %>%
select(-"NA")
# V1 cat dog dolphin horse mouse
# 1 A 2 1 0 2 1
# 2 B 0 2 2 0 2
# 3 C 0 0 0 2 0
# 4 D 1 0 2 0 2

find columns present in vector and replace values that equal 1

I have a binary data set which looks like
a b c d
r1 1 1 0 0
r2 0 1 1 0
r3 1 0 0 1
And a vector
V <- c("a", "c")
I want to a command to search colnames and change values in these columns. for example change 1 to A. So the output would be:
a b c d
r1 A 1 0 0
r2 0 1 A 0
r3 A 0 0 1
Here is a vectorized way to do it,
df[names(df) %in% V] <- replace(df[names(df) %in% V], df[names(df) %in% V] == 1, 'A')
#or avoid calling the %in% part 3 times by assigning it, i.e.
i1 <- names(df) %in% V
df[i1] <- replace(df[i1], df[i1] == 1, 'A')
#or a more simplified syntax, compliments of #Cath,
df[, V][df[, V]==1] <- "A"
which gives,
a b c d
r1 A 1 0 0
r2 0 1 A 0
r3 A 0 0 1
A solution with dplyr:
library(dplyr)
V <- c("a", "c")
df %>%
mutate_at(V, ~replace(.x, .x == 1, 'A'))
# a b c d
# r1 A 1 0 0
# r2 0 1 A 0
# r3 A 0 0 1
mutate_at takes a a data.frame and a vector of column names and applys the specified function to each of the columns.
DATA
df <- structure(list(a = c(1L, 0L, 1L), b = c(1L, 1L, 0L),
c = c(0L, 1L, 0L), d = c(0L, 0L, 1L)),
.Names = c("a", "b", "c", "d"),
class = "data.frame", row.names = c("r1", "r2", "r3"))
If left hand side (LHS) and right hand side (RHS) or of the same type, then data.table can be used to update only the selected "cells" in place, i.e., without copying the whole column:
library(data.table)
setDT(df)
for (s in V) df[get(s) == 1L, (s) := 99L] # replacement value is of type integer
df[]
a b c d
1: 99 1 0 0
2: 0 1 99 0
3: 99 0 0 1
To verify that only selected rows in each column are updated, we can check the addresses of each column before and after the update using:
df[, lapply(.SD, address), .SDcols = V]
(In addition, the verbose mode can be switched on by options(datatable.verbose = TRUE).)
In case LHS and RHS are of different type, a type conversion is required anyway. Therefore, the whole column needs to be replaced:
df[, (V) := lapply(.SD, function(x) replace(x, x == 1L, "A")), .SDcols = V]
df
a b c d
1: A 1 0 0
2: 0 1 A 0
3: A 0 0 1
Using address() shows that each of the affected columns has been copied. But only the affected columns are copied, the other columns haven't been touched. This is different to the other answers posted so far where the whole data frame is copied.

rbindfill like merge of list of vectors

I have a list of named vectors (see below and at end for dput version) I would like to "merge" together to make a matrix and fill in zeros if a vector doesn't contain a name (character in this case). This doesn't seem that hard but I haven't found a working base solution to the problem. I thought about using match but that seems very costly of time when I'm sure there's a fancy way to use do.call and rbind together.
List of Named Vectors:
$greg
e i k l
1 2 1 1
$sam
! c e i t
1 1 1 2 1
$teacher
? c i k l
1 1 1 1 1
Final Desired Output
! ? c e i k l t
greg 0 0 0 1 2 1 1 0
sam 1 0 1 1 2 0 0 1
teacher 0 1 1 0 1 1 1 0
Likely this is the output people will give and filling NAs with 0 is easy
! ? c e i k l t
greg NA NA NA 1 2 1 1 NA
sam 1 NA 1 1 2 NA NA 1
teacher NA 1 1 NA 1 1 1 NA
Sample Data
L2 <- structure(list(greg = structure(c(1L, 2L, 1L, 1L), .Dim = 4L, .Dimnames = structure(list(
c("e", "i", "k", "l")), .Names = ""), class = "table"), sam = structure(c(1L,
1L, 1L, 2L, 1L), .Dim = 5L, .Dimnames = structure(list(c("!",
"c", "e", "i", "t")), .Names = ""), class = "table"), teacher = structure(c(1L,
1L, 1L, 1L, 1L), .Dim = 5L, .Dimnames = structure(list(c("?",
"c", "i", "k", "l")), .Names = ""), class = "table")), .Names = c("greg",
"sam", "teacher"))
Here's a fairly straight forward base solution:
# first determine all possible column names
cols <- sort(unique(unlist(lapply(L2,names), use.names=FALSE)))
# initialize the output
out <- matrix(0, length(L2), length(cols), dimnames=list(names(L2),cols))
# loop over list and fill in the matrix
for(i in seq_along(L2)) {
out[names(L2)[i], names(L2[[i]])] <- L2[[i]]
}
UPDATE with benchmarks:
f1 <- function(L2) {
cols <- sort(unique(unlist(lapply(L2,names), use.names=FALSE)))
out <- matrix(0, length(L2), length(cols), dimnames=list(names(L2),cols))
for(i in seq_along(L2)) out[names(L2)[i], names(L2[[i]])] <- L2[[i]]
out
}
f2 <- function(L2) {
L.names <- sort(unique(unlist(sapply(L2, names))))
L3 <- t(sapply(L2, function(x) x[L.names]))
colnames(L3) <- L.names
L3[is.na(L3)] <- 0
L3
}
f3 <- function(L2) {
m <- do.call(rbind, lapply(L2, as.data.frame))
m$row <- sub("[.].*", "", rownames(m))
m$Var1 <- factor(as.character(m$Var1))
xtabs(Freq ~ row + Var1, m)
}
library(rbenchmark)
benchmark(f1(L2), f2(L2), f3(L2), order="relative")[,1:5]
# test replications elapsed relative user.self
# 1 f1(L2) 100 0.022 1.000 0.020
# 2 f2(L2) 100 0.051 2.318 0.052
# 3 f3(L2) 100 0.788 35.818 0.760
set.seed(21)
L <- replicate(676, {n=sample(10,1); l=sample(26,n);
setNames(sample(6,n,TRUE), letters[l])}, simplify=FALSE)
names(L) <- levels(interaction(letters,LETTERS))
benchmark(f1(L), f2(L), order="relative")[,1:5]
# test replications elapsed relative user.self
# 1 f1(L) 100 1.84 1.000 1.828
# 2 f2(L) 100 4.24 2.304 4.220
I think something like this:
names <- sort(unique(unlist(lapply(L2, names), use.names=FALSE)))
L3 <- t(vapply(L2, function(x) x[names], FUN.VALUE=numeric(length(names))))
colnames(L3) <- names
L3[is.na(L3)] <- 0
reshape2 Solution. This can be readily done with the reshape2 package by melting the list into long form and then using dcast to reshape it back into wide form:
> library(reshape2)
> m <- melt(L2)
> m$Var.1 <- factor(as.character(m$Var.1)) # optional - if columns should be sorted
> dcast(m, L1 ~ Var.1, fill = 0)
L1 ! ? c e i k l t
1 greg 0 0 0 1 2 1 1 0
2 sam 1 0 1 1 2 0 0 1
3 teacher 0 1 1 0 1 1 1 0
Base Solution. And here is a corresponding base solution where the first two lines perform the melt, the next line ensures the columns will be sorted and the last line reshapes from long to wide:
> m <- do.call(rbind, lapply(L2, as.data.frame))
> m$row <- sub("[.].*", "", rownames(m))
> m$Var1 <- factor(as.character(m$Var1))
> xtabs(Freq ~ row + Var1, m)
Var1
row ! ? c e i k l t
greg 0 0 0 1 2 1 1 0
sam 1 0 1 1 2 0 0 1
teacher 0 1 1 0 1 1 1 0
EDIT: Added a base solution and modified the sort line.
While typing this I thought of this solution but wonder if there's a more efficient one:
chars <- sort(unique(unlist(lapply(L2, names))))
L3 <- lapply(L2, function(x){
nots <- chars[!chars %in% names(x)]
new <- rev(c(x, rep(0, length(nots))))
names(new)[1:length(nots)] <- nots
new[order(names(new))]
})
do.call(rbind, L3)
Yielding:
! ? c e i k l t
greg 0 0 0 1 2 1 1 0
sam 1 0 1 1 2 0 0 1
teacher 0 1 1 0 1 1 1 0

Resources