Converting table with missing values to matrix of counts - r

I have table with an unequal number of elements in each row, with each element having a count of 1 or 2 appended to a string. I want to create a matrix of presence/absence of each string, but including the count (1,2) and placing a zero if the string is not found.
From this:
V1 V2 V3 V4 V5
1 A cat:2 dog:1 mouse:1 horse:2
2 B dog:2 mouse:2 dolphin:2
3 C horse:2
4 D cat:1 mouse:2 dolphin:2
To this:
cat dog mouse horse dolphin
A 2 1 1 2 0
B 0 2 2 0 2
C 0 0 0 2 0
D 1 0 2 0 2
I have looked up previous solutions to similar problems:
Convert a dataframe to presence absence matrix
put they create a 0/1 matrix of absence, not including the count.
sample data:
structure(list(V1 = c("A", "B", "C", "D"),
V2 = c("cat:2", "dog:2", "horse:2", "cat:1"),
V3 = c("dog:1", "mouse:2", "", "mouse:2"),
V4 = c("mouse:1", "dolphin:2", "", "dolphin:2"),
V5 = c("horse:2", "", "", "")),
.Names = c("V1", "V2", "V3", "V4", "V5"),
class = "data.frame", row.names = c(NA, -4L))

Maybe some package could make this easier, but here is a solution. It won't be fast for large data, but it does the job:
#split the strings
tmp <- apply(DF[,-1], 1, strsplit, ":")
#extract the first strings
names <- lapply(tmp,function(x) c(na.omit(sapply(x, "[", 1))))
uniquenames <- unique(unlist(names))
#extract the numbers
reps <- lapply(tmp,function(x) as.numeric(na.omit(sapply(x, "[", 2))))
#make the numbers named vectors
res <- mapply(setNames, reps, names)
#subset the named vectors and combine result in a matrix
res <- do.call(rbind, lapply(res, "[",uniquenames))
#cosmetics
colnames(res) <- uniquenames
rownames(res) <- DF$V1
res[is.na(res)] <- 0
# cat dog mouse horse dolphin
#A 2 1 1 2 0
#B 0 2 2 0 2
#C 0 0 0 2 0
#D 1 0 2 0 2

You can separate the animals from the counts with separate from tidyr right after melting the data into long format and then dcasting to wide using the counts as values (which need to be casted from character to numeric as a previous step).
data %>%
melt("V1") %>%
separate(value, c("animal", "count"), ":", fill = "left") %>%
transform(count = as.numeric(count)) %>%
dcast(V1 ~ animal, value.var = "count", fun.aggregate = sum) %>%
select(-"NA")
# V1 cat dog dolphin horse mouse
# 1 A 2 1 0 2 1
# 2 B 0 2 2 0 2
# 3 C 0 0 0 2 0
# 4 D 1 0 2 0 2

Related

Adjacency Matrix from a dataframe

I am trying to convert an edgelist to an adjacent matrix.
Below is the sample data
#Sample Data
User<-c("1","1","2","3","4")
v1 <- c("b", "b", "a", "d", "c")
v2 <- c("c", "d", "c", "a", "a")
v3 <- c(0, 0, "d", 0, "b")
v4 <- c(0, 0, 0, 0, 0)
v5 <- c(0, 0, 0, 0, 0)
my_data<-data.frame(User, v1, v2, v3, v4, v5)
my_data
If you run this code you will get the below as output,
User v1 v2 v3 v4 v5
1 b c 0 0 0
1 b d 0 0 0
2 a c d 0 0
3 d a 0 0 0
4 c a b 0 0
Using the data, I want to create an adjacent matrix that looks like follows:
a b c d
a 0 0 2 2
b 0 0 1 1
c 2 1 0 1
d 2 1 1 0
Basically, the desired output diplays the count how many times each pair appeared in column v1~v5 in the sample data frame.
I have tried to use AdjacencyFromEdgelist function from dils library, also tried to create a matrix shell with NAs and fill out the matrix by looping through the dataframe.
However, I could not get neither way to work.
I think this may be close to what you have in mind. In the rows where there are more than 2 vertices, I considered every existing pairs:
library(igraph)
do.call(rbind, my_data[-1] |>
apply(1, \(x) x[x != 0]) |>
lapply(\(x) t(combn(x, m = 2)))) |>
graph_from_edgelist(directed = FALSE) %>%
as_adjacency_matrix()
4 x 4 sparse Matrix of class "dgCMatrix"
b c d a
b . 2 1 1
c 2 . 1 2
d 1 1 . 2
a 1 2 2 .
Or without the pip operator in base R:
tmp <- apply(my_data[-1], 1, function(x) x[x != 0])
tmp <- do.call(rbind, lapply(tmp, function(x) t(combn(x, m = 2))))
my_graph <- graph_from_edgelist(tmp, directed = FALSE)
adj_mat <- as_adjacency_matrix(my_graph)
adj_mat
Another attempt, minus the need to calculate all the combinations with combn
sel <- my_data[-1] != 0
dat <- data.frame(row=row(my_data[-1])[sel], value = my_data[-1][sel])
out <- crossprod(table(dat))
diag(out) <- 0
out
# value
#value a b c d
# a 0 1 2 2
# b 1 0 2 1
# c 2 2 0 1
# d 2 1 1 0
Matches the result from #AnoushiravanR:
adj_mat[c("a","b","c","d"), c("a","b","c","d")]
#4 x 4 sparse Matrix of class "dgCMatrix"
# a b c d
#a . 1 2 2
#b 1 . 2 1
#c 2 2 . 1
#d 2 1 1 .
Another igraph option
do.call(
rbind,
combn(df, 2, setNames, nm = c("from", "to"), simplify = FALSE)
) %>%
filter(from > 0 & to > 0) %>%
arrange(from) %>%
graph_from_data_frame(directed = FALSE) %>%
get.adjacency(sparse = FALSE)
gives
a b c d
a 0 1 2 2
b 1 0 2 1
c 2 2 0 1
d 2 1 1 0

How to loop over a column in a data frame in R?

I'm quite new to R. I have a df with a V1 column. I would like to create a loop to calculate the ratio (cuf-off values).
I want to take the first number 1 and divide by 301 and put the value in a df$V2. Then I want to sum first two numbers and divide by 301 etc.
For example:
V2
1/301 (first value of df) 0.0033
2/301 (sum of the first two values of df) 0.0066
2/301 (sum of the first three values of df) 0.0066
df
V1
1
1
0
0
1
0
1
1
1
0
You can take cumulative sum of V1 values and divide it by 301.
df$V2 <- cumsum(df$V1)/301
df
# V1 V2
#1 1 0.00332
#2 1 0.00664
#3 0 0.00664
#4 0 0.00664
#5 1 0.00997
#6 0 0.00997
#7 1 0.01329
#8 1 0.01661
#9 1 0.01993
#10 0 0.01993
We can transform to create the 'V2' by dividing the cumulative sum of 'V1' with 301
df <- transform(df, V2 = cumsum(V1)/301)
-output
df
# V1 V2
#1 1 0.003322259
#2 1 0.006644518
#3 0 0.006644518
#4 0 0.006644518
#5 1 0.009966777
#6 0 0.009966777
#7 1 0.013289037
#8 1 0.016611296
#9 1 0.019933555
#10 0 0.019933555
Or another option is Reduce with accumulate = TRUE
transform(df, V2 = Reduce(`+`, V1, accumulate = TRUE)/301)
Or if we need to loop, loop over the sequence of rows, then assign each value of 'V2' by the sum of the sequence of elements from 1 to that row of 'V1' divided by 301
df$V2 <- 0
for(i in seq_len(nrow(df))) {
df$V2[i] <- sum(df$V1[1:i])/301
}
Or using tidyverse
library(dplyr)
df %>%
mutate(V2 = cumsum(V1)/301)
Or using accumulate
library(purrr)
df %>%
mutate(V2 = accumulate(V1, `+`)/301)
data
df <- structure(list(V1 = c(1, 1, 0, 0, 1, 0, 1, 1, 1, 0)), class = "data.frame",
row.names = c(NA,
-10L))

Get variable combination matrix

Data
We have numerous text strings that look like this (way longer in our real dataset):
df <- data.frame(
id = c('text1','text2','text3'),text = c('ABA','ABA','AAA')
)
>df
id text
1 text1 ABA
2 text2 ABA
3 text3 AAA
We want to create a matrix that tells how often a letter at position x is found together with the other letters at other positions, so in this case:
3A 3 1 2 3
2B 2 0 2 2
2A 1 1 0 1
1A 3 1 2 3
1A 2A 2B 3A
What I tried
I previously converted the matrix to a binary matrix, looking like this:
structure(list(pos1_A = c(1, 1, 1), pos2_A = c(0, 0, 1), pos2_B = c(1,
1, 0), pos3_A = c(1, 1, 1)), class = "data.frame", row.names = c("text1",
"text2", "text3"))
pos1_A pos2_A pos2_B pos3_A
text1 1 0 1 1
text2 1 0 1 1
text3 1 1 0 1
Then I can run commands like cor to get correlations, however, instead of correlations I want the frequencies.
Note this is different from questions about co-occurrences wherein the variable name itself (here position) is neglected, for example like "How to use R to create a word co-occurrence matrix"
Huge credit to #Ronak Shah with the answer here
It's much simpler if we convert the categorical data to a numerical (binary matrix), for example using this hacky but easy way with the homals package and then apply the method by #Ronak Shah linked above:
# The dataset
df <- data.frame(
id = c('text1','text2','text3'),text = c('ABA','ABA','AAA')
)
# Split the strings in characters and add column names
df2 <- df %>% splitstackshape::cSplit('text', sep = '', stripWhite = FALSE, type.convert = FALSE, direction = 'wide') %>%
column_to_rownames('id')
colnames(df2) <- paste0('pos', 1:ncol(df2))
# Convert to binary matrix (hacky way)
bin.mat <- homals:::expandFrame(df2, clean = F)
# Method by #Ronak Shah to get the frequency matrix
fun <- function(x, y) sum(bin.mat[, x] & bin.mat[, y])
n <- seq_along(bin.mat)
mat <- outer(n, n, Vectorize(fun))
dimnames(mat) <- list(names(bin.mat)[n], names(bin.mat[n]))
This produces the matrix:
>mat
pos1_A pos2_A pos2_B pos3_A
pos1_A 3 1 2 3
pos2_A 1 1 0 1
pos2_B 2 0 2 2
pos3_A 3 1 2 3
Here's an alternative approach that produces a matrix as originally requested:
# Make all strings the same length:
df$text <- stringr::str_pad(df$text, side = "right", max(nchar(df$text)))
# Create a matrix with all letters labelled by their position:
all_vals <- apply(do.call(rbind, strsplit(df$text, "")), 1,
function(x) paste0(seq_along(x), x))
# Create a vector of all possible letter / position combos
all_labs <- do.call(paste0, expand.grid(seq(max(nchar(df$text))),
unique(unlist(strsplit(df$text, "")))))
# Create a function that will count all co-occurences per data frame row
f <- function(y, x) as.vector(outer(x, x, function(a, b) 1 * (a %in% y & b %in% y)))
# Create the results matrix and label it
m <- matrix(rowSums(apply(as.data.frame(all_vals), 2, f, all_labs)), nrow = length(all_labs))
rownames(m) <- all_labs
colnames(m) <- all_labs
m
#> 1A 2A 3A 1B 2B 3B
#> 1A 3 1 3 0 2 0
#> 2A 1 1 1 0 0 0
#> 3A 3 1 3 0 2 0
#> 1B 0 0 0 0 0 0
#> 2B 2 0 2 0 2 0
#> 3B 0 0 0 0 0 0
Created on 2020-05-24 by the reprex package (v0.3.0)

Dispatch values in list column to separate columns

I have a data.table with a list column "c":
df <- data.table(a = 1:3, c = list(1L, 1:2, 1:3))
df
a c
1: 1 1
2: 2 1,2
3: 3 1,2,3
I want to create separate columns for the values in "c".
I create a set of new columns F_1, F_2, F_3:
mmax <- max(df$a)
flux <- paste("F", 1:mmax, sep = "_")
df[, (flux) := 0]
df
a c F_1 F_2 F_3
1: 1 1 0 0 0
2: 2 1,2 0 0 0
3: 3 1,2,3 0 0 0
I want to dispatch values in "c" to columns F_1, F_2, F_3 like this:
df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
What I have tried:
comp_vect <- function(vec, mmax){
vec <- vec %>% unlist()
n <- length(vec)
answr <- c(vec, rep(0, l = mmax -n))
}
df[ , ..flux := mapply(comp_vect, c, mmax)]
The expected data.table is :
> df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
I followed a radically different approach. I rbinded the list column and then dcasted it, obtaining the desired result. Last part is to set the names.
library(data.table)
df <- data.table(a = 1:3, d = list(1L, c(1L, 2L), c(1L, 2L, 3L)))
df2 <- df[, rbind(d), by = a][, dcast(.SD, a ~ V1, fill = 0)]
setnames(df2, 2:4, flux)[]
a F_1 F_2 F_3
1: 1 1 0 0
2: 2 1 2 0
3: 3 1 2 3
where flux is the variable of names that you defined in your question.
Please notice that avoided using the column name c, as it may be confused with the function c().
Solution :
for(idx in seq(max(sapply(df$c, length)))){ # maximum number of values according to all the elements of the list
set(x = df,
i = NULL,
j = paste0("F_",idx), # column's name
value = sapply(df$c, function(x){
if(is.na(x[idx])){
return(0) # 0 instead of NA
} else {
return(x[idx])
}
})
)
}
Explications :
We can extract the values from a list like this :
sapply(df$c, function(ll) return(ll[1])) # first value
[1] 1 1 1
sapply(df$c, function(ll) return(ll[2])) # second value
[1] NA 2 2
sapply(df$c, function(ll) return(ll[3])) # third value
[1] NA NA 3
We see that if there is no value, we have a NA.
We need an iterator to extract all values at the position idx. For that, we'll find the number of values in each element of df$c (the list) and keep the maximum.
max(sapply(df$c, length))
[1] 3
If we want zeros instead of NAs, we need to create a function in the sapply to convert them :
vec <- c(NA, 5, 1, NA)
> sapply(vec, function(x) if(is.na(x)) return(0) else return(x))
[1] 0 5 1 0

Forming a co-occurence matrix from a data frame

I have a data frame which looks something like this:
id val
1 a
1 b
2 a
2 c
2 d
3 a
3 a
think of each row as a label, val, that was given to some observation with an id.
What I ultimately want to get to is a "co-occurence" matrix that looks something like this where I get a count of how many times each letter appears within the same id with each other letter:
a b c d
a 1 1 1 1
b 1 0 0 0
c 1 0 0 1
d 1 0 1 0
I've been wracking my brain looking for ways to do this, but have come up empty so far. Any hints? Preferably using tidyverse tools, but open to other options as well at this point.
EDIT: the solutions to the question linked as a possible duplicate do not work in this case. I'm not sure why, but I suspect it has to do with that question having a data frame with 3 columns.
Here's a solution in base R. Not quite elegant but seems to work
temp = data.frame(do.call(cbind, lapply(split(df, df$id), function(a)
combn(a$val, 2))), stringsAsFactors = FALSE)
sapply(sort(unique(df$val)), function(rows)
sapply(sort(unique(df$val)), function(cols)
sum(sapply(temp, function(x)
identical(sort(x), sort(c(rows, cols)))))))
# a b c d
#a 1 1 1 1
#b 1 0 0 0
#c 1 0 0 1
#d 1 0 1 0
OR with igraph
temp = t(do.call(cbind, lapply(split(df, df$id), function(a) combn(a$val, 2))))
library(igraph)
as.matrix(get.adjacency(graph(temp, directed = FALSE)))
# a c b d
#a 1 1 1 1
#c 1 0 0 1
#b 1 0 0 0
#d 1 1 0 0
DATA
df = structure(list(id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L),
val = c("a", "b", "a", "c", "d", "a", "a")),
.Names = c("id", "val"),
class = "data.frame",
row.names = c(NA, -7L))
A solution with dplyr + purrr:
library(dplyr)
library(purrr)
df %>%
split(.$id) %>%
map_dfr(function(x){
t(combn(x$val, 2)) %>%
data.frame(stringsAsFactors = FALSE)
}) %>%
mutate_all(funs(factor(., levels = c("a", "b", "c", "d")))) %>%
table() %>%
pmax(., t(.))
Result:
X2
X1 a b c d
a 1 1 1 1
b 1 0 0 0
c 1 0 0 1
d 1 0 1 0
Notes:
I first split the df by id, then used map_dfr from purrr to map the combn function to each id group.
combn finds all combinations of elements within a vector (length(vec) choose 2) and returns a matrix.
_dfr at the end of map_dfr means that the result will be a dataframe by row binding each element of the list. So this is effectively do.call(rbind, lapply()).
mutate_all makes sures that table retains all the levels needed even if a letter does not exist in a column.
Finally, since after the table step, an upper triangular matrix is produced, I fed that matrix and its transpose into pmax
pmax finds the parallel maxima from the two inputs and returns a symmetric matrix as desired.
Data:
df = read.table(text= "id val
1 a
1 b
2 a
2 c
2 d
3 a
3 a", header = TRUE, stringsAsFactors = FALSE)

Resources