Count instances of overlap in two vectors in R - r

I am hoping to create a matrix that shows a count of instances of overlapping values for a grouping variable based on a second variable. Specifically, I am hoping to determine the degree to which primary studies overlap across meta-analyses in order to create a network diagram.
So, in this example, I have three meta-analyses that include some portion of three primary studies.
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,3,2,1,2,3))
metas studies
1 1 1
2 1 3
3 1 2
4 2 1
5 3 2
6 3 3
I would like it to return:
v1 v2 v3
1 3 1 2
2 1 1 0
3 2 0 2
The value in row 1, column 1 indicates that Meta-analysis 1 had three studies in common with itself (i.e., it included three studies). Row 1, column 2 indicates that Meta-analysis 1 had one study in common with Meta-analysis 2. Row 1, column 3 indicates that Meta-analysis 1 had two studies in common with Meta-analysis 3.

I believe you are looking for a symmetric matrix of intersecting studies.
dfspl <- split(df$studies, df$metas)
out <- outer(seq_along(dfspl), seq_along(dfspl),
function(a, b) lengths(Map(intersect, dfspl[a], dfspl[b])))
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2
If you need names on them, you can go with the names as defined by df$metas:
rownames(out) <- colnames(out) <- names(dfspl)
out
# 1 2 3
# 1 3 1 2
# 2 1 1 0
# 3 2 0 2
If you need the names defined as v plus the meta name, go with
rownames(out) <- colnames(out) <- paste0("v", names(dfspl))
out
# v1 v2 v3
# v1 3 1 2
# v2 1 1 0
# v3 2 0 2
If you need to understand what this is doing, outer creates an expansion of the two argument vectors, and passes them all at once to the function. For instance,
outer(seq_along(dfspl), seq_along(dfspl), function(a, b) { browser(); 1; })
# Called from: FUN(X, Y, ...)
debug at #1: [1] 1
# Browse[2]>
a
# [1] 1 2 3 1 2 3 1 2 3
# Browse[2]>
b
# [1] 1 1 1 2 2 2 3 3 3
# Browse[2]>
What we ultimately want to do is find the intersection of each pair of studies.
dfspl[[1]]
# [1] 1 3 2
dfspl[[3]]
# [1] 2 3
intersect(dfspl[[1]], dfspl[[3]])
# [1] 3 2
length(intersect(dfspl[[1]], dfspl[[3]]))
# [1] 2
Granted, we are doing it twice (once for 1 and 3, once for 3 and 1, which is the same result), so this is a little inefficient ... it would be better to filter them to only look at the upper or lower half and transferring it to the other.
Edited for a more efficient process (only calculating each intersection pair once, and never calculating self-intersection.)
eg <- expand.grid(a = seq_along(dfspl), b = seq_along(dfspl))
eg <- eg[ eg$a < eg$b, ]
eg
# a b
# 4 1 2
# 7 1 3
# 8 2 3
lens <- lengths(Map(intersect, dfspl[eg$a], dfspl[eg$b]))
lens
# 1 1 2 ## btw, these are just names, from eg$a
# 1 2 0
out <- matrix(nrow = length(dfspl), ncol = length(dfspl))
out[ cbind(eg$a, eg$b) ] <- lens
out
# [,1] [,2] [,3]
# [1,] NA 1 2
# [2,] NA NA 0
# [3,] NA NA NA
out[ lower.tri(out) ] <- out[ upper.tri(out) ]
diag(out) <- lengths(dfspl)
out
# [,1] [,2] [,3]
# [1,] 3 1 2
# [2,] 1 1 0
# [3,] 2 0 2

Same idea as #r2evans, also Base R (and a bit less eloquent) (edited as required):
# Create df using sample data:
df <- data.frame(metas = c(1,1,1,2,3,3), studies = c(1,7,2,1,2,3))
# Test for equality between the values in the metas vector and the rest of
# of the values in the dataframe -- Construct symmetric matrix from vector:
m1 <- diag(v1); m1[,1] <- m1[1,] <- v1 <- rowSums(data.frame(sapply(df$metas, `==`,
unique(unlist(df)))))
# Coerce matrix to dataframe setting the names as desired; dropping non matches:
df_2 <- setNames(data.frame(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)]),
paste0("v", 1:ncol(m1[which(rowSums(m1) > 0), which(colSums(m1) > 0)])))

Related

function to calculate score

Calculate sequence score based on score matrix.
sum(j[k])
j <- matrix(1:25, ncol = 5, nrow = 5)
diag(j) <- 0
j
n <- 1:5
k <- sample(n, 5, replace = FALSE)
k <- replicate(5, sample(n, 5, replace = FALSE))
j is score matrix.
k is sequence type matrix.
lets say k[1,] = 4 1 5 3 2
k[2,] = 2 5 4 2 4
solution: Please help answer two issues;
Issue 1:
add one more column to matrix k (lets call it "score"). Based on J matrix the score for this sequence should be 48.
4 1 5 3 2 48
Issue 2:
k[2,] = 2 5 4 2 4 The sample function is producing wrong permutations. I don't want any repetition in the sequence. Here 4 is repeated. Secondly 1 is missing. is there any other best way to generate random permutations.
You better double check the result. Without a reproducible example from your end it's difficult to confirm the values.
set.seed(1)
k <- replicate(5, sample(5))
# each column is a random permutation of 1:5
k
# [,1] [,2] [,3] [,4] [,5]
# [1,] 2 5 2 3 5
# [2,] 5 4 1 5 1
# [3,] 4 2 3 4 2
# [4,] 3 3 4 1 4
# [5,] 1 1 5 2 3
j <- matrix(1:25, 5)
diag(j) <- 0
nr <- nrow(k)
# arrange successive values as a column pair
ix <- cbind(c(k[-nr,]), c(k[-1,]))
# use the column pair to reference indices in j
jx <- j[ix]
# arrange j-values into a matrix and sum by column, producing the scores
scores <- colSums(matrix(jx, nr-1))
cbind(t(k), scores)
# scores
# [1,] 2 5 4 3 1 59
# [2,] 5 4 2 3 1 44
# [3,] 2 1 3 4 5 55
# [4,] 3 5 4 1 2 53
# [5,] 5 1 2 4 3 42

Select unique values from a list of 3

I would like to list all unique combinations of vectors of length 3 where each element of the vector can range between 1 to 9.
First I list all such combinations:
df <- expand.grid(1:9, 1:9, 1:9)
Then I would like to remove the rows that contain repetitions.
For example:
1 1 9
9 1 1
1 9 1
should only be included once.
In other words if two lines have the same numbers and the same number of each number then it should only be included once.
Note that
8 8 8 or
9 9 9 is fine as long as it only appears once.
Based on your approach and the idea to remove repetitions:
df <- expand.grid(1:2, 1:2, 1:2)
# Var1 Var2 Var3
# 1 1 1 1
# 2 2 1 1
# 3 1 2 1
# 4 2 2 1
# 5 1 1 2
# 6 2 1 2
# 7 1 2 2
# 8 2 2 2
df2 <- unique(t(apply(df, 1, sort))) #class matrix
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 1 1 2
# [3,] 1 2 2
# [4,] 2 2 2
df2 <- as.data.frame(df2) #class data.frame
There are probably more efficient methods, but if I understand you correct, that is the result you want.
Maybe something like this (since your data frame is not large, so it does not pain!):
len <- apply(df,1,function(x) length(unique(x)))
res <- rbind(df[len!=2,], df[unique(apply(df[len==2,],1,prod)),])
Here is what is done:
Get the number of unique elements per row
Comprises two steps:
First argument of rbind: Those with length either 1 (e.g. 1 1 1, 7 7 7, etc) or 3 (e.g. 5 8 7, 2 4 9, etc) are included in the final results res.
Second argument of rbind: For those in which the number of unique elements are 2 (e.g. 1 1 9, 3 5 3, etc), we apply product per row and take whose unique products (cause, for example, the product of 3 3 5 and 3 5 3 and 5 3 3 are the same)

Converting a matrix to a data.table list format [duplicate]

This seems like a simple problem but I am having trouble doing this in a fast manner.
Say I have a matrix and I want to sort this matrix and store the indices of the elements in descending order. Is there a quick way to do this? Right now, I am extracting the maximum, storing the result, changing it to -2, and then extracting the next maximum in a for loop. Which is probably the most inefficient way to do it.
My problem actually requires me to work on a 20,000 X 20,000 matrix. Memory is not an issue. Any ideas about the fastest way to do it would be great.
For example if I have a matrix
>m<-matrix(c(1,4,2,3),2,2)
>m
[,1] [,2]
[1,] 1 2
[2,] 4 3
I want the result to indicate the numbers in descending order:
row col val
2 1 4
2 2 3
1 2 2
1 1 1
Here's a possible data.table solution
library(data.table)
rows <- nrow(m) ; cols <- ncol(m)
res <- data.table(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
setorder(res, -val)
res
# row col val
# 1: 2 1 4
# 2: 2 2 3
# 3: 1 2 2
# 4: 1 1 1
Edit: a base R alternative
res <- cbind(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
res[order(-res[, 3]),]
# row col val
# [1,] 2 1 4
# [2,] 2 2 3
# [3,] 1 2 2
# [4,] 1 1 1

Printing the sorted elements of a matrix in descending order with array indices in the fastest fashion

This seems like a simple problem but I am having trouble doing this in a fast manner.
Say I have a matrix and I want to sort this matrix and store the indices of the elements in descending order. Is there a quick way to do this? Right now, I am extracting the maximum, storing the result, changing it to -2, and then extracting the next maximum in a for loop. Which is probably the most inefficient way to do it.
My problem actually requires me to work on a 20,000 X 20,000 matrix. Memory is not an issue. Any ideas about the fastest way to do it would be great.
For example if I have a matrix
>m<-matrix(c(1,4,2,3),2,2)
>m
[,1] [,2]
[1,] 1 2
[2,] 4 3
I want the result to indicate the numbers in descending order:
row col val
2 1 4
2 2 3
1 2 2
1 1 1
Here's a possible data.table solution
library(data.table)
rows <- nrow(m) ; cols <- ncol(m)
res <- data.table(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
setorder(res, -val)
res
# row col val
# 1: 2 1 4
# 2: 2 2 3
# 3: 1 2 2
# 4: 1 1 1
Edit: a base R alternative
res <- cbind(
row = rep(seq_len(rows), cols),
col = rep(seq_len(cols), each = rows),
val = c(m)
)
res[order(-res[, 3]),]
# row col val
# [1,] 2 1 4
# [2,] 2 2 3
# [3,] 1 2 2
# [4,] 1 1 1

How to transform a list of user ratings into a matrix in R

I am working on a collaborative filtering problem, and I am having problems reshaping my raw data into a user-rating matrix. I am given a rating database with columns 'movie', 'user' and 'rating'. From this database, I would like to obtain a matrix of size #users x #movies, where each row indicates a user's ratings.
Here is a minimal working example:
# given this:
ratingDB <- data.frame(rbind(c(1,1,1),c(1,2,NA),c(1,3,0), c(2,1,1), c(2,2,1), c(2,3,0),
c(3,1,NA), c(3,2,NA), c(3,3,1)))
names(ratingDB) <- c('user', 'movie', 'liked')
#how do I get this?
userRating <- matrix(data = rbind(c(1,NA,0), c(1,1,0), c(NA,NA,1)), nrow=3)
I can solve the problem using two for loops, but this of course doesn't scale well. Can anybody help with me with a vectorized solution?
This can be done without any loop. It works with the function matrix:
# sort the 'liked' values (this is not neccessary for the example data)
vec <- with(ratingDB, liked[order(user, movie)])
# create a matrix
matrix(vec, nrow = length(unique(ratingDB$user)), byrow = TRUE)
[,1] [,2] [,3]
[1,] 1 NA 0
[2,] 1 1 0
[3,] NA NA 1
This will transform the vector stored in ratingDB$liked to a matrix. The argument byrow = TRUE allows arranging the data in rows (the default is by columns).
Update: What to do if the NA cases are not in the data frame?
(see comment by #steffen)
First, remove the rows containing NA:
subDB <- ratingDB[complete.cases(ratingDB), ]
user movie liked
1 1 1 1
3 1 3 0
4 2 1 1
5 2 2 1
6 2 3 0
9 3 3 1
The full data frame can be reconstructed. The function expand.grid is used to generate all combinations of user and movie:
full <- setNames(with(subDB, expand.grid(sort(unique(user)), sort(unique(movie)))),
c("user", "movie"))
movie user
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
Now, the information of the sub data frame subDB and the full combination data frame full can be combined with the merge function:
ratingDB_2 <- merge(full, subDB, all = TRUE)
user movie liked
1 1 1 1
2 1 2 NA
3 1 3 0
4 2 1 1
5 2 2 1
6 2 3 0
7 3 1 NA
8 3 2 NA
9 3 3 1
The result is identical with the original matrix. Hence, the same procedure can be applied to transform it to a matrix of liked values:
matrix(ratingDB_2$liked, nrow = length(unique(ratingDB_2$user)), byrow = TRUE)
[,1] [,2] [,3]
[1,] 1 NA 0
[2,] 1 1 0
[3,] NA NA 1

Resources