I have a bipartite edgelist that I would like to convert into a unipartite graph of just the 'from' nodes. I need to do this in a sparse matrix because of the size. Unfortunately, this means that easier solutions such as using bipartite.projection(graph) freezes everything. My data looks like:
To From Weight
A1 B2 1
A1 B3 1
A2 B2 1
A3 B3 1
A3 B4 1
A4 B2 1
A4 B3 1
Using the Matrix package, I've created a sparse matrix with the correct dimensions, but for some reason only the diagonal is populated. For the sparse matrix I used:
myMat <- sparseMatrix(as.integer(as.factor(df$from),
as.integer(as.factor(df$from),
x = df$weight,
dimnames = list(levels(as.factor(df$from)),
levels(as.factor(df$from))
)
)
returns:
B2 B3 B4
B2 2 . .
B3 . 2 .
B4 . . 1
The diagonal summed the weight, but the rest of the matrix is empty where I was expecting it to have filled with the summed weight as well.
What I'd like is:
B2 B3 B4
B2 . 2 .
B3 2 . 1
B4 . 1 .
As this is a matrix of one side of the bipartite graph with the matrix[i,j] representing the count of df$to values connecting any two df$from values. This would then be the weight I would give to edges in any network graph.
What about as_adjacency_matrix which returns sparse?
library(igraph)
df <- read.table(textConnection("
To From Weight
A1 B2 1
A1 B3 1
A2 B2 1
A3 B3 1
A3 B4 1"), header=T, stringsAsFactors=F)
g <- graph.data.frame(df[,1:2])
V(g)$type <- V(g)$name %in% df[,1]
is.bipartite(g)
as_adjacency_matrix(g)
I ended up using some matrix algebra rather than a defined function to get my result. By changing around the sparseMatrix call just a tiny bit and then multiplying by the transpose I got the matrix I was looking for
myMat <- sparseMatrix(as.integer(as.factor(df$from),
as.integer(as.factor(df$to),
x = 1,
dimnames = list(levels(as.factor(df$from)),
levels(as.factor(df$to))
)
)
finalMat <- myMat %*% t(myMat)
Related
Suppose you have a factor variable whose level labels come in pairs
(such as 'a1' and 'a2', 'b1' and 'b2', etc.), and these pairs have unequal n-sizes.
x <- factor(c(rep("a1", 10), rep("a2", 15),rep("b1", 5), rep("b2", 30),rep("c1", 33), rep("c2", 22)))
> table(x)
a1 a2 b1 b2 c1 c2
10 15 5 30 33 22
But you wanted to randomly downsample the larger-sized level of each pair to
equalize their n-sizes. Here's the desired outcome:
a1 a2 b1 b2 c1 c2
10 10 5 5 22 22
I have found that caret::downSample() can downsample to equalize all the levels of
a factor:
x_ds <- caret::downSample(1:115, x)
table(x_ds$Class)
a1 a2 b1 b2 c1 c2
5 5 5 5 5 5
And I have the notion to use split() in conjunction with downSample(), but I'm having trouble figuring out a way to split on the level pairs. How could this be done?
When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?
Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.
Let's say we have a corpus consisting of two sets of similar documents, { (a1, a2, a3), (b1, b2) } where the letters indicate similarity. We want to keep just one document when the others are "duplicates", defined as similarity exceeding a threshold, say 0.80.
We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.
library("quanteda")
# Loading required package: quanteda
# Package version: 1.3.14
mydocs <- c(a1 = "a a a a a b b c d w g j t",
b1 = "l y y h x x x x x y y y y",
a2 = "a a a a a b c s k w i r f",
b2 = "p q w e d x x x x y y y y",
a3 = "a a a a a b b x k w i r f")
mydfm <- dfm(mydocs)
(sim <- textstat_simil(mydfm))
# a1 b1 a2 b2
# b1 -0.22203788
# a2 0.80492203 -0.23090513
# b2 -0.23427416 0.90082239 -0.28140219
# a3 0.81167608 -0.09065452 0.92242890 -0.12530944
# create a data.frame of the unique pairs and their similarities
sim_pair_names <- t(combn(docnames(mydfm), 2))
sim_pairs <- data.frame(sim_pair_names,
sim = as.numeric(sim),
stringsAsFactors = FALSE)
sim_pairs
# X1 X2 sim
# 1 a1 b1 -0.22203788
# 2 a1 a2 0.80492203
# 3 a1 b2 -0.23427416
# 4 a1 a3 0.81167608
# 5 b1 a2 -0.23090513
# 6 b1 b2 0.90082239
# 7 b1 a3 -0.09065452
# 8 a2 b2 -0.28140219
# 9 a2 a3 0.92242890
# 10 b2 a3 -0.12530944
Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().
# set the threshold for similarity
threshold <- 0.80
# discard one of the pair if similarity > threshold
todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)
todrop
# [1] "a1" "a1" "b1" "a2"
# then subset the dfm, keeping only the "keepers"
dfm_subset(mydfm, !docnames(mydfm) %in% todrop)
# Document-feature matrix of: 2 documents, 20 features (62.5% sparse).
# 2 x 20 sparse Matrix of class "dfm"
# features
# docs a b c d w g j t l y h x s k i r f p q e
# b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1
# a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0
Other solutions to this problem of similar documents would be to form them into clusters, or to reduce the document matrix using principal components analysis, along the lines of latent semantic analysis.
You already received some excellent answers. But if you prefer a more automated approach targeted at your specific use case, you can use the package LexisNexisTools (which I wrote). It comes with a function called lnt_similarity(), which does exactly what you were looking for. I wrote a quick tutorial with mock data here.
The main difference between the solutions here and in lnt_similarity() is that I also take into account word order, which can make a big difference in some rare cases (see this blog post).
I also suggest you think carefully about thresholds as you might otherwise remove some articles wrongfully. I included a function to visualize the difference between two articles so you can get a better grip of the data you are removing called lnt_diff().
If you have thousands of documents, it takes a lot of space in your RAM to save all the similarity scores, but you can set a minimum threshold in textstat_proxy(), the underlying function of textstat_simil().
In this example, cosine similarity scores smaller than 0.9 are all ignored.
library("quanteda")
mydocs <- c(a1 = "a a a a a b b c d w g j t",
b1 = "l y y h x x x x x y y y y",
a2 = "a a a a a b c s k w i r f",
b2 = "p q w e d x x x x y y y y",
a3 = "a a a a a b b x k w i r f")
mydfm <- dfm(mydocs)
(sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))
# 5 x 5 sparse Matrix of class "dsTMatrix"
# a1 b1 a2 b2 a3
# a1 1 . . . .
# b1 . 1.0000000 . 0.9113423 .
# a2 . . 1.0000000 . 0.9415838
# b2 . 0.9113423 . 1.0000000 .
# a3 . . 0.9415838 . 1.0000000
matrix2list <- function(x) {
names(x#x) <- rownames(x)[x#i + 1]
split(x#x, factor(x#j + 1, levels = seq(ncol(x)), labels = colnames(x)))
}
matrix2list(sim)
# $a1
# a1
# 1
#
# $b1
# b1
# 1
#
# $a2
# a2
# 1
#
# $b2
# b1 b2
# 0.9113423 1.0000000
#
# $a3
# a2 a3
# 0.9415838 1.0000000
See https://koheiw.net/?p=839 for the performance differences.
I have a dataframe with n columns and I need to obtain combinations of its variables:
E.g.:
df <- data.frame(A = c("a1","a2","a3","a4","a5","a6"),
B = c("a1","a1","a3","a3","a5","a5"),
C = c("a1","a1","a1","a3","a4","a4"),
D = c("a1","a1","a1","a3","a4","a5"))
I need to make a list that would have n-1 elements each including all the unique combinations of the dataframe variables. The first element includes unique values for each columns starting from the first and ending to the last. For each subsequent element I need to drop the first column of the previous appended dataframe. Like this:
myList <- list(unique(df[,1:ncol(df)),
unique(df[,2:ncol(df)),
unique(df[,3:ncol(df)))
I managed to solve this with a for loop:
myList <- list()
for (i in 1:(ncol(df) - 1)){
myList[[i]] <- unique(df[, i:ncol(df)])
}
but I was left wondering whether there was a faster and more elegant way to do this.
With sapply():
sapply(1:(ncol(df)-1),
FUN = function(x, nc, df) unique(df[, x:nc]), nc = ncol(df), df = df)
An elegant solution would be a recursion:
func = function(df, n, lst)
{
if(ncol(df)==n) return(lst)
func(df, n+1, c(lst, list(unique(df[n:ncol(df)]))))
}
#> func(df,1, list())
#[[1]]
# A B C D
#1 a1 a1 a1 a1
#2 a2 a1 a1 a1
#3 a3 a3 a1 a1
#4 a4 a3 a3 a3
#5 a5 a5 a4 a4
#6 a6 a5 a4 a5
#[[2]]
# B C D
#1 a1 a1 a1
#3 a3 a1 a1
#4 a3 a3 a3
#5 a5 a4 a4
#6 a5 a4 a5
#[[3]]
# C D
#1 a1 a1
#4 a3 a3
#5 a4 a4
#6 a4 a5
I would like to achieve the following data.frame in R:
i1 i2 i3
1 A1 A2 A3
2 No A2 A3
3 A1 No A3
4 No No A3
5 A1 A2 No
6 No A2 No
7 A1 No No
8 No No No
In each column the variable can either be the concatenated string "A" and the column number or "No". The data.frame should contain all possible combinations.
My idea was to use expand.grid, but I don't know how to create the list dynamically. Or is there a better approach?
expand.grid(list(c("A1", "No"), c("A2", "No"), c("A3", "No")))
I guess you could create your own helper function, something like that
MyList <- function(n) expand.grid(lapply(paste0("A", seq_len(n)), c, "No"))
Then simply pass it the number of elements (e.g., 3)
MyList(3)
# Var1 Var2 Var3
# 1 A1 A2 A3
# 2 No A2 A3
# 3 A1 No A3
# 4 No No A3
# 5 A1 A2 No
# 6 No A2 No
# 7 A1 No No
# 8 No No No
Alternatively, you could also try data.tables CJ equivalent which should much more efficient than expand.grid for a big n
library(data.table)
DTCJ <- function(n) do.call(CJ, lapply(paste0("A", seq_len(n)), c, "No"))
DTCJ(3) # will return a sorted cross join
# V1 V2 V3
# 1: A1 A2 A3
# 2: A1 A2 No
# 3: A1 No A3
# 4: A1 No No
# 5: No A2 A3
# 6: No A2 No
# 7: No No A3
# 8: No No No
Another option is using Map with expand.grid
n <- 3
expand.grid(Map(c, paste0('A', seq_len(n)), 'NO'))
Or
expand.grid(as.data.frame(rbind(paste0('A', seq_len(n)),'NO')))
Another option, only using the most fundamental functions in R, is to use the indices:
df <- data.frame(V1 = c('A','A','A', 'A',rep('No',4)), V2 = c('A','A','No','No','A','A','No','No'), V3 = c('A','No','A','No','A','No','A','No'), stringsAsFactors = FALSE)
to get the row and col indices of the elements we need to change:
rindex <- which(df != 'No') %% nrow(df)
cindex <- ceiling(which(df != 'No')/nrow(df))
the solution is basically a one-liner:
df[matrix(c(rindex,cindex),ncol=2)] <- paste0(df[matrix(c(rindex,cindex),ncol=2)],cindex)
> df
V1 V2 V3
1 A1 A2 A3
2 A1 A2 No
3 A1 No A3
4 A1 No No
5 No A2 A3
6 No A2 No
7 No No A3
8 No No No
Suppose the following situation. There are two tables, each one of them with data of different quality. Both of them have the same variables A, B and C. Variables in the first table are called A1, B1 and C2, while those in the second table are called A2, B2, and C2.
The first table can be updated with the second table. There are six possible combinations:
A1, B1, C2
A1, B2, C1
A2, B1, C1
A1, B2, C2
A2, B1, C2
A2, B2, C1
The question is how to get that in R. What I'm using is what follows:
require(utils)
require(stringr)
vars <- c("A1", "B1", "C1", "A2", "B2", "C2")
combine <- function(data, n){
com1 = combn(data, n)# make all combinations
com2 = c(str_sub(com1, end=-2L))# remove the number in the end of the name
com3 = matrix(com2, nrow = dim(com1)[1], ncol = dim(com1)[2])# vector to matrix
com3 = split(com3, rep(1:ncol(com3), each = nrow(com3)))# matrix to list
com3 = lapply(com3, duplicated)# find list elements with duplicated names
com3 = lapply(com3, function(X){X[which(!any(X == TRUE))]})# identify duplicated names
pos = which(as.numeric(com3) == 0)# get position of duplicates
com3 = com1[,pos]# return elements from the original list
com3 = split(com3, rep(1:ncol(com3), each = nrow(com3)))# matrix to list
com3 = lapply(com3, sort)# sort by alphabetical order
com3 = as.data.frame(com3, stringsAsFactors = FALSE)# matrix to data frame
res = list(positions = pos, combinations = com3)# return position and combinations
return(res)
}
combine(vars, 3)
$positions
[1] 1 4 6 10 11 15 17 20
$combinations
X1 X2 X3 X4 X5 X6 X7 X8
1 A1 A1 A1 A1 A2 A2 A2 A2
2 B1 B1 B2 B2 B1 B1 B2 B2
3 C1 C2 C1 C2 C1 C2 C1 C2
I'd like to know if anyone knows a more straightforward solution than creating all possible combinations and afterwards cleaning up the result as my function does.
You're over thinking the problem. Just use expand.grid:
> expand.grid(c('A1','A2'),c('B1','B2'),c('C1','C2'))
Var1 Var2 Var3
1 A1 B1 C1
2 A2 B1 C1
3 A1 B2 C1
4 A2 B2 C1
5 A1 B1 C2
6 A2 B1 C2
7 A1 B2 C2
8 A2 B2 C2