second order neighbors of graph nodes in R - r

I am looking for an efficient way to find the neighborhoods of exact degree of all nodes in a large graph. Even though it stores graphs as sparse matrices, igraph::ego blows up:
require(Matrix)
require(igraph)
require(ggplot2)
N <- 10^(1:5)
runtimes <- function(N) {
g <- erdos.renyi.game(N, 1/N)
system.time(ego(g, 2, mindist = 2))[3]
}
runtime <- sapply(N, runtimes)
qplot(log10(N), runtime, geom = "line")
Is there a more efficient way?

Using adjacency matrices directly provides a significant improvement.
# sparse adjacency-matrix calculation of indirect neighbors -------------------
diff_sparse_mat <- function(A, B) {
# Difference between sparse matrices.
# Input: sparse matrices A and B
# Output: C = (A & !B), using element-wise diffing, treating B as logical
stopifnot(identical(dim(A), dim(B)))
A <- as(A, "generalMatrix")
AT <- as.data.table(summary(as(A, "TsparseMatrix")))
setkeyv(AT, c("i", "j"))
B <- drop0(B)
B <- as(B, "generalMatrix")
BT <- as.data.table(summary(as(B, "TsparseMatrix")))
setkeyv(BT, c("i", "j"))
C <- AT[!BT]
if (length(C) == 2) {
return(sparseMatrix(i = C$i, j = C$j, dims = dim(A)))
} else {
return(sparseMatrix(i = C$i, j = C$j, x = C$x, dims = dim(A)))
}
}
distance2_peers <- function(adj_mat) {
# Returns a matrix of indirect neighbors, excluding the diagonal
# Input: adjacency matrix A (assumed symmetric)
# Output: (A %*% A & !A) with zero diagonal
indirect <- forceSymmetric(adj_mat %*% adj_mat)
indirect <- diff_sparse_mat(indirect, adj_mat) # excl. direct neighbors
indirect <- diff_sparse_mat(indirect, Diagonal(n = dim(indirect)[1])) # excl. diag.
return(indirect)
}
for the Erdos Renyi example, in half a minute now a network of 10^7, not 10^5 can be analyzed:
N <- 10 ^ (1:7)
runtimes <- function(N) {
g <- erdos.renyi.game(N, 1 / N, directed = FALSE)
system.time(distance2_peers(as_adjacency_matrix(g)))[3]
}
runtime <- sapply(N, runtimes)
qplot(log10(N), runtime, geom = "line")
The resulting matrix containst at (i, j) the number of paths from i to j of length 2 (excluding paths that include i itself).

Related

Finding a subset in a large dataset (>100) optimizing a parameter

I have a large matrix having this structure :
A B C
A' 9 2 0
B' 2 8 0
C' 0 1 7
The diagonal terms represent the interaction of an individual (A) with his/her brother/sister (A'). Off diagonal elements represent the interaction of an individual with individuals not part of the family.
From a large set of individuals (say a few hundreds), I would like to find subsets (say 10) of individuals having minimal interactions with individuals not part of the family.
I was thinking of using a genetic algorithm (to optimize a parameter that I could calculate from the matrix) but could not find any algorithm that deals with subsets.
Is there a package in R (preferable) doing this ?
Thanks
I'll outline binary genetic algorithm approach using GA package interface that you can use as baseline to create your own implementation. Because you have constraint that you want subset of specific length I will create mutation and crossover operators that will not ruin that specification (otherwise those operators will mostly create infeasible individuals).
mutation
mutation <- function(obj, parent){
vec <- as.vector(obj#population[parent,])
ind1 <- which(vec == 1)
ind2 <- setdiff(seq_along(vec), ind1)
ind1 <- sample(ind1, 1)
ind2 <- sample(ind2, 1)
vec[c(sample(ind1, 1), sample(ind2, 1))] <- c(0, 1)
vec
}
Mutation picks one position with 1 and one position with 0 and change their values to 0 and 1 respectively.
crossover
sizeCrossover <- function(size){
function(obj, parents){
vec1 <- obj#population[parents[1],] == 1
vec2 <- obj#population[parents[2],] == 1
c1 <- vec1 & vec2
c1[sample(which(!c1), size - sum(c1))] <- 1
c2 <- vec1 | vec2
c2[sample(which(c2), sum(c2) - size)] <- 0
list(children = rbind(c1, c2), fitness = c(NA, NA))
}
}
Crossover is variation of arithmetic crossover. In case of & it needs to additionally change some 0 to 1 and in case of | it needs to additionally change some 1 to 0.
initial population
initialPopulation <- function(popSize, N, size){
indices <- do.call(
rbind,
mapply(
function(x, y) cbind(x, y),
seq_len(popSize),
replicate(popSize, sample(seq_len(N), size), simplify = FALSE),
SIMPLIFY = FALSE
)
)
mm <- matrix(0, nrow = popSize, ncol = N)
mm[indices] <- 1
mm
}
If you create completely random initial population most (if not every) individuals will be infeasible. You need to create feasible initial population.
fitness
fitness <- function(vec, m, size){
indices <- which(vec == 1)
-sum(m[indices,indices])
}
GA::ga performas maximization hence the minus sign.
random data and parameters
N <- 200 # matrix dimensions
size <- 10 # subset length
popSize <- 100 # population size for genetic algorithm
m <- matrix(sample(0:10, N^2, TRUE), nrow = N)
diag(m) <- 0
ga
obj <- GA::ga(
type = "binary",
nBits = N,
run = 500,
maxiter = 10000,
popSize = popSize,
fitness = fitness,
m = m,
size = 10,
suggestions = initialPopulation(popSize, N, size),
mutation = mutation,
crossover = sizeCrossover(size)
)
subset <- which(obj#solution[1,] == 1)
note
I'm using sample as modified in Advanced R:
sample <- function(x, size = NULL, replace = FALSE, prob = NULL) {
size <- size %||% length(x)
x[sample.int(length(x), size, replace = replace, prob = prob)]
}
The problem in the question is not fully specified so assume that the problem is as follows. Let v be the vector of column sums of m-diag(diag(m)) and let k be an input specifying the required number of individuals in the subset. Then we wish to find the column names corresponding to the k smallest values in v or in terms of the language of integer linear programming we want to find the 0/1 vector x which minimizes v'x such that sum(x) = k.
We use the inputs and v defined in the Note at the end.
1) sort To minimize this we simply take the column names corresponding to the k smallest values of v.
names(head(sort(v), k))
## [1] "C" "A"
2) knapsack This can also be expressed as a knapsack problem. knapsack maximizes so we use max(v)-v to get the effect of minimization.
library(adagio)
res.knap <- knapsack(rep(1, length(v)), max(v) - v, k)
names(v)[res.knap$indices]
## [1] "A" "C"
3) linear programming We can also use integer linear programming with the following solution.
library(lpSolve)
res.lp <- lp("min", v, t(rep(1, length(v))), "=", k, all.bin = TRUE)
res.lp
## Success: the objective function is 2
names(v)[res.lp$solution == 1]
## [1] "A" "C"
4) genalg We use the rbga.bin genetic algorithm.
library(genalg)
set.seed(13)
f <- function(x) if (sum(x) == k) sum(x * v) else Inf
res.ga <- rbga.bin(size = length(v), evalFunc = f, popSize = 200,
mutation = .01, iters = 400)
cat(summary(res.ga))
## ...snip...
## GA Results
## Best Solution : 1 0 1
Note
The inputs are assumed to be:
k <- 2
m <- structure(c(9L, 2L, 0L, 2L, 8L, 1L, 0L, 0L, 7L), dim = c(3L,
3L), dimnames = list(c("A'", "B'", "C'"), c("A", "B", "C")))
v <- colSums(m - diag(diag(m)))

Have the upper triangular matrix from a lower triangular matrix with R

I have a vector which is composed of data from a lower triangular matrix.
m_lower <- c(0.3663761172,0.4863082351,0.7465202620,0.4965009484,0.0749920755,4.4082686453,3.2714448866,0.7604404092,1.5994592726,0.2677065512,1.4247358706,1.8448569996,1.9888525802,0.6684931466,0.0909124206,1.2443815645,0.1329758523,0.4777177616,4.7059557222,0.0018111412,3.2430708925,1.7024842083,0.2973714654,1.8145898881,0.8277291485,0.4898066476,0.2827415558,0.2652730958,0.4801885476,1.9012667391,4.2655464241,0.6021593916,0.7127273433,3.6781491268,0.2084725974,0.3147488236,0.0977461927,0.1689097181,0.2176950021,0.0114681239,0.2621692606,0.1242180116,0.0530288130,0.0065052254,0.1241324065,0.3803137139,0.3877225011,0.1456193524,0.0238036494,0.6558033727,0.8803106660,0.0846636279,0.0375347721,0.4605712015,4.7307220442,0.5978980924,4.9502297322,1.6783343118,0.1872433954,0.0096240780,2.5275913377,0.7891683399,0.2747600533,0.8053761872,0.1541668145,0.2320415088,0.9950147294,0.5487573930,0.4876815384,0.0389633056,0.9807247107,2.6210927047,0.1429777740,0.1381647168,0.5751212069,3.1234289486,6.2097230541,0.6347384457,0.2290039612,0.1133552059,0.1790568649,0.0375669532,0.8222290651,0.1216843377,0.0296845741,0.1042203024,1.8920649062,1.3812280928,2.7648752363,0.0506352673,1.9935626852,1.4874099504,0.2729502243,0.2616840959,0.2617474854,0.0813626319,0.5760642438,0.3235973298,0.2360780322,0.4039630140,0.1098278657,0.2043569205,0.3537233928,0.1890846579,0.1392357526,5.5852889211,0.8895464606,4.4717950139,1.4398239163,4.1612760128,1.5346225800,0.6834217952,1.7567290546,1.1563462407,0.0455001087,0.1562431558,0.8705330892,0.5275899123,0.3237870293,1.2863038275,1.6973196296,0.5847098190,2.3223925288,0.5613444343,2.2124772293,1.4562826709,0.8400955943,0.1424881993,0.6683791808,0.8525094744,0.3708594594,1.1009684274,1.5736877623,0.2093061150,0.5452534503,7.3930448949,0.1652496048,0.8540872055,0.1483290331,0.0281819105,0.7922539949,0.2942266303,0.0628311599,0.1691295505,2.4279363796,0.1212368185,0.5582404808,0.1328161192,1.0350547406,5.7389081114,0.1187634510,0.2655712856,0.2854689127,0.1898174381,0.3563226261,0.5320306856,0.1195664207,0.9398889664,0.3591286333,0.0931564303,0.0782904570,5.1694252396,0.3460867771,0.4119580757,0.1127626303,0.6218762663,13.2700188970,0.1119347335,0.4291885095,0.3251606306,4.7433868446,1.9050191397,0.1906132162,0.1333666643,0.0419028810,2.6071857096,0.3753732407,0.3117898454,0.0429600236,0.1713773435,10.7412818563,1.7645514498,0.2214461620,1.7688809751,0.6670375727,0.3626269616,0.0956151063,2.0777759737,0.1852685974,0.3278837993)
picture-matrix-lower
How can I get the whole matrix or the upper triangular matrix from my vector which is composed of data from a lower triangular matrix?
I have no idea if this is what you want
n <- (sqrt(1+8*length(m_lower))+1)/2
m <- matrix(nrow = n,ncol = n)
m[lower.tri(m)] <- m_lower
You can use lower.tri and upper.tri to achive that by something like:
# Create some equivalent, but smaller toy data
n <- 5
X <- matrix(rnorm(n^2), n, n) # Initialize empty matrix
m <- (t(X) + X)/2 # Make symmetric "m"
m_lower <- m[lower.tri(m, diag = TRUE)]
m_lower
# [1] -0.475060360 0.439727314 -0.332462307 0.628518993 -1.528984770 -0.115709110 -1.172260961
# [8] 0.006313545 0.018025707 0.016831925 0.228239570 1.450523622 -0.539834781 -0.152822864
#[15] -0.522060544
# Determine dimension (if square and diagonal included in m_lower)
l <- length(m_lower)
n <- (sqrt(1 + 8*l) - 1)/2 # Solve n*(n + 1) = l for n
# Reconstruct
m2 <- matrix(NA, n, n)
m2[lower.tri(m2, diag = TRUE)] <- m_lower
m2[upper.tri(m2)] <- t(m2)[upper.tri(m2)] # If symmetric, fill also upper half
# Check
all(m == m2)
# [1] TRUE
Edit: Now wrapped into a function which takes additional arguments if the reconstruction should be symmetric and/of if the diagonal is included in m_lower (see answer from ThomasIsCoding).
# Create function for reconstruction
reconstruct <- function(m_lower, diag = TRUE, symmetric = TRUE) {
# Determine dimension (if square and diagonal included in m_lower)
l <- length(m_lower)
n <- (sqrt(1 + 8*l) + ifelse(diag, -1, 1))/2 # Solve n*(n + 1) = l for n
m <- matrix(NA, n, n)
# Reconstruct
m[lower.tri(m, diag = diag)] <- m_lower
if (symmetric) { # If symmetric, fill also upper half
m[upper.tri(m)] <- t(m)[upper.tri(m)]
}
return(m)
}
m3 <- reconstruct(m_lower)
# Check
all(m == m3)
# [1] TRUE

optimise algorithm for building a graph based on node weights

I am trying to improve a function to build a network based on the score calculated from some node attributes. The function tries to find the best subnetwork from a graph maximizing the product of node's attributes.
The function starts in a random node and start searching in the first neighbor, if there are some neighbors whose node's score suffice a threshold, the neighbour/s is added to the first node and the process continues until no more are added (the addition of the neighbour does not produce the desired increment in the score). If there is no node in the first neighbours that yields the increment of the score, then the function looks to the second degree neighbours. In this situation, it is very likely that there are several paths to connect the node (2nd degree neighbour), in this specific case, the chosen path will be the shortest with the highest weight (one of the nodes attribute).
I could do some paralelization of the code, although I don't know how to implement it in this type of function.
The function is the following:
build_network <-
function (G, seed, d= 2){
net <- G
d <- d
score.fun<-function(g){
Za <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
k <- vcount(g)
tmp <- genesets.length.null.stat[[as.character(k)]] # genesets.length.null.stat is a list with the median of Za and sd of Za calculated for 1000 replicates of networks of size k
Sa <- (Za-tmp[1])/tmp[2]
}
best.fun<-function(in.nodes,out.nodes) {
score<-(-Inf); best<-character()
for(node in out.nodes){
subG.update<-induced.subgraph(net, c(in.nodes,node))
if( score.fun(subG.update) > score ){
score<-score.fun(subG.update)
best<-node
}
}
list("node"=best,"score"=score)
}
subG <- induced.subgraph(net, seed)
if (!is.connected(subG)) { #the seed must be connected
stop("Input seeds are disjoint")
}
while (TRUE) {
in.nodes <- V(subG)$name
node_num <- vcount(subG)
subsum <- score.fun(subG)
#subx <- V(subG)$name
for (rad in 1:d) {
tmp.neigh <- unlist(neighborhood(net, order = rad, nodes = V(subG)$name))
pot.nodes <- V(net)[tmp.neigh]$name
out.nodes <- setdiff(pot.nodes, in.nodes)
if (length(out.nodes) == 0) break
best_node<-best.fun(in.nodes, out.nodes)
new_score<-best_node$score
best_node<-best_node$node
if (new_score > subsum + 0.01) {
tmp <- unlist(lapply(best_node, function(x) node2treePath(net,V(subG)$name, x))) # node2treePath is a function to retrieve the shortest path with the highest node weights
in.nodes <- c(tmp, V(subG)$name)
subG <- induced.subgraph(net, in.nodes)
break
}
}
if (node_num == vcount(subG)) break
}
return(subG)
}
I am trying to apply this function to a graph of ~10,000 nodes. Here will be an approximation of the code for running the function
### generate some example data
library(igraph)
my_graph <- erdos.renyi.game(10000, 0.0003)
V(my_graph)$name <- 1:vcount(my_graph)
V(my_graph)$weight <- rnorm(10000)
V(my_graph)$RWRNodeweight <- runif(10000, min=0, max=0.05)
### Run the function
sublist = list()
for (node in V(G)$name) {
subnet <- build_network(G, node, d)
sublist[[node]] <- subnet }
EDIT: here is the dput of head(genesets.length.null.stat)
structure(list(`1` = c(1.01397367504035, 1.18858228819048), `2` = c(1.61970348041337, 1.30189433386605), `3` = c(2.11767222957028, 1.36222065695878), `4` = c(2.47710421934929, 1.36968129959296), `5` = c(2.776011866622, 1.36318885187196), `6` = c(3.16885126246671, 1.42577861995897)), .Names = c("1", "2", "3", "4", "5", "6"))
Here is the node2treePath function:
node2treePath <- function (G, Tnodes, node){
tmp.path <- get.all.shortest.paths(G, node, Tnodes)$res
tmp.l <- unlist(lapply(tmp.path, length))
index <- which(tmp.l == min(tmp.l))
tmp.path = tmp.path[index]
tmp.sum <- unlist(lapply(tmp.path, function(x)return(sum(V(G)[x]$weight))))
index <- which(tmp.sum == max(tmp.sum))
selected.path = tmp.path[index]
collect <- unlist(lapply(selected.path, function(x)return(V(G)[x]$name)))
return(collect)
}
For the logic you want to do (and I imagine you may wish to change in way incompatible with the above answers) the following code is about ten times 30% faster. I used Rprof and profr and recoded some slow bits in trivial ways, e.g. not passing a named list pair, just an anonymous pair from one of your functions. The numerically named list with pairs of values for genesets.length.null.stat is very inefficient. I replaced it with two numeric vectors. You also call the 'V' function a lot, which was a big time consumer: as you can see, you can call it once, then query the result as needed.
# node2treePath is a function to retrieve the shortest path with the highest node weights
node2treePath_jw <- function(G, Tnodes, node){
tmp.path <- get.all.shortest.paths(G, node, Tnodes)$res
tmp.l <- vapply(tmp.path, length, integer(1))
index <- which(tmp.l == min(tmp.l))
tmp.path = tmp.path[index]
Vg <- V(G)
tmp.sum <- vapply(tmp.path, function(x) sum(Vg[x]$weight), numeric(1))
index <- which(tmp.sum == max(tmp.sum))
selected.path = tmp.path[index]
sapply(selected.path, function(x) Vg[x]$name)
}
build_network_jw <- function(net, seed, d= 2){
score.fun <- function(Vg, k){
Za <- sum(Vg$weight * Vg$RWRNodeweight) / sqrt(sum(Vg$RWRNodeweight^2))
(Za - genesets_jack_a[k]) / genesets_jack_b[k]
}
best.fun_jw <- function(in.nodes, out.nodes) {
score <- (-Inf)
best <- character()
for (node in out.nodes) {
subG.update <- induced.subgraph(net, c(in.nodes,node))
Vsgu <- V(subG.update)
Vsgu_count <- vcount(subG.update)
sf <- score.fun(Vsgu, Vsgu_count)
if (sf > score) {
score <- sf
best <- node
}
}
list(best, score)
}
subG <- induced.subgraph(net, seed)
if (!is.connected(subG)) { #the seed must be connected
stop("Input seeds are disjoint")
}
while (TRUE) {
VsubG <- V(subG)
Vnet <- V(net)
in.nodes <- VsubG$name
node_num <- vcount(subG)
subsum <- score.fun(VsubG, node_num)
for (rad in 1:d) { # d = 2
tmp.neigh <- unlist(neighborhood(net, order = rad, nodes = VsubG$name))
pot.nodes <- Vnet[tmp.neigh]$name
out.nodes <- setdiff(pot.nodes, in.nodes)
if (length(out.nodes) == 0) break
best_node <- best.fun_jw(in.nodes, out.nodes)
new_score <- best_node[[2]]
best_node <- best_node[[1]]
if (new_score > subsum + 0.01) {
tmp <- sapply(best_node, function(x) node2treePath_jw(net, VsubG$name, x))
in.nodes <- c(tmp, VsubG$name)
subG <- induced.subgraph(net, in.nodes)
break
}
}
if (node_num == vcount(subG)) break
}
subG
}
node2treePath <- function (G, Tnodes, node){
tmp.path <- get.all.shortest.paths(G, node, Tnodes)$res
tmp.l <- unlist(lapply(tmp.path, length))
index <- which(tmp.l == min(tmp.l))
tmp.path = tmp.path[index]
tmp.sum <- unlist(lapply(tmp.path, function(x)return(sum(V(G)[x]$weight))))
index <- which(tmp.sum == max(tmp.sum))
selected.path = tmp.path[index]
collect <- unlist(lapply(selected.path, function(x)return(V(G)[x]$name)))
return(collect)
}
build_network <- function (net, seed, d= 2){
#genesets.length.null.stat <- structure(list(`1` = c(1.01397367504035, 1.18858228819048), `2` = c(1.61970348041337, 1.30189433386605), `3` = c(2.11767222957028, 1.36222065695878), `4` = c(2.47710421934929, 1.36968129959296), `5` = c(2.776011866622, 1.36318885187196), `6` = c(3.16885126246671, 1.42577861995897)), .Names = c("1", "2", "3", "4", "5", "6"))
genesets.length.null.stat <- lapply(1:500, function(x) c(runif(1)+x, runif(1)+x))
names(genesets.length.null.stat) <- 1:500
score.fun<-function(g){
Za <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
k <- vcount(g)
tmp <- genesets.length.null.stat[[as.character(k)]] # genesets.length.null.stat is a list with the median of Za and sd of Za calculated for 1000 replicates of networks of size k
Sa <- (Za-tmp[1])/tmp[2]
}
best.fun <- function(in.nodes,out.nodes) {
score<-(-Inf); best<-character()
for (node in out.nodes){
subG.update<-induced.subgraph(net, c(in.nodes,node))
if (score.fun(subG.update) > score) {
score<-score.fun(subG.update)
best<-node
}
}
list("node"=best,"score"=score)
}
subG <- induced.subgraph(net, seed)
if (!is.connected(subG)) { #the seed must be connected
stop("Input seeds are disjoint")
}
while (TRUE) {
in.nodes <- V(subG)$name
node_num <- vcount(subG)
subsum <- score.fun(subG)
#subx <- V(subG)$name
for (rad in 1:d) {
tmp.neigh <- unlist(neighborhood(net, order = rad, nodes = V(subG)$name))
pot.nodes <- V(net)[tmp.neigh]$name
out.nodes <- setdiff(pot.nodes, in.nodes)
if (length(out.nodes) == 0) break
#message("length in.nodes = ", length(in.nodes))
#message("length out.nodes = ", length(out.nodes))
best_node<-best.fun(in.nodes, out.nodes)
new_score<-best_node$score
best_node<-best_node$node
if (new_score > subsum + 0.01) {
tmp <- unlist(lapply(best_node, function(x) node2treePath(net,V(subG)$name, x))) # node2treePath is a function to retrieve the shortest path with the highest node weights
in.nodes <- c(tmp, V(subG)$name)
subG <- induced.subgraph(net, in.nodes)
break
}
}
if (node_num == vcount(subG)) break
}
subG
}
library(igraph)
library(profr)
library(igraph)
library(profr)
#genesets.length.null.stat <- lapply(1:500, function(x) c(runif(1)+x, runif(1)+x))
#names(genesets.length.null.stat) <- 1:500
set.seed(1)
genesets_jack_a = runif(500) + 1:500
genesets_jack_b = runif(500) + 1:500
do_it_jw <- function(n = 1000){
my_graph <- erdos.renyi.game(n, 0.0003)
V(my_graph)$name <- 1:vcount(my_graph)
V(my_graph)$weight <- rnorm(n)
V(my_graph)$RWRNodeweight <- runif(n, min = 0, max = 0.05)
### Run the function
sublist = list()
Vmg <- V(my_graph)
for (node in Vmg$name) {
#message(node)
subnet <- build_network_jw(my_graph, node, 2)
sublist[[node]] <- subnet }
}
do_it <- function(n = 1000){
my_graph <- erdos.renyi.game(n, 0.0003)
V(my_graph)$name <- 1:vcount(my_graph)
V(my_graph)$weight <- rnorm(n)
V(my_graph)$RWRNodeweight <- runif(n, min = 0, max = 0.05)
### Run the function
sublist = list()
Vmg <- V(my_graph)
for (node in Vmg$name) {
#message(node)
subnet <- build_network(my_graph, node, 2)
sublist[[node]] <- subnet }
}
library(microbenchmark)
mb <- microbenchmark(do_it(1000), do_it_jw(1000), times = 5)
print(mb)
Since your score function only depends on node attributes and not edge's, the solution is not unique; you might want to search for a best tree instead. If you restructure your problem so that your nodes are edges and vice-versa, you probably can just use eg Djikstra's algorithm to find the best one. That is already in the igraph package as shortest.paths().
I can't read the R code, but based on your description: If the score threshold is constant, then this is easy to do in O(|V|+|E|+|C|^2) time, where |C| is the number of "good" components (this will be further explained shortly).
In a first pass, delete all nodes with score below the threshold. Then find all connected components in this new graph (this can be done in O(|V|+|E|) time by starting a DFS at each as-yet-unvisited node), calculate their scores by multiplying together all vertex weights in the component, and label each vertex with its component ID. This already tells you the "good" components -- the ones that don't require any 2nd-degree connections.
Suppose this produces |C| components. Create an empty hashtable H which has component-ID pairs for keys, and (length, weight) pairs for values. Now go back through each vertex v you deleted in the first pass: for each one, look at all its neighbours and record the shortest edge to each distinct component (this can be done using a length-|C| array to store the shortest edge to each component seen so far). After examining all of v's neighbours, count the number k of distinct components they fall into: if k >= 2, then v potentially should be used to connect some of these k(k-1)/2 pairs of components. For every pair of distinct components i and j that could be connected by v, update H with the weight and distance of this 2-edge connection as necessary: that is, if i and j are not yet joined together, then record that v joins them; otherwise, if they are already joined by some vertex u, only update H if v can do better (i.e., if it uses less total length and greater weight than u would). This step can be thought of as building a minimum spanning tree in a "component graph" derived from the original, pruned graph. The scores for each new "combined" component can easily be calculated as you go just by multiplying together the scores of the two constituent components.
Finally, simply return the component whose product is maximum.

Linear Independence of Large Sparse Matrices in R

I have three large matrices: I, G, and G^2. These are 4Million x 4Million matrices and they are sparse. I would like to check if they are linearly independent and I would like to do this in R.
For small matrices, a way to this is to vectorize each matrix: stack columns on top of each other and test if the matrix formed by the three stacked vectors has rank three.
However, due to the size of my problem I am not sure how to proceed.
(1) Is there a way to vectorize a Large Sparse Matrix into a Very Large Sparse Vector in R?
(2) Is there any other solution to the problem that could make this test efficient ?
Thanks in advance
When converting your matrices to vectors, you can keep only the non-zero elements.
# Sample data
n <- 4e6
k <- n
library(Matrix)
I <- spMatrix(n, n, 1:n, 1:n, rep(1,n))
G <- spMatrix(n, n,
sample(1:n, k, replace=TRUE),
sample(1:n, k, replace=TRUE),
sample(0:9, k, replace=TRUE)
)
G2 <- G %*% G
G2 <- as(G2, "dgTMatrix") # For the j slot
# Only keep elements that are non-zero in one of the 3 matrices
i <- as.integer( c(G#i, G2#i, I#i) + 1 )
j <- as.integer( c(G#j, G2#j, I#j) + 1 )
ij <- cbind(i,j)
rankMatrix( cbind( G2[ij], G[ij], I[ij] ) ) # 3
# Another example
m <- ceiling(n/2)-1
G <- spMatrix(n, n,
c(1:n, 2*(1:m)),
c(1:n, 2*(1:m)+1),
rep(1, n+m)
)
G2 <- as(G %*% G, "dgTMatrix")
i <- c(G#i, G2#i, I#i) + 1
j <- c(G#j, G2#j, I#j) + 1
ij <- cbind(i,j)
rankMatrix( cbind( G2[ij], G[ij], I[ij] ) ) # 2
(To speed things up, you could take only a small part of those vectors:
if the rank is already 3, you know that they are independent,
if it is 2, you can check if the linear dependence relation also holds for the large vectors.)

Ordering 1:17 by perfect square pairs

There was an interesting question on R-help:
"Take the numbers one up to 17. Can you write them out in a line so that every pair of numbers that are next to each other, adds up to give a square number?"
My solution is below and not particularly special. I'm curious about a more elegant and/or robust solution. Maybe a solution that can take an arbitrary string of numbers and order them like this if possible?
sq.test <- function(a, b) {
## test for number pairs that sum to squares.
sqrt(sum(a, b)) == floor(sqrt(sum(a, b)))
}
ok.pairs <- function(n, vec) {
## given n as a member of vec,
## which other members of vec satisfiy sq.test
vec <- vec[vec!=n]
vec[sapply(vec, sq.test, b=n)]
}
grow.seq <- function(y) {
## given a starting point (y) and a pairs list (pl)
## grow the squaring sequence.
ly <- length(y)
if(ly == y[1]) return(y)
## this line is the one that breaks down on other number sets...
y <- c(y, max(pl[[y[ly]]][!pl[[y[ly]]] %in% y]))
y <- grow.seq(y)
return(y)
}
## start vector
x <- 1:17
## get list of possible pairs
pl <- lapply(x, ok.pairs, vec=x)
## pick start at max since few combinations there.
y <- max(x)
grow.seq(y)
You can use outer to compute the allowable pairs.
The resulting matrix is the adjacency matrix of a graph,
and you just want a Hamiltonian path on it.
# Allowable pairs form a graph
p <- outer(
1:17, 1:17,
function(u,v) round(sqrt(u + v),6) == floor(sqrt(u+v)) )
)
rownames(p) <- colnames(p) <- 1:17
image(p, col=c(0,1))
# Read the solution on the plot
library(igraph)
g <- graph.adjacency(p, "undirected")
V(g)$label <- V(g)$name
plot(g, layout=layout.fruchterman.reingold)

Resources