Optimize network range function - r

I found a blog post by Richard Paquin Morel on computing a network range function in R (Burt 1981; Reagans and McEvily 2003). The function assigns a value to each network node based on the number of its contacts and the interconnectedness of these nodes. The network range can be computed accross subgroups of nodes (e.g. female and male nodes). These are stored as attributes of the vertices.
The author's example is very illustrative, but it is based on a relatively small network (about 100 nodes). I have a network with about 200,000 nodes, which means the performance of the function is not suited for my analyses.
I present you an example where the size of a random graph according to the Erdos-Renyi model can be manipulated to time the performance of the functions.
I am unfamiliar with optimizing R code, but I think the adjacency matrix needs to be stored more efficiently (e.g., a sparse matrix). My attempts so far did not succeed in producing a working function.
rm(list=ls())
library(igraph)
library(statnet)
library(intergraph)
library(tictoc)
set.seed(42)
## Source: https://ramorel.github.io/network-range/
## Function to find network range for each node in a network
## Arguments:
## net = adjacency matrix, igraph graph, or network object
## attr = Vector of attributes associated with each node in net
## directed = boolean indicated if the network is directed or not
netrange <- function(net, attr, directed = TRUE){
require(reshape2)
if (class(net) == "igraph") {
net <- as_adjacency_matrix(net, sparse = F)
}
else {
if(class(net) == "network") {
net <- as.matrix.network(net)
}
else {
net <- as.matrix(net)
}
}
if(nrow(net) != length(attr)) {
stop("Number of nodes must match length of attributes vector")
}
else {
if (directed == TRUE){
ns <- colnames(net)
el <- melt(net, varnames=c("ego", "alter"), value.name = "weight")
df <- cbind(rownames(net), attr)
el$ego_grp <- df[match(el[,1], df[,1]), 2]
el$alter_grp <- df[match(el[,2], df[,1]), 2]
#FINDING p_k, the strength of ties within each group
# z_iq = sum of strength of ties from nodes in group _k_ to all other alters
# z_ij = sum of strength of ties from nodes in group _k_ to alters in group _k_
z_iq <- sapply(unique(attr), function(x) {
sum(el[which(el$ego_grp==x), "weight"])
})
z_ij <- sapply(unique(attr), function(x) {
sum(el[which(el$ego_grp==x & el$alter_grp==x), "weight"])
})
p_k <- z_ij / z_iq
p_k[is.na(p_k)] <- 0
#FINDING p_ik, the strength of connection from person i to group k
# x_iq = sum of strength of ties for _i_ to alters in group _k_
# x_ij = sum of strength of ties for _i_ to all alters
x_ij <- sapply(colnames(net), function(x) {
sum(el[which(el$ego==x), "weight"])
}
)
x_iq <- list(NULL)
for(i in colnames(net)) {
x_iq[[i]] <- sapply(unique(attr), function(x) {
sum(el[which(el$ego==i & el$alter_grp==x), "weight"])
}
)
}
x_iq <- x_iq[-c(1)] #x_iq is now a list where each elements is a vector of node _i_ summed strength of tie to group _k_
p_ik <- lapply(1:length(x_iq),
function(x) x_iq[[x]] / x_ij[x])
# FINDING nd_i, the network diversity score for node _i_
nd_i <- sapply(1:length(p_ik),
function(x) 1 - sum(p_k*p_ik[[x]]^2, na.rm = F)
)
}
else {
ns <- colnames(net)
el <- melt(net, varnames=c("ego", "alter"), value.name = "weight")
dup <- data.frame(t(apply(el[,1:2],1,sort)))
el <- el[!duplicated(dup),]
df <- cbind(rownames(net), attr)
el$ego_grp <- df[match(el[,1], df[,1]), 2]
el$alter_grp <- df[match(el[,2], df[,1]), 2]
#FINDING p_k, the strength of ties within each group
# z_iq = sum of strength of ties from nodes in group _k_ to all other alters
# z_ij = sum of strength of ties from nodes in group _k_ to alters in group _k_
z_iq <- sapply(unique(attr), function(x) {
sum(el[which(el$ego_grp==x | el$alter_grp==x), "weight"])
})
z_ij <- sapply(unique(attr), function(x) {
sum(el[which(el$ego_grp==x & el$alter_grp==x), "weight"])
})
p_k <- z_ij / z_iq
p_k[is.na(p_k)] <- 0
#FINDING p_ik, the strength of connection from person i to group k
# x_iq = sum of strength of ties for _i_ to alters in group _k_
# x_ij = sum of strength of ties for _i_ to all alters
x_ij <- sapply(colnames(net), function(x) {
sum(el[which(el$ego==x | el$alter==x), "weight"])
}
)
x_iq <- list(NULL)
for(i in colnames(net)) {
x_iq[[i]] <- sapply(unique(attr), function(x) {
sum(el[which(el$ego==i & el$alter_grp==x), "weight"],
el[which(el$alter==i & el$ego_grp==x), "weight"])
}
)
}
x_iq <- x_iq[-c(1)] #x_iq is now a list where each elements is a vector of node _i_ summed strength of tie to group _k_
p_ik <- lapply(1:length(x_iq),
function(x) x_iq[[x]] / x_ij[x])
# FINDING nd_i, the network diversity score for node _i_
nd_i <- sapply(1:length(p_ik),
function(x) 1 - sum(p_k*p_ik[[x]]^2, na.rm = F)
)
}
return(nd_i)
}
}
# Generate exemplary network
g <- igraph::erdos.renyi.game(1000, 150, type = "gnm")
## Add categorical (binary) vertex feature: female
V(g)$female <- sample(c(0,1), replace=TRUE, size=length(V(g)))
V(g)$female
## transform igraph to statnet
net <- intergraph::asNetwork(g)
## Apply network function
tic()
range_female <- netrange(net,
net %v% "female",
directed = T)
seq_time <- toc()

Best thing to optimize is to use profiling using profvis which shows where your bottle necks are. I took part of your function though.
There we saw x_ij <- sapply(colnames(net), function(x) { sum(el[which(el$ego==x | el$alter==x), "weight"]) }) taking terribly long.
I prefer data.table myself so just provide the code here to speed up that part which takes a few seconds only. Try the same then for creating x_iq.
xx_ij <- el[, .(xxij = lapply(.SD, sum)), by = c("ego"), .SDcols = c("weight")]
xx_ij2 <- xx_ij$xxij
names(xx_ij2) <- xx_ij$ego
identical(x_ij, unlist(xx_ij2))
# TRUE
**profiling
library(profvis)
p <- profvis({
## code here
})
f <- paste0("profile_", as.Date(now()), ".html")
htmlwidgets::saveWidget(p, f)
browseURL(f)

Related

Best R package function to prepare sampling distribution for stratified sampling

I'm attempting to prepare a demonstration in R of how the repeated stratified random sampling of a small population results in a near-normal sampling distribution of means. As an example consider the R code below (which works but is very slow due to looping).
#Dummy population made up of dice throws - 18 per row
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
P1 <- as.data.frame(c(5,6,5,1,6,4,2,2,4,4,6,6,5,2,3,5,1,6))
P1$Zn <- 1
names(P1) <- c('Die','Zn')
Dt <- P1
P2 <- as.data.frame(c(2,5,4,5,5,5,3,3,2,5,6,1,2,5,4,3,6,1))
P2$Zn <- 2
names(P2) <- c('Die','Zn')
Dt <- rbind(Dt,P2)
# Empty dataframe to hold random draws
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Smps <- data.frame(Die = numeric(), Zn= numeric(),Drw = numeric())
# Draw stratifed samples one from each row
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
print(paste('Start','at',Sys.time()))
n <- 10000 # number of draws
r <- 2 # number of rows (the strata)
for (j in 1:n){
# for a 2 strata
for (i in 1:r){
#sub set strata
x <- subset(Dt, Dt$Zn == i)
# random sample
y <- x[sample(1:18,1),]
y$Drw <- j
#append sample
Smps <- rbind(Smps,y)
}
# report progress
if(right(j,3) == '000'){
print(paste(j,'at',Sys.time()))
flush.console()
}
}
# Compute the sample means
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mns <-aggregate(Smps[, 1], list(Smps$Drw), mean)
# Density plot of means
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
d <- density(Mns$x)
plot(d,xlab = 'Means', las=1, main = '')
polygon(d, col="blue", border="blue")
I'm expecting there is an R package with a function that does this type of stratified sampling but I am struggling to find the one that works in the manner I can understand. Something that inputs a data frame with a grouping field and the number of samples to be drawn from each group is something I'm expecting has already been written to allow a repeat sampling by a group. Any pointers to examples that work would be appreciated. Ideally, I would like to prepare to say 100,000 stratified samples from a known population with many more strata and then plot the distribution of the means (but quickly)
After a while away from this problem, I found a package called 'fifer' (https://www.rdocumentation.org/packages/fifer/versions/1.1) which seem contained a stratified function in a package but unfortunately, this package does not work on the latest versions of R. I did, however, find a clever stratified function from Ananda Mahto (https://gist.github.com/mrdwab/6424112) which works well but at the cost of having a rather long function in your script rather than the one line of loading a package. My solution to the problem above using this function is below.
#Dummy population made up of dice throws - 18 per row
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
P1 <- as.data.frame(c(5,6,5,1,6,4,2,2,4,4,6,6,5,2,3,5,1,6))
P1$Zn <- 1
names(P1) <- c('Die','Zn')
Dt <- P1
P2 <- as.data.frame(c(2,5,4,5,5,5,3,3,2,5,6,1,2,5,4,3,6,1))
P2$Zn <- 2
names(P2) <- c('Die','Zn')
Dt <- rbind(Dt,P2)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Stratfed function from web
# https://gist.github.com/mrdwab/6424112
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stratified <- function(df, group, size, select = NULL,
replace = FALSE, bothSets = FALSE) {
if (is.null(select)) {
df <- df
} else {
if (is.null(names(select))) stop("'select' must be a named list")
if (!all(names(select) %in% names(df)))
stop("Please verify your 'select' argument")
temp <- sapply(names(select),
function(x) df[[x]] %in% select[[x]])
df <- df[rowSums(temp) == length(select), ]
}
df.interaction <- interaction(df[group], drop = TRUE)
df.table <- table(df.interaction)
df.split <- split(df, df.interaction)
if (length(size) > 1) {
if (length(size) != length(df.split))
stop("Number of groups is ", length(df.split),
" but number of sizes supplied is ", length(size))
if (is.null(names(size))) {
n <- setNames(size, names(df.split))
message(sQuote("size"), " vector entered as:\n\nsize = structure(c(",
paste(n, collapse = ", "), "),\n.Names = c(",
paste(shQuote(names(n)), collapse = ", "), ")) \n\n")
} else {
ifelse(all(names(size) %in% names(df.split)),
n <- size[names(df.split)],
stop("Named vector supplied with names ",
paste(names(size), collapse = ", "),
"\n but the names for the group levels are ",
paste(names(df.split), collapse = ", ")))
}
} else if (size < 1) {
n <- round(df.table * size, digits = 0)
} else if (size >= 1) {
if (all(df.table >= size) || isTRUE(replace)) {
n <- setNames(rep(size, length.out = length(df.split)),
names(df.split))
} else {
message(
"Some groups\n---",
paste(names(df.table[df.table < size]), collapse = ", "),
"---\ncontain fewer observations",
" than desired number of samples.\n",
"All observations have been returned from those groups.")
n <- c(sapply(df.table[df.table >= size], function(x) x = size),
df.table[df.table < size])
}
}
temp <- lapply(
names(df.split),
function(x) df.split[[x]][sample(df.table[x],
n[x], replace = replace), ])
set1 <- do.call("rbind", temp)
if (isTRUE(bothSets)) {
set2 <- df[!rownames(df) %in% rownames(set1), ]
list(SET1 = set1, SET2 = set2)
} else {
set1
}
}
# Empty dataframe to hold random draws
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Smps <- data.frame(Die = numeric(), Zn = numeric())
# Right function for reporting progress
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
right = function(text, num_char) {
substr(text, nchar(text) - (num_char-1), nchar(text))
}
# Draw stratifed samples one from each row
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
n <- 10000 # number of draws
for (j in 1:n){
y <- stratified(Dt,"Zn",1)
y <- cbind(y,j)
Smps <- rbind(Smps,y)
# report progress
if(right(j,3) == '000'){
print(paste(j,'at',Sys.time()))
flush.console()
}
}
# Compute the sample means
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mns <-aggregate(Smps[, 1], list(Smps$j), mean)
# Density plot of means
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
d <- density(Mns$x)
plot(d,xlab = 'Means', las=1, main = '')
polygon(d, col="blue", border="blue")

Reducing a network by appending strong links to the spanning tree

I am trying to reduce a full graph so that it only shows the strongest links.
To do that I am doing this:
Multiply the original matrix by -1 (the values are in the interval [0,1])
Obtain the minimum spanning tree, step 1. makes it a spanning tree with the strongest links
Paste other strong links that are not a part of the spanning tree until the average number of links per node is <= 5.
To perform these steps I have defined two functions that should be equivalent and I have to determine if the fast function output is the same as the slow function output.
Setup
# number of links
n <- 10
n2 <- sqrt(n)
library(igraph)
set.seed(1234)
mat <- matrix(rnorm(n, 1, 0.5), nrow = n2, ncol = n2)
mat[mat < 0] <- 0
colnames(mat) <- letters[1:n2]
rownames(mat) <- letters[1:n2]
diag(mat) <- 1
Function 1: binding rows after converting graph to data.frame
trim_network_1 <- function(mat, avg_links = 5, tolerance = 0.01) {
mat <- (-1) * mat
g <- graph_from_adjacency_matrix(mat, weighted = TRUE, mode = "undirected", diag = FALSE)
g_mst <- mst(g, algorithm = "prim")
threshold <- 0
avg_links_n <- FALSE
while (avg_links_n == FALSE) {
if (threshold < 1) {
message(sprintf("%s threshold...", threshold))
g_not_in_mst <- delete.edges(g, which(abs(E(g)$weight) <= threshold))
g_not_in_mst <- graph.difference(g_not_in_mst, g_mst)
g <- rbind(
as_data_frame(g_mst),
as_data_frame(g_not_in_mst)
)
g <- graph_from_data_frame(g, directed = F)
avg_links_n <- ifelse(mean(degree(g)) <= avg_links, TRUE, FALSE)
threshold <- threshold + tolerance
if (avg_links_n == TRUE) {
message(sprintf("%s threshold achieves the avg number of connections", threshold))
E(g)$weight <- (-1) * E(g)$weight
return(g)
}
} else {
warning("no threshold achieves the avg number of connections\nreturning maximum spanning tree")
avg_links_n <- TRUE
E(g_mst)$weight <- (-1) * E(g_mst)$weight
return(g_mst)
}
}
}
Function 2: Combine weights after merging graphs
trim_network_2 <- function(mat, avg_links = 5, tolerance = 0.01) {
mat <- (-1) * mat
g <- graph_from_adjacency_matrix(mat, weighted = TRUE, mode = "undirected", diag = FALSE)
g_mst <- mst(g, algorithm = "prim")
threshold <- 0
avg_links_n <- FALSE
while (avg_links_n == FALSE) {
if (threshold < 1) {
message(sprintf("%s threshold...", threshold))
g_not_in_mst <- delete.edges(g, which(abs(E(g)$weight) <= threshold))
g_not_in_mst <- graph.difference(g_not_in_mst, g_mst)
g <- graph.union(g_mst, g_not_in_mst)
E(g)$weight <- pmin(E(g)$weight_1, E(g)$weight_2, na.rm = T)
g <- remove.edge.attribute(g, "weight_1")
g <- remove.edge.attribute(g, "weight_2")
avg_links_n <- ifelse(mean(degree(g)) <= avg_links, TRUE, FALSE)
threshold <- threshold + tolerance
if (avg_links_n == TRUE) {
message(sprintf("%s threshold achieves the avg number of connections", threshold))
E(g)$weight <- (-1) * E(g)$weight
return(g)
}
} else {
warning("no threshold achieves the avg number of connections\nreturning maximum spanning tree")
avg_links_n <- TRUE
E(g_mst)$weight <- (-1) * E(g_mst)$weight
return(g_mst)
}
}
}
Comparting the outputs
g1 <- trim_network_1(mat)
g2 <- trim_network_2(mat)
g1 <- as_data_frame(g1)
g2 <- as_data_frame(g2)
g1w <- order(g1$weight)
g1 <- g1[g1w, ]
g2w <- order(g2$weight)
g2 <- g2[g2w, ]
# this is symmetric so A-B is the same as B-A (undirected graph)
# all.equal(g1, g2) doens't always hold bc of symmetry
all.equal(g1$weight, g2$weight)
all.equal(nrow(g1), nrow(g2))
I tried with graph.union(g_mst, g_not_in_mst) and even after doing E(g)$weight <- pmin(E(g)$weight_1, E(g)$weight_2, na.rm = T) I cannot find a formal way to prove that the two results are the equal. The idea of comparing two results is that one function is more efficient than the other.

How to implement the jaccard distance in kproto function

I am trying to implement the distance of jaccard in the kproto function (package clustMixType in R), but without any success. The aim is to do a cluster analysis of my dataset.
The distance of jaccard that I want to use is the complement of the similarity coefficient of jaccard, so
distance of jaccard = 1-[a/(a+b+c)] = [(b+c)/(a+b+c)], or
distance of jaccard = 1-[M11/(M01+M10+M11)] = [(M01+M10)/(M01+M10+M11)].
The source code of the kproto function is presented bellow. The variable d1 is the euclidean distance for the numeric variables and the variable d2 is the distance from the simple matching coefficient for the categorical variables (as factors).
It computes the distances between the observations and the prototypes, not between observations. Prototypes are calculated, and not an observation of the data set it self.
So my twoo questions are
1) d2 is what I want to modify, but how?
2) should d1 be the sqrt of what is being calculated?
Thank you for all your help. It will be much apreciated.
Here is an excerpt of the dataset i'm working on, where V1 to V4 are factor (binary) variables (to use the jaccard distance) and V5 to V8 are numeric variables (to use the euclidean distance):
V1;V2;V3;V4;V5;V6;V7;V8
1;1;0;0;6;2;3;3
0;1;0;1;3;5;2;1
1;1;0;0;1;3;2;1
1;1;0;0;4;3;3;1
1;1;1;0;1;4;1;1
1;0;1;0;4;3;1;1
1;1;0;0;2;4;2;1
1;1;0;0;2;4;2;1
1;1;0;0;6;2;1;1
1;1;0;0;6;2;2;1
1;1;0;0;5;2;3;1
1;1;0;0;4;3;3;1
1;1;0;0;4;4;2;1
1;1;0;0;4;3;3;1
1;1;0;0;4;3;3;1
1;1;0;0;3;4;2;1
1;1;0;0;4;3;2;1
1;1;0;0;5;2;3;1
1;1;0;0;4;3;4;1
1;1;0;0;4;3;2;1
1;1;0;0;4;3;2;1
1;1;0;0;3;3;2;1
1;1;0;0;3;3;3;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;2;1
1;1;0;0;3;3;2;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;2;1
1;0;0;0;3;4;2;1
1;1;0;0;7;2;4;1
1;1;0;0;7;2;2;1
1;1;0;0;5;2;4;1
1;1;0;0;5;3;4;1
1;1;0;0;5;3;2;1
1;1;0;0;5;3;4;1
1;0;0;0;3;5;3;1
0;1;0;0;6;2;4;1
1;1;0;0;6;2;3;1
1;1;0;0;6;2;4;1
Lets take the first twoo observations from the dataset provided above as an example:
V1;V2;V3;V4;V5;V6;V7;V8
1;1;0;0;6;2;3;3
0;1;0;1;3;5;2;1
The algorithm first select the k prototypes from the data set randomly, so lets supose that the second observation is one of the inicial prototypes. As I understood the algorithm creates a data frame called "protos" initially with k random observations from the data set provided, so the second observation would be one of the lines of the "proto" dataframe.
The combined distance used to cluster the observations would be d=d1+lambda*d2. Lambda can also be a vector of individual weights to each variable. d is the distance between the observations in the data set provided and the "proto" matrix created initially with k random observations.
In this case, considering the first twoo observations presented, the calculated distances, between the observation (yi) and the prototype (yk), would be as follow:
Euclidian for the numeric variables (V5 to V8):
d1=sum[(yij-ykj)^2]^0,5
where,
k=1 to k clusters
i=1 to n observations
j=5 to 8 th variable
d1=[[(6-3)^2]+[(2-5)^2]+[(3-2)^2]+[(3-1)^2]]^0,5
d1=[9+9+1+4]^0,5
d1=4.796
Jaccard, for the set of binary variables (V1 to V4):
d2=[(b+c)/(a+b+c)]
where,
a=1
b=1
c=1
are correspondences counts between the n observations and the k prototypes, for variables 1 to 4.
d2=[(1+1)/(1+1+1)]
d2=2/3
d2=0.667
So the combined distance between this especific observation and the initial prototype of that cluster is:
d=d1+d2
d=4.796+0.667
d=5.463
The results, as I understood, are then stored in a matrix called "d", line by line, the size of [number of lines=number of observations, number of columns = number of clusters k].
I'm expecting to correctly calculate the euclidian and jaccard distances, modifiyng the kproto function, maintaining the steps and results provided by the original function.
NOTE: the final function should work on any number of observations, variables and prototypes, and not only to my specific dataset.
I've also tried to mix the codes from kproto (clustMixType package) and dist.binary (ade4 package), but they work in different ways.
#K-Prototypes algorithm
kproto.default <- function(x, k, lambda = NULL, iter.max = 100, nstart = 1, na.rm = TRUE, keep.data = TRUE, verbose = TRUE, ...){
# initial error checks
if(!is.data.frame(x)) stop("x should be a data frame!")
if(ncol(x) < 2) stop("For clustering x should contain at least two variables!")
if(iter.max < 1 | nstart < 1) stop("iter.max and nstart must not be specified < 1!")
if(!is.null(lambda)){
if(any(lambda < 0)) stop("lambda must be specified >= 0!")
if(!any(lambda > 0)) stop("lambda must be specified > 0 for at least one variable!")
}
# check for numeric and factor variables
numvars <- sapply(x, is.numeric)
anynum <- any(numvars)
catvars <- sapply(x, is.factor)
anyfact <- any(catvars)
if(!anynum) stop("\n No numeric variables in x! Try using kmodes() from package klaR...\n\n")
if(!anyfact) stop("\n No factor variables in x! Try using kmeans()...\n\n")
# treatment of missings
NAcount <- apply(x, 2, function(z) sum(is.na(z)))
if(verbose){
cat("# NAs in variables:\n")
print(NAcount)
}
if(any(NAcount == nrow(x))) stop(paste("Variable(s) have only NAs please remove them:",names(NAcount)[NAcount == nrow(x)],"!"))
if(na.rm) {
miss <- apply(x, 1, function(z) any(is.na(z)))
if(verbose){
cat(sum(miss), "observation(s) with NAs.\n")
if(sum(miss) > 0) message("Observations with NAs are removed.\n")
cat("\n")
}
x <- x[!miss,]
} # remove missings
if(!na.rm){
allNAs <- apply(x,1,function(z) all(is.na(z)))
if(sum(allNAs) > 0){
if(verbose) cat(sum(allNAs), "observation(s) where all variables NA.\n")
warning("No meaningful cluster assignment possible for observations where all variables NA.\n")
if(verbose) cat("\n")
}
}
if(nrow(x) == 1) stop("Only one observation clustering not meaningful.")
k_input <- k # store input k for nstart > 1 as clusters can be merged
# initialize prototypes
if(!is.data.frame(k)){
if (length(k) == 1){
if(as.integer(k) != k){k <- as.integer(k); warning(paste("k has been set to", k,"!"))}
if(nrow(x) < k) stop("Data frame has less observations than clusters!")
ids <- sample(nrow(x), k)
protos <- x[ids,]
}
if (length(k) > 1){
if(nrow(x) < length(k)) stop("Data frame has less observations than clusters!")
ids <- k
k <- length(ids)
if(length(unique(ids)) != length(ids)) stop("If k is specified as a vector it should contain different indices!")
if(any(ids<1)|any(ids>nrow(x))) stop("If k is specified as a vector all elements must be valid indices of x!")
#check for integer
protos <- x[ids,]
}
rm(ids)
}
if(is.data.frame(k)){
if(nrow(x) < nrow(k)) stop("Data frame has less observations than clusters!")
if(length(names(k)) != length(names(x))) stop("k and x have different numbers of columns!")
if(any(names(k) != names(x))) stop("k and x have different column names!")
if(anynum) {if( any(sapply(k, is.numeric) != numvars)) stop("Numeric variables of k and x do not match!")}
if(anyfact) {if( any(sapply(k, is.factor) != catvars)) stop("Factor variables of k and x do not match!")}
protos <- k
k <- nrow(protos)
}
if(k < 1) stop("Number of clusters k must not be smaller than 1!")
# automatic calculation of lambda
if(length(lambda) > 1) {if(length(lambda) != sum(c(numvars,catvars))) stop("If lambda is a vector, its length should be the sum of numeric and factor variables in the data frame!")}
if(is.null(lambda)){
if(anynum & anyfact){
vnum <- mean(sapply(x[,numvars, drop = FALSE], var, na.rm = TRUE))
vcat <- mean(sapply(x[,catvars, drop = FALSE], function(z) return(1-sum((table(z)/sum(!is.na(z)))^2))))
if (vnum == 0){
if(verbose) warning("All numerical variables have zero variance.")
anynum <- FALSE
}
if (vcat == 0){
if(verbose) warning("All categorical variables have zero variance.")
anyfact <- FALSE
}
if(anynum & anyfact){
lambda <- vnum/vcat
if(verbose) cat("Estimated lambda:", lambda, "\n\n")
}else{
lambda <- 1
}
}
}
# initialize clusters
clusters <- numeric(nrow(x))
tot.dists <- NULL
moved <- NULL
iter <- 1
# check for any equal prototypes and reduce cluster number in case of occurence
if(k > 1){
keep.protos <- rep(TRUE,k)
for(l in 1:(k-1)){
for(m in (l+1):k){
d1 <- sum((protos[l,numvars, drop = FALSE]-protos[m,numvars, drop = FALSE])^2) # euclidean for numerics
d2 <- sum(protos[l,catvars, drop = FALSE] != protos[m,catvars, drop = FALSE]) # wtd simple matching for categorics
if((d1+d2) == 0) keep.protos[m] <- FALSE
}
}
if(!all(keep.protos)){
protos <- protos[keep.protos,]
k <- sum(keep.protos)
if(verbose) message("Equal prototypes merged. Cluster number reduced to:", k, "\n\n")
}
}
# special case only one cluster
if(k == 1){clusters <- rep(1, nrow(x)); size <- table(clusters); iter <- iter.max} # REM: named vector size is needed later...
# start iterations for standard case (i.e. k > 1)
while(iter < iter.max){
# compute distances
nrows <- nrow(x)
dists <- matrix(NA, nrow=nrows, ncol = k)
for(i in 1:k){
#a0 <- proc.time()[3]
#d1 <- apply(x[,numvars],1, function(z) sum((z-protos[i,numvars])^2)) # euclidean for numerics
d1 <- (x[,numvars, drop = FALSE] - matrix(rep(as.numeric(protos[i, numvars, drop = FALSE]), nrows), nrow=nrows, byrow=T))^2
if(length(lambda) == 1) d1 <- rowSums(d1, na.rm = TRUE)
if(length(lambda) > 1) d1 <- as.matrix(d1) %*% lambda[numvars]
#a1 <- proc.time()[3]
#d2 <- lambda * apply(x[,catvars],1, function(z) sum((z != protos[i,catvars]))) # wtd simple matching for categorics
d2 <- sapply(which(catvars), function(j) return(x[,j] != rep(protos[i,j], nrows)) )
d2[is.na(d2)] <- FALSE
if(length(lambda) == 1) d2 <- lambda * rowSums(d2)
if(length(lambda) > 1) d2 <- as.matrix(d2) %*% lambda[catvars]
#a2 <- proc.time()[3]
dists[,i] <- d1 + d2
#cat(a1-a0, a2-a1, "\n")
}
# assign clusters
old.clusters <- clusters
# clusters <- apply(dists, 1, function(z) which.min(z))
clusters <- apply(dists, 1, function(z) {a <- which.min(z); if (length(a)>1) a <- sample(a,1); return(a)}) # sample in case of multiple minima
size <- table(clusters)
min.dists <- apply(cbind(clusters, dists), 1, function(z) z[z[1]+1])
within <- as.numeric(by(min.dists, clusters, sum))
tot.within <- sum(within)
# prevent from empty classes
#tot.within <- numeric(k)
#totw.list <- by(min.dists, clusters, sum)
#tot.within[names(totw.list)] <- as.numeric(totw.list)
# ...check for empty clusters and eventually reduce number of prototypes
if (length(size) < k){
k <- length(size)
protos <- protos[1:length(size),]
if(verbose) cat("Empty clusters occur. Cluster number reduced to:", k, "\n\n")
}
# trace
tot.dists <- c(tot.dists, sum(tot.within))
moved <- c(moved, sum(clusters != old.clusters))
# compute new prototypes
remids <- as.integer(names(size))
for(i in remids){
protos[which(remids == i), numvars] <- sapply(x[clusters==i, numvars, drop = FALSE], mean, na.rm = TRUE)
protos[which(remids == i), catvars] <- sapply(x[clusters==i, catvars, drop = FALSE], function(z) levels(z)[which.max(table(z))])
}
if(k == 1){clusters <- rep(1, length(clusters)); size <- table(clusters); iter <- iter.max; break}
# check for any equal prototypes and reduce cluster number in case of occurence
if(iter == (iter.max-1)){ # REM: for last iteration equal prototypes are allowed. otherwise less prototypes than assigned clusters.
keep.protos <- rep(TRUE,k)
for(l in 1:(k-1)){
for(m in (l+1):k){
d1 <- sum((protos[l,numvars, drop = FALSE]-protos[m,numvars, drop = FALSE])^2) # euclidean for numerics
d2 <- sum(protos[l,catvars, drop = FALSE] != protos[m,catvars, drop = FALSE]) # wtd simple matching for categorics
if((d1+d2) == 0) keep.protos[m] <- FALSE
}
}
if(!all(keep.protos)){
protos <- protos[keep.protos,]
k <- sum(keep.protos)
if(verbose) cat("Equal prototypes merged. Cluster number reduced to:", k, "\n\n")
}
}
# add stopping rules
if(moved[length(moved)] == 0) break
if(k == 1){clusters <- rep(1, length(clusters)); size <- table(clusters); iter <- iter.max; break}
#cat("iter", iter, "moved", moved[length(moved)], "tot.dists",tot.dists[length(tot.dists)],"\n" )
iter <- iter+1
}
### Final update of prototypes and dists
if(iter == iter.max){ # otherwise there have been no moves anymore and prototypes correspond to cluster assignments
# compute new prototypes
remids <- as.integer(names(size))
for(i in remids){
protos[which(remids == i), numvars] <- sapply(x[clusters==i, numvars, drop = FALSE], mean, na.rm = TRUE)
protos[which(remids == i), catvars] <- sapply(x[clusters==i, catvars, drop = FALSE], function(z) levels(z)[which.max(table(z))])
}
# compute distances
nrows <- nrow(x)
dists <- matrix(NA, nrow=nrows, ncol = k)
for(i in 1:k){
d1 <- (x[,numvars, drop = FALSE] - matrix(rep(as.numeric(protos[i, numvars, drop = FALSE]), nrows), nrow=nrows, byrow=T))^2
if(length(lambda) == 1) d1 <- rowSums(d1, na.rm = TRUE)
if(length(lambda) > 1) d1 <- as.matrix(d1) %*% lambda[numvars]
d2 <- sapply(which(catvars), function(j) return(x[,j] != rep(protos[i,j], nrows)) )
d2[is.na(d2)] <- FALSE
if(length(lambda) == 1) d2 <- lambda * rowSums(d2)
if(length(lambda) > 1) d2 <- as.matrix(d2) %*% lambda[catvars]
dists[,i] <- d1 + d2
}
size <- table(clusters)
min.dists <- apply(cbind(clusters, dists), 1, function(z) z[z[1]+1])
within <- as.numeric(by(min.dists, clusters, sum))
tot.within <- sum(within)
}
names(clusters) <- row.names(dists) <- row.names(x)
rownames(protos) <- NULL
# create result:
res <- list(cluster = clusters,
centers = protos,
lambda = lambda,
size = size,
withinss = within,
tot.withinss = tot.within,
dists = dists,
iter = iter,
trace = list(tot.dists = tot.dists, moved = moved))
# loop: if nstart > 1:
if(nstart > 1)
for(j in 2:nstart){
res.new <- kproto(x=x, k=k_input, lambda = lambda, iter.max = iter.max, nstart=1, verbose=verbose)
if(res.new$tot.withinss < res$tot.withinss) res <- res.new
}
if(keep.data) res$data = x
class(res) <- "kproto"
return(res)
}
I've managed to modify the functions to accept a variety of similarity measures and uploaded the R file at http://dx.doi.org/10.17632/63nyn9tjcd.1, in case someone needs it.

Small bug in backpropagation algorithm in r

I've been trying to implement backpropagation in R, but I've been getting some strange results. It appears that after 1000 iterations of backprop, the program predicts 1 for all values. I was hoping it was a problem in the test function, but testing on smaller numbers of iterations shows that 0 is predicted as an output value in some instances. It seems that somewhere in iterating through the dataset, the weight updates tend toward increasing, when they should tend toward reducing error.
I apologize that the code is difficult to read in spots. I'm working on this with a partner and I dislike the way that he names variables. It's also not as fully commented as I'd like. Any help is appreciated
# initialize a global output vector and a global vector of data frames
createNeuralNet <- function(numberOfInputNodes,hiddenLayers,nodesInHiddenLayer){
L <<- initializeWeightDataFrames(numberOfInputNodes,nodesInHiddenLayer,hiddenLayers)
# print(L)
OutputList <<- initializeOutputVectors(hiddenLayers)
}
# creates a list of weight data frames
# each weight data frame uses the row as an index of the "tail" for a connection
# the "head" of the connection (where the arrow points) is in the column index
# the value in the weight data frame is the weight of that connection
# the last row is the weight between the bias and a particular node
initializeWeightDataFrames <- function(numberOfInputNodes, nodesPerHiddenLayer, numberOfHiddenLayers) {
weights <- vector("list", numberOfHiddenLayers + 1)
# this code simply creates empty data frames of the proper size so that they may
first <- read.csv(text=generateColumnNamesCSV(nodesPerHiddenLayer))
middle <- read.csv(text=generateColumnNamesCSV(nodesPerHiddenLayer))
# assume binary classifier, so output layer has 1 node
last <- read.csv(text=generateColumnNamesCSV(1))
first <- assignWeights(first, numberOfInputNodes + 1)
weights[[1]] <- first
# assign random weights to each row
if (numberOfHiddenLayers != 1) {
for (i in 1:numberOfHiddenLayers - 1) {
middle <- assignWeights(middle, nodesPerHiddenLayer + 1)
weights[[i+1]] <- middle
}
}
last <- assignWeights(last, nodesPerHiddenLayer + 1)
weights[[length(weights)]] <- last
return(weights)
}
# generate a comma-separated string of column names c1 thru cn for creating arbitrary size data frame
generateColumnNamesCSV <- function(n) {
namesCSV <- ""
if (n==1) {
return("c1")
}
for (i in 1:(n-1)) {
namesCSV <- paste0(namesCSV, "c", i, ",")
}
namesCSV <- paste0(namesCSV, "c", n)
return(namesCSV)
}
assignWeights <- function(weightDF, numRows) {
modifiedweightDF <- weightDF
for (rowNum in 1:numRows) {
# creates a bunch of random numbers from -1 to 1, used to populate a row
rowVector <- runif(length(weightDF))
for (i in 1:length(rowVector)) {
sign <- (-1)^round(runif(1))
rowVector[i] <- sign * rowVector[i]
}
modifiedweightDF[rowNum,] <- rowVector
}
return(modifiedweightDF)
}
# create an empty list of the right size, will hold vectors of node outputs in the future
initializeOutputVectors <- function(numberOfHiddenLayers) {
numberOfLayers <- numberOfHiddenLayers + 1
outputVectors <- vector("list", numberOfLayers)
return(outputVectors)
}
# this is the main loop that does feed-forward and back prop
trainNeuralNet <- function(trainingData,target,iterations){
count <- 0
# iterations is a constant for how many times the dataset should be iterated through
while(count<iterations){
print(count)
for(row in 1:nrow(trainingData)) { # for each row in the data set
#Feed Forward
# instance is the current row that's being looked at
instance <- trainingData[row,]
# print(instance)
for (l in 1:length(L)) { # for each weight data frame
# w is the current weights
w <- L[[l]]
#print(w)
Output <- rep(NA, length(w))
if (l!=1) {
# x is the values in the previous layer
# can't access the previous layer if you're on the first layer
x <- OutputList[[l-1]]
#print(x)
}
for (j in 1:ncol(w)) { # for each node j in the "head" layer
s <- 0
for (i in 1:(nrow(w)-1)) {
# calculate the weighted sum s of connection weights and node values
# this is used to calculate a node in the next layer
# check the instance if on the first layer
if (l==1) {
# print(i)
# print(instance[1,i])
# i+1 skips over the target column
s <- s + instance[1,i+1]*w[i,j]
# print(s)
# if the layer is 2 or more
}else{
# print(i)
#print(j)
# print(w)
# print(w[i,j])
s <- s + x[i]*w[i,j] # weighted sum
# sigmoid activation function value for node j
}
}
#print(s)
s <- s + w[nrow(w),j] # add weighted bias
# print("s")
# print(s)
# print("sigmoid s")
# print(sigmoid(s))
Output[j] <- sigmoid(s)
}
OutputList[[l]] <- Output
}
# print(OutputList)
# print("w")
# print(L)
# print("BAck prop Time")
#Back Propagation
out <- OutputList[length(OutputList)]
#print(OutputList)
outputError <- rep(NA, length(w))
outputErrorPresent <- rep(NA, length(w))
outputError[1] <- out[[1]]*(1-out[[1]])*(out[[1]]-target[row])
for (h in (length(L)):1) { # for each weight matrix in hidden area h (going backwards)
hiddenOutput <- OutputList[h]
#print("hiddenOutput")
#print(h)
if (row==1||row==2) {
# print("h")
# print(h)
# print("output error Present")
# print(outputErrorPresent)
}
if (h!=(length(L))) {
outputError <- outputErrorPresent
}
w <- L[[h]]
for (j in 1:(nrow(w))) { # for each node j in hidden layer h
#print("length w")
#print(length(w))
if (row==1||row==2) {
# print("j")
# print(j)
}
errSum <- 0
nextLayerNodes <- L[[h]]
# print(nextLayerNodes)
#print(class(nextLayerNodes))
for (k in 1:ncol(nextLayerNodes)) {
errSum <- errSum + outputError[k]*nextLayerNodes[j,k]
}
m <- 0
if (h == 1) {
m <- as.numeric(instance)
m <- m[-1]
} else {
m <- OutputList[h-1][[1]]
}
deltaWeight <- 0
for (k in 1:ncol(nextLayerNodes)) {
hiddenNodeError <- hiddenOutput[[1]][k]*(1- hiddenOutput[[1]][k])*errSum
if (j == nrow(w)) {
deltaWeight <- learningRate*hiddenNodeError
} else {
deltaWeight <- learningRate*hiddenNodeError*m[j]
}
# print(deltaWeight)
w[j,k] <- w[j,k] + deltaWeight
}
if (j != nrow(w)) {
outputErrorPresent[j] <- hiddenNodeError
}
}
L[[h]] <<- w
}
# print(OutputList)
}
count <- count +1
# print(L)
#calculate global error
}
########################repeat
# print("w")
}
sigmoid <- function(s){
sig <- 1/(1+exp(-s))
return(sig)
}
testNeuralNetwork <- function(testingData,testTarget){
correctCount <- 0
# run the same code as feed forward
# this time run it on testing examples and compare the outputs
for(row in 1:nrow(testingData)) { # for each test instance
#Feed Forward
instance <- testingData[row,]
#print(instance)
for (l in 1:length(L)) { # for each layer l
w <- L[[l]]
#print(w)
Output <- rep(NA, length(w))
if (l!=1) {
x <- OutputList[[l-1]]
#print(x)
}
for (j in 1:ncol(w)) { # for each node j in layer l
s <- 0
for (i in 1:(nrow(w)-1)) {
if (l==1) {
# i+1 skips over the target column
s <- s + instance[1,i+1]*w[i,j]
# print(s)
}else{
# print(i)
#print(j)
# print(w)
# print(w[i,j])
s <- s + x[i]*w[i,j] # weighted sum
# sigmoid activation function value for node j
}
}
#print(s)
s <- s + w[nrow(w),j] # add weighted bias
Output[j] <- sigmoid(s)
#print(sigmoid(s))
}
OutputList[[l]] <- Output
}
# print(OutputList)
outputVal <- threshold(OutputList[[length(OutputList)]])
if (outputVal==testTarget[row]) {
print(paste0(" ", outputVal, " Correct!"))
correctCount <- correctCount + 1
}else{
print(paste0(" ", outputVal, " Wrong."))
}
#print()
#print(paste0("s2 ",str))
}
}
# convert real-valued output to a binary classification
threshold <- function(value){
if (value>=0.5) {
return(1)
}else{
return(0)
}
}
# this modifies df by removing 30 random rows
# this means that the same df will be changed permanently, so be careful of that
# it also returns the 30 random rows as a test set
makeTestSet <- function(df, size) {
len <- 1:length(df[,1])
randRows <- sample(len, size, replace=F)
return(randRows)
}
Data <- read.csv(file = "Downloads/numericHouse-votes-84.csv", head = TRUE, sep = ",")
learningRate <<- 0.1
# assume that the first column of the data is the column that is to be predicted
# thus the number of inputs is 1 less than the number of columnns
numberOfInputNodes <- ncol(Data) - 1
randRows <- makeTestSet(Data,30) #change this to 30
testData <- Data[randRows,]
trainingData <- Data[-randRows,]
testTarget <- testData[,1]
#trainingData <- Data[,1:numberOfInputNodes]
trainingTarget <- trainingData[,1]
createNeuralNet(numberOfInputNodes,1,numberOfInputNodes)
iterations <- 100
trainNeuralNet(trainingData,trainingTarget,iterations)
testNeuralNetwork(testData,testTarget)
L

optimise algorithm for building a graph based on node weights

I am trying to improve a function to build a network based on the score calculated from some node attributes. The function tries to find the best subnetwork from a graph maximizing the product of node's attributes.
The function starts in a random node and start searching in the first neighbor, if there are some neighbors whose node's score suffice a threshold, the neighbour/s is added to the first node and the process continues until no more are added (the addition of the neighbour does not produce the desired increment in the score). If there is no node in the first neighbours that yields the increment of the score, then the function looks to the second degree neighbours. In this situation, it is very likely that there are several paths to connect the node (2nd degree neighbour), in this specific case, the chosen path will be the shortest with the highest weight (one of the nodes attribute).
I could do some paralelization of the code, although I don't know how to implement it in this type of function.
The function is the following:
build_network <-
function (G, seed, d= 2){
net <- G
d <- d
score.fun<-function(g){
Za <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
k <- vcount(g)
tmp <- genesets.length.null.stat[[as.character(k)]] # genesets.length.null.stat is a list with the median of Za and sd of Za calculated for 1000 replicates of networks of size k
Sa <- (Za-tmp[1])/tmp[2]
}
best.fun<-function(in.nodes,out.nodes) {
score<-(-Inf); best<-character()
for(node in out.nodes){
subG.update<-induced.subgraph(net, c(in.nodes,node))
if( score.fun(subG.update) > score ){
score<-score.fun(subG.update)
best<-node
}
}
list("node"=best,"score"=score)
}
subG <- induced.subgraph(net, seed)
if (!is.connected(subG)) { #the seed must be connected
stop("Input seeds are disjoint")
}
while (TRUE) {
in.nodes <- V(subG)$name
node_num <- vcount(subG)
subsum <- score.fun(subG)
#subx <- V(subG)$name
for (rad in 1:d) {
tmp.neigh <- unlist(neighborhood(net, order = rad, nodes = V(subG)$name))
pot.nodes <- V(net)[tmp.neigh]$name
out.nodes <- setdiff(pot.nodes, in.nodes)
if (length(out.nodes) == 0) break
best_node<-best.fun(in.nodes, out.nodes)
new_score<-best_node$score
best_node<-best_node$node
if (new_score > subsum + 0.01) {
tmp <- unlist(lapply(best_node, function(x) node2treePath(net,V(subG)$name, x))) # node2treePath is a function to retrieve the shortest path with the highest node weights
in.nodes <- c(tmp, V(subG)$name)
subG <- induced.subgraph(net, in.nodes)
break
}
}
if (node_num == vcount(subG)) break
}
return(subG)
}
I am trying to apply this function to a graph of ~10,000 nodes. Here will be an approximation of the code for running the function
### generate some example data
library(igraph)
my_graph <- erdos.renyi.game(10000, 0.0003)
V(my_graph)$name <- 1:vcount(my_graph)
V(my_graph)$weight <- rnorm(10000)
V(my_graph)$RWRNodeweight <- runif(10000, min=0, max=0.05)
### Run the function
sublist = list()
for (node in V(G)$name) {
subnet <- build_network(G, node, d)
sublist[[node]] <- subnet }
EDIT: here is the dput of head(genesets.length.null.stat)
structure(list(`1` = c(1.01397367504035, 1.18858228819048), `2` = c(1.61970348041337, 1.30189433386605), `3` = c(2.11767222957028, 1.36222065695878), `4` = c(2.47710421934929, 1.36968129959296), `5` = c(2.776011866622, 1.36318885187196), `6` = c(3.16885126246671, 1.42577861995897)), .Names = c("1", "2", "3", "4", "5", "6"))
Here is the node2treePath function:
node2treePath <- function (G, Tnodes, node){
tmp.path <- get.all.shortest.paths(G, node, Tnodes)$res
tmp.l <- unlist(lapply(tmp.path, length))
index <- which(tmp.l == min(tmp.l))
tmp.path = tmp.path[index]
tmp.sum <- unlist(lapply(tmp.path, function(x)return(sum(V(G)[x]$weight))))
index <- which(tmp.sum == max(tmp.sum))
selected.path = tmp.path[index]
collect <- unlist(lapply(selected.path, function(x)return(V(G)[x]$name)))
return(collect)
}
For the logic you want to do (and I imagine you may wish to change in way incompatible with the above answers) the following code is about ten times 30% faster. I used Rprof and profr and recoded some slow bits in trivial ways, e.g. not passing a named list pair, just an anonymous pair from one of your functions. The numerically named list with pairs of values for genesets.length.null.stat is very inefficient. I replaced it with two numeric vectors. You also call the 'V' function a lot, which was a big time consumer: as you can see, you can call it once, then query the result as needed.
# node2treePath is a function to retrieve the shortest path with the highest node weights
node2treePath_jw <- function(G, Tnodes, node){
tmp.path <- get.all.shortest.paths(G, node, Tnodes)$res
tmp.l <- vapply(tmp.path, length, integer(1))
index <- which(tmp.l == min(tmp.l))
tmp.path = tmp.path[index]
Vg <- V(G)
tmp.sum <- vapply(tmp.path, function(x) sum(Vg[x]$weight), numeric(1))
index <- which(tmp.sum == max(tmp.sum))
selected.path = tmp.path[index]
sapply(selected.path, function(x) Vg[x]$name)
}
build_network_jw <- function(net, seed, d= 2){
score.fun <- function(Vg, k){
Za <- sum(Vg$weight * Vg$RWRNodeweight) / sqrt(sum(Vg$RWRNodeweight^2))
(Za - genesets_jack_a[k]) / genesets_jack_b[k]
}
best.fun_jw <- function(in.nodes, out.nodes) {
score <- (-Inf)
best <- character()
for (node in out.nodes) {
subG.update <- induced.subgraph(net, c(in.nodes,node))
Vsgu <- V(subG.update)
Vsgu_count <- vcount(subG.update)
sf <- score.fun(Vsgu, Vsgu_count)
if (sf > score) {
score <- sf
best <- node
}
}
list(best, score)
}
subG <- induced.subgraph(net, seed)
if (!is.connected(subG)) { #the seed must be connected
stop("Input seeds are disjoint")
}
while (TRUE) {
VsubG <- V(subG)
Vnet <- V(net)
in.nodes <- VsubG$name
node_num <- vcount(subG)
subsum <- score.fun(VsubG, node_num)
for (rad in 1:d) { # d = 2
tmp.neigh <- unlist(neighborhood(net, order = rad, nodes = VsubG$name))
pot.nodes <- Vnet[tmp.neigh]$name
out.nodes <- setdiff(pot.nodes, in.nodes)
if (length(out.nodes) == 0) break
best_node <- best.fun_jw(in.nodes, out.nodes)
new_score <- best_node[[2]]
best_node <- best_node[[1]]
if (new_score > subsum + 0.01) {
tmp <- sapply(best_node, function(x) node2treePath_jw(net, VsubG$name, x))
in.nodes <- c(tmp, VsubG$name)
subG <- induced.subgraph(net, in.nodes)
break
}
}
if (node_num == vcount(subG)) break
}
subG
}
node2treePath <- function (G, Tnodes, node){
tmp.path <- get.all.shortest.paths(G, node, Tnodes)$res
tmp.l <- unlist(lapply(tmp.path, length))
index <- which(tmp.l == min(tmp.l))
tmp.path = tmp.path[index]
tmp.sum <- unlist(lapply(tmp.path, function(x)return(sum(V(G)[x]$weight))))
index <- which(tmp.sum == max(tmp.sum))
selected.path = tmp.path[index]
collect <- unlist(lapply(selected.path, function(x)return(V(G)[x]$name)))
return(collect)
}
build_network <- function (net, seed, d= 2){
#genesets.length.null.stat <- structure(list(`1` = c(1.01397367504035, 1.18858228819048), `2` = c(1.61970348041337, 1.30189433386605), `3` = c(2.11767222957028, 1.36222065695878), `4` = c(2.47710421934929, 1.36968129959296), `5` = c(2.776011866622, 1.36318885187196), `6` = c(3.16885126246671, 1.42577861995897)), .Names = c("1", "2", "3", "4", "5", "6"))
genesets.length.null.stat <- lapply(1:500, function(x) c(runif(1)+x, runif(1)+x))
names(genesets.length.null.stat) <- 1:500
score.fun<-function(g){
Za <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
k <- vcount(g)
tmp <- genesets.length.null.stat[[as.character(k)]] # genesets.length.null.stat is a list with the median of Za and sd of Za calculated for 1000 replicates of networks of size k
Sa <- (Za-tmp[1])/tmp[2]
}
best.fun <- function(in.nodes,out.nodes) {
score<-(-Inf); best<-character()
for (node in out.nodes){
subG.update<-induced.subgraph(net, c(in.nodes,node))
if (score.fun(subG.update) > score) {
score<-score.fun(subG.update)
best<-node
}
}
list("node"=best,"score"=score)
}
subG <- induced.subgraph(net, seed)
if (!is.connected(subG)) { #the seed must be connected
stop("Input seeds are disjoint")
}
while (TRUE) {
in.nodes <- V(subG)$name
node_num <- vcount(subG)
subsum <- score.fun(subG)
#subx <- V(subG)$name
for (rad in 1:d) {
tmp.neigh <- unlist(neighborhood(net, order = rad, nodes = V(subG)$name))
pot.nodes <- V(net)[tmp.neigh]$name
out.nodes <- setdiff(pot.nodes, in.nodes)
if (length(out.nodes) == 0) break
#message("length in.nodes = ", length(in.nodes))
#message("length out.nodes = ", length(out.nodes))
best_node<-best.fun(in.nodes, out.nodes)
new_score<-best_node$score
best_node<-best_node$node
if (new_score > subsum + 0.01) {
tmp <- unlist(lapply(best_node, function(x) node2treePath(net,V(subG)$name, x))) # node2treePath is a function to retrieve the shortest path with the highest node weights
in.nodes <- c(tmp, V(subG)$name)
subG <- induced.subgraph(net, in.nodes)
break
}
}
if (node_num == vcount(subG)) break
}
subG
}
library(igraph)
library(profr)
library(igraph)
library(profr)
#genesets.length.null.stat <- lapply(1:500, function(x) c(runif(1)+x, runif(1)+x))
#names(genesets.length.null.stat) <- 1:500
set.seed(1)
genesets_jack_a = runif(500) + 1:500
genesets_jack_b = runif(500) + 1:500
do_it_jw <- function(n = 1000){
my_graph <- erdos.renyi.game(n, 0.0003)
V(my_graph)$name <- 1:vcount(my_graph)
V(my_graph)$weight <- rnorm(n)
V(my_graph)$RWRNodeweight <- runif(n, min = 0, max = 0.05)
### Run the function
sublist = list()
Vmg <- V(my_graph)
for (node in Vmg$name) {
#message(node)
subnet <- build_network_jw(my_graph, node, 2)
sublist[[node]] <- subnet }
}
do_it <- function(n = 1000){
my_graph <- erdos.renyi.game(n, 0.0003)
V(my_graph)$name <- 1:vcount(my_graph)
V(my_graph)$weight <- rnorm(n)
V(my_graph)$RWRNodeweight <- runif(n, min = 0, max = 0.05)
### Run the function
sublist = list()
Vmg <- V(my_graph)
for (node in Vmg$name) {
#message(node)
subnet <- build_network(my_graph, node, 2)
sublist[[node]] <- subnet }
}
library(microbenchmark)
mb <- microbenchmark(do_it(1000), do_it_jw(1000), times = 5)
print(mb)
Since your score function only depends on node attributes and not edge's, the solution is not unique; you might want to search for a best tree instead. If you restructure your problem so that your nodes are edges and vice-versa, you probably can just use eg Djikstra's algorithm to find the best one. That is already in the igraph package as shortest.paths().
I can't read the R code, but based on your description: If the score threshold is constant, then this is easy to do in O(|V|+|E|+|C|^2) time, where |C| is the number of "good" components (this will be further explained shortly).
In a first pass, delete all nodes with score below the threshold. Then find all connected components in this new graph (this can be done in O(|V|+|E|) time by starting a DFS at each as-yet-unvisited node), calculate their scores by multiplying together all vertex weights in the component, and label each vertex with its component ID. This already tells you the "good" components -- the ones that don't require any 2nd-degree connections.
Suppose this produces |C| components. Create an empty hashtable H which has component-ID pairs for keys, and (length, weight) pairs for values. Now go back through each vertex v you deleted in the first pass: for each one, look at all its neighbours and record the shortest edge to each distinct component (this can be done using a length-|C| array to store the shortest edge to each component seen so far). After examining all of v's neighbours, count the number k of distinct components they fall into: if k >= 2, then v potentially should be used to connect some of these k(k-1)/2 pairs of components. For every pair of distinct components i and j that could be connected by v, update H with the weight and distance of this 2-edge connection as necessary: that is, if i and j are not yet joined together, then record that v joins them; otherwise, if they are already joined by some vertex u, only update H if v can do better (i.e., if it uses less total length and greater weight than u would). This step can be thought of as building a minimum spanning tree in a "component graph" derived from the original, pruned graph. The scores for each new "combined" component can easily be calculated as you go just by multiplying together the scores of the two constituent components.
Finally, simply return the component whose product is maximum.

Resources