Batching batches of graphs in PyTorch Geometric - graph

I'm working with some data that have a precise structure and I'm struggling to keep the same structure when using graphs. The data has a nested structure:
one sample has L sub-samples
one sub-sample has S HeteroData graphs
Probably you already see the problem. In the standard case one has data samples and creates batches from them (e.g. if batch size is B then one batch contains B graphs). But in my case I first need to group my S graphs into a sub-sample, then group the L sub-samples and finally create batches from them. In other words, if batch size is B, one batch would have B samples, each of them having L sub-samples, each of them having S graphs.
Is there a way to do that in PyTorch Geometric?

Related

Translating a for-loop to perhaps an apply through a list

I have a r code question that has kept me from completing several tasks for the last year, but I am relatively new to r. I am trying to loop over a list to create two variables with a specified correlation structure. I have been able to "cobble" this together with a "for" loop. To further complicate matters, I need to be able to put the correlation number into a data frame two times.
For my ultimate usage, I am concerned about speed, efficiency, and long-term effectiveness of my code.
library(mvtnorm)
n=100
d = NULL
col = c(0, .3, .5)
for (j in 1:length(col)){
X.corr = matrix(c(1, col[j], col[j], 1), nrow=2, ncol=2)
x=rmvnorm(n, mean=c(0,0), sigma=X.corr)
x1=x[,1]
x2=x[,2]
}
d = rbind(d, c(j))
Let me describe my code, so my logic is clear. This is part of a larger simulation. I am trying to draw 2 correlated variables from the mvtnorm function with 3 different correlation levels per pass using 100 observations [toy data to get the coding correct]. d is a empty data frame. The 3 correlation levels will occur in the following way pass 1 uses correlation 0 then create the variables, and yes other code will occur; pass 2 uses correlation .3 to create 2 new variables, and then other code will occur; pass 3 uses correlation .5 to create 2 new variables, and then other code will occur. Within my larger code, the for-loop gets the job done. The last line puts the number of the correlation into the data frame. I realize as presented here it will only put 1 number into this data frame, but when it is incorporated into my larger code it works as desired by putting 3 different numbers in a single column (1=0, 2=.3, and 3=.5). To reiterate, the for-loop gets the job done, but I believe there is a better way--perhaps something in the apply family. I do not know how to construct this and still access which correlation is being used. Would someone help me develop this little piece of code? Thank you.

How to create graph with large number of points in R?

I have a large dataset contains a large number of nodes; more than 25000 nodes, organized in a .csv file. The structure is similar to the following :
node freq
a 3
b 2
c 5
I want to create a graph from these node in which edges between nodes are constructed by a function of the freq column. I have used the rgraph function from sna package, such as:
num_nodes <- length(data$node)
pLink = data$freq/10
# create 1 graph with nodes and link proability, graph loops = FALSE
graph_adj= rgraph(num_nodes,1,pLink,"graph",FALSE)
graph <- graph.adjacency(graph_adj, mode="undirected")
The above code is running in case of small number of nodes, but with large number of nodes, The R session aborted with the following Error:
Error: C stack usage 19924416 is too close to the limit
Is there another way to create a graph with the mentioned properties: a large number of nodes and edges are created with probability?

Run the script using hard drive

Given a set of n inputs, I want to generate all permutations of 0's and 1's (essentially the input matrix for a truth table). In order to do so, I am using the permutations command (using the gtools package) in R, as follows:
> permutations(2,n,v=c(0,1),repeats.allowed=TRUE)
where n is the number of inputs.
However, given sufficiently large number of n (let's say 26), the size of the variable becomes very high (if n=26, the variable would be approx. 13GB in size). Given this, I wanted to know if there is any way (in R) of using the hard disk instead of creating the variable on the RAM? (I might actually have to run this with n = 86 which would be an impossible thing to do on the RAM).

Generating large adjacency matrix

I'm trying to generate a adjacency matrix from a csv.
The csv contains 2 columns, 1 for users and 1 for projects. The two columns form a bipartite graph, where each user can be part of multiple projects or none at all, but no edges between nodes of the same set (there are no repeating entries for the same user-project pair, but there are repeated entries of the same user or projects with different combinations for pairs).
I wrote a comparison for comparing each user's project with the entire project set using Matlab and ismember(a,b). The algorithm runs iteratively through each entry. In the end, I have an adjacency matrix of size M(|users| + |user|) x (|users| + |user|).
For small entry count < 15000, it works fast, but for a sample of +15000, Matlab stalls. I initialize the adjacency matrix with a zeros matrix (zero(r,c)) and add row by row the results of ismember(a,b). But for my Matlab, a zeros matrix zero(15000,15000) almost maxes out the memory. I tried also making a zero matrix in R with that size (matrix(0, 15000, 15000)) and it also maxes out R's memory.
Is there a way to get around this? My full sample size is 597,000 rows (with ~70,000 users and ~35,000 projects) and I want to run a network analysis of it.
Also I want to keep it in matrix format and not an adjacency list because I have a max cut min flow algorithm I want to run on the results and it only works with matrices.
Updated:
The data looks like this
User | Project
382 2429
385 2838
294 2502
... ...
It is taken from SourceForge using Zerlot from University of Notredame. Where each int value is a key in a SQL database.
I want to convert this affiliation data into a one-mode user-to-user adjacency matrix where each edge between users is a shared project.

Clustering big data

I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.

Resources