Specifying cross-network (multiplex) terms in R's ERGM package - r

I have a research question which deals with a multiplex network, i.e. there are multiple edge-types which may (or may not) co-occur between two nodes. For example:
library(igraph)
relations <- data.frame(
from = c(1,1,2,3,4),
to = c(2,3,3,4,5),
link = c(1,1,1,0,0),
funding = c(0,1,0,0,1),
)
graph <- graph.data.frame(relations)
Note that two different types of relations exist, links and funding.
In my analysis, I am interested in triad-structures, especially transitive triads, i.e. if A -link-> B -link-> C then A -fund-> C. I have seen multiplex networks in the literature (this great piece), which serves as an example, but also indicate R should be capable of doing this.
But, my problem is that I am unable to find any indication of the syntax for such triads across networks. I have seen the PNet application to do this (TriangleABA from the manual), but I am unable to do this in R. I am looking for something like
fit <- ergm( fundnet ~ edges + odegree + ttriple(funding) + ttripe(link) )
Instead of specifying types, I could also make the incidence matrices by using tcrossprod(funding,link) for two-step via funding and than links.
Although I am not dead-set on R, my network may -depending on operationalization- be around 100.000 nodes with 300.000 edges. In R, I am confident I can scale this on a cluster of computers, hence my preference.
I would strongly appreciate any remarks or (hopefully) solutions!

Related

Calculate output of dynamic fnn without linear algebra

Working on a project which includes evolving the fnn topology during runtime, I want to outsource the nn output calculation to my gpu. Currently using unity (soon switching to just c# libs without unity or smth similar), I use hlsl like compute shaders.
Now since I am very restricted with hlsl's syntax (Arrays/Matrices etc with dynamic index ranges, dot product and matrix functions only working on preexisting types like float2 and those "vectors" having a max length of 4 (float4)), I am looking for a formula with which I can calculate the output of the fnn within a single calculation (obviously still including a loop). That means my shader has the weights, input layer, bias and structure of the nn each as seperate structured buffers. Now I need to find a formula with which I can calculate the output without using dot products or matrix vector multiplications since it's very hard and tortuous to implement those. However I've tried finding a formula for days, hopelessly...
Does anyone have a clue for this? Sadly not so many less recourses on the internet concerning hlsl and this problem.
Thanks a lot!
EDIT:
Okay I got some hints that my actual math related question is not quite clear. Since I am kind of an artist myself, I made a little illustration:
Now, there of course are multiple ways to calculate the output O. I wanna record following:
v3 = v1w1+v2w2
v4 = v1w3+v2w4
v5 = v3w5+v4w6 = w5*(v1w1+v2w2)+w6*(v1w3+v2w4)
This is an easy way to calculate the final output O in one step:
O1 = w5*(v1w1+v2w2)+w6*(v1w3+v2w4).
There are other quite cool ways to calculate the output in one step. Looking at the weights as matrices, e.g.
m1 = [[w1, w2], [w3, w4]] and m2 = [[w5, w6]].
This way we can calculate O as
L1 = m1 * I and O = m2 * L1
or in one step
O = m2*(m1*I)
which is the more elegant way. I would prefer it this way but I cant do it with matrix-vector multiplications or any other linear algebra since I am restricted by my programming tools, so I have to stay with this shape:
O1 = w5*(v1w1+v2w2)+w6*(v1w3+v2w4).
Now this would be really really easy if I had a neural network with fixed topology. However, since it evolves during runtime I have to find a function/formula with which I can calculate O independent from the topology. All information I have is ONE list of ALL weights (w=[w1, w2, w3, w4, w5, w6]), a list of inputs (i=[v1, v2]) and the structure of the nn (s=[2, 2, 1] - 2 input nodes, one hidden layer with 2 nodes, and one output node).
However I cant think of an algorithm that calculates the output with the given information efficiently.

Graphframes PageRank performance: PySpark vs sparklyr

I am using Spark/GraphFrames from Python and from R. When I call PageRank on a small graph from Python, it is a lot slower than with R. Why is it so much slower with Python, considering that both Python and R are calling the same libraries?
I'll try to demonstrate the problem below.
Spark/GraphFrames includes examples of graphs, such as friends, as described on this link. This is a very small directed graph with 6 nodes and 8 edges (note that the example is not the same compared to other versions of GraphFrames).
When I run the following piece of code with R, it takes almost not time to calculate PageRank:
library(graphframes)
library(sparklyr)
library(dplyr)
nodes <- read.csv('nodes.csv')
edges <- read.csv('edges.csv')
sc <- spark_connect(master = "local", version = "2.1.1")
nodes_tbl <- copy_to(sc, nodes)
edges_tbl <- copy_to(sc, edges)
graph <- gf_graphframe(nodes_tbl, edges_tbl)
ranks <- gf_pagerank(graph, reset_probability = 0.15, tol = 0.01)
print(ranks$vertices)
results <- as.data.frame(ranks$vertices)
results <- arrange(results, id)
results$pagerank <- results$pagerank / sum(results$pagerank)
print(results)
When I run the equivalent with PySpark, it takes 10 to 30 minutes:
from pyspark.sql import SparkSession
from graphframes.examples import Graphs
if __name__ == '__main__':
sc = SparkSession.builder.master("local").getOrCreate()
g = Graphs(sc).friends()
results = g.pageRank(resetProbability=0.15, tol=0.01)
results.vertices.select("id", "pagerank").show()
results.edges.select("src", "dst", "weight").show()
I tried different version of Spark and GraphFrames for Python to be aligned with the settings of R.
In, general when you see such significant runtime differences between pieces of code that are apparently equivalent in different backends you have to consider two possibilities:
There are not really equivalent. Despite using the same Java libraries under the hood, the path which different language use to interact with the JVM are not the same, and when the code reaches the JVM, it might not use the same call chain.
The methods are equivalent but the configuration and / or data distribution is not the same.
In this particular case the first and the most obvious reason is how you load the data.
In sparklyr copy_to.spark_connection uses by default only a single partition. With such small data it can be often beneficial, as parallelization / distribution overhead can be much higher than the computation cost, but can also lead to miserable failures.
In PySpark, friends loader uses standard parallelize - it means that the number of partitions will use defaultParallelism.
Based on the master configuration the value is at least 1, but it can be affected by configuration options not visible here (like spark.default.parallelism).
However, as far as I can tell tell, these options shouldn't affect the runtime in this particular case. Moreover the path before code reaches JVM backend in both cases, doesn't seem to differ enough to explain the difference.
This suggests that problem lies somewhere in the configuration. In general there are at least two options which can significantly affect data distribution, and therefore the execution time:
spark.default.parallelism - used with RDD API to determine the number of partitions in different cases, including default post-shuffle distribution. For possible implications see for example Spark iteration time increasing exponentially when using join
It doesn't look like it affects your code here.
spark.sql.shuffle.partitions - used with Dataset API to determine the number of partitions after a shuffle (groupBy, join, etc.).
While PageRank code uses old GraphX API, and this parameter is not directly applicable there, before data is passed to the older API, involves indexing edges and vertices with Dataset API.
If you check the source you'll see that both indexedEdges and indexVertices use joins, and therefore depend on spark.sql.shuffle.partitions.
Furthermore the number of partitions set by aforementioned methods will be inherited by the GraphX Graph object, significantly affecting execution time.
If you set spark.sql.shuffle.partitions to a minimum value:
spark: SparkSession
spark.conf.set("spark.sql.shuffle.partitions", 1)
the execution time on such small data should be negligible.
Conclusion:
You environments are likely to use different values of spark.sql.shuffle.partitions.
General Directions:
If you see behavior like this, and want to roughly narrow down the problem you should take a look at the Spark UI, and see where things diverge. In this case you're likely to see significantly different numbers of tasks.

doubts regarding batch size and time steps in RNN

In Tensorflow's tutorial of RNN: https://www.tensorflow.org/tutorials/recurrent
. It mentions two parameters: batch size and time steps. I am confused by the concepts. In my opinion, RNN introduces batch is because the fact that the to-train sequence can be very long such that backpropagation cannot compute that long(exploding/vanishing gradients). So we divide the long to-train sequence into shorter sequences, each of which is a mini-batch and whose size is called "batch size". Am I right here?
Regarding time steps, RNN consists of only a cell (LSTM or GRU cell, or other cell) and this cell is sequential. We can understand the sequential concept by unrolling it. But unrolling a sequential cell is a concept, not real which means we do not implement it in unroll way. Suppose the to-train sequence is a text corpus. Then we feed one word each time to the RNN cell and then update the weights. So why do we have time steps here? Combining my understanding of the above "batch size", I am even more confused. Do we feed the cell one word or multiple words (batch size)?
Batch size pertains to the amount of training samples to consider at a time for updating your network weights. So, in a feedforward network, let's say you want to update your network weights based on computing your gradients from one word at a time, your batch_size = 1.
As the gradients are computed from a single sample, this is computationally very cheap. On the other hand, it is also very erratic training.
To understand what happen during the training of such a feedforward network,
I'll refer you to this very nice visual example of single_batch versus mini_batch to single_sample training.
However, you want to understand what happens with your num_steps variable. This is not the same as your batch_size. As you might have noticed, so far I have referred to feedforward networks. In a feedforward network, the output is determined from the network inputs and the input-output relation is mapped by the learned network relations:
hidden_activations(t) = f(input(t))
output(t) = g(hidden_activations(t)) = g(f(input(t)))
After a training pass of size batch_size, the gradient of your loss function with respect to each of the network parameters is computed and your weights updated.
In a recurrent neural network (RNN), however, your network functions a tad differently:
hidden_activations(t) = f(input(t), hidden_activations(t-1))
output(t) = g(hidden_activations(t)) = g(f(input(t), hidden_activations(t-1)))
=g(f(input(t), f(input(t-1), hidden_activations(t-2)))) = g(f(inp(t), f(inp(t-1), ... , f(inp(t=0), hidden_initial_state))))
As you might have surmised from the naming sense, the network retains a memory of its previous state, and the neuron activations are now also dependent on the previous network state and by extension on all states the network ever found itself to be in. Most RNNs employ a forgetfulness factor in order to attach more importance to more recent network states, but that is besides the point of your question.
Then, as you might surmise that it is computationally very, very expensive to calculate the gradients of the loss function with respect to network parameters if you have to consider backpropagation through all states since the creation of your network, there is a neat little trick to speed up your computation: approximate your gradients with a subset of historical network states num_steps.
If this conceptual discussion was not clear enough, you can also take a look at a more mathematical description of the above.
I found this diagram which helped me visualize the data structure.
From the image, 'batch size' is the number of examples of a sequence you want to train your RNN with for that batch. 'Values per timestep' are your inputs.' (in my case, my RNN takes 6 inputs) and finally, your time steps are the 'length', so to speak, of the sequence you're training
I'm also learning about recurrent neural nets and how to prepare batches for one of my projects (and stumbled upon this thread trying to figure it out).
Batching for feedforward and recurrent nets are slightly different and when looking at different forums, terminology for both gets thrown around and it gets really confusing, so visualizing it is extremely helpful.
Hope this helps.
RNN's "batch size" is to speed up computation (as there're multiple lanes in parallel computation units); it's not mini-batch for backpropagation. An easy way to prove this is to play with different batch size values, an RNN cell with batch size=4 might be roughly 4 times faster than that of batch size=1 and their loss are usually very close.
As to RNN's "time steps", let's look into the following code snippets from rnn.py. static_rnn() calls the cell for each input_ at a time and BasicRNNCell::call() implements its forward part logic. In a text prediction case, say batch size=8, we can think input_ here is 8 words from different sentences of in a big text corpus, not 8 consecutive words in a sentence.
In my experience, we decide the value of time steps based on how deep we would like to model in "time" or "sequential dependency". Again, to predict next word in a text corpus with BasicRNNCell, small time steps might work. A large time step size, on the other hand, might suffer gradient exploding problem.
def static_rnn(cell,
inputs,
initial_state=None,
dtype=None,
sequence_length=None,
scope=None):
"""Creates a recurrent neural network specified by RNNCell `cell`.
The simplest form of RNN network generated is:
state = cell.zero_state(...)
outputs = []
for input_ in inputs:
output, state = cell(input_, state)
outputs.append(output)
return (outputs, state)
"""
class BasicRNNCell(_LayerRNNCell):
def call(self, inputs, state):
"""Most basic RNN: output = new_state =
act(W * input + U * state + B).
"""
gate_inputs = math_ops.matmul(
array_ops.concat([inputs, state], 1), self._kernel)
gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)
output = self._activation(gate_inputs)
return output, output
To visualize how these two parameters are related to the data set and weights, Erik Hallström's post is worth reading. From this diagram and above code snippets, it's obviously that RNN's "batch size" will no affect weights (wa, wb, and b) but "time steps" does. So, one could decide RNN's "time steps" based on their problem and network model and RNN's "batch size" based on computation platform and data set.

customer segmentation in retail [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I have a large sales database of a 'home and construction' retail.
And I need to know who are the electricians, plumbers, painters, etc. in the store.
My first approach was to select the articles related to a specialty (wires [article] is related to an electrician [specialty], for example) And then, based on customer sales, know who the customers are.
But this is a lot of work.
My second approach is to make a cluster segmentation first, and then discover which cluster belong to a specialty. (this is a lot better because I would be able to discover new segments)
But, how can I do that? What type of clustering should I occupy? Kmeans, fuzzy? What variables should I take to that model? Should I use PCA to know how many cluster to search?
The header of my data (simplified):
customer_id | transaction_id | transaction_date | item_article_id | item_group_id | item_category_id | item_qty | sales_amt
Any help would be appreciated
(sorry my english)
You want to identify classes of customers based on what they buy (I presume this is for marketing reasons). This calls for a clustering approach. I will talk you through the entire setup.
The clustering space
Let us first consider what exactly you are clustering: either orders or customers. In either case, the way you characterize the items and the distances between them is the same. I will discuss the basic case for orders first, and then explain the considerations that apply to clustering by customers instead.
For your purpose, an order is characterized by what articles were purchased, and possibly also how many of them. In terms of a space, this means that you have a dimension for each type of article (item_article_id), for example the "wire" dimension. If all you care about is whether an article is bought or not, each item has a coordinate of either 0 or 1 in each dimension. If some order includes wire but not pipe, then it has a value of 1 on the "wire" dimension and 0 on the "pipe" dimension.
However, there is something to say for caring about the quantities. Perhaps plumbers buy lots of glue while electricians buy only small amounts. In that case, you can set the coordinate in each dimension to the quantity of the corresponding article (presumably item_qty). So suppose you have three articles, wire, pipe and glue, then an order described by the vector (2, 3, 0) includes 2 wire, 3 pipe and 0 glue, while an order described by the vector (0, 1, 4) includes 0 wire, 1 pipe and 4 glue.
If there is a large spread in the quantities for a given article, i.e. if some orders include order of magnitude more of some article than other orders, then it may be helpful to work with a log scale. Suppose you have these four orders:
2 wire, 2 pipe, 1 glue
3 wire, 2 pipe, 0 glue
0 wire, 100 pipe, 1 glue
0 wire, 300 pipe, 3 glue
The former two orders look like they may belong to electricians while the latter two look like they belong to plumbers. However, if you work with a linear scale, order 3 will turn out to be closer to orders 1 and 2 than to order 4. We fix that by using a log scale for the vectors that encode these orders (I use the base 10 logarithm here, but it does not matter which base you take because they differ only by a constant factor):
(0.30, 0.30, 0)
(0.48, 0.30, -2)
(-2, 2, 0)
(-2, 2.48, 0.48)
Now order 3 is closest to order 4, as we would expect. Note that I have used -2 as a special value to indicate the absence of an article, because the logarithm of 0 is not defined (log(x) tends to negative infinity as x tends to 0). -2 means that we pretend that the order included 1/100th of the article; you could make the special value more or less extreme, depending on how much weight you want to give to the fact that an article was not included.
The input to your clustering algorithm (regardless of which algorithm you take, see below) will be a position matrix with one row for each item (order or customer), one column for each dimension (article), and either the presence (0/1), amount, or logarithm of the amount in each cell, depending on which you choose based on the discussion above. If you cluster by customers, you can simply sum the amounts from all orders that belong to that customer before you calculate what goes into each cell of your position matrix (if you use the log scale, sum the amounts before taking the logarithm).
Clustering by orders rather than by customers gives you more detail, but also more noise. Customers may be consistent within an order but not between them; perhaps a customer sometimes behaves like a plumber and sometimes like an electrician. This is a pattern that you will only find if you cluster by orders. You will then find how often each customer belongs to each cluster; perhaps 70% of somebody's orders belong to the electrician type and 30% belong to the plumber type. On the other hand, a plumber may only buy pipe in one order and then only buy glue in the next order. Only if you cluster by customers and sum the amounts of their orders, you get a balanced view of what each customer needs on average.
From here on I will refer to your position matrix by the name my.matrix.
The clustering algorithm
If you want to be able to discover new customer types, you probably want to let the data speak for themselves as much as possible. A good old fashioned
hierarchical clustering with complete linkage (CLINK) may be an appropriate choice in this case. In R, you simply do hclust(dist(my.matrix)) (this will use the Euclidean distance measure, which is probably good enough in your case). It will join closely neighbouring items or clusters together until all items are categorized in a hierarchical tree. You can treat any branch of the tree as a cluster, observe typical article amounts for that branch and decide whether that branch represents a customer segment by itself, should be split in sub-branches, or joined with a sibling branch instead. The advantage is that you find the "full story" of which items and clusters of items are most similar to each other and how much. The disadvantage is that the outcome of the algorithm does not tell you where to draw the borders between your customer segments; you can cut up the clustering tree in many ways, so it's up to your interpretation how you want to identify your customer types.
On the other hand, if you are comfortable fixing the number of clusters (k) beforehand, k-means is a very robust way to get just any segmentation of your customers in k distinct types. In R, you would do kmeans(my.matrix, k). For marketing purposes, it may be sufficient to have (say) 5 different profiles of customers that you make custom advertisement for, rather than treating all customers the same. With k-means you don't explore all of the diversity that is present in your data, but you might not need to do so anyway.
If you don't want to fix the number of clusters beforehand, but you also don't want to manually decide where to draw the borders between the segments afterwards, there is a third possibility. You start with the k-means algorithm, where you let it generate an amount of cluster centers that is much larger than the number of clusters that you hope to end up with (for example, if you hope to end up with somewhere about 10 clusters, let the k-means algorithm look for 200 clusters). Then, use the mean shift algorithm to further cluster the resulting centers. You will end up with a smaller number of compact clusters. The approach is explained in more detail by James Li over here. You can use the mean shift algorithm in R with the ms function from the LPCM package, see this documentation.
About using PCA
PCA will not tell you how many clusters you need. PCA answers a different question: which variables seem to represent a common underlying (hidden) factor. In a sense, it is a way to cluster variables, i.e. properties of entities, not to cluster the entities themselves. The number of principal components (common underlying factors) is not indicative of the number of clusters needed. PCA can still be interesting if you want to learn something about the predictive value of each article about a customer's interests.
Sources
Michael J. Crawley, 2005. Statistics. An Introduction using R.
Gerry P. Quinn and Michael J. Keough, 2002. Experimental Design and Data Analysis for Biologists.
Wikipedia: hierarchical clustering, k-means, mean shift, PCA

What Are Large Graphs? What is Large Graph Analysis? What Is Big Data? What is Big Data Analysis?

I know what are these as I have started working with them. But for now, I just want to know the formal definitions of these terms and questions.
Any help in these regards is highly appreciated.
In my opinion, there is no absolute, formal criterion of when a graph becomes 'large' of when the amount of data becomes 'big'. These adjectives are meaningless without a frame of reference.
For instance, when you say someone is 'tall', it is implicitly assumed that you are either comparing this person to yourself, or to a perceived average height of people. If you change your frame of reference and compare this person to, let's say Mount Everest, this person's height becomes negligible. I could give a billion other examples, but the take-home message is: there is no absolute notion of 'bigness' or 'smallness'. The notion of scale is a relative notion. Simple concept, but with very strong implication: in a sense, physics has been so successful because physicists understood it very early.
So, to answer this question, I think a good of thumb is:
'large graphs' are graphs the exploration of which require long computation times on a typical quad-core machine compared to what people judge reasonable (an hour, a day. Your patience may vary).
'big data' are typically data which take too much memory space to be stored on a single hard drive.
Of course these are just rules of thumbs.
Usually A graph that has a set of nodes and arrows is a Small graph; otherwise, it is a Large graph.
If we show collection of nodes of a graph G by G0 and the collection of arrows by G1 then let G0= {1,2}, G1= {a,b,c},
source(a) = target(a) = source(b) = target(c)=1 and target(b) = source(c) = 2. It is small graph but The graph of sets and functions has all sets as nodes and all functions between sets as arrows. The source of a function is its domain, and its target is its codomain.
In this example, unlike the previous ones, the nodes do not form a set.Thus the graph of sets and functions is a large graph.
More generally we refer to any kind of mathematical structure as ‘small’ if the collection(s) it is built on form sets, and ‘large’ otherwise.

Resources