Graphframes PageRank performance: PySpark vs sparklyr - r

I am using Spark/GraphFrames from Python and from R. When I call PageRank on a small graph from Python, it is a lot slower than with R. Why is it so much slower with Python, considering that both Python and R are calling the same libraries?
I'll try to demonstrate the problem below.
Spark/GraphFrames includes examples of graphs, such as friends, as described on this link. This is a very small directed graph with 6 nodes and 8 edges (note that the example is not the same compared to other versions of GraphFrames).
When I run the following piece of code with R, it takes almost not time to calculate PageRank:
library(graphframes)
library(sparklyr)
library(dplyr)
nodes <- read.csv('nodes.csv')
edges <- read.csv('edges.csv')
sc <- spark_connect(master = "local", version = "2.1.1")
nodes_tbl <- copy_to(sc, nodes)
edges_tbl <- copy_to(sc, edges)
graph <- gf_graphframe(nodes_tbl, edges_tbl)
ranks <- gf_pagerank(graph, reset_probability = 0.15, tol = 0.01)
print(ranks$vertices)
results <- as.data.frame(ranks$vertices)
results <- arrange(results, id)
results$pagerank <- results$pagerank / sum(results$pagerank)
print(results)
When I run the equivalent with PySpark, it takes 10 to 30 minutes:
from pyspark.sql import SparkSession
from graphframes.examples import Graphs
if __name__ == '__main__':
sc = SparkSession.builder.master("local").getOrCreate()
g = Graphs(sc).friends()
results = g.pageRank(resetProbability=0.15, tol=0.01)
results.vertices.select("id", "pagerank").show()
results.edges.select("src", "dst", "weight").show()
I tried different version of Spark and GraphFrames for Python to be aligned with the settings of R.

In, general when you see such significant runtime differences between pieces of code that are apparently equivalent in different backends you have to consider two possibilities:
There are not really equivalent. Despite using the same Java libraries under the hood, the path which different language use to interact with the JVM are not the same, and when the code reaches the JVM, it might not use the same call chain.
The methods are equivalent but the configuration and / or data distribution is not the same.
In this particular case the first and the most obvious reason is how you load the data.
In sparklyr copy_to.spark_connection uses by default only a single partition. With such small data it can be often beneficial, as parallelization / distribution overhead can be much higher than the computation cost, but can also lead to miserable failures.
In PySpark, friends loader uses standard parallelize - it means that the number of partitions will use defaultParallelism.
Based on the master configuration the value is at least 1, but it can be affected by configuration options not visible here (like spark.default.parallelism).
However, as far as I can tell tell, these options shouldn't affect the runtime in this particular case. Moreover the path before code reaches JVM backend in both cases, doesn't seem to differ enough to explain the difference.
This suggests that problem lies somewhere in the configuration. In general there are at least two options which can significantly affect data distribution, and therefore the execution time:
spark.default.parallelism - used with RDD API to determine the number of partitions in different cases, including default post-shuffle distribution. For possible implications see for example Spark iteration time increasing exponentially when using join
It doesn't look like it affects your code here.
spark.sql.shuffle.partitions - used with Dataset API to determine the number of partitions after a shuffle (groupBy, join, etc.).
While PageRank code uses old GraphX API, and this parameter is not directly applicable there, before data is passed to the older API, involves indexing edges and vertices with Dataset API.
If you check the source you'll see that both indexedEdges and indexVertices use joins, and therefore depend on spark.sql.shuffle.partitions.
Furthermore the number of partitions set by aforementioned methods will be inherited by the GraphX Graph object, significantly affecting execution time.
If you set spark.sql.shuffle.partitions to a minimum value:
spark: SparkSession
spark.conf.set("spark.sql.shuffle.partitions", 1)
the execution time on such small data should be negligible.
Conclusion:
You environments are likely to use different values of spark.sql.shuffle.partitions.
General Directions:
If you see behavior like this, and want to roughly narrow down the problem you should take a look at the Spark UI, and see where things diverge. In this case you're likely to see significantly different numbers of tasks.

Related

In chainer, How to write BPTT updater using multiple GPUs?

I don't find example because existing example only extends training.StandardUpdater, thus only use One GPU.
I assume that you are talking about the BPTTUpdater of the ptb example of Chainer.
It's not straight forward to make the customized updater support learning on multiple GPUs. The MultiprocessParallelUpdater hard code the way to compute the gradient (only the target link implementation is customizable), so you have to copy the overall implementation of MultiprocessParallelUpdater and modify the gradient computation parts. What you have to copy and edit is chainer/training/updaters/multiprocess_parallel_updater.py.
There are two parts in this file that compute gradient; one in _Worker.run, which represents a worker process task, and the other in MultiprocessParallelUpdater.update_core, which represents the master process task. You have to make these code do BPTT by modifying the code starting from _calc_loss to backward in each of these two parts:
# Change self._master into self.model for _Worker.run code
loss = _calc_loss(self._master, batch)
self._master.cleargrads()
loss.backward()
It should be modified by inserting the code of BPTTUpdater.update_core.
You also have to take care on the data iterators. MultiprocessParallelUpdater accept the set of iterators that will be distributed to master/worker processes. Since the ptb example uses a customized iterator (ParallelSequentialIterator), you have to make sure that these iterators iterate over different portions of the dataset or using different initial offsets of word positions. It may require customization to ParalellSequentialIterator as well.

How to speed up the generation of a latin hypercube (LHS) design

I'm trying to generate an optimized LHS (Latin Hypercube Sampling) design in R, with sample size N = 400 and d = 7 variables, but it's taking forever. My pc is an HP Z820 workstation with 12 cores, 32 Mb RAM, Windows 7 64 bit, and I'm running Microsoft R Open which is a multicore version of R. The code has been running for half an hour, but I still don't see any results:
library(lhs)
lhs_design <- optimumLHS(n = 400, k = 7, verbose = TRUE)
It seems a bit weird. Is there anything I could do to speed it up? I heard that parallel computing may help with R, but I don't know how to use it, and I have no idea if it speeds up only code that I write myself, or if it could speed up an existing package function such as optimumLHS. I don't have to use the lhs package necessarily - my only requirement is that I would like to generate an LHS design which is optimized in terms of S-optimality criterion, maximin metric, or some other similar optimality criterion (thus, not just a vanilla LHS). If worse comes to worst, I could even accept a solution in a different environment than R, but it must be either MATLAB or a open source environment.
Just a little code to check performance.
library(lhs)
library(ggplot2)
performance<-c()
for(i in 1:100){
ptm<-proc.time()
invisible(optimumLHS(n = i, k = 7, verbose = FALSE))
time<-print(proc.time()-ptm)[[3]]
performance<-rbind(performance,data.frame(time=time, n=i))
}
ggplot(performance,aes(x=n,y=time))+
geom_point()
Not looking too good. It seems to me you might be in for a very long wait indeed. Based on the algorithm, I don't think there is a way to speed things up via parallel processing, since to optimize the separation between sample points, you need to know the location of the all the sample points. I think your only option for speeding this up will be to take a smaller sample or get (access)a faster computer. It strikes me that since this is something that only really has to be done once, is there a resource where you could just get a properly sampled and optimized distribution already computed?
So it looks like ~650 hours for my machine, which is very comparable to yours, to compute with n=400.

OpenMDAO 1.x relevance reduction

I have a component in OpenMDAO without outputs that serves to provide inputs to the rest of the group. apply_linear in that component is being called despite the fact that the output of it is not connected. Shouldn't the relevance reduction algorithm in OpenMDAO 1.x figure out that apply_linear for this method never needs to be called?
As it turns out, relevance reduction on a per-variable basis isn't turned on by default. You can turn it on with:
prob.root.ln_solver = LinearGaussSeidel()
prob.root.ln_solver.options['single_voi_relevance_reduction'] = True
This options is set to False by default because it does use more memory by allocating separate vectors for each quantity of interest (though each vector is smaller because it only contains relevant variables, but the total size may be larger.) Also, relevance-reduction is only applicable when using Linear Gauss Seidel as the top linear solver.
My reputation isn't high enough yet to leave comments, so I'm just adding another answer instead. I just wanted to mention that if you're not running under MPI, activating single_voi_relevance_reduction is essentially free. The real increase in memory use isn't due to the vectors themselves, but instead it's due to the index arrays that we store in order to transfer the data from source arrays to target arrays. We're forced to use index arrays under MPI, because PETSc requires it, but when we're not using MPI we use python slice objects to do our data transfer. Slice objects require very little memory.

Cheat sheet for caffe / pycaffe?

Does anyone know whether there is a cheat sheet for all important pycaffe commands?
I was so far using caffe only via Matlab interface and terminal + bash scripts.
I wanted to shift towards using ipython and work through the ipython notebook examples. However I find it hard to get an overview of all the functions that are inside the caffe module for python. (I'm also quite new to python).
The pycaffe tests and this file are the main gateway to the python coding interface.
First of all, you would like to choose whether to use Caffe with CPU or GPU. It is sufficient to call caffe.set_mode_cpu() or caffe.set_mode_gpu(), respectively.
Net
The main class that the pycaffe interface exposes is the Net. It has two constructors:
net = caffe.Net('/path/prototxt/descriptor/file', caffe.TRAIN)
which simply create a Net (in this case using the Data Layer specified for training), or
net = caffe.Net('/path/prototxt/descriptor/file', '/path/caffemodel/weights/file', caffe.TEST)
which creates a Net and automatically loads the weights as saved in the provided caffemodel file - in this case using the Data Layer specified for testing.
A Net object has several attributes and methods. They can be found here. I will cite just the ones I use more often.
You can access the network blobs by means of Net.blobs. E.g.
data = net.blobs['data'].data
net.blobs['data'].data[...] = my_image
fc7_activations = net.blobs['fc7'].data
You can access the parameters (weights) too, in a similar way. E.g.
nice_edge_detectors = net.params['conv1'].data
higher_level_filter = net.params['fc7'].data
Ok, now it's time to actually feed the net with some data. So, you will use backward() and forward() methods. So, if you want to classify a single image
net.blobs['data'].data[...] = my_image
net.forward() # equivalent to net.forward_all()
softmax_probabilities = net.blobs['prob'].data
The backward() method is equivalent, if one is interested in computing gradients.
You can save the net weights to subsequently reuse them. It's just a matter of
net.save('/path/to/new/caffemodel/file')
Solver
The other core component exposed by pycaffe is the Solver. There are several types of solver, but I'm going to use only SGDSolver for the sake of clarity. It is needed in order to train a caffe model.
You can instantiate the solver with
solver = caffe.SGDSolver('/path/to/solver/prototxt/file')
The Solver will encapsulate the network you are training and, if present, the network used for testing. Note that they are usually the same network, only with a different Data Layer. The networks are accessible with
training_net = solver.net
test_net = solver.test_nets[0] # more than one test net is supported
Then, you can perform a solver iteration, that is, a forward/backward pass with weight update, typing just
solver.step(1)
or run the solver until the last iteration, with
solver.solve()
Other features
Note that pycaffe allows you to do more stuff, such as specifying the network architecture through a Python class or creating a new Layer type.
These features are less often used, but they are pretty easy to understand by reading the test cases.
Please note that the answer by Flavio Ferrara has a litte problem which may cause you waste a lot of time:
net.blobs['data'].data[...] = my_image
net.forward()
The code above is noneffective if your first layer is a Data type layer, because when net.forward() is called, it will begin from the first layer, and then your inserted data my_image will be covered. So it will show no error but give you totally irrelevant output. The correct way is to assign the start and end layer, for example:
net.forward(start='conv1', end='fc')
Here is a Github repository of Face Verification Experiment on LFW Dataset, using pycaffe and some matlab code. I guess it could help a lot, especially the caffe_ftr.py file.
https://github.com/AlfredXiangWu/face_verification_experiment
Besides, here are some short example code of using pycaffe for image classification:
http://codrspace.com/Jaleyhd/caffe-python-tutorial/
http://prog3.com/sbdm/blog/u011762313/article/details/48342495

R: clarification on memory management

Suppose I have a matrix bigm. I need to use a random subset of this matrix and give it to a machine learning algorithm such as say svm. The random subset of the matrix will only be known at runtime. Additionally there are other parameters that are also chosen from a grid.
So, I have code that looks something like this:
foo = function (bigm, inTrain, moreParamsList) {
parsList = c(list(data=bigm[inTrain, ]), moreParamsList)
do.call(svm, parsList)
}
What I am seeking to know is whether R uses new memory to save that bigm[inTrain, ] object in parsList. (My guess is that it does.) What commands can I use to test such hypotheses myself? Additionally, is there a way of using a sub-matrix in R without using new memory?
Edit:
Also, assume I am calling foo using mclapply (on Linux) where bigm resides in the parent process. Does that mean I am making mc.cores number of copies of bigm or do all cores just use the object from the parent?
Any functions and heuristics of tracking memory location and consumption of objects being made in different cores?
Thanks.
I am just going to put in here what I find from my research on this topic:
I don't think using mclapply makes mc.cores copies of bigm based on this from the manual for multicore:
In a nutshell fork spawns a copy (child) of the current process, that can work in parallel
to the master (parent) process. At the point of forking both processes share exactly the
same state including the workspace, global options, loaded packages etc. Forking is
relatively cheap in modern operating systems and no real copy of the used memory is
created, instead both processes share the same memory and only modified parts are copied.
This makes fork an ideal tool for parallel processing since there is no need to setup the
parallel working environment, data and code is shared automatically from the start.
For your first part of the question, you can use tracemem :
This function marks an object so that a message is printed whenever the internal code copies the object
Here an example:
a <- 1:10
tracemem(a)
## [1] "<0x000000001669cf00"
b <- a ## b and a share memory (no message)
d <- stats::rnorm(10)
invisible(lm(d ~ a+log(b)))
## tracemem[0x000000001669cf00 -> 0x000000001669e298] ## object a is copied twice
## tracemem[0x000000001669cf00 -> 0x0000000016698a38]
untracemem(a)
You already found from the manual that mclapply isn't supposed to make copies of bigm.
But each thread needs to make its own copy of the smaller training matrix as it varies across the threads.
If you'd parallelize with e.g. snow, you'd need to have a copy of the data in each of the cluster nodes. However, in that case you could rewrite your problem in a way that only the smaller training matrices are handed over.
The search term for the general investigation of memory consumption behaviour is memory profiling. Unfortunately, AFAIK the available tools are not (yet) very comfortable, see e.g.
Monitor memory usage in R
Memory profiling in R - tools for summarizing

Resources