In chainer, How to write BPTT updater using multiple GPUs? - chainer

I don't find example because existing example only extends training.StandardUpdater, thus only use One GPU.

I assume that you are talking about the BPTTUpdater of the ptb example of Chainer.
It's not straight forward to make the customized updater support learning on multiple GPUs. The MultiprocessParallelUpdater hard code the way to compute the gradient (only the target link implementation is customizable), so you have to copy the overall implementation of MultiprocessParallelUpdater and modify the gradient computation parts. What you have to copy and edit is chainer/training/updaters/multiprocess_parallel_updater.py.
There are two parts in this file that compute gradient; one in _Worker.run, which represents a worker process task, and the other in MultiprocessParallelUpdater.update_core, which represents the master process task. You have to make these code do BPTT by modifying the code starting from _calc_loss to backward in each of these two parts:
# Change self._master into self.model for _Worker.run code
loss = _calc_loss(self._master, batch)
self._master.cleargrads()
loss.backward()
It should be modified by inserting the code of BPTTUpdater.update_core.
You also have to take care on the data iterators. MultiprocessParallelUpdater accept the set of iterators that will be distributed to master/worker processes. Since the ptb example uses a customized iterator (ParallelSequentialIterator), you have to make sure that these iterators iterate over different portions of the dataset or using different initial offsets of word positions. It may require customization to ParalellSequentialIterator as well.

Related

Julia: What are the right data structures for graph traversal?

I'm writing a bunch of recursive graph algorithms where graph nodes have parents, children, and a number of other properties. The algorithms can also create nodes dynamically, and make use of recursive functions.
What are the right data structures to use in this case? In C++ I would've implemented this via pointers (i.e. each node has a vector<Node*> parents, vector<Node*> children), but I'm not sure if Julia pointers are the right tool for that, or if there's something else ... ?
In Julia state-of-the-art with this regard is the LightGraphs.jl library.
It uses adjacency lists for graph representation and assumes that the data for nodes is being kept outside the graph (for example in Vectors indexed by node identifiers) rather than inside the graph.
This approach is generally most efficient and most convenient (operating Array indices rather than references).
LightGraphs.jl provides implementation for several typical graph algorithms and is usually the way to go when doing computation on graphs.
However, LightGraphs.jl,'s approach might be less convenient in scenarios where you are continuously at the same time adding and destroying many nodes within the graph.
Now, regarding an equivalent of the C++ approach you have proposed it can be accomplished as
struct MyNode{T}
data::T
children::Vector{MyNode}
parents::Vector{MyNode}
MyNode(data::T,children=MyNode[],parents=MyNode[]) where T= new{T}(data,children,parents)
end
And this API can be used as:
node1 = MyNode(nothing)
push!(node1.parents, MyNode("hello2"))
Finally, since LightGraphs.jl is a Julia standard it is usually worth to provide some bridging implementation so your API is able to use LightGraphs.jl functions.
For illustration how it can be done for an example have a look for SimpleHypergraphs.jl library.
EDIT:
Normally, for efficiency reasons you will want the data field to be be homogenous across the graph, in that case better is:
struct MyNode{T}
data::T
children::Vector{MyNode{T}}
parents::Vector{MyNode{T}}
MyNode(data::T,children=MyNode{T}[],parents=MyNode{T}[]) where T= new{T}(data,children,parents)
end

Graphframes PageRank performance: PySpark vs sparklyr

I am using Spark/GraphFrames from Python and from R. When I call PageRank on a small graph from Python, it is a lot slower than with R. Why is it so much slower with Python, considering that both Python and R are calling the same libraries?
I'll try to demonstrate the problem below.
Spark/GraphFrames includes examples of graphs, such as friends, as described on this link. This is a very small directed graph with 6 nodes and 8 edges (note that the example is not the same compared to other versions of GraphFrames).
When I run the following piece of code with R, it takes almost not time to calculate PageRank:
library(graphframes)
library(sparklyr)
library(dplyr)
nodes <- read.csv('nodes.csv')
edges <- read.csv('edges.csv')
sc <- spark_connect(master = "local", version = "2.1.1")
nodes_tbl <- copy_to(sc, nodes)
edges_tbl <- copy_to(sc, edges)
graph <- gf_graphframe(nodes_tbl, edges_tbl)
ranks <- gf_pagerank(graph, reset_probability = 0.15, tol = 0.01)
print(ranks$vertices)
results <- as.data.frame(ranks$vertices)
results <- arrange(results, id)
results$pagerank <- results$pagerank / sum(results$pagerank)
print(results)
When I run the equivalent with PySpark, it takes 10 to 30 minutes:
from pyspark.sql import SparkSession
from graphframes.examples import Graphs
if __name__ == '__main__':
sc = SparkSession.builder.master("local").getOrCreate()
g = Graphs(sc).friends()
results = g.pageRank(resetProbability=0.15, tol=0.01)
results.vertices.select("id", "pagerank").show()
results.edges.select("src", "dst", "weight").show()
I tried different version of Spark and GraphFrames for Python to be aligned with the settings of R.
In, general when you see such significant runtime differences between pieces of code that are apparently equivalent in different backends you have to consider two possibilities:
There are not really equivalent. Despite using the same Java libraries under the hood, the path which different language use to interact with the JVM are not the same, and when the code reaches the JVM, it might not use the same call chain.
The methods are equivalent but the configuration and / or data distribution is not the same.
In this particular case the first and the most obvious reason is how you load the data.
In sparklyr copy_to.spark_connection uses by default only a single partition. With such small data it can be often beneficial, as parallelization / distribution overhead can be much higher than the computation cost, but can also lead to miserable failures.
In PySpark, friends loader uses standard parallelize - it means that the number of partitions will use defaultParallelism.
Based on the master configuration the value is at least 1, but it can be affected by configuration options not visible here (like spark.default.parallelism).
However, as far as I can tell tell, these options shouldn't affect the runtime in this particular case. Moreover the path before code reaches JVM backend in both cases, doesn't seem to differ enough to explain the difference.
This suggests that problem lies somewhere in the configuration. In general there are at least two options which can significantly affect data distribution, and therefore the execution time:
spark.default.parallelism - used with RDD API to determine the number of partitions in different cases, including default post-shuffle distribution. For possible implications see for example Spark iteration time increasing exponentially when using join
It doesn't look like it affects your code here.
spark.sql.shuffle.partitions - used with Dataset API to determine the number of partitions after a shuffle (groupBy, join, etc.).
While PageRank code uses old GraphX API, and this parameter is not directly applicable there, before data is passed to the older API, involves indexing edges and vertices with Dataset API.
If you check the source you'll see that both indexedEdges and indexVertices use joins, and therefore depend on spark.sql.shuffle.partitions.
Furthermore the number of partitions set by aforementioned methods will be inherited by the GraphX Graph object, significantly affecting execution time.
If you set spark.sql.shuffle.partitions to a minimum value:
spark: SparkSession
spark.conf.set("spark.sql.shuffle.partitions", 1)
the execution time on such small data should be negligible.
Conclusion:
You environments are likely to use different values of spark.sql.shuffle.partitions.
General Directions:
If you see behavior like this, and want to roughly narrow down the problem you should take a look at the Spark UI, and see where things diverge. In this case you're likely to see significantly different numbers of tasks.

How to accumulate gradient across mini-batch and then back-propagation in Chainer?

I am doing classifying video sequence, I need 2 things:
Because of limited GPU memory, I want to accumulate gradient across mini-batch, and then average gradient value, and then back propagation.
I need to know how to shuffle between mini-batch but not shuffle inside each mini-batch, because I want the video sequence keep its order.
Question 1:
You can forward and backward each minibatch but not call optimizer.update(), after you have repeated forward & backward for necessary minibatches, you can call optimizer.update() to updated based on accumulated gradients.
If you want to achieve it with trainer module, I think you need to override StandardUpdater to define your own Updater class to do above.
Question 2:
Are you using trainer module?
If so, you can define your own iterator to achieve this. See also below for reference how to define iterator class.
https://github.com/chainer/chainer/blob/master/examples/ptb/train_ptb.py
http://corochann.com/training-rnn-with-simple-sequence-dataset-1304.html

OpenMDAO 1.x relevance reduction

I have a component in OpenMDAO without outputs that serves to provide inputs to the rest of the group. apply_linear in that component is being called despite the fact that the output of it is not connected. Shouldn't the relevance reduction algorithm in OpenMDAO 1.x figure out that apply_linear for this method never needs to be called?
As it turns out, relevance reduction on a per-variable basis isn't turned on by default. You can turn it on with:
prob.root.ln_solver = LinearGaussSeidel()
prob.root.ln_solver.options['single_voi_relevance_reduction'] = True
This options is set to False by default because it does use more memory by allocating separate vectors for each quantity of interest (though each vector is smaller because it only contains relevant variables, but the total size may be larger.) Also, relevance-reduction is only applicable when using Linear Gauss Seidel as the top linear solver.
My reputation isn't high enough yet to leave comments, so I'm just adding another answer instead. I just wanted to mention that if you're not running under MPI, activating single_voi_relevance_reduction is essentially free. The real increase in memory use isn't due to the vectors themselves, but instead it's due to the index arrays that we store in order to transfer the data from source arrays to target arrays. We're forced to use index arrays under MPI, because PETSc requires it, but when we're not using MPI we use python slice objects to do our data transfer. Slice objects require very little memory.

Cheat sheet for caffe / pycaffe?

Does anyone know whether there is a cheat sheet for all important pycaffe commands?
I was so far using caffe only via Matlab interface and terminal + bash scripts.
I wanted to shift towards using ipython and work through the ipython notebook examples. However I find it hard to get an overview of all the functions that are inside the caffe module for python. (I'm also quite new to python).
The pycaffe tests and this file are the main gateway to the python coding interface.
First of all, you would like to choose whether to use Caffe with CPU or GPU. It is sufficient to call caffe.set_mode_cpu() or caffe.set_mode_gpu(), respectively.
Net
The main class that the pycaffe interface exposes is the Net. It has two constructors:
net = caffe.Net('/path/prototxt/descriptor/file', caffe.TRAIN)
which simply create a Net (in this case using the Data Layer specified for training), or
net = caffe.Net('/path/prototxt/descriptor/file', '/path/caffemodel/weights/file', caffe.TEST)
which creates a Net and automatically loads the weights as saved in the provided caffemodel file - in this case using the Data Layer specified for testing.
A Net object has several attributes and methods. They can be found here. I will cite just the ones I use more often.
You can access the network blobs by means of Net.blobs. E.g.
data = net.blobs['data'].data
net.blobs['data'].data[...] = my_image
fc7_activations = net.blobs['fc7'].data
You can access the parameters (weights) too, in a similar way. E.g.
nice_edge_detectors = net.params['conv1'].data
higher_level_filter = net.params['fc7'].data
Ok, now it's time to actually feed the net with some data. So, you will use backward() and forward() methods. So, if you want to classify a single image
net.blobs['data'].data[...] = my_image
net.forward() # equivalent to net.forward_all()
softmax_probabilities = net.blobs['prob'].data
The backward() method is equivalent, if one is interested in computing gradients.
You can save the net weights to subsequently reuse them. It's just a matter of
net.save('/path/to/new/caffemodel/file')
Solver
The other core component exposed by pycaffe is the Solver. There are several types of solver, but I'm going to use only SGDSolver for the sake of clarity. It is needed in order to train a caffe model.
You can instantiate the solver with
solver = caffe.SGDSolver('/path/to/solver/prototxt/file')
The Solver will encapsulate the network you are training and, if present, the network used for testing. Note that they are usually the same network, only with a different Data Layer. The networks are accessible with
training_net = solver.net
test_net = solver.test_nets[0] # more than one test net is supported
Then, you can perform a solver iteration, that is, a forward/backward pass with weight update, typing just
solver.step(1)
or run the solver until the last iteration, with
solver.solve()
Other features
Note that pycaffe allows you to do more stuff, such as specifying the network architecture through a Python class or creating a new Layer type.
These features are less often used, but they are pretty easy to understand by reading the test cases.
Please note that the answer by Flavio Ferrara has a litte problem which may cause you waste a lot of time:
net.blobs['data'].data[...] = my_image
net.forward()
The code above is noneffective if your first layer is a Data type layer, because when net.forward() is called, it will begin from the first layer, and then your inserted data my_image will be covered. So it will show no error but give you totally irrelevant output. The correct way is to assign the start and end layer, for example:
net.forward(start='conv1', end='fc')
Here is a Github repository of Face Verification Experiment on LFW Dataset, using pycaffe and some matlab code. I guess it could help a lot, especially the caffe_ftr.py file.
https://github.com/AlfredXiangWu/face_verification_experiment
Besides, here are some short example code of using pycaffe for image classification:
http://codrspace.com/Jaleyhd/caffe-python-tutorial/
http://prog3.com/sbdm/blog/u011762313/article/details/48342495

Resources