Cheat sheet for caffe / pycaffe? - jupyter-notebook

Does anyone know whether there is a cheat sheet for all important pycaffe commands?
I was so far using caffe only via Matlab interface and terminal + bash scripts.
I wanted to shift towards using ipython and work through the ipython notebook examples. However I find it hard to get an overview of all the functions that are inside the caffe module for python. (I'm also quite new to python).

The pycaffe tests and this file are the main gateway to the python coding interface.
First of all, you would like to choose whether to use Caffe with CPU or GPU. It is sufficient to call caffe.set_mode_cpu() or caffe.set_mode_gpu(), respectively.
Net
The main class that the pycaffe interface exposes is the Net. It has two constructors:
net = caffe.Net('/path/prototxt/descriptor/file', caffe.TRAIN)
which simply create a Net (in this case using the Data Layer specified for training), or
net = caffe.Net('/path/prototxt/descriptor/file', '/path/caffemodel/weights/file', caffe.TEST)
which creates a Net and automatically loads the weights as saved in the provided caffemodel file - in this case using the Data Layer specified for testing.
A Net object has several attributes and methods. They can be found here. I will cite just the ones I use more often.
You can access the network blobs by means of Net.blobs. E.g.
data = net.blobs['data'].data
net.blobs['data'].data[...] = my_image
fc7_activations = net.blobs['fc7'].data
You can access the parameters (weights) too, in a similar way. E.g.
nice_edge_detectors = net.params['conv1'].data
higher_level_filter = net.params['fc7'].data
Ok, now it's time to actually feed the net with some data. So, you will use backward() and forward() methods. So, if you want to classify a single image
net.blobs['data'].data[...] = my_image
net.forward() # equivalent to net.forward_all()
softmax_probabilities = net.blobs['prob'].data
The backward() method is equivalent, if one is interested in computing gradients.
You can save the net weights to subsequently reuse them. It's just a matter of
net.save('/path/to/new/caffemodel/file')
Solver
The other core component exposed by pycaffe is the Solver. There are several types of solver, but I'm going to use only SGDSolver for the sake of clarity. It is needed in order to train a caffe model.
You can instantiate the solver with
solver = caffe.SGDSolver('/path/to/solver/prototxt/file')
The Solver will encapsulate the network you are training and, if present, the network used for testing. Note that they are usually the same network, only with a different Data Layer. The networks are accessible with
training_net = solver.net
test_net = solver.test_nets[0] # more than one test net is supported
Then, you can perform a solver iteration, that is, a forward/backward pass with weight update, typing just
solver.step(1)
or run the solver until the last iteration, with
solver.solve()
Other features
Note that pycaffe allows you to do more stuff, such as specifying the network architecture through a Python class or creating a new Layer type.
These features are less often used, but they are pretty easy to understand by reading the test cases.

Please note that the answer by Flavio Ferrara has a litte problem which may cause you waste a lot of time:
net.blobs['data'].data[...] = my_image
net.forward()
The code above is noneffective if your first layer is a Data type layer, because when net.forward() is called, it will begin from the first layer, and then your inserted data my_image will be covered. So it will show no error but give you totally irrelevant output. The correct way is to assign the start and end layer, for example:
net.forward(start='conv1', end='fc')
Here is a Github repository of Face Verification Experiment on LFW Dataset, using pycaffe and some matlab code. I guess it could help a lot, especially the caffe_ftr.py file.
https://github.com/AlfredXiangWu/face_verification_experiment
Besides, here are some short example code of using pycaffe for image classification:
http://codrspace.com/Jaleyhd/caffe-python-tutorial/
http://prog3.com/sbdm/blog/u011762313/article/details/48342495

Related

Graphframes PageRank performance: PySpark vs sparklyr

I am using Spark/GraphFrames from Python and from R. When I call PageRank on a small graph from Python, it is a lot slower than with R. Why is it so much slower with Python, considering that both Python and R are calling the same libraries?
I'll try to demonstrate the problem below.
Spark/GraphFrames includes examples of graphs, such as friends, as described on this link. This is a very small directed graph with 6 nodes and 8 edges (note that the example is not the same compared to other versions of GraphFrames).
When I run the following piece of code with R, it takes almost not time to calculate PageRank:
library(graphframes)
library(sparklyr)
library(dplyr)
nodes <- read.csv('nodes.csv')
edges <- read.csv('edges.csv')
sc <- spark_connect(master = "local", version = "2.1.1")
nodes_tbl <- copy_to(sc, nodes)
edges_tbl <- copy_to(sc, edges)
graph <- gf_graphframe(nodes_tbl, edges_tbl)
ranks <- gf_pagerank(graph, reset_probability = 0.15, tol = 0.01)
print(ranks$vertices)
results <- as.data.frame(ranks$vertices)
results <- arrange(results, id)
results$pagerank <- results$pagerank / sum(results$pagerank)
print(results)
When I run the equivalent with PySpark, it takes 10 to 30 minutes:
from pyspark.sql import SparkSession
from graphframes.examples import Graphs
if __name__ == '__main__':
sc = SparkSession.builder.master("local").getOrCreate()
g = Graphs(sc).friends()
results = g.pageRank(resetProbability=0.15, tol=0.01)
results.vertices.select("id", "pagerank").show()
results.edges.select("src", "dst", "weight").show()
I tried different version of Spark and GraphFrames for Python to be aligned with the settings of R.
In, general when you see such significant runtime differences between pieces of code that are apparently equivalent in different backends you have to consider two possibilities:
There are not really equivalent. Despite using the same Java libraries under the hood, the path which different language use to interact with the JVM are not the same, and when the code reaches the JVM, it might not use the same call chain.
The methods are equivalent but the configuration and / or data distribution is not the same.
In this particular case the first and the most obvious reason is how you load the data.
In sparklyr copy_to.spark_connection uses by default only a single partition. With such small data it can be often beneficial, as parallelization / distribution overhead can be much higher than the computation cost, but can also lead to miserable failures.
In PySpark, friends loader uses standard parallelize - it means that the number of partitions will use defaultParallelism.
Based on the master configuration the value is at least 1, but it can be affected by configuration options not visible here (like spark.default.parallelism).
However, as far as I can tell tell, these options shouldn't affect the runtime in this particular case. Moreover the path before code reaches JVM backend in both cases, doesn't seem to differ enough to explain the difference.
This suggests that problem lies somewhere in the configuration. In general there are at least two options which can significantly affect data distribution, and therefore the execution time:
spark.default.parallelism - used with RDD API to determine the number of partitions in different cases, including default post-shuffle distribution. For possible implications see for example Spark iteration time increasing exponentially when using join
It doesn't look like it affects your code here.
spark.sql.shuffle.partitions - used with Dataset API to determine the number of partitions after a shuffle (groupBy, join, etc.).
While PageRank code uses old GraphX API, and this parameter is not directly applicable there, before data is passed to the older API, involves indexing edges and vertices with Dataset API.
If you check the source you'll see that both indexedEdges and indexVertices use joins, and therefore depend on spark.sql.shuffle.partitions.
Furthermore the number of partitions set by aforementioned methods will be inherited by the GraphX Graph object, significantly affecting execution time.
If you set spark.sql.shuffle.partitions to a minimum value:
spark: SparkSession
spark.conf.set("spark.sql.shuffle.partitions", 1)
the execution time on such small data should be negligible.
Conclusion:
You environments are likely to use different values of spark.sql.shuffle.partitions.
General Directions:
If you see behavior like this, and want to roughly narrow down the problem you should take a look at the Spark UI, and see where things diverge. In this case you're likely to see significantly different numbers of tasks.

In chainer, How to write BPTT updater using multiple GPUs?

I don't find example because existing example only extends training.StandardUpdater, thus only use One GPU.
I assume that you are talking about the BPTTUpdater of the ptb example of Chainer.
It's not straight forward to make the customized updater support learning on multiple GPUs. The MultiprocessParallelUpdater hard code the way to compute the gradient (only the target link implementation is customizable), so you have to copy the overall implementation of MultiprocessParallelUpdater and modify the gradient computation parts. What you have to copy and edit is chainer/training/updaters/multiprocess_parallel_updater.py.
There are two parts in this file that compute gradient; one in _Worker.run, which represents a worker process task, and the other in MultiprocessParallelUpdater.update_core, which represents the master process task. You have to make these code do BPTT by modifying the code starting from _calc_loss to backward in each of these two parts:
# Change self._master into self.model for _Worker.run code
loss = _calc_loss(self._master, batch)
self._master.cleargrads()
loss.backward()
It should be modified by inserting the code of BPTTUpdater.update_core.
You also have to take care on the data iterators. MultiprocessParallelUpdater accept the set of iterators that will be distributed to master/worker processes. Since the ptb example uses a customized iterator (ParallelSequentialIterator), you have to make sure that these iterators iterate over different portions of the dataset or using different initial offsets of word positions. It may require customization to ParalellSequentialIterator as well.

In ns-3 simulator when to use p2p nodes, wifistanodes and csmanodes

I'm trying to use NS-3 simulator to do some tests between random waypoint model and some other models. While during the simulation I found that when need to run the simulation we need to initiate the model by using a class called
MobilityHelper
Code below is the part of the code that I am using. During the initialization, there are some nodes need to be created beforehand such as
p2pNodes
csmaNodes
So what do these nodes mean, and in which situation need to use them? Are they specify to some specific mobility models? If so please give some details, many thanks!
NodeContainer p2pNodes;
p2pNodes.Create (3);
PointToPointHelper pointToPoint;
pointToPoint.SetDeviceAttribute ("DataRate", StringValue ("5Mbps"));
pointToPoint.SetChannelAttribute ("Delay", StringValue ("2ms"));
NetDeviceContainer p2pDevices;
p2pDevices = pointToPoint.Install (p2pNodes);
NodeContainer csmaNodes;
csmaNodes.Add (p2pNodes.Get (1));
csmaNodes.Create (nCsma);
CsmaHelper csma;
csma.SetChannelAttribute ("DataRate", StringValue ("100Mbps"));
csma.SetChannelAttribute ("Delay", TimeValue (NanoSeconds (6560)));
NetDeviceContainer csmaDevices;
csmaDevices = csma.Install (csmaNodes);
p2pNodes and csmaNodes are simply variable names for the NodeContainer used in this particular example in order to keep track of nodes for the two newtorks (a Point-to-Point and a CSMA). The fact that you name them p2pNodes or csmaNodes is only for your conveniency. What matters, is the types of NetDevices they get installed.
In any case, that has nothing to do with the MobilityModel that you install.
Both P2P and CSMA are wired networks and I wouldn't think of adding a random mobility on those. It wouldn't make sense to move around with a wire attached to you..!
Note that the above example code will crash, as you have created 3 p2pNodes, and a point-to-point link can only be instantiated between two nodes.
I would recommend to study the ns-3 tutorial to get an understanding of the consepts of Nodes, NodeContainers (ie. vectors of Nodes), NetDevices (ie. the network card/type), MobilityModel etc

What is the right way to duplicate an OpenCL kernel?

It seems that I can duplicate a kernel by get the program object and kernel name from the kernel. And then I can create a new one.
Is this the right way? It doesn't looks so good, though.
EDIT: To answer properly the question: Yes it is the correct way, there is no other way in CL 2.0 or earlier versions.
The compilation (and therefore, slow step) of the CL code creation is in the "program" creation (clProgramBuild + clProgramLink).
When you create a kernel. You are just creating a object that packs:
An entry point to a function in the program code
Parameters for input + output to that function
Some memory to remember all the above data between calls
It is an simple task that should be almost for free.
That is why it is preferred to have multiple kernel with different input parameters. Rather than one single kernel, and changing the parameters every loop.

How can I label my sub-processes for logging when using multicore and doMC in R

I have started using the doMC package for R as the parallel backend for parallelised plyr routines.
The parallelisation itself seems to be working fine (though I have yet to properly benchmark the speedup), my problem is that the logging is now asynchronous and messages from different cores are getting mixed in together. I could created different logfiles for each core, but I think I neater solution is to simply add a different label for each core. I am currently using the log4r package for my logging needs.
I remember when using MPI that each processor got a rank, which was a way of distinguishing each process from one another, so is there a way to do this with doMC? I did have the idea of extracting the PID, but this does seem messy and will change for every iteration.
I am open to ideas though, so any suggestions are welcome.
EDIT (2011-04-08): Going with the suggestion of one answer, I still have the issue of correctly identifying which subprocess I am currently inside, as I would either need separate closures for each log() call so that it writes to the correct file, or I would have a single log() function, but have some logic inside it determining which logfile to append to. In either case, I would still need some way of labelling the current subprocess, but I am not sure how to do this.
Is there an equivalent of the mpi_rank() function in the MPI library?
I think having multiple process write to the same file is a recipe for a disaster (it's just a log though, so maybe "disaster" is a bit strong).
Often times I parallelize work over chromosomes. Here is an example of what I'd do (I've mostly been using foreach/doMC):
foreach(chr=chromosomes, ...) %dopar% {
cat("+++", chr, "+++\n")
## ... some undoubtedly amazing code would then follow ...
}
And it wouldn't be unusual to get output that tramples over each other ... something like (not exactly) this:
+++chr1+++
+++chr2+++
++++chr3++chr4+++
... you get the idea ...
If I were in your shoes, I think I'd split the logs for each process and set their respective filenames to be unique with respect to something happening in that process's loop (like chr in my case above). Collate them later if you must ... ie. map/reduce your log files :-)

Resources