What is the right way to duplicate an OpenCL kernel?

What is the right way to duplicate an OpenCL kernel? - opencl

It seems that I can duplicate a kernel by get the program object and kernel name from the kernel. And then I can create a new one.
Is this the right way? It doesn't looks so good, though.

EDIT: To answer properly the question: Yes it is the correct way, there is no other way in CL 2.0 or earlier versions.
The compilation (and therefore, slow step) of the CL code creation is in the "program" creation (clProgramBuild + clProgramLink).
When you create a kernel. You are just creating a object that packs:
An entry point to a function in the program code
Parameters for input + output to that function
Some memory to remember all the above data between calls
It is an simple task that should be almost for free.
That is why it is preferred to have multiple kernel with different input parameters. Rather than one single kernel, and changing the parameters every loop.

Related

Calling the agrep .Internal C function from Rcpp

In short: How can I call, from within Rccp C++ code, the agrep C internal function that gets called when users use the regular agrep function from base R?
In long: I have found multiple questions here about how to invoke, from within Rcpp, a C or C++ function created for another package (e.g. using C function from other package in Rcpp
and Rcpp: Call C function from a package within Rcpp).
The thing that I am trying to achieve, however, is at the same time simpler but also way less documented: it is to directly call, from within Rcpp, a .Internal C function that comes with base R rather than another package, without interfacing with R (that is, without doing what is said in Call R functions in Rcpp). How could I do that for the .Internal C function that lays underneath base R's agrep wrapper?
The specific function I am trying to call here is the agrep internal C function. And for context, what I am ultimately trying to achieve is to speed-up a call to agrep for when millions of patterns must be each checked against each of millions of x targets.

Great question. The long and short of it is "You cant" (in many cases) unless the function is visible in one of the header files in "src/include/". At least not that easily.
Not long ago I had a similar fun challenge, where I tried to get access to the do_docall function (called by do.call), and it is not a simple task. First of all, it is not directly possible to just #include <agrep.c> (or something similar). That file simply isn't available for inclusion, as it is not a part of the "src/include". It is compiled and the uncompiled file is removed (not to mention that one should never "include" a .c file).
If one is willing to go the mile, then the next step one could look at is "copying" and "altering" the source code. Basically find the function in "src/main/agrep.c", copy it into your package and then fix any errors you find.
Problems with this approach:
As documented in R-exts the internal structures of sexprec_info is not made public (this is the base structure for all objects in R). Many internal function use the fields within this structure, so one has to "copy" the structure into your source code, to make it public to your code specifically.
If you ever #include <Rcpp.h> prior to this file, you will need to go through each and every call to internal functions and likely add either R_ or Rf_.
The function may contain calls to other "internal" functions, that further needs to be copied and altered for it to work.
You will also need to get a clear understanding of what CDR, CAR and similar does. The internal functions have a documented structure, where the first argument contains the full call passed to the function, and function like those 2 are used to access parts of the call.
I did myself a solid and rewrote do_docall changing the input format, to avoid having to consider this. But this takes time. The alternative is to create a pairlist according to the documentation, set its type as a call-sexp (the exact name is lost to me at the moment) and pass the appropriate arguments for op, args and env.
And lastly, if you go through the steps, and find that it is necessary to copy the internal structures of sexprec_info (as described later), then you will need to be very careful about when you include Rinternals and Rcpp, as any one of these causes your code to crash and burn in the most beautiful and silent way if you include your header and these in the wrong order! Note that this even goes for [[Rcpp::export]], which may indeed turn out to include them in the wrong arbitrary order!
If you are willing to go this far down the drainage, I would suggest carefully reading adv-R "R's C interface" and Chapter 2, 5 and 6 of R-ext and maybe even the R internal manual, and finally once that is done take a look at do_docall from src/main/coerce.c and compare it to the implementation in my repository cmdline.arguments/src/utils/{cmd_coerce.h, cmd_coerce.c}. In this version I have
Added all the internal structures that are not public, so that I can access their unmodified form (unmodified by the current session).
This includes the table used to store the currently used SEXP's, that was used as a lookup. This caused a problem as I can't access the modified version, so my code is slightly altered with the old code blocked by the macro #if --- defined(CMDLINE_ARGUMENTS_MAYBE_IN_THE_FUTURE). Luckily the code causing a problem had a static answer, so I could work around this (but this might not always be the case).
I added quite a few Rf_s as their macro version is not available (since I #include <Rcpp.h> at some point)
The code has been split into smaller functions to make it more readable (for my own sake).
The function has one additional argument (name), that is not used in the internal function, with some added errors (for my specific need).
This implementation will be frozen "for all time to come" as I've moved on to another branch (and this one is frozen for my own future benefit, if I ever want to walk down this path again).
I spent a few days scouring the internet for information on this and found 2 different posts, talking about how this could be achieved, and my approach basically copies this. Whether this is actually allowed in a cran package, is an whole other question (and not one that I will be testing out).
This approach goes again if you want to use not-public code from other packages. While often here it is as simple as "copy-paste" their files into your repository.
As a final side note, you mention the intend is to "speed up" your code for when you have to perform millions upon millions of calls to agrep. It seems that this is a time where one should consider performing the task in parallel. Even after going through the steps outlined above, creating N parallel sessions to take care of K evaluations each (say 100.000), would be the first step to reduce computing time. Of course each session should be given a batch and not a single call to agrep.

Cheat sheet for caffe / pycaffe?

Does anyone know whether there is a cheat sheet for all important pycaffe commands?
I was so far using caffe only via Matlab interface and terminal + bash scripts.
I wanted to shift towards using ipython and work through the ipython notebook examples. However I find it hard to get an overview of all the functions that are inside the caffe module for python. (I'm also quite new to python).

The pycaffe tests and this file are the main gateway to the python coding interface.
First of all, you would like to choose whether to use Caffe with CPU or GPU. It is sufficient to call caffe.set_mode_cpu() or caffe.set_mode_gpu(), respectively.
Net
The main class that the pycaffe interface exposes is the Net. It has two constructors:
net = caffe.Net('/path/prototxt/descriptor/file', caffe.TRAIN)
which simply create a Net (in this case using the Data Layer specified for training), or
net = caffe.Net('/path/prototxt/descriptor/file', '/path/caffemodel/weights/file', caffe.TEST)
which creates a Net and automatically loads the weights as saved in the provided caffemodel file - in this case using the Data Layer specified for testing.
A Net object has several attributes and methods. They can be found here. I will cite just the ones I use more often.
You can access the network blobs by means of Net.blobs. E.g.
data = net.blobs['data'].data
net.blobs['data'].data[...] = my_image
fc7_activations = net.blobs['fc7'].data
You can access the parameters (weights) too, in a similar way. E.g.
nice_edge_detectors = net.params['conv1'].data
higher_level_filter = net.params['fc7'].data
Ok, now it's time to actually feed the net with some data. So, you will use backward() and forward() methods. So, if you want to classify a single image
net.blobs['data'].data[...] = my_image
net.forward() # equivalent to net.forward_all()
softmax_probabilities = net.blobs['prob'].data
The backward() method is equivalent, if one is interested in computing gradients.
You can save the net weights to subsequently reuse them. It's just a matter of
net.save('/path/to/new/caffemodel/file')
Solver
The other core component exposed by pycaffe is the Solver. There are several types of solver, but I'm going to use only SGDSolver for the sake of clarity. It is needed in order to train a caffe model.
You can instantiate the solver with
solver = caffe.SGDSolver('/path/to/solver/prototxt/file')
The Solver will encapsulate the network you are training and, if present, the network used for testing. Note that they are usually the same network, only with a different Data Layer. The networks are accessible with
training_net = solver.net
test_net = solver.test_nets[0] # more than one test net is supported
Then, you can perform a solver iteration, that is, a forward/backward pass with weight update, typing just
solver.step(1)
or run the solver until the last iteration, with
solver.solve()
Other features
Note that pycaffe allows you to do more stuff, such as specifying the network architecture through a Python class or creating a new Layer type.
These features are less often used, but they are pretty easy to understand by reading the test cases.

Please note that the answer by Flavio Ferrara has a litte problem which may cause you waste a lot of time:
net.blobs['data'].data[...] = my_image
net.forward()
The code above is noneffective if your first layer is a Data type layer, because when net.forward() is called, it will begin from the first layer, and then your inserted data my_image will be covered. So it will show no error but give you totally irrelevant output. The correct way is to assign the start and end layer, for example:
net.forward(start='conv1', end='fc')
Here is a Github repository of Face Verification Experiment on LFW Dataset, using pycaffe and some matlab code. I guess it could help a lot, especially the caffe_ftr.py file.
https://github.com/AlfredXiangWu/face_verification_experiment
Besides, here are some short example code of using pycaffe for image classification：
http://codrspace.com/Jaleyhd/caffe-python-tutorial/
http://prog3.com/sbdm/blog/u011762313/article/details/48342495

R: clarification on memory management

Suppose I have a matrix bigm. I need to use a random subset of this matrix and give it to a machine learning algorithm such as say svm. The random subset of the matrix will only be known at runtime. Additionally there are other parameters that are also chosen from a grid.
So, I have code that looks something like this:
foo = function (bigm, inTrain, moreParamsList) {
parsList = c(list(data=bigm[inTrain, ]), moreParamsList)
do.call(svm, parsList)
}
What I am seeking to know is whether R uses new memory to save that bigm[inTrain, ] object in parsList. (My guess is that it does.) What commands can I use to test such hypotheses myself? Additionally, is there a way of using a sub-matrix in R without using new memory?
Edit:
Also, assume I am calling foo using mclapply (on Linux) where bigm resides in the parent process. Does that mean I am making mc.cores number of copies of bigm or do all cores just use the object from the parent?
Any functions and heuristics of tracking memory location and consumption of objects being made in different cores?
Thanks.

I am just going to put in here what I find from my research on this topic:
I don't think using mclapply makes mc.cores copies of bigm based on this from the manual for multicore:
In a nutshell fork spawns a copy (child) of the current process, that can work in parallel
to the master (parent) process. At the point of forking both processes share exactly the
same state including the workspace, global options, loaded packages etc. Forking is
relatively cheap in modern operating systems and no real copy of the used memory is
created, instead both processes share the same memory and only modified parts are copied.
This makes fork an ideal tool for parallel processing since there is no need to setup the
parallel working environment, data and code is shared automatically from the start.

For your first part of the question, you can use tracemem :
This function marks an object so that a message is printed whenever the internal code copies the object
Here an example:
a <- 1:10
tracemem(a)
## [1] "<0x000000001669cf00"
b <- a ## b and a share memory (no message)
d <- stats::rnorm(10)
invisible(lm(d ~ a+log(b)))
## tracemem[0x000000001669cf00 -> 0x000000001669e298] ## object a is copied twice
## tracemem[0x000000001669cf00 -> 0x0000000016698a38]
untracemem(a)

You already found from the manual that mclapply isn't supposed to make copies of bigm.
But each thread needs to make its own copy of the smaller training matrix as it varies across the threads.
If you'd parallelize with e.g. snow, you'd need to have a copy of the data in each of the cluster nodes. However, in that case you could rewrite your problem in a way that only the smaller training matrices are handed over.
The search term for the general investigation of memory consumption behaviour is memory profiling. Unfortunately, AFAIK the available tools are not (yet) very comfortable, see e.g.
Monitor memory usage in R
Memory profiling in R - tools for summarizing

Connection to memory (R)

I have to repeatedly serialize (big) R objects. To avoid repeated garbage collecting of the resulting raw vectors (after profiling, it turns out that half of my script running time is spent in gc!) I’d like to ask R to directly write in a memory buffer -- always the same, as after each serialization I’d call a C function with .C that would work directly on this memory buffer; it is the result of this C function that interests me.
Is that possible? How unreasonable is it?
Thanks in advance.

I might not have understood your problem, but why don't you directly use your R object in c++ code using Rcpp. There will be no copy and you don't not need any serialization.

How can I label my sub-processes for logging when using multicore and doMC in R

I have started using the doMC package for R as the parallel backend for parallelised plyr routines.
The parallelisation itself seems to be working fine (though I have yet to properly benchmark the speedup), my problem is that the logging is now asynchronous and messages from different cores are getting mixed in together. I could created different logfiles for each core, but I think I neater solution is to simply add a different label for each core. I am currently using the log4r package for my logging needs.
I remember when using MPI that each processor got a rank, which was a way of distinguishing each process from one another, so is there a way to do this with doMC? I did have the idea of extracting the PID, but this does seem messy and will change for every iteration.
I am open to ideas though, so any suggestions are welcome.
EDIT (2011-04-08): Going with the suggestion of one answer, I still have the issue of correctly identifying which subprocess I am currently inside, as I would either need separate closures for each log() call so that it writes to the correct file, or I would have a single log() function, but have some logic inside it determining which logfile to append to. In either case, I would still need some way of labelling the current subprocess, but I am not sure how to do this.
Is there an equivalent of the mpi_rank() function in the MPI library?

I think having multiple process write to the same file is a recipe for a disaster (it's just a log though, so maybe "disaster" is a bit strong).
Often times I parallelize work over chromosomes. Here is an example of what I'd do (I've mostly been using foreach/doMC):
foreach(chr=chromosomes, ...) %dopar% {
cat("+++", chr, "+++\n")
## ... some undoubtedly amazing code would then follow ...
}
And it wouldn't be unusual to get output that tramples over each other ... something like (not exactly) this:
+++chr1+++
+++chr2+++
++++chr3++chr4+++
... you get the idea ...
If I were in your shoes, I think I'd split the logs for each process and set their respective filenames to be unique with respect to something happening in that process's loop (like chr in my case above). Collate them later if you must ... ie. map/reduce your log files :-)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

What is the right way to duplicate an OpenCL kernel? - opencl

It seems that I can duplicate a kernel by get the program object and kernel name from the kernel. And then I can create a new one. Is this the right way? It doesn't looks so good, though.

Related

Calling the agrep .Internal C function from Rcpp

Cheat sheet for caffe / pycaffe?

R: clarification on memory management

Connection to memory (R)

How can I label my sub-processes for logging when using multicore and doMC in R

Categories

Resources