When I write code in PyOpenCl, do I still need to write the kernels in C, or can I write them somehow in Python?
Yes, you still need to write the kernels in C.
It really is not much of a pain to deal with. And if you want a bit more abstraction, you can create a domain specific language with Python that maps to parts of C kernels.
The reason C is required for writing kernels is because OpenCL exists to create extremely performant applications. In order to make the most out of a GPU, you need to control the exact on-chip operations that the application does (such as bitwise operations), and how the application allocates the GPU's memory spaces (global, shared, and local). C is a great language for having that sort of control.
Related
I am writing a machine learning toolkit to run algorithm with different settings in parallel (each process run the algorithm for one setting). I am thinking about either to use mpi4py or python's build-in multiprocessing ?
There are a few pros and cons I am considering about.
Easy-to-use:
mpi4py: It seems more concepts to learn and a bit more tricks to make it work well
multiprocessing: quite easy and clean API
Speed:
mpi4py: people say it is more low level, so I am expect it can be faster than python multiprocessing ?
multiprocessing: compared with mpi4py, much slower ?
Clean and short code:
mpi4py: seems more code to write
multiprocessing: preferred, easy to use API
The working context is I am aiming at running the code basically in one computer or a GPU server. Not really targeting at running in different machines in the network (which only MPI can do it).
And since the main goal is doing machine learning, so the parallelization is not really required to be very optimal, the key goal I want to have is to balance easy, clean and quick to maintain code base but at the same time like to exploit the benefits of parallelization.
With the background described above, is it recommended that using multiprocessing should just be enough ? Or is there a very strong reason to use mpi4py ?
By using mpi4py you can divide the task into multiple threads, but with a single computer with limited performance or number of cores the usability will be limited. However you might find it handy during training.
mpi4py is constructed on top of the MPI-1/2 specifications and provides an object oriented interface which closely follows MPI-2 C++ bindings.
MPI for Python provides MPI bindings for the Python language, allowing programmers to exploit multiple processor computing systems.
MPI for Python supports convenient, pickle-based communication of generic Python object as well as fast, near C-speed, direct array data communication of buffer-provider objects
I'm looking to create Rust implementations of some small bioinformatics programs for my research. One of my main considerations is performance, and while I know that I could schedule the Rust program to run on a grid with qsub - the cluster I have access to uses Oracle's GridEngine - I'm worried that the fact that I'm not calling MPI directly will cause performance issues with the Rust program.
Will scheduling the program without using an MPI library hinder performance greatly? Should I use an MPI library in Rust, and if so, are there any known MPI libraries for Rust? I've looked for one but I haven't found anything.
I have used several supercomputing facilities (I'm an astrophysicist) and have often faced the same problem: I know C/C++ very well but prefer to work with other languages.
In general, any approach other than MPI will do, but consider that often such supercomputers have heavily optimised MPI libraries, often tailored for the specific hardware integrated in the cluster. It is difficult to tell how much the performance of your Rust programs will be affected if you do not use MPI, but the safest bet is to stay with the MPI implementation provided on the cluster.
There is no performance penalty in using a Rust wrapper around a C library like a MPI library, as the bottleneck is the time needed to transfer data (e.g. via a MPI_Send) between nodes, not the negligible cost of an additional function call. (Moreover, this is not the case for Rust: there is no additional function call, as already stated above.)
However, despite the very good FFI provided by Rust, it is not going to be easy to create MPI bindings. The problem lies in the fact that MPI is not a library, but a specification. Popular MPI libraries are OpenMPI (http://www.open-mpi.org) and MPICH (http://www.mpich.org). Each of them differs slightly in the way they implement the standard, and they usually cover such differences using C preprocessor macros. Very few FFIs are able to deal with complex macros; I don't know how Rust scores here.
As an instance, I am implementing an MPI Program in Free Pascal but I am not able to use the existing MPICH bindings (http://wiki.lazarus.freepascal.org/MPICH), as the cluster I am using provides its own MPI library and I prefer to use this one for the reason stated above. I was unable to reuse MPICH bindings, as they assumed that constants like MPI_BYTE were hardcoded integer constants. But in my case they are pointers to opaque structures that seem to be created when MPI_Init is called.
Julia bindings to MPI (https://github.com/lcw/MPI.jl) solve this problem by running C and Fortran programs during the installation that generate Julia code with the correct values for such constants. See e.g. https://github.com/lcw/MPI.jl/blob/master/deps/make_f_const.f
In my case I preferred to implement a middleware, I.e., a small C library which wraps MPI calls with a more "predictable" interface. (This is more or less what the Python and Ocaml bindings do too, see https://forge.ocamlcore.org/projects/ocamlmpi/ and http://mpi4py.scipy.org.) Things are running smoothly, so far I haven't got any problem.
Will scheduling the program without using an MPI library hinder performance greatly?
There are lots of ways to carry out parallel computing. MPI is one, and as comments to your question indicate you can call MPI from Rust with a bit of gymnastics.
But there are other approaches, like the PGAS family (Chapel, OpenSHMEM, Co-array Fortran), or alternative messaging like what Charm++ uses.
MPI is "simply" providing a (very useful, highly portable, aggressively optimized) messaging abstraction, but as long as you have some way to manage the parallelism, you can run anything on a cluster.
I have program which use OpenCL do do math, how i can get source code of opencl, that execute on my gpu when this program do calculations?
The most straightforward approach is to look for the kernel string in the application. Sometimes you'll be able to just find its source lying in some .cl file, otherwise you can try to scan the application's binaries with something like strings. If the application is not purposefully obfuscating the kernel source, you're likely to find it using one of those methods.
A more bulletproof approach would be to catch the strings provided to the OpenCL API. You can even provide your own OpenCL implementation that just prints out the kernel strings in the relevant cl function. It's actually pretty easy: start with pocl and change the implementation of clCreateProgramWithSource to print out the input strings - this is a trivial code change.
You can then install that modified version as an OpenCL implementation and make sure the application uses it. This might be tricky if the application requires certain OpenCL capabilities, but your implementation can of course lie about those.
Notice that in the future, SPIR can make this sort of thing impossible - you'll be able to get an IR of the kernel, but not its source.
clGetProgramInfo(..., CL_PROGRAM_BINARIES, ...) gets you the compiled binary, but interpreting that is dependent upon the architecture. Various SDK's have different tools that might get you GPU assembly though.
I currently have an MPI program written in C and I want to use a routine from ScaLAPACK.
I'm working on a parallel version of LDA, and one step is inverting a matrix.
I found a routine in ScaLAPACK that solves this pdgetri.f (it's written in fortran, I'm not sure that a c routine exists), but I'm not sure how to configure it to work. I'm using Windows and an Intel Dual Core Laptop. The purpose is more didactic than for performance.
SCALAPACK relies on BLACS to provide abstraction to whatever message passing system is in use. If you have an existing MPI communicator established in your code, you can use blacs_gridmap to initialise a BLACS context which is mapped onto your communicator. That context can then be used to create SCALAPACK distributed arrays and those arrays then passed to SCALPACK routines which will then operate on them.
How you tackle the C-Fortran interfacing problem will depend a lot on what compiler(s) you are using. If you have a "modern" compiler which supports Fortran 2003 features, you can use the C interoperability language features to write an interface wrapper for the functions you need, and then call them directly. On UNIX/LINUX style systems, F2C style interfacing was the defacto way to call Fortran from C, although some of the details where usually compiler specific. I don't use Windows at all, so I can really help you if you can't use Fortran 2003 interoperability.
I have seen a few blog entries on this and have had a discussion or two with my team mates but I would like to see what the stack overflow community thinks.
So why does the Adobe Alchemy Tool create so much faster running flash byte code than the flex compiler?
Also, when will the flex compiler be able to make similar performance gains?
Will it require programmer specific use of special Array's or something of that nature to get the same performance?
Alchemy is an implementation of LLVM in ActionScript. Simply put, it's an virtual machine that uses a ByteArray as it's memory store.
The C code compiled by Alchemy has direct access to "memory" (via some opcodes introduced in Flash 10), allowing it to chunk memory around at it's leisure (including pointers to objects). This results in some, but by no means all, code running faster. Some types of code will actually run slower in Alchemy due to it being a VM running on top of the AVM (another VM).
Additionally, Alchemy does not have native access to ActionScript classes and must access them through interop classes.
The alchemy tool creates code that uses instructions in the flash player that aren't available to the regular compiler (and the talk is that these instructions were exposed especially for alchemy).
Whether the regular compiler will eventually make similar gains, hopefully. It's been proven a few times that the compiler creates substandard code, and there are a couple of projects which optimise the generated code. These may shame Adobe into improving.
Chances are, no, there won't be anything special a programmer needs to do to get these performance gains (though check out the optimising blogs, writing loops in a particular way means they can be optimised better).