R gputools: clearing or reusing memory - r

I'm running a simulation using the gputools package and I want to make sure I don't run in to memory problems as the simulation progresses. Does gputools automatically clear the GPU memory?
I checked the gputools documentation here but I don't see any mention of how memory is handled.

As far as I can tell, all the functions in gputools are just wrappers around lower level functions provided by BLAS and the NVIDIA CUDA toolkit.
I've briefly dug into the toolkit, and they do provide memory allocation and freeing functions for C/C++ users, but calling those from R would be highly impractical, so i think it's fairly safe to say that the allocation and freeing of memory is handled by the lower level functions that gputools wraps around.

Related

R with 36 cores is much slower than R with 8 cores

On the right hand side is the result from EC2 instance with 36 cores and 64GB RAM while on the left is my laptop with 8 cores and 8GB RAM.
I'm new to running R on AWS EC2 instance, so probably I need to configure my R in order to make use of the EC2 instance raw compute power.
Could someone please advise on how or is there anything that I miss here?
Thanks!
I found the better answer here.
Quite disappointing with the unnecessary downvotes.
https://github.com/csantill/RPerformanceWBLAS/blob/master/RPerformanceBLAS.md
We frequently hear that R is slow, if your code makes heavy usage of
vector/matrix operations you will see significant performance
improvements.
The precompiled R distribution that is downloaded from CRAN makes use
of the reference BLAS/LAPACK implementation for linear algebra
operations, These implementations are built to be stable and cross
platform compatible but are not optimized for performance. Most R
programmers use these default libraries and are unaware that highly
optimized libraries are available and switching to these can have a
significant perfomance improvement.

Confusion over Nvidia GPU packages in Julia, CuArrays and ArrayFire

I recently looked into the usage of GPU computation, where the usage of package seemed to be confusing.
For example, CuArrays and ArrayFire seemed to be doing the same thing, where ArrayFire seemed to be the "official" package on Nvidia developers' webpage.(https://devblogs.nvidia.com/gpu-computing-julia-programming-language )
Also, there were CUDAdrv and CUDAnative Packages..., which seemed to be confusing, as their functionality seemed to be not as straightforward as the others.
What does these packages do? Is there any difference between CuArrays and ArrayFire?
As explained in the blog post you shared, it is quite simply as given below
The Julia package ecosystem already contains quite a few GPU-related
packages, targeting different levels of abstraction as Figure 1 shows.
At the highest abstraction level, domain-specific packages like
MXNet.jl and TensorFlow.jl can transparently use the GPUs in your
system. More generic development is possible with ArrayFire.jl, and if
you need a specialized CUDA implementation of a linear algebra or deep
neural network algorithm you can use vendor-specific packages like
cuBLAS.jl or cuDNN.jl. All these packages are essentially wrappers
around native libraries, making use of Julia’s foreign function
interfaces (FFI) to call into the library’s API with minimal overhead.
CUDAdrv and CUDAnative packages are meant for directly using CUDA runtime API and writing kernels from Julia itself. I believe that is where CuArray come in handy - wrapping native Julia objects into CUDA accessible format, roughly speaking.
ArrayFire on the other hand is a generic library that wraps around all(cuBLAS, cuSparse, cuSolve, cuFFT) CUDA provided domain specific libraries into nice interface(functions). Apart from the interface to CUDA's domain specific libraries, ArrayFire by itself provides lot of other functions in the areas of statistics, image processing, computer vision etc. It has nice JIT feature where user's code is compiled to a runtime kernel - simply put. ArrayFire.jl is an language binding with some extra Julia specific improvements at wrapper level.
That's the general difference. From a developers perspective, using a library(like ArrayFire) basically takes out the burden of keeping up with CUDA API and maintaining/tweaking the kernels for optimum performance which I think takes lot of time.
PS. I am a member of ArrayFire development team.

R does not engage my computer's GPU when running complex code

I am running R Studio 64bit on a Windows10 laptop with a Nvidia GPU in it, however, when I am running code, specifically Rshiny apps, they take a long time. This laptop has a GPU but my task manager shows that the GPU is not being utilized. Would the GPU make my program run faster? I do not know much about hardware so forgive my ignorance regarding this.
In answer to your question getting a new GPU would have no impact on the speed of your code.
By default most R code is single threaded meaning that it will only use 1 CPU core. There are various ways to do parallel processing (using more than 1 core) in R. And there are also packages that can make use of GPUs. However it sounds like you are not using either of these.
There are various different ways that you could code your application that would make it more efficient. However how you would go about this would be specific to your code. I would suggest you ask a different question regarding.
Also Hadley's excellent book, Advance R, has techniques for profiling and benchmarking your code to increase performance: http://adv-r.had.co.nz/Profiling.html

PETSc with shared memory

I have an MPI parallel code using PETSc to solve a linear equation system with a matrix-free GMRES method. It works fine, but each process uses about the same amount of memory, independent of the number of processes I use. So when using many processes, memory usage gets excessive. I am wondering if there is a way around this and I think using a shared memory approach might be the way to go.
As I understand the PETSc website shared memory is supported by PETSc (MPI shared memory is used for simplicity), but I can't find any information on how to enable or use it. Is using PETSc with shared memory a possible solution to my problem and if yes, is there any documentation on how to do this? Or is MPI shared memory used in PETSc by default if possible without any need for additional coding?

Rust on grid computing

I'm looking to create Rust implementations of some small bioinformatics programs for my research. One of my main considerations is performance, and while I know that I could schedule the Rust program to run on a grid with qsub - the cluster I have access to uses Oracle's GridEngine - I'm worried that the fact that I'm not calling MPI directly will cause performance issues with the Rust program.
Will scheduling the program without using an MPI library hinder performance greatly? Should I use an MPI library in Rust, and if so, are there any known MPI libraries for Rust? I've looked for one but I haven't found anything.
I have used several supercomputing facilities (I'm an astrophysicist) and have often faced the same problem: I know C/C++ very well but prefer to work with other languages.
In general, any approach other than MPI will do, but consider that often such supercomputers have heavily optimised MPI libraries, often tailored for the specific hardware integrated in the cluster. It is difficult to tell how much the performance of your Rust programs will be affected if you do not use MPI, but the safest bet is to stay with the MPI implementation provided on the cluster.
There is no performance penalty in using a Rust wrapper around a C library like a MPI library, as the bottleneck is the time needed to transfer data (e.g. via a MPI_Send) between nodes, not the negligible cost of an additional function call. (Moreover, this is not the case for Rust: there is no additional function call, as already stated above.)
However, despite the very good FFI provided by Rust, it is not going to be easy to create MPI bindings. The problem lies in the fact that MPI is not a library, but a specification. Popular MPI libraries are OpenMPI (http://www.open-mpi.org) and MPICH (http://www.mpich.org). Each of them differs slightly in the way they implement the standard, and they usually cover such differences using C preprocessor macros. Very few FFIs are able to deal with complex macros; I don't know how Rust scores here.
As an instance, I am implementing an MPI Program in Free Pascal but I am not able to use the existing MPICH bindings (http://wiki.lazarus.freepascal.org/MPICH), as the cluster I am using provides its own MPI library and I prefer to use this one for the reason stated above. I was unable to reuse MPICH bindings, as they assumed that constants like MPI_BYTE were hardcoded integer constants. But in my case they are pointers to opaque structures that seem to be created when MPI_Init is called.
Julia bindings to MPI (https://github.com/lcw/MPI.jl) solve this problem by running C and Fortran programs during the installation that generate Julia code with the correct values for such constants. See e.g. https://github.com/lcw/MPI.jl/blob/master/deps/make_f_const.f
In my case I preferred to implement a middleware, I.e., a small C library which wraps MPI calls with a more "predictable" interface. (This is more or less what the Python and Ocaml bindings do too, see https://forge.ocamlcore.org/projects/ocamlmpi/ and http://mpi4py.scipy.org.) Things are running smoothly, so far I haven't got any problem.
Will scheduling the program without using an MPI library hinder performance greatly?
There are lots of ways to carry out parallel computing. MPI is one, and as comments to your question indicate you can call MPI from Rust with a bit of gymnastics.
But there are other approaches, like the PGAS family (Chapel, OpenSHMEM, Co-array Fortran), or alternative messaging like what Charm++ uses.
MPI is "simply" providing a (very useful, highly portable, aggressively optimized) messaging abstraction, but as long as you have some way to manage the parallelism, you can run anything on a cluster.

Resources