Chainer: ParallelUpdater performance vs MultiprocessUpdater - chainer

I'd like to train a CNN on the CIFAR10 dataset with chainer on multiple GPUs on a single node. I tried adapting this example to use ParallelUpdater, in a manner identical to the mnist data parallel example but training performance was very poor -- slower than training on one GPU, even though all 8 GPUs were being utilized. I changed to MultiprocessUpdater and performance (iters/sec) was much better.
Bad:
num_gpus = 8
chainer.cuda.get_device_from_id(0).use()
train_iter = chainer.iterators.SerialIterator(train, batch_size)
if num_gpus > 0:
updater = training.updater.ParallelUpdater(
train_iter,
optimizer,
devices={('main' if device == 0 else str(device)): device for device in range(num_gpus)},
)
else:
updater = training.updater.StandardUpdater(train_iter, optimizer, device=0)
Good:
num_gpus = 8
devices = range(num_gpus)
train_iters = [chainer.iterators.MultiprocessIterator(i, batch_size, n_processes=num_gpus) \
for i in chainer.datasets.split_dataset_n_random(train, len(devices))]
test_iter = chainer.iterators.MultiprocessIterator(test, batch_size, repeat=False, n_processes=num_gpus)
device = 0 if num_gpus > 0 else -1 # -1 indicates CPU, 0 indicates first GPU device.
if num_gpus > 0:
updater = training.updaters.MultiprocessParallelUpdater(train_iters, optimizer, devices=range(num_gpus))
else:
updater = training.updater.StandardUpdater(train_iters[0], optimizer, device=device)
I also ran this benchmarking scripts with 8 GPUs, using the ParallelUpdater, but performance was also very poor: https://github.com/mitmul/chainer-cifar10/blob/master/train.py
My question is: how can I get good performance from ParallelUpdater, and what might I be doing wrong with it?
Thanks!

Using multiple GPUs, there is some overhead for communicating, so each iteration speed could be slower.
If you using data parallel method, you can use much more large batch size and large learning rate, it could accelerate your training.

I am not so familiar with ParallelUpdater, so my understanding might be wrong.
I guess the purpose of ParallelUpdater is not for the speed performance, instead its main purpose is to use memory efficiently to compute large batch gradient.
When reading the source code, model update is done in python for loop, so due to the GIL (Global Interpreter Lock) mechanism, I guess its computation itself is not done in parallel.
https://github.com/chainer/chainer/blob/master/chainer/training/updaters/parallel_updater.py#L118
As written, you can use MultiprocessUpdater if you want to get benefit of speed performance by using multiple GPU.
Also, you can consider using ChainerMN which is extension library for multi-GPU training with chainer.
github
documentation

Related

R's gc() on parallel runs seems to dramatically under-report peak memory

In R I have a task that I'm trying to parallelize. Part of this is comparing run-times and peak memory usage for different implementations of the task at hand. I'm using the peakRAM library to determine peak memory, which I think just uses gc() under the surface, since if I do it manually I get the same peak memory results.
The problem is that the results from peakRAM are different from the computer's task manager (or top on Linux). If I run a single-core, these numbers are in the same ballpark, but even using 2 cores, they are really different.
I'm parallelizing using pblapply in a manner similar to this.
times_parallel = peakRAM(
pblapply(X = 1:10,
FUN = \(x) data[iteration==x] %>% parallel_task(),
cl = makeCluster(numcores, type = "FORK"))
)
With a single core, this process requires a peak of 30G of memory. But with 2 cores, peakRAM reports only about 3G of memory. Looking at top however, shows that each of the 2 threads is using around 20-30G of memory at a time.
The only thing I can think of is that peakRAM is only reporting the memory of the main thread but I see nothing in the gc() details that suggests this is happening.
The time reported from peakRAM seems appropriate. Sub-linear gains at different core levels.

cuDF low GPU utilization

I have a task that involves running many queries on a dataframe. I compared the performance of running these queries on a Xeon CPU (Pandas) vs. RTX 2080 (CUDF). For a dataframe of 100k rows, GPU is faster but not by much. Looking at nvidia-smi output and the GPU utilization is around 3-4% while running the queries.
My question is what can I do to speed up the cuDF task and achieve high GPU utilization?
For example I can run 8 of these queries on 8 CPU cores in parallel for the CPU use case.
NUM_ELEMENTS = 100000
df = cudf.DataFrame()
df['value1'] = cp.random.sample(NUM_ELEMENTS)
df['value2'] = cp.random.sample(NUM_ELEMENTS)
df['value3'] = cp.random.sample(NUM_ELEMENTS)
c1 = np.random.random()
c2 = np.random.random()
c3 = np.random.random()
res = df.query('((value1 < #c1) & (value2 > #c2) & (value3 < #c3))')
Here is a sample code that doesn't take a lot of GPU cycles, however I want to run thousands of such queries on the data and I don't want to run them sequentially. Is there a way to run the multiple query() calls on a cuDF dataframe in parallel to maximize GPU utilization?
We're working towards enabling this in cudf, but this is currently a limitation of the cuDF library. The parallelism mechanism you're looking for is using CUDA Streams (https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/). We don't quite yet support CUDA streams in the cuDF Python library, but we're actively working on it.
You may be able to workaround this using a combination of cupy and numba along with their support of CUDA streams (https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.Stream.html, https://numba.pydata.org/numba-doc/dev/cuda-reference/host.html#stream-management), but you'd be in a very experimental area.

How to speed up the generation of a latin hypercube (LHS) design

I'm trying to generate an optimized LHS (Latin Hypercube Sampling) design in R, with sample size N = 400 and d = 7 variables, but it's taking forever. My pc is an HP Z820 workstation with 12 cores, 32 Mb RAM, Windows 7 64 bit, and I'm running Microsoft R Open which is a multicore version of R. The code has been running for half an hour, but I still don't see any results:
library(lhs)
lhs_design <- optimumLHS(n = 400, k = 7, verbose = TRUE)
It seems a bit weird. Is there anything I could do to speed it up? I heard that parallel computing may help with R, but I don't know how to use it, and I have no idea if it speeds up only code that I write myself, or if it could speed up an existing package function such as optimumLHS. I don't have to use the lhs package necessarily - my only requirement is that I would like to generate an LHS design which is optimized in terms of S-optimality criterion, maximin metric, or some other similar optimality criterion (thus, not just a vanilla LHS). If worse comes to worst, I could even accept a solution in a different environment than R, but it must be either MATLAB or a open source environment.
Just a little code to check performance.
library(lhs)
library(ggplot2)
performance<-c()
for(i in 1:100){
ptm<-proc.time()
invisible(optimumLHS(n = i, k = 7, verbose = FALSE))
time<-print(proc.time()-ptm)[[3]]
performance<-rbind(performance,data.frame(time=time, n=i))
}
ggplot(performance,aes(x=n,y=time))+
geom_point()
Not looking too good. It seems to me you might be in for a very long wait indeed. Based on the algorithm, I don't think there is a way to speed things up via parallel processing, since to optimize the separation between sample points, you need to know the location of the all the sample points. I think your only option for speeding this up will be to take a smaller sample or get (access)a faster computer. It strikes me that since this is something that only really has to be done once, is there a resource where you could just get a properly sampled and optimized distribution already computed?
So it looks like ~650 hours for my machine, which is very comparable to yours, to compute with n=400.

multinode processing in R

I am trying to run R code on an HPC, but not sure how to take advantage of multiple nodes. The specific HPC I am using has 100 nodes and 36 cores per node.
Here is an example of the code.
n = 3600 ### This would be my ideal. Set to 3 on my laptop
cl = makeCluster(n, "SOCK")
foreach(i in 1:length(files), packages=c("raster","dismo")) %dopar%
Myfunction(files=files[i],template=comm.path, out = outdir)
This code works on my laptop and on the login of the HPC, but it is only using 1 node. I just want to make sure I am taking advantage of all the cores that I can.
How specifically do I take advantage of multiple nodes, or is it done "behind the scenes"?
If you are serious with HPC clusters, use MPI cluster, not SOCK. MPI is the standard for non-shared memory computing, and most clusters are optimized for MPI.
In case of HPC you also need a job-script to start R. There are several ways to start it, you may use mpirun, or invoke the workers directly from R. Scheduler will set up the MPI environment and R will figure out which nodes to use. Start small, with say 4 workers, and increase the number until you have reached the optimal level. Most tasks cannot efficiently use 3600 cpus.
Finally, if you are using tens of CPU-s over MPI, I strongly recommend to use Rhpc instead of Rmpi package. it uses more efficient MPI communication and gives you quite a noticeable speed boost.
On a TORQUE-controlled system I am using something along the lines:
Rhpc_initialize()
nodefile <- Sys.getenv("PBS_NODEFILE")
nodes <- readLines(nodefile)
commSize <- length(nodes)
cl <- Rhpc_getHandle(commSize)
Rhpc_Export(cl, c("data"))
...
result <- Rhpc_lapply(cl, 1:1000, runMySimulation)
...
Rhpc_finalize()
The TORQUE-specific part is the nodefile part, in this way I know how many workers to create. In the jobscript I start R just as Rscript >>output.txt myScript.R.
As a side note: are you sure myfun(files, ...) is correct? Perhaps you mean myfun(files[i], ...)?
Let us know how it goes, I am happy to help :-)

Cryptsetup Benchmark Values unrealistic

i am currently evaluating the AES encryption and decryption speed on my laptop and my workstation.
when executing
cryptsetup benchmark -c aes --key-size 128
I get normal results of almost 200MB/s without the AESNI extension.
when i load the extension with
modprobe aesni-intel
and perform the same benchmark I get completely unrealistic results
for example 68021MB/s on decrypt
any suggestions what might be the problem causing these unrealistic results?
BTW: OS on laptop is Ubuntu, Workstation is Gentoo
Uninstalled predefined ubuntu package
installed from source
with
make check
the make scripts performs a single test and these results are fine
but when i install it via
make install
i again get these weird results
Unrealistic benchmark results are usually caused by wrong (as in totally invalid) approach to benchmarking.
Judging from their benchmark source, the benchmark core is (in horrendous pseudo-code)
totalTime = 0
totalSize = 0
while ( totalTime < 1000 ) {
(sampleTime, sampleSize) = processSingleSample
totalTime += sampleTime
totalSize += sampleSize
}
speed = totalSize / totalTime
Imagine a situation when the execution time of processSingleSample is close to zero - each iteration steadily increases the totalSize but on some iteration the total time will not increase at all. In the end the totalTime is 1000 and the totalSize is arbitrary large, hence the resulting "speed" is arbitrary large.
This benchmarking approach can still be useful when each individual iteration takes a significant amount of time but in this particular case (especially after you enable aesni which decreases the time for each individual iteration even more) it just not the right one.

Resources