I'm currently working on a research that involves oversubscribing MPI applications to check how big of a hit in performance it causes. Long story short: i need to run an application compiled for, let's say, 32 nodes (-np 32) on 16 cores, even though my machine has 32 cores.
I have tried setting the number of slots available on the hostfile (i.e.: localhost:16) but mpich still uses all the 32 cores.
Is it possible to force MPICH to do that?
PS: I'm using Nas Parallel Benchmarks and the host runs Ubuntu 14.04
Related
I want to run my MPI program on two machines, one with ubuntu 18.04 and the other with Windows 10. Is it possible to form a cluster with different operating systems? (I'm using MPICH)
If possible, how?? I could not find any resources online.
You should have the same version of MPI installed on different distributions. In such case, it is possible to use MPICH2 with Linux and Windows machines simultaneously.
But, it should be noted that performance characteristics of the machine plays a significant role as the application performance will be limited by the slowest processor and hence it is not recommended to execute MPI jobs across machines with different performance characteristics. Also, it should be noted that even if the hardware is identical, the performance of MPI will be different between MPICH2 on Linux and Windows.
Kindly note that the last version of MPICH supported on Windows was MPICH2 1.4.1p1. MPICH is not supported on Windows anymore including Cygwin as conveyed in MPICH FAQ.
Will R properly recognize all the cores on a multi-CPU Windows 10 Pro 64-bit machine? We are designing a parallel computing system with Intel CPUs, using two CPUs on one mainboard. There will be a total of 32 cores between the 2 CPUs (8 physical cores per CPU, up to 16 logical cores per CPU).
Before we spend the money, I want some confirmation that my R code will recognize and access all (or nearly all) the cores (usually, I put all but 1 into a cluster). I am using R doParallel and foreach packages successfully on a Win 10 Pro 64-bit workstation with a single 4-core CPU with 8 logical cores. I can run my R code and request 7 cores with no trouble.
You should be able to use all of your cores in R. R successfully recognizes all available cores on both my laptop and my data science server.
Once you have access to one of the computers you want to use, you can find out how many cores R recognizes with the detectCores() function from the parallel package.
library(parallel)
detectCores()
Taking a step back, you might want to reconsider the architecture of your system. If you have to run something so intense that it requires a battalion of multithreaded machines, you might want to think about rewriting your code to be more efficient or possibly integrate Rcpp or move to a different language.
I recently installed Microsoft R Open but this message appears at startup of R:
"Multithreaded BLAS/LAPACK libraries detected. Using 2 cores for math algorithms."
on a MAC it's supposed to start using the 4 cores without any additional set up.
How can i change this to 3 or 4 cores?
Thank you
A very common way to setup multicore processing in RRO is to use the method setMKLthreads() out of the Intel Math Kernel Library (MKL). However, to the best of my knowledge, there is no OSX-compatible MKL version, yet (see here for more information).
Another way to achieve multicore processing on OSX would be to use mcapply() out of the parallel, which works similar to the base-R lapply() (see the package's documentation here).
However, before you dig into this matter, I suggest to check if you really have a CPU with more than 2 physical cores. For instance, there are Intel i5 processors with both 2 and 4 physical cores dependent of the model. CPUs with only 2 physical cores, can then simulate a higher number of virtual cores. Since such i5 CPUs are frequently built into laptops, I think that this could be case if you are using a MacBook.
See also this SO question for further information: Virtual core vs Physical core
Have anyone tried to compile glibc with -march=corei7 to see if there's any performance improvement over the version that comes by default with any Linux x68_64 distribution? GCC is compiled with -march=i686. I think (not sure) that the mathematical library is also compiled the same way. Can anybody confirm this?
Most Linux distributions for x86 compile using only i686 instructions, but asking for scheduling them for later processors. I haven't really followed later developments.
A long while back different versions of system libraries according to processor lines were common, but the performance differences were soon deemed too small for the cost. And machines got more uniform in performance meanwhile.
One thing that has to be remembered always is that today's machines are memory bound. I.e., today a memory access takes a few hundred times longer than an instruction, and the gap is growing. Not to mention that this machine (an oldish laptop, was top-of-the-line some 2 years back) has 4 cores (8 threads), all battling to get data/instructions from memory. Making the code run a tiny bit faster, so the CPU can wait longer for RAM, isn't very productive.
I noticed an interesting problem. If I run the following code in R 2.12.0 (32-Bit) on a windows 3.00 gHz Core 2 Duo CPU with 2GB of RAM, it runs in less than one second. If I run it on a unix-box with sparc-sun-solaris2.10 (Also 32-Bit, though the unix box could run 64-bit) it takes 84 seconds. The processing speed of the unix box is 2.5 gHz. If I run top while the code is running, I noticed that my R process is only taking up to ~3.2% of available cpu states, even if more are available. Could this be part of the problem? I read the install manual, but nothing jumped out at me as the obvious solution to my problem. Is the unix operating system somehow limiting available resources while windows is not? Or, is there some preferable way to compile R from source that was not done? I apologize if I have not given enough information to answer the problem, this is not really my area of expertise.
t0 <- proc.time()[[3]]
x <- rnorm(10000)
for(i in 1:10000){
sd(x)
}
print(proc.time()[[3]]-t0)
Processors such as the T1 or T2 have a number of cores, and each core has a number of strands (hardware-level context switching). If you can run a multithreaded application, you'll get a large throughput. A typical intended use case would be a Java based web server, processing e.g. 20-40 connections at the same time.
The downside of this type of processors is that single threaded performance of these SPARC chips is quite low. It looks like Oracle is aware of the issue; the current development on T4 focuses on improving the single threaded speed.
The T1 processor exposes 32 logical CPUs to the operating system. If this was your case, and the displayed value was the percent of total computing power, 1/32 ~= 3.125%, which is close to what you saw.
To squeeze all the performance from a T1 processor, you need to make R use multiple CPUs, for example via the multicore package.