R Parallel Processing with Xeon Phi, minimal code changes?

R Parallel Processing with Xeon Phi, minimal code changes? - r

Looking at buying a couple Xeon Phi 5110P, but trying to estimate how much code I have to change or other software needed.
Currently I make good use of R on a multi-core Windows machine (24 cores) by using the foreach package, passing it other packages forecast, glmnet, etc. to do my parallel processing.
Having a Xeon Phi I understand I would want to compile R
https://software.intel.com/en-us/articles/running-r-with-support-for-intel-xeon-phi-coprocessors And I understand this could be done with a trail version of Parallel Studio XE.
Then do I then need to edit R's Makeconf file, adding the C/C++ flags and for the Phi? Compile all the needed packages before the trail on Parallel Studio expires? Or do I not need to edit the Makeconf to get the benefits of foreach on the Phi?
Seems like some of this will be handled automatically once R is compiled, with offloading done by the Math Kernel Library (MKL), but I'm not totally sure of this.
Somewhat related question: Is the Intel Xeon Phi usable without a costly Intel Compiler?
Also revolutionanalytics.com seems to have a few related blog posts but not entirely conclusive for me: http://blog.revolutionanalytics.com/2015/05/behold-the-power-of-parallel.html

If all you need is matrix operations, you can compile it with MKL libraries per here: [Running R with Support for Intel® Xeon Phi™ Coprocessors][1] , which requires the Intel Complier. Microsoft R comes pre compiled with MKL but I was not able to use the auto offload, I had to compile R with the Intel compiler for it to work properly.
You could use the trial version compiler and compile it during the trial period to see if it fits your purpose.
If you want to use things like foreach package by setting up a cluster,since each node is a linux computer, I'm afraid you're out of luck. On page 3 of [R-Admin][1] it says
Cross-building is not possible: installing R builds a minimal version of R and then runs many
R scripts to complete the build.
You have to cross compile from xeon host for xeon phi node with the intel compiler, and it's just not feasible.
The last way to utilize the Phi is to rewrite your code to call it directly. Rcpp provides an easy interface to C and C++ routines. If you found a C routine that runs well on xeon you can call the nodes within your code. I have done this with CUDA and the Rcpp is a thin layer and there are good examples of how to use it, and if you join it with examples of calling the phi card nodes you can probably achieve your goal with less overhead.
BUt, if all you need is matrix ops, there is no quicker route to supercomputing than a good double precision nvidea card and pre loading nvBlas during R startup.

Related

Using nvBLAS in R on Windows?

I am having trouble getting nvBLAS to work in R. I'm using RStudio on a Windows 10 machine, and I have no idea how to link nvBLAS and the original Rblas together for R to start up with both. From the nvBLAS documentation:
To use the NVBLAS Library, the user application must be relinked
against NVBLAS in addition to the original CPU Blas (technically only
NVBLAS is needed unless some BLAS routines not supported by NVBLAS are
used by the application). To be sure that the linker links against the
exposed symbols of NVBLAS and not the ones from the CPU Blas, the
NVBLAS Library needs to be put before the CPU Blas on the linkage
command line.
How exactly do I do this in Windows? Caveat, I am a pretty solid R user, but I am by no means an R expert or a computer scientist. I would ideally like to avoid using an Ubuntu-build for this as well.

Is there R command(s) making Keras Tensorflow-GPU to run on CPU?

I'm running Keras in R and using Tensorflow-GPU backend. Is it possible to force Keras to run on CPU without re-installing the backend?

Let me give you 2 answers.
Answer #1 (normal answer)
No, unfortunately not. For keras CPU and GPU are 2 different versions, from which you select at install time.
It seems you remember that you selected GPU at install time. I guess you're hoping that you were only setting a minor option, not selecting a version of the program. Unfortunately, you were selecting the version of keras to install.
Answer #2 (ok, maybe you can "trick" keras)
It seems you can use environment variable values to trick keras into thinking that your CPU is your GPU.
This seems like it may have unexpected results, but it seemed to work for these Python users.
I wouldn't worry about the fact that they are using Python. They are just using their language to set environment variables. So you can do the same in R
or directly within your OS.

Why is Intel Haswell XEON CPU sporadically miscomputing FFTs and ART?

During the last days I observed a behaviour of my new workstation I couldn't explain. Doing some research on this problem, there might be a possible bug in the INTEL Haswell architecture as well as in the current Skylake Generation.
Before writing about the possible bug, let me give you an overview of the hardware used, the program code and the problem itself.
Workstation hardware specification
INTEL Xeon E5-2680 V3 2500MHz 30M Cache 12Core
Supermicro SC745 BTQ -R1K28B-SQ
4 x 32GB ECC Registered DDR4-2133 Ram
INTEL SSD 730 Series 480 GB
NVIDIA Tesla C2075
NVIDIA TITAN
Operating system & program code in question
I'm currently running Ubuntu 15.04 64bit Desktop version, latest updates and kernel stuff installed. Besides using this machine to develop CUDA Kernels and stuff, I recently tested a pure C program.
The program is doing sort of modified ART on quite large input sets of data. So the code executes some FFTs and consumes quite some time to finish calculation. I can't currently post / link to any source
code as this is ongoing research that cannot be published. If you're not familiar with ART, just a simple explanation what it does. ART is a technique used to reconstruct the data received from a computer tomograph machine to get
visible images for diagnosis. So our version of the code reconstructs data sets of sizes like 2048x2048x512. Up until now, nothing too special nor rocket science involved. After some hours of debugging and fixing errors, the code was tested
on reference results and we can confirm the code works as it is supposed to. The only library the code is using is standard math.h . No special compile parameters, no additional library stuff that might bring in additional problems.
Observing the problem
The code implements ART using a technique to minimize the projections needed for reconstructing the data. So let's assume we can reconstruct one slice of data involving 25 projections. The code is started with exactly the same input data on 12 cores. Please note that the
implementation is not based on multithreading, currently 12 instances of the program are launched. I know this isn't the best way to do it, involving proper thread management is heavily advised and this is already on the list of improvements :)
So when we run at least two instances of the program (every instance working on a separate data slice), the results are of some projections are wrong in a random fashion. To give you an idea of the results, please see Table1. Please note that the input data is always the same.
Running only one instance of the code involving one core of the CPU, the results are all correct. Even performing some runs involving one CPU core, the results remain correct. Only involving at least two or more cores generates a result pattern as seen in Table1.
Identifying the problem
Okay this took quite some hours to get an idea of what is actually going wrong. So we went through the whole code, most of those problems begin with a minor implementation mistake. But, well, no (of course we cannot proof the absence of bugs nor guarantee it). To verify our code, we used two different machines:
(Machine1) Intel Core i5 Quad-Core (Model from late 2009)
(Machine2) Virtual Machine running on Intel XEON 6core SandyBridge CPU
surprisingly, both Machine1 & Machine2 produce always correct results. Even using all CPU-cores, the results remain correct. Not even one wrong result in over 50 runs on every machine. Code was compiled on every target machine without optimization options or any specific compiler settings.
So, reading the news led to the following findings:
ArsTechnika - Skylake CPU freezes during complex workload
PcWorld - how to test your PC for the skylake bug
Intel Community - Simple instruction for freezing a Skylake Processor
So the folks over at Prime95 and the Mersenne Community seem to be the first ones to discover and identify this nasty bug. The referenced postings and news support the suspicion, that the problem only exists under heavy workload. Following my observation, I can confirm this behavior.
The question(s)
Have you / the community observed this problem on Haswell CPUs as well as on Skylake CPUs?
As gcc does per default AVX(2) optimization (whenever possible), turning off this optimization would help?
How can I compile my code and ensure, that any optimization that might be affected by this bug is turned off? So far I read only about a problem using the AVX2 command set in Haswell / Skylake architectures.
Solutions?
Okay I can turn off all AVX2 optimizations. But this slows down my code. Intel might release a BIOS update to mainboard manufactures that would modify the microcode in Intel CPUs. As it seems to be a hardware bug, this might become interesting even by updating the CPUs microcode. I think it might be a valid option, as Intel CPUs use some RISC to CISC translation mechanisms controlled by Microcode.
EDIT: Techreport.com - Errata prompts Intel to disable TSX in Haswell, early Broadwell CPUs Will check the microcode version in my CPU.
EDIT2: As of now (19.01.2016 15:39 CET) Memtest86+ v4.20 is running and testing the memory. As this seems to take quite some time to finish, I'll update the post tomorrow with results.
EDIT3: As of now (21.01.2016 09:35 CET) Memtest86+ finished two runs and passed. Not even one memory error. Updated the microcode of the CPU from revision=0x2d to revision=0x36. Currently preparing source code for releasing here. Problem with the wrong results consists. As I'm not the author of the code in question, I have to double check not to post code I'm not allowed to. I'm also using the workstation and maintaining it.
EDIT4: (22.01.2016) (12:15 CET) Here ist the Makefile used to compile the sourcecode:
# VARIABLES ==================================================================
CC = gcc
CFLAGS = --std=c99 -Wall
#LDFLAGS = -lm -lgomp -fast -s -m64
LDFLAGS = -lm
OBJ = ArtReconstruction2Min.o
# RULES AND DEPENDENCIES ====================================================
# linking all object files
all: $(OBJ)
$(CC) -o ART2Min $(OBJ) $(LDFLAGS)
# every o-file depends on the corresonding c-file, -g Option bedeutet Debugging Informationene setzen
%.o: %.c
$(CC) -c -g $< $(CFLAGS)
# MAKE CLEAN =================================================================
clean:
rm -f *.o
rm -f main
and the gcc -v output:
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.9/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.9.2-10ubuntu13' --with-bugurl=file:///usr/share/doc/gcc-4.9/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.9 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.9 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.9-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.9.2 (Ubuntu 4.9.2-10ubuntu13)

EDIT: Problem solved. I have to shout out a huge Sorry to the community and a big thank you for your hints. Sorry to user anonymous, who seems to be involved into kernel development. What happened? We spent another 2 days debugging and fiddling around with the program code. No implementation problems were found. BUT: the main code involves another helper program. This helper program calculates weights for the ART algorithm on demand. So after debugging and testing, this helper program messed up, when running at least 4 processes. So this was NOT a Kernel / hardware problem, but a software (memory access) problem.
Lessons learned:
Debug every tool that is involved into the calculation process.
Microcode was outdated. SuperMicro is informed about this.
Ubuntu 15.04 possibly needs additional tools, so that all Cores of the CPU run at full speed. Achieved this by installing Ubuntu 14.04 - all cores running at 2,5GHz.
I need to spent some beer if we ever meet up at a conference.
So after three days of thinking, testing and fiddling around with the machine, I discovered the following observations today:
Ubuntu 15.04 runs the CPU with 420 - 650 MHz per Core. Okay I thought this is an Energy-saving option, so I followed various guides to set the speed to the maximum (2.50 GHz). It didn't work. Checked with cpufreq-utils.
Results still remained wrong after several tests on this machine. Other (i5, i7, XEON) machines produced correct results.
I read that other users experienced issues with Ubuntu 15.04 and the CPU frequency. So I decided to plug in a SSD and install Ubuntu 14.04. Checked again what the CPU frequency is now.. and it showed 2.50 GHz as I expected it.
Again started the reconstruction algorithm (which was now like 4-5 times faster than on Ubuntu 15.04) and waited for the results. Okay. Results are correct now! I double checked, started 9 processes and compared results. Still correct.
So I can only assume that there might be a problem in Ubuntu 15.04 / kernel using Speedstep in this CPU. CPU in 15.04 ran all the time between 420 - 650 MHz, while the min CPU speed is expected to be 1,20 GHz and the max CPU speed is 3,30 GHz. If somebody wants the check, I can offer the source code and example data leading to this problem.
Sorry for suspecting this be a CPU bug.
EDIT: after some more testing, the problem is only solved for some scenarios but not yet for all. I'll do more testing.

The Skylake-S/U prime95 erratum is in the AVX (not AVX2) unit. It is fixed on microcodes 0x56 (probably) and 0x6a (for sure). Such erratum in Haswell is unlikely, but possible (especially on post-2014 Intel, where "validation" became an unwelcome cost instead of a tenant for quality).
Haswell has errata linked to the AVX unit, although HSE58 is rather unlikely to be at play (it only slows down the AVX unit). However, do try to place a few MFENCE instructions before the AVX2 computations. If this fixes it, report back immediately, it means we need to MFENCE all IRET in the kernel (HSE105).
Your processor has signature 0x306f2. Ensure you have microcode revision 0x36 or later, this microcode is in Intel's "Linux microcode update pack" from 2015-11-06.
EDIT: this wasn't really an answer, so I should have made it a comment, instead. I apologise. Since the microcode update was not sufficient to fix the issue, it could still be a new errata, an old, but unworked-around errata, or something else entirely (such as code bug or gcc code generation bug).

How to profile an openmp code natively on Intel MIC?

I have an openmp code written in C. I executed the code on Intel MIC on Stampede. I want to profile the code to find the hotspots in the code so that it will be helpful for me to optimize the code further. I tried to use the profiler gprof but I read somewhere that gprof cannot be used on MIC directly. I tried to use perf by going through tutorial. I could go till a certain step after which when the perf annotate step comes and I execute the code, it gives me the error ")" unexpected. So I am not knowing how to proceed to profile my code. Can anybody please help ??
This is the site where I referred to the perf tutorial : sandsoftwaresound.net/perf/perf-tutorial-hot-spots/ .

80% of optimization for the Xeon Phi is the same as for the host (Xeon). Use gprof, printf, compiler options, and the rest of your toolkit and carry your optimization as far as you can executing your code on the host only. After you can do no more, then focus on specific Xeon Phi optimizations.
As you are on Stampede, I assume you are using the Intel compiler. The compiler has a lot of diagnostic capabilities to profile your code and even provide suggestions. I'd provide you with more specific URLs but am on vacation with limited bandwidth.
Though this isn't specific to your question, here are some other suggestions. If you aren't, you'll most likely get a substantial boost using it. Intel compilers are danged good at optimizations, especially on Intel architectures. Also, you should use Intel MKL where possible. All of MKL's routines are optimized for the different IA architectures, and the most relevant to HPC are optimized specifically for MIC.

You have a few options.
The heavyweight approach is to use Intel Vtune. Firstly add -g to your compiler flags.
I use Vtune from the host command line quite a bit, here is the command I use to profile an application on the MIC. (This is executed on the host machine, Vtune on the host uses ssh
to launch the application on the MIC.)
amplxe-cl -collect knc-hotspots -source-search-dir=/mysrc/dir -search-dir=/mybin/dir -- ssh mic0 /home/me/myapp
Assume the app on the MIC is at /home/me/myapp, and the source dir and source search dir on the host. (With Vtune update 15 at least, I need to specify both of these separately in order to get the Vtune GUI to show me symbol info)
Once your app has finished, run the Vtune GUI on the host with amplxe-gui and open your result set.
There are also some simplified open source profiling tools developed by Intel that support the MIC, Speedometer and Overhead, you can find information about them here
Hopefully this is enough info to get you started.

Running a loop in parallel using multiple cores in R [duplicate]

I have a quad-core laptop running Windows XP, but looking at Task Manager R only ever seems to use one processor at a time. How can I make R use all four processors and speed up my R programs?

I have a basic system I use where I parallelize my programs on the "for" loops. This method is simple once you understand what needs to be done. It only works for local computing, but that seems to be what you're after.
You'll need these libraries installed:
library("parallel")
library("foreach")
library("doParallel")
First you need to create your computing cluster. I usually do other stuff while running parallel programs, so I like to leave one open. The "detectCores" function will return the number of cores in your computer.
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl, cores = detectCores() - 1)
Next, call your for loop with the "foreach" command, along with the %dopar% operator. I always use a "try" wrapper to make sure that any iterations where the operations fail are discarded, and don't disrupt the otherwise good data. You will need to specify the ".combine" parameter, and pass any necessary packages into the loop. Note that "i" is defined with an equals sign, not an "in" operator!
data = foreach(i = 1:length(filenames), .packages = c("ncdf","chron","stats"),
.combine = rbind) %dopar% {
try({
# your operations; line 1...
# your operations; line 2...
# your output
})
}
Once you're done, clean up with:
stopCluster(cl)

The CRAN Task View on High-Performance Compting with R lists several options. XP is a restriction, but you still get something like snow to work using sockets within minutes.

As of version 2.15, R now comes with native support for multi-core computations. Just load the parallel package
library("parallel")
and check out the associated vignette
vignette("parallel")

I hear tell that REvolution R supports better multi-threading then the typical CRAN version of R and REvolution also supports 64 bit R in windows. I have been considering buying a copy but I found their pricing opaque. There's no price list on their web site. Very odd.

I believe the multicore package works on XP. It gives some basic multi-process capability, especially through offering a drop-in replacement for lapply() and a simple way to evaluate an expression in a new thread (mcparallel()).

On Windows I believe the best way to do this would probably be with foreach and snow as David Smith said.
However, Unix/Linux based systems can compute using multiple processes with the 'multicore' package. It provides a high-level function, 'mclapply', that performs a list comprehension across multiple cores. An advantage of the 'multicore' package is that each processor gets a private copy of the Global Environment that it may modify. Initially, this copy is just a pointer to the Global Environment, making the sharing of variable extremely quick if the Global Environment is treated as read-only.
Rmpi requires that the data be explicitly transferred between R processes instead of working with the 'multicore' closure approach.
-- Dan

If you do a lot of matrix operations and you are using Windows you can install revolutionanalytics.com/revolution-r-open for free, and this one comes with the intel MKL libraries which allow you to do multithreaded matrix operations. On Windows if you take the libiomp5md.dll, Rblas.dll and Rlapack.dll files from that install and overwrite the ones in whatever R version you like to use you'll have multithreaded matrix operations (typically you get a 10-20 x speedup for matrix operations). Or you can use the Atlas Rblas.dll from prs.ism.ac.jp/~nakama/SurviveGotoBLAS2/binary/windows/x64 which also work on 64 bit R and are almost as fast as the MKL ones. I found this the single easiest thing to do to drastically increase R's performance on Windows systems. Not sure why they don't come as standard in fact on R Windows installs.
On Windows, multithreading unfortunately is not well supported in R (unless you use OpenMP via Rcpp) and the available SOCKET-based parallelization on Windows systems, e.g. via package parallel, is very inefficient. On POSIX systems things are better as you can use forking there. (package multicore there is I believe the most efficient one). You could also try to use package Rdsm for multithreading within a shared memory model - I've got a version on my github that has unflagged -unix only flag and should work also on Windows (earlier Windows wasn't supported as dependency bigmemory supposedly didn't work on Windows, but now it seems it does) :
library(devtools)
devtools::install_github('tomwenseleers/Rdsm')
library(Rdsm)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex