As you know mpi can run a bunch of processes even though there is only one processor with one core. Let's say I have a dual core single processor. If I run a program with mpiexec.mpich -np 2 ./out how can I be sure that the work was split among two cores?
Probably the simplest way for you to confirm that you're running on both cores is to do something like a tight while loop that will spike the processor usage:
#include <mpi.h>
int main(int argc, char** argv)
{
MPI_Init(&argc, &argv);
while(1) {}
}
Then you can to look at your usage with something like top to make sure it's what you expect.
If you want to have fine-grained control over where your processes are run, MPICH has options to let you do that. You can find out all of the options on the wiki page. There's flags to let you bind to cores, hardware threads, sockets, etc.
Related
I would think this question has been asked thousands of times, I simply cannot find many resources on the subject.
I would like to program my Arduino Uno (ATmega328P) using Atmel Studio and the C language, minus the Arduino Libraries. What I mean by this is that I would like to write code as follows:
int main(void) {
/* set pin 5 of PORTB for output*/
DDRB |= _BV(DDB5);
while (1) {
/* set pin 5 high to turn led on */
PORTB |= _BV(PORTB5);
_delay_ms(BLINK_DELAY_MS);
/* set pin 5 low to turn led off */
PORTB &= ~_BV(PORTB5);
_delay_ms(BLINK_DELAY_MS);
}
}
Rather than code that is riddled with the oh so convenient Arduino functions. I want to get under the hood and get dirty with Arduinos!
That being said, I'm looking for any great learning sources that you may be able to offer so that I can expand my knowledge!
So far, the only somewhat useful source that I've managed to find is this page:
https://hekilledmywire.wordpress.com/2010/12/04/22/
However, the images are missing and it seems minimalistic, anyways.
Provided you are familiar with C, I recommend to
start with the AVR Libc reference
inspect iom328p.h for your processor specific definitions (located under ...\Atmel Toolchain\AVR8 GCC\Native\[#.#.####]\avr8-gnu-toolchain\avr\include\avr)
optionally, in Atmel Studio create a new ASF board project selecting device ATmega328p and inspect the sources loaded into your project folder from the "user_board" template (which anyway is a generic nearly empty set of *.h's providing space for things you may/may not need)
have the complete processor manual close to you at all times - register and bitnames found there match with the definitions in AVR libraries
Be aware that the libraries coming with Atmel Studio and the toolchain support the m328P, but the UNO board per se is not supported by the ASF. However, for basic programming you will be fine.
adding ... on PORTB
PORTB is defined in your processor's specific ...io.h (1st bullet above) which is automatically included by including <io.h> and choosing the correct processor in AVR Studio. In the library of your processor you find
#define PORTB _SFR_IO8(0x05)
Looking up the processor guide (4th bullet above) page 615 you see that PORTB is at I/O address 0x05 (q.e.d.). _SFR_IO8(..) by itself is a macro defined in <avr/sfr_defs.h> to convert from I/O to memory address (yes the lower registers are double mapped as I/O and Memory, whereby Memory address is by 0x20 higher because the lowest Memory addresses are occupied by R0 to R31).
By including <io.h> you get from the AVR library
#include <avr/io.h>
// included by io.h
// #include <avr/sfr_defs.h>
// #include <avr/portpins.h>
// #include <avr/common.h>
// #include <avr/version.h>
// #include <avr/io(your_processor).h> via processor declaration ... fuses
// #include <avr/(maybe some more).h>
All these ...h's (and some more) finally let you program in C using the register/port/pin names you find in the processor manual.
There are some more usefull libs like
#include <stdint.h> // Type definitions, e.g. uint8_t
// #include "stdint-gcc.h"
#include <avr/power.h> // clock prescaler macro
#include <avr/interrupt.h> // interrupt macros
you will find libs to support reading/writing from/to program and flash memory etc. etc.
You can do what you want but you are going to need a programmer like the Atmel-ICE, AVR Dragon, STK 500 or a AVRISP mkII. There a a few more. Atmel has a number of programers depending on your needs. There a also some 3rd party programmers that are a lot cheaper. I have the STK500 and Dragon. Love them both and they play nice with Atmel Studio 6.X.
A good learning resources is this book:
Make: AVR Programming By Elliot Williams.
What is the best way to export the code written in Qt to the script language TCL. In the code Qt, I use the data structure in Qt like QMAP, QLIST other than those in STL, so the SWIG may not recognize them and neither some other macros in Qt.
You need a linking function with this signature:
extern "C" int theFunctionName(ClientData clientData, Tcl_Interp *interp,
int argc, char **argv)
which you register with Tcl_CreateCommand. (Or there's Tcl_CreateObjCommand, which uses a more efficient type management system, but in principle it's pretty similar.) This is what SWIG would construct for you, with lots of guidance, but it's not too hard to DIY and you'll probably end up with a better interface anyway. (SWIG's handling of enumerations and bit-sets is usually very unidiomatic from a Tcl perspective.)
That linking function has to decide the values in argv to create the values that are passed to your Qt/C++ code, and then it uses Tcl_SetResult or Tcl_SetObjResult to store the result back in the Tcl_Interp context before returning TCL_OK or TCL_ERROR (for success or an exception).
I guess a QMap would look like a Tcl dictionary, and a QList like a Tcl list, but the details could get quite tricky. The details depend on exactly what's going on (stuff like callbacks gets a little tricky, especially in threaded code!)
I have a hybrid code with MPI/OpenMP. I want to know what is the time spent for particular function, let's say A, for every MPI process. This function is called inside the OpenMP do/for loops also in a very complicated way by various functions on top of it (i.e. some other functions let's say B and C may be calling A which also might be inside the OpenMP do/for loops). I was planning to do it as follows:
double A()
{
time1 = MPI_Wtime();
//compute result...
//Note: inside this function there is no OpenMP or MPI calls...
// just pure computation of results...
time2 = MPI_Wtime();
printf("myRank=%d timeSpent=%f\n", myRank, (time2-time1));
return result;
}
Would the sum of all the times per every MPI process be the total time spent for this function by that MPI process? If not please can you show me how to get it correctly, thanks!
We don't want to reinvent the wheel and we don't want to reinvent the MPI profiler. That would be hard.
There are very powerful tools available from the manufactures of many cluster systems. For example Cray machines usually come with CrayPat which spits out magic.
Additionally there is free software such as this http://mpip.sourceforge.net/
I would recommend not to reinvent the wheel but to use some professional grade software already built for profiling such as TAU, or MPIP or Gprof ...
heres a decent presentation to get you started
I know that there is no way using std classes such as string, vector, map or set in CUDA kernel. However, it's very uncomfortable without them. I have to write a lot of code in CUDA kernel, so I would like to use at least strings and vectors. I'm not talking about something like thrust. I want to be able to write something like this:
__global__ void kernel()
{
cuda_vector<int> a;
for(int i=0;i<10;i++)
a.push_back(i);
}
int main()
{
kernel<<<1,512>>>();
return 0;
}
This should create 512 threads and in each thread I want to create cuda_vector class and use it as std::vector. I didn't find any solution on the internet and I started to write my own class. Each function of this class is defined as "__ host __ " and " __ device __" function so that I can use it on both CPU and GPU.
Theoretically, it can be implemented, however only on Fermi architecture. Because, we need to allocate memory dynamically. I have GTX 580 and started to write my own Vector. But it's tiring and needs a lot of time. Isn't there any implementation which I can use? I can't believe that there isn't any. Do so many software developers write on CUDA without it? And noone tried to write his/her own version?
The reason you don't find something like std::vector for cuda is performance. Your traditional vector object doesn't fit well with the CUDA model. If you are planning on using only 512 threads and each one will be managing a std::vector like object your performance is going to be worse than running the same code on the CPU.
GPU threads are not like CPU threads, they should be as light as possible. Use thread blocks and shared memory to have the threads cooperate. If you are manipulating a string, each thread should be working on one character, if you are using vectors in the CPU pass an array of that to the GPU, and have each thread work on one element. Basically, think about how to solve the problem with the CUDA programming model as apposed to solving it with a CPU approach and then translating it to CUDA.
I've not used it, but the CuPP framework may be of interest to you, especially the vector<T> implementation. Looks like it could do what you need it to do.
Is there a way to use C style function pointers in OpenCL?
In other words, I'd like to fill out a OpenCL struct with several values, as well as pointers to a OpenCL function. I'm not talking about going from a CPU function to a GPU function, I'm talking about going from a GPU function to a GPU function.
Is this possible?
--- EDIT ---
If not, it there a way around this? In CUDA we have object inheritance, and in 4.0 we even have virtual functions. About the only way I can find to implement a runtime dispatch like this is to resort to if statements, and that will get ugly really fast.
From the OpenCL 1.1 specification:
Section 6.8 (Restrictions) (a):
The use of pointers is somewhat restricted. The following rules apply:
Arguments to kernel functions declared in a program that are pointers
must be declared with the __global, __constant or __local qualifier.
A
pointer declared with the __constant, __local or __global qualifier
can only be assigned to a pointer declared with the __constant,
__local or
__global qualifier respectively.
Pointers to functions are not
allowed.
The usual work around I use for this is with macros. Evil, but currently inescapable. So I typically end up with something like:
#define FEATURENAME_START impl1_start
#define FEATURENAME_END impl1_end
I then either inject this into the kernel at compilation time or pass it as an argument to the OpenCL compiler. It's not quite runtime in the usual sense, but it can be runtime from the Host's perspective still, even if not the device.
AMD has future plans for hardware support for this, so there may be future extensions for them.