MPI IO Writing a file when offset is not known - mpi

I want to use MPI IO to write files.
The processes is in a while loop and it calls a function which generates random amount of data. I want to write this data to a single file.
How can I accomplish this?

It might help you:
http://social.microsoft.com/Forums/en-US/47b99479-37d9-43a4-b1e3-90efd3d52c0a/can-different-mpi-processes-write-data-in-one-file-with-different-offset?forum=windowshpcmpi
But maybe you ll need to pass a variable which contains the start point of every process..

in each iteration of your while loop, each process knows how much data will be written. Use MPI_SCAN to share that data, then MPI_File_write_at_all to write collectively:
incr = generate_random_data();
MPI_Scan(&incr, &new_offset, 1, MPI_LONG_LONG_INT,
MPI_SUM, MPI_COMM_WORLD);
new_offset -= incr;
ret = MPI_File_write_at_all(mpi_fh, new_offset, buf, count,
datatype, status);

Related

What is the rule behind instruction count in Intel PIN?

I wanted to count instructions in simple recursive fibo function O(2^n). I succeded to do so with bubble sort and matrix multiplication, but in this case it seemed like instruction count ignored my fibo function. Here is the code used for instrumentation:
// Insert a call at the entry point of a routine to increment the call count
RTN_InsertCall(rtn, IPOINT_BEFORE, (AFUNPTR)docount, IARG_PTR, &(rc->_rtnCount), IARG_END);
// For each instruction of the routine
for (INS ins = RTN_InsHead(rtn); INS_Valid(ins); ins = INS_Next(ins))
{
// Insert a call to docount to increment the instruction counter for this rtn
INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount, IARG_PTR, &(rc->_icount), IARG_END);
}
I started to wonder what's the difference between this program and the previous ones and my first thought was: here I'm not using an array.
This is what I realised after some manual tests:
a = 5; // instruction ignored by PIN and
// pretty much everything not using array
fibo[1] = 1 // instruction counted properly
a = fibo[1] // instruction ignored by PIN
So it seems like only instructions counted are writes to the memory (that's what I assume). After I changed my fibo function to this it works:
long fibonacciNumber(int n, long *fiboNumbers)
{
if (n < 2) {
fiboNumbers[n] = n;
return n;
}
fiboNumbers[n] = fiboNumbers[n-1] + fiboNumbers[n-2];
return fibonacciNumber(n - 1, fiboNumbers) + fibonacciNumber(n - 2, fiboNumbers);
}
But I would like to count instructions also for programs that aren't written by me. Is there a way to count all type of instrunctions? Is there any particular reason why only this instructions are counted? Any help appreciated.
//Edit
I used disassembly option in Visual Studio to check how it looks and it still makes no sense for me. I can't find the reason why only assingment to array is interpreted by PIN as instruction.
instruction_comparison
This exceeded all my expectations, counted as 2 instructions:
even 2 instructions, not one
PIN, like other low-level profiling and analysis tools, measures individual instructions, low-level orders like "add these two registers" or "load a value from that memory address". The sequence of instructions which a program comprises are generally produced from a high-level language like C++ through a compiler. An individual line of C++ code might be transformed into exactly one instruction, but it's also common for a line to translate to several instructions or even to zero instructions; and the instructions for a line of code may be interleaved with those of other instructions.
Your compiler can output an assembly-language file for your source code, showing what instructions were produced for which lines of code. (For GCC and Clang, this is done with the -S flag.) Note that reading the assembly code output from a compiler is not the best way to learn assembly. Also, I would point you to godbolt.org, a very convenient tool for analyzing assembly output.

Why does array += (without #.) produce so much memory allocation?

I don't understand why the += operation for arrays produces so much memory allocation, but it gets fixed when applying #.
function loop()
a = randn(10)
total = similar(a)
for i=1:1000
total += a
end
end
function loopdot()
a = randn(10)
total = similar(a)
for i=1:1000
#. total += a
end
end
loop()
loopdot()
Profile.clear_malloc_data()
loop()
loopdot()
produces
160000 total += a
and
0 #. total += a
total += a is the same as total = a + total which is a vectorized operation like:
out = similar(a)
for i in eachindex(a)
out[i] = total[i] + a[i]
end
total = out
since internally it's
total = +(total,a)
This is just like MATLAB, Python, or R and thus has a temporary array that is allocated for the vectorized operation, and then the = sets the reference of total to this new array. This is why vectorized operations are slow vs traditional low level loops and is one of the main reasons why using something like NumPy can be faster than Python but cannot fully reach C (because of these temporaries!).
#. total += a is the same as total .= total .+ a. This blog post explains that in Julia there is semantic dot fusion via anonymous functions, and thus corresponds to doing the following:
# Build an anonymous function for the fused operation
f! = (a,b,c) -> (a[i] = b[i] + c[i])
# Now loop it, since it's `.=` don't make a temporary
for i in eachindex(a)
f!(total,total,a)
end
which updates total in-place without creating a temporary array.
Fusion in Julia happens semantically: this conversion of dotted operations into an anonymous function plus a broadcast! call (which is essentially the loop I wrote there) is done at parsing time, and the anonymous function is compiled so that way this is efficient. This is very useful for other reasons as well. By overloading broadcast! on a generic f!, this is how things like GPUArrays.jl automatically build efficient single kernels which do in-place updates on the GPU. This is as opposed to MATLAB, Python, and R where different vectorized functions are considered as different function calls and thus must compute a return, hence the temporary array.

Is there a minimum string length for F() to be useful?

Is there a limit for short strings where using the F() macro brings more RAM overhead then saving?
For (a contrived) example:
Serial.print(F("\n"));
Serial.print(F("Hi"));
Serial.print(F("there!"));
Serial.print(F("How do you doyou how?"));
Would any one of those be more efficient without the F()?
I imagine it uses some RAM to iterate over the string and copy it from PROGMEM to RAM. I guess the question is: how much? Also, is heap fragmentation a concern here?
I'm looking at this purely from SRAM-conserving perspective.
From a purely SRAM-conserving perspective all of your examples are identical in that no SRAM is used. At run-time some RAM is used, but only momentarily on the stack. Keep in mind that calling println() (w/o any parameters) uses some stack/RAM.
For a single character it will take up less space in flash if a char is passed into print or println. For example:
Serial.print('\n');
The char will be in flash (not static RAM).
Using
Serial.print(F("\n"));
will create a string in flash memory that is two bytes long (newline char + null terminator) and will additionally pass a pointer to that string to print which is probably two bytes long.
Additionally at runtime, using the F macro will result in two fetches ('\n' and the null terminator) from flash. While fetches from flash are fast, passing in a char results in zero fetches from flash, which is a tiny bit faster.
I don't think there is any minimum size of the string to be useful. If you look at how the outputting is implemented in Print.cpp:
size_t Print::print(const __FlashStringHelper *ifsh)
{
PGM_P p = reinterpret_cast<PGM_P>(ifsh);
size_t n = 0;
while (1) {
unsigned char c = pgm_read_byte(p++);
if (c == 0) break;
n += write(c);
}
return n;
}
You can see from there that only one byte of RAM is used at a time (plus a couple of variables), as it pulls the string from PROGMEM a byte at a time. These are all on the stack so there is no ongoing overhead.
I imagine it uses some RAM to iterate over the string and copy it from PROGMEM to RAM. I guess the question is: how much?
No, it doesn't as I showed above. It outputs a byte at a time. There is no copying (in bulk) of the string into RAM first.
Also, is heap fragmentation a concern here?
No, the code does not use the heap.

Exit early on found in OpenCL

I'm trying to write an OpenCL implementation of memchr to help me learn how OpenCL works. What I'm planning to do is to assign each work item a chunk of memory to search. Then, inside each work item, it loops through the chunk searching for the character.
Especially if the buffer is large, I don't want the other threads to keep searching after an occurrence has already been found (assume there is only one occurrence of the character in any given buffer).
What I'm stuck on is how does a work item indicate, both to the host and other threads, when it has found the character?
Thanks,
One way you could do this is to use a global flag variable. You atomically set it to 1 when you find the value and other threads will check on that value when they are doing work.
For example:
__kernel test(__global int* buffer, __global volatile int* flag)
{
int tid = get_global_id(0);
int sx = get_global_size(0);
int i = tid;
while(buffer[i] != 8) //Whatever value we're trying to find.
{
int stop = atomic_add(&flag, 0); //Read the atomic value
if(stop)
break;
i = i + sx;
}
atomic_xchg(&flag, 1); //Set the atomic value
}
This might add more overhead than by just running the whole kernel (unless you are doing a lot of work on every iteration). In addition, this method won't work if each thread is just checking a single value in the array. Each thread must have multiple iterations of work.
Finally, I've seen instances where writing to an atomic variable doesn't immediately commit, so you need to check to see if this code will deadlock on your system because the write isn't committing.

Using memcpy to change a jnz to a jmp

Not used memcpy much but here's my code that doesn't work.
memcpy((PVOID)(enginebase+0x74C9D),(void *)0xEB,2);
(enginebase+0x74C9D) is a pointer location to the address of the bytes that I want to patch.
(void *)0xEB is the op code for the kind of jmp that I want.
Only problem is that this crashes the instant that the line tries to run, I don't know what I'm doing wrong, any incite?
The argument (void*)0xEB is saying to copy memory from address 0xEB; presumably you want something more like
unsigned char x = 0xEB;
memcpy((void*)(enginebase+0x74c9d), (void*)&x, 2);
in order to properly copy the value 0xEB to the destination. BTW, is 2 the right value to copy a single byte to program memory? Looks like it should be 1, since you're copying 1 byte. I'm also under the assumption that you can't just do
((char*)enginebase)[0x74c9d] = 0xEB;
for some reason? (I don't have any experience overwriting program memory intentionally)
memcpy() expect two pointers for the source and destination buffers. Your second argument is not a pointer but rather the data itself (it is the opcode of jnz, as you described it). If I understand correctly what you are trying to do, you should set an array with the opcode as its contetns, and provide memcpy() with the pointer to that array.
The program crashes b/c you try to reference a memory location out of your assigned space (address 0xEB).

Resources