Access Path in Zero-Copy in OpenCL - opencl

I am a little bit confused about how exactly zero-copy work.
1- Want to confirm that the following corresponds to zero-copy in opencl.
.......................
. . .
. . .
. . CPU .
. SYSTEM . .
. RAM . c3 X .
. <=====> .
...|...................
PCI-E / /
| / /
c2 |X /PCI-E, CPU directly accessing GPU memory
| / / copy c3, c2 is avoided, indicated by X.
...|...././................
. MEMORY<====> .
. OBJECT .c1 .
. . GPU .
. GPU RAM . .
. . .
...........................
.......................
. . .
. . .
. . CPU .
.SYSTEM RAM . .
. . .
. . c3 .
. MEMORY<====> .
...| OBJECT............
| \ \
PCI-E \ \PCI-E, GPU directly accessing System memory. copy c2, c1 is avoided
| \ \
C2 |X \ \
...|.........\..\...........
. | . .
. <=======> .
. GPU c1 X GPU .
. RAM . .
. . .
............................
The GPU/CPU is accessing System/GPU-RAM directly, without explicit copy.
2-What is the advantage of having this? PCI-e is still limiting the over all bandwidth.
Or the only advantage is that we can avoid copies c2 & c1/c3 in above situations?

You are correct in your understanding of how zero-copy works. The basic premise is that you can access either the host memory from the device, or the device memory from the host without needing to do an intermediate buffering step in between.
You can perform zero-copy by creating buffers with the following flags:
CL_MEM_AMD_PERSISTENT_MEM //Device-Resident Memory
CL_MEM_ALLOC_HOST_PTR // Host-Resident Memory
Then, the buffers can be accessed using memory mapping semantics:
void* p = clEnqueueMapBuffer(queue, buffer, CL_TRUE, CL_MAP_WRITE, 0, size, 0, NULL, NULL, &err);
//Perform writes to the buffer p
err = clEnqueueUnmapMemObject(queue, buffer, p, 0, NULL, NULL);
Using zero-copy you could be able to achieve performance over an implementation that did the following:
Copy a file to a host buffer
Copy buffer to the device
Instead you could do it all in one step
Memory Map device side buffer
Copy file from host to device
Unmap memory
On some implementations, the calls of mapping and unmapping can hide the cost of data transfer. As in our example,
Memory Map device side buffer [Actually creates a host-side buffer of the same size]
Copy file from host to device [Actually writes to the host-side buffer]
Unmap memory [Actually copies data from host-buffer to device-buffer via clEnqueueWriteBuffer]
If the implementation is performing this way, then there will be no benefit to using the mapping approach. However, AMDs newer drivers for OpenCL allow the data to be written directly, making the cost of mapping and unmapping almost 0. For discrete graphics cards, the requests still take place over the PCIe bus, so data transfers can be slow.
In the case of an APU architecture, however, the costs of data transfers using the zero-copy semantics can greatly increase the speed of transfers due to the APUs unique architecture (pictured below). In this architecture, the PCIe bus is replaced with the Unified North Bridge (UNB) that allows for faster transfers.
BE AWARE that when using zero-copy semantics with the memory-mapping, that you will see absolutely horrendous bandwidths when reading a device-side buffer from the host. These bandwidths are on the order of 0.01 Gb/s and can easily become a new bottleneck for your code.
Sorry if this is too much information. This was my thesis topic.

Related

Assembly code . = 60^. in low.s file of UNIX V6 source code

The source code of UNIX V6 is available and there is a book on it by J.Lions. From the book I know that " . " symbol means current location. I do not understand the next:
"*An assignment statement of the form
identifier = expression
associates a value and type with the identifier. In the example
. = 60^.
the operator ’^’ delivers the value of the first
operand and the type of the second operand
(in this case, “location”);*"
The statement can be found in file low.s (0526). What does it mean? Does it actually change PC register value and behaves as a jump instruction? I know it is old code, but I want to understand it. Thank you.
In the 6th edition assembler, . is the location counter, an offset from the beginning of a segment (text, data, or bss). When the assembler starts processing a file, . in each segment is 0, and is incremented either by assignment to . or by the presence of data or instruction statements.
The statement . = 60^. means to take the value 60 (in octal), cast it to the type of the location counter (in this case, type data), and assign it to the location counter. You'll see several statements like this in low.s in the area where interrupt vectors are set up.
When the link editor combines multiple object files together, their text, data, and bss sections are concatenated (except for COMMON data, which gets allocated just once) and any references (such as labels) to instructions or data will be relocated appropriately.
Building the Unix kernel requires an extra step to make sure data meant to be in low memory get loaded at the proper address. After low.s and the rest of the Unix kernel object files have been linked together, sysfix is run to make the data section have a load address of 0, and to relocate all data references appropriately. So that . = 60^. statement has effectively set the location counter to physical address 60.

How to measure the amount of memory or RAM consumed by a code on Arduino Mega or Due

Can anybody tell me how to measure the consumed RAM for a particular code running on Arduino Mega or Due.
There is two kinds of numbers to this question:
Global static usage and current run time.
The static estimated usage can be determined by adding the following line to (if it does not already exist)
.\arduino-1.5.5\hardware\arduino\avr\boards.txt
uno.upload.maximum_ram_size=2048
This then allows the compiler to output the additional 2nd line in the following example in the IDE's result window
Binary sketch size: 25,880 bytes (of a 32,256 byte maximum)
Estimated used SRAM memory: 990 bytes (of a 2048 byte maximum)
To see the amount of memory used at any given point. Including memory space currently in use, that exists while only in functions and members. This includes the HEAP and such. I use the following MemoryFree library at specific points in the code to reveal the high-water. The readme explains how to save unnecessarily/unintentionally used RAM by prints.
Note: That while the original Arduino IDE 1.0.5's boards.txt files does contain these ram_sizes, it does not actually use display usage. Where the original Arduino IDE 1.5.5 does, along with Arduino ERW 1.0.5 does (an non-supported fork).
In my Arduino IDE 2.1.0
I edit the file: /usr/share/arduino/hardware/arduino/boards.txt
but the second line don't appear
After read:
check-ram-memory-usage-arduino-optimization
measuring-free-memory
I tried:
Show vervose output during compilation
and use avr-size /tmp/build4042914391435450796.tmp/XXXXXXX.cpp.elf
then i get my memory used
Best Regards!
int freeRam () {
extern int __heap_start, *__brkval;
int v;
int fr = (int) &v - (__brkval == 0 ? (int) &__heap_start : (int) __brkval);
Serial.print("Free ram: ");
Serial.println(fr);
}

Using the extra 16 bits in 64-bit pointers

I read that a 64-bit machine actually uses only 48 bits of address (specifically, I'm using Intel core i7).
I would expect that the extra 16 bits (bits 48-63) are irrelevant for the address, and would be ignored. But when I try to access such an address I got a signal EXC_BAD_ACCESS.
My code is:
int *p1 = &val;
int *p2 = (int *)((long)p1 | 1ll<<48);//set bit 48, which should be irrelevant
int v = *p2; //Here I receive a signal EXC_BAD_ACCESS.
Why this is so? Is there a way to use these 16 bits?
This could be used to build more cache-friendly linked list. Instead of using 8 bytes for next ptr, and 8 bytes for key (due to alignment restriction), the key could be embedded into the pointer.
The high order bits are reserved in case the address bus would be increased in the future, so you can't use it simply like that
The AMD64 architecture defines a 64-bit virtual address format, of which the low-order 48 bits are used in current implementations (...) The architecture definition allows this limit to be raised in future implementations to the full 64 bits, extending the virtual address space to 16 EB (264 bytes). This is compared to just 4 GB (232 bytes) for the x86.
http://en.wikipedia.org/wiki/X86-64#Architectural_features
More importantly, according to the same article [Emphasis mine]:
... in the first implementations of the architecture, only the least significant 48 bits of a virtual address would actually be used in address translation (page table lookup). Further, bits 48 through 63 of any virtual address must be copies of bit 47 (in a manner akin to sign extension), or the processor will raise an exception. Addresses complying with this rule are referred to as "canonical form."
As the CPU will check the high bits even if they're unused, they're not really "irrelevant". You need to make sure that the address is canonical before using the pointer. Some other 64-bit architectures like ARM64 have the option to ignore the high bits, therefore you can store data in pointers much more easily.
That said, in x86_64 you're still free to use the high 16 bits if needed (if the virtual address is not wider than 48 bits, see below), but you have to check and fix the pointer value by sign-extending it before dereferencing.
Note that casting the pointer value to long is not the correct way to do because long is not guaranteed to be wide enough to store pointers. You need to use uintptr_t or intptr_t.
int *p1 = &val; // original pointer
uint8_t data = ...;
const uintptr_t MASK = ~(1ULL << 48);
// === Store data into the pointer ===
// Note: To be on the safe side and future-proof (because future implementations
// can increase the number of significant bits in the pointer), we should
// store values from the most significant bits down to the lower ones
int *p2 = (int *)(((uintptr_t)p1 & MASK) | (data << 56));
// === Get the data stored in the pointer ===
data = (uintptr_t)p2 >> 56;
// === Deference the pointer ===
// Sign extend first to make the pointer canonical
// Note: Technically this is implementation defined. You may want a more
// standard-compliant way to sign-extend the value
intptr_t p3 = ((intptr_t)p2 << 16) >> 16;
val = *(int*)p3;
WebKit's JavaScriptCore and Mozilla's SpiderMonkey engine as well as LuaJIT use this in the nan-boxing technique. If the value is NaN, the low 48-bits will store the pointer to the object with the high 16 bits serve as tag bits, otherwise it's a double value.
Previously Linux also uses the 63rd bit of the GS base address to indicate whether the value was written by the kernel
In reality you can usually use the 48th bit, too. Because most modern 64-bit OSes split kernel and user space in half, so bit 47 is always zero and you have 17 top bits free for use
You can also use the lower bits to store data. It's called a tagged pointer. If int is 4-byte aligned then the 2 low bits are always 0 and you can use them like in 32-bit architectures. For 64-bit values you can use the 3 low bits because they're already 8-byte aligned. Again you also need to clear those bits before dereferencing.
int *p1 = &val; // the pointer we want to store the value into
int tag = 1;
const uintptr_t MASK = ~0x03ULL;
// === Store the tag ===
int *p2 = (int *)(((uintptr_t)p1 & MASK) | tag);
// === Get the tag ===
tag = (uintptr_t)p2 & 0x03;
// === Get the referenced data ===
// Clear the 2 tag bits before using the pointer
intptr_t p3 = (uintptr_t)p2 & MASK;
val = *(int*)p3;
One famous user of this is the V8 engine with SMI (small integer) optimization. The lowest bit in the address will serve as a tag for type:
if it's 1, the value is a pointer to the real data (objects, floats or bigger integers). The next higher bit (w) indicates that the pointer is weak or strong. Just clear the tag bits and dereference it
if it's 0, it's a small integer. In 32-bit V8 or 64-bit V8 with pointer compression it's a 31-bit int, do a signed right shift by 1 to restore the value; in 64-bit V8 without pointer compression it's a 32-bit int in the upper half
32-bit V8
|----- 32 bits -----|
Pointer: |_____address_____w1|
Smi: |___int31_value____0|
64-bit V8
|----- 32 bits -----|----- 32 bits -----|
Pointer: |________________address______________w1|
Smi: |____int32_value____|0000000000000000000|
https://v8.dev/blog/pointer-compression
So as commented below, Intel has published PML5 which provides a 57-bit virtual address space, if you're on such a system you can only use 7 high bits
You can still use some work around to get more free bits though. First you can try to use a 32-bit pointer in 64-bit OSes. In Linux if x32abi is allowed then pointers are only 32-bit long. In Windows just clear the /LARGEADDRESSAWARE flag and pointers now have only 32 significant bits and you can use the upper 32 bits for your purpose. See How to detect X32 on Windows?. Another way is to use some pointer compression tricks: How does the compressed pointer implementation in V8 differ from JVM's compressed Oops?
You can further get more bits by requesting the OS to allocate memory only in the low region. For example if you can ensure that your application never uses more than 64MB of memory then you need only a 26-bit address. And if all the allocations are 32-byte aligned then you have 5 more bits to use, which means you can store 64 - 21 = 43 bits of information in the pointer!
I guess ZGC is one example of this. It uses only 42 bits for addressing which allows for 242 bytes = 4 × 240 bytes = 4 TB
ZGC therefore just reserves 16TB of address space (but not actually uses all of this memory) starting at address 4TB.
A first look into ZGC
It uses the bits in the pointer like this:
6 4 4 4 4 4 0
3 7 6 5 2 1 0
+-------------------+-+----+-----------------------------------------------+
|00000000 00000000 0|0|1111|11 11111111 11111111 11111111 11111111 11111111|
+-------------------+-+----+-----------------------------------------------+
| | | |
| | | * 41-0 Object Offset (42-bits, 4TB address space)
| | |
| | * 45-42 Metadata Bits (4-bits) 0001 = Marked0
| | 0010 = Marked1
| | 0100 = Remapped
| | 1000 = Finalizable
| |
| * 46-46 Unused (1-bit, always zero)
|
* 63-47 Fixed (17-bits, always zero)
For more information on how to do that see
Allocating Memory Within A 2GB Range
How can I ensure that the virtual memory address allocated by VirtualAlloc is between 2-4GB
Allocate at low memory address
How to malloc in address range > 4 GiB
Custom heap/memory allocation ranges
Side note: Using linked list for cases with tiny key values compared to the pointers is a huge memory waste, and it's also slower due to bad cache locality. In fact you shouldn't use linked list in most real life problems
Bjarne Stroustrup says we must avoid linked lists
Why you should never, ever, EVER use linked-list in your code again
Number crunching: Why you should never, ever, EVER use linked-list in your code again
Bjarne Stroustrup: Why you should avoid Linked Lists
Are lists evil?—Bjarne Stroustrup
A standards-compliant way to canonicalize AMD/Intel x64 pointers (based on the current documentation of canonical pointers and 48-bit addressing) is
int *p2 = (int *)(((uintptr_t)p1 & ((1ull << 48) - 1)) |
~(((uintptr_t)p1 & (1ull << 47)) - 1));
This first clears the upper 16 bits of the pointer. Then, if bit 47 is 1, this sets bits 47 through 63, but if bit 47 is 0, this does a logical OR with the value 0 (no change).
I guess no-one mentioned possible use of bit fields ( https://en.cppreference.com/w/cpp/language/bit_field ) in this context, e.g.
template<typename T>
struct My64Ptr
{
signed long long ptr : 48; // as per phuclv's comment, we need the type to be signed to be sign extended
unsigned long long ch : 8; // ...and, what's more, as Peter Cordes pointed out, it's better to mark signedness of bit field explicitly (before C++14)
unsigned long long b1 : 1; // Additionally, as Peter found out, types can differ by sign and it doesn't mean the beginning of another bit field (MSVC is particularly strict about it: other type == new bit field)
unsigned long long b2 : 1;
unsigned long long b3 : 1;
unsigned long long still5bitsLeft : 5;
inline My64Ptr(T* ptr) : ptr((long long) ptr)
{
}
inline operator T*()
{
return (T*) ptr;
}
inline T* operator->()
{
return (T*)ptr;
}
};
My64Ptr<const char> ptr ("abcdefg");
ptr.ch = 'Z';
ptr.b1 = true;
ptr.still5bitsLeft = 23;
std::cout << ptr << ", char=" << char(ptr.ch) << ", byte1=" << ptr.b1 <<
", 5bitsLeft=" << ptr.still5bitsLeft << " ...BTW: sizeof(ptr)=" << sizeof(ptr);
// The output is: abcdefg, char=Z, byte1=1, 5bitsLeft=23 ...BTW: sizeof(ptr)=8
// With all signed long long fields, the output would be: abcdefg, char=Z, byte1=-1, 5bitsLeft=-9 ...BTW: sizeof(ptr)=8
I think it may be quite a convenient way to try to make use of these 16 bits, if we really want to save some memory. All the bitwise (& and |) operations and cast to full 64-bit pointer are done by compiler (though, of course, executed in run time).
According to the Intel Manuals (volume 1, section 3.3.7.1) linear addresses has to be in the canonical form. This means that indeed only 48 bits are used and the extra 16 bits are sign extended. Moreover, the implementation is required to check whether an address is in that form and if it is not generate an exception. That's why there is no way to use those additional 16 bits.
The reason why it is done in such way is quite simple. Currently 48-bit virtual address space is more than enough (and because of the CPU production cost there is no point in making it larger) but undoubtedly in the future the additional bits will be needed. If applications/kernels were to use them for their own purposes compatibility problems will arise and that's what CPU vendors want to avoid.
Physical memory is 48 bit addressed. That's enough to address a lot of RAM. However between your program running on the CPU core and the RAM is the memory management unit, part of the CPU. Your program is addressing virtual memory, and the MMU is responsible for translating between virtual addresses and physical addresses. The virtual addresses are 64 bit.
The value of a virtual address tells you nothing about the corresponding physical address. Indeed, because of how virtual memory systems work there's no guarantee that the corresponding physical address will be the same moment to moment. And if you get creative with mmap() you can make two or more virtual addresses point at the same physical address (wherever that happens to be). If you then write to any of those virtual addresses you're actually writing to just one physical address (wherever that happens to be). This sort of trick is quite useful in signal processing.
Thus when you tamper with the 48th bit of your pointer (which is pointing at a virtual address) the MMU can't find that new address in the table of memory allocated to your program by the OS (or by yourself using malloc()). It raises an interrupt in protest, the OS catches that and terminates your program with the signal you mention.
If you want to know more I suggest you Google "modern computer architecture" and do some reading about the hardware that underpins your program.

is there any tool or IDE to debug R packages and run it step by step?

I am interested in some R packages, and want to understand how it works, is there any tool to run the method in package step by step and print the intermediate output?
The two previous answers already told you what base R, and add-on packages can do.
As far as IDEs go, you have two choices:
The StatET plugin for Eclipse has some features for this (which I haven't used).
ESS for Emacs where the newest ESS releases have integrated ess-tracebug which does this too. Here is some documentation from when ess-tracebug was still a third-party project and not part of ESS proper. While I am an ESS user, I haven't tried this either yet.
Here is the help for ess-tracebug to give a flavor of what it can do:
Documentation:
Default ess-tracebug key bindings:
* Breakpoints:
b . Set BP (repeat to cycle BP type) . `ess-bp-set'
B . Set conditional BP . `ess-bp-set-conditional'
k . Kill BP . `ess-bp-kil'
K . Kill all BPs . `ess-bp-kill-all'
t . Toggle BP state . `ess-bp-toggle-state'
l . Set logger BP . `ess-bp-set-logger'
C-n . Goto next BP . `ess-bp-next'
C-p . Goto previous BP . `ess-bp-previous'
* General Debugging:
` . Show R Traceback . `ess-show-R-traceback'
e . Toggle error action (repeat to cycle). `ess-dbg-toggle-error-action'
d . Flag for debugging . `ess-dbg-flag-for-debugging'
u . Unflag for debugging . `ess-dbg-unflag-for-debugging'
w . Watch window . `ess-watch'
* Navigation to errors (emacs general functionality):
C-x `, M-g n . `next-error'
M-g p . `previous-error'
* Interactive Debugging:
c . Continue . `ess-dbg-command-c'
n . Next step . `ess-dbg-command-n'
p . Previous step . `previous-error'
q . Quit debugging . `ess-dbg-command-Q'
1..9. Enter recover frame . `ess-dbg-command-digit'
0 . Exit recover (also q,n,c) . `ess-dbg-command-digit'
* Input Ring:
i . Goto input event marker forwards . `ess-dbg-goto-input-event-marker'
I . Goto input event marker backwards . `ess-dbg-goto-input-event-marker'
* Misc:
s . Source current file . `ess-tracebug-source-current-file'
? . Show this help . `ess-tracebug-show-help'
I think the R instructions debug and browser will let you do what you want.
There is the debug package, combined with the mtrace function. There is also a new debug module for Eclipse (as Dirk mentions in his answer), and a similar capabilities might be added to Rstudio in the future. Once these are in place, the question would be how to have them debug the relevant functions. Which would basically mean getting the list of all the functions you are interested in and tracking them.
p.s: you might have a look at this - http://www.r-bloggers.com/what-does-this-package-look-like/
The IDE in Revolution R includes convenient visual debugging features similar to those found in MS Visual Studio. Although the software is proprietary and requires paying for license, you could always download the free academic version.
http://www.revolutionanalytics.com/downloads/free-academic.php

How to avoid multiple writers in a named pipe?

I am writing a program with a named pipe with multiple readers and multiple writers. The idea is to use that named pipe to create pairs of reader/writer. That is:
A reads the pipe
B writes in the pipe
(vice versa)
Pair A-B created!
In order to ensure that only one process is reading and one is writing, I have used 2 locks with flock. Just like this.
Reader Code:
echo "[JOB $2, Part $REMAINING] Taking next machine..."
VMTAKEN=$((
flock -x 200;
cat $VMPIPE;
)200>$JOINQUEUELOCK)
echo "[JOB $2, Part $REMAINING] Machine $VMTAKEN taken..."
Writer Code:
((
flock -x 200;
echo "[MACHINE $MACHINEID] I am inside the critical section"
echo "$MACHINEID" > $VMPIPE;
echo "[MACHINE $MACHINEID] Going outside the critical section"
)200>$VMQUEUELOCK)
echo "[MACHINE $MACHINEID] Got new Job"
I sometimes get the following problem:
[MACHINE 3] I am inside the critical section
[JOB 1, Part 249] Taking next machine...
[MACHINE 3] Going outside the critical section
[MACHINE 1] I am inside the critical section
[MACHINE 1] Going outside the critical section
[MACHINE 1]: Got new Job
[MACHINE 3]: Got new Job
[JOB 1, Part 249] Machine 3
1 taken...
As you can see, Another writer wrote before the reader finished reading. What can I do to get rid of this problem? Should I use an ACK Pipe or something?
Thank you in advance
This would be a typical use for semaphores:
Create 2 semaphores - one for reading processed, the other one for writing processes. set each semaphore to value 1
Reading processes sem_wait(2) on the semaphore for readers until semphore > 0 and lower it to zero if they get it.
Writing processes will do the same with the semaphore intended for them
A controlling process (which may also set up the semaphores initially) could check, if both semaphores are zero and assign the pair
reader/writer release the semaphores (increasing them by 1 again) so next readr or writer will get the semaphore.
For passing informations between reader/writer shared memory may be used...

Resources