I read the book "InformationTheory And Network Coding" and I have some doubts.
I already know that the global encoding kernel is obtained through the local encoding kernel, but I still do not know how to determine how to obtain the local encoding kernel in a network (for example: Butterfly Network), and how to use the local encoding kernel. Get global encoding kernel?
Just like the butterfly network in the picture, in the local encoding kernel matrix of each node, what does each element represent, and how does each element determine the value?
In the butterfly network, the global encoding kernel fe=(kd,e)*fd is already known, so how to determine which kd,e elements of each edge are multiplied by?
butterfly network
Related
What is the actual difference of putting multiple kernels in a single program, or compiling a different program for each kernel, excluding source code organization? Specifically, is the register pressure dictated by the size of the program or by the actual kernel that is chosen within the program? Is the sum of all __local storage of all kernels allocated for the run of any of the kernels? Is there any other performance-related observation to make (e.g. code upload size to device, etc.)?
This could be device specific, and I speak from Intel GPU experience. Program-scope resources will only be visible to kernels in that program. Beyond that register allocation is per-kernel; hence, 1 kernel in K programs vs. K kernels in 1 program has no effect on register pressure. You do build and link per-program. Hence, compiling K kernels in one program is less efficient in terms of startup time if you don't use all the of K kernels.
I would like to run the full 6502 test suite by Klaus Dormann to test my Kansas Lava 6502 implementation. However, the code uses self-modification (see all uses of range_adr), which, while trivial to implement in an emulator, doesn't bode well for a hardware implementation: the program image needs to be stored on ROM, so the write-backs are going to be blackholed by whatever routes writes based on addressing ROM or RAM-backed parts.
The same problem, of course, applies both to synthesizing it to a real FPGA and to running it in a simulator (either the low-level VHDL one or the high-level Kansas Lava one).
Is there a way to run the test suite without doing a lengthy (in terms of cycles) dance of suspending the CPU, copying the program from some unaddressable ROM into an all-RAM memory byte-by-byte, and then initializing the CPU and letting it run? I'd prefer not doing this because simulating these extra cycles at startup will slow down running the test considerably.
Knee-jerk observations:
Despite coming as a 64kb image, the test is actually just 14,093 bytes of actual content, from $0000 up to $370d, then a padded fill of $ffs up to the three vectors in $fffa–$ffff. So you'd need to copy at most 14,099 bytes rather than the prima facie 65,536.
Having set up that very test suite in an emulator I wrote yesterday (no, really) the full range of touched addresses — using [x, y] to denote the closed range, i.e. to include both x and y, is:
[000a, 0012], [0100, 0101], [01f9, 01ff] (i.e. the stack and zero pages);
0200;
[0203, 0207];
04a8;
2cbb;
2cdc;
2eb1;
2ed2;
30a7;
30c8;
33f2;
3409;
353b; and
3552.
From the .lst version of the program, that means all you need to move are the variables with labels:
test_case;
ada2;
sba2;
range_adr;
... and either move or remove the routines that:
test AND immediate from 2cac down to 2cec;
test EOR immediate from 2ea2 to 2ee2;
test ORA immediate from 3098 to 30d8;
test decimal ADC/SBC immediate from 33e7 down to 3414 (specifically to include chkdadi and chksbi);
test binary ADC/SBC immediate from 3531 down to 355d.
All the immediate tests self modify the operand. If you're happy leaving that one addressing mode untested then it shouldn't be too troublesome.
So, I guess, edit those tests out of the original file, and you can safely relocate range_adr to the middle of the stack page if my simulation is accurate.
Hi i am working on Tmote sky motes (MSP430 microprocessor) with contiki os. I want to know the number of instruction cycles used when I do a multiplication operation in my programming (software).
Thank you,
Avijit
The msp430 is a 16-bit system, so 32-bit values are not supported directly. A 32-bit operation is typically translated to assembly code as a sequence of 16-bit ops.
The execution times of 8-bit and 16-bit operations can be found in TI application report "The MSP430Hardware Multiplier":
Table 4. CPU Cycles Needed With Different Multiplication Modes
OPERATION: Unsigned Multiply (MPY)
SOFTWARE LOOP: 139...171
HARDWARE MPYer: 8
SPEED INCREASE: 17.4...21.4
OPERATION: Unsigned multiply-and-accumulate (MAC)
SOFTWARE LOOP: 137...169
HARDWARE MPYer: 8
SPEED INCREASE: 17.1...21.1
OPERATION: Signed Multiply (MPYS)
SOFTWARE LOOP: 145...179
HARDWARE MPYer: 8
SPEED INCREASE: 18.1...22.4
OPERATION: Signed multiply-and-accumulate (MAC)
SOFTWARE LOOP: 143...177
HARDWARE MPYer: 17
SPEED INCREASE: 8.4...10.4
The HW multiplier should be active with default compilation settings, but check the generated object file with msp430-objdump to make sure.
You can use naken_asm by Michael Kohn to disaemble an Intel hex or ELF file and it will calculate the cycle counts for each instruction. I've used it in the past and the cycle counter is OK for CPU (such as in your Tmote) but not fully supported in CPUX.
You can invoke it from the command line as simply as:
naken_util -disasm <infile>
where <infile> is the name of your hex or ELF file. The default processor is MSP430, but you'd need the assembly listing from your compiler in order to be able to match up the original code with the disassembled code which includes cycle counts.
Another alternative would be to use MSPDebug's tracer option which can track running software and provide an up-to-date instruction cycle count. However, I've never used it for that purpose so cannot provide an example.
I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.
I have some big, big files that I work with and I use several different I/O functions to access them. The most common one is the bigmemory package.
When writing to the files, I've learned the hard way to flush output buffers, otherwise all bets are off on whether the data was saved. However, this can lead to some very long wait times while bigmemory does its thing (many minutes). I don't know why this happens - it doesn't always occur and it's not easily reproduced.
Is there some way to determine whether or not I/O buffers have been flushed in R, especially for bigmemory? If the operating system matters, then feel free to constrain the answer in that way.
If an answer can be generalized beyond bigmemory, that would be great, as I sometimes rely on other memory mapping functions or I/O streams.
If there are no good solutions to checking whether buffers have been flushed, are there cases in which it can be assumed that buffers have been flushed? I.e. besides using flush().
Update: I should clarify that these are all binary connections. #RichieCotton noted that isIncomplete(), though the help documentation only mentions text connections. It's not clear if that is usable for binary connections.
Is this more convincing that isIncomplete() works with binary files?
# R process 1
zz <- file("~/test", "wb")
writeBin(c(1:100000),con=zz)
close(zz)
# R process 2
zz2 <- file("~/test", "rb")
inpp <- readBin(con=zz2, integer(), 10000)
while(isIncomplete(con2)) {Sys.sleep(1); inpp <- c(inpp, readBin(zz2),integer(), 10000)}
close(zz2)
(Modified from the help(connections) file.)
I'll put forward my own answer, but I welcome anything that is clearer.
From what I've seen so far, the various connection functions, e.g. file, open, close, flush, isOpen, and isIncomplete (among others), are based on specific connection types, e.g. files, pipes, URLs, and a few other things.
In contrast, bigmemory has its own connection type and the bigmemory object is an S4 object with a slot for a memory address for operating system buffers. Once placed there, the OS is in charge of flushing those buffers. Since it's an OS responsibility, then getting information on "dirty" buffers requires interacting with the OS, not with R.
Thus, the answer for bigmemory is "no" as the data is stored in the kernel buffer, though it may be "yes" for other connections that are handled through STDIO (i.e. stored in "user space").
For more insight on the OS / kernel side of things, see this question on SO; I am investigating a couple of programs (not just R + bigmemory) that are producing buffer flushing curiosities, and that thread helped to enlighten me about the kernel side of things.