Optimal NEON vector structure for processing vectors of uint8_t type with Arm Cortex-A8 (32-bit) - vector

I am doing some image processing on an embedded system (BeagleBone Black) using OpenCV and need to write some code to take advantage of NEON optimization. Specifically, I would like to write a NEON optimized thresholding function and then a NEON optimized erosion/dilation function.
This is my first time writing NEON code and I don't have experience writing assmbly code, so I have been looking at examples and resources for the C-style NEON intrinsics. I believe that I can put some working code together, but am not sure how I should structure the vectors. According to page 2 of the "ARM NEON support in the ARM compiler" white paper:
"These registers can hold "vectors" of items which are 8, 16, 32 or 64
bits. The traditional advice when optimizing or porting algorithms
written in C/C++ is to use the natural type of the machine for data
handling (in the case of ARM 32 bits). The unwanted bits can then be
discarded by casting and/or shifting before storing to memory."
What exactly does this mean? Do I need to to restrict my NEON code to using uint32x4_t vectors rather than uint8x16_t? How would I go about loading the registers? Or does this mean than I need to take some special steps when using vst1q_u8 to store the data to memory?
I did find this example, which is untested but uses the uint8x16_t type. Does it adhere to the "32-bit" advice given above?
I would really appreciate it if someone could please elaborate on the above quotation and maybe provide a very simple working example.

The next sentence from the document you linked gives your answer.
The ability of NEON to specify the data width in the instruction and
hence use the whole register width for useful information means
keeping the natural type for the algorithm is both possible and
preferable.
Note, the document is distinguishing between the natural type of the machine (32-bit) and the natural type of the algorithm (in your case uint8_t).
The document is saying that in the past you would have written your code in such a way that it used 32-bit integers so that it could use the efficient machine instructions suited for 32-bit operations.
With Neon, this is not necessary. It is more useful to use the data type you actually want to use, as Neon can efficiently operate on those data types.
It will depend on your algorithm as to the optimal choice of register width (uint8x8_t or uint8x16_t).
To give a simple example of using the Neon intrinsics to add two sets of uint8_t:
#include <arm_neon.h>
void
foo (uint8_t a, uint8_t *b, uint8_t *c)
{
uint8x16_t t1 = vld1q_u8 (a);
uint8x16_t t2 = vld1q_u8 (b);
uint8x16_t t3 = vaddq_u8 (a, b);
vst1q_u8 (c, t3);
}

Related

Operate only on a subset of buffer in OpenCL kernel

Newbie to OpenCL here. I'm trying to convert a numerical method I've written to OpenCL for acceleration. I'm using the PyOpenCL package as I've written this once in Python already and as far as I can tell there's no compelling reason to use the C version. I'm all ears if I'm wrong on this, though.
I've managed to translate over most of the functionality I need in to OpenCL kernels. My question is on how to (properly) tell OpenCL to ignore my boundary/ghost cells. The reason I need to do this is that my method (for example) for point i accesses cells at [i-2:i+2], so if i=1, I'll run off the end of the array. So - I add some extra points that serve to prevent this, and then just tell my algorithm to only run on points [2:nPts-2]. It's easy to see how to do this with a for loop, but I'm a little more unclear on the 'right' way to do this for a kernel.
Is it sufficient to do, for example (pseudocode)
__kernel void myMethod(...) {
gid = get_global_id(0);
if (gid < nGhostCells || gid > nPts-nGhostCells) {
retVal[gid] = 0;
}
// Otherwise perform my calculations
}
or is there another/more appropriate way to enforce this constraint?
It looks sufficient.
Branching is same for nPts-nGhostCells*2 number of points and it is predictable if nPts and nGhostCells are compile-time constants. Even if it is not predictable, sufficiently large nPts vs nGhostCells (1024 vs 3) should not be distinctively slower than zero-branching version, except the latency of "or" operation. Even that "or" latency must be hidden behind array access latency, thanks to thread level parallelism.
At those "break" points, mostly 16 or 32 threads would lose some performance and only for several clock cycles because of the lock-step running of SIMD-like architectures.
If you happen to code some chaotic branching, like data-driven code path, then you should split them into different kernels(for different regions) or sort them before the kernel so that average branching between neighboring threads are minimized.

SIMD-8,SIMD-16 or SIMD-32 in opencl on gpgpu

I read couple of questions on SO for this topic(SIMD Mode), but still slight clarification/confirmation of how things work is required.
Why use SIMD if we have GPGPU?
SIMD intrinsics - are they usable on gpus?
CPU SIMD vs GPU SIMD?
Are following points correct,if I compile the code in SIMD-8 mode ?
1) it means 8 instructions of different work items are getting executing in parallel.
2) Does it mean All work items are executing the same instruction only?
3) if each wrok item code contains vload16 load then float16 operations and then vstore16 operations only. SIMD-8 mode will still work. I mean to say is it true GPU is till executing the same instruction (either vload16/ float16 / vstore16) for all 8 work items?
How should I understand this concept?
In the past many OpenCL vendors required to use vector types to be able to use SIMD. Nowadays OpenCL vendors are packing work items into SIMD so there is no need to use vector types. Whether is preffered to use vector types can be checked by querying for: CL_DEVICE_PREFERRED_VECTOR_WIDTH_<CHAR, SHORT, INT, LONG, FLOAT, DOUBLE>.
On Intel if vector type is used the vectorizer first scalarize them and then re-vectorize to make use of the wide instruction set. This is probably going to be similar on the other platforms.

Armadillo C++ linear algebra library : How to create vector of boolean

Recently I started using Armadillo C++ library. Given my C++ coding skills are not that great, I found this very friendly for linear algebra. I am also using that along with my matlab to speed things up for many of reconstruction algorithm.
I do need to create a vector of boolean and I would prefer using this library rather than . However, I could not figure out how to do it. I tried using uvec; but, documentation seems to indicate that it can not be used with boolean.
Any help would be appreciated.
Regards,
Dushyant
Consider using a matrix uchar_mat which is a typdef for Mat<unsigned char>, it should consume the same amount of memory as a matrix of boolean values.
The Armadillo documentation of version 7.8 states that a matrix Mat<type>, can be of the following types:
float, double, std::complex<float>, std::complex<double>, short, int, long, and unsigned versions of short, int, and long. The code on GitHub however contains typedef Mat <unsigned char> uchar_mat; in the file include/armadillo_bits/typedef_mat.hpp so you should also be able to use uchar_mat.
You will not save any memory by creating a matrix of bool values compared to a matrix of unsigned char values (a bool type consumes 8 bits). This is because in C++ every data type must be addressable; it must be at least 1 byte long so that it is possible to create a pointer pointing to it.

CUDA Thrust library: How can I create a host_vector of host_vectors of integers?

In C++ in order to create a vector that has 10 vectors of integers I would do the following:
std::vector< std::vector<int> > test(10);
Since I thought Thrust was using the same logic with the STL I tried doing the same:
thrust::host_vector< thrust::host_vector<int> > test(10);
However I got too many confusing errors. I tried doing:
thrust::host_vector< thrust::host_vector<int> > test;
and it worked, however I can't add anything to this vector. Doing
thrust::host_vector<int> temp(3);
test.push_back(temp);
would give me the same errors(too many to paste them here).
Also generally speaking when using Thrust, does it make a difference between using host_vector and the STL's vector?
Thank you in advance
Thrust's containers are only designed for POD (plain old data) types. It isn't possible to create multidimensional vectors by instantiating "vectors of vectors" in thrust, mostly because of the limitations on the GPU side which make it impossible to pass and use in the device code path.
There is some level of compatibility between C++ standard library types and algorithms and the thrust host implementation of those STL derived models, but you should really stick with host vectors when you want to work both with the host and device library back ends.

Program to mimic scanf() using system calls

As the Title says, i am trying out this last year's problem that wants me to write a program that works the same as scanf().
Ubuntu:
Here is my code:
#include<unistd.h>
#include<stdio.h>
int main()
{
int fd=0;
char buf[20];
read(0,buf,20);
printf("%s",buf);
}
Now my program does not work exactly the same.
How do i do that both the integer and character values can be stored since my given code just takes the character strings.
Also how do i make my input to take in any number of data, (only 20 characters in this case).
Doing this job thoroughly is a non-trivial exercise.
What you show does not emulate sscanf("%s", buffer); very well. There are at least two problems:
You limit the input to 20 characters.
You do not stop reading at the first white space character, leaving it and other characters behind to be read next time.
Note that the system calls cannot provide an 'unget' functionality; that has to be provided by the FILE * type. With file streams, you are guaranteed one character of pushback. I recently did some empirical research on the limits, finding values that the number of pushed back characters ranged from 1 (AIX, HP-UX) to 4 (Solaris) to 'big', meaning up to 4 KiB, possibly more, on Linux and MacOS X (BSD). Fortunately, scanf() only requires one character of pushback. (Well, that's the usual claim; I'm not sure whether that's feasible when distinguishing between "1.23e+f" and "1.23e+1"; the first needs three characters of lookahead, it seems to me, before it can tell that the e+f is not part of the number.)
If you are writing a plug-in replacement for scanf(), you are going to need to use the <stdarg.h> mechanism. Fortunately, all the arguments to scanf() after the format string are data pointers, and all data pointers are the same size. This simplifies some aspects of the code. However, you will be parsing the scan format string (a non-trivial exercise in its own right; see the recent discussion of print format string parsing) and then arranging to make the appropriate conversions and assignments.
Unless you have unusually stringent conditions imposed upon you, assume that you will use the character-level Standard I/O library functions such as getchar(), getc() and ungetc(). If you can't even use them, then write your own variants of them. Be aware that full integration with the rest of the I/O functions is tricky - things like fseek() complicate matters, and ensuring that pushed-back characters are properly consumed is also not entirely trivial.

Resources