Most efficient way of converting byte array to vector - opencl

What is the most efficient way of converting an array of 16 bytes into
a uint4 vector ? currently, I manually OR the bytes into uints, then set
the vector's components with the completed uints. Is there OpenCL support for performing this task?
This is for OpenCL 1.2
Edit: here is my code:
local uchar buffer[16];
uint v[4];
for (int i = 0; i < 4; ++i) {
v[i]=0;
for (int j = 0; j < 4; ++j) {
v[i] |= (buffer[(i<<2)+j]) << (j<<3);
}
}
uint4 result = (uint4)(v[0],v[1],v[2],v[3]);
Edit 2: buffer is actually a local buffer.

You should be able to convert it on the fly without copying the data:
local uchar buffer[16];
if(get_local_id(0) == 0)
{
for (int x = 0; x < 4; ++x)
{
buffer[x] = x + 1;
buffer[x + 4] = x + 2;
buffer[x + 8] = x + 3;
buffer[x + 12] = x + 4;
}
local uint4 *result = (local uint4*)buffer;
printf("0x%x 0x%x 0x%x 0x%x\n", (*result).x, (*result).y, (*result).z, (*result).w);
}
Result:
0x4030201 0x5040302 0x6050403 0x7060504
If you need to copy the data though you do:
uint4 result = *(local uint4*)buffer;

If you shape your data in a different way you have an instruction for that:
ushort[n] upsample (uchar[n] hi, uchar[n] lo){
result[i]= ((short)hi[i]<< 8) | lo[i]
}
uint[n] upsample (ushort[n] hi, ushort[n] lo){
result[i]= ((uint)hi[i]<< 8) | lo[i]
}
But you will need uchar16 buffer' = (uchar16)(buffer[0], buffer[4], buffer[8], buffer[12], buffer[1], buffer[5], buffer[9], buffer[13], ....) (please check!)
In order to be able to just perform a simple:
uint4 result = upsample(upsample(buffer'));
This is probably the fastest way of doing it, since it does vector operations.
If you have the data properly shaped of course....
But if your data is aligned, you can just cast it, and it will work.
uint4 result = *((local uint4 *)(&buffer));
For your case I think it is not, so you can do something like:
uchar16 bufferR = (uchar16)(buffer[3], buffer[2], buffer[1], buffer[0], buffer[7], buffer[6], buffer[5], buffer[4], ....)
uint4 result = *((uint4 *)(&bufferR));
Or maybe align it in the portion of code that creates that block of uchar16

Related

how to send 100 digit in serial communication from processing to arduino

i am converting pixel array to String and send this String to arduino. but i Think this String is not converted properly because Serial.write send (8 bit or 8 character) i don’t know. and also want to send 100 character of string into Serial .
please send your suggestion and getting help me out of this problem.
for any mistake Sorry in advance.
for(int x =0 ; x < img.height; x++)
{
for(int y=0; y <img.width; y++)
{
int i = x+y*width;
if(img.pixels[i] == color(0,0,0))
{
i=1;
}
else
{
i=0;
}
String s = str(i);
print(s);
Serial.write(s);
delay(2);
}
}
and also tell me how to stop string after 100 character by not using ("\n" or "\r" )
It seems you are looking for the code below:
for (int x = 0; x < img.height; x++) { // iterate over height
for (int y = 0; y < img.width; y++) { // iterate over width
int i = x + y * width;
if (img.pixels[i] == color(0, 0, 0)) { // determine if zero
Serial.write('1'); // send non zero char
} else {
Serial.write('0'); // send zero char
}
}
Serial.write("\n\r");
}
If you want to cluster your output in units the size of img.width you could do this:
for (int x = 0; x < img.height; x++) { // iterate over height
String s;
for (int y = 0; y < img.width; y++) { // iterate over width
int i = x + y * width;
if (img.pixels[i] == color(0, 0, 0)) { // determine if zero
s += '1'; // append a non zero char to string s
} else {
s += '0'; // append a zero char to string s
}
}
Serial.println(s);
}
Please remember:
Serial.write outputs raw binary value(s).
Serial.print outputs character(s).
Serial.println outputs character(s) and appends a newline character to output.
I have serious doubts about this calculation int i = x+y*width; as your data is probably structured as:
vertical data: 0 1 2
horizontal data: [row 0][row 1][row 2]
Instead of:
horizontal data: 0 1 2
vertical data: [column 0][column 1][column 2]

vector field on gnuplot with u and v components

I'm solving Navier-Stokes equation for incompressible fluid flow through a square region with obstacle. As an output I get X and Y components of velocity as NxN matrix each. How to plot vector field for it in gnuplot.
I found this answer but I can't understand what values to put for x, y, dx, dy.
Can anyone explain how to use my output to plot vector field?
UPDATE
I tried doing as #LutzL said, but something seems to be wrong with my code. Is everything alright with this code?
int main() {
ifstream finu("U"), finv("V");
int N = 41, M = 41;
auto
**u = new double *[N],
**v = new double *[N];
for (int i = 0; i < N; i++) {
u[i] = new double[M];
v[i] = new double[M];
}
double
dx = 1.0 / (N - 1),
dy = 1.0 / (M - 1);
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
finu >> u[i][j];
finv >> v[i][j];
}
}
ofstream foutvec("vec");
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
foutvec << dx * i << "\t" << dy * j << "\t" << u[i][j] << "\t" << v[i][j] << endl;
}
}
ofstream plt("graph.plt");
plt << "set term pngcairo"
"\nset title 'Navier-Stokes Equation'"
"\nset output 'vec.png'"
"\nplot 'vec' w vec";
plt.close();
system("gnuplot graph.plt");
return 0;
}
As an output I get a bit weird field.
You need to save your result in a text file with lines
x[i] y[j] X[i,j] Y[i,j]
for all of the pairs i,j. Then use gnuplot with the "traditional" vector field command.
You need only use using if you put additional columns into that file, and the vectors to display are not (simply) the 3rd and 4th columns. One use might be that you compute a scaling factor R[i,j] to display X/R, Y/R. Put that into 5th place
x[i] y[j] X[i,j] Y[i,j] R[i,j]
and call with using 1:2:($3/$5):($4/$5) to perform the scaling in gnuplot.
In the code in the update and the resulting image, one sees that the vector field is too large to plot. Scale with dt for some reasonable time step, in the gnuplot commands this could be done via
dt = 0.01
plot 'vec' u 1:2:(dt*$3):(dt*$4) w vec
The incomplete plot hints to an incomplete data file on the disk. Flush or close the output stream for the vector data.

OpenCL: Move data between __global memory

I am trying to move some data between 2 global memory before running a kernel on it.
Here buffer contains data that needs to be written in array, but sadly not contiguously:
void exchange_2_halo_write(
__global float2 *array,
__global float *buffer,
const unsigned int im,
const unsigned int jm,
const unsigned int km
) {
const unsigned int v_dim = 2;
unsigned int i, j, k, v, i_buf = 0;
// Which vector component, ie along v_dim
for (v = 0; v < v_dim; v++) {
// top halo
for (k = 0; k < km; k++) {
for (i = 0; i < im; i++) {
((__global float*)&array[i + k*im*jm])[v] = buffer[i_buf];
i_buf++;
}
}
// bottom halo
for (k = 0; k < km; k++) {
for (i = 0; i < im; i++) {
((__global float*)&array[i + k*im*jm + im*(jm-1)])[v] = buffer[i_buf];
i_buf++;
}
}
// left halo
for (k = 0; k < km; k++) {
for (j = 1; j < jm-1; j++) {
((__global float*)&array[j*im + k*im*jm])[v] = buffer[i_buf];
i_buf++;
}
}
// right halo
for (k = 0; k < km; k++) {
for (j = 1; j < jm-1; j++) {
((__global float*)&array[j*im + k*im*jm + (im-1)])[v] = buffer[i_buf];
i_buf++;
}
}
}
}
This works really fine in C (with a few minor changes), and for the data size I need (im = 150, jm = 150, km = 90, buf_sz = 107280), it runs in about 0.02s.
I had expected the same code to be slower on the GPU, but not that slower, it actually takes about 90 minutes to do the same thing (that's about 250000x slower!).
Simply doing a straight allocation takes about 15 minutes, which clearly shows it is not the way to go.
for (i = 0; i < buf_sz; i++) {
array[i] = buffer[i];
}
In that case, I have seen that I can do something like this:
int xid = get_global_id(0);
array[xid] = buffer[xid];
which seems to work fine/quickly.
However, I do not know how to adapt this to use the conditions I have in the first code.
The top and bottom_halo parts have im contiguous elements to transfer to array, which I think means it could be ok to transfer easily. Sadly the left and right_halos don't.
Also with better code, can I expect to get somewhat close to the CPU time? If it is impossible to do it in, say, under 1s, it's probably going to be a waste.
Thank you.
Before the answer, 1 remark. When you do a for loop inside a kernel, like this:
for (i = 0; i < buf_sz; i++) {
array[i] = buffer[i];
}
And you launch ie: 512 work items, you are doing the copy 512 times!!, not doing it in parallel with 512 threads. So obviously, it is going to be even slower! more than 512x slower!!!
That said, you can split it in this way:
2D Global size: km x max(im,jm)
void exchange_2_halo_write(
__global float2 *array,
__global float *buffer,
const unsigned int im,
const unsigned int jm
) {
const unsigned int v_dim = 2;
const unsigned int k = get_global_id(0);
const unsigned int i = get_global_id(1);
const unsigned int km = get_global_size(0);
// Which vector component, ie along v_dim
for (unsigned int v = 0; v < v_dim; v++) {
if(i < im){
// top halo
((__global float*)&array[i + k*im*jm])[v] = buffer[v*(2*km*im + 2*km*(jm-2))+km*i];
// bottom halo
((__global float*)&array[i + k*im*jm + im*(jm-1)])[v] = buffer[v*(2*km*im + 2*km*(jm-2))+km*im+km*i];
}
if(i < jm-1 && i > 0){
// left halo
((__global float*)&array[i*im + k*im*jm])[v] = buffer[v*(2*km*im + 2*km*(jm-2))+km*im*2+km*(i-1)];
// right halo
((__global float*)&array[i*im + k*im*jm + (im-1))[v] = buffer[v*(2*km*im + 2*km*(jm-2))+km*im*2+km*(jm-2)+km*(i-1)];
}
}
}
Other options are possible, like using local memory, but that is a tedious work....

why fundamental Frequency and magnitude are not null when microphone is off?

I would like to make real time audio processing with Qt and display the spectrum using FFTW3.
What I've done in steps:
I capture any sound from computer device and fill it into the buffer.
I assign sound samples to double array
I compute the fundamental frequency.
when I'm display the fundamental frequency and Magnetitude when the microphone is on but no signal(silence) , the fundamental frequency is not what I expected , the code don't always return zero , sometimes the code returns 1500Hz,2000hz as frequency
and when the microphone is off (mute) the code don't return zero as fundamamental frequency but returns a number between 0 and 9000Hz. Any help woulbd be appreciated
here is my code
QByteArray *buffer;
QAudioInput *audioInput;
audioInput = new QAudioInput(format, this);
//Check the number of samples in input buffer
qint64 len = audioInput->bytesReady();
//Limit sample size
if(len > 4096)
len = 4096;
//Read sound samples from input device to buffer
qint64 l = input->read(buffer.data(), len);
int input_size= BufferSize;
int output_size = input_size; //input_size/2+1;
fftw_plan p3;
double in[output_size];
fftw_complex out[output_size];
short *outdata = (short*)m_buffer.data();// assign sample into short array
int data_size = size_t(outdata);
int data_size1 = sizeof(outdata);
int count = 0;
double w = 0;
for(int i(chanelNumber); i < output_size/2; i= i + 2) //fill array in
{
w= 0.5 * (1 - cos(2*M_PI*i/output_size)); // Hann Windows
double x = 0;
if(i < data_size){
x = outdata[i];
}
if(count < output_size){
in[count] = x;// fill Array In with sample from buffer
count++;
}
}
for(int i=count; i<output_size; i++){
in[i] = 0;
}
p3 = fftw_plan_dft_r2c_1d(output_size, in, out, FFTW_ESTIMATE);// create Plan
fftw_execute(p3);// FFT
for (int i = 0; i < (output_size/2); i++) {
long peak=0;
double Amplitudemax=0;
double r1 = out[i][0] * out[i][0];
double im1 = out[i][3] * out[i][4];
double t1 = r1 + im1;
//double t = 20*log(sqrt(t1));
double t = sqrt(t1)/(double)(output_size/2);
double f = (double)i*8000 / ((double)output_size/2);
if(Magnitude > AmplitudeMax)
{
AmplitudeMax = Magnitude;
Peak =2* i;
}
}
fftw_destroy_plan(p3);
return Peak*(static_cast<double>(8000)/output_Size);
What you think is silence might contain some small amount of noise. The FFT of random noise will also appear random, and thus have a random magnitude peak. But it is possible that noise might come from equipment or electronics in the environment (fans, flyback transformers, etc.), or the power supply to your ADC or mic, thus showing some frequency biases.
If the noise level is low enough, normally one checks the level of the magnitude peak, compares it against a threshold, and cuts off frequency estimation reporting below this threshold.

Best way to copy 2D or 3D data to Local Memory

I am starting to do a lot of work in 3D for my OpenCL kernels for filtering. Is there an optimum way to copy a 2D or 3D subset from global memory into local or private memory?
The use for this could be to take a 3D dataset and apply a 3D kernel (or operate on the space occupied by the 3D kernel). Each thread is going to look at one pixel, crop the data around the pixel in 3 dimensions that is the size of a kernel (say 1, 3, 5, etc), copy this subset of data to local or private memory, and then compute, for example, the Standard Deviation of the subset of data.
The easiest and least efficient way is just by brute force:
__kernel void Filter_3D_StdDev(__global float *Data_3D_In,
int KernelSize){
//Note: KernelSize is always ODD
int k = get_global_id(0); //also z
int j = get_global_id(1); //also y
int i = get_global_id(2); //also x
//Convert 3D to 1D
int linear_coord = i + get_global_size(0)*j + get_global_size(0)*get_global_size(1)*k;
//private memory
float Subset[KernelSize*KernelSize*KernelSize];
int HalfKernel = (KernelSize - 1)/2; //compute the pixel radius
for(int z = -HalfKernel ; z < HalfKernel; z++){
for(int y = -HalfKernel ; y < HalfKernel; y++){
for(int x = -HalfKernel ; z < HalfKernel; x++){
int index = (i + x) + get_global_size(0)*(j + y) + \
get_global_size(0)*get_global_size(1)*(k + z);
Subset[x + HalfKernel + (y + HalfKernel)*KernelSize + (z + HalfKernel)*KernelSize*KernelSize] = Data_3D_In[index];
}
}
}
//Filter subset here
}
This is horribly in-efficient since so many calls are made to global memory. Is there a way to improve this?
My first thought is to use vload to reduce the number of loops, such as:
__kernel void Filter_3D_StdDev(__global float *Data_3D_In,
int KernelSize){
//Note: KernelSize is always ODD
int k = get_global_id(0); //also z
int j = get_global_id(1); //also y
int i = get_global_id(2); //also x
//Convert 3D to 1D
int linear_coord = i + get_global_size(0)*j + get_global_size(0)*get_global_size(1)*k;
//private memory
float Subset[KernelSize*KernelSize];
int HalfKernel = (KernelSize - 1)/2; //compute the pixel radius
for(int z = -HalfKernel ; z < HalfKernel; z++){
for(int y = -HalfKernel ; y < HalfKernel; y++){
//##TODO##
//Automatically determine which vload to use based on Kernel Size
//for now, use vload3
int index = (i + -HalfKernel) + get_global_size(0)*(j + y) + \
get_global_size(0)*get_global_size(1)*(k + z);
int subset_index = (z + HalfKernel)*KernelSize*KernelSize
float3 temp = vload3(index, Data_3D_In);
vstore3(temp, subset_index, Subset);
}
}
//Filter subset here
}
Is there an even better way?
Thanks in Advance!
First off you need to unroll those loops. You will have to make several copies of the function or do string replacement before you compile, or unroll the loops first but just as a test do:
#define HALF_KERNEL_SIZE = 2
#pragma unroll HALF_KERNEL_SIZE * 2 + 1
for(int z = -HALF_KERNEL_SIZE ; z < HALF_KERNEL_SIZE ; z++){
#pragma unroll HALF_KERNEL_SIZE * 2 + 1
for(int y = -HALF_KERNEL_SIZE ; y < HALF_KERNEL_SIZE ; y++){
For the GPU you should read it into local memory (especially for the 5x5x5 ones because you are reading back into global memory A LOT when you already have the data and you don't want to go back to get it. (This is for the GPU) for the CPU it is not as big of an issue.
So do this exactly as you would do for convolution but with an extra dimension:
1. Read in a block (or cube) of memory into local memory for a number of threads.
2. Create a barrier to make sure all data is read before you continue.
3. Sample into your local memory using your local id as an offset.
4. Test various local workgroup sizes until you get best performance
Everything else is the same. For the larger kernels with a bigger overlap this will be orders of manatudes faster.

Resources