I would like to implement an image filtering algorithm using OpenCL but the image size is very large (4096 x 4096). I understand that the copy time to the OpenCL device may take too long.
Do you think it makes sense to address this problem by using a parallel copy in combination with OpenCL kernel execution?
E.g., below is my approach:
1) Split the full image into 2 parts.
2) Copy the first half to the device.
3) Execute the image filtering kernel on the device, then copy the 2nd half of the image to the device.
4) Block the kernel execution until the first half completes, then call the kernel again to process the 2nd part.
5) Block until the 2nd part finishes.
Best regards,
OpenCL thread of execution is completely independent to your application. So there is no need to "wait" after each call. Just flush all the order to OpenCL and it should schedule them properly.
The only need is to have 2 queues, in order to be able to run commands in parallel. So you will need a IO queue, and an execution queue. A single queue (even in out of order mode), can never run 2 operations in parallel.
Here you have one example approach with events, you can call clFlush() on the queues just after doing the enqueues in order to speed them up.
//Create 2 queues (at creation only!)
mQueueIO = cl::CommandQueue(context, device[0], 0);
mQueueRun = cl::CommandQueue(context, device[0], 0);
//Everytime you run your image filter
//Queue the 2 writes
cl::Event wev1; //Event to known when the write finishes
mQueueIO.enqueueWriteBuffer(ImageBufferCL, CL_FALSE, 0, size/2, imageCPU, NULL, &wev1);
cl::Event wev2; //Event to known when the write finishes
mQueueIO.enqueueWriteBuffer(ImageBufferCL, CL_FALSE, size/2, size/2, imageCPU+size/2, &wev2);
//Queue the 2 runs (with the proper dependency)
std::vector<cl::Event> wait;
wait.push_back(wev1);
cl::Event ev1; //Event to track the finish of the run command
mQueueRun.enqueueNDRangeKernel(kernel, cl::NDRange(0), cl::NDRange(size/2), cl::NDRange(localsize), &wait, &ev1);
wait[0] = wev2;
cl::Event ev2; //Event to track the finish of the run command
mQueueRun.enqueueNDRangeKernel(kernel, cl::NDRange(size/2), cl::NDRange(size/2), cl::NDRange(localsize), &wait, &ev2);
//Read back the data when it has finished
std::vector<cl::Event> rev(2);
wait[0] = ev1;
mQueueIO.enqueueReadBuffer(ImageBufferCL, CL_FALSE, 0, size/2, imageCPU, &wait, &rev[0]);
wait[0] = ev1;
mQueueIO.enqueueReadBuffer(ImageBufferCL, CL_FALSE, size/2, size/2, imageCPU + size/2, &wait, &rev[1]);
rev[0].wait();
rev[1].wait();
Notice how I create 2 events for writing, these are the wait events of the execution; and 2 events for execution that are the wait events for reading.
In the last part I create another 2 events for reading but they are not really needed, you can use a blocking read.
Try using out of order queues - most implementation's hardware should support them. You'll want to use the global offset parameter in your kernels along with global_id where applicable. At some point you will get diminishing returns with a division strategy like this but there should exist a number such that you can get a good payoff in latency reduction - I would guess it's in [2, 100] is probably a good interval to brute force profile. Be aware that only one kernel can write to a memory buffer at a time and make sure the input buffer is const (read-only). Be aware that you must also merge the result from N buffer splits in one kernel to an output - this means you will effectively write all pixels twice to GDS. OpenCL 2.0 may be able to save us all these divided writes with it's image types, if you are able to use it.
cl::CommandQueue queue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE|CL_QUEUE_ON_DEVICE);
cl::Event last_event;
std::vector<Event> events;
std::vector<cl::Buffer> output_buffers;//initialize with however many splits you have, ensure there is at least enough for what is written and update the kernel perhaps to only write to it's relative region.
//you might approach finer granularity with even more splits
//just make sure the kernel is using the global offset -
//in which case adjust this code into a loop
set_args(kernel, image_input, image_outputs[0]);
queue.enqueueNDRangeKernel(kernel, cl::NDRange(0, 0), cl::NDRange(cols * local_size[0], (rows/2) * local_size[0]), cl::NDRange(local_size[0], local_size[1]), &events, &last_event); events.push_back(last_event);
set_args(kernel, image_input, image_outputs[0]);
queue.enqueueNDRangeKernel(kernel, cl::NDRange(0, size/2 * local_size), cl::NDRange(cols * local_size[0], (size - size/2) * local_size[1]), cl::NDRange(local_size[0], local_size[1]), &events, &last_event); events.push_back(last_event);
set_args(merge_buffers_kernel, output_buffers...)
queue.enqueueNDRangeKernel(merge_buffers_kernel, NDRange(), NDRange(cols * local_size[0], rows * local_size[1])
cl::waitForEvents(events);
Related
I wrote a barebone progran template in XC8 (1.37) that I use to develop and test new GLCD functions for the 18F family. Programming is done via a PICkit3. Since I need to quicky reprogram several times the code it is really important that programming is faster as much as possible.
Tipically, the code size is around 2K and it takes less than 10 sec to program,
Everiything is fine until I must use a font table, defined as:
const char font8[] = {....
Now, with just $400 bytes added, the compiler place the table at the ROM's end and the programming of 64K memory takes more than 1 minute.
Is there any way to avoid this?
I tried to manually limit the memory range in the MPLABX options, but this is annoying and a little unsafe (sometimes part of code is truncated).
A while back I had to write some code for emissions testing, where I needed to copy data between extreme ends of RAM. To do that I needed to specify the exact memory addresses. You can also use the C extension __at() construct. http://ww1.microchip.com/downloads/en/DeviceDoc/50002053F.pdf#page=27
int scanMode __at(0x200);
const char keys[] __at(123) = { ’r’, ’s’, ’u’, ’d’};
int modify(int x) __at(0x1000) {
return x * 2 + 3;
}
I am trying to allocate a 2GB buffer using huge TLB page (1GB) and bind the memory region to a specific numa node.
To allocate the buffer using huge TLB page, I am using the following code:
shmid = shmget (IPC_PRIVATE, buf_size,
SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
buf = (uint64_t *) shmat (shmid, 0, 0);
Then, I called:
numa_tonode_memory (buf, buf_size, 3);
to move the buffer to a specific node.
When I run the program, as soon as I access buffer offset larger than 1GB, the program would stop with "Bus error (core dumped)".
Removing numa_tonode_memory would avoid the error, however, it would also destroy the purpose of allocating memory on a specific node.
I am wondering if there is any work around on this problem,
Thank you,
When an out-of-memory error is raised in a parfor, is there any way to kill only one Matlab slave to free some memory instead of having the entire script terminate?
Here is what happens by default when an out-of-memory error occurs in a parfor: the script terminated, as shown in the screenshot below.
I wish there was a way to just kill one slave (i.e. removing a worker from parpool) or stop using it to release as much memory as possible from it:
If you get a out of memory in the master process there is no chance to fix this. For out of memory on the slave, this should do it:
The simple idea of the code: Restart the parfor again and again with the missing data until you get all results. If one iteration fails, a flag (file) is written which let's all iterations throw an error as soon as the first error occurred. This way we get "out of the loop" without wasting time producing other out of memory.
%Your intended iterator
iterator=1:10;
%flags which indicate what succeeded
succeeded=false(size(iterator));
%result array
result=nan(size(iterator));
FLAG='ANY_WORKER_CRASHED';
while ~all(succeeded)
fprintf('Another try\n')
%determine which iterations should be done
todo=iterator(~succeeded);
%initialize array for the remaining results
partresult=nan(size(todo));
%initialize flags which indicate which iterations succeeded (we can not
%throw erros, it throws aray results)
partsucceeded=false(size(todo));
%flag indicates that any worker crashed. Have to use file based
%solution, don't know a better one. #'
delete(FLAG);
try
parfor falseindex=1:sum(~succeeded)
realindex=todo(falseindex);
try
% The flag is used to let all other workers jump out of the
% loop as soon as one calculation has crashed.
if exist(FLAG,'file')
error('some other worker crashed');
end
% insert your code here
%dummy code which randomly trowsexpection
if rand<.5
error('hit out of memory')
end
partresult(falseindex)=realindex*2
% End of user code
partsucceeded(falseindex)=true;
fprintf('trying to run %d and succeeded\n',realindex)
catch ME
% catch errors within workers to preserve work
partresult(falseindex)=nan
partsucceeded(falseindex)=false;
fprintf('trying to run %d but it failed\n',realindex)
fclose(fopen(FLAG,'w'));
end
end
catch
%reduce poolsize by 1
newsize = matlabpool('size')-1;
matlabpool close
matlabpool(newsize)
end
%put the result of the current iteration into the full result
result(~succeeded)=partresult;
succeeded(~succeeded)=partsucceeded;
end
After quite bit of research, and a lot of trial and error, I think I may have a decent, compact answer. What you're going to do is:
Declare some max memory value. You can set it dynamically using the MATLAB function memory, but I like to set it directly.
Call memory inside your parfor loop, which returns the memory information for that particular worker.
If the memory used by the worker exceeds the threshold, cancel the task that worker was working on. Now, here it get's a bit tricky. Depending on the way you're using parfor, you'll either need to delete or cancel either the task or worker. I've verified that it works with the code below when there is one task per worker, on a remote cluster.
Insert the following code at the beginning of your parfor contents. Tweak as necessary.
memLimit = 280000000; %// This doesn't have to be in parfor. Everything else does.
memData = memory;
if memData.MemUsedMATLAB > memLimit
task = getCurrentTask();
cancel(task);
end
Enjoy! (Fun question, by the way.)
One other option to consider is that since R2013b, you can open a parallel pool with 'SpmdEnabled' set to false - this allows MATLAB worker processes to die without the whole pool being shut down - see the doc here http://www.mathworks.co.uk/help/distcomp/parpool.html . Of course, you still need to arrange somehow to shutdown the workers.
There is an example in OpenCL NVIDIA SDK, oclCopyComputeOverlap, that uses 2 queues to alternatively transfer buffers / execute kernels.
In this example mapped memory is used.
**//pinned memory**
cmPinnedSrcA = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, szBuffBytes, NULL, &ciErrNum);
**//host pointer for pinned memory**
fSourceA = (cl_float*)clEnqueueMapBuffer(cqCommandQueue[0], cmPinnedSrcA, CL_TRUE, CL_MAP_WRITE, 0, szBuffBytes, 0, NULL, NULL, &ciErrNum);
...
**//normal device buffer**
cmDevSrcA = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, szBuffBytes, NULL, &ciErrNum);
**//write half the data from host pointer to device buffer**
ciErrNum = clEnqueueWriteBuffer(cqCommandQueue[0], cmDevSrcA, CL_FALSE, 0, szHalfBuffer, (void*)&fSourceA[0], 0, NULL, NULL);
I have 2 questions:
1) Is there any need to use pinned memory for the overlap to occur? Couldn't fSourceA be just a simple host pointer,
fSourceA = (cl_float *)malloc(szBuffBytes);
...
//write random data in fSourceA
2) cmPinnedSrcA is not used in the kernel, instead cmDevSrcA is used. Doesn't the space occupied by the buffers on the device still grow? (space required for cmPinnedSrcA added to the space required for cmDevSrcA)
Thank you
If I understood your question properly:
1)
Yes, you can use any kind of memory (pinned, host pointer, etc..) and the overlap will still occur. As far as you use two queues and the HW/drivers supports it.
But remaind that, the queues are always unsynced. And in this case, events are needed to prevent the copy queue to copy non-consistent data of the running kernel.
2) I think you are using 2 times the memory if you use pinned memory, one for the pinned and another one for a temporary copy. But I am not 100% sure, maybe it is only a pointer.
I am performing a benchmark like show below
CHECK( context = clCreateContext(props, 1, &device, NULL, NULL, &_err); );
CHECK( queue = clCreateCommandQueue(context, device, 0, &_err); );
#define SYNC() clFinish(queue)
#define LAUNCH(glob, loc, kernel) OCL(clEnqueueNDRangeKernel(queue, kernel, 2,\
NULL, glob, loc,\
0, NULL, NULL))
/* Build program, set arguments over here */
START;
for (int i = 0; i < iter; i++) {
LAUNCH(global, local, plus_kernel);
}
SYNC();
STOP;
printf("Time taken (plus) : %lf\n", uSec / iter);
START;
for (int i = 0; i < iter; i++) {
LAUNCH(global, local, minus_kernel);
}
SYNC();
STOP;
printf("Time taken (minus): %lf\n", uSec / iter);
START;
for (int i = 0; i < iter; i++) {
LAUNCH(global, local, plus_kernel);
LAUNCH(global, local, minus_kernel);
}
SYNC();
STOP;
printf("Time taken (both) : %lf\n", uSec / iter);
The results look weird:
Time taken (plus) : 31.450000
Time taken (minus): 28.120000
Time taken (both) : 2256.380000
START, and STOP are just macros that start and stop a timer.
Here are the relevant macros.
I am not sure why queuing up is the kernels is slowing them down (and only on AMD GPUs)!
EDIT I am using Radeon 7970
EDIT Both kernels are operating on independent memory. Also here is the system information.
OS: Ubuntu 11.10
fglrxinfo:
display: :0 screen: 0
OpenGL vendor string: Advanced Micro Devices, Inc.
OpenGL renderer string: AMD Radeon HD 7900 Series
OpenGL version string: 4.2.11762 Compatibility Profile Context
I think the answer has to do with caching of data on newer GPUs (Specifically the Radeon 7970, which uses the Graphics Compute Next (GCN) architecture.
One of the advantages of this architecture is it's caching capabilities (somewhat close to CPU caching at this point). If you perform calls like this:
PLUS
PLUS
PLUS
....
Then the memory that is resident in the inner caches of the GPU. On the other hand if you make calls like this:
PLUS
MINUS
PLUS
MINUS
...
Where the two kernels have different memory objects associated with them, then the data is kicked out of the hardware devices on each CU, causing a need for them to be brought in from the very sluggish global memory.
Two easy ways to test if this is the case:
Run only Pluses with varying numbers of iterations. As the number of iterations increases, the average time will go down because the cost of the first run (which brings the data in) is amortized. Also, you should notice that all calls after the first should be relatively equal.
Make the Plus and Minus kernels run on the same memory objects. If the reason for the slowdown is because of the caching of memory objects, then the overall run time should be the average of the individual running times of PLUS and MINUS (depending perhaps on experiment 1).
Let me know if you find out if this is actually the case!