allocate memory using huge page and numa_tonode_memory giving "Bus Error" - numa

I am trying to allocate a 2GB buffer using huge TLB page (1GB) and bind the memory region to a specific numa node.
To allocate the buffer using huge TLB page, I am using the following code:
shmid = shmget (IPC_PRIVATE, buf_size,
SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
buf = (uint64_t *) shmat (shmid, 0, 0);
Then, I called:
numa_tonode_memory (buf, buf_size, 3);
to move the buffer to a specific node.
When I run the program, as soon as I access buffer offset larger than 1GB, the program would stop with "Bus error (core dumped)".
Removing numa_tonode_memory would avoid the error, however, it would also destroy the purpose of allocating memory on a specific node.
I am wondering if there is any work around on this problem,
Thank you,

Related

Parallel copy and opencl kernel execution

I would like to implement an image filtering algorithm using OpenCL but the image size is very large (4096 x 4096). I understand that the copy time to the OpenCL device may take too long.
Do you think it makes sense to address this problem by using a parallel copy in combination with OpenCL kernel execution?
E.g., below is my approach:
1) Split the full image into 2 parts.
2) Copy the first half to the device.
3) Execute the image filtering kernel on the device, then copy the 2nd half of the image to the device.
4) Block the kernel execution until the first half completes, then call the kernel again to process the 2nd part.
5) Block until the 2nd part finishes.
Best regards,
OpenCL thread of execution is completely independent to your application. So there is no need to "wait" after each call. Just flush all the order to OpenCL and it should schedule them properly.
The only need is to have 2 queues, in order to be able to run commands in parallel. So you will need a IO queue, and an execution queue. A single queue (even in out of order mode), can never run 2 operations in parallel.
Here you have one example approach with events, you can call clFlush() on the queues just after doing the enqueues in order to speed them up.
//Create 2 queues (at creation only!)
mQueueIO = cl::CommandQueue(context, device[0], 0);
mQueueRun = cl::CommandQueue(context, device[0], 0);
//Everytime you run your image filter
//Queue the 2 writes
cl::Event wev1; //Event to known when the write finishes
mQueueIO.enqueueWriteBuffer(ImageBufferCL, CL_FALSE, 0, size/2, imageCPU, NULL, &wev1);
cl::Event wev2; //Event to known when the write finishes
mQueueIO.enqueueWriteBuffer(ImageBufferCL, CL_FALSE, size/2, size/2, imageCPU+size/2, &wev2);
//Queue the 2 runs (with the proper dependency)
std::vector<cl::Event> wait;
wait.push_back(wev1);
cl::Event ev1; //Event to track the finish of the run command
mQueueRun.enqueueNDRangeKernel(kernel, cl::NDRange(0), cl::NDRange(size/2), cl::NDRange(localsize), &wait, &ev1);
wait[0] = wev2;
cl::Event ev2; //Event to track the finish of the run command
mQueueRun.enqueueNDRangeKernel(kernel, cl::NDRange(size/2), cl::NDRange(size/2), cl::NDRange(localsize), &wait, &ev2);
//Read back the data when it has finished
std::vector<cl::Event> rev(2);
wait[0] = ev1;
mQueueIO.enqueueReadBuffer(ImageBufferCL, CL_FALSE, 0, size/2, imageCPU, &wait, &rev[0]);
wait[0] = ev1;
mQueueIO.enqueueReadBuffer(ImageBufferCL, CL_FALSE, size/2, size/2, imageCPU + size/2, &wait, &rev[1]);
rev[0].wait();
rev[1].wait();
Notice how I create 2 events for writing, these are the wait events of the execution; and 2 events for execution that are the wait events for reading.
In the last part I create another 2 events for reading but they are not really needed, you can use a blocking read.
Try using out of order queues - most implementation's hardware should support them. You'll want to use the global offset parameter in your kernels along with global_id where applicable. At some point you will get diminishing returns with a division strategy like this but there should exist a number such that you can get a good payoff in latency reduction - I would guess it's in [2, 100] is probably a good interval to brute force profile. Be aware that only one kernel can write to a memory buffer at a time and make sure the input buffer is const (read-only). Be aware that you must also merge the result from N buffer splits in one kernel to an output - this means you will effectively write all pixels twice to GDS. OpenCL 2.0 may be able to save us all these divided writes with it's image types, if you are able to use it.
cl::CommandQueue queue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE|CL_QUEUE_ON_DEVICE);
cl::Event last_event;
std::vector<Event> events;
std::vector<cl::Buffer> output_buffers;//initialize with however many splits you have, ensure there is at least enough for what is written and update the kernel perhaps to only write to it's relative region.
//you might approach finer granularity with even more splits
//just make sure the kernel is using the global offset -
//in which case adjust this code into a loop
set_args(kernel, image_input, image_outputs[0]);
queue.enqueueNDRangeKernel(kernel, cl::NDRange(0, 0), cl::NDRange(cols * local_size[0], (rows/2) * local_size[0]), cl::NDRange(local_size[0], local_size[1]), &events, &last_event); events.push_back(last_event);
set_args(kernel, image_input, image_outputs[0]);
queue.enqueueNDRangeKernel(kernel, cl::NDRange(0, size/2 * local_size), cl::NDRange(cols * local_size[0], (size - size/2) * local_size[1]), cl::NDRange(local_size[0], local_size[1]), &events, &last_event); events.push_back(last_event);
set_args(merge_buffers_kernel, output_buffers...)
queue.enqueueNDRangeKernel(merge_buffers_kernel, NDRange(), NDRange(cols * local_size[0], rows * local_size[1])
cl::waitForEvents(events);

Prefix Sum with global memory and an error with local memory

I have a Mali GPU which does not support local memory at all.
Everytime I run code consisting of local memory it gives me some errors from the device.
So, I want to transfer my codes to a version that only uses global memory.
I was thinking if it is possible to run a prefix sum/parallel reduction algorithm using global memory only on GPU.
EDITED : I was debugging the error and found a strange thing that one particular line is giving the erorr.
I have e line like this:
`#define LOG_LSIZE 8`
`#define LSIZE_SHIFT_VALUE 4`
`#define LOG_NUM_BANKS 2`
`#define GET_CONFLICT_OFFSET(lid) ((lid) >> LOG_NUM_BANKS)`
`#define LSIZE 32`
`__local int lm_sum[2][LSIZE + LOG_LSIZE]`
`**lm_sum[lid >> LSIZE_SHIFT_VALUE][bi] += lm_sum[lid >> LSIZE_SHIFT_VALUE][ai]**`
lid is local id and I used qork groups size 32. I found that the highlighted line is the cause of the error. I tried using fixed values and found that I cannot use lm_sum on the right side of a statement. If I do, that gives me an error. For example, this line also gives me error:
int temp= lm_sum[0][0]
Any idea on what is going on?
Error:
`In initial.cpp***[14100.684249] Mali<ERROR, BASE_MMU>: In file: /home/jbmaster/work/01.LPD_OpenCL_RFS/01.arm_work_3_0_31/SEC_All_EVT0_TX013-BU-00001-r2p0-00rel0/TX013-BU-00001-r2p0-00rel0/driver/product/kernel/drivers/gpu/arm/t6xx/kbase/src/common/mali_kbase_mmu.c line: 1240 function:kbase_mmu_report_fault_and_kill
[14100.709724] Unhandled Page fault in AS0 at VA 0x00000002000EC1A0
[14100.709728] raw fault status 0x500003C3
[14100.709730] decoded fault status: SLAVE FAULT
[14100.709733] exception type 0xC3: TRANSLATION_FAULT
[14100.709736] access type 0x3: WRITE
[14100.709738] source id 0x5000
[14100.734958]
[14100.736432] Mali<ERROR, BASE_JD>: In file: /home/jbmaster/work/01.LPD_OpenCL_RFS/01.arm_work_3_0_31/SEC_All_EVT0_TX013-BU-00001-r2p0-00rel0/TX013-BU-00001-r2p0-00rel0/driver/product/kernel/drivers/gpu/arm/t6xx/kbase/src/common/mali_kbase_jm.c line: 899 function:kbase_job_slot_hardstop
[14100.761458] Issueing GPU soft-reset instead of hard stopping job due to a hardware issue
[14100.769517] `
Since lm_sum[0][0] doesn't work, the memory for the array is not allocated. You said your GPU doesn't support local memory. Well, you are trying to use lm_sum which is declared to be in local memory (__local int lm_sum[2][LSIZE + LOG_LSIZE]).

ftruncate failed at the second time

I'm trying to exceed the shared memory object after shm_open and ftruncate successfully at fisrt. Here is the code,
char *uuid = GenerateUUID();
int fd = shm_open(uuid, O_RDWR|O_CREAT|O_EXCL, S_IRUSR|S_IWUSR);
if(fd == -1) perror("shm_open");
size_t shmSize = sizeof(container);
int ret = ftruncate(fd, shmSize);
perror("ftruncate first");
ret = ftruncate(fd, shmSize * 2);
perror("ftruncate second");
It could pass the first ftruncate, but for the second ftruncate, it exceeds failed with errno=22, "Invalid argument".
I also tried to ftruncate the memory object after mmap, refer to the ftruncate's man page, the shared memory should be formatted as zero to the new length.
Besides, I also tried to ftruncate the memory object in the child process (This is an IPC topic among two processes), the ftruncate returns "Invalid fd, no such file or directory" but I could shm_open and mmap successfully in child process.
Any ideas? Thanks!
I think this is a known "feature" of shm_open(), ftruncate(), mmap().
You have to ftruncate() the first time through to give the shared memory a length, but subsequent times ftruncate() gives that error number 22, which you can simply ignore.
The used implementation seems to conform to an older specification where returning an error is an allowed behavior for ftruncate(fd, length) when length exceeds the previous length:
If the file previously was smaller than this size, ftruncate() shall
either increase the size of the file or fail.

OpenCL overlap communication and computation

There is an example in OpenCL NVIDIA SDK, oclCopyComputeOverlap, that uses 2 queues to alternatively transfer buffers / execute kernels.
In this example mapped memory is used.
**//pinned memory**
cmPinnedSrcA = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, szBuffBytes, NULL, &ciErrNum);
**//host pointer for pinned memory**
fSourceA = (cl_float*)clEnqueueMapBuffer(cqCommandQueue[0], cmPinnedSrcA, CL_TRUE, CL_MAP_WRITE, 0, szBuffBytes, 0, NULL, NULL, &ciErrNum);
...
**//normal device buffer**
cmDevSrcA = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, szBuffBytes, NULL, &ciErrNum);
**//write half the data from host pointer to device buffer**
ciErrNum = clEnqueueWriteBuffer(cqCommandQueue[0], cmDevSrcA, CL_FALSE, 0, szHalfBuffer, (void*)&fSourceA[0], 0, NULL, NULL);
I have 2 questions:
1) Is there any need to use pinned memory for the overlap to occur? Couldn't fSourceA be just a simple host pointer,
fSourceA = (cl_float *)malloc(szBuffBytes);
...
//write random data in fSourceA
2) cmPinnedSrcA is not used in the kernel, instead cmDevSrcA is used. Doesn't the space occupied by the buffers on the device still grow? (space required for cmPinnedSrcA added to the space required for cmDevSrcA)
Thank you
If I understood your question properly:
1)
Yes, you can use any kind of memory (pinned, host pointer, etc..) and the overlap will still occur. As far as you use two queues and the HW/drivers supports it.
But remaind that, the queues are always unsynced. And in this case, events are needed to prevent the copy queue to copy non-consistent data of the running kernel.
2) I think you are using 2 times the memory if you use pinned memory, one for the pinned and another one for a temporary copy. But I am not 100% sure, maybe it is only a pointer.

Qt QSharedMemory Segmentation Faults after Several Successful Writes

I'm using QSharedMemory to store some data and want to subsequently append data to what is contained there. So I call the following code several times with new data. The "audioBuffer" is new data given to this function. I can call this function about 4-7 times ( and it varies ) before it seg faults on the memcpy operation. The size of the QSharedMemory location is huge so in the few calls that I do before seg faulting, there is no issue of memcpy copying data beyond it's boundaries. Also, m_SharedAudioBuffer.errorString() gives no errors up to the memcpy operation. Currently, I only have one process using this QSharedMemory segment. I also tried to write continually without appending each time and that works fine, so something is happening when I try to append more data to the shared memory segment. Any ideas? Thanks!
// Get the buffer size for the current audio buffer in shared memory
int bufferAudioDataSizeBytes = readFromSharedAudioBufferSizeMemory(); // This in number of bytes
// Create a bytearray with our data currently in the shared buffer
char* bufferAudioData = readFromSharedAudioBufferMemory();
QByteArray currentAudioStream = QByteArray::fromRawData(bufferAudioData,bufferAudioDataSizeBytes);
QByteArray currentAudioStreamDeepCopy(currentAudioStream);
currentAudioStreamDeepCopy.append(audioBuffer);
dataSize = currentAudioStreamDeepCopy.size();
//#if DEBUG
qDebug() << "Inserting audio buffer, new size is: " << dataSize;
//#endif
writeToSharedAudioBufferSizeMemory( dataSize ); // Just the size of what we received
// Write into the shared memory
m_SharedAudioBuffer.lock();
// Clear the buffer and define the copy locations
memset(m_SharedAudioBuffer.data(), '\0', m_SharedAudioBuffer.size());
char *to = (char*)m_SharedAudioBuffer.data();
char *from = (char*)audioBuffer.data();
// Now perform the actual copy operation to store the buffer
memcpy( to, from, dataSize );
// Release the lock
m_SharedAudioBuffer.unlock();
EDIT: Perhaps, this is due to my target embedded device which is very small. The available RAM is large when I am trying to write to shared memory, but I notice that in the /tmp directory ( which is only given 4Mb ) I have the following entries - the size is not nearly consumed in /tmp though so I'm not sure why I couldn't allocate more memory, also the QSharedMemory::create method never fails for my maximum size of 960000:
# cd /tmp/
# ls
QtSettings
lib
qipc_sharedmemory_AudioBufferData2a7d5f1a29e3d27dac65b4f350d76a0dfd442222
qipc_sharedmemory_AudioBufferSizeData6b7acc119f94322a6794cbca37ed63df07b733ab
qipc_systemsem_AudioBufferData2a7d5f1a29e3d27dac65b4f350d76a0dfd442222
qipc_systemsem_AudioBufferSizeData6b7acc119f94322a6794cbca37ed63df07b733ab
qtembedded-0
run
The problem seemed to be that I was using QByteArray's ::fromRawData on the pointer returned by the shared memory segment. When I copied that data explicitly using memcpy on this pointer, and then constructed my QByteArray using the copied data, then the seg faults stopped.

Resources