enqueueWriteBuffer locking up when sending non-32 bit aligned data - opencl

I am working on an opencl project and I have run into an issue where if I try and send data to from the cpu to global memory then sometimes it locks up the application. This happens sporadically. I can run it x times in a row and the next time it locks. It only appears to be happening if I try and send data that is not 32 bits wide. For example, I can send float and int just fine, but when I try short, char, or half then I get random lockups. It is not dying with badly initialized data or something, because it does run, just not all the time. I also put some logging in and found that it is always locking up just after attempting to write one of these non-standard size data arrays. I am running on an NVIDIA GeForce GT 330M. Below is a snippet of the code I am running to send the data. I am using the c++ interface on the host side.
cl_half *m_aryTest;
shared_ptr< cl::Buffer > m_bufTest;
m_aryTest = new cl_half[m_iNeuronCount];
m_bufTest = shared_ptr<cl::Buffer>( new cl::Buffer(m_lpNervousSystem->ActiveContext(), CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(m_aryTest)*m_iNeuronCount, m_aryTest));
kernel.setArg(8, *(m_bufTest.get()));
printf("m_bufTest.\n");
m_lpQueue->enqueueWriteBuffer(*(m_bufTest.get()), CL_TRUE, 0, sizeof(m_aryTest)*m_iNeuronCount, m_aryTest, NULL, NULL);
Does anyone have any ideas on why this is happening?
Thanks

Related

how to flush page data in python using mmap

I am trying to map a region of fpga memory to host system,
resource0 = os.open("/sys/bus/pci/devices/0000:0b:00.0/resource0", os.O_RDWR | os.O_SYNC)
resource_size = os.fstat(resource0).st_size
mem = mmap.mmap(resource0, 65536, flags=mmap.MAP_SHARED, prot=mmap.PROT_WRITE|mmap.PROT_READ, offset= 0 )
If i flush my host page with
mem.flush()
then print the contents
the data is same as before,
nothing is getting cleared from page
print(mem[0:131072])
mem.flush()
print(mem[0:131072])
as i read on python mmap docs , it says it clears then content,
https://docs.python.org/3.6/library/mmap.html
but when i test it remains same
i am using python 3.6.9
Why do you expect flush to clear a page?
https://docs.python.org/2/library/mmap.html
flush([offset, size])
Flushes changes made to the in-memory copy of a file back to disk. Without use of this call there is no guarantee that changes are written back before the object is destroyed. If offset and size are specified, only changes to the given range of bytes will be flushed to disk; otherwise, the whole extent of the mapping is flushed. offset must be a multiple of the PAGESIZE or ALLOCATIONGRANULARITY.
So if you want to clear anything you have to assign a new value first and then write it to the memory i.e. flush it.

why does not clEnequeMapBuffer map to original pointer, OpenCL, Caffe

Assume that a CPU pointer(cpu_ptr_) already exists, then I create a buffer for gpu(cl_gpu_mem_). The problem is that when I map the gpu buffer to a cpu pointer(mapped_ptr), the mapped_ptr is not equal to the original pointer (cpu_ptr_), which causes that CHECK_EQ(mapped_ptr, cpu_ptr_) raises an error.
cl_gpu_mem_ = clCreateBuffer(ctx.handle().get(),
CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
size_, cpu_ptr_, &err);
void *mapped_ptr = clEnqueueMapBuffer(
ctx.get_queue().handle().get(),
cl_gpu_mem_,
true,
CL_MAP_READ | CL_MAP_WRITE,
0, size_, 0, NULL, NULL, NULL);
CHECK_EQ(mapped_ptr, cpu_ptr_)
<< "Device claims it supports zero copy"
<< " but failed to create correct user ptr buffer";
I don't know why this error occurs at all. Would you please give me any advice for this problem, or any solution to it. Thank you very much.
OpenCL implementations are free to mirror the host pointer (making it non-zero copy). On devices that support true zero copy (e.g. Intel GPU), there are still typically constraints that impose whether we can really use that host allocation directory or must mirror it. On Intel the host address must be page aligned and a the length must be a multiple of 128 bytes (an even cacheline). (I typically just page align both.) I am not sure what AMD and other's requirements are.
Look into aligned_alloc or overallocate a couple extra pages via and use a page boundary for the base.

Get total Latency - UDP Audio Communication

Okay i am currently trying to make a Voice chat software using NAudio and c#.
But i currently have a problem, latency seems to bet worse and worse the longer the application runs.
Now, i am a total beginner, so i have no idea what can be the cause of it.
But to troubleshoot, i would like to know if i can get the total latency to see how much it adds over time.
Total Latency = Input buffer + network latency + output buffer (and more if there is any, i am using UDP).
So if i have something like:
Label.text = TotalLatency();
It will get updated all the time.
while (!bStop)
{
byte[] datanbefore = waveStream.GetBuffer();
autoResetEvent.WaitOne();
waveStream.Position = 0;
captureBuffer.Read(offset, waveStream, halfBuffer, LockFlag.None);
readFirstBufferPart = !readFirstBufferPart;
offset = readFirstBufferPart ? 0 : halfBuffer;
//TODO: Fix this ugly way of initializing differently.
//Mute Mic when button is checked
if (MuteMic.Checked)
{
waveStream = new MemoryStream(halfBuffer);
}
byte[] datanaudio = waveStream.GetBuffer();
udpClient.Send(datanaudio, datanaudio.Length, otherPartyIP.Address.ToString(), 5550);
}
So here is the sending part. I am not really sure how the buffering works, as i started the application using a free sample, and have been changing it here and there, but some parts still remain, but i think that buffer can be improved though.
while (!bStop)
{
//Receive data.
byte[] byteData = udpClient.Receive(ref remoteEP);
waveProvider.AddSamples(byteData, 0, byteData.Length);
}
Here is the Receive part, and it´s much simpler, it just get´s the data from the UDP, ass it to a buffer and play it.
You can work out roughly the input and output latency by knowing the buffer sizes of WaveIn and WaveOut. By default in NAudio they are each 100ms.
For network latency, you could try timestamping your audio packets although the clocks of both machines would need to be in sync.

OpenBSD serial I/O: -lpthead makes read() block forever, even with termios VTIME set?

I have an FTDI USB serial device which I use via the termios serial API. I set up the port so that it will time-out on read() calls in half a second (by using the VTIME parameter), and this works on Linux as well as on FreeBSD. On OpenBSD 5.1, however, the read() call simply blocks forever when no data is available (see below.) I would expect read() to return 0 after 500ms.
Can anyone think of a reason that the termios API would behave differently under OpenBSD, at least with respect to the timeout feature?
EDIT: The no-timeout problem is caused by linking against pthread. Regardless of whether I'm actually using any pthreads, mutexes, etc., simply linking against that library causes read() to block forever instead of timing out based on the VTIME setting. Again, this problem only manifests on OpenBSD -- Linux and FreeBSD work as expected.
if ((sd = open(devPath, O_RDWR | O_NOCTTY)) >= 0)
{
struct termios newtio;
char input;
memset(&newtio, 0, sizeof(newtio));
// set options, including non-canonical mode
newtio.c_cflag = (CREAD | CS8 | CLOCAL);
newtio.c_lflag = 0;
// when waiting for responses, wait until we haven't received
// any characters for 0.5 seconds before timing out
newtio.c_cc[VTIME] = 5;
newtio.c_cc[VMIN] = 0;
// set the input and output baud rates to 7812
cfsetispeed(&newtio, 7812);
cfsetospeed(&newtio, 7812);
if ((tcflush(sd, TCIFLUSH) == 0) &&
(tcsetattr(sd, TCSANOW, &newtio) == 0))
{
read(sd, &input, 1); // even though VTIME is set on the device,
// this read() will block forever when no
// character is available in the Rx buffer
}
}
from the termios manpage:
Another dependency is whether the O_NONBLOCK flag is set by open() or
fcntl(). If the O_NONBLOCK flag is clear, then the read request is
blocked until data is available or a signal has been received. If the
O_NONBLOCK flag is set, then the read request is completed, without
blocking, in one of three ways:
1. If there is enough data available to satisfy the entire
request, and the read completes successfully the number of
bytes read is returned.
2. If there is not enough data available to satisfy the entire
request, and the read completes successfully, having read as
much data as possible, the number of bytes read is returned.
3. If there is no data available, the read returns -1, with errno
set to EAGAIN.
can you check if this is the case?
cheers.
Edit: OP traced back the problem to a linking with pthreads that caused the read function to block. By upgrading to OpenBSD >5.2 this issue was resolved by the change to the new rthreads implementation as the default threading library on openbsd. more info on guenther# EuroBSD2012 slides

Qt Streaming Large File via HTTP and Flushing to eMMC Flash

I'm streaming a large file ( 1Gb ) via HTTP to my server in Qt on a very memory constrained embedded Linux device. When I first receive the header I determine where to write the data on the filesystem, create a QFile pointer to that location, and open the file for appending. There is an 'accumulate' function in the server that is called each time new data arrives to the socket. From that accumulate function I want to stream the data right to the file via write(). You can see my accumulate function below.
My problem is memory usage when doing this -- I run out of memory. Shouldn't I be able to flush() and fsync() each iteration of the accumulation and not have to worry about RAM usage? What am I doing wrong and how can I fix this? Thanks -
I open my file once before the accumulate function:
// Open the file
filePointerToWriteTo->open(QIODevice::WriteOnly | QIODevice::Append | QIODevice::Unbuffered)
Here is a portion of the accumulate function:
// Extract the QFile pointer from the QVariant
QFile *filePointerToWriteTo = (QFile *)(containerForPointer->pointer).value<void *>();
qDebug() << "APPENDING bytes: " << data.length();
// Write to the file and sync
filePointerToWriteTo->write(data);
filePointerToWriteTo->waitForBytesWritten(-1);
filePointerToWriteTo->flush(); // Flush
fsync(filePointerToWriteTo->handle()); // Make sure bytes are written to disk
EDIT:
I instrumented my code and the 'waitForBytesWritten(-1)' call ALWAYS return 'false'. The docs say this should wait until data is written to the device.
Also, If I uncomment only the 'write(data)' line, then my free memory never decreases. What could be going on? How does 'write' consume so much memory?
EDIT:
Now I am doing the following. I do not run out of memory, but my free memory drops to 2Mb and hovers there until the entire file is transferred. At which point, the memory is released. If I kill the transfer in the middle, the kernel seems to hold on to the memory because it stays around 2Mb free until I restart the process and try to write to the same file. I still think I should be able to use and flush the memory each iteration:
// Extract the QFile pointer from the QVariant
QFile *filePointerToWriteTo = (QFile *)(containerForPointer->pointer).value<void *>();
int numberOfBytesWritten = filePointerToWriteTo->write(data);
qDebug() << "APPENDING bytes: " << data.length() << " ACTUALLY WROTE: " << numberOfBytesWritten;
// Flush and sync
bool didWaitForWrite = filePointerToWriteTo->waitForBytesWritten(-1); // <----------------------- This ALWAYS returns false!
filePointerToWriteTo->flush(); // Flush
fsync(filePointerToWriteTo->handle()); // Make sure bytes are written to disk
fdatasync(filePointerToWriteTo->handle()); // Specific Sync
sync(); // Total sync
EDIT:
This kind of sounds like me misunderstanding Linux caching. After reading this post --> http://blog.scoutapp.com/articles/2010/10/06/determining-free-memory-on-linux, it's possible that I am misunderstanding the output of 'free -mt'. I have been watching the 'free' field in that output and see it drop to hover around 2MB on the massive file transfer. I would just like to see it return to high levels of free data when the file transfer is done.
I think Linux is just caching everything it can and frees what it can spare around the 2MB free memory limit. I do not run out of memory when receiving or sending out ~2Gb of files on a 512 MB RAM system. In my Qt program, after receiving all of the data, appending to file, and closing the file. I do the following in a QProcess to see my 'free' memory return in the 'free -mt' command in a separate terminal:
// Now we've returned a large file - so free up cache in linux
QProcess freeCachedMemory;
freeCachedMemory.start("sh");
freeCachedMemory.write("sync; echo 3 > /proc/sys/vm/drop_caches"); // Sync to disk and clear Linux cache
freeCachedMemory.waitForFinished();
freeCachedMemory.close();

Resources