On x86, it may fail to initialize QImage on worker thread.
(Rare in x64)
The probability increases when parallel processing is performed over the number of cores of the CPU.
This occurs not only by reading from an image file, but also by initializing a plain QImage by specifying its size, or simply by calling QImage::copy().
This is a code to avoid it. Of course it is not perfect.
Please tell me a better way.
QImage createImageAsync(QString path)
{
QImageReader reader(path);
if(!reader.canRead())
return QImage();
// QImage processing sometimes fails
QImage src;
int count = 0;
do {
src = reader.read();
if(!src.isNull())
break;
if(src.isNull() && count++ < 1000) {
QThread::currentThread()->usleep(1000);
continue;
}
return QImage();
} while(1);
return src;
}
Essentially this problem was caused by heap memory fragmentation.
Therefore, as a solution, it is conceivable to replace the memory allocator with tcmalloc or jemalloc, for example.
In my application, it was confirmed that this problem can be sufficiently avoided by limiting the number of image files that are opened at the same time in the x86 version.
Related
I've been searching for a few days now without any luck.
Heap memory fragmentation is a result of using malloc() and free() heavily in microcontrollers/Arduino.
If using them is unavoidable, how can I defragment the heap every now and then to make sure that the next malloc() call will find contiguous memory to allocate?
Create a pool of fixed-size memory chunks. When you need any amount of memory, allocate one of the chunks from the pool and use the portion you need. When you're done with it, return the chunk to the pool. The pool will never fragment because the chunks are always the same size. And if any chunks are available then they're always the right size. Sometimes you'll need only a couple bytes and for a second you'll think that you're being wasteful for tying up an entire chunck. But then you'll remember that you're avoiding memory fragmentation and you'll feel better about it.
If the size of your allocations vary greatly then one fixed-size memory pool might really be too wasteful. Then you can make two or three fixed-size memory pools for small, medium, and large chunks. But make sure you return the chunks to the same pool that you obtained them from.
Use a queue or a linked list to organize the chunks in the pool. The chunks are removed from the queue/list when they're allocated. And they're returned to the queue/list when they're freed.
As #kkrambo mentioned, use a queue or linked list to keep track of allocated memory chunks. I've included an example specifically for Arduino, which does not have all of my desired bells and whistles.
I opted to only keep the pointer to the chunks of memory in the queue, as opposed to the chunks themselves (that seems to be what kkrambo says, although I do not know if s/he meant what I'm saying above.)
const int MAX_POOL_SIZE = 4;
const int MAX_DATA_SIZE = 128;
// Memory pool for data
QueueArray <uint8_t*> dataPool;
void initDataPool () {
uint8_t * ptr;
for (int i = 0; i < MAX_POOL_SIZE; i++) {
ptr = (uint8_t*) malloc(MAX_DATA_SIZE); // Allocate MAX_DATA_SIZE buffer
dataPool.push(ptr); // Push buffer pointer to queue
}
}
// Allocate message data buffer from data pool
// If data pool still has buffer space available, and requested
// buffer size is suitable, return pointer to avalable buffer
static void* allocateDataBuff (size_t buffSize) {
if (!dataPool.isEmpty()) {
if (buffSize < MAX_DATA_SIZE) return dataPool.pop();
else {
if (debugMsg) Serial.println("allocateDataBuff: Requested Buffer Size is too large");
return NULL;
}
} // if message pool still has buffer space available
else {
if (debugMsg) Serial.println("allocateDataBuff: Memory pool is full, no buffers available at this time");
}
}
void setup() {
initDataPool();
}
void loop() {
// ........
uint8_t* dataPtr = (uint8_t*) allocateDataBuff(100);
// ....assign message to *dataPtr
// ........
deallocateDataBuff(dataPtr);
}
We have two Qt applications. App1 accepts a connection from App2 through QTcpServer and stores it in an instance of QTcpSocket* tcpSocket. App1 runs a simulation with 30 Hz. For each simulation run, a QByteArray consisting of a few kilobytes is sent using the following code (from the main/GUI thread):
QByteArray block;
/* lines omitted which write data into block */
tcpSocket->write(block, block.size());
tcpSocket->waitForBytesWritten(1);
The receiver socket listens to the QTcpSocket::readDataBlock signal (in main/GUI thread) and prints the corresponding time stamp to the GUI.
When both App1 and App2 run on the same system, the packages are perfectly in sync. However when App1 and App2 are run on different systems connected through a network, App2 is no longer in sync with the simulation in App2. The packages come in much slower. Even more surprising (and indicating our implementation is wrong) is the fact that when we stop the simulation loop, no more packages are received. This surprises us, because we expect from the TCP protocol that all packages will arrive eventually.
We built the TCP logic based on Qt's fortune example. The fortune server, however, is different, because it only sends one package per incoming client. Could someone identify what we have done wrong?
Note: we use MSVC2012 (App1), MSVC2010 (App2) and Qt 5.2.
Edit: With a package I mean the result of a single simulation experiment, which is a bunch of numbers, written into QByteArray block. The first bits, however, contain the length of the QByteArray, so that the client can check whether all data has been received. This is the code which is called when the signal QTcpSocket::readDataBlock is emitted:
QDataStream in(tcpSocket);
in.setVersion(QDataStream::Qt_5_2);
if (blockSize == 0) {
if (tcpSocket->bytesAvailable() < (int)sizeof(quint16))
return; // cannot yet read size from data block
in >> blockSize; // read data size for data block
}
// if the whole data block is not yet received, ignore it
if (tcpSocket->bytesAvailable() < blockSize)
return;
// if we get here, the whole object is available to parse
QByteArray object;
in >> object;
blockSize = 0; // reset blockSize for handling the next package
return;
The problem in our implementation was caused by data packages being piled up and incorrect handling of packages which had only arrived partially.
The answer goes in the direction of Tcp packets using QTcpSocket. However this answer could not be applied in a straightforward manner, because we rely on QDataStream instead of plain QByteArray.
The following code (run each time QTcpSocket::readDataBlock is emitted) works for us and shows how a raw series of bytes can be read from QDataStream. Unfortunately it seems that it is not possible to process the data in a clearer way (using operator>>).
QDataStream in(tcpSocket);
in.setVersion(QDataStream::Qt_5_2);
while (tcpSocket->bytesAvailable())
{
if (tcpSocket->bytesAvailable() < (int)(sizeof(quint16) + sizeof(quint8)+ sizeof(quint32)))
return; // cannot yet read size and type info from data block
in >> blockSize;
in >> dataType;
char* temp = new char[4]; // read and ignore quint32 value for serialization of QByteArray in QDataStream
int bufferSize = in.readRawData(temp, 4);
delete temp;
temp = NULL;
QByteArray buffer;
int objectSize = blockSize - (sizeof(quint16) + sizeof(quint8)+ sizeof(quint32));
temp = new char[objectSize];
bufferSize = in.readRawData(temp, objectSize);
buffer.append(temp, bufferSize);
delete temp;
temp = NULL;
if (buffer.size() == objectSize)
{
//ready for parsing
}
else if (buffer.size() > objectSize)
{
//buffer size larger than expected object size, but still ready for parsing
}
else
{
// buffer size smaller than expected object size
while (buffer.size() < objectSize)
{
tcpSocket->waitForReadyRead();
char* temp = new char[objectSize - buffer.size()];
int bufferSize = in.readRawData(temp, objectSize - buffer.size());
buffer.append(temp, bufferSize);
delete temp;
temp = NULL;
}
// now ready for parsing
}
if (dataType == 0)
{
// deserialize object
}
}
Please not that the first three bytes of the expected QDataStream are part of our own procotol: blockSize indicates the number of bytes for a complete single package, dataType helps deserializing the binary chunk.
Edit
For reducing the latency of sending objects through the TCP connection, disabling packet bunching was very usefull:
// disable Nagle's algorithm to avoid delay and bunching of small packages
tcpSocketPosData->setSocketOption(QAbstractSocket::LowDelayOption,1);
I have written my code for single Xeon Phi node( with 61 cores on it). I have two files. I have called MPI_Init(2) before calling any other mpi calls. I have found ntasks, rank also using mpi calls. I have also included all the required libraries. Still i get an error. Can you please help me out with this?
In file 1:
int buffsize;
int *sendbuff,**recvbuff,buffsum;
int *shareRegion;
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
sendbuff=(int *)malloc(sizeof(int)*buffsize);
if( taskid == 0 ){
recvbuff=(int **)malloc(sizeof(int *)*ntasks);
recvbuff[0]=(int *)malloc(sizeof(int)*ntasks*buffsize);
for(i=1;i<ntasks;i++)recvbuff[i]=recvbuff[i-1]+buffsize;
}
else{
recvbuff=(int **)malloc(sizeof(int *)*1);
recvbuff[0]=(int *)malloc(sizeof(int)*1);
}
for(i=0;i<buffsize;i++){
sendbuff[i]=1;
MPI_Barrier(MPI_COMM_WORLD);
call(sendbuff, buffsize, shareRegion, recvbuff[0],buffsize,taskid,ntasks);
In file 2:
void* gInit( MPI_Comm comm, int size, int num_proc)
{
int share_mem = shm_open("share_region", O_CREAT|O_RDWR,0666 );
if( share_mem == -1)
return NULL;
int rank;
MPI_Comm_rank(comm,&rank);
if( ftruncate( share_mem, sizeof(int)*size*num_proc) == -1 )
return NULL;
int* shared = mmap(NULL, sizeof(int)*size*num_proc, PROT_WRITE | PROT_READ, MAP_SHARED, share_mem, 0);
if(shared == (void*)-1)
printf("error in mem allocation (mmap)\n");
*(shared+(rank)) = 0
MPI_Barrier(MPI_COMM_WORLD);
return shared;
}
void call(int *sendbuff, int sendcount, volatile int *sharedRegion, int **recvbuff, int recvcount, int rank, int size)
{
int i=0;
int k,j;
j=rank*sendcount;
for(i=0;i<sendcount;i++)
{
sharedRegion[j] = sendbuff[i];
j++;
}
if( rank == 0)
for(k=0;k<size;k++)
for(i=0;i<sendcount;i++)
{
j=0;
recvbuff[k][i] = sharedRegion[j];
j++;
}
}
Then i am doing some computation in file 1 on this recvbuff.
I get this segmentation fault while using sharedRegion variable.
MPI represents the Message Passing paradigm. That means, processes (ranks) are isolated and are generally running on a distributed machine. They communicate via explicit communication messages, recent versions allow also one-sideded, but still explicit, data transfer. You can not assume that shared memory is available for the processes. Have a look at any MPI tutorial to see how MPI is used.
Since you did not specify on what kind of machine you are running, any further suggestion is purely speculative. If you actually are on a shared memory machine, you may want to use a real shared memory paradigm instead, e.g. OpenMP.
While it's possible to restrict MPI to only use one machine and have shared memory (see the RMA chapter, especially in MPI-3), if you're only ever going to use one machine, it's easier to use some other paradigm.
However, if you're going to use multiple nodes and have multiple ranks on one node (multi-core processes for example), then it might be worth taking a look at MPI-3 RMA to see how it can help you with both locally shared memory and remote memory access. There are multiple papers out on the subject, but because they're so new, there's not a lot of good tutorials yet. You'll have to dig around a bit to find something useful to you.
The ordering of these two lines:
shareRegion = (int*)gInit(MPI_COMM_WORLD, buffsize, ntasks); /* gInit is in file 2 */
buffsize=atoi(argv[1]);
suggest that buffsize could possibly have different values before and after the call to gInit. If buffsize as passed in the first argument to the program is larger than its initial value while gInit is called, then out-of-bounds memory access would occur later and lead to a segmentation fault.
Hint: run your code as an MPI singleton (e.g. without mpirun) from inside a debugger (e.g. gdb) or change the limits so that cores would get dumped on error (e.g. with ulimit -c unlimited) and then examine the core file(s) with the debugger. Compiling with debug information (e.g. adding -g to the compiler options) helps a lot in such cases.
Writing some signal processing in CUDA I recently made huge progress in optimizing it. By using 1D textures and adjusting my access patterns I managed to get a 10× performance boost. (I previously tried transaction aligned prefetching from global into shared memory, but the nonuniform access patterns happening later messed up the warp→shared cache bank association (I think)).
So now I'm facing the problem, how CUDA textures and bindings interact with asynchronous memcpy.
Consider the following kernel
texture<...> mytexture;
__global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = tex1Dfetch(texture, threadIdx.x);
}
The kernel is launched in multiple streams
extern void *sourcedata;
#define N_CUDA_STREAMS ...
cudaStream stream[N_CUDA_STREAMS];
void *d_pOut[N_CUDA_STREAMS];
void *d_texData[N_CUDA_STREAMS];
for(int k_stream = 0; k_stream < N_CUDA_STREAMS; k_stream++) {
cudaStreamCreate(stream[k_stream]);
cudaMalloc(&d_pOut[k_stream], ...);
cudaMalloc(&d_texData[k_stream], ...);
}
/* ... */
for(int i_datablock; i_datablock < n_datablocks; i_datablock++) {
int const k_stream = i_datablock % N_CUDA_STREAMS;
cudaMemcpyAsync(d_texData[k_stream], (char*)sourcedata + i_datablock * blocksize, ..., stream[k_stream]);
cudaBindTexture(0, &mytexture, d_texData[k_stream], ...);
mykernel<<<..., stream[k_stream]>>>(d_pOut);
}
Now what I wonder about is, since there is only one texture reference, what happens when I bind a buffer to a texture while other streams' kernels access that texture? cudaBindStream doesn't take a stream parameter, so I'm worried that by binding the texture to another device pointer while running kernels are asynchronously accessing said texture I'll divert their accesses to the other data.
The CUDA documentation doesn't tell anything about this. If have to to disentangle this to allow concurrent access, it seems I'd have to create a number of texture references and use a switch statementto chose between them, based on the stream number passed as a kernel launch parameter.
Unfortunately CUDA doesn't allow to put arrays of textures on the device side, i.e. the following does not work:
texture<...> texarray[N_CUDA_STREAMS];
Layered textures are not an option, because the amount of data I have only fits within a plain 1D texture not bound to a CUDA array (see table F-2 in the CUDA 4.2 C Programming Guide).
Indeed you cannot unbind the texture while still using it in a different stream.
Since the number of streams doesn't need to be large to hide the asynchronous memcpys (2 would already do), you could use C++ templates to give each stream its own texture:
texture<float, 1, cudaReadModeElementType> mytexture1;
texture<float, 1, cudaReadModeElementType> mytexture2;
template<int TexSel> __device__ float myTex1Dfetch(int x);
template<> __device__ float myTex1Dfetch<1>(int x) { return tex1Dfetch(mytexture1, x); }
template<> __device__ float myTex1Dfetch<2>(int x) { return tex1Dfetch(mytexture2, x); }
template<int TexSel> __global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = myTex1Dfetch<TexSel>(threadIdx.x);
}
int main(void)
{
float *out_d[2];
// ...
mykernel<1><<<blocks, threads, stream[0]>>>(out_d[0]);
mykernel<2><<<blocks, threads, stream[1]>>>(out_d[1]);
// ...
}
I was thinking of trying my hand at some jit compilataion (just for the sake of learning) and it would be nice to have it work cross platform since I run all the major three at home (windows, os x, linux).
With that in mind, I want to know if there is any way to get out of using the virtual memory windows functions to allocate memory with execution permissions. Would be nice to just use malloc or new and point the processor at such a block.
Any tips?
DEP is just turning off Execution permission from every non-code page of memory. The code of application is loaded to memory which has execution permission; and there are lot of JITs which works in Windows/Linux/MacOSX, even when DEP is active. This is because there is a way to dynamically allocate memory with needed permissions set.
Usually, plain malloc should not be used, because permissions are per-page. Aligning of malloced memory to pages is still possible at price of some overhead. If you will not use malloc, some custom memory management (only for executable code). Custom management is a common way of doing JIT.
There is a solution from Chromium project, which uses JIT for javascript V8 VM and which is cross-platform. To be cross-platform, the needed function is implemented in several files and they are selected at compile time.
Linux: (chromium src/v8/src/platform-linux.cc) flag is PROT_EXEC of mmap().
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
const size_t msize = RoundUp(requested, AllocateAlignment());
int prot = PROT_READ | PROT_WRITE | (is_executable ? PROT_EXEC : 0);
void* addr = OS::GetRandomMmapAddr();
void* mbase = mmap(addr, msize, prot, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mbase == MAP_FAILED) {
/** handle error */
return NULL;
}
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, msize);
return mbase;
}
Win32 (src/v8/src/platform-win32.cc): flag is PAGE_EXECUTE_READWRITE of VirtualAlloc
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
// The address range used to randomize RWX allocations in OS::Allocate
// Try not to map pages into the default range that windows loads DLLs
// Use a multiple of 64k to prevent committing unused memory.
// Note: This does not guarantee RWX regions will be within the
// range kAllocationRandomAddressMin to kAllocationRandomAddressMax
#ifdef V8_HOST_ARCH_64_BIT
static const intptr_t kAllocationRandomAddressMin = 0x0000000080000000;
static const intptr_t kAllocationRandomAddressMax = 0x000003FFFFFF0000;
#else
static const intptr_t kAllocationRandomAddressMin = 0x04000000;
static const intptr_t kAllocationRandomAddressMax = 0x3FFF0000;
#endif
// VirtualAlloc rounds allocated size to page size automatically.
size_t msize = RoundUp(requested, static_cast<int>(GetPageSize()));
intptr_t address = 0;
// Windows XP SP2 allows Data Excution Prevention (DEP).
int prot = is_executable ? PAGE_EXECUTE_READWRITE : PAGE_READWRITE;
// For exectutable pages try and randomize the allocation address
if (prot == PAGE_EXECUTE_READWRITE &&
msize >= static_cast<size_t>(Page::kPageSize)) {
address = (V8::RandomPrivate(Isolate::Current()) << kPageSizeBits)
| kAllocationRandomAddressMin;
address &= kAllocationRandomAddressMax;
}
LPVOID mbase = VirtualAlloc(reinterpret_cast<void *>(address),
msize,
MEM_COMMIT | MEM_RESERVE,
prot);
if (mbase == NULL && address != 0)
mbase = VirtualAlloc(NULL, msize, MEM_COMMIT | MEM_RESERVE, prot);
if (mbase == NULL) {
LOG(ISOLATE, StringEvent("OS::Allocate", "VirtualAlloc failed"));
return NULL;
}
ASSERT(IsAligned(reinterpret_cast<size_t>(mbase), OS::AllocateAlignment()));
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, static_cast<int>(msize));
return mbase;
}
MacOS (src/v8/src/platform-macos.cc): flag is PROT_EXEC of mmap, just like Linux or other posix.
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
const size_t msize = RoundUp(requested, getpagesize());
int prot = PROT_READ | PROT_WRITE | (is_executable ? PROT_EXEC : 0);
void* mbase = mmap(OS::GetRandomMmapAddr(),
msize,
prot,
MAP_PRIVATE | MAP_ANON,
kMmapFd,
kMmapFdOffset);
if (mbase == MAP_FAILED) {
LOG(Isolate::Current(), StringEvent("OS::Allocate", "mmap failed"));
return NULL;
}
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, msize);
return mbase;
}
And I also want note, that bcdedit.exe-like way should be used only for very old programs, which creates new executable code in memory, but not sets an Exec property on this page. For newer programs, like firefox or Chrome/Chromium, or any modern JIT, DEP should be active, and JIT will manage memory permissions in fine-grained manner.
One possibility is to make it a requirement that Windows installations running your program be either configured for DEP AlwaysOff (bad idea) or DEP OptOut (better idea).
This can be configured (under WinXp SP2+ and Win2k3 SP1+ at least) by changing the boot.ini file to have the setting:
/noexecute=OptOut
and then configuring your individual program to opt out by choosing (under XP):
Start button
Control Panel
System
Advanced tab
Performance Settings button
Data Execution Prevention tab
This should allow you to execute code from within your program that's created on the fly in malloc() blocks.
Keep in mind that this makes your program more susceptible to attacks that DEP was meant to prevent.
It looks like this is also possible in Windows 2008 with the command:
bcdedit.exe /set {current} nx OptOut
But, to be honest, if you just want to minimise platform-dependent code, that's easy to do just by isolating the code into a single function, something like:
void *MallocWithoutDep(size_t sz) {
#if defined _IS_WINDOWS
return VirtualMalloc(sz, OPT_DEP_OFF); // or whatever
#elif defined IS_LINUX
// Do linuxy thing
#elif defined IS_MACOS
// Do something almost certainly inexplicable
#endif
}
If you put all your platform dependent functions in their own files, the rest of your code is automatically platform-agnostic.