Multiple 4K videos in a GStreamer / QML app = black flickering - qt

I'm developping a QML application allowing to display up to four 4K videos, with a Decklink 8K Pro and the qmlglsink element.
The hardware setup is the following:
4 4K cameras
1 DeckLink 8K Pro
1 Dell Precision with 2 Intel Xeon E5-2698 (40c/80t), 256GB of memory and an Nvidia GeForce GTX 1080
When displaying the 4 videos at the same time in my app, I frequently have black frames (I see a black flash and then the video is back). When displaying 3 videos, blinks are much less frequent, and I don't think I've seen them when showing only 1 or 2 videos.
A few interesting things:
if I start 4 instances of Gstreamer with one 4K video in each, there's no blink
if I start 4 instances of my app with one 4K video in each, there's no blink
say I'm showing 2 times camera A and 2 time camera B (I also have some hardware to duplicate video streams). When a blink occurs, it does for the two video views of that camera (A or B).
It looks to me like a threading problem, but adding a queue element to my pipeline does not fix the problem. Setting QSG_RENDER_LOOP to threaded with qputenv doesn't either.
Is there anything else I can do?
Here's some code used to create the pipeline:
// Create the source element (see below for the makeElement method):
GstElement* decklink_src = VideoView::makeElement("decklinkvideosrc", { {"device-number", 0}, {"mode", "2160p2997"}, {"profile", 4} });
// Create the pipeline element (see below for the generatePipeline method):
GstElement* pipeline = generatePipeline(_source, m_videoViewItem->findChild<QQuickItem*> ("videoView"));
// Schedule rendering:
m_window->scheduleRenderJob(new SetPlaying (m_pipeline), QQuickWindow::BeforeSynchronizingStage);
GstElement* VideoView::makeElement(const std::string& _elementName, const std::map<std::string, QVariant>& _params)
{
GstElement* element = gst_element_factory_make(_elementName.c_str(), nullptr);
for(const auto& attribute : _params)
{
const char* name = attribute.first.c_str();
switch (attribute.second.type())
{
case QVariant::Type::Int:
g_object_set (element, name, attribute.second.toInt(), nullptr);
break;
case QVariant::Type::String:
gst_util_set_object_arg (G_OBJECT(element), name, attribute.second.toString().toStdString().c_str());
break;
default: break;
}
}
return element;
}
GstElement* VideoView::generatePipeline(GstElement* _src, QQuickItem* _parent)
{
GstElement* pipeline = gst_pipeline_new(nullptr);
GstElement* queue = gst_element_factory_make ("queue", nullptr);
GstElement* glupload = gst_element_factory_make ("glupload", nullptr);
GstElement* glcolorconvert = gst_element_factory_make ("glcolorconvert", nullptr);
GstElement* sink = gst_element_factory_make ("qmlglsink", nullptr);
g_assert (_src && queue && glupload && glcolorconvert && sink);
gst_bin_add_many(GST_BIN (pipeline), _src, queue, glupload, glcolorconvert, sink, nullptr);
gst_element_link_many( _src, queue, glupload, glcolorconvert, sink, nullptr);
g_object_set(sink, "widget", _parent, nullptr);
return pipeline;
}

Related

SDL and Qt. Resizing the app causes display freeze when rendering from another thread

Qt: 5.14.1
SDL: 2.0.12
OS: Windows 10
I'm working on a video player and I'm using Qt for UI and SDL for rendering frames.
I created the SDL window by passing my rendering widget's (inside a layout) winId() handle.
This works perfectly when I start a non-threaded Play().
However, this causes some issues with playback when resizing or moving the app window. Nothing serious, but since the play code is non threaded my frame queues fill-up which then causes the video to speed up until it catches to the audio.
I solved that by putting my play code inside Win32 thread created with CreateThread function.
Now when I move window the video continues to play as intended, but when resizing the app, rendering widget will stop refreshing the widget and only the last displayed frame before resize event will be shown.
I can confirm that video is still running and correct frames are still being displayed. The displayed image can even be resized, but its never refreshed.
The similar thing happens when I was testing Qt threads with SDL. Consider this code
class TestThread: public QThread
{
public:
TestThread(QObject *parent = NULL) : QThread(parent)
{
}
void run() override
{
for (;;)
{
SDL_Delay(1000/60);
// Basic square bouncing animation
SDL_Rect spos;
spos.h = 100;
spos.w = 100;
spos.y = 100;
spos.x = position;
SDL_SetRenderDrawColor(RendererRef, 0, 0, 0, 255);
SDL_RenderFillRect(RendererRef, 0);
SDL_SetRenderDrawColor(RendererRef, 0xFF, 0x0, 0x0, 0xFF);
SDL_RenderFillRect(RendererRef, &spos);
SDL_RenderPresent(RendererRef);
if (position >= 500)
dir = 0;
else if (position <= 0)
dir = 1;
if (dir)
position += 5;
else
position -= 5;
}
}
};
// a call from Init SDL and Start Thread button
...
// create new SDL borderless resizible window.
WindowRef = SDL_CreateWindow("test",10,10,1280,800,SDL_WINDOW_RESIZABLE | SDL_WINDOW_BORDERLESS);
// create and start thread
test_thread = new TestThread();
test_thread->start();
...
This will create a separate window from the Qt app window and will start rendering a bouncy square. However if any resize event occurs in the Qt app, the rendering context will be lost and the the same thing that happens in my video player will happen here.
I also found out that If I remove SDL_RenderPresent function from the Thread object and put it in a Main Qt Window, the rendering will continue after resize event. However this has proved as completely unreliable and will sometimes completely freeze my app.
I also can't figure out why my completely separate SDL window and renderer still freezes on resize.
I presume there is a clash somewhere with SDL renderer/window and Qt's drawing stuff, but I'm at a loss here.
Also, it's only the resize stuff. Everything else works.
Thanks.
Answer:
SDL_Renderer needs to be destroyed and recreated on window resize as well as any SDL_Texture created with previous renderer.
The same thing will happen even without qt.
However, I think this is just a workaround and not a real solution.
A simple code to recreate the issue.
int position = 0;
int dir = 0;
SDL_Window *window = NULL;
SDL_Renderer *sdlRenderer_ = NULL;
DWORD WINAPI MyThreadFunction( LPVOID lpParam )
{
for (;;)
{
SDL_Delay(1000/60);
// Basic square bouncing animation
SDL_Rect spos;
spos.h = 100;
spos.w = 100;
spos.y = 100;
spos.x = position;
SDL_SetRenderDrawColor(sdlRenderer_, 0, 0, 0, 255);
SDL_RenderFillRect(sdlRenderer_, 0);
SDL_SetRenderDrawColor(sdlRenderer_, 0xFF, 0x0, 0x0, 0xFF);
SDL_RenderFillRect(sdlRenderer_, &spos);
SDL_RenderPresent(sdlRenderer_);
if (position >= 500)
dir = 0;
else if (position <= 0)
dir = 1;
if (dir)
position += 5;
else
position -= 5;
}
}
int APIENTRY wWinMain(_In_ HINSTANCE hInstance, _In_opt_ HINSTANCE hPrevInstance, _In_ LPWSTR lpCmdLine,_In_ int nCmdShow)
{
SDL_Init(SDL_INIT_VIDEO);
window = SDL_CreateWindow("test",SDL_WINDOWPOS_UNDEFINED,SDL_WINDOWPOS_UNDEFINED,600,600,SDL_WINDOW_SHOWN | SDL_WINDOW_RESIZABLE);
if (!window)
printf("Unable to create window");
sdlRenderer_ = SDL_CreateRenderer(window, -1, SDL_RENDERER_ACCELERATED |SDL_RENDERER_PRESENTVSYNC | SDL_RENDERER_TARGETTEXTURE);
if (!sdlRenderer_)
printf("Unable to create renderer");
HANDLE playHandle = CreateThread(0, 0, MyThreadFunction, 0, 0, 0);
if (playHandle == NULL)
{
return 0;
}
SDL_Event e;
while(1)
{
SDL_PollEvent(&e);
if (e.type == SDL_WINDOWEVENT )
{
switch( e.window.event )
{
case SDL_WINDOWEVENT_SIZE_CHANGED:
int mWidth = e.window.data1;
int mHeight = e.window.data2;
SDL_DestroyRenderer(sdlRenderer_); // stops rendering on resize if commented out
sdlRenderer_ = SDL_CreateRenderer(window, -1, SDL_RENDERER_ACCELERATED |SDL_RENDERER_PRESENTVSYNC | SDL_RENDERER_TARGETTEXTURE);
break;
}
}
}
return 0;
}
EDIT:
Real solution.
Renderer doesn't need to be recreated. Seperate the Qt code from SDL main thread by making a seperate thread. Create all SDL stuff in that thread because SDL_Renderer needs to be created in a thread that handles SDL events.
Use SDL_PushEvent to signal this thread to render to screen.
This way textures don't need to be recreated.

malloc() fails in QtConcurrent::run()

On x86, it may fail to initialize QImage on worker thread.
(Rare in x64)
The probability increases when parallel processing is performed over the number of cores of the CPU.
This occurs not only by reading from an image file, but also by initializing a plain QImage by specifying its size, or simply by calling QImage::copy().
This is a code to avoid it. Of course it is not perfect.
Please tell me a better way.
QImage createImageAsync(QString path)
{
QImageReader reader(path);
if(!reader.canRead())
return QImage();
// QImage processing sometimes fails
QImage src;
int count = 0;
do {
src = reader.read();
if(!src.isNull())
break;
if(src.isNull() && count++ < 1000) {
QThread::currentThread()->usleep(1000);
continue;
}
return QImage();
} while(1);
return src;
}
Essentially this problem was caused by heap memory fragmentation.
Therefore, as a solution, it is conceivable to replace the memory allocator with tcmalloc or jemalloc, for example.
In my application, it was confirmed that this problem can be sufficiently avoided by limiting the number of image files that are opened at the same time in the x86 version.

Strange behavior of program on Qt 5

I'm writing two programs on Qt 5. The first ("Debug") program sends requests to the second ("Simulator") through a PCI-E device. The "Simulator" must respond as fast as possible. This is a separate issue, but now I want to ask about another strange effect. The "Debug" program writes 10 bytes from QLineEdit to a PCI-E device and then waits for an answer from the "Simulator". When the answer came fast enough, I see the right data bytes in the "Debug" program window, otherwise a PCI-E device returns a copy of a sended data and I see it too. The issue is that data can be sended by two ways: by clicking the Send button on the form and by clicking the Return button on a keyboard. In both cases the data is sent from the following slot:
void MyWin::on_pushButton_Send_clicked()
{
if(data_matched)
{
QString exp = ui.lineEdit_Data->text();
QStringList list = exp.split(QRegExp("\\s"),
QString::SkipEmptyParts);
for(int i=0; i<list.size(); i++)
{
quint8 a = list[i].toUInt(0, 16);
data[i] = a;
}
write_insys(bHandle, data, DataSize);
ui.textEdit->append( /* show sended bytes */ );
read_insys(bHandle, data, DataSize);
ui.textEdit->append( /* show received bytes */ );
}
}
But in the second case (on Return key press) the only difference is that the above slot is invoked inside the following:
void MyWin::on_lineEdit_Data_returnPressed()
{
on_pushButton_Send_clicked();
}
But the results:
1st case: 90% wrong answers
2st case: 90% right answers
The code of write_insys and read_insys is absolutely trivial, simply call the library functions:
bool write_insys(BRD_Handle handle, void* data, int size)
{
S32 res = BRD_putMsg(handle, NODE0, data, (U32*)&size, BRDtim_FOREVER);
return (res >= 0);
}
bool read_insys(BRD_Handle handle, void* data, int size)
{
S32 res = BRD_getMsg(handle, NODE0, data, (U32*)&size, BRDtim_FOREVER);
return (res >= 0);
}
Does anyone know why this might happen?
Windows 7, Qt 5.4.2, Msvc 2010.
edit: Most likely it is a Qt bug...

CUDA streams, texture binding and async memcpy

Writing some signal processing in CUDA I recently made huge progress in optimizing it. By using 1D textures and adjusting my access patterns I managed to get a 10× performance boost. (I previously tried transaction aligned prefetching from global into shared memory, but the nonuniform access patterns happening later messed up the warp→shared cache bank association (I think)).
So now I'm facing the problem, how CUDA textures and bindings interact with asynchronous memcpy.
Consider the following kernel
texture<...> mytexture;
__global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = tex1Dfetch(texture, threadIdx.x);
}
The kernel is launched in multiple streams
extern void *sourcedata;
#define N_CUDA_STREAMS ...
cudaStream stream[N_CUDA_STREAMS];
void *d_pOut[N_CUDA_STREAMS];
void *d_texData[N_CUDA_STREAMS];
for(int k_stream = 0; k_stream < N_CUDA_STREAMS; k_stream++) {
cudaStreamCreate(stream[k_stream]);
cudaMalloc(&d_pOut[k_stream], ...);
cudaMalloc(&d_texData[k_stream], ...);
}
/* ... */
for(int i_datablock; i_datablock < n_datablocks; i_datablock++) {
int const k_stream = i_datablock % N_CUDA_STREAMS;
cudaMemcpyAsync(d_texData[k_stream], (char*)sourcedata + i_datablock * blocksize, ..., stream[k_stream]);
cudaBindTexture(0, &mytexture, d_texData[k_stream], ...);
mykernel<<<..., stream[k_stream]>>>(d_pOut);
}
Now what I wonder about is, since there is only one texture reference, what happens when I bind a buffer to a texture while other streams' kernels access that texture? cudaBindStream doesn't take a stream parameter, so I'm worried that by binding the texture to another device pointer while running kernels are asynchronously accessing said texture I'll divert their accesses to the other data.
The CUDA documentation doesn't tell anything about this. If have to to disentangle this to allow concurrent access, it seems I'd have to create a number of texture references and use a switch statementto chose between them, based on the stream number passed as a kernel launch parameter.
Unfortunately CUDA doesn't allow to put arrays of textures on the device side, i.e. the following does not work:
texture<...> texarray[N_CUDA_STREAMS];
Layered textures are not an option, because the amount of data I have only fits within a plain 1D texture not bound to a CUDA array (see table F-2 in the CUDA 4.2 C Programming Guide).
Indeed you cannot unbind the texture while still using it in a different stream.
Since the number of streams doesn't need to be large to hide the asynchronous memcpys (2 would already do), you could use C++ templates to give each stream its own texture:
texture<float, 1, cudaReadModeElementType> mytexture1;
texture<float, 1, cudaReadModeElementType> mytexture2;
template<int TexSel> __device__ float myTex1Dfetch(int x);
template<> __device__ float myTex1Dfetch<1>(int x) { return tex1Dfetch(mytexture1, x); }
template<> __device__ float myTex1Dfetch<2>(int x) { return tex1Dfetch(mytexture2, x); }
template<int TexSel> __global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = myTex1Dfetch<TexSel>(threadIdx.x);
}
int main(void)
{
float *out_d[2];
// ...
mykernel<1><<<blocks, threads, stream[0]>>>(out_d[0]);
mykernel<2><<<blocks, threads, stream[1]>>>(out_d[1]);
// ...
}

JIT compilation and DEP

I was thinking of trying my hand at some jit compilataion (just for the sake of learning) and it would be nice to have it work cross platform since I run all the major three at home (windows, os x, linux).
With that in mind, I want to know if there is any way to get out of using the virtual memory windows functions to allocate memory with execution permissions. Would be nice to just use malloc or new and point the processor at such a block.
Any tips?
DEP is just turning off Execution permission from every non-code page of memory. The code of application is loaded to memory which has execution permission; and there are lot of JITs which works in Windows/Linux/MacOSX, even when DEP is active. This is because there is a way to dynamically allocate memory with needed permissions set.
Usually, plain malloc should not be used, because permissions are per-page. Aligning of malloced memory to pages is still possible at price of some overhead. If you will not use malloc, some custom memory management (only for executable code). Custom management is a common way of doing JIT.
There is a solution from Chromium project, which uses JIT for javascript V8 VM and which is cross-platform. To be cross-platform, the needed function is implemented in several files and they are selected at compile time.
Linux: (chromium src/v8/src/platform-linux.cc) flag is PROT_EXEC of mmap().
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
const size_t msize = RoundUp(requested, AllocateAlignment());
int prot = PROT_READ | PROT_WRITE | (is_executable ? PROT_EXEC : 0);
void* addr = OS::GetRandomMmapAddr();
void* mbase = mmap(addr, msize, prot, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mbase == MAP_FAILED) {
/** handle error */
return NULL;
}
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, msize);
return mbase;
}
Win32 (src/v8/src/platform-win32.cc): flag is PAGE_EXECUTE_READWRITE of VirtualAlloc
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
// The address range used to randomize RWX allocations in OS::Allocate
// Try not to map pages into the default range that windows loads DLLs
// Use a multiple of 64k to prevent committing unused memory.
// Note: This does not guarantee RWX regions will be within the
// range kAllocationRandomAddressMin to kAllocationRandomAddressMax
#ifdef V8_HOST_ARCH_64_BIT
static const intptr_t kAllocationRandomAddressMin = 0x0000000080000000;
static const intptr_t kAllocationRandomAddressMax = 0x000003FFFFFF0000;
#else
static const intptr_t kAllocationRandomAddressMin = 0x04000000;
static const intptr_t kAllocationRandomAddressMax = 0x3FFF0000;
#endif
// VirtualAlloc rounds allocated size to page size automatically.
size_t msize = RoundUp(requested, static_cast<int>(GetPageSize()));
intptr_t address = 0;
// Windows XP SP2 allows Data Excution Prevention (DEP).
int prot = is_executable ? PAGE_EXECUTE_READWRITE : PAGE_READWRITE;
// For exectutable pages try and randomize the allocation address
if (prot == PAGE_EXECUTE_READWRITE &&
msize >= static_cast<size_t>(Page::kPageSize)) {
address = (V8::RandomPrivate(Isolate::Current()) << kPageSizeBits)
| kAllocationRandomAddressMin;
address &= kAllocationRandomAddressMax;
}
LPVOID mbase = VirtualAlloc(reinterpret_cast<void *>(address),
msize,
MEM_COMMIT | MEM_RESERVE,
prot);
if (mbase == NULL && address != 0)
mbase = VirtualAlloc(NULL, msize, MEM_COMMIT | MEM_RESERVE, prot);
if (mbase == NULL) {
LOG(ISOLATE, StringEvent("OS::Allocate", "VirtualAlloc failed"));
return NULL;
}
ASSERT(IsAligned(reinterpret_cast<size_t>(mbase), OS::AllocateAlignment()));
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, static_cast<int>(msize));
return mbase;
}
MacOS (src/v8/src/platform-macos.cc): flag is PROT_EXEC of mmap, just like Linux or other posix.
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
const size_t msize = RoundUp(requested, getpagesize());
int prot = PROT_READ | PROT_WRITE | (is_executable ? PROT_EXEC : 0);
void* mbase = mmap(OS::GetRandomMmapAddr(),
msize,
prot,
MAP_PRIVATE | MAP_ANON,
kMmapFd,
kMmapFdOffset);
if (mbase == MAP_FAILED) {
LOG(Isolate::Current(), StringEvent("OS::Allocate", "mmap failed"));
return NULL;
}
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, msize);
return mbase;
}
And I also want note, that bcdedit.exe-like way should be used only for very old programs, which creates new executable code in memory, but not sets an Exec property on this page. For newer programs, like firefox or Chrome/Chromium, or any modern JIT, DEP should be active, and JIT will manage memory permissions in fine-grained manner.
One possibility is to make it a requirement that Windows installations running your program be either configured for DEP AlwaysOff (bad idea) or DEP OptOut (better idea).
This can be configured (under WinXp SP2+ and Win2k3 SP1+ at least) by changing the boot.ini file to have the setting:
/noexecute=OptOut
and then configuring your individual program to opt out by choosing (under XP):
Start button
Control Panel
System
Advanced tab
Performance Settings button
Data Execution Prevention tab
This should allow you to execute code from within your program that's created on the fly in malloc() blocks.
Keep in mind that this makes your program more susceptible to attacks that DEP was meant to prevent.
It looks like this is also possible in Windows 2008 with the command:
bcdedit.exe /set {current} nx OptOut
But, to be honest, if you just want to minimise platform-dependent code, that's easy to do just by isolating the code into a single function, something like:
void *MallocWithoutDep(size_t sz) {
#if defined _IS_WINDOWS
return VirtualMalloc(sz, OPT_DEP_OFF); // or whatever
#elif defined IS_LINUX
// Do linuxy thing
#elif defined IS_MACOS
// Do something almost certainly inexplicable
#endif
}
If you put all your platform dependent functions in their own files, the rest of your code is automatically platform-agnostic.

Resources