Buffer starvation inside MS Mpeg-2 Demultiplexer filter - directshow

My capture graph is dying. I have traced the problem to a media-sample buffer starvation inside the Microsoft Mpeg-2 Demultiplexer filter.
Processing stops inside CBaseAllocator::GetBuffer. The pool is exhausted and the thread sleeps waiting indefinitely for a buffer to be recycled.
0:866> ~~[3038]s
ntdll!NtWaitForSingleObject+0x14:
00007ffe`49199f74 c3 ret
0:094> k
# Child-SP RetAddr Call Site
00 00000035`807fede8 00007ffe`460b9252 ntdll!NtWaitForSingleObject+0x14
01 00000035`807fedf0 00007ffe`22a35f4e KERNELBASE!WaitForSingleObjectEx+0xa2
02 00000035`807fee90 00007ffe`35609460 QUARTZ!CBaseAllocator::GetBuffer+0x7e
03 00000035`807feec0 00007ffe`3560697a mpg2splt!CMediaSampleCopyBuffer::GetCopyBuffer+0x60
04 00000035`807fef60 00007ffe`35606cc9 mpg2splt!CBufferSourceManager::GetNewCopyBuffer+0x3a
05 00000035`807fefa0 00007ffe`356073de mpg2splt!CStreamParser::CopyStream+0x89
06 00000035`807feff0 00007ffe`35608325 mpg2splt!CMpeg2PESStreamParser::ProcessBuffer_+0x15a
07 00000035`807ff040 00007ffe`35610724 mpg2splt!CMpeg2PESStreamParser::ProcessSysBuffer+0x135
08 00000035`807ff090 00007ffe`3560fb2e mpg2splt!CStreamMapContext::Process+0xb4
09 00000035`807ff110 00007ffe`3560f621 mpg2splt!CTransportStreamMapper::ProcessTSPacket_+0x30e
0a 00000035`807ff2d0 00007ffe`355fd0c1 mpg2splt!CTransportStreamMapper::Process+0xf1
0b 00000035`807ff320 00007ffe`355f4eb8 mpg2splt!CMPEG2Controller::ProcessMediaSampleLocked+0x111
0c 00000035`807ff3a0 00007ffe`355f98a7 mpg2splt!CMPEG2Demultiplexer::ProcessMediaSampleLocked+0x7c
0d 00000035`807ff3f0 00007ffd`ba58cba3 mpg2splt!CMPEG2DemuxInputPin::Receive+0x87
0e 00000035`807ff480 00007ffd`ba58ca4d 0x00007ffd`ba58cba3
0f 00000035`807ff530 00007ffd`ba58c92e 0x00007ffd`ba58ca4d
10 00000035`807ff590 00007ffe`19b5222e 0x00007ffd`ba58c92e
11 00000035`807ff5d0 00007ffe`246e5402 clr!UMThunkStub+0x6e
12 00000035`807ff660 00007ffe`2472aa23 qedit!CSampleGrabber::Receive+0x1b2
13 00000035`807ff6d0 00007ffe`287ea6d6 qedit!CTransformInputPin::Receive+0x53
14 00000035`807ff700 00007ffe`287ea459 Obsidian_DSP_DirectShow!MulticastSourceFilter::UDP_consumerThreadProc+0x276 [s:\library\obsidian.dsp.directshow\multicastsourcefilter.cpp # 475]
15 00000035`807ff7f0 00007ffe`46f73034 Obsidian_DSP_DirectShow!MulticastSourceFilter::UDP_consumerThreadEntry+0x9 [s:\library\obsidian.dsp.directshow\multicastsourcefilter.cpp # 445]
16 00000035`807ff820 00007ffe`49171461 KERNEL32!BaseThreadInitThunk+0x14
17 00000035`807ff850 00000000`00000000 ntdll!RtlUserThreadStart+0x21
Here are a few facts about this particular graph:
The source media is in the form of a heavily multiplexed MPEG2-TS UDP
stream.
This stream contains 14 SD TV programs, consuming 37.5Mbps of
network bandwidth.
The problem occurs predictably during periods
where the stream becomes heavily fragmented (the audio and video decoders
emit a burst of samples with IsDiscontinuity() in TRUE.
According to windbg (and SOS) There are No managed or unmanaged locks contended (no possibility of a deadlock).
There is no evidence of a "runaway"
thread (not stuck on an infinite loop).
The graph's final filter is a
GDCL bridge box, that then bridges the decoded sample to an MP4 muxer
box.
The demuxer video output is connected to an instance of ffdshow
decoder filter. The demuxer audio output is connected to an instance
of lav audio decoder filter.
Am I right to suspect the problem could be inside either the ffdshow or the lav filter? (who else could be holding demuxer buffers?)
Any pointers or suggestions on how can I trace why the buffer pool inside the demuxer is exhausted?

It looks like memory allocator on certain pin connection has all buffers in user with external references, and so it fell asleep waiting for new buffer to be returned for recycling.
This is expected behavior, and the problem is either too few buffers or excessive referencing.
You seem to be able to identify pin connection using call stack, and you could either increase amount of buffers or provide a custom memory allocator which expands on demand.
The easiest is when it's your filter is a part of the connection, and you can affect allocator during negotiation phase by either providing allocator requirements or directly updating the allocator properties. In more complicated cases you could locate existing connection and change properties before going active. In even more complicated you could insert your no-op filter into processing chain just for the purpose of getting in between and having direct access to effective allocator.

Related

The elegant way to handle ADCs with DMA in a RTOS

I'm currently setting up an AZURE RTOS (ThreadX on STM32), with Ethernet, SPI and ADCs activated.
This STM32 has to pass-through configuration information from time to time, coming from my PC over the Ethernet-Port.
It has to pass these information via SPI to two other STM32, which makes the first STM32 the system-controller / system-interface. This will be a low-priority task, since the activation of the passed configuration will be started by sync-lines, running from the system-controller to the two other STMs.
While doing so, the system-controller has to read-in ADC values constantly and pass them via Ethernet / TCP to my computer.
I've used the ThreadX TCP server example, as given by STM, as a starting point.
From there I've managed to set up three servers on three ports, communicating sucessfully with a python script on my PC (as a first test).
Now come the two great questions:
1)
Since my input signal may contain frequencies up to 2.5 MHz, I want to digitize this signal with the full 5 MSPS (Nyquist), which ADC3 is capable of.
The smallest internally available data-type at full resolution is uint16_t, which makes the data rate work out to be R = 16 * 5 MSPS = 80 MBit/s (worst-case, I bet, there is optimization possible ... e.g. 8 bits resolution, which halves the data-rate ... but this resolution might not be enough ... or 16 bits, and FFT afterwards, which is also sufficient, since I'm mostly interested in energy per frequency band, but initially I wanted to do this on my computer, for best flexibility).
Even if the Ethernet-IF is capable of doing 100MBit/s, the TCP layer of NetXDuo, I bet, is not.
(There is also USB OTG on this board available, but since networked devices are in my opinion more versatile, I prefer using Ethernet ... nevertheless, USB might be a backup solution)
From my measurements, a data-stream transmitted to the uC via TC from within python, and mirrored back within a thread to my PC allows for relatively consistent 20 MBit/s.
... How do I push this speed to a better level?
(I think 20MBit/s is the back-and-forth data-rate, so one-way may be faster)
However. Second question:
2)
The ADC within the STM is capable of storing data via DMA to memory.
There are two callbacks available, one at half-full, one at full buffer state.
My problem is mostly about the way of reading out the DMA and/or triggering the conversion in the first place.
How do you do this the "right" way on a RTOS (such that you don't brake the RT in RTOS)?
I see some options here, what are the pros/cons you can think of?
a) Let the ADC run freely, calling the call-backs at the respective fill-levels, triggering a TCP-transmission whenever one of the call-backs is reached
-> may lead to glitches due to insufficient speed of the TCP layer in my opinion.
b) Let the ADC conversion be triggered by a thread, which is preempted and will later TCP-transmit the data, as soon as the memory-buffer is full
-> may lead to inconsistency in the converted values, since you get burst-style conversions, with gaps in between, while the buffer is read
c) Let a thread trigger each conversion individually
-> A no-go I think, since threads are not triggered that often, to get a decent sample-frequency
d) Let a free-running ADC trigger callbacks, let a thread do the FFT, transmit within another thread the data via TCP
-> May work, but is less flexible, since the data gets crunched within the uC.
--> Are there other ways you can think of / what do you think about the ways I named here?
--> What do you think about question 1)?
Have a nice day!

STM32H7 | Portenta H7 Data missing during DMA transfer (ADC to Memory)

I'm currently working on STM32H747XI (Portenta H7). I'am programming the ADC1 with the DMA1 to get 16bits data at 1Msps.
I'm sorry, I can't share my entire code but I will therefore try to describe my configuration as precisely as possible.
I'm using the ADC1 trigged by a 1MHz timer. The ADC is working in continus mode with the DMA circular and double buffer mode. I tryed direct mode and burst with a full FIFO. I have no DMA error interrupe and no ADC overrun.
My peripheral are running but I'm stuck front of two issues. First issue, I'am doing buffer of 8192 uint16_t and I send it on the USB CDC with the arduino function USBserial.Write(buf,len). In this case, USB transfer going right but I have some missing data in my buffer. The DMA increments memory but doesn't write. So I don't have missing sample but the value is false (it belongs to the old buffer).
You can see the data plot below :
transfer with buffer of 8192 samples
If I double the buffer size, this issue is fixed but another comes. The USB VPC transfer fail if the data buffer is longer than 16384 byte. Some data are cut. I tried to solve this with differents sending and delays but it doesn't work. I still have the same kind of cut.
Here the data plot of the same script with a longer buffer : transfer withe buffer of 16384 sample (32768 byte)
Thank you for your help. I remain available.
For a fast check try to disable data cache. You're probably not managing cache correctly or you haven't disable caching in the memory space where you're using DMA.
Peripherals are not aware of cache so you must manage it manually. You have also to align buffers to cache lines in this case.
Refer to AN4839

IMFMediaSession.Close() not working as intended?

According to https://learn.microsoft.com/en-us/windows/desktop/api/mfidl/nf-mfidl-imfmediasession-close , once the IMFMediaSession.Close is called, i am supposed to receive an event called MESessionClosed, which i am not getting always, but in most cases.
I got a few customers with growing native memory leaks, and i think that one of the reasons is either what i mentioned above, or MediaFoundation interaction with the GPU driver, since i have analyzed dumps where i saw thousands of threads open in atiumd64.dll, method OpenAdapter:
00 000000b0`cecff8f8 00007ff8`c1cf9252 ntdll!NtWaitForSingleObject+0x14
01 000000b0`cecff900 00007ff8`752d2ccd KERNELBASE!WaitForSingleObjectEx+0xa2
02 000000b0`cecff9a0 00007ff8`757bf247 atiumd64!OpenAdapter+0x63ced
03 000000b0`cecff9d0 00007ff8`757bf3ee
atiumd64!XdxInitXopAdapterServices+0x3d0a57
04 000000b0`cecffa00 00007ff8`c4293034
atiumd64!XdxInitXopAdapterServices+0x3d0bfe
05 000000b0`cecffa30 00007ff8`c4d91461 kernel32!BaseThreadInitThunk+0x14
06 000000b0`cecffa60 00000000`00000000 ntdll!RtlUserThreadStart+0x21
I had a total of 160000 topologies created over the span of 4 days, and some 100 did not raise the MESessionClosed at all, and i fear these are the ones which cause a leak.
In cases where no MESessionClosed is sent, i notice that they all have one error in common: -1072870850, which is MF_E_SAMPLEALLOCATOR_EMPTY.
I would love to know if anyone has had experience with MediaFoundation not raising MESessionClosed according to documentation.
MESessionClosed event is created as a result of completion of asynchronously executed IMFMediaSession::Close call. Your not getting indicates a closing problem, perhaps a problem with one of the primitives participating in the topology, such as for example, failure to end streaming because of outstanding or leaked reference on some object.
Given the description of the problem perhaps the best way to address the problem is to attach debugger to the process (live or creating a dump and reviewing it interactively) expecting to find a thread waiting for something to close or complete.
Your seeing MF_E_SAMPLEALLOCATOR_EMPTY earlier might suggest that a leaked pointer to one of the samples prevents from terminating a sample allocator inside one of the primitives, which in turn create a deadlock.
Other than this you might want to do mftrace on the process and compare output produced by closed session to the other one that fails.
One thing you are also interested in, including putting it as a part of the question, is understanding the topology and especially whether it has third party or optional segments you can temporarily exclude. Since you cannot do much of debugging of MF internals directly, your options to change the topology could help you narrow the scope of the issue to specific primitive which is giving you the trouble. If the topology has your own primitives, you are interested in reviewing their termination behavior.

How to process RTP data for Microsoft directshow MPEG1 decoder

Starting from videoprocessing project, I'm trying to build a directshow filter that connects to a RTSP server becoming a source filter for the Windows MPEG1 decoder (I can not use other formats or decoders having WinCE as OS target).
My filter declares MediaType
MEDIATYPE_Video type
FORMAT_MPEGVideo subtype
MEDIASUBTYPE_MPEG1Payload formatType
Currently, when I connect my rtspSource filter with the CLSID_CMpegVideoCodec decoder, I am rendering a black video.
However, if I replace the windows decoder with CLSID_LAV_VideoDecoderFilter provided by the LAVFilters project, the video is correctly rendered.
After reading "How to process raw UDP packets so that they can be decoded by a decoder filter in a directshow source filter", dealing with the same issue for H264 and MPEG-4, I also read the RFC2250 and then I have depacketized the data but the result is the same.
Currently I'm sending to decoder packets starting with Video Stream Start Code
000001 00 (Picture)
or integral packets starting with
000001 B3 (Sequence Header)
and which contain within them also startCode
000001 B2 (User Data)
000001 B8 (Group Of Picture)
000001 00 (Picture)
000001 01 (Slice)
Still referring to the previous link, which deals with H264 and MPEG-4 cases, speak about "Process data for decoder" but I am not clear exactly what is expected by the CLSID_CMpegVideoCodec filter, after agreeing the format type MEDIASUBTYPE_MPEG1Payload.
However, adding at the beginning of each sample the three bytes 000001 or the 4 bytes 00000100, the video is rendered with images updated approximately every 2 seconds and losing the intermediate images.
I performed the tests both by setting the IMediaSample with
SetTime(NULL, NULL)
that setting
SetTime(start, start+1)
with:
start = (rtp_timestamp - rtp_timestamp_first_packet) + 300ms
following the answer to "Writing custom DirectShow RTSP/RTP Source push filter - timestamping data coming from live sources"
but the results do not change.
Any suggestions would be greatly appreciated.
Thanks in advance.

Extracting SMPTE timecode from audio stream

I'm working on an audio recording system. My task involves extracting the SMPTE time code from audio input stream, generated by a synchronizer device. I'm using ASIO SDK to get the time code of each callback buffer but it's always zero.
Perhaps somebody has experience in ASIO SDK (or any other platform/sdk that can be used to extract SMPTE timecode from audio stream) could help me?
Regards,
Ben
LTC is straightforward, so If nothing else, you could just scan the audio stream for LTC data, as documented on wikipedia. Each 80 bit frame ends with 0011 1111 1111 1101, just scan for that byte sequence to synchronize, then cast the buffer data starting after that sync sequence to be an array of 80 bit struct timecode_t elements. If your buffer is sized as a multiple of 80 your calculations will be easier (but you do need to test for sync lossage, because soundcards lose bits with overruns).
The hard part is that if I am not mistaken, the time code "bits" are not the same of the bits of the sampled audio stream, so you would have to implement logic to detect the bit sequence. This can just be a for loop checking for the proper signal changes and appending bits to the buffer as appropriate (and then calling the function to interpret the buffer when it is full).

Resources