How to process Heap traces? - .net-traceprocessing

How to process Heap traces? - .net-traceprocessing

Heap traces are some of the heaviest traces ETW can capture, and would be interesting to handle that through .net-trace processing.
Is it possible currently to get this data, or is it not supported?
I only see a UseHeapSnapshots() which, if I understand correctly, has to do with Heap Snapshots, not heap data capture.

Currently we do not have any special built-in support for these events, however you can still find these events in GenericEvents (which contains all Manifested and TraceLogging events in the trace) or in ClassicEvents (which contains all Classic and WPP events), or by writing an EventConsumer (see the Extensibility section of https://blogs.windows.com/buildingapps/2019/05/09/announcing-traceprocessor-preview-0-1-0/ for an example).

Related

Lock free non-allocating collection

I am looking for a collection data structure that is:
thread safe
lock free
non-allocating (amortized or pre-allocated is fine)
non-intrusive
not using exotic intrinsics
Element order does not matter. Stack, queue, bag, anything is fine. I have found plenty of examples that fulfill four of these five requirements, for example:
.NET's List is not thread safe.
If I put a mutex on it, then it's not lock free.
.NET's ConcurrentStack is thread safe, lock free, uses a simple CompareExchange, but allocates a new Node for each element.
If I move the next pointer from the Node to the element itself, then it's intrusive.
Array based lock free data structures tend to require multi-word intrinsics.
I feel like I'm missing something super obvious. This should be a solved problem.

.NET's ConcurrentQueue fulfills all five requirements. It does allocate when the backing storage runs out of space, similar to List<T>, but as long as there is extra capacity, no allocations occur. Unfortunately the only way to reserve extra capacity upfront, is to initialize it with a collection of same size and then dequeuing all the elements.
same is true for .NET's ConcurrentBag

IMFMediaSession.Close() not working as intended?

According to https://learn.microsoft.com/en-us/windows/desktop/api/mfidl/nf-mfidl-imfmediasession-close , once the IMFMediaSession.Close is called, i am supposed to receive an event called MESessionClosed, which i am not getting always, but in most cases.
I got a few customers with growing native memory leaks, and i think that one of the reasons is either what i mentioned above, or MediaFoundation interaction with the GPU driver, since i have analyzed dumps where i saw thousands of threads open in atiumd64.dll, method OpenAdapter:
00 000000b0`cecff8f8 00007ff8`c1cf9252 ntdll!NtWaitForSingleObject+0x14
01 000000b0`cecff900 00007ff8`752d2ccd KERNELBASE!WaitForSingleObjectEx+0xa2
02 000000b0`cecff9a0 00007ff8`757bf247 atiumd64!OpenAdapter+0x63ced
03 000000b0`cecff9d0 00007ff8`757bf3ee
atiumd64!XdxInitXopAdapterServices+0x3d0a57
04 000000b0`cecffa00 00007ff8`c4293034
atiumd64!XdxInitXopAdapterServices+0x3d0bfe
05 000000b0`cecffa30 00007ff8`c4d91461 kernel32!BaseThreadInitThunk+0x14
06 000000b0`cecffa60 00000000`00000000 ntdll!RtlUserThreadStart+0x21
I had a total of 160000 topologies created over the span of 4 days, and some 100 did not raise the MESessionClosed at all, and i fear these are the ones which cause a leak.
In cases where no MESessionClosed is sent, i notice that they all have one error in common: -1072870850, which is MF_E_SAMPLEALLOCATOR_EMPTY.
I would love to know if anyone has had experience with MediaFoundation not raising MESessionClosed according to documentation.

MESessionClosed event is created as a result of completion of asynchronously executed IMFMediaSession::Close call. Your not getting indicates a closing problem, perhaps a problem with one of the primitives participating in the topology, such as for example, failure to end streaming because of outstanding or leaked reference on some object.
Given the description of the problem perhaps the best way to address the problem is to attach debugger to the process (live or creating a dump and reviewing it interactively) expecting to find a thread waiting for something to close or complete.
Your seeing MF_E_SAMPLEALLOCATOR_EMPTY earlier might suggest that a leaked pointer to one of the samples prevents from terminating a sample allocator inside one of the primitives, which in turn create a deadlock.
Other than this you might want to do mftrace on the process and compare output produced by closed session to the other one that fails.
One thing you are also interested in, including putting it as a part of the question, is understanding the topology and especially whether it has third party or optional segments you can temporarily exclude. Since you cannot do much of debugging of MF internals directly, your options to change the topology could help you narrow the scope of the issue to specific primitive which is giving you the trouble. If the topology has your own primitives, you are interested in reviewing their termination behavior.

How to write with a single node in MPI

I want to implement some file io with the routines provided by MPI (in particular Open MPI).
Due to possible limitations of the environment, I wondered, if it is possible to limit the nodes, which are responsible for IO, so that all other nodes are required to perform a hidden mpi_send to this group of processes, to actually write the data. This would be nice in cases, where e.g. the master node is placed on a node with high-performance filesystem and the other nodes have only access to a low-performance filesystem, where the binaries are stored.
Actually, I already found some information, which might be helpful, but I couldn't find further information, how to actually implement these things:
1: There is an info key MPI_IO belonging to the communicator, which tells which ranks provide standard-conforming IO-routines. As this is listed as an environmental inquiry, I don't see, where I could modify this.
2: There is an info key io_nodes_list which seems to belong to file-related info-objects. Unfortunately, the possible values for this key are not documented and Open MPI doesn't seem to implement them in any way. Actually, I can't even get the filename from the info-object which is returned by mpi_file_get_info...
As a workaround, I could imagine two things: On the one hand, I could perform the IO with standard Fortran routines, or on the other hand, create a new communicator, which is responsible for IO. But in both cases, the processes, which are responsible for IO have to check for possible IO from the other processes to perform manual communication and file interaction.
Is there a nice and automatic way to restrict the IO to certain nodes? If yes, how could I implement this?

You explicitly asked about OpenMPI, but there are two MPI-IO implementations in OpenMPI. The old workhorse is ROMIO, the MPI-IO implementation shared among just about every MPI implementation. OpenMPI also has OMPIO, but I don't know a whole lot about tuning that one.
Next, if you want things to happen automatically for you, you'll have to use collective i/o. The independent I/O routines cannot send a message to anyone else -- they are independent and there's no way to know if the other side will be listening.
With those preliminaries out of the way...
You are asking about "i/o aggregaton". There is a bit of information here in the context of another optimization called "deferred open" (and which OMPIO calls Lazy Open)
https://press3.mcs.anl.gov/romio/2003/08/05/deferred-open/
In short, you can definitely say "only these N processes should do I/O", and then the collective I/O library will exchange data and make sure that happens. The optimization was developed some 15-odd years ago for just the situation you proposed: some nodes being better connected to storage than others (as was the case on the old ASCI Red machine, to give you a sense for how old this optimization is...)
I don't know where you got io_nodes_list. You probably want to use the MPI-IO info keys cb_config_list and cb_nodes
So, you've got a cluster with master1, master2, master3, and compute1, compute2, compute3 (or whatever the hostnames actually are). You can do something like this (in c, sorry. I'm not proficient in Fortran):
MPI_Info info;
MPI_File fh;
MPI_Info_create(&info);
MPI_Info_set(info, "cb_config_list", "master1:1,master2:1,master3:1");
MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE|MPI_MODE_WRONLY, info, &fh)
With these hints, MPI_File_write_all will aggregate all the I/O through the MPI processes on master1, master2, and master3. ROMIO won't blow up your memory because it will chunk up the I/O into a smaller working set (specified with the "cb_buffer_size" hint: cranking this up, if you have the memory, is a good way to get better performance).
There is a ton of information about the hints you can set in the ROMIO users guide:
http://www.mcs.anl.gov/research/projects/romio/doc/users-guide/node6.html

A MailboxProcessor that operates with a LIFO logic

I am learning about F# agents (MailboxProcessor).
I am dealing with a rather unconventional problem.
I have one agent (dataSource) which is a source of streaming data. The data has to be processed by an array of agents (dataProcessor). We can consider dataProcessor as some sort of tracking device.
Data may flow in faster than the speed with which the dataProcessor may be able to process its input.
It is OK to have some delay. However, I have to ensure that the agent stays on top of its work and does not get piled under obsolete observations
I am exploring ways to deal with this problem.
The first idea is to implement a stack (LIFO) in dataSource. dataSource would send over the latest observation available when dataProcessor becomes available to receive and process the data. This solution may work but it may get complicated as dataProcessor may need to be blocked and re-activated; and communicate its status to dataSource, leading to a two way communication problem. This problem may boil down to a blocking queue in the consumer-producer problem but I am not sure..
The second idea is to have dataProcessor taking care of message sorting. In this architecture, dataSource will simply post updates in dataProcessor's queue. dataProcessor will use Scanto fetch the latest data available in his queue. This may be the way to go. However, I am not sure if in the current design of MailboxProcessorit is possible to clear a queue of messages, deleting the older obsolete ones. Furthermore, here, it is written that:
Unfortunately, the TryScan function in the current version of F# is
broken in two ways. Firstly, the whole point is to specify a timeout
but the implementation does not actually honor it. Specifically,
irrelevant messages reset the timer. Secondly, as with the other Scan
function, the message queue is examined under a lock that prevents any
other threads from posting for the duration of the scan, which can be
an arbitrarily long time. Consequently, the TryScan function itself
tends to lock-up concurrent systems and can even introduce deadlocks
because the caller's code is evaluated inside the lock (e.g. posting
from the function argument to Scan or TryScan can deadlock the agent
when the code under the lock blocks waiting to acquire the lock it is
already under).
Having the latest observation bounced back may be a problem.
The author of this post, #Jon Harrop, suggests that
I managed to architect around it and the resulting architecture was actually better. In essence, I eagerly Receive all messages and filter using my own local queue.
This idea is surely worth exploring but, before starting to play around with code, I would welcome some inputs on how I could structure my solution.
Thank you.

Sounds like you might need a destructive scan version of the mailbox processor, I implemented this with TPL Dataflow in a blog series that you might be interested in.
My blog is currently down for maintenance but I can point you to the posts in markdown format.
Part1
Part2
Part3
You can also check out the code on github
I also wrote about the issues with scan in my lurking horror post
Hope that helps...

tl;dr I would try this: take Mailbox implementation from FSharp.Actor or Zach Bray's blog post, replace ConcurrentQueue by ConcurrentStack (plus add some bounded capacity logic) and use this changed agent as a dispatcher to pass messages from dataSource to an army of dataProcessors implemented as ordinary MBPs or Actors.
tl;dr2 If workers are a scarce and slow resource and we need to process a message that is the latest at the moment when a worker is ready, then it all boils down to an agent with a stack instead of a queue (with some bounded capacity logic) plus a BlockingQueue of workers. Dispatcher dequeues a ready worker, then pops a message from the stack and sends this message to the worker. After the job is done the worker enqueues itself to the queue when becomes ready (e.g. before let! msg = inbox.Receive()). Dispatcher consumer thread then blocks until any worker is ready, while producer thread keeps the bounded stack updated. (bounded stack could be done with an array + offset + size inside a lock, below is too complex one)
Details
MailBoxProcessor is designed to have only one consumer. This is even commented in the source code of MBP here (search for the word 'DRAGONS' :) )
If you post your data to MBP then only one thread could take it from internal queue or stack.
In you particular use case I would use ConcurrentStack directly or better wrapped into BlockingCollection:
It will allow many concurrent consumers
It is very fast and thread safe
BlockingCollection has BoundedCapacity property that allows you to limit the size of a collection. It throws on Add, but you could catch it or use TryAdd. If A is a main stack and B is a standby, then TryAdd to A, on false Add to B and swap the two with Interlocked.Exchange, then process needed messages in A, clear it, make a new standby - or use three stacks if processing A could be longer than B could become full again; in this way you do not block and do not lose any messages, but could discard unneeded ones is a controlled way.
BlockingCollection has methods like AddToAny/TakeFromAny, which work on an arrays of BlockingCollections. This could help, e.g.:
dataSource produces messages to a BlockingCollection with ConcurrentStack implementation (BCCS)
another thread consumes messages from BCCS and sends them to an array of processing BCCSs. You said that there is a lot of data. You may sacrifice one thread to be blocking and dispatching your messages indefinitely
each processing agent has its own BCCS or implemented as an Agent/Actor/MBP to which the dispatcher posts messages. In your case you need to send a message to only one processorAgent, so you may store processing agents in a circular buffer to always dispatch a message to least recently used processor.
Something like this:
(data stream produces 'T)
|
[dispatcher's BCSC]
|
(a dispatcher thread consumes 'T and pushes to processors, manages capacity of BCCS and LRU queue)
| |
[processor1's BCCS/Actor/MBP] ... [processorN's BCCS/Actor/MBP]
| |
(process) (process)
Instead of ConcurrentStack, you may want to read about heap data structure. If you need your latest messages by some property of messages, e.g. timestamp, rather than by the order in which they arrive to the stack (e.g. if there could be delays in transit and arrival order <> creation order), you can get the latest message by using heap.
If you still need Agents semantics/API, you could read several sources in addition to Dave's links, and somehow adopt implementation to multiple concurrent consumers:
An interesting article by Zach Bray on efficient Actors implementation. There you do need to replace (under the comment // Might want to schedule this call on another thread.) the line execute true by a line async { execute true } |> Async.Start or similar, because otherwise producing thread will be consuming thread - not good for a single fast producer. However, for a dispatcher like described above this is exactly what needed.
FSharp.Actor (aka Fakka) development branch and FSharp MPB source code (first link above) here could be very useful for implementation details. FSharp.Actors library has been in a freeze for several months but there is some activity in dev branch.
Should not miss discussion about Fakka in Google Groups in this context.
I have a somewhat similar use case and for the last two days I have researched everything I could find on the F# Agents/Actors. This answer is a kind of TODO for myself to try these ideas, of which half were born during writing it.

The simplest solution is to greedily eat all messages in the inbox when one arrives and discard all but the most recent. Easily done using TryReceive:
let rec readLatestLoop oldMsg =
async { let! newMsg = inbox.TryReceive 0
match newMsg with
| None -> oldMsg
| Some newMsg -> return! readLatestLoop newMsg }
let readLatest() =
async { let! msg = inbox.Receive()
return! readLatestLoop msg }
When faced with the same problem I architected a more sophisticated and efficient solution I called cancellable streaming and described in in an F# Journal article here. The idea is to start processing messages and then cancel that processing if they are superceded. This significantly improves concurrency if significant processing is being done.

Object not garbage collected, but contains no gcroots

Running into a prickly problem with our web app here. (Asp.net 2.0 Win server 2008)
Our memory usage for the website, grows and grows even though I would expect it to remain at a fairly static level. (We have a small amount of data that gets stored in state).
Wanting to find out what the problem is, I've run a System.GC.Collect(); a few times, taken a memory dump and then loaded this memory dump into WinDbg.
When I do a DumpHeap -Stat I get an inordinately large number on particular type hanging around in memory.
0000064280580b40 713471 79908752 PaymentOption
so, doing a DumpHeap -MT for this type, I get a stack of object references. Picking a random number of these, I do a !gcroot and the command comes back reporting that no references are held to it.
To me, this is exactly when the GC should collect these items, but for some reason they have been left outstanding.
Can anybody offer an explanation as to what might be happening?

You could try using sosex.dll in Windbg, which is an extension written to help with .NET debugging. There is a command named !refs which is similar to !gcroot, in that it will show you all the objects referencing an object, plus it will show all the objects that it too is referencing.
In the example on the author's website, !refs is used against an object and the output looks like this:
0:000> !refs 0000000080000db8
Objects referenced by 0000000080000db8 (System.Threading.Mutex):
0000000080000ef0 32 Microsoft.Win32.SafeHandles.SafeWaitHandle
Objects referencing 0000000080000db8 (System.Threading.Mutex):
0000000080000e08 72 System.Threading.Mutex+<>c__DisplayClass3
0000000080000e50 64 System.Runtime.CompilerServices.RuntimeHelpers+CleanupCode

Few things:
GC.Collect won't help you do any debugging. The garbage collector is already being called: if any objects were available for collection it would have happened already.
Idle memory on a server is wasted memory. Are you sure memory is being 'leaked', or is it just that the framework is deciding it can keep more things in memroy or keep more memory around for faster access? In this case I suspect you are leaking memory, but it's something to double check for.
It sounds like something you don't expect is keeping a reference to PaymentOption objects. Perhaps a static collection somewhere? Or separate thread?

Does PaymentObject implement a finalizer by any chance? Does it call a STA COM object?
I'd be curious to see the output of !finalizequeue to see if the count of objects that are showing up on the heap are roughly the amount of any that might waiting to be finalized. Output should probably look something like this:
generation 0 has 57 finalizable objects (0409b5cc->0409b6b0)
generation 1 has 55 finalizable objects (0409b4f0->0409b5cc)
generation 2 has 0 finalizable objects (0409b4f0->0409b4f0)
Ready for finalization 0 objects (0409b6b0->0409b6b0)
If the number of Ready for finalization objects continues to grow, and your certain garbage collections are occuring (confirm via perfmon counters), then it might be a blocked finalizer thread. You might need to take several snapshots over the lifetime of the process (before a recycle) to confirm. I usually rely on the magic number of three, as long as the site is under some sort of load.
A bug in a finalizer can block the finalizer thread and prevent the objects from ever being collected.
If the PaymentOption object calls a legacy STA COM object, then this article ASP.NET Hang and OutOfMemory exceptions caused by STA components might point in the right direction.

Not without more info on your application. But we ran into some nasty memory problems a long time ago. Do you use ASP.NET caching? As Raymond Chen likes to say, "poor caching strategy is indisitinguishable from a memory leak."
Check out another tool - CLRProfiler.exe - it will help you traverse object reference trees to see where your objects are rooted. This is also good: link text
You've heard this before - if you have to GC.Collect, something is wrong.

Is the PaymentOption object created in an asynchronous process, by any chance? I remember something about, if you don't call EndInvoke, you can get problems like this.

I've been investigating the same issue myself and was asking why objects that had no references were not being collected.
Objects larger than 85,000 bytes are stored on the Large Object Heap, from which memory is freed up less frequently.
http://msdn.microsoft.com/en-us/magazine/cc534993.aspx
A single PaymentOption may not be that big, but are they contained within collections, or are they based on something like a DataSet? You should pick on few instances of the PaymentOption / collection / DataSet and then use the sos !objsize command to see big they are.
Unfortunately this doesn't really answer the question. I like to think I can trust the .net framework to take care of releasing unused memory whenever it needs to. However I see a lot of memory being used by the worker process running the app I am looking at, even when memory looks quite tight on the server.

FYI, SOS in .NET 4 supports a few new commands that might be of assistance, namely !gcwhere (locate the generation of an objection; sosex's gcgen) and !findroots (does what it says on the tin; sosex's !refs)
Both are documented on the SOS documentation and mentioned on Tess Ferrandez's blog.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex