Concerning TCP connection call while profiling - tcp

I'm profiling my Clojure program using VisualVM 1.3.9.
Randomly while profiling (sometime after a couple minutes), I see the following (notice the top entry):
My program has absolutely 0 network functionality on its own (it's a genetic algorithm test routine), and doesn't use any external libraries.
Should I be concerned? Is there a way to find out what's causing this?

Related

How to find i/o bottleneck within asp.net app

We got a high traffic website which generates a lot of I/O. Within 10 minutes it has been reading over 10 gb of data (w3wp in question seen in task manager). For memory and application hangs I have been using WinDbg with success. But I don't know how I can find the object(s) / method(s) within a process which are responsible for the highest I/O.
Is this even possible?
Edit
The question is: Is there a way to profile I/O operations in a .NET assembly, say: list of threads sorted by highest disk I/O (or something similar that would help me where to look)
ANTS Performance Profiler
I have used this tool to great success - dealing with finding the specific instructions which are causing ~512GB of memory on a high-volume web farm getting chewed up within 5-10 minutes. Sounds like a very similar situation as yours.
Now, to be realistic - it's not going to magically solve your problem. It still requires a lot of setup, thorough analysis and detective work. But this tool definitely took the problem from "practically unsolvable" to "solvable within days".
Update:
As I mentioned in the comments (and Ben Emmett echoed), we can use ANTS to monitor memory, file system handles - pretty much any resource consumption and drill down the call stack to see the effects of specific routines.
I came up with this tool AppDynamics Lite which displays your application calls costs and performance in a visual way. It might help you to find out which functions are making the most costy IO operations.
Quoting;
Understand the health of your CLR with key metrics like response time, throughput, exception rate, and garbage collection time as well as key system resource like CPU, memory and disk I/O.
Worth giving a shot as it is trial/free for 30 days. Hope it helps.
Ps: I'm not affiliated with AppDynamics in any way.
You can use the (free) Windows Performance Toolkit from Windows 8 which does run also on Windows Vista and later. There you can turn on system wide profiling to see what was going on in all processes at once. No instrumentation necessary. Only one reboot is required to set an arcane registry key which is done by WPRUI.exe automatically.
With XPerf you could enable IO Init stack walking so that a call stack is taken for every IO which is started. The only issue is that the stacks will be broken for 64 bit processes which means that you will see only the first method above the BCL methods of your code because there is a Windows 7 bug in the stackwalking capabilities of the OS.
A workaround is to Ngen your assemblies or move to Server 2012 or switch to x86 for profiling to see deeper call stacks.
You will see all file IO and CPU activity even without any call stacks and the file names along how long the hard disc was used. That should give you good information which part of your app is causing the disc IO. From the partial call stacks you should be able to pinpoint your issue even without full stacks.
The tool will give you much more insight than any commercially available profiler at the expense that you need to learn how to use it. Since the call stacks do not end at your code or in user mode but in the kernel you can also determine if e.g. the virus scanner is causing significant IO delays. But you need to know how your processor does work. This toolset was originally aimed at kernel devs which explains why you see so many useless columns.
In the picture below you see file IO and CPU consumption stacked. When you select your high IO file in the disc IO graph it will highlight in the CPU consumption all related call stacks which were taken at the same time while the IO was active. This way you can diretly navigate from the IO to your potentially blocked threads.

MPI_Isend /Irecv: Is it possible to access the sendbuffer on unused memory-locations in the meanwhile

I would like to speedup my MPI- Program with the use of asynchronous communication. But the used time remains the same. The workflow is as followed.
before:
1. MPI_send/ MPI_recv Halo (ca. 10 Seconds)
2. process the whole Array (ca. 12 Seconds)
after:
1. MPI_Isend/ MPI_Irecv Halo (ca. 0,1 Seconds)
2. process the Array (without Halo) (ca. 10 Seconds)
3. MPI_Wait (ca. 10 Seconds) (should be ca. 0 Seconds)
4. process the Halo only (ca. 2 Seconds)
Measurements showed that the communication and processing the Array-core nearly take the same time for common workloads. So asynchronism should nearly hide the communication time.
But it dosn't.
One fact - and I thinks this could be the problem - is that the sendbuffer is also the array the calculations are made on. Is it possible that MPI serializes the memory-access although communication ONLY accesses the Halo (with derived datatype) and the computation ONLY accesses the core (only reading) of the array???
Does anybody know if this is for sure the reason?
Is it maybe implementation-dependend (I'm using OpenMPI)?
Thanks in advance.
It isn't the case that MPI serializes the memory accesses in the user code (that's beyond the library's power to do, in general), and it is true that what exactly does happen is implementation specific.
But as a practical matter, MPI libraries don't do as much communication "in the background" as you might hope, and this is particularly true when using transports and networks like tcp + ethernet, where there's no meaningful way to hand off communication to another set of hardware.
You can only be sure that the MPI library is actually doing something when you're running MPI library code, eg in an MPI function call. Often, a call to any of a number of MPI calls will nudge an implementations "progress engine" that keeps track of in-flight messages and ushers them along. So for instance one thing you can quickly do is to make calls to MPI_Test() on the requests within the compute loop to make sure things start happening well before the MPI_Wait(). There is of course overhead to this, but this is something that's easy to try to measure.
Of course you could imagine the MPI library would use some other mechanism to run things behind the scenes. Both MPICH2 and OpenMPI have played with separate "progress threads" which execute separately from the user code and do this ushering along in the background; but getting that to work well, and without tying up a processor while you're trying to run your computation, is a genuinely difficult problem. OpenMPI's progress threads implementation has long been experimental, and in fact is temporarily out of the current (1.6.x) release, although work continues. I'm not sure about MPICH2's support.
If you are using infiniband, where the network hardware has a lot of intelligence to it, then prospects brighten a bit. If you are willing to leave memory pinned (for the openfabrics), and/or you can use a vendor-specific module (mxm for Mellanox, psm for Qlogic), then things can progress somewhat more rapidly. If you're using shared memory, than the knem kernel module can also help with intranode transport.
One other implementation-specific approach you can take, if memory isn't a big issue, is to try to use eager protocols for sending the data directly, or send more data per chunk so fewer nudges of the progress engine are needed. What eager protocols means here is that data is automatically sent at send time, rather than just initiating a set of handshakes which will eventually lead to the message being sent. The bad news is that this generally requires extra buffer memory for the library, but if that's not a problem and you know the number of incoming messages is bounded (eg, by the number of halo neighbours you have), this can help a great deal. How to do this for (eg) shared memory transport for openmpi is described on the OpenMPI page for tuning for shared memory, but similar parameters exist for other transports and often for other implementations. One nice tool that IntelMPI has is an "mpitune" tool that automatically runs through a number of such parameters for best performance.
The MPI specification states:
A nonblocking send call indicates that the system may start copying
data out of the send buffer. The sender should not modify any part of the
send buffer after a nonblocking send operation is called, until the
send completes.
So yes, you should copy your data to a dedicated send buffer first.

Boost asio async vs blocking reads, udp speed/quality

I have a quick and dirty proof of concept app that I wrote in C# that reads high data rate multicast UDP packets from the network. For various reasons the full implementation will be written in C++ and I am considering using boost asio. The C# version used a thread to receive the data using blocking reads. I had some problems with dropped packets if the computer was heavily loaded (generally with processing those packets in another thread).
What I would like to know is if the async read operations in boost (which use overlapped io in windows) will help ensure that I receive the packets and/or reduce the cpu time needed to receive the packets. The single thread doing blocking reads is pretty straightforward, using the async reads seems like a step up in complexity, but I think it would be worth it if it provided higher performance or dropped fewer packets on a heavily loaded system. Currently the data rate should be no higher than 60Mb/s.
I've written some multicast handling code using boost::asio also. I would say that overall, in my experience there is a lot of added complexity to doing things in asio that may not make it easy for other people you work with to understand the code you end up writing.
That said, presumably the argument in favour of moving to asio instead of using lots of different threads to do the work is that you would have to do less context switching. This would clearly be true on a single-core box, but what about when you go multi-core? Are you planning to offload the work you receive to threads or just have a single thread doing the processing work? If you go for a single threaded approach you are going to end up in a situation where you could drop packets waiting for that thread to process the work.
In the end it's swings and roundabouts. I'd say you want to get some fairly solid figures backing up your arguments for going down this route if you are going to do so, just because of all the added complexity it entails (a whole new paradigm for some people I'm sure).

How is MPI I/O Implemented?

Long-Winded Background
I'm working on parallelising some code for cardiac electrophysiology simulations. Since users can specify their own simulations using an in-built scripting language, I have no way of knowing how to manage the trade-off of communication vs. computation. To combat this, I'm making a sort-of runtime profiler, which will decide how to handle the domain decomposition once it's seen the simulation to be run and the hardware environment that it has to work with.
My question is this:
How is MPI I/O implemented behind the scenes? Is each process actually writing to a single file on some other node, or is each process writing to some sparse file, which will get spliced back together when the file is closed?
Knowing this will help me decide whether to consider I/O operations as communication or computation, and adjust the balance accordingly…
Thanks in advance for any insight you can offer.
Ross
The mechanism for I/O is implementation dependent. In addition, there is not a single style of I/O. Some I/O is cached by the remote ranks and collected by the mpirun process at the end of the run. Some I/O is written to local scratch space as required. Some I/O is written to a NAS/SAN style high performance shared file system.
Some MPI's use 3rd party libraries to support I/O to parallel file systems, and those details may be proprietary. Some file systems are local discs, others are SAN over fiber or InfinBand.
How are you planning to actually measure the time spent in I/O? Are you planning to use the pMPI interface to intercept all the calls into the library?

Microcontroller + Verilog/VHDL simulator?

Over the years I've worked on a number of microcontroller-based projects; mostly with Microchip's PICs. I've used various microcontroller simulators, and while they can be very helpful at times, I often find myself frustrated. In real life microcontrollers never exist alone and the firmware's behavior is dependent on the environment. However, none of the sims I've used provide decent support for anything outside the microcontroller.
My first thought was to model the entire board in Verilog. But, I'd rather not create an entire CPU model, and I haven't had much luck finding existing models for the chips I use. Regardless, I really don't need, or want, to simulate the proc at that level of detail, and I'd like to retain the debugging facilities provided by a regular processor sim.
It seems to me that the ideal solution would be a hybrid simulator that interfaces a traditional processor simulator with a Verilog model.
Does such a thing exist?
I've used the Altera Nios II processor embedded on a FPGA. Altera provides a toolchain for simulating the CPU (with its software) together with your custom logic in a simulator. I suppose that a similar setup can be achieved by downloading a VHDL/Verilog core of your CPU (Did you try opencores ? They have lots of stuff there).
But keep in mind that it is going to be mind-bogglingly slow, so don't expect to simulate whole complex processes this way. The best you can hope for is simulating fine software-hardware interaction points to debug problems. If you need a deeper simulation, consider running it on a FPGA with built-in monitoring code.
For the "simulate the whole board" approach,
The Free Model Foundry has a large number of models, some in VHDL others in Verilog, that are available now.. but you'll need to pay to have new models created. These are very helpful in being sure the board is built correctly.
But I think the more common approach when debugging your PIC is to just build a board, then work on the firmware. In the chip world, (where the firmware is running on a microprocessor in a chip that hasn't gone to fab yet) people often resort to very expensive systems (or renting time on them) that allow compiling part of the design into an emulator while the rest of the design runs in the normal simulator environment. Without the barrier of an expensive mask set for the chip, the cost is just not justifiable for a Circuit board. Although I've heard of some creative applications of Simulink (Mathworks) with FPGA's, but my recollection is that one either ran the system on the computer, or programmed the device and ran the same thing in realtime.
I believe both Cadence (ask about Palladium) and Mentor Graphics have that integrated solution if you have the money to spend on it.
What I have done recently is create an interface between the simulation environment and host system. Different hdl simulators have different interfaces, and getting the simulator NOT think in batch mode, the traditional simulation model, instead run for ever like a real design is half of the problem.
Then from the host using C (or whatever) you can create abstractions that may or may not allow you to write your application software for whatever target (depending on what language and compiler capabilities you have). For example you can make a generic poke and peek function and on the final target have those actually poke and peek memory or I/O, but for simulation through the abstraction you talk to a testbench in the simulation that simulates the same memory cycle.
I went one step further and used (Berkeley) sockets between the host and test bench so that the simulation can keep running while the host applications stop and start. Not unlike having a real processor with an OS that you are starting applications and running them to completion and starting another. At least for test applications, for delivery you probably only have one app.
By creating these abstraction layers I can write real applications that will be used on the target when it is built. Along the way you can use software simulation of the logic initially, then if you like build an fpga with an abstraction interface (throw away logic) say a uart for example. Replace the shim between the applications abstraction layer and the simulator with a uart interface, or whatever. Then when you marry the processor and logic in the same chip or on the same board, replace the abstraction layer again with direct calls to whatever interfaces they have always though they were talking to. If something breaks and you have retained the abstraction layer you can take the application back to the simulation model and have access to all of your logic internals.
Specifically this time around I am using a hdl language cyclicity cdl which is on sourceforge, the documentation needs some help but the examples may get you going, and it produces synthesizable verilog, so you get an extra win there. I threw out all the scripting batch stuff other than the bare minimum needed to connect and start a C simulation model. So my test bench is in C (well C++ technically) the sockets layer was done there. The output can be .vcd files which gtkwave uses. Basically you can do the bulk of your HDL design using open source software with no licenses, etc. By adding one or two lines of code to the CDL simulation part I was able to have it run as an infinite loop, which I can say works quite well, there doesnt appear to be any memory leaks, etc.
both modelsim and cadence have standardized ways of connecting host C programs to the simulation world and from there you can use an IPC to get to host applications talking to an abstraction layer api.
this is probably way overkill for a pic, I have given up pics a while ago for the faster and C friendly arm based micros anyway. There is/was an open core pic that you could simply incorporate into your simulation, even though that is not what you are trying to do here.
Not that I've seen. Your best bet is to properly define the interfaces and behavior between the uC and FPGA and then define a series of test waveforms that can be applied using an automated tester. You would have to make the automated tester (or perhaps a logic analyzer may have some such functionality) out of an FPGA or uC (apply waveform, watch interrupts, breakpoints, etc). If you really want I know that Opencores.org has PIC and AVR-like 8-bit uC cores defined as VHDL, so you could implement your entire project on the FPGA and then just debug that.
Generally there isn't need to model the CPU at the RTL level. Since you don't really care about what it does bit by bit; you generally care about what it does, e.g. register values, memories and bus access.
The simplest is call at Bus Functional Model. This just generates the read and writes that the CPU does, often based on a text file. These are available for some CPUs and many popular buses (e.g. PCI, PCIe). THese simulate super fast.
The next step up is a functional cycle-accurate model. Those simulate fast. They are often encrypted.
Last is a full RTL model. Those usually are only available if you are working closely with the CPU vendor, e.g. using their core in your ASIC. Typically these are encrypted, unless you are a huge company.
Memory models are typically cycle-accurate (e.g. Micron).
My workmates from the hardware department use FPGA simulation software quite often to find timing-bugs and trace down strange behaviours.
Simulating one or two milliseconds can take several hours, so using the simulator for anything but very small things is not feasable.
You may want to have a look at SystemC though. http://en.wikipedia.org/wiki/SystemC

Resources