I need a short C programm that runs slower on a processor with HyperThreading than on one without it

I need a short C programm that runs slower on a processor with HyperThreading than on one without it - hyperthreading

I want to write a paper with Compiler Optimizations for HyperTreading. First step would be to investigate why a processor with HyperThreading( Simultaneous Multithreading) could lead to poorer performances than a processor without this technology. First step is to find an application that is better without HyperThreading, so i can run some hardware performance counters on it. Any suggest on how or where i could find one?
So, to summarize. I know that HyperThreading benefits are between -10% and +30%. I need a C application that falls in the 10% performance penalty.
Thanks.

Probably the main drawback of hyperthreading is the effective halving of cache sizes. Each thread will be populating the cache, and so each, in effect, has half the cache.
To create a programme which runs worse with hyperthreading than without, create a single threaded programme which performs a task which just fits inside L1 cache. Then add a second thread, which shares the workload, the works from "the other end" of the data, as it were. You will find performance falls through the floor - this is because both threads now must access L2 cache.
Hyperthreading can dramatically improve or worsen performance. It is completely dependent on use. None of this -10%/+30% stuff - that's ridiculous.

I'm not familiar with compiler optimizations for HT, nor the different between i7 HT and P4's as David pointed out. However, you can expect some general behaviors.
Context switching is very expensive. So if you have one core and run two threads on it simultaneously, switching back and forth one thread from the other always gives you performance penalty. However, threads do not use the core all the time. For example, if the thread reads or writes memory, it just waits for the memory access to be done, without using the core, usually for more than 100 cycles. There are many other cases that a thread need to stall like this, e.g., I/O operations, data dependencies, etc. Here HT helps, because it can ships out the waiting (or blocked) thread, and executes another thread instead.
Therefore, you can think if all threads are really unlikely to be blocked, then context switching will only cause overhead. Think about very computation-bounded application working on a small set of data.

Related

Opencl Workitems and streaming processors

What is the relation between an workitem and a streaming processor(cuda core). I read somewhere that the number of workitems SHOULD greatly exceed the number of cores, otherwise there is no performance improvement. But why is this so?? I thought 1 core repsresents 1 workitem. Can someone help me to understand this?
Thanks

GPUs and most other hardware tend to do arithmetic much faster than they can access most of their available memory. Having many more work items than you have processors lets the scheduler stagger the memory use, while those work items which have already read their data are using the ALU hardware to do the processing.
Here is a good page about optimization in opencl. Scroll down to "
2.4. Removing 'Costly' Global GPU Memory Access", where it goes into this concept.

The reason is mainly scheduling - a single core/processor/unit can usually run multiple threads and switch between them to hide memory latency (SMT). So it's generally a good idea for each core to have multiple threads queued up for it.
A thread will usually correspond to at least one work item, although depending on driver and hardware, multiple work-items might be combined into one thread, to make use of SIMD/vector capabilities of a core.

How to use non-blocking point-to-point MPI routines instead of collectives

In my programm, I would like to heavily parallelize many mathematical calculations, the results of which are then written to an output file.
I successfully implemented that using collective communication (gather, scatter etc.) but I noticed that using these synchronizing routines, the slowest among all processors dominates the execution time and heavily reduces overall computation time, as fast processors spend a lot of time waiting.
So I decided to switch to the scheme, where one (master) processor is dedicated to receiving chunks of results and handling the file output, and alle the other processors calculate these results and send them to the master using non-blocking send routines.
Unfortunately, I don't really know how to implement the master code; Do I need to run an infinite loop with MPI_Recv(), listening for incoming messages? How do I know when to stop the loop? Can I combine MPI_Isend() and MPI_Recv(), or do both method need to be non-blocking? How is this typically done?

MPI 3.1 provides non-blocking collectives. I would strongly recommend that instead of implementing it on your own.
However, it may not help you after all. Eventually you need the data from all processes, even the slow ones. So you are likely to wait at some point again. Non-blocking communication overlaps communication and computation, but it doesn't fix your load imbalances.
Update (more or less a long clarification comment)
There are several layers to your question, I might have been confused by the title as to what kind of answer you were expecting. Maybe the question is rather
How do I implement a centralized work queue in MPI?
This pops up regularly, most recently here. But that is actually often undesirable because a central component quickly becomes a bottleneck in large scale programs. So the actual problem you have, is that your work decomposition & mapping is imbalanced. So the more fundamental "X-question" is
How do I load balance an MPI application?
At that point you must provide more information about your mathematical problem and it's current implementation. Preferably in form of an [mcve]. Again, there is no standard solution. Load balancing is a huge research area. It may even be a topic for CS.SE rather than SO.

MPI vs openMP for a shared memory

Lets say there is a computer with 4 CPUs each having 2 cores, so totally 8 cores. With my limited understanding I think that all processors share same memory in this case. Now, is it better to directly use openMP or to use MPI to make it general so that the code could work on both distributed and shared settings. Also, if I use MPI for a shared setting would performance decrease compared with openMP?

Whether you need or want MPI or OpenMP (or both) heavily depends the type of application you are running, and whether your problem is mostly memory-bound or CPU-bound (or both). Furthermore, it depends on the type of hardware you are running on. A few examples:
Example 1
You need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power.
In this case you probably want to use MPI and start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.
Example 2
You usually have small datasets and only want to speed up your application, which is computationally heavy. Also, you do not want to spend much time thinking about parallelization, but more your algorithms in general.
In this case OpenMP is your first choice. You only need to add a few statements here and there (e.g. in front of your for loops that you want to accelerate), and if your program is not too complex, OpenMP will do the rest for you automatically.
Example 3
You want it all. You need more memory, i.e. more computing nodes, but you also want to speed up your calculations as much as possible, i.e. running on more than one core per node.
Now your hardware comes into play. From my personal experience, if you have only a few cores per node (4-8), the performance penalty created by the general overhead of using OpenMP (i.e. starting up the OpenMP threads etc.) is more than the overhead of processor-internal MPI communication (i.e. sending MPI messages between processes that actually share memory and would not need MPI to communicate).
However, if you are working on a machine with more cores per node (16+), it will become necessary to use a hybrid approach, i.e. parallelizing with MPI and OpenMP at the same time. In this case, hybrid parallelization will be necessary to make full use of your computational resources, but it is also the most difficult to code and to maintain.
Summary
If you have a problem that is small enough to be run on just one node, use OpenMP. If you know that you need more than one node (and thus definitely need MPI), but you favor code readability/effort over performance, use only MPI. If using MPI only does not give you the speedup you would like/require, you have to do it all and go hybrid.
To your second question (in case that did not become clear):
If you setup is such that you do not need MPI at all (because your will always run on only one node), use OpenMP as it will be faster. But If you know that you need MPI anyways, I would start with that and only add OpenMP later, when you know that you've exhausted all reasonable optimization options for MPI.

With most distributed memory platforms nowadays consisting of SMP or NUMA nodes it just makes no sense to not use OpenMP. OpenMP and MPI can perfectly work together; OpenMP feeds the cores on each node and MPI communicates between the nodes. This is called hybrid programming. It was considered exotic 10 years ago but now it is becoming mainstream in High Performance Computing.
As for the question itself, the right answer, given the information provided, has always been one and the same: IT DEPENDS.

For use on a single shared memory machine like that, I'd recommend OpenMP. It make some aspects of the problem simpler and might be faster.
If you ever plan to move to a distributed memory machine, then use MPI. It'll save you solving the same problem twice.
The reason I say OpenMP might be faster is because a good implementation of MPI could be clever enough to spot that it's being used in a shared memory environment and optimise its behaviour accordingly.

Just for a bigger picture, hybrid programming has become popular because OpenMP benefits from cache topology, by using the same address space. As MPI might have the same data replicated over the memory (because process can't share data) it might suffer from cache cancelation.
On the other hand, if you partition your data correctly, and each processor has a private cache, it might come to a point were your problem fit completely in cache. In this case you have super linear speedups.
By talking in cache, there are very different cache topology on recent processors, and has always: IT DEPENDS...

Hardware Assisted Garbage Collection

I was thinking on ways that functional languages could be more tied directly to their hardware and was wondering about any hardware implementations of garbage collection.
This would speed things up significantly as the hardware itself would implicitly handle all collection, rather than the runtime of some environment.
Is this what LISP Machines did? Has there been any further research into this idea? Is this too domain specific?
Thoughts? Objections? Discuss.

Because of Generational Collection, I'd have to say that tracing and copying are not huge bottlenecks to GC.
What would help, is hardware-assisted READ barriers which take away the need for 'stop the world' pauses when doing stack scans and marking the heap.
Azul Systems has done this: http://www.azulsystems.com/products/compute_appliance.htm
They gave a presentation at JavaOne on how their hardware modifications allowed for completely pauseless GC.
Another improvement would be hardware assisted write barriers for keeping track of remembered sets.
Generational GCs, and even more so for G1 or Garbage First, reduce the amount of heap they have to scan by only scanning a partition, and keeping a list of remembered sets for cross-partition pointers.
The problem is this means ANY time the mutator sets a pointer it also has to put an entry in the appropriate rememered set. So you have (small) overhead even when you're not GCing. If you can reduce this, you'd reduce both the pause times neccessary for GCing, and overall program performance.

One obvious solution was to have memory pointers which are larger than your available RAM, for example, 34bit pointers on a 32 bit machine. Or use the uppermost 8 bits of a 32bit machine when you have only 16MB of RAM (2^24). The Oberon machines at the ETH Zurich used such a scheme with a lot success until RAM became too cheap. That was around 1994, so the idea is quite old.
This gives you a couple of bits where you can store object state (like "this is a new object" and "I just touched this object"). When doing the GC, prefer objects with "this is new" and avoid "just touched".
This might actually see a renaissance because no one has 2^64 bytes of RAM (= 2^67 bits; there are about 10^80 ~ 2^240 atoms in the universe, so it might not be possible to have that much RAM ever). This means you could use a couple of bits in todays machines if the VM can tell the OS how to map the memory.

Yes. Look at the related work sections of these 2 papers:
https://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/index.htm
http://www.filpizlo.com/papers/pizlo-ismm2007-stopless.pdf
Or at this one:
http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon12StallFree.pdf

There was an article on lambda the ultimate describing how you need a GC-aware virtual memory manager to have a really efficient GC, and VM mapping is done mostly by hardware these days. Here you are :)

You're a grad student, sounds like a good topic for a research grant to me.
Look into FPGA design and computer architechture there are plenty of free processor design availiable on http://www.opencores.org/
Garbage collection can be implemented as a background task, it's already intended for parallel operation.
Pete

I'm pretty sure that some prototypes should exist. But develop, and produce hardware specific features is very expensive. It took very long time to implement MMU or TLB at a hardware level, which are quite easy to implement.
GC is too big to be efficiently implemented into hardware level.

Older sparc systems had tagged memory ( 33 bits) which was usable to mark addresses.
Not fitted today ?
This came from their LISP heritage IIRC.
One of my friends built a generational GC that tagged by stealing a bit from primitives. It worked better.
So, yes it can be done, but nodody bothers tagging things anymore.
runT1mes' comments about hardware assisted generational GC are worth reading.
given a sufficiently big gate array (vertex-III springs to mind) an enhanced MMU that supported GC activities would be possible.

Probably the most relevant piece of data needed here is, how much time (percentage of CPU time) is presently being spent on garbage collection? It may be a non-problem.
If we do go after this, we have to consider that the hardware is fooling with memory "behind our back". I guess this would be "another thread", in modern parlance. Some memory might be unavailable if it were being examined (maybe solvable with dual-port memory), and certainly if the HWGC is going to move stuff around, then it would have to lock out the other processes so it didn't interfere with them. And do it in a way that fits into the architecture and language(s) in use. So, yeah, domain specific.
Look what just appeared... in another post... Looking at java's GC log.

I gather the biggest problem is CPU registers and the stack. One of the things you have to do during GC is traverse all the pointers in your system, which means knowing what those pointers are. If one of those pointers is currently in a CPU register then you have to traverse that as well. Similarly if you have a pointer on the stack. So every stack frame has to have some sort of map saying what is a pointer and what isn't, and before you do any GC traversing you have to get any pointers out into memory.
You also run into problems with closures and continuations, because suddenly your stack stops being a simple LIFO structure.
The obvious way is to never hold pointers on the CPU stack or in registers. Instead you have each stack frame as an object pointing to its predecessor. But that kills performance.

Several great minds at MIT back in the 80s created the SCHEME-79 chip, which directly interpreted a dialect of Scheme, was designed with LISP DSLs, and had hardware-gc built in.

Why would it "speed things up"? What exactly should the hardware be doing?
It still has to traverse all references between objects, which means it has to run through a big chunk of data in main memory, which is the main reason why it takes time. What would you gain by this? Which particular operation is it that you think could be done faster with hardware support? Most of the work in a garbage collector is just following pointers/references, and occasionally copying live objects from one heap to another. how is this different from the instructions a CPU already supports?
With that said, why do you think we need faster garbage collection? A modern GC can already be very fast and efficient, to the point where it's basically a solved problem.

How do you pre allocate memory to a process in solaris?

My problem is:
I have a perl script which uses lot of memory (expected behaviour because of caching). But, I noticed that the more I do caching, slower it gets and the process spends most of the time in sleep mode.
I thought pre-allocating memory to the process might speed up the performance.
Does someone have any ideas here?
Update:
I think I am not being very clear here. I will put question in clearer way:
I am not looking for the ways of pre-allocating inside the perl script. I dont think that would help me much here. What I am interested in is a way to tell OS to allocate X amount of memory for my perl script so that it does not have to compete with other processes coming in later.
Assume that I cant get away with the memory usage. Although, I am exploring ways of reducing that too but dont expect much improvement there.
FYI, I am working on a solaris 10 machine.

What I gathered from your posting and comments is this:
Your program gets slow when memory use rises
Your pogram increasingly spends time sleeping, not computing.
Most likely eplanation: Sleeping means waiting for a resource to become available. In this case the resource most likely is memory. Use the vmstat 1 command to verify. Have a look at the sr column. If it goes beyond ~150 consistently the system is desperate to free pages to satisfy demand. This is accompanied by high activity in the pi, po and fr columns.
If this is in fact the case, your best choices are:
Upgrade system memory to meet demand
Reduce memory usage to a level appropiate for the system at hand.
Preallocating memory will not help. In either case memory demand will exceed the available main memory at some point. The kernel will then have to decide which pages need to be in memory now and which pages may be cleared and reused for the more urgently needed pages. If all regularily needed pages (the working set) exceeds the size of main memory, the system is constantly moving pages from and to secondary storage (swap). The system is then said to be thrashing and spends not much time doing useful work. There is nothing you can do about this execept adding memory or using less of it.

From a comment:
The memory limitations are not very severe but the memory footprint easily grows to GBs and when we have competing processes for memory, it gets very slow. I want to reserve some memory from OS so that thrashing is minimal even when too many other processes come. Jagmal
Let's take a different tack then. The problem isn't really with your Perl script in particular. Instead, all the processes on the machine are consuming too much memory for the machine to handle as configured.
You can "reserve" memory, but that won't prevent thrashing. In fact, it could make the problem worse because the OS won't know if you are using the memory or just saving it for later.
I suspect you are suffering the tragedy of the commons. Am I right that many other users are on the machine in question? If so, this is more of a social problem than a technical problem. What you need is someone (probably the System Administrator) to step in and coordinate all the processes on the machine. They should find the most extravagant memory hogs and work with their programmers to reduce the cost on system resources. Further, they ought to arrange for processes to be scheduled so that resource allocation is efficient. Finally, they may need to get more or improved hardware to handle the expected system load.

Some questions you might ask yourself:
are my data structures really useful for the task at hand?
do I really have to cache that much?
can I throw away cached data after some time?

my #array;
$#array = 1_000_000; # pre-extend array to one million elements,
# http://perldoc.perl.org/perldata.html#Scalar-values
my %hash;
keys(%hash) = 8192; # pre-allocate hash buckets
# (same documentation section)
Not being familiar with your code, I'll venture some wild speculation here [grin] that these techniques aren't going to offer new great efficiencies to your script, but that the pre-allocation could help a little bit.
Good luck!
-- Douglas Hunter

I recently rediscovered an excellent Randal L. Schwartz article that includes preallocating an array. Assuming this is your problem, you can test preallocating with a variation on that code. But be sure to test the result.
The reason the script gets slower with more caching might be thrashing. Presumably the reason for caching in the first place is to increase performance. So a quick answer is: reduce caching.
Now there may be ways to modify your caching scheme so that it uses less main memory and avoids thrashing. For instance, you might find that caching to a file or database instead of to memory can boost performance. I've found that file system and database caching can be more efficient than application caching and can be shared among multiple instances.
Another idea might be to alter your algorithm to reduce memory usage in other areas. For instance, instead of pulling an entire file into memory, Perl programs tend to work better reading line by line.
Finally, have you explored the Memoize module? It might not be immediately applicable, but it could be a source of ideas.

I could not find a way to do this yet.
But, I found out that (See this for details)
Memory allocated to lexicals (i.e.
my() variables) cannot be reclaimed or
reused even if they go out of scope.
It is reserved in case the variables
come back into scope. Memory allocated
to global variables can be reused
(within your program) by using
undef()ing and/or delete().
So, I believe a possibility here could be to check if i can reduce the total memory print of lexical variables at a given point in time.

It sounds like you are looking for limit or ulimit. But I suspect that will cause a script that goes over the limit to fail, which probably isn't what you want.
A better idea might be to share cached data between processes. Putting data in a database or in a file works well in my experience.
I hate to say it, but if your memory limitations are this severe, Perl is probably not the right language for this application. C would be a better choice, I'd think.

One thing you could do is to use solaris zones (containers) .
You could put your process in a zone and allocate it resources like RAM and CPU's.
Here are two links to some tutorials :
Solaris Containers How To Guide
Zone Resource Control in the Solaris 10 08/07 OS

While it's not pre-allocating as you asked for, you may also want to look at the large page size options, so that when perl has to ask the OS for more memory for your program, it gets it in
larger chunks.
See Solaris Internals: Multiple Page Size Support for more information on the difference this makes and how to do it.

Look at http://metacpan.org/pod/Devel::Size
You could also inline a c function to do the above.
As far as I know, you cannot allocate memory directly from Perl. You can get around this by writing an XS module, or using an inline C function like I mentioned.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex