any pipeline programing example showing efficiency of pipeline - pipeline

I'm looking for programming example of pipeline where it execute the instruction/code using pipeline technique that could also be executed by nopipeline technique. so that I can see pipeline is better than nopipeline.
any help pls?

Your CPU pretends it does stuff in order, as far as people looking at it as a black box (that sets memory and reads memory) this is what it does. a pipeline is when it does a bit of every instruction in the pipeline and moves things along in it every cycle.
If your CPU only has one adder, then a second add instruction will stall the pipeline (the first will have to finish), compilers (GCC) are aware of this so use things like instruction scheduling to always keep the CPU busy.
You cannot "turn it off" to see that it is better. Instruction scheduling is one of the cheapest and most beneficial optimisations we can do, you get it free with assignment analysis and stuff.
You really want a book to talk about "before and after", it's also very hard to say "pipelining made this much of an increase" because it depends on what instructions were used, how many ALUs the CPU has, so forth.

Related

How to use non-blocking point-to-point MPI routines instead of collectives

In my programm, I would like to heavily parallelize many mathematical calculations, the results of which are then written to an output file.
I successfully implemented that using collective communication (gather, scatter etc.) but I noticed that using these synchronizing routines, the slowest among all processors dominates the execution time and heavily reduces overall computation time, as fast processors spend a lot of time waiting.
So I decided to switch to the scheme, where one (master) processor is dedicated to receiving chunks of results and handling the file output, and alle the other processors calculate these results and send them to the master using non-blocking send routines.
Unfortunately, I don't really know how to implement the master code; Do I need to run an infinite loop with MPI_Recv(), listening for incoming messages? How do I know when to stop the loop? Can I combine MPI_Isend() and MPI_Recv(), or do both method need to be non-blocking? How is this typically done?
MPI 3.1 provides non-blocking collectives. I would strongly recommend that instead of implementing it on your own.
However, it may not help you after all. Eventually you need the data from all processes, even the slow ones. So you are likely to wait at some point again. Non-blocking communication overlaps communication and computation, but it doesn't fix your load imbalances.
Update (more or less a long clarification comment)
There are several layers to your question, I might have been confused by the title as to what kind of answer you were expecting. Maybe the question is rather
How do I implement a centralized work queue in MPI?
This pops up regularly, most recently here. But that is actually often undesirable because a central component quickly becomes a bottleneck in large scale programs. So the actual problem you have, is that your work decomposition & mapping is imbalanced. So the more fundamental "X-question" is
How do I load balance an MPI application?
At that point you must provide more information about your mathematical problem and it's current implementation. Preferably in form of an [mcve]. Again, there is no standard solution. Load balancing is a huge research area. It may even be a topic for CS.SE rather than SO.

CISC and RISC - synchronous and asynchronous

I have two opposing ideas. I think that all the instructions in a RISC take the same time what makes me believe RISC is associated with synchronous.
At the same time i think that CISC should be synchronous and RISC should be asynchronous.
Can you tell me if RISC is associated with asynchronous and CISC is associated with synchronous? WHY???
Thanks!
What do you mean by synchronous and asynchronous. If you are talking about logic clocking, then cisc vs risc vs a hard disk controller dont matter, it is just some logic that you are clocking. And when it comes to processors if you are thinking in whether they can perform operations in series or parallel, it also doesnt matter between the two because both can be implemented either way. Some CISC processors are a shell around a risc or vliw or some flavor of other processor. If you are thinking parallel execution vs serial, in order or out of order execution, again the two dont matter with either one if you have the right logic and the right pipelines you can look ahead and execute out of order and in parallel.
It is like saying there are red sweaters and green sweaters made from red and green yarn. they are still just a combination of knit and purl stitches. Logic is just a lot of primitive gates, be it a disk controller chip, sound card chip, or processor, you can implement the logic in as many ways as you can implement a program to solve a particular problem. Machine code is nothing more than a bunch of bits fed into a bunch of gates, there is nothing special about how risc works vs cisc other than the number of bits you feed and the number of gates to implement. One might be easier to implement than another but it is still just some number of logic gates.

Is a preemptive multitasking OS possible on the interruptless DCPU-16?

I am looking into various OS designs in the hopes of writing a simple multitasking OS for the DCPU-16. However, everything I read about implementation of preemptive multitasking is centered around interrupts. It sounds like in the era of 16-bit hardware and software, cooperative multitasking was more common, but that requires every program to be written with multitasking in mind.
Is there any way to implement preemptive multitasking on an interruptless architecture? All I can think of is an interpreter which would dynamically switch tasks, but that would have a huge performance hit (possibly on the order of 10-20x+ if it had to parse every operation and didn't let anything run natively, I'm imagining).
Preemptive multitasking is normally implemented by having interrupt routines post status changes/interesting events to a scheduler, which decides which tasks to suspend, and which new tasks to start/continue based on priority. However, other interesting events can occur when a running task makes a call to an OS routine, which may have the same effect.
But all that matters is that some event is noted somewhere, and the scheduler decides who to run. So you can make all such event signalling/scheduling occur only only on OS calls.
You can add egregious calls to the scheduler at "convenient" points in various task application code to make your system switch more often. Whether it just switches, or uses some background information such as elapsed time since the last call is a scheduler detail.
Your system won't be as responsive as one driven by interrupts, but you've already given that up by choosing the CPU you did.
Actually, yes. The most effective method is to simply patch run-times in the loader. Kernel/daemon stuff can have custom patches for better responsiveness. Even better, if you have access to all the source, you can patch in the compiler.
The patch can consist of a distributed scheduler of sorts. Each program can be patched to have a very low-latency timer; on load, it will set the timer, and on each return from the scheduler, it will reset it. A simplistic method would allow code to simply do an
if (timer - start_timer) yield to scheduler;
which doesn't yield too big a performance hit. The main trouble is finding good points to pop them in. In between every function call is a start, and detecting loops and inserting them is primitive but effective if you really need to preempt responsively.
It's not perfect, but it'll work.
The main issue is making sure that the timer return is low latency; that way it is just a comparison and branch. Also, handling exceptions - errors in the code that cause, say, infinite loops - in some way. You can technically use a fairly simple hardware watchdog timer and assert a reset on the CPU without clearing any of the RAM; an in-RAM routine would be where RESET vector points, which would inspect and unwind the stack back to the program call (thus crashing the program but preserving everything else). It's sort of like a brute-force if-all-else-fails crash-the-program. Or you could POTENTIALLY change it to multi-task this way, RESET as an interrupt, but that is much more difficult.
So...yes. It's possible but complicated; using techniques from JIT compilers and dynamic translators (emulators use them).
This is a bit of a muddled explanation, I know, but I am very tired. If it's not clear enough I can come back and clear it up tomorrow.
By the way, asserting reset on a CPU mid-program sounds crazy, but it is a time-honored and proven technique. Early versions of Windows even did it to run compatibility mode on, I think 386's, properly, because there was no way to switch back to 32-bit from 16-bit mode. Other processors and OSes have done it too.
EDIT: So I did some research on what the DCPU is, haha. It's not a real CPU. I have no idea if you can assert reset in Notch's emulator, I would ask him. Handy technique, that is.
I think your assessment is correct. Preemptive multitasking occurs if the scheduler can interrupt (in the non-inflected, dictionary sense) a running task and switch to another autonomously. So there has to be some sort of actor that prompts the scheduler to action. If there are no interrupting devices (in the inflected, technical sense) then there's little you can do in general.
However, rather than switching to a full interpreter, one idea that occurs is just dynamically reprogramming supplied program code. So before entry into a process, the scheduler knows full process state, including what program counter value it's going to enter at. It can then scan forward from there, substituting, say, either the twentieth instruction code or the next jump instruction code that isn't immediately at the program counter with a jump back into the scheduler. When the process returns, the scheduler puts the original instruction back in. If it's a jump (conditional or otherwise) then it also effects the jump appropriately.
Of course, this scheme works only if the program code doesn't dynamically modify itself. And in that case you can preprocess it so that you know in advance where jumps are without a linear search. You could technically allow well-written self-modifying code if it were willing to nominate all addresses that may be modified, allowing you definitely to avoid those in your scheduler's dynamic modifications.
You'd end up sort of running an interpreter, but only for jumps.
another way is to keep to small tasks based on an event queue (like current GUI apps)
this is also cooperative but has the effect of not needing OS calls you just return from the task and then it will go on to the next task
if you then need to continue a task you need to pass the next "function" and a pointer to the data you need to the task queue

I need a short C programm that runs slower on a processor with HyperThreading than on one without it

I want to write a paper with Compiler Optimizations for HyperTreading. First step would be to investigate why a processor with HyperThreading( Simultaneous Multithreading) could lead to poorer performances than a processor without this technology. First step is to find an application that is better without HyperThreading, so i can run some hardware performance counters on it. Any suggest on how or where i could find one?
So, to summarize. I know that HyperThreading benefits are between -10% and +30%. I need a C application that falls in the 10% performance penalty.
Thanks.
Probably the main drawback of hyperthreading is the effective halving of cache sizes. Each thread will be populating the cache, and so each, in effect, has half the cache.
To create a programme which runs worse with hyperthreading than without, create a single threaded programme which performs a task which just fits inside L1 cache. Then add a second thread, which shares the workload, the works from "the other end" of the data, as it were. You will find performance falls through the floor - this is because both threads now must access L2 cache.
Hyperthreading can dramatically improve or worsen performance. It is completely dependent on use. None of this -10%/+30% stuff - that's ridiculous.
I'm not familiar with compiler optimizations for HT, nor the different between i7 HT and P4's as David pointed out. However, you can expect some general behaviors.
Context switching is very expensive. So if you have one core and run two threads on it simultaneously, switching back and forth one thread from the other always gives you performance penalty. However, threads do not use the core all the time. For example, if the thread reads or writes memory, it just waits for the memory access to be done, without using the core, usually for more than 100 cycles. There are many other cases that a thread need to stall like this, e.g., I/O operations, data dependencies, etc. Here HT helps, because it can ships out the waiting (or blocked) thread, and executes another thread instead.
Therefore, you can think if all threads are really unlikely to be blocked, then context switching will only cause overhead. Think about very computation-bounded application working on a small set of data.

How to increase my Web Application's Performance?

I have a ASP.NET web application (.NET 2008) using MS SQL server 2005. I want to increase the performance of the web site. Does anyone know of an article containing steps to do that, step by step, in SQL (indexes, etc.), and in the code?
Performance tuning is a very specific process. I don't know of any articles that discuss directly how to achieve this, but I can give you a brief overview of the steps I follow when I need to improve performance of an application/website.
Profile.
Start by gathering performance data. At the end of the tuning process you will need some numbers to compare to actually prove you have made a difference. This means you need to choose some specific processes that you monitor and record their performance and throughput.
For example, on your site you might record how long a login takes. You need to keep this very narrow. Pick a specific action that you want to record and time it. (Use a tool to do the timing, or put some Stopwatch code in you app to report times. Also, don't just run it once. Run it multiple times. Try to ensure you know all the environment set up so you can duplicate this again at the end.
Try to make this as close to your production environment as possible. Make sure your code is compiled in release mode, and running on real separate servers, not just all on one box etc.
Instrument.
Now you know what action you want to improve, and you have a target time to beat, you can instrument your code. This means injecting (manually or automatically) extra code that times each method call, or each line and records times and or memory usage right down the call stack.
There are lots of tools out their that can help you with this and automate some of it. (Microsoft's CLR profiler (free), Redgate - Ants (commercial), the higher editions of visual studio have stuff built in, and loads more) But you don't have to use automatic tools, it's perfectly acceptable to just use the Stopwatch class to time each block of your code. What you are looking for is a bottle neck. The likely hood is that you will find a high proportion of the overall time is spent in a very small bit of code.
Tune.
Now you have some timing data, you can start tuning.
There are two approaches to consider here. Firstly, take an overall perspective. Consider if you need to re design the whole call stack. Are you repeating something unnecessarily? Or are you just doing something you don't need to?
Secondly, now you have an idea of where your bottle neck is you can try and figure out ways to improve this bit of code. I can't offer much advice here, because it depends on what your bottle neck is, but just look to optimise it. Perhaps you need to cache data so you don't have to loop over it twice. Or batch up SQL calls so you can do just one. Or tighten your query filters so you return less data.
Re-profile.
This is the most important step that people often miss out. Once you have tuned your code, you absolutely must re-profile it in the same environment that you ran your initial profiling in. It is very common to make minor tweaks that you think might improve performance and actually end up degrading it because of some unknown way that the CLR handles something. This is much more common in managed languages because you often don't know exactly what is going on under the covers.
Now just repeat as necessary.
If you are likely to be performance tuning often I find it good to have a whole batch of automated performance tests that I can run that check the performance and throughput of various different activities. This way I can run these with every release and record performance changes each release. It also means that I can check that after a performance tuning session I know I haven't made the performance of some other area any worse.
When you are profiling, don't always just think about the time to run a single action. Also consider profiling under load, with lots of users logged in. Sometimes apps perform great when there's just one user connected, but when they hit a certain number of users suddenly the whole thing grinds to a halt. Perhaps because suddnely they are spending more time context switching or swapping memory in and out to disk. If it's throughput you want to improve you need to be figuring out what is causing the limit on throughput.
Finally. Check out this huge MSDN article on Improving .NET Application Performance and Scalability. Specifically, you might want to look at chapter 6 and chapter 17.
I think the best we can do from here is give you some pointers:
query less data from the sql server (caching, appropriate query filters)
write better queries (indexing, joins, paging, etc)
minimise any inappropriate blockages such as locks between different requests
make sure session-state hasn't exploded in size
use bigger metal / more metal
use appropriate looping code etc
But to stress; from here anything is guesswork. You need to profile to find the general area for the suckage, and then profile more to isolate the specific area(s); but start by looking at:
sql trace between web-server and sql-server
network trace between web-server and client (both directions)
cache / state servers if appropriate
CPU / memory utilisation on the web-server
I think First of all you have to find your Bottlenecks and then try to improve those.
This helps you to perform exactly where you have serios problem.
An in addition you needto improve your Connection to DB. For exampleusing a Lazy , Singletone Pattern and also create Batch request instead of single requests.
It help you to decrease DB connection.
Check your cache and suitable loop structures.
another thing is to use appropriate types, forexample if you need int donot create a long and etc
at the end ypu can use some Profiler (specially in SQL) andcheckif your queries implemented as well as possible.

Resources