Does Frama-C provide any tools for proving the run-time characteristics of a function such as execution time (possibly as instruction count) and heap memory space (counted as bytes allocated)?

Concerning execution time estimation
Frama-C works at the C level. The Metrics plug-in can provide a few metrics (such as statement count) on a version of the source very close to the original one (-metrics -metrics-ast cabs), or on the normalized source (often referred to as Cil code) that it uses. However, it does not have any knowledge of assembly code, therefore it cannot provide precise information about execution time at this level.
Since compiler optimizations impact code generation, the numbers given by Frama-C may or may not be close to what will be produced by a compiler, depending on which optimizations are enabled, what is known about the compiler and the target architecture, etc. In the general case, Frama-C cannot give any guarantees; in specific situations, it is possible to develop plug-ins to provide some of this information (e.g. the Cost plug-in, mentioned here uses annotations to try and maintain some correspondence between source and compiled code, and then uses them to provide some execution time information).
Concerning memory size estimation
There is an option, -metrics-locals-size, which does a rough estimation of the stack memory usage by a function. As in the previous case, this is only an estimation based on the source code. Compilers are likely to stack-allocate temporary variables for computing temporary subexpressions, or for register spilling, so the numbers given by Frama-C cannot be used in a worst-case stack estimation.
Dynamic memory allocation is supported in ACSL, so in theory it is possible to write annotations concerning it. However, current plug-ins do not provide a direct way to handle this precisely; it might require writing a new plug-in or, at least, an abstract domain for Eva.
Eva currently handles dynamic allocation, but probably not precisely enough for estimating heap size in an interesting way. It is possible to develop an abstract domain for Eva that would keep track of this information (adding mallocs and subtracting frees) and compute an overapproximation of the heap memory space, but this would require being able to bound the number of iterations of loops containing allocations (otherwise the upper bound would be infinite). Precision would depend on the complexity of the program.
For runtime verification, the E-ACSL plug-in already tracks some stack/heap information usage (even though it is not currently exported to the user), so in theory one could write an assertion similar to //# assert heap_size <= \old(heap_size) + 42;, and have it checked at runtime, when running the instrumented program.

To complement anol's answer, the PathCrawler plug-in (online version can be used freely, but the plugin itself is proprietary) has been used to generate sets of test cases covering all paths of C functions. This article explains under which assumptions this can be used as the basis for WCET measurement, but basically the issues are the one already mentioned by anol: without a precise knowledge of the work done by the compiler and of the underlying hardware, which is not something Frama-C provides natively, things are going to be quite rough.
There has been apparently some recent work taking the same route of using PathCrawler for generating execution traces covering a sufficiently large proportion of the search space as a bachelor project in Amsterdam.


Sparecode analysis in Frama-C

Sorry if this is detailed somewhere, I tried searching in the different documentations of Frama-C without luck.
I'm trying to do dead code elimination in my code, but I don't understand the results of the tool. Is there any paper / documentation that explains how this plugin works? I only know that it uses the results of the Value analysis.
Admittedly, the sparecode page on Frama-C's website is a bit terse. However, this is partly due to the fact that there's not much to parameterize in this plug-in. Mainly, it is a specialized form of the slicing plug-in, where the criterion is "preserve the state at the end of the program".
More generally, slicing consists in removing instructions that do not contribute to a user given criterion (e.g. the whole program state at a given point, the validity status of an ACSL annotation, or simply the fact that the program reaches a particular instruction).
In order to compute such slice, slicing, hence sparecode, indeed relies on the results of Eva, mainly to obtain an over-approximation (as always with Eva) of the dependencies between the various memory locations involved at each point in the program (you might want to have a look at Chapter 7 of the Eva manual which deals with dependencies. Very roughly speaking the slice will consist in the transitive closure of the dependencies for the memory locations involved in the criterion (in presence of pointers and branches, this notion of "transitive closure" becomes a bit complicated to define formally, but the essence is there).
Now, with respect to dead code, there are two points worth noting:
As mentioned before, Eva provides an over-approximation of the behavior of the program. For slicing, this means that some statements might be kept even though they do not contribute to the slicing criterion in any concrete trace, but appear to do so in the abstract trace due to over-approximation. On the other hand, if a statement is not included, then it definitely does not contribute to the criterion.
For sparecode, not contributing to the final state of the program does not mean that the code is dead, but merely that all its side-effects are shadowed by further instructions. The simplest example of that would be the following:
int x;
int main() {
x = 1;
x = 2;
Here, the x=1 has no influence over the final state of the program, and only x=2 will be kept.

How to detect type unstable functions in Julia

Setup: Let's say I have a reasonably detailed piece of software (in Julia), involving the interaction of several modules. I feel like it is running slower than it should. Typically the first culprit to check for is type unstable functions, i.e. functions where the compiler is unable to determine ahead of time what the output type will be.
Question: How can I detect these type unstable functions?
What I currently do: I use the profiling tools, e.g. the ProfileView.jl package of #tholy, to detect bottlenecks, under the assumption that type unstable functions will show up here (due to their excessive run-time). But what would be really nice is some sort of debugging tool that, after a routine is run, will spit out a list of functions where the compiler was unable to determine the output type ahead of time. Is this possible?
You could try TypeCheck.jl on bits the profiler say are slow.
Julia 0.4 has #code_warntype as well.
In addition to the excellent suggestions of IainDunning, running julia with --track-allocation=user and analyzing the results with analyze_malloc from the Coverage package is a good way to quickly get a high-level overview. The principle is that type-instability triggers memory allocation, so looking for lines of code that have unexpected, large allocations is a good way to find the most egregious instances of type instability.
You can find more information about track-allocation in the manual, and even more performance-analysis options described.

Tuning Mathematical Parallel Codes

Assuming that I am interested in performance rather than portability of my linear algebra iterative multi-threaded solver and that I have the results of profiling my code in hand, how do I go about tuning my code to run optimally on that machine of my choice?
The algorithm involves Matrix-Vector multiplications, norms and dot-products. (FWIW, I am working on CG and GMRES).
I am working on codes which are of matrix size roughly equivalent to the full size of the RAM (~6GB). I'll be working on Intel i3 Laptop. I'll be linking my codes using Intel MKL.
Is there a good resource(PDF/Book/Paper) for learning manual tuning? There are numerous things that I learnt by doing for instance : Manual Unrolling isn't always optimal or about compiler flags but I would prefer a centralized resource.
I need something to translate profiler information to improved performance. For instance, my profiler tells me that my stacks of one processor are being accessed by another or that my mulpd ASM is taking too much time. I have no clue what these mean and how I could use this information for improving my code.
My intention is to spend as much time as needed to squeeze as much compute power as possible. Its more of a learning experience than for actual use or distribution as of now.
(I am concerned about manual tuning not auto-tuning)
Misc Details:
This differs from usual performance tuning since the major portions of the code are linked to Intel's proprietary MKL library.
Because of Memory Bandwidth issues in O(N^2) matrix-vector multiplications and dependencies, there is a limit to what I could manage on my own through simple observation.
I write in C and Fortran and I have tried both and as discussed a million times on SO, I found no difference in either if I tweak them appropriately.
Gosh, this still has no answers. After you've read this you'll still have no useful answers ...
You imply that you've already done all the obvious and generic things to make your codes fast. Specifically you have:
chosen the fastest algorithm for your problem (either that, or your problem is to optimise the implementation of an algorithm rather than to optimise the finding of a solution to a problem);
worked your compiler like a dog to squeeze out the last drop of execution speed;
linked in the best libraries you can find which are any use at all (and tested to ensure that they do in fact improve the performance of your program;
hand-crafted your memory access to optimise r/w performance;
done all the obvious little tricks that we all do (eg when comparing the norms of 2 vectors you don't need to take a square root to determine that one is 'larger' than another, ...);
hammered the parallel scalability of your program to within a gnat's whisker of the S==P line on your performance graphs;
always executed your program on the right size of job, for a given number of processors, to maximise some measure of performance;
and still you are not satisfied !
Now, unfortunately, you are close to the bleeding edge and the information you seek is not to be found easily in books or on web-sites. Not even here on SO. Part of the reason for this is that you are now engaged in optimising your code on your platform and you are in the best position to diagnose problems and to fix them. But these problems are likely to be very local indeed; you might conclude that no-one else outside your immediate research group would be interested in what you do, I know you wouldn't be interested in any of the micro-optimisations I do on my code on my platform.
The second reason is that you have stepped into an area that is still an active research front and the useful lessons (if any) are published in the academic literature. For that you need access to a good research library, if you don't have one nearby then both the ACM and IEEE-CS Digital Libraries are good places to start. (Post or comment if you don't know what these are.)
In your position I'd be looking at journals on 2 topics: peta- and exa-scale computing for science and engineering, and compiler developments. I trust that the former is obvious, the latter may be less obvious: but if your compiler already did all the (useful) cutting-edge optimisations you wouldn't be asking this question and compiler-writers are working hard so that your successors won't have to.
You're probably looking for optimisations which like, say, loop unrolling, were relatively difficult to find implemented in compilers 25 years ago and which were therefore bleeding-edge back then, and which themselves will be old and established in another 25 years.
First, let me make explicit something that was originally only implicit in my 'answer': I am not prepared to spend long enough on SO to guide you through even a summary of the knowledge I have gained in 25+ years in scientific/engineering and high-performance computing. I am not given to writing books, but many are and Amazon will help you find them. This answer was way longer than most I care to post before I added this bit.
Now, to pick up on the points in your comment:
on 'hand-crafted memory access' start at the Wikipedia article on 'loop tiling' (see, you can't even rely on me to paste the URL here) and read out from there; you should be able to quickly pick up the terms you can use in further searches.
on 'working your compiler like a dog' I do indeed mean becoming familiar with its documentation and gaining a detailed understanding of the intentions and realities of the various options; ultimately you will have to do a lot of testing of compiler options to determine which are 'best' for your code on your platform(s).
on 'micro-optimisations', well here's a start: Performance Optimization of Numerically Intensive Codes. Don't run away with the idea that you will learn all (or even much) of what you want to learn from this book. It's now about 10 years old. The take away messages are:
performance optimisation requires intimacy with machine architecture;
performance optimisation is made up of 1001 individual steps and it's generally impossible to predict which ones will be most useful (and which ones actually harmful) without detailed understanding of a program and its run-time environment;
performance optimisation is a participation sport, you can't learn it without doing it;
performance optimisation requires obsessive attention to detail and good record-keeping.
Oh, and never write a clever piece of optimisation that you can't easily un-write when the next compiler release implements a better approach. I spend a fair amount of time removing clever tricks from 20-year old Fortran that was justified (if at all) on the grounds of boosting execution performance but which now just confuses the programmer (it annoys the hell out of me too) and gets in the way of the compiler doing its job.
Finally, one piece of wisdom I am prepared to share: these days I do very little optimisation that is not under one of the items in my first list above; I find that the cost/benefit ratio of micro-optimisations is unfavourable to my employers.

Language without explicit memory alloc/dealloc AND without garbage collection

I was wondering if it is possible to create a programming language without explicit memory allocation/deallocation (like C, C++ ...) AND without garbage collection (like Java, C#...) by doing a full analysis at the end of each scope?
The obvious problem is that this would take some time at the end of each scope, but I was wondering if it has become feasible with all the processing power and multiple cores in current CPU's. Do such languages exist already?
I also was wondering if a variant of C++ where smart pointers are the only pointers that can be used, would be exactly such a language (or am I missing some problems with that?).
Well after some more research apparently it's this:
I was wondering why this isn't more popular. The disadvantages listed there don't seem quite serious, the overhead should be that large according to me. A (non-interpreted, properly written from the ground up) language with C family syntax with reference counting seems like a good idea to me.
The biggest problem with reference counting is that it is not a complete solution and is not capable of collecting a cyclic structure. The overhead is incurred every time you set a reference; for many kinds of problems this adds up quickly and can be worse than just waiting for a GC later. (Modern GC is quite advanced and awesome - don't count it down like that!!!)
What you are talking about is nothing special, and it shows up all the time. The C or C++ variant you are looking for is just plain regular C or C++.
For example write your program normally, but constrain yourself not to use any dynamic memory allocation (no new, delete, malloc, or free, or any of their friends, and make sure your libraries do the same), then you have that kind of system. You figure out in advance how much memory you need for everything you could do, and declare that memory statically (either function level static variables, or global variables). The compiler takes care of all the accounting the normal way, nothing special happens at the end of each scope, and no extra computation is necessary.
You can even configure your runtime environment to have a statically allocated stack space (this one isn't really under the compiler's control, more linker and operating system environment). Just figure out how deep your function call chain goes, and how much memory it uses (with a profiler or similar tool), an set it in your link options.
Without dynamic memory allocation (and thus no deallocation through either garbage collection or explicit management), you are limited to the memory you declared when you wrote the program. But that's ok, many programs don't need dynamic memory, and are already written that way. The real need for this shows up in embedded and real-time systems when you absolutely, positively need to know exactly how long an operation will take, how much memory (and other resources) it will use, and that the running time and the use of those resources can't ever change.
The great thing about C and C++ is that the language requires so little from the environment, and gives you the tools to do so much, that smart pointers or statically allocated memory, or even some special scheme that you dream up can be implemented. Requiring the use them, and the constraints you put on yourself just becomes a policy decision. You can enforce that policy with code auditing (use scripts to scan the source or object files and don't permit linking to the dynamic memory libraries)

Just in Time compilation always faster?

Greetings to all the compiler designers here on Stack Overflow.
I am currently working on a project, which focuses on developing a new scripting language for use with high-performance computing. The source code is first compiled into a byte code representation. The byte code is then loaded by the runtime, which performs aggressive (and possibly time consuming) optimizations on it (which go much further, than what even most "ahead-of-time" compilers do, after all that's the whole point in the project). Keep in mind the result of this process is still byte code.
The byte code is then run on a virtual machine. Currently, this virtual machine is implemented using a straight-forward jump table and a message pump. The virtual machine runs over the byte code with a pointer, loads the instruction under the pointer, looks up an instruction handler in the jump table and jumps into it. The instruction handler carries out the appropriate actions and finally returns control to the message loop. The virtual machine's instruction pointer is incremented and the whole process starts over again. The performance I am able to achieve with this approach is actually quite amazing. Of course, the code of the actual instruction handlers is again fine-tuned by hand.
Now most "professional" run-time environments (like Java, .NET, etc.) use Just-in-Time compilation to translate the byte code into native code before execution. A VM using a JIT does usually have much better performance than a byte code interpreter. Now the question is, since all an interpreter basically does is load an instruction and look up a jump target in a jump table (remember the instruction handler itself is statically compiled into the interpreter, so it is already native code), will the use of Just-in-Time compilation result in a performance gain or will it actually degrade performance? I cannot really imagine the jump table of the interpreter to degrade performance that much to make up the time that was spent on compiling that code using a JITer. I understand that a JITer can perform additional optimization on the code, but in my case very aggressive optimization is already performed on the byte code level prior to execution. Do you think I could gain more speed by replacing the interpreter by a JIT compiler? If so, why?
I understand that implementing both approaches and benchmarking will provide the most accurate answer to this question, but it might not be worth the time if there is a clear-cut answer.
The answer lies in the ratio of single-byte-code-instruction complexity to jump table overheads. If you're modelling high level operations like large matrix multiplications, then a little overhead will be insignificant. If you're incrementing a single integer, then of course that's being dramatically impacted by the jump table. Overall, the balance will depend upon the nature of the more time-critical tasks the language is used for. If it's meant to be a general purpose language, then it's more useful for everything to have minimal overhead as you don't know what will be used in a tight loop. To quickly quantify the potential improvement, simply benchmark some nested loops doing some simple operations (but ones that can't be optimised away) versus an equivalent C or C++ program.
When you use an interpreter, the code cache in your processor caches the interpreter code; not the byte code (which may be cached in the data cache). Since code caches are 2 to 3 times faster than data caches, IIRC; you may see a performance boost if you JIT compile. Also, the native, real code you are executing is probably PIC; something which can be avoided for JITted code.
Everything else depends on how optimized the byte code is, IMHO.
JIT can theoretically optimize better, since it has information not available at compile time (especially about typical runtime behavior). So it can for example do better branch prediction, roll out loops as needed, et.c.
I am sure your jumptable approach is OK, but I still think it would perform rather poor compared to straight C code, don't you think?
