LLVM JIT speed up choices? - jit

It's kind of subjective, but I'm having troubles getting LLVM JIT up to speed. Jitting large bodies of code take 50 times as much time as just interpreting them even with lazy compilation turned on.
So I was wondering how can I speeding jitting up, what kind of settings I can use?
Any other recommendations?

I am sorry to say that LLVM just isn't very fast as a JIT compiler, it is better as a AOT/static compiler.
I have run into the same speed issues in my llvm-lua project. What I did was to disable JIT compiling of large Lua functions. llvm-lua doesn't have lazy-compilation support turned on since LLVM requires too much C-stack space to run from Lua coroutines.
Also if you use this in your program's main() function:
llvm::cl::ParseCommandLineOptions(argc, argv, 0, true);
It will expose alot of command-line options from LLVM like '-time-passes' which will enable timing of LLVM passes, to see which parts of the JIT compiling is taking the most time.

Related

Why is it not possible to execute compile-file in parallel?

I have a project that has a lot of files that are not managed by ASDF and are compiled manually. These files are completely independent and being able to compile them in parallel seemed like a way to reduce compilation time to me. My plan was to compile these files in parallel and then sequentially load produced FASL files. But after I parallelized compilation, I saw that there was literally zero performance improvement. Then I went to SBCL sources and found that compile-file takes a world lock, which essentially sequentializes the compilation.
My question is, what's the reason that compile-file takes this lock? While loading FASLs in parallel could indeed lead to some race conditions, it seemed to me that compilation of Lisp files should be independent and parallelizable.
The compiler is accessible from the language. You can do compile-time programming, have compiler macros etc. Just as an illustration, there is (eval-when (:compile) …). You cannot rule out compile-time effects in general, and this would have to be thread safe everywhere. I guess that the effort to make this robust is much bigger than one was willing to invest.
You might be able to start multiple Lisp images for parallel compilation, though. You just need to handle the dependency graph while orchestrating that.
UPDATE: I just stumbled upon a conversation that seems to imply that SBCL is closer to getting rid of that lock than I thought: https://sourceforge.net/p/sbcl/mailman/message/36473674/

Experimental support for keeping the Scala compiler resident with sbt.resident.limit?

I've just noticed in the code of sbt.BuiltinCommands that there's a Java command line setting - sbt.resident.limit that (quoting Experimental or In-progress):
Experimental support for keeping the Scala compiler resident. Enable
by passing -Dsbt.resident.limit=n to sbt, where n is an integer
indicating the maximum number of compilers to keep around.
Should an end user know the switch? Where could it be useful? Is the feature going into a mainstream use or is it so specialized that almost of no use?
We've experimented with keeping Scala compiler instances in memory to reduce the time it takes to perform incremental compilation. Our findings showed us that the improvements to speed were not as large as we expected. The complexity of resident compilation is really large due to issues like memory leaks or sound symbol table invalidation.
I think it's very unlikely we'll finish that experimental feature in any foreseeable future so I think we should remove any references to resident compilation mode from sbt sources.
I created an sbt ticket for tracking it: https://github.com/sbt/sbt/issues/1287
Feel free to grab it. I'm happy to assist you with any questions related to cleaning up sbt code from resident compilation mode.

Getting OpenCL program code from GPU

I have program which use OpenCL do do math, how i can get source code of opencl, that execute on my gpu when this program do calculations?
The most straightforward approach is to look for the kernel string in the application. Sometimes you'll be able to just find its source lying in some .cl file, otherwise you can try to scan the application's binaries with something like strings. If the application is not purposefully obfuscating the kernel source, you're likely to find it using one of those methods.
A more bulletproof approach would be to catch the strings provided to the OpenCL API. You can even provide your own OpenCL implementation that just prints out the kernel strings in the relevant cl function. It's actually pretty easy: start with pocl and change the implementation of clCreateProgramWithSource to print out the input strings - this is a trivial code change.
You can then install that modified version as an OpenCL implementation and make sure the application uses it. This might be tricky if the application requires certain OpenCL capabilities, but your implementation can of course lie about those.
Notice that in the future, SPIR can make this sort of thing impossible - you'll be able to get an IR of the kernel, but not its source.
clGetProgramInfo(..., CL_PROGRAM_BINARIES, ...) gets you the compiled binary, but interpreting that is dependent upon the architecture. Various SDK's have different tools that might get you GPU assembly though.

Can C/C++ software be compiled into bytecode for later execution? (Architecture independent unix software.)

I would want to compile existing software into presentation that can later be run on different architectures (and OS).
For that I need a (byte)code that can be easily run/emulated on another arch/OS (LLVM IR? Some RISC assemby?)
Some random ideas:
Compiling into JVM bytecode and running with java. Too restricting? C-compilers available?
MS CIL. C-Compilers available?
LLVM? Can Intermediate representation be run later?
Compiling into RISC arch such as MMIX. What about system calls?
Then there is the system call mapping thing, but e.g. BSD have system call translation layers.
Are there any already working systems that compile C/C++ into something that can later be run with an interpreter on another architecture?
Edit
Could I compile existing unix software into not-so-lowlevel binary, which could be "emulated" more easily than running full x86 emulator? Something more like JVM than XEN HVM.
There are several C to JVM compilers listed on Wikipedia's JVM page. I've never tried any of them, but they sound like an interesting exercise to build.
Because of its close association with the Java language, the JVM performs the strict runtime checks mandated by the Java specification. That requires C to bytecode compilers to provide their own "lax machine abstraction", for instance producing compiled code that uses a Java array to represent main memory (so pointers can be compiled to integers), and linking the C library to a centralized Java class that emulates system calls. Most or all of the compilers listed below use a similar approach.
C compiled to LLVM bit code is not platform independent. Have a look at Google portable native client, they are trying to address that.
Adobe has alchemy which will let you compile C to flash.
There are C to Java or even JavaScript compilers. However, due to differences in memory management, they aren't very usable.
Web Assembly is trying to address that now by creating a standard bytecode format for the web, but unlike the JVM bytecode, Web Assembly is more low level, working at the abstraction level of C/C++, and not Java, so it's more like what's typically called an "assembly language", which is what C/C++ code is normally compiled to.
LLVM is not a good solution for this problem. As beautiful as LLVM IR is, it is by no means machine independent, nor was it intended to be. It is very easy, and indeed necessary in some languages, to generate target dependent LLVM IR: sizeof(void*), for example, will be 4 or 8 or whatever when compiled into IR.
LLVM also does nothing to provide OS independence.
One interesting possibility might be QEMU. You could compile a program for a particular architecture and then use QEMU user space emulation to run it on different architectures. Unfortunately, this might solve the target machine problem, but doesn't solve the OS problem: QEMU Linux user mode emulation only works on Linux systems.
JVM is probably your best bet for both target and OS independence if you want to distribute binaries.
As Ankur mentions, C++/CLI may be a solution. You can use Mono to run it on Linux, as long as it has no native bits. But unless you already have a code base you are trying to port at minimal cost, maybe using it would be counter productive. If it makes sense in your situation, you should go with Java or C#.
Most people who go with C++ do it for performance reasons, but unless you play with very low level stuff, you'll be done coding earlier in a higher level language. This in turn gives you the time to optimize so that by the time you would have been done in C++, you'll have an even faster version in whatever higher level language you choose to use.
The real problem is that C and C++ are not architecture independent languages. You can write things that are reasonably portable in them, but the compiler also hardcodes aspects of the machine via your code. Think about, for example, sizeof(long). Also, as Richard mentions, there's no OS independence. So unless the libraries you use happen to have the same conventions and exist on multiple platforms then it you wouldn't be able to run the application.
Your best bet would be to write your code in a more portable language, or provide binaries for the platforms you care about.

JIT compilers for math

I am looking for a JIT compiler or a small compiler library that can be embedded in my program. I indent to use it to compile dynamically generated code that perform complex number arithmetics. The generated code are very simple in structure: no loops, no conditionals, but they can be quite long (a few MB when compiled by GCC). The performance of the resulting machine code is important, while I don't really care about the speed of compilation itself. Which JIT compiler is best for my purpose? Thanks!
Detailed requirements
Support double precision complex number arithmetics
Support basic optimization
Support many CPUs (x86 and x86-64 at least)
Make use of SSE on supported CPUs
Support stack or a large set of registers for local variables
ANSI-C or C++ interface
Cross platform (mainly Linux, Unix)
You might want to take a look at LLVM.
Cint is a c++(ish) environment that offers the ability to mix compiled code and interpreted code. There is a set of optimization tools for the interpreter. ROOT extends this even further by supporting compile and link at run-time at run-time (see the last section of http://root.cern.ch/drupal/content/cint-prompt), though it appears to use the system compiler and thus may not help. All the code is open source.
I make regular use of all these features as part of my work.
I don't know if it makes active use of SIMD instructions, but it seems to meet all your other requirements.
As I see that you are currently using the compile to dynamic library at link on the fly methond, you might consider TCC, though I don't believe that it does much optimization and suspect that it does not support SIMD.
Sounds like you want to be able to compile on the fly and then dynamically load the compiled library (.DLL or .so). This would give you the best performance, with an ANSI-C or C++ interface. So, forget about JITing and consider spawning a C/C++ compiler to do the compilation.
This of course assumes that a compiler can be installed at the point where the dynamically generated code is actually generated.

Resources