How much instruction-level optimisation can a JIT apply? - jit

To what extent can a JIT replace platform independent code with processor-specific machine instructions?
For example, the x86 instruction set includes the BSWAP instruction to reverse a 32-bit integer's byte order. In Java the Integer.reverseBytes() method is implemented using multiple bitwise masks and shifts, even though in x86 native code it could be implemented in a single instruction using BSWAP. Are JITs (or static compilers for that matter) able to make the change automatically or is it too complex or not worth it due to a poor speed/time tradeoff?
(I know that this is in most cases a micro-optimisation, but I'm interested none the less.)

For this case, yes, the hotspot server compiler could do this optimization. The reverseBytes() methods are registered as vmIntrinsics in hotspot. When jit compiler compile these methods, it will generate a special IR node, not compile the whole method. And this node will be translated into 'bswap' in x86. see src/share/vm/opto/library_call.cpp
//---------------------------- inline_reverseBytes_int/long/char/short-------------------
// inline Integer.reverseBytes(int)
// inline Long.reverseBytes(long)
// inline Character.reverseBytes(char)
// inline Short.reverseBytes(short)
bool LibraryCallKit::inline_reverseBytes(vmIntrinsics::ID id) {
assert(id == vmIntrinsics::_reverseBytes_i || id == vmIntrinsics::_reverseBytes_l ||
id == vmIntrinsics::_reverseBytes_c || id == vmIntrinsics::_reverseBytes_s,
"not reverse Bytes");
if (id == vmIntrinsics::_reverseBytes_i && !Matcher::has_match_rule(Op_ReverseBytesI)) return false;
if (id == vmIntrinsics::_reverseBytes_l && !Matcher::has_match_rule(Op_ReverseBytesL)) return false;
if (id == vmIntrinsics::_reverseBytes_c && !Matcher::has_match_rule(Op_ReverseBytesUS)) return false;
if (id == vmIntrinsics::_reverseBytes_s && !Matcher::has_match_rule(Op_ReverseBytesS)) return false;
_sp += arg_size(); // restore stack pointer
switch (id) {
case vmIntrinsics::_reverseBytes_i:
push(_gvn.transform(new (C, 2) ReverseBytesINode(0, pop())));
break;
case vmIntrinsics::_reverseBytes_l:
push_pair(_gvn.transform(new (C, 2) ReverseBytesLNode(0,pop_pair())));
break;
case vmIntrinsics::_reverseBytes_c:
push(_gvn.transform(new (C, 2) ReverseBytesUSNode(0, pop())));
break;
case vmIntrinsics::_reverseBytes_s:
push(_gvn.transform(new (C, 2) ReverseBytesSNode(0, pop())));
break;
default:
;
}
return true;
}
and src/cpu/x86/vm/x86_64.ad
instruct bytes_reverse_int(rRegI dst) %{
match(Set dst (ReverseBytesI dst));
format %{ "bswapl $dst" %}
opcode(0x0F, 0xC8); /*Opcode 0F /C8 */
ins_incode( REX_reg(dst), OpcP, opc2_reg(dst) );
ins_pipe( ialu_reg );
%}

Related

Arduino sketch repeats an instruction even when it doesn't have too

I'm programming an Arduino Sketch that in general terms is a calculator that uses 4 numeric systems: decimal, binary, octal and hexadecimal.
When I ask the user for the numeric system he's gonna input the desired system (1 for decimal, 2 for hexadecimal, 3 for octal and 4 for binary) with a keypad, then, after receiving this input, the arduino prints on an LCD the chosen system. But these portion of the code seems to repeat itself indifinitely, without executing the part where the numbers and operands are inputted. I can't input numbers that aren't 1, 2, 3 or 4 and if I press one of these numbers it prints its system, completely ignoring the previous input.
I've tried boolean switches to indicate the program to not run that portion of the code if it has been executed previously but it doesn't seem to work
This is the portion of the code that receives the input and validates it. The switch case repeats itself other 3 times in the other 3 cases, changing the numerical system that is printed.
void loop()
{
char base = calcuShift.getKey();
if (base != NO_KEY && (base == '1' || base == '2' || base == '3' || base == '4')) {
switch (base) {
case '1':
lcd.clear();
lcd.setCursor(0,0);
lcd.print("Sistema");
lcd.setCursor(0,1);
lcd.print("Decimal");
delay(3000);
lcd.clear();
break;
After the switch case it must execute the following code:
char key;
if (digitalRead(A0) == HIGH) {
key = calcuShift.getKey();
} else {
key = calcu.getKey();
}
if (key != NO_KEY &&
(key=='1' || key=='2' || key=='3' || key=='4' || key=='5' || key=='6' || key=='7' || key=='8' || key=='9') &&
base == '1') {
if (inicio == false) {
num1 = num1 + key;
int numLength = num1.length();
lcd.setCursor(15-numLength,0);
lcd.print(num1);
} else {
num2 = num2 + key;
int numLength = num2.length();
lcd.setCursor(15-numLength,1);
lcd.print(num2);
final = true;
}
Obviously including other if conditionals that vary depending on the variable "base" (the one the user inputs at the beginning). If it's 1 (decimal) it accepts numbers 0 through 9, if it's 2 (hexadecimal) it accepts numbers 0 through F etc.
The user inputs his numbers with the variable key. The object calcuShift is just the normal keypad in shift mode, with the letters and other two operands instead of numbers and the multiplication and division operands replaced by power and root operands.
I want my calculator to receive the wanted numeric system, receive numbers in such system and make operations with these numbers, returning an answer in the previously chosen system, but instead just sticks to the input of the variable "base" that dictates the numeric system used.
An Arduino sketch is intended to execute loop() as fast as possible forever and ever.
Anything that is intended to run once after startup should go into setup()
Typically, you define a sort of finite state machine that behaves different whether a user is supposed to enter the number base code or a digit or an operator key. And it stores previous input (the current state) in some global or static variables.
At the end of loop() there's no "program" finish, but it just has checked for all currently possible state changes, stored them and eventually updated the display.
And as loop() is fast enough to react immediately on any button press, it will normally simply do nothing, as no new key is pressed.
If your code does something repeatedly forever but should do that only once, you simply don't store the fact that this was done already. (Or you use the reset button as user interface and allow that "choose number base" only for the setup() phase )

How can lParam be casted into more than one structures?

I saw this piece of code below in here. I tested it and it works all right.
// g_hLink is the handle of the SysLink control.
case WM_NOTIFY:
switch (((LPNMHDR)lParam)->code) // CAST TO NMHDR*
{
case NM_CLICK: // Fall through to the next case.
case NM_RETURN:
{
PNMLINK pNMLink = (PNMLINK)lParam; // CAST TO NMLINK*
LITEM item = pNMLink->item;
if ((((LPNMHDR)lParam)->hwndFrom == g_hLink) && (item.iLink == 0))
{
ShellExecute(NULL, L"open", item.szUrl, NULL, NULL, SW_SHOW);
}
else if (wcscmp(item.szID, L"idInfo") == 0)
{
MessageBox(hDlg, L"This isn't much help.", L"Example", MB_OK);
}
break;
}
}
break;
The parameter lParam is casted to both NMHDR* and NMLINK* types. The documentation of WM_NOTIFY message says that lParam can be casted to NMHDR*, but NMLINK is a different structure which encapsulates NMHDR.
What does actually happen when we cast lParam to an arbitrarily chosen structure between these two?
NMLINK contains NMHDR as its first element:
struct NMLINK {
NMHDR hdr;
LITEM item;
};
And so pointer to NMLINK equals to the pointer to its first member (which is NMHDR structure sitting at offset 0), they are the same. It means that you can cast NMHDR* to NMLINK*.

Qt optimization of a QByteArray conversion

I wrote a function to convert a hexa string representation (like x00) of some binary data to the data itself.
How to improve this code?
QByteArray restoreData(const QByteArray &data, const QString prepender = "x")
{
QByteArray restoredData = data;
return QByteArray::fromHex(restoredData.replace(prepender, ""));
}
How to improve this code?
Benchmark before optimizing this. Do not do premature optimization.
Beyond the main point: Why would you like to optimize it?
1) If you are really that concerned about performance where this negligible code from performance point of view matters, you would not use Qt in the first place because Qt is inherently slow compared to a well-optimized framework.
2) If you are not that concerned about performance, then you should keep the readability and maintenance in mind as leading principle, in which case your code is fine.
You have not shown any real world example either why exactly you want to optimize. This feels like an academic question without much pratical use to me. It would be interesting to know more about the motivation.
That being said, several improvement items, which are also optimization, could be done in your code, but then again: it is not done for optimization, but more like logical reasons.
1) Prepender is bad name; it is usually called "prefix" in the English language.
2) You wish to use QChar as opposed to QString for a character.
3) Similarly, for the replacement, you wish to use '' rather than the string'ish "" formula.
4) I would pass classes like that with reference as opposed to value semantics even if it is CoW (implicitly shared).
5) I would not even use an argument here for the prefix since it is always the same, so it does not really fit the definition of variable.
6) It is needless to create an interim variable explicitly.
7) Make the function inline.
Therefore, you would be writing something like this:
QByteArray restoreData(QByteArray data)
{
return QByteArray::fromHex(data.replace('x', ''));
}
Your code has a performance problem because of replace(). Replace itself is not very fast, and creating intermediate QByteArray object slows the code down even more. If you are really concerned about performance, you can copy QByteArray::fromHex implementation from Qt sources and modify it for your needs. Luckily, its implementation is quite self-contained. I only changed / 2 to / 3 and added --i line to skip "x" characters.
QByteArray myFromHex(const QByteArray &hexEncoded)
{
QByteArray res((hexEncoded.size() + 1)/ 3, Qt::Uninitialized);
uchar *result = (uchar *)res.data() + res.size();
bool odd_digit = true;
for (int i = hexEncoded.size() - 1; i >= 0; --i) {
int ch = hexEncoded.at(i);
int tmp;
if (ch >= '0' && ch <= '9')
tmp = ch - '0';
else if (ch >= 'a' && ch <= 'f')
tmp = ch - 'a' + 10;
else if (ch >= 'A' && ch <= 'F')
tmp = ch - 'A' + 10;
else
continue;
if (odd_digit) {
--result;
*result = tmp;
odd_digit = false;
} else {
*result |= tmp << 4;
odd_digit = true;
--i;
}
}
res.remove(0, result - (const uchar *)res.constData());
return res;
}
Test:
qDebug() << QByteArray::fromHex("54455354"); // => "TEST"
qDebug() << myFromHex("x54x45x53x54"); // => "TEST"
This code can behave unexpectedly when hexEncoded is malformed (.e.g. "x54x45x5" will be converted to "TU"). You can fix this somehow if it's a problem.

implicit conversion of 'bool' (aka 'signed char') to 'nsdata *' is disallowed with arc

When perform a migration of my project to Obejctive-C ARC, i got one error:
implicit conversion of 'bool' (aka 'signed char') to 'nsdata *' is disallowed with arc
The function Xcode is referring to for this error is returning NO or nil although its returning type is of type NSData:
- (NSData *)compressBytes:(Bytef *)bytes length:(NSUInteger)length error:(NSError **)err shouldFinish:(BOOL)shouldFinish
{
if (length == 0) return nil;
int status;
if (status == myVariable) {
break;
} else if (status != y_OK) {
if (err) {
*err = [[self class] deflateErrorWithCode:status];
}
return NO;
}
However, i am not quite sure i know how to fix that, any idea will be appreciated.
Under ARC you are only allowed to return an object or nil. Period.
This is because ARC not just requires, but DEMANDS that you don't do anything fishy with pointers - that pointers either point to objects or nil.
ARC is having fits because you are trying to stuff NO (a zero value) into a pointer. This violates the rules and that is why you are getting an error.
We can't help you fix it because a) we don't know what the valid return values are for (why NO? Why not nil?). Since this appears to be a code fragment, it is hard to help you. Sorry.
Just don't do that. NO is not in any sense a valid return value for that function. Your code was broken before ARC, and now it's still broken after.
Also, these lines:
int status;
if (status == myVariable) {
break;
}
are exactly the same as these:
if (myVariable == nil) {
break;
}
except written in a really confusing way, and relying on ARC to initialize status. I'm pretty sure that's not what you wanted.
Basically, this method looks completely wrong.

JIT compilation and DEP

I was thinking of trying my hand at some jit compilataion (just for the sake of learning) and it would be nice to have it work cross platform since I run all the major three at home (windows, os x, linux).
With that in mind, I want to know if there is any way to get out of using the virtual memory windows functions to allocate memory with execution permissions. Would be nice to just use malloc or new and point the processor at such a block.
Any tips?
DEP is just turning off Execution permission from every non-code page of memory. The code of application is loaded to memory which has execution permission; and there are lot of JITs which works in Windows/Linux/MacOSX, even when DEP is active. This is because there is a way to dynamically allocate memory with needed permissions set.
Usually, plain malloc should not be used, because permissions are per-page. Aligning of malloced memory to pages is still possible at price of some overhead. If you will not use malloc, some custom memory management (only for executable code). Custom management is a common way of doing JIT.
There is a solution from Chromium project, which uses JIT for javascript V8 VM and which is cross-platform. To be cross-platform, the needed function is implemented in several files and they are selected at compile time.
Linux: (chromium src/v8/src/platform-linux.cc) flag is PROT_EXEC of mmap().
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
const size_t msize = RoundUp(requested, AllocateAlignment());
int prot = PROT_READ | PROT_WRITE | (is_executable ? PROT_EXEC : 0);
void* addr = OS::GetRandomMmapAddr();
void* mbase = mmap(addr, msize, prot, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mbase == MAP_FAILED) {
/** handle error */
return NULL;
}
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, msize);
return mbase;
}
Win32 (src/v8/src/platform-win32.cc): flag is PAGE_EXECUTE_READWRITE of VirtualAlloc
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
// The address range used to randomize RWX allocations in OS::Allocate
// Try not to map pages into the default range that windows loads DLLs
// Use a multiple of 64k to prevent committing unused memory.
// Note: This does not guarantee RWX regions will be within the
// range kAllocationRandomAddressMin to kAllocationRandomAddressMax
#ifdef V8_HOST_ARCH_64_BIT
static const intptr_t kAllocationRandomAddressMin = 0x0000000080000000;
static const intptr_t kAllocationRandomAddressMax = 0x000003FFFFFF0000;
#else
static const intptr_t kAllocationRandomAddressMin = 0x04000000;
static const intptr_t kAllocationRandomAddressMax = 0x3FFF0000;
#endif
// VirtualAlloc rounds allocated size to page size automatically.
size_t msize = RoundUp(requested, static_cast<int>(GetPageSize()));
intptr_t address = 0;
// Windows XP SP2 allows Data Excution Prevention (DEP).
int prot = is_executable ? PAGE_EXECUTE_READWRITE : PAGE_READWRITE;
// For exectutable pages try and randomize the allocation address
if (prot == PAGE_EXECUTE_READWRITE &&
msize >= static_cast<size_t>(Page::kPageSize)) {
address = (V8::RandomPrivate(Isolate::Current()) << kPageSizeBits)
| kAllocationRandomAddressMin;
address &= kAllocationRandomAddressMax;
}
LPVOID mbase = VirtualAlloc(reinterpret_cast<void *>(address),
msize,
MEM_COMMIT | MEM_RESERVE,
prot);
if (mbase == NULL && address != 0)
mbase = VirtualAlloc(NULL, msize, MEM_COMMIT | MEM_RESERVE, prot);
if (mbase == NULL) {
LOG(ISOLATE, StringEvent("OS::Allocate", "VirtualAlloc failed"));
return NULL;
}
ASSERT(IsAligned(reinterpret_cast<size_t>(mbase), OS::AllocateAlignment()));
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, static_cast<int>(msize));
return mbase;
}
MacOS (src/v8/src/platform-macos.cc): flag is PROT_EXEC of mmap, just like Linux or other posix.
void* OS::Allocate(const size_t requested,
size_t* allocated,
bool is_executable) {
const size_t msize = RoundUp(requested, getpagesize());
int prot = PROT_READ | PROT_WRITE | (is_executable ? PROT_EXEC : 0);
void* mbase = mmap(OS::GetRandomMmapAddr(),
msize,
prot,
MAP_PRIVATE | MAP_ANON,
kMmapFd,
kMmapFdOffset);
if (mbase == MAP_FAILED) {
LOG(Isolate::Current(), StringEvent("OS::Allocate", "mmap failed"));
return NULL;
}
*allocated = msize;
UpdateAllocatedSpaceLimits(mbase, msize);
return mbase;
}
And I also want note, that bcdedit.exe-like way should be used only for very old programs, which creates new executable code in memory, but not sets an Exec property on this page. For newer programs, like firefox or Chrome/Chromium, or any modern JIT, DEP should be active, and JIT will manage memory permissions in fine-grained manner.
One possibility is to make it a requirement that Windows installations running your program be either configured for DEP AlwaysOff (bad idea) or DEP OptOut (better idea).
This can be configured (under WinXp SP2+ and Win2k3 SP1+ at least) by changing the boot.ini file to have the setting:
/noexecute=OptOut
and then configuring your individual program to opt out by choosing (under XP):
Start button
Control Panel
System
Advanced tab
Performance Settings button
Data Execution Prevention tab
This should allow you to execute code from within your program that's created on the fly in malloc() blocks.
Keep in mind that this makes your program more susceptible to attacks that DEP was meant to prevent.
It looks like this is also possible in Windows 2008 with the command:
bcdedit.exe /set {current} nx OptOut
But, to be honest, if you just want to minimise platform-dependent code, that's easy to do just by isolating the code into a single function, something like:
void *MallocWithoutDep(size_t sz) {
#if defined _IS_WINDOWS
return VirtualMalloc(sz, OPT_DEP_OFF); // or whatever
#elif defined IS_LINUX
// Do linuxy thing
#elif defined IS_MACOS
// Do something almost certainly inexplicable
#endif
}
If you put all your platform dependent functions in their own files, the rest of your code is automatically platform-agnostic.

Resources