Intel C Compiler uses unaligned SIMD moves with aligned memory - intel

I am using an Haswell Core i7-4790K.
When I compile the following toy example with icc -O3 -std=c99 -march=core-avx2 -g:
#include <stdio.h>
#include <stdint.h>
#include <immintrin.h>
typedef struct {
__m256i a;
__m256i b;
__m256i c;
} mystruct_t;
#define SIZE 1000
#define TEST_VAL 42
int _do(mystruct_t* array) {
int value = 0;
for (size_t i = 0; i < SIZE; ++i) {
array[i].a = _mm256_set1_epi8(TEST_VAL + i*3 );
array[i].b = _mm256_set1_epi8(TEST_VAL + i*3 + 1);
array[i].c = _mm256_set1_epi8(TEST_VAL + i*3 + 2);
value += _mm_popcnt_u32(_mm256_movemask_epi8(array[i].a)) +
_mm_popcnt_u32(_mm256_movemask_epi8(array[i].b)) +
_mm_popcnt_u32(_mm256_movemask_epi8(array[i].c));
}
return value;
}
int main() {
mystruct_t* array = (mystruct_t*)_mm_malloc(SIZE * sizeof(*array), 32);
printf("%d\n", _do(array));
_mm_free(array);
}
The following ASM code is produced for the _do() function:
0x0000000000400bc0 <+0>: xor %eax,%eax
0x0000000000400bc2 <+2>: xor %ecx,%ecx
0x0000000000400bc4 <+4>: xor %edx,%edx
0x0000000000400bc6 <+6>: nopl (%rax)
0x0000000000400bc9 <+9>: nopl 0x0(%rax)
0x0000000000400bd0 <+16>: lea 0x2b(%rdx),%r8d
0x0000000000400bd4 <+20>: inc %ecx
0x0000000000400bd6 <+22>: lea 0x2a(%rdx),%esi
0x0000000000400bd9 <+25>: lea 0x2c(%rdx),%r9d
0x0000000000400bdd <+29>: add $0x3,%edx
0x0000000000400be0 <+32>: vmovd %r8d,%xmm1
0x0000000000400be5 <+37>: vpbroadcastb %xmm1,%ymm4
0x0000000000400bea <+42>: vmovd %esi,%xmm0
0x0000000000400bee <+46>: vpmovmskb %ymm4,%r11d
0x0000000000400bf2 <+50>: vmovd %r9d,%xmm2
0x0000000000400bf7 <+55>: vmovdqu %ymm4,0x20(%rdi)
0x0000000000400bfc <+60>: vpbroadcastb %xmm0,%ymm3
0x0000000000400c01 <+65>: vpbroadcastb %xmm2,%ymm5
0x0000000000400c06 <+70>: vpmovmskb %ymm3,%r10d
0x0000000000400c0a <+74>: vmovdqu %ymm3,(%rdi)
0x0000000000400c0e <+78>: vmovdqu %ymm5,0x40(%rdi)
0x0000000000400c13 <+83>: popcnt %r11d,%esi
0x0000000000400c18 <+88>: add $0x60,%rdi
0x0000000000400c1c <+92>: vpmovmskb %ymm5,%r11d
0x0000000000400c20 <+96>: popcnt %r10d,%r9d
0x0000000000400c25 <+101>: popcnt %r11d,%r8d
0x0000000000400c2a <+106>: add %esi,%r9d
0x0000000000400c2d <+109>: add %r8d,%r9d
0x0000000000400c30 <+112>: add %r9d,%eax
0x0000000000400c33 <+115>: cmp $0x3e8,%ecx
0x0000000000400c39 <+121>: jb 0x400bd0 <_do+16>
0x0000000000400c3b <+123>: vzeroupper
0x0000000000400c3e <+126>: retq
0x0000000000400c3f <+127>: nop
If I compile the same code using gcc-5 -O3 -std=c99 -mavx2 -march=native -g, the following ASM code is produced for the _do() function:
0x0000000000400650 <+0>: lea 0x17700(%rdi),%r9
0x0000000000400657 <+7>: mov $0x2a,%r8d
0x000000000040065d <+13>: xor %eax,%eax
0x000000000040065f <+15>: nop
0x0000000000400660 <+16>: lea 0x1(%r8),%edx
0x0000000000400664 <+20>: vmovd %r8d,%xmm2
0x0000000000400669 <+25>: xor %esi,%esi
0x000000000040066b <+27>: vpbroadcastb %xmm2,%ymm2
0x0000000000400670 <+32>: vmovd %edx,%xmm1
0x0000000000400674 <+36>: add $0x60,%rdi
0x0000000000400678 <+40>: lea 0x2(%r8),%edx
0x000000000040067c <+44>: vpbroadcastb %xmm1,%ymm1
0x0000000000400681 <+49>: vmovdqa %ymm2,-0x60(%rdi)
0x0000000000400686 <+54>: add $0x3,%r8d
0x000000000040068a <+58>: vmovd %edx,%xmm0
0x000000000040068e <+62>: vpmovmskb %ymm2,%edx
0x0000000000400692 <+66>: vmovdqa %ymm1,-0x40(%rdi)
0x0000000000400697 <+71>: vpbroadcastb %xmm0,%ymm0
0x000000000040069c <+76>: popcnt %edx,%esi
0x00000000004006a0 <+80>: vpmovmskb %ymm1,%edx
0x00000000004006a4 <+84>: popcnt %edx,%edx
0x00000000004006a8 <+88>: vpmovmskb %ymm0,%ecx
0x00000000004006ac <+92>: add %esi,%edx
0x00000000004006ae <+94>: vmovdqa %ymm0,-0x20(%rdi)
0x00000000004006b3 <+99>: popcnt %ecx,%ecx
0x00000000004006b7 <+103>: add %ecx,%edx
0x00000000004006b9 <+105>: add %edx,%eax
0x00000000004006bb <+107>: cmp %rdi,%r9
0x00000000004006be <+110>: jne 0x400660 <_do+16>
0x00000000004006c0 <+112>: vzeroupper
0x00000000004006c3 <+115>: retq
My questions are:
1) Why icc uses unaligned moves (vmovdqu) unlike gcc?
2) Is there a penalty when vmovdqu is used instead of vmovdqa on aligned memory?
P.S: The problem is the same using SSE instructions/registers.
Thanks

There is no penalty to using VMOVDQU when the address is aligned. The behavior is identical to using VMOVDQA in that case.
As for "why" there may not be a single clear answer. It's possible that ICC does this deliberately so that users who later call _do with an unaligned argument will not crash, but it's also possible that it's simply emergent behavior of the compiler. Someone on the Intel compiler team could answer this question, the rest of us can only speculate.

There are three factors at play that solve the bigger problem:
a) faulting behavior may be good for debugging performance but not as good for production code - especially when a mix of 3rd party libraries is involved - very few would take a crash over slightly slower performance of their software product at a customer site
b) Intel microarchitecture solved the "unaligned" instruction forms on aligned data performance problem starting with Nehalem, they are same performance as "aligned" forms, AMD did it even before that I think
c) AVX+ improved the architectural behavior of Load+OP forms over SSE to non-faulting, so
VADDPS ymm0, ymm0, ymmword ptr [rax]; // no longer faults when rax is misaligned
Since for AVX+ we want the compiler to still have the liberty to use either standalone or Load+OP instruction forms when generating code from intrinsics, for code such as this:
_mm256_add_ps( a, *(__m256*)data_ptr );
With AVX+ a compiler can use the vMOVUs (VMOVUPS/VMOVUPD/VMOVDQU) for the all loads and maintain uniform behavior with Load+OP forms
It is needed for when the source code changes slightly or the code generation of the same code changes (e.g. between different compilers/versions or due to inlining) and code generation switches from a Load+OP instruction to standalone Load and OP instructions, the behavior of a load is the same as with a Load+OP, i.e. non-faulting.
So AVX with the above compiler practice and the use of “unaligned” store instruction forms overall allow a uniform non-faulting behavior for SIMD code, without a performance loss on aligned data.
Of course, there are still (relatively rare) usage targeted instructions for non-temporal stores (vMOVNTDQ/vMOVNTPS/vMOVNTPD) and loads from WC types of memory (vMOVNDQA) that maintain faulting behavior for misaligned addresses.
-Max Locktyukhin, Intel

Related

Return a pointer at a specific position - Assembly

I am a beginner in Assembly and i have a simple question.
This is my code :
BITS 64 ; 64−bit mode
global strchr ; Export 'strchr'
SECTION .text ; Code section
strchr:
mov rcx, -1
.loop:
inc rcx
cmp byte [rdi+rcx], 0
je exit_null
cmp byte [rdi+rcx], sil
jne .loop
mov rax, [rdi+rcx]
ret
exit_null:
mov rax, 0
ret
This compile but doesn't work. I want to reproduce the function strchr as you can see. When I test my function with a printf it crashed ( the problem isn't the test ).
I know I can INC rdi directly to move into the rdi argument and return it at the position I want.
But I just want to know if there is a way to return rdi at the position rcx to fix my code and probably improve it.
Your function strchr seems to expect two parameters:
pointer to a string in RDI, and
pointer to a character in RSI.
Register rcx is used as index inside the string? In this case you should use al instead of cl. Be aware that you don't limit the search size. When the character refered by RSI is not found in the string, it will probably trigger an exception. Perhaps you should test al loaded from [rdi+rcx] and quit further searching when al=0.
If you want it to return pointer to the first occurence of character
inside the string, just
replace mov rax,[rdi+rcx] with lea rax,[rdi+rcx].
Your code (from edit Version 2) does the following:
char* strchr ( char *p, char x ) {
int i = -1;
do {
if ( p[i] == '\0' ) return null;
i++;
} while ( p[i] != x );
return * (long long*) &(p[i]);
}
As #vitsoft says, your intention is to return a pointer, but in the first return (in assembly) is returning a single quad word loaded from the address of the found character, 8 characters instead of an address.
It is unusual to increment in the middle of the loop.  It is also odd to start the index at -1.  On the first iteration, the loop continue condition looks at p[-1], which is not a good idea, since that's not part of the string you're being asked to search.  If that byte happens to be the nul character, it'll stop the search right there.
If you waited to increment until both tests are performed, then you would not be referencing p[-1], and you could also start the index at 0, which would be more usual.
You might consider capturing the character into a register instead of using a complex addressing mode three times.
Further, you could advance the pointer in rdi and forgo the index variable altogether.
Here's that in C:
char* strchr ( char *p, char x ) {
for(;;) {
char c = *p;
if ( c == '\0' )
break;
if ( c == x )
return p;
p++;
}
return null;
}
Thanks to your help, I finally did it !
Thanks to the answer of Erik, i fixed a stupid mistake. I was comparing str[-1] to NULL so it was making an error.
And with the answer of vitsoft i switched mov to lea and it worked !
There is my code :
strchr:
mov rcx, -1
.loop:
inc rcx
cmp byte [rdi+rcx], 0
je exit_null
cmp byte [rdi+rcx], sil
jne .loop
lea rax, [rdi+rcx]
ret
exit_null:
mov rax, 0
ret
The only bug remaining in the current version is loading 8 bytes of char data as the return value instead of just doing pointer math, using mov instead of lea. (After various edits removed and added different bugs, as reflected in different answers talking about different code).
But this is over-complicated as well as inefficient (two loads, and indexed addressing modes, and of course extra instructions to set up RCX).
Just increment the pointer since that's what you want to return anyway.
If you're going to loop 1 byte at a time instead of using SSE2 to check 16 bytes at once, strchr can be as simple as:
;; BITS 64 is useless unless you're writing a kernel with a mix of 32 and 64-bit code
;; otherwise it only lets you shoot yourself in the foot by putting 64-bit machine code in a 32-bit object file by accident.
global mystrchr
mystrchr:
.loop: ; do {
movzx ecx, byte [rdi] ; c = *p;
cmp cl, sil ; if (c == needle) return p;
je .found
inc rdi ; p++
test cl, cl
jnz .loop ; }while(c != 0)
;; fell out of the loop on hitting the 0 terminator without finding a match
xor edi, edi ; p = NULL
; optionally an extra ret here, or just fall through
.found:
mov rax, rdi ; return p
ret
I checked for a match before end-of-string so I'd still have the un-incremented pointer, and not have to decrement it in the "found" return path. If I started the loop with inc, I could use an [rdi - 1] addressing mode, still avoiding a separate counter. That's why I switched up the order of which branch was at the bottom of the loop vs. your code in the question.
Since we want to compare the character twice, against SIL and against zero, I loaded it into a register. This might not run any faster on modern x86-64 which can run 2 loads per clock as well as 2 branches (as long as at most one of them is taken).
Some Intel CPUs can micro-fuse and macro-fuse cmp reg,mem / jcc into a single load+compare-and-branch uop for the front-end, at least when the memory addressing mode is simple, not indexed. But not cmp [mem], imm/jcc, so we're not costing any extra uops for the front-end on Intel CPUs by separately loading into a register. (With movzx to avoid a false dependency from writing a partial register like mov cl, [rdi])
Note that if your caller is also written in assembly, it's easy to return multiple values, e.g. a status and a pointer (in the not-found case, perhaps to the terminating 0 would be useful). Many C standard library string functions are badly designed, notably strcpy, to not help the caller avoid redoing length-finding work.
Especially on modern CPUs with SIMD, explicit lengths are quite useful to have: a real-world strchr implementation would check alignment, or check that the given pointer isn't within 16 bytes of the end of a page. But memchr doesn't have to, if the size is >= 16: it could just do a movdqu load and pcmpeqb.
See Is it safe to read past the end of a buffer within the same page on x86 and x64? for details and a link to glibc strlen's hand-written asm. Also Find the first instance of a character using simd for real-world implementations like glibc's using pcmpeqb / pmovmskb. (And maybe pminub for the 0-terminator check to unroll over multiple vectors.)
SSE2 can go about 16x faster than the code in this answer for non-tiny strings. For very large strings, you might hit a memory bottleneck and "only" be about 8x faster.

Optimizing mask function with ARM SIMD instructions

I was wondering if you could help me use NEON intrinsics to optimize this mask function. I already tried to use auto-vectorization using the O3 gcc compiler flag but the performance of the function was smaller than running it with O2, which turns off the auto-vectorization. For some reason the assembly code produced with O3 is 1,5 longer than the one with O2.
void mask(unsigned int x, unsigned int y, uint32_t *s, uint32_t *m)
{
unsigned int ixy;
ixy = xsize * ysize;
while (ixy--)
*(s++) &= *(m++);
}
Probably I have to use the following commands:
vld1q_u32 // to load 4 integers from s and m
vandq_u32 // to execute logical and between the 4 integers from s and m
vst1q_u32 // to store them back into s
However i don't know how to do it in the most optimal way. For instance should I increase s,m by 4 after loading , anding and storing? I am quite new to NEON so I would really need some help.
I am using gcc 4.8.1 and I am compiling with the following cmd:
arm-linux-gnueabihf-gcc -mthumb -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon -O3 -fprefetch-loop-arrays name.c -o name
Thanks in advance
I would probably do it like this. I've included 4x loop unrolling. Preloading the cache is always a good idea and can speed things up another 25%. Since there's not much processing going on (it's mostly spending time loading and storing), it's best to load lots of registers, then process them as it gives time for the data to actually load. It assumes the data is an even multiple of 16 elements.
void fmask(unsigned int x, unsigned int y, uint32_t *s, uint32_t *m)
{
unsigned int ixy;
uint32x4_t srcA,srcB,srcC,srcD;
uint32x4_t maskA,maskB,maskC,maskD;
ixy = xsize * ysize;
ixy /= 16; // process 16 at a time
while (ixy--)
{
__builtin_prefetch(&s[64]); // preload the cache
__builtin_prefetch(&m[64]);
srcA = vld1q_u32(&s[0]);
maskA = vld1q_u32(&m[0]);
srcB = vld1q_u32(&s[4]);
maskB = vld1q_u32(&m[4]);
srcC = vld1q_u32(&s[8]);
maskC = vld1q_u32(&m[8]);
srcD = vld1q_u32(&s[12]);
maskD = vld1q_u32(&m[12]);
srcA = vandq_u32(srcA, maskA);
srcB = vandq_u32(srcB, maskB);
srcC = vandq_u32(srcC, maskC);
srcD = vandq_u32(srcD, maskD);
vst1q_u32(&s[0], srcA);
vst1q_u32(&s[4], srcB);
vst1q_u32(&s[8], srcC);
vst1q_u32(&s[12], srcD);
s += 16;
m += 16;
}
}
I would start with the simplest one and take it as a reference for compare with future routines.
A good rule of thumb is to calculate needed things as soon as possible, not exactly when needed.
This means that instructions can take X cycles to execute, but the results are not always immediately ready, so scheduling is important
As an example, a simple scheduling schema for your case would be (pseudocode)
nn=n/4 // Assuming n is a multiple of 4
LOADI_S(0) // Load and immediately after increment pointer
LOADI_M(0) // Load and immediately after increment pointer
for( k=1; k<nn;k++){
AND_SM(k-1) // Inner op
LOADI_S(k) // Load and increment after
LOADI_M(k) // Load and increment after
STORE_S(k-1) // Store and increment after
}
AND_SM(nn-1)
STORE_S(nn-1) // Store. Not needed to increment
Leaving out these instructions from the inner loop we achieve that the ops inside don't depend on the result of the previous op.
This schema can be further extended in order to take profit of the time that otherwise would be lost waiting for the result of the previous op.
Also, as intrinsics still depend on the optimizer, see what does the compiler do under different optimization options. I prefer to use inline assembly, which is not difficult for small routines, and give you more control.

How should I declare a vector variable in OpenCL that can fully utilize GPU's vectorized feature

I'm using AMD-APP (1214.3). My code in OpenCL is as follows,
// W is an uint4 variable
uint4 T = (uint4)(1U, 2U, 3U, 4U);
T += W;
or I had also tried using constant data as follows,
// outside function scope
__constant uint4 X = (uint4)(1U, 2U, 3U, 4U);
// inside function
uint4 T = X;
T += W;
However, after compilation I saw the assembly code contains multiple addition instructions to form a uint vector;
dcl_literal l16, 0x00000001, 0x00000001, 0x00000001, 0x00000001
dcl_literal l19, 0x00000002, 0x00000002, 0x00000002, 0x00000002
dcl_literal l18, 0x00000003, 0x00000003, 0x00000003, 0x00000003
dcl_literal l17, 0x00000004, 0x00000004, 0x00000004, 0x00000004
mov r66, l16
iadd r66, r66.xyz0, l17.000x
iadd r66, r66.xy0w, l18.00x0
iadd r66, r66.x0zw, l19.0x00
iadd r75, r75, r66
So, how could I code for vector initialization in OpenCL to achieve fewer instruction. For example, one instruction load and then iadd, like following
dcl_literal l16, 0x00000001, 0x00000002, 0x00000003, 0x00000004
move r66, l16
iadd r75, r75, r66
Thanks for your help.
What you see in
dcl_literal l16, 0x00000001, 0x00000001, 0x00000001, 0x00000001
...
seems to be LLVM assembler. It's an output of compiler front-end yet to be processed by back end & translated into machine code. As it's not the final version, than, in my opinion, there is no measure to determine how optimal this code is.
As suggestion - such LLVM representation may be used for better backward compatibility with legacy architectures, as it looks like VLIW instructions code.
Returning back to OpenCL performance. One IO operation takes so long, that all effort, put into smaller instruction-level optimizations is just wasting of time. That's why GPGPU performance is usually bandwidth - bound.

cannot jump into arduino boot loader

I want to jump from my application to the bootloader ( I load via bluetooth and have an application command to jump to the boot loader).
the following work :
void* bl = (void *) 0x3c00;
goto *bl;
or
asm volatile { jmp BOOTL ::}
asm volatile { .org 0x3c00
BOOTL: }
(but code size grows to 0x3c00)
BUT, the most obvious option
asm volatile { jmp 0x3c00 ::}
does not (seems it does not even produce code }
Any idea why ?
The question as stated is not clear, as to what is working and what is failing. And about your environment, which is important. That said I guess your stating the void and/or "jmp BOOTL" work as desired, but makes the code appear to be huge.
I tried it on Arduino IDE 1.0.5 and only saw less than a 1/2K of code. Note 16K or Huge.
void* bl = (void *) 0x3c00;
void setup()
{
// put your setup code here, to run once:
}
void loop()
{
goto *bl;
// put your main code here, to run repeatedly:
}
with a compile output of...
Binary sketch size: 474 bytes (of a 32,256 byte maximum)
Estimated used SRAM memory: 11 bytes (of a 2048 byte maximum)
I suspect your observation is that the linker is seeing the pointer out at 0x3C00 the location of the BOOTSECTOR (noting it is at end of code) So it only looks like it is huge. I suspect there is a lot of white space between you may want to use the "avr-objdump.exe -d output.elf" to see what it is actually doing, versus what you expect.
0x3C00 is a 16-bit word address.
Use 0x7800 in GCC if you are using goto. GCC uses byte address (0x3C00 * 2 = 0x7800).
Example:
void *bl = (void *) 0x7800;
goto *bl;
will create the following assembly language (see *.lss output file):
c4: 0c 94 00 3c jmp 0x7800 ; 0x7800 <__stack+0x6d01>
#define GO_TO_ADRR_FLASH_MEMORY_BOOT_LOADER asm volatile ("JMP 0x7800")
GO_TO_ADRR_FLASH_MEMORY_BOOT_LOADER;

klee with loops strange behaviour with similar code

I have a question about how is working KLEE (symbolic execution tool) in case of loops with symbolic parameters:
int loop(int data) {
int i, result=0;
for (i=0;i<data ;i++){
result+= 1;
//printf("result%d\n", result); //-- With this line klee give different values for data
}
return result;
}
void main() {
int n;
klee_make_symbolic(&n, sizeof(int),"n");
int result=loop(n) ;
}
If we execute klee with this code, it give only one test-case.
However, if we take out the comment of the printf(...), klee will need some type of control for stop the execution, because it will produce values for n:
--max-depth= 200
I would like to understand why klee have this different behavior, that doesn't have sense for me. Why if I don't have a printf in this code, it will not produce the same values.
I discovered that it happens whend the option --optimize is used
when it is not there is the same behavior. Someone know how is working the --optimize of Klee?
An another quieston about the same is, if in the paper that they have published, As I understand they said that their heuristics search will make it not be infinite (they useavoid starvation)
because I have it runing doesn't stop, It is true that should finish the klee execution in case of this loop?
Thanks in advance
What I want to know is why this different behavior with the option
--optimize. Thanks
In C/C++ there is concept of "undefined behaviour" (see: A Guide to Undefined Behavior in C and C++, Part1,Part 2,Part 3, What Every C Programmer Should Know About Undefined Behavior [#1/3],[#2/3],[#3/3]).
Overflow of signed integer is defined as undefined behaviour in order to allow compiler to do optimize stuff like this:
bool f(int x){ return x+1>x ? true : false ; }
let's think... x+1>x in normal algebra is always true, in this modulo algebra it is almost always true (except one case of overflow), so let's turn it into:
true
Such practices enables huge amount of great optimizations. (Btw. If you want to have defined behaviour when overflow, use unsigned integers - this feature is extensively used in cryptography algorithms implementations).
On the other hand, sometimes leads to surprising results,
like this code:
int main(){
int s=1, i=0;
while (s>0) {
++i;
s=2*s;
}
return i;
}
Being optimised into infinite loop. It's not bug! It's powerful feature! (Again.. for defined behaviour , use unsigned).
Let's generate assembly codes for above examples:
$ g++ -O1 -S -o overflow_loop-O1.s overflow_loop.cpp
$ g++ -O2 -S -o overflow_loop-O2.s overflow_loop.cpp
Check of loop part is compiled differently:
overflow_loop-O1.s:
(...)
.L2:
addl $1, %eax
cmpl $31, %eax
jne .L2
(...)
overflow_loop-O2.s:
(...)
.L2:
jmp .L2
(...)
I would advice you to check assembly of your code on different optimisation levels (gcc -S -O0 vs gcc -S -O1...-O3).
And again, nice posts about topic:
[1],
[2],
[3],
[4],
[5],
[6].

Resources