Is there a performance penalty for one-based array indexing? - julia

Is there a performance penalty (however small) for Julia using one-based array indexing since machine code usually more directly supports zero-based indexing?

I did some snooping around and here is what I fond (I used Julia 0.6 for all experiments below):
> arr = zeros(5);
> #code_llvm arr[1]
define double #jlsys_getindex_51990(i8** dereferenceable(40), i64) #0 !dbg !5 {
top:
%2 = add i64 %1, -1
%3 = bitcast i8** %0 to double**
%4 = load double*, double** %3, align 8
%5 = getelementptr double, double* %4, i64 %2
%6 = load double, double* %5, align 8
ret double %6
}
In this snippet %1 holds the actual index. Note the %2 = add i64 %1, -1. Julia does indeed uses 0-based arrays under the hood and subtracts 1 off the index. This results in an additional llvm instruction being generated, so the llvm code looks slightly less efficient. However how this additional arithmetic operation trickles down to native code is another question.
On ax86 and amd64
> #code_native arr[1]
.text
Filename: array.jl
Source line: 520
leaq -1(%rsi), %rax
cmpq 24(%rdi), %rax
jae L20
movq (%rdi), %rax
movsd -8(%rax,%rsi,8), %xmm0 # xmm0 = mem[0],zero
retq
L20:
pushq %rbp
movq %rsp, %rbp
movq %rsp, %rcx
leaq -16(%rcx), %rax
movq %rax, %rsp
movq %rsi, -16(%rcx)
movl $1, %edx
movq %rax, %rsi
callq 0xffffffffffcbf392
nopw %cs:(%rax,%rax)
The good news on these architectures is that they support arbitrary-number-based indexing. The movsd -8(%rax,%rsi,8), %xmm0 and leaq -1(%rsi), %rax are the two instructions affected by the 1-based indexing in Julia. Look at the movsd instruction, in this one single instruction we do both the actual indexing and the subtracting. The -8 part is the subtracting. If 0-based indexing was used than the instruction would be movsd (%rax,%rsi,8), %xmm0.
The other affected instruction is leaq -1(%rsi), %rax. However due to the fact that cmp instructions use an in-out argument, the value of %rsi has to be copied to another register so under 0-based indexing the same instruction would still be generated but it would probably look like leaq (%rsi), %rax.
So on x86 and amd64 machines the 1-based indexing results in simply using slightly more complicated version of the same instructions but no additional instructions are generated. The code most probably runs exactly as fast as 0-based indexing. If any slowdown is present it is probably due to the specific micro architecture and would be present in one CPU model and not present in another. This difference is down to the silicon and I wouldn't worry about it.
Unfortunately, I don't know enough about arm and other architectures but the situation is probably similar.
Interfacing with another language
When interfacing with another language like C or Python, one always has to remember to subtract or add 1 when passing indices around. The compiler cannot help you because the other code is out of its reach. So there is a performance hit of 1 extract arithmetic operations in this case. But unless this is in a really tight loop, this difference is negligible.
The elephant in the room
Well, the elephant in the room is the bound checking. Returning to the previous assembly snippet, most of the generated code is concerned with that - the first 3 instructions and everything under the L20 label. The actual indexing is just the movq and movsd instructions. So if you care about really fast code then you will get much more of a performance penalty from the bound checking than the 1-based indexing. Fortunately Julia offers ways to alleviate this problems through the use of #inbound and --check-bounds=no.

The most likely possibility is that Julia simply subtracts 1 from the indexes you provide it, and uses zero-based arrays under the hood. So the performance penalty would be the cost of the subtraction (almost certainly immaterial).
It would be easy enough to write two small bits of code to test the performance of each.

Related

What does adding two registers in square brackets mean?

What adding two registers together in square brackets mean?
I have a question about these lines of code:
"mov al, [ebx+edx];"
"mov [ecx+edx],al;"
I know that mov instruction should move values from source to destination. But I don't really know what [ebx+edx] and [ecx+edx] does.
Is it simply adding two registers and then saving values in the memory?
This will add the values of the two registers and subsequently use them as a memory address reference to either retrieve the value at that register:
MOV EDX, [EBX+EAX]
or store a value to that location:
MOV [EBX+EDX], ECX

Associativity gives us parallelizability. But what does commutativity give?

Alexander Stepanov notes in one of his brilliant lectures at A9 (highly recommended, by the way) that the associative property gives us parallelizability – an extremely useful and important trait these days that the compilers, CPUs and programmers themselves can leverage:
// expressions in parentheses can be done in parallel
// because matrix multiplication is associative
Matrix X = (A * B) * (C * D);
But what, if anything, does the commutative property give us? Reordering? Out of order execution?
Here is a more abstract answer with less emphasis on instruction level parallelism and more on thread level parallelism.
A common objective in parallelism is to do a reduction of information. A simple example is the dot product of two arrays
for(int i=0; i<N; i++) sum += x[i]*[y];
If the operation is associative then we can have each thread calculate a partial sum. Then the finally sum is the sum of each partial sum.
If the operation is commutative the final sum can be done in any order. Otherwise the partial sums have to be summed in order.
One problem is that we can't have multiple threads writing to the final sum at the same time otherwise it creates a race condition. So when one thread writes to the final sum the others have to wait. Therefore, summing in any order can be more efficient because it's often difficult to have each thread finish in order.
Let's choose an example. Let's say there are two threads and therefore two partial sums.
If the operation is commutative we could have this case
thread2 finishes its partial sum
sum += thread2's partial sum
thread2 finishes writing to sum
thread1 finishes its partial sum
sum += thread1's partial sum
However if the operation does not commute we would have to do
thread2 finishes its partial sum
thread2 waits for thread1 to write to sum
thread1 finishes its partial sum
sum += thread1's partial sum
thread2 waits for thread1 to finish writing to sum
thread1 finishes writing to sum
sum += thread2's partial sum
Here is an example of the dot product with OpenMP
#pragma omp parallel for reduction(+: sum)
for(int i=0; i<N; i++) sum += x[i]*[y];
The reduction clause assumes the operation (+ in this case) is commutative. Most people take this for granted.
If the operation is not commutative we would have to do something like this
float sum = 0;
#pragma omp parallel
{
float sum_partial = 0
#pragma omp for schedule(static) nowait
for(int i=0; i<N; i++) sum_partial += x[i]*[y];
#pragma omp for schedule(static) ordered
for(int i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
sum += sum_partial;
}
}
The nowait clause tells OpenMP not to wait for each partial sum to finish. The ordered clause tells OpenMP to only write to sum in order of increasing thread number.
This method does the final sum linearly. However, it could be done in log2(omp_get_num_threads()) steps.
For example if we had four threads we could do the reduction in three sequential steps
calculate four partial sums in parallel: s1, s2, s3, s4
calculate in parallel: s5 = s1 + s2 with thread1 and s6 = s3 + s4 with thread2
calculate sum = s5 + s6 with thread1
That's one advantage of using the reduction clause since it's a black box it may do the reduction in log2(omp_get_num_threads()) steps. OpenMP 4.0 allows defining custom reductions. But nevertheless it still assumes the operations are commutative. So it's not good for e.g. chain matrix multiplication. I'm not aware of an easy way with OpenMP to do the reduction in log2(omp_get_num_threads()) steps when the operations don't commute.
Some architectures, x86 being a prime example, have instructions where one of the sources is also the destination. If you still need the original value of the destination after the operation, you need an extra instruction to copy it to another register.
Commutative operations give you (or the compiler) a choice of which operand gets replaced with the result. So for example, compiling (with gcc 5.3 -O3 for x86-64 Linux calling convention):
// FP: a,b,c in xmm0,1,2. return value goes in xmm0
// Intel syntax ASM is op dest, src
// sd means Scalar Double (as opposed to packed vector, or to single-precision)
double comm(double a, double b, double c) { return (c+a) * (c+b); }
addsd xmm0, xmm2
addsd xmm1, xmm2
mulsd xmm0, xmm1
ret
double hard(double a, double b, double c) { return (c-a) * (c-b); }
movapd xmm3, xmm2 ; reg-reg copy: move Aligned Packed Double
subsd xmm2, xmm1
subsd xmm3, xmm0
movapd xmm0, xmm3
mulsd xmm0, xmm2
ret
double easy(double a, double b, double c) { return (a-c) * (b-c); }
subsd xmm0, xmm2
subsd xmm1, xmm2
mulsd xmm0, xmm1
ret
x86 also allows using memory operands as a source, so you can fold loads into ALU operations, like addsd xmm0, [my_constant]. (Using an ALU op with a memory destination sucks: it has to do a read-modify-write.) Commutative operations give more scope for doing this.
x86's avx extension (in Sandybridge, Jan 2011) added non-destructive versions of every existing instruction that used vector registers (same opcodes but with a multi-byte VEX prefix replacing all the previous prefixes and escape bytes). Other instruction-set extensions (like BMI/BMI2) also use the VEX coding scheme to introduce 3-operand non-destructive integer instructions, like PEXT r32a, r32b, r/m32: Parallel extract of bits from r32b using mask in r/m32. Result is written to r32a.
AVX also widened the vectors to 256b and added some new instructions. It's unfortunately nowhere near ubiquitous, and even Skylake Pentium/Celeron CPUs don't support it. It will be a long time before it's safe to ship binaries that assume AVX support. :(
Add -march=native to the compile options in the godbolt link above to see that AVX lets the compiler use just 3 instructions even for hard(). (godbolt runs on a Haswell server, so that includes AVX2 and BMI2):
double hard(double a, double b, double c) { return (c-a) * (c-b); }
vsubsd xmm0, xmm2, xmm0
vsubsd xmm1, xmm2, xmm1
vmulsd xmm0, xmm0, xmm1
ret

Dereference pointers in XMM register (gather)

If I have some pointer or pointer-like values packed into an SSE or AVX register, is there any particularly efficient way to dereference them, into another such register? ("Particularly efficient" meaning "more efficient than just using memory for the values".) Is there any way to dereference them all without writing an intermediate copy of the register out to memory?
Edit for clarification: that means, assuming 32-bit pointers and SSE, to index into four arbitrary memory areas at once with the four sections of an XMM register and return four results at once to another register. Or as close to "at once" as possible. (/edit)
Edit2: thanks to PaulR's answer I guess the terminology I'm looking for is "gather", and the question therefore is "what's the best way to implement gather for systems pre-AVX2?".
I assume there isn't an instruction for this since ...well, one doesn't appear to exist as far as I can tell and anyway it doesn't seem to be what SSE is designed for at all.
("Pointer-like value" meaning something like an integer index into an array pretending to be the heap; mechanically very different but conceptually the same thing. If, say, one wanted to use 32-bit or even 16-bit values regardless of the native pointer size, to fit more values in a register.)
Two possible reason I can think of why one might want to do this:
thought it might be interesting to explore using the SSE registers for general-purpose... stuff, perhaps to have four identical 'threads' processing potentially completely unrelated/non-contiguous data, slicing through the registers "vertically" rather than "horizontally" (i.e. instead of the way they were designed to be used).
to build something like romcc if for some reason (probably not a good one), one didn't want to write anything to memory, and therefore would need more register storage.
This might sound like an XY problem, but it isn't, it's just curiosity/stupidity. I'll go looking for nails once I have my hammer.
The question is not entirely clear, but if you want to dereference vector register elements then the only instructions which might help you here are AVX2's gathered loads, e.g. _mm256_i32gather_epi32 et al. See the AVX2 section of the Intel Intrinsics Guide.
SYNOPSIS
__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
#include "immintrin.h"
Instruction: vpgatherdd ymm, vm32x, ymm
CPUID Flag : AVX2
DESCRIPTION
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
OPERATION
FOR j := 0 to 7
i := j*32
dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale]
ENDFOR
dst[MAX:256] := 0
So if I understood this correctly, your title is misleading and you really want to:
index into the concatenation of all XMM registers
with an index held in a part of an XMM register
Right?
That's hard. And a little weird, but I'm OK with that.
Assuming crazy tricks are allowed, I propose self-modifying code: (not tested)
pextrb eax, xmm?, ? // question marks are the position of the pointer
mov edx, eax
shr eax, 1
and eax, 0x38
add eax, 0xC0 // C0 makes "hack" put its result in eax
mov [hack+4], al // xmm{al}
and edx, 15
mov [hack+5], dl // byte [dl] of xmm reg
call hack
pinsrb xmm?, eax, ? // put value back somewhere
...
hack:
db 66 0F 3A 14 00 00 // pextrb ?, ? ,?
ret
As far as I know, you can't do that with full ymm registers (yet?). With some more effort, you could extend it to xmm8-xmm15. It's easily adjustable to other "pointer" sizes and other element sizes.

how to use Base Pointer in Assembly 8086 to go through the stack?

I have an assigment in which I have to insert two digit integers into the stack. Search a number in the stack and return in which position this number is, print all the numbers in the stack and delete a number from the stack.
Right now I'm trying to print all the numbers in the stack by going through the stack using the base pointer, but my code doesn't work.
mov di,offset bp
mov ax, [di] ;trying to move de value stored in di direction in stack to Ax
mov digito,ah
mov digito2,al
mov dl,digito
mov ah,02
int 21h
mov dl,digito2
mov ah,02
int 21h
mov ah,01
int 21h
So in this code I'm trying to print the two number digit by getting the bp into di (so later I can decrement it to go through all the stack), and the passing the number stored in that direction in Ax. I'm a newby in assembly so I don't know what I'm doing.
Thank you in advance for your time. (And sorry for my english)
Sorry for the delayed reply. First, bp doesn't really have an "offset", so you could remove that. Second, bp won't automatically point into the stack unless you have made it so (mov bp, sp).
You don't mention an OS, but int 21h identifies it as DOS... which is real mode, segmented memory model. mov ax, [di] defaults to mov ax, ds:[di]. If you've assembled this into a ".com" file, cs, ds, es, and ss are all the same. If you've assembled it into an ".exe" file, this is not so! You may want to write it as mov ax, ss:[di] to be sure. In contrast, mov ax, [bp] defaults to mov ax, ss:[bp], so you may want to use bp instead of di here. I suspect that's how you're "supposed" to do it. If you've got a ".com" file, you can forget about this part (in 32-bit code you can forget about it too, but that doesn't apply to you).
Then... your attempt to print a number isn't really going to work properly. Look for "How do I print a number?" examples for more information on that - too much to get into here...
This is too hard an assignment for a beginner, IMO (but "the instructor is always right" :) ).

Pointers and Indexes in Intel 8086 Assembly

I have a pointer to an array, DI.
Is it possible to go to the value pointed to by both DI and another pointer?
e.g:
mov bl,1
mov bh,10
inc [di+bl]
inc [di+bh]
And, on a related note, is there a single line opcode to swap the values of two registers? (In my case, BX and BP?)
For 16-bit programs, the only supported addressing forms are:
[BX+SI]
[BX+DI]
[BP+SI]
[BP+DI]
[SI]
[DI]
[BP]
[BX]
Each of these may include either an 8- or 16-bit constant displacement.
(Source: Intel Developer's Manual volume 2A, page 38)
The problem with the example provided is that bl and bh are eight-bit registers and cannot be used as the base pointer. However, if you set bx to the desired value then inc [di+bx] (with a suitable size specifier for the pointer) is valid.
As for swapping "the high and low bits of a register," J-16 SDiZ's suggestion of ror bx, 8 is fine for exchanging bl and bh (and IIRC, it is the optimal way to do so). However, if you want to exchange bit 0 of (say) bl with bit 7 of bl, you'll need more logic than that.
DI is not a pointer, it is an index.
You can you ROR BX, 8 to rotate a lower/higher byte of a register.

Resources