strcmp and strcmp_sse functions in libc

strcmp and strcmp_sse functions in libc - intel

I've seen that in libc.so the actual type of strcmp_sse to call is decided by the function strcmp itself.
Here it is the code:
strcmp:
.text:000000000007B9F0 cmp cs:__cpu_features.kind, 0
.text:000000000007B9F7 jnz short loc_7B9FE
.text:000000000007B9F9 call __init_cpu_features
.text:000000000007B9FE
.text:000000000007B9FE loc_7B9FE: ; CODE XREF: .text:000000000007B9F7j
.text:000000000007B9FE lea rax, __strcmp_sse2_unaligned
.text:000000000007BA05 test cs:__cpu_features.cpuid._eax, 10h
.text:000000000007BA0F jnz short locret_7BA2B
.text:000000000007BA11 lea rax, __strcmp_ssse3
.text:000000000007BA18 test cs:__cpu_features.cpuid._ecx, 200h
.text:000000000007BA22 jnz short locret_7BA2B
.text:000000000007BA24 lea rax, __strcmp_sse2
.text:000000000007BA2B
.text:000000000007BA2B locret_7BA2B: ; CODE XREF: .text:000000000007BA0Fj
.text:000000000007BA2B ; .text:000000000007BA22j
.text:000000000007BA2B retn
What I do not understand is that the address of the strcmp_sse function to call is placed in rax and never actually called. Therefore I am wondering: who is going do call *rax? When?

Linux dynamic linker supports a special symbol type called STT_GNU_IFUNC. Strcmp is likely implemented as an IFUNC. 'Regular' symbols in a dynamic library are nothing more but a mapping from a name to the address. IFUNCs are a bit more complex than that: the address isn't readily available, in order to obtain it the linker must execute a piece of code from the library itself. We are seeing an example of such a peice of code here. Note that in x86_64 ABI a function returns the result in RAX.
This technique is typically used to pick the optimal implementation based on the CPU features. Please note that the selection logic runs only once; all but the first call to strcmp are fast.

Related

x86 assembly, moving data from an array to a register

Ive been going over the book over and over again and cannot understand why this is giving me "improper operand type". It should work!
This is inline assembly in Visual Studio.
function(unsigned int* a){
unsigned int num;
_asm {
mov eax, a //This stores address (start of the array) in eax
mov num, dword ptr [eax*4] //This is the line I am having issues with.
That last line, I am trying to store the 4 byte value that is in the array. But I get error C2415: improper operand type
What am I doing wrong? How do I copy 4 byte value from an array into a 32 bit register?

In Visual C++'s inline assembly, all variables are accessed as memory operands1; in other words, wherever you write num you can think that the compiler will replace dword ptr[ebp - something].
Now, this means that in the last mov you are effectively trying to perform a memory-memory mov, which isn't provided on x86. Use a temporary register instead:
mov eax, dword ptr [a] ; load value of 'a' (which is an address) in eax
mov eax, dword ptr [eax] ; dereference address, and load contents in eax
mov dword ptr [num], eax ; store value in 'num'
Notice that I removed the *4, as it doesn't really make sense to multiply a pointer by four - maybe you meant to use a as base plus some other index?
1 Other compilers, such as gcc, provide means to control way more finely the interaction between inline assembly and compiler generated code, which provides great flexibility and power but has quite a steep learning curve and requires great care to get everything right.

What code is generated for Ada Array of Records loop?

For example for:
type PERSONCV is
record
name: String ( 1..4 );
age: Integer;
cvtext: String ( 1..2000 );
end record;
N: constant := 40000;
persons : array ( 1..N ) of PERSONCV;
function jobcandidate return Boolean is
iscandidate: Boolean := False;
begin
for p of persons loop -- what code is generated for this?
if p.age >= 18 then
iscandidate := true;
exit;
end if;
end loop;
return iscandidate;
end;
In C/C++ the loop part would typically be:
PERSONCV * p; // address pointer
int k = 0;
while ( k < N )
{
p = &persons [ k ]; // pointer to k'th record
if ( p->age >= 18 )...
...
k++ ;
}
I have read that Ada uses Value semantics for records.
Does the Ada loop above copy the k'th record to loop variable p?
e.g. like this is in C/C++ :
PERSONCV p; // object/variable
int k = 0;
while ( k < N )
{
memcpy ( &p, &persons [ k ], sizeof ( PERSONCV ) ); // copies k'th elem
if ( p.age >= 18 )...
...
k++ ;
}

Assuming you are using GNAT, there are two avenues of investigation.
The switch -gnatG will regenerate an Ada-like representation of what the front end of the compiler is going to pass to the back end (before any optimisations). In this case, I see
function resander__jobcandidate return boolean is
iscandidate : boolean := false;
L_1 : label
begin
L_1 : for C8b in 1 .. 40000 loop
p : resander__personcv renames resander__persons (C8b);
if p.age >= 18 then
iscandidate := true;
exit;
end if;
end loop L_1;
return iscandidate;
end resander__jobcandidate;
so the question is, how does renames get translated? Given that the record size is 2008 bytes, the chances of the compiler generating a copy is pretty much zero.
The second investigatory approach is to keep the assembly code that the compiler normally emits to the assembler and then deletes, using the switch -S. This confirms that the code generated is like your first C++ version (for macOS).
As an interesting sidelight, Ada 2012 allows an alternate implementation of jobcandidate:
function jobcandidate2 return Boolean is
(for some p of persons => p.age >= 18);
which generates identical code.

I suspect that what you have read about Ada is wrong, and probably worse, is encouraging you to think about Ada in the wrong way.
Ada's intent is to encourage thinking in the problem domain, i.e. to specify what should happen, rather than thinking in the solution domain, i.e. to implement the fine details of exactly how.
So here the intent is to loop over all Persons, exit returning True on meeting the first over 18, otherwise return False.
And that's it.
By and large Ada mandates nothing about the details of how it's done, provided those semantics are satisfied.
Then, the intent is, you just expect the compiler to do the right thing.
Now an individual compiler may choose one implementation over another - or may switch between implementations according to optimisation heuristics, taking into account which CPU it's compiling for, as well as the size of the objects (will they fit into a register?) etc.
You could imagine a CPU with many registers, where a single cache line read makes the copy implementation faster than operating in place (especially if there are no modifications to write back to P's contents), or other target CPUs where the reverse was true. Why would you want to stop the compiler picking the better implementation?
A good example of this is Ada's approach to parameter passing to subprograms - name, value or reference semantics really don't apply - instead, you specify the parameter passing mode - in, out, or in out describing the information flow to (or from) the subprogram. Intuitive, provides semantics that can be more rigorously checked, and leaves the compiler free to pick the best (fastest, smallest, depending on your goal) implementation that correctly obeys those semantics.
Now it would be possible for a specific Ada compiler to make poor choices, and 30 years ago when computers were barely big enough to run an Ada compiler at all, you might well have found performance compromised for simplicity in early releases of a compiler.
But we have thirty more years of compiler development now, running on more powerful computers. So, today, I'd expect the compiler to normally make the best choice. And if you find a specific compiler missing out on performance optimisations, file an enhancement request. Ada compilers aren't perfect, just like any other compiler.
In this specific example, I'd normally expect P to be a cursor into the array, and operations to happen in-place, i.e. reference semantics. Or possibly a hybrid between forms, where one memory fetch into a register serves several operations, like a partial form of value semantics.
If your interest is academic, you can easily look at the assembly output from whatever compiler you're using and find out. Or write all three versions above and benchmark them.

Using a current compiler (GCC 7.0.0), I have copied your source to both an Ada program and a C++ program, using std:array<char, 4> etc. corresponding to String( 1..4 ) etc. Switches were simply -O2 for C++, and -O2 -gnatp for Ada, so as to use comparable settings regarding checked access to array elements, etc.
These are the results for jobcandidate:
C++:
movl $_ZN15Loop_Over_Array7personsE+4, %eax
movl $_ZN15Loop_Over_Array7personsE+80320004, %edx
jmp .L3
.L8:
addq $2008, %rax
cmpq %rdx, %rax
je .L7
.L3:
cmpl $17, (%rax)
jle .L8
movl $1, %eax
ret
.L7:
xorl %eax, %eax
ret
Ada:
movl $1, %eax
jmp .L5
.L10:
addq $1, %rax
cmpq $40001, %rax
je .L9
.L5:
imulq $2008, %rax, %rdx
cmpl $17, loop_over_array__persons-2004(%rdx)
jle .L10
movl $1, %eax
ret
.L9:
xorl %eax, %eax
ret
One difference I see is in how either implementation uses %edx and %eax; for going form one element of the array to the next, and testing whether the end has been reached. Ada seems to imulq the element size to set the cursor, C++ seems to addq it to the pointer.
I haven't measured performance.

In Recursion factorial program when a function is returning how stack FRAMES are maintained?

Ex : Function Implementation:
facto(x){
if(x==1){
return 1;
}
else{
return x*facto(x-1);
}
in more simple way lets take a stack -->
returns
|2(1)|----> 2(1) evaluates to 2
|3(2)|----> 3(2)<______________| evaluates to 6
|4(3)|----> 4(6)<______________| evaluates to 24
|5(4)|----> 5*(24)<____________| evaluates to 120
------ finally back to main...
when a function returns in reverse manner it never knows what exactly is behind it? The stack have activation records stored inside it but how they know about each other who is popped and who is on top?
How the stack keeps track of all variables within the function being
executed? Besides this, how it keeps track of what code is executed
(stackpointer)? When returning from a function call the result of that
function will be filled in a placeholder. By using the stackpointer
but how it knows where to continue executing code? These are the
basics of how the stack works I know but for recursion I don't
understand how it exactly works??

When a function returns its stack frame is discarded (i.e the complete local state is pop-ed out of the stack).
The details depend on the processor architecture and language.
Check the C calling conventions for x86 processors: http://en.wikipedia.org/wiki/X86_calling_conventions, http://en.wikibooks.org/wiki/X86_Disassembly/Functions_and_Stack_Frames and search for "PC Assembly Language" by Paul A. Carter (its a bit outdated but it has a good explanation of C and Pascal calling conventions for the ia32 architecture).
In C in x86 processors:
a. The calling function pushes the parameters of the called function to the stack in reverse order and then it pushes the pointer to the return address.
push -6
push 2
call add # pushes `end:` address an then jumps to `add:` (add(2, -6))
end:
# ...
b. Then the called function pushes the base of the stack (the ebp register in ia32) (it is used to reference local variables in the caller function).
add:
push ebp
c. The called function sets ebp to the current stack pointer (this ebp will be the reference to access the local variables and parameters of the current function instance).
add:
# ...
mov ebp, esp
d. The called function reserves space in the stack for the local (automatic) variables subtracting the size of the variables to the stack pointer.
add:
# ...
sub esp, 4 # Reserves 4 bytes for a variable
e. At the end of the called function it sets the stack pointer to be ebp (i.e frees its local variables), restores the ebp register of the caller function and returns to the return address (previously pushed by the caller).
add:
# ...
mov esp, ebp # frees local variables
pop ebp # restores old ebp
ret # pops `end:` and jumps there
f. Finally the caller adds to the stack pointer the space used by the parameters of the called function (i.e frees the space used by the arguments).
# ...
end:
add esp, 8
Return values (unless they are bigger than the register) are returned in the eax register.

What's the name of the popcount function in Julia?

What's the name of the function that tells you how many bits are set in some variable? This surely already exists in Base or maybe some standard library.

To quote Keno Fischer...
Try count_ones. As you can see it uses the popcnt instruction:
julia> code_native(count_ones,(Int64,))
.section __TEXT,__text,regular,pure_instructions
Filename: int.jl
Source line: 192
push RBP
mov RBP, RSP
Source line: 192
popcnt RAX, RDI
pop RBP
ret
Is your question in any way related to the Hacker News buzz about Replacing a 32-bit loop count variable with 64-bit introduces crazy performance deviations?

Assembly call memory address

This is probally a really stupid question but how do you call a memory address in ASM? I am using the code call dword 557054 (557054 is were code is located...) but I figure that it is calling 557054 + were ever the program got loaded into into memory. I need this for my executable loader...

There are two ways to do this, you can use CALL or you can use JMP, the second is more flexible but requires you to do a little more work if you want some compatibility with C-style code
Simple c-function call using CALL
push eax ; push args to stack
push ebx
call my_func ; my_func can be a c exported function or defined as a macro or asm function

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex