Recursion Product mips32 - recursion

The code isn't printing the product of 2 numbers. I am not sure how to print out the result. I'm not 100% sure if my return value is store in stack pointer or not.
main:
#prompt 1
li $v0, 4 # Print the String at Label “Input”
la $a0, num1
syscall
li $v0, 5
syscall
move $a2, $v0
#prompt 2
li $v0, 4 # Print the String at Label “Input”
la $a0, num2
syscall
li $v0, 5 # Read integer from user
syscall
move $a1, $v0 # Pass integer to input argument register $a0
jal multiply
add $a0, $v0, $zero
li $v0, 1
syscall
multiply:
bne $a1, 0, recurse
move $v1, $a1
jr $ra
recurse:
sub $sp, $sp, 12
sw $ra, 0($sp)
sw $a0, 4($sp)
sw $a1, 8($sp)
addiu $a1, $a1, -1
jal multiply
lw $a1, 4($sp)
add $v1, $v1, $a1
lw $ra, 0($sp)
addi $sp, $sp, 12
jr $ra
Expected value of 3 * 2 is 6, but prints out 2. 3 * 3 returns 3

Related

Fibonacci sequence less than 1000 in R

I'm trying to print the Fibonacci Sequence less than 1000 using while loop in R.
So far,
fib <- c(1,1)
counter <-3
while (fib[counter-1]<1000){
fib[counter]<- fib[counter-2]+fib[counter-1]
counter = counter+1
}
fib
I have this code. Only the first two numbers are given: 1,1. This is printing:
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597
How do I fix my code to print only less than 1000?
Instead of checking the value of the last element wrt 1000, for the expected output you should be checking the sum of the last two elements as so.
fib <- c(1,1)
counter <-3
while (fib[counter-2]+fib[counter - 1]<1000){
fib[counter]<- fib[counter-2]+fib[counter-1]
counter = counter+1
}
fib
The issue with your approach is when the condition (fib[counter-1]<1000) in while loop is FALSE you have already added the number in fib which is greater than 1000.
You could return fib[-length(fib)] to remove the last number or check the number before inserting the number in fib.
fib <- c(1,1)
counter <-3
while (TRUE){
temp <- fib[counter-2] + fib[counter-1]
if(temp < 1000)
fib[counter] <- temp
else
break
counter = counter+1
}
fib
#[1] 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
You could change the while condition to sum the last 2 answers instead of just the last one:
fib <- c(1,1)
counter <-3
while (sum(fib[counter - 1:2]) < 1000){
fib[counter]<- fib[counter-2]+fib[counter-1]
counter = counter+1
}
fib
#> [1] 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
Or just get rid of counter completely:
fib <- c(1,1)
while (sum(fib[length(fib) - 0:1]) < 1000) fib <- c(fib, sum(fib[length(fib) - 0:1]))
fib
#> [1] 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987

Find the sum of the elements in array in mips assembly (using recursion)

So, this was the c++ code that I needed to translate to mips assembly.
int sum( int arr[], int size ) {
if ( size == 0 ) return 0 ;
else
return sum( arr, size - 1 ) + arr[size-1];
}
And this is my attempt:
.data
arr: .word 5 1 2 3 4
sum:
li $v0, 0
beq $a1, $zero, out
addi $sp, $sp, -12
sw $ra, 8($sp)
sw $s1, 4 ($sp)
sw $s0, 0($sp)
move $s0, $a0
move $s1, $a1
addi $a1, $a1, -1
jal sum
addi $s1, $s1, -1
sll $s1, $s1, 2
addu $s0, $s0, $s1
lw $t0, 0($s0)
add $v0,$v0, $t0
lw $s0, 0($sp)
lw $s1, 4($sp)
lw $ra, 8($sp)
addi $sp, $sp, 12
jr $ra
out:
jr $ra
main:
la $a0, arr
li $a1, 5
jal sum
move $a0, $v0
li $v0, 1
syscall
li $v0, 10
syscall
So, I preserve 3 variables: ra, a0, and a1 across my recursive calls. But I got the exception of bad read which probably came when I tried to access the memory of an array. But I cannot find the problem.
Not sure about what is wrong with your code, here is my example I am using QtSpim for testing MIPS
.text
fillArray:
li $t0, 0
loop:
beq $t0, $a1, endFillArray
mulo $t1, $t0, 4
add $t1, $a0, $t1
li $v0, 5
syscall
sw $v0, 0($t1)
addi $t0, $t0, 1
j loop
endFillArray:
jr $ra
sumArrayRec:
addi $sp, $sp,-16
sw $ra, 0($sp)
sw $a0, 4($sp)
beqz $a1, zero
addi $a1, $a1, -1
sw $a1, 8($sp)
jal sumArrayRec
sw $v0, 12($sp)
lw $t0, 4($sp)
lw $t2, 8($sp)
mulo $t2, $t2, 4
add $t0, $t0, $t2
lw $t1, 0($t0)
lw $t3, 12($sp)
add $v0, $t3, $t1
j endSumArrRec
zero:
li $v0, 0
endSumArrRec:
move $a0, $v0
lw $ra, 0($sp)
addi $sp, $sp, 16
jr $ra
main:
la $a0, str1
li $v0, 4
syscall
li $v0, 5
syscall
move $s0, $v0 #length arr
la $a0, str2
li $v0, 4
syscall
la $a0, arr
move $a1, $s0
jal fillArray #input arr elemetns
la $a0, arr
move $a1, $s0
jal sumArrayRec
move $s0, $v0
la $a0, str3
li $v0, 4
syscall
move $a0, $s0
li $v0, 1
syscall
li $v0, 10
syscall
.data
arr: .space 200
str1: .asciiz "Input length n = "
str2: .asciiz "Insert array elements: \n"
str3: .asciiz "Result: "

How to check the status of a parent and child process?

I have just written a simple piece of code to check how the child and parent process run. But I am not getting the desired output.
#include<stdio.h>
#include<stdlib.h>
#include<unistd.h>
#include<sys/types.h>
#include<sys/wait.h>
int main()
{
pid_t x;
int n=1;
x=fork();
if(x>0)
{
n+=2;
printf("Parent process exist %d\n",n);
}
else if(x==0)
{
n+=5;
printf(" Child process %d\n ",n);
}
printf("done %d",n);
return 0;
}
The code is very trivial but is there any hidden problem which gives unexpected output?
Warning: Answer is not rules lawyery!
Yes. n+=5 is not an atomic operation. It consists of three "sub-operations": load n, add 5, store n.
Except it doesn't even do that, necessarily, since the compiler is free to go "hey; all this loading and storing n into RAM is pointless when I can just keep the value in a register". Declare the variable volatile int to fix this.
The importance of that non-atomic thing can be seen with this example compilation and execution of your code (with the volatile change):
0 SYSCALL fork_into_register_zero
1 STORE 1 INTO RAM #386
2 COMPARE REGISTER #0 TO 0
3 JUMP IF <= TO INSTRUCTION #23
4 LOAD RAM #386 INTO REGISTER #1
5 ADD 2 TO REGISTER #1
6 STORE REGISTER #1 INTO RAM #386
7 LOAD "Parent process exist " INTO REGISTER #0
8 LOAD 1 INTO REGISTER #1
9 SYSCALL output_string_from_register_zero_to_file_descriptor_from_register_one
10 LOAD 1000000 INTO REGISTER #2
11 LOAD RAM #386 INTO REGISTER #1
12 STORE REGISTER #1 INTO REGISTER #0
13 DIVBY REGISTER #2 TO REGISTER #0
14 ADD '0' TO REGISTER #0
15 SYSCALL putch_from_register_zero
16 MODBY REGISTER #2 TO REGISTER #1
17 DIVBY 10 TO REGISTER #2
18 COMPARE REGISTER #2 TO 0
19 JUMP IF > TO INSTRUCTION #12
20 STORE '\n' TO REGISTER #0
21 SYSCALL putch_from_register_zero
22 COMPARE 1 TO 0
23 JUMP IF != TO INSTRUCTION 44
24 LOAD RAM #386 INTO REGISTER #1
25 ADD 5 TO REGISTER #1
26 STORE REGISTER #1 INTO RAM #386
27 LOAD " Child process " INTO REGISTER #0
28 LOAD 1 INTO REGISTER #1
29 SYSCALL output_string_from_register_zero_to_file_descriptor_from_register_one
30 LOAD 1000000 INTO REGISTER #2
31 LOAD RAM #386 INTO REGISTER #1
32 STORE REGISTER #1 INTO REGISTER #0
33 DIVBY REGISTER #2 TO REGISTER #0
34 ADD '0' TO REGISTER #0
35 SYSCALL putch_from_register_zero
36 MODBY REGISTER #2 TO REGISTER #1
37 DIVBY 10 TO REGISTER #2
38 COMPARE REGISTER #2 TO 0
39 JUMP IF > TO INSTRUCTION #12
40 STORE '\n' TO REGISTER #0
41 SYSCALL putch_from_register_zero
42 STORE ' ' TO REGISTER #0
43 SYSCALL putch_from_register_zero
44 LOAD "Done " INTO REGISTER #0
45 LOAD 1 INTO REGISTER #1
46 SYSCALL output_string_from_register_zero_to_file_descriptor_from_register_one
47 LOAD 1000000 INTO REGISTER #2
48 LOAD RAM #386 INTO REGISTER #1
49 STORE REGISTER #1 INTO REGISTER #0
50 DIVBY REGISTER #2 TO REGISTER #0
51 ADD '0' TO REGISTER #0
52 SYSCALL putch_from_register_zero
53 MODBY REGISTER #2 TO REGISTER #1
54 DIVBY 10 TO REGISTER #2
55 COMPARE REGISTER #2 TO 0
56 JUMP IF > TO INSTRUCTION #12
57 STORE '\n' TO REGISTER #0
58 SYSCALL putch_from_register_zero
59 STORE 0 TO REGISTER #0
60 RETURN
Here there's a kernel-level putch-to-stdout operation, because I don't think consistency is important enough to retype 23 lines of fake code. Note that SYSCALLs are handled by the kernel and so are atomic (when dealing with output to a file descriptor, at least).
Each one of these instructions is atomic. However, everything after the SYSCALL fork_into_register_zero is run twice with different values of REGISTER #0, and can be interleaved in any way. Let that sink in. The chances are that n isn't going to be 8 at the end. In fact, the output could be this:
Child process 3
Done Parent process exist 33
Done 3
Doesn't seem right? That's threading for you!

Implementing the MATLAB filter2 function in R

I am currently implementing the filter2 MATLAB function in R, which is a method for 2D convolution. I have made for the 2D convolution work, but how the 'valid' option in filter2 works is not quite clear to me.
The MATLAB function is described here:
http://se.mathworks.com/help/matlab/ref/filter2.html
My implementation:
filter2D <- function(img, window) {
# Algoritm for 2D Convolution
filter_center_index_y <- median(1:dim(window)[1])
filter_max_index_y <- dim(window)[1]
filter_center_index_x <- median(1:dim(window)[2])
filter_max_index_x <- dim(window)[2]
# For each position in the picture, 2D convolution is done by
# calculating a score for all overlapping values within the two matrices
x_min <- 1
x_max <- dim(img)[2]
y_min <- 1
y_max <- dim(img)[1]
df <- NULL
for (x_val in c(x_min:x_max)){
for (y_val in c(y_min:y_max)){
# Distanced from cell
img_dist_left <- x_val-1
img_dist_right <- x_max-x_val
img_dist_up <- y_val-1
img_dist_down <- y_max-y_val
# Overlapping filter cells
filter_x_start <- filter_center_index_x-img_dist_left
if (filter_x_start < 1) {
filter_x_start <- 1
}
filter_x_end <- filter_center_index_x+img_dist_right
if (filter_x_end > filter_max_index_x) {
filter_x_end <- filter_max_index_x
}
filter_y_start <- filter_center_index_y-img_dist_up
if (filter_y_start < 1) {
filter_y_start <- 1
}
filter_y_end <- filter_center_index_y+img_dist_down
if (filter_y_end > filter_max_index_y) {
filter_y_end <- filter_max_index_y
}
# Part of filter that overlaps
filter_overlap_matrix <- filter[filter_y_start:filter_y_end, filter_x_start:filter_x_end]
# Overlapped image cells
image_x_start <- x_val-filter_center_index_x+1
if (image_x_start < 1) {
image_x_start <- 1
}
image_x_end <- x_val+filter_max_index_x-filter_center_index_x
if (image_x_end > x_max) {
image_x_end <- x_max
}
image_y_start <- y_val-filter_center_index_y+1
if (image_y_start < 1) {
image_y_start <- 1
}
image_y_end <- y_val+filter_max_index_y-filter_center_index_y
if (image_y_end > y_max) {
image_y_end <- y_max
}
# Part of image that is overlapped
image_overlap_matrix <- img[image_y_start:image_y_end, image_x_start:image_x_end]
# Calculating the cell value
cell_value <- sum(filter_overlap_matrix*image_overlap_matrix)
df = rbind(df,data.frame(x_val,y_val, cell_value))
}
}
# Axis labels
x_axis <- c(x_min:x_max)
y_axis <- c(y_min:y_max)
# Populating matrix
filter_matrix <- matrix(df[,3], nrow = x_max, ncol = y_max, dimnames = list(x_axis, y_axis))
return(filter_matrix)
}
Running the method:
> image
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
[2,] 7 8 9 10 11 12
[3,] 13 14 15 16 17 18
[4,] 19 20 21 22 23 24
[5,] 25 26 27 28 29 30
[6,] 31 32 33 34 35 36
> filter
[,1] [,2] [,3]
[1,] 1 2 1
[2,] 0 0 0
[3,] -1 -2 -1
> filter2D(image, filter)
1 2 3 4 5 6
1 -22 -32 -36 -40 -44 -35
2 -36 -48 -48 -48 -48 -36
3 -36 -48 -48 -48 -48 -36
4 -36 -48 -48 -48 -48 -36
5 -36 -48 -48 -48 -48 -36
6 76 104 108 112 116 89
This is the same output that filter2(image, filter) produces in matlab, however, when the option 'valid' is added the following output is generated:
-48 -48 -48 -48
-48 -48 -48 -48
-48 -48 -48 -48
-48 -48 -48 -48
It is not entirely obvious how filter2 with the 'valid' option generates this. Is it just using the center values? Or is it doing something more sophisticated?
Before I start, your code is actually doing 2D correlation. 2D convolution requires that you perform a 180 degree rotation on the kernel before doing the weighted sum. Correlation and convolution are in fact the same operation if the kernel is symmetric (i.e. the transpose of the kernel is equal to itself). I just wanted to make that clear before I start. Also, the documentation for filter2 does state that correlation is being performed.
The 'valid' option in MATLAB simply means that it should return only the outputs where the kernel fully overlaps the 2D signal when performing filtering. Because you have a 3 x 3 kernel, this means that at location (2,2) in the 2D signal for example, the kernel does not go outside of the signal boundaries. Therefore, what is returned is the extent of the filtered 2D signal where the kernel was fully inside the original 2D signal. To give you an example, if you placed the kernel at location (1,1), some of the kernel would go out of bounds. Handling out of bounds conditions when filtering can be done in many ways which may affect results and the interpretation of those results when it comes down to it. Therefore the 'valid' option is desired as you are using true information that forms the final result. You aren't interpolating or performing any estimations for data that goes beyond the borders of the 2D signal.
Simply put, you return a reduced matrix which removes the border elements. The filter being odd shaped makes this easy. You simply remove the first and last floor(M/2) rows and the first and last floor(N/2) columns where M x N is the size of your kernel. Therefore, because your kernel is 3 x 3, this means that we need to remove 1 row from the top and 1 row from the bottom, as well as 1 column from the left and 1 column from the right. This results in the -48 within a 4 x 4 grid as you see from the output of MATLAB.
Therefore, if you want to use the 'valid' option in your code, simply make sure that you remove the border elements in your result. You can do this right at the end before your return the matrix:
# Place your code here...
# ...
# ...
# Now we're at the end of your code
# Populating matrix
filter_matrix <- matrix(df[,3], nrow = x_max, ncol = y_max, dimnames = list(x_axis, y_axis))
# New - Determine rows and columns of matrix as well as the filter kernel
nrow_window <- nrow(window)
ncol_window <- ncol(window)
nrows <- nrow(filter_matrix)
ncols <- ncol(filter_matrix)
# New - Figure out where to cut off
row_cutoff <- floor(nrow_window/2)
col_cutoff <- floor(ncol_window/2)
# New - Remove out borders
filter_matrix <- filter_matrix[((1+row_cutoff):(nrows-row_cutoff)), ((1+col_cutoff):(ncols-col_cutoff))]
# Finally return matrix
return(filter_matrix)
Example Run
Using your data:
> image <- t(matrix(c(1:36), nrow=6, ncol=6))
> image
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
[2,] 7 8 9 10 11 12
[3,] 13 14 15 16 17 18
[4,] 19 20 21 22 23 24
[5,] 25 26 27 28 29 30
[6,] 31 32 33 34 35 36
> filter <- matrix(c(1,0,-1,2,0,-2,1,0,-1), nrow=3, ncol=3)
> filter
[,1] [,2] [,3]
[1,] 1 2 1
[2,] 0 0 0
[3,] -1 -2 -1
I ran the function and now I get:
> filter2D(image,filter)
2 3 4 5
2 -48 -48 -48 -48
3 -48 -48 -48 -48
4 -48 -48 -48 -48
5 -48 -48 -48 -48
I think it may be important to leave the horizontal and vertical labels to be the way they are so you can explicitly see that not all of the signal is being returned, which is what the code is currently doing.... that's up to you though. I'll leave that to you to decide.

g++ math difference between ARM and INTEL platforms

The following program gives 8191 on INTEL platforms. On ARM platforms, it gives 8192 (the correct answer).
// g++ -o test test.cpp
#include <stdio.h>
int main( int argc, char *argv[])
{
double a = 8192.0 / (4 * 510);
long x = (long) (a * (4 * 510));
printf("%ld\n", x);
return 0;
}
Can anyone explain why? The problem goes away if I use any of the -O, -O2, or -O3 compile switches.
Thanks in advance!
long fun ( void )
{
double a = 8192.0 / (4 * 510);
long x = (long) (a * (4 * 510));
return(x);
}
g++ -c -O2 fun.c -o fun.o
objdump -D fun.o
0000000000000000 <_Z3funv>:
0: b8 00 20 00 00 mov $0x2000,%eax
5: c3 retq
No math, the compiler did all the math removing all of the dead code you had supplied.
gcc same deal.
0000000000000000 <fun>:
0: b8 00 20 00 00 mov $0x2000,%eax
5: c3 retq
arm gcc optimized
00000000 <fun>:
0: e3a00a02 mov r0, #8192 ; 0x2000
4: e12fff1e bx lr
the raw binary for the double a is
0x40101010 0x10101010
and double(4*510) is
0x409FE000 0x00000000
those are computations done at compile time even unoptimized.
generic soft float arm
00000000 <fun>:
0: e92d4810 push {r4, fp, lr}
4: e28db008 add fp, sp, #8
8: e24dd014 sub sp, sp, #20
c: e28f404c add r4, pc, #76 ; 0x4c
10: e8940018 ldm r4, {r3, r4}
14: e50b3014 str r3, [fp, #-20]
18: e50b4010 str r4, [fp, #-16]
1c: e24b1014 sub r1, fp, #20
20: e8910003 ldm r1, {r0, r1}
24: e3a02000 mov r2, #0
28: e59f3038 ldr r3, [pc, #56] ; 68 <fun+0x68>
2c: ebfffffe bl 0 <__aeabi_dmul>
30: e1a03000 mov r3, r0
34: e1a04001 mov r4, r1
38: e1a00003 mov r0, r3
3c: e1a01004 mov r1, r4
40: ebfffffe bl 0 <__aeabi_d2iz>
44: e1a03000 mov r3, r0
48: e50b3018 str r3, [fp, #-24]
4c: e51b3018 ldr r3, [fp, #-24]
50: e1a00003 mov r0, r3
54: e24bd008 sub sp, fp, #8
58: e8bd4810 pop {r4, fp, lr}
5c: e12fff1e bx lr
60: 10101010 andsne r1, r0, r0, lsl r0
64: 40101010 andsmi r1, r0, r0, lsl r0
68: 409fe000 addsmi lr, pc, r0
6c: e1a00000 nop ; (mov r0,
intel
0000000000000000 <fun>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 b8 10 10 10 10 10 movabs $0x4010101010101010,%rax
b: 10 10 40
e: 48 89 45 f0 mov %rax,-0x10(%rbp)
12: f2 0f 10 4d f0 movsd -0x10(%rbp),%xmm1
17: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 1f <fun+0x1f>
1e: 00
1f: f2 0f 59 c1 mulsd %xmm1,%xmm0
23: f2 48 0f 2c c0 cvttsd2si %xmm0,%rax
28: 48 89 45 f8 mov %rax,-0x8(%rbp)
2c: 48 8b 45 f8 mov -0x8(%rbp),%rax
30: 5d pop %rbp
31: c3 retq
0000000000000000 <.rodata>:
0: 00 00 add %al,(%rax)
2: 00 00 add %al,(%rax)
4: 00 e0 add %ah,%al
6: 9f lahf
7: 40 rex
so you can see in the arm that it is taking the 4.whatever (a value) and the 4*510 converted to double value and passing those into aeabi_dmul (double multiply no doubt). Then it converts that from double to integer (d2i) and there you go.
Intel same deal but with hard float instructions.
So if there is a difference (I would have to prep and fire up an arm to see at least one arm result, already have posted that my intel result for your program verbatim is 8192) would be in one of the two floating point operations (multiply or double to integer) and rounding choices may come into play.
this is oviously not a value that can be represented cleanly in base two (floating point).
0x40101010 0x10101010
start doing math with that and one of those ones may cause a rounding difference.
last and most important, just because the pentium bug was famous, and we were lead to believe they fixed it, floating point units still have bugs...But usually the programmer falls into a floating point accuracy trap before then which is likely what you are seeing here if you are actually seeing anything here...
The reason your problem goes away if you use optimisation flags is because this result is known a priori, i.e. the compiler can just replace x in the printf statement with 8192 and save memory. In fact, I'm willing to bet money that it's the compiler that's responsible for the differences you observe.
This question is essentially 'how do computers store numbers', and that question is always is relevant to programming in C++ (or C). I recommend you look at the link before reading further.
Let's look at these two lines:
double a = 8192.0 / (4 * 510);
long x = (long) (a * (4 * 510));
For a start, note that you're multiplying two int constants together -- implicitly, 4 * 510 is the same as (int)(4 * 150). However, C (and C++) has a happy rule that, when people write things like a/3, the calculation is done in floating-point arithmetic rather than with integer arithmetic. The calculation is done in double unless both operands are float, in which case the calculation is done in float, and integer if both operands are ints. I suspect that you might be running into precision issues for your ARM target. Let's make sure.
For the sake of my own curiosity, I've compiled your program to assembly, by calling gcc -c -g -Wa,-a,-ad test.c > test.s. On two different versions of GCC, both of them on Unix-like OSes, this snippet always ejects 8192 rather than 8191.
This particular combination of flags includes the corresponding line of C as an assembly comment, which makes it much easier to read what's happening. Here's the interesting bits, written in AT&T Syntax, i.e. commands have the form command source, destination.
30 5:testSpeed.c **** double a = 8192.0 / (4 * 510);
31 23 .loc 1 5 0
32 24 000f 48B81010 movabsq $4616207279229767696, %rax
33 24 10101010
34 24 1040
35 25 0019 488945F0 movq %rax, -16(%rbp)
Yikes! Let's break this down a bit. Lines 30 to 36 deal with the assignment of the quad-byte value 4616207279229767696 to the register rax, a processor register that holds values. The next line -- movq %rax, -16(%rbp) -- moves that data into a location in memory pointed to by rbp.
So, in other words, the compiler has assigned a to memory and forgotten about it.
The next set of lines are a bit more complicated.
36 6:testSpeed.c **** long x = (long) (a * (4 * 510));
37 26 .loc 1 6 0
38 27 001d F20F104D movsd -16(%rbp), %xmm1
39 27 F0
40 28 0022 F20F1005 movsd .LC1(%rip), %xmm0
41 28 00000000
42 29 002a F20F59C1 mulsd %xmm1, %xmm0
43 30 002e F2480F2C cvttsd2siq %xmm0, %rax
44 30 C0
45 31 0033 488945F8 movq %rax, -8(%rbp)
...
72 49 .LC1:
73 50 0008 00000000 .long 0
74 51 000c 00E09F40 .long 1084219392
75 52 .text
76 53 .Letext0:
Here, we start off by moving the contents of the register pointed to above -- i.e. a -- to a register (xmm1). We then take the data pointed to in the snipped I've shown below, .LC1, and jam it into another register (xmm0). Much to my surprise, we then do a scalar floating-point double precision multiply (mulsd). We then truncate the result (which is what your cast to long actually does by calling cvttsd2siq, and put the result somewhere (movq %rax, -8(%rbp)).
46 7:testSpeed.c **** printf("%ld\n", x);
47 32 .loc 1 7 0
48 33 0037 488B45F8 movq -8(%rbp), %rax
49 34 003b 4889C6 movq %rax, %rsi
50 35 003e BF000000 movl $.LC2, %edi
51 35 00
52 36 0043 B8000000 movl $0, %eax
53 36 00
54 37 0048 E8000000 call printf
55 37 00
56 8:testSpeed.c **** return 0;
The remainder of this code then just calls printf.
Now, let's do the same thing again, but with -O3, i.e. telling the compiler to be rather aggressively optimising. Here's a few choice snippets from the resulting assembly:
139 22 .loc 2 104 0
140 23 0004 BA002000 movl $8192, %edx
...
154 5:testSpeed.c **** double a = 8192.0 / (4 * 510);
155 6:testSpeed.c **** long x = (long) (a * (4 * 510));
156 7:testSpeed.c **** printf("%ld\n", x);
157 8:testSpeed.c **** return 0;
158 9:testSpeed.c **** }
...
In this instance, we see that the compiler hasn't even bothered to produce instructions from your code, and instead just inlines the right answer.
For the sake of argument, I did the same thing with long x=8192; printf("%ld\n",x);. The assembly is identical.
Something similar will happen for your ARM target, but the floating-point multiplies are different because it's a different processor (everything above only holds true for x86_64). Remember, if you see something you don't expect with C (or C++) programming, you need to stop and think about it. Fractions like 0.3 cannot be represented finitely in memory!

Resources