I was searching for an equivalent for the very convenient value_counts in pandas for a series in a dataframe in julia.
Unfortunately I could not find anything here, so my solution for a value_counts in a julia dataframe is as follows. However I didn't like my solution very much, as it is not as convinient compared to pandas with the method .value_counts(). So my question, is there another (more convinient) option than this?
jdf = DataFrame(rand(Int8, (1000000, 3)))
which gives me:
│ Row │ x1 │ x2 │ x3 │
│ │ Int8 │ Int8 │ Int8 │
├─────────┼──────┼──────┼──────┤
│ 1 │ -97 │ 98 │ 79 │
│ 2 │ -77 │ -118 │ -19 │
⋮
│ 999998 │ -115 │ 17 │ 107 │
│ 999999 │ -43 │ -64 │ 72 │
│ 1000000 │ 40 │ -11 │ 31 │
Value count for the first column would be:
combine(nrow,groupby(jdf,:x1))
which returns:
│ Row │ x1 │ nrow │
│ │ Int8 │ Int64 │
├─────┼──────┼───────┤
│ 1 │ -97 │ 3942 │
│ 2 │ -77 │ 3986 │
⋮
│ 254 │ 12 │ 3899 │
│ 255 │ -92 │ 3973 │
│ 256 │ -49 │ 3952 │
In DataFrames.jl this is the way to get the result you want. In general the approach in DataFrames.jl is to have a minimal API. If you use combine(nrow,groupby(jdf,:x1)) often then you can just define:
value_counts(df, col) = combine(groupby(df, col), nrow)
in your script.
Alternative ways to achieve what you want are using FreqTables.jl or StatsBase.jl:
julia> freqtable(jdf, :x1)
256-element Named Array{Int64,1}
x1 │
─────┼─────
-128 │ 3875
-127 │ 3931
-126 │ 3924
⋮ ⋮
125 │ 3873
126 │ 3917
127 │ 3975
julia> countmap(jdf.x1)
Dict{Int8,Int64} with 256 entries:
-98 => 3925
-74 => 4054
11 => 3798
-56 => 3853
29 => 3765
-105 => 3918
⋮ => ⋮
(the difference is that the output type will differ)
In terms of performance countmap is fastest, and combine is slowest:
julia> using BenchmarkTools
julia> #benchmark countmap($jdf.x1)
BenchmarkTools.Trial:
memory estimate: 16.80 KiB
allocs estimate: 14
--------------
minimum time: 436.000 μs (0.00% GC)
median time: 443.200 μs (0.00% GC)
mean time: 455.244 μs (0.22% GC)
maximum time: 5.362 ms (91.59% GC)
--------------
samples: 10000
evals/sample: 1
julia> #benchmark freqtable($jdf, :x1)
BenchmarkTools.Trial:
memory estimate: 37.22 KiB
allocs estimate: 86
--------------
minimum time: 7.972 ms (0.00% GC)
median time: 8.089 ms (0.00% GC)
mean time: 8.158 ms (0.00% GC)
maximum time: 10.016 ms (0.00% GC)
--------------
samples: 613
evals/sample: 1
julia> #benchmark combine(groupby($jdf,:x1), nrow)
BenchmarkTools.Trial:
memory estimate: 23.28 MiB
allocs estimate: 183
--------------
minimum time: 12.679 ms (0.00% GC)
median time: 14.572 ms (8.68% GC)
mean time: 15.239 ms (14.50% GC)
maximum time: 20.385 ms (21.83% GC)
--------------
samples: 328
evals/sample: 1
Note though that in combine most of the cost is grouping, so if you have the GroupedDataFrame object created already then combine is relatively fast:
julia> gdf = groupby(jdf,:x1);
julia> #benchmark combine($gdf, nrow)
BenchmarkTools.Trial:
memory estimate: 16.16 KiB
allocs estimate: 152
--------------
minimum time: 680.801 μs (0.00% GC)
median time: 714.800 μs (0.00% GC)
mean time: 737.568 μs (0.15% GC)
maximum time: 4.561 ms (83.47% GC)
--------------
samples: 6766
evals/sample: 1
EDIT
If you want a sorted dict then load DataStructures.jl and then do:
sort!(OrderedDict(countmap(jdf.x1)))
or
sort!(OrderedDict(countmap(jdf.x1)), byvalue=true)
depending on by what you want to sort the dictionary.
Related
I would like to inquire about extracting data from a given dataset (I guess it is similar to data decomposition).
The goal is to decompose a dataset to extract the features?
For example: Extract individual components of the volume of rectangle prism without the knowledge of the individual features (length, width, height).
Please do recommend best practice to carry out the operation. Also, do suggest any book or article which can explain such process in detail?
Update
The example code for the analysis:
using DataFrames
mutable struct rect
length
breadth
height
end
r = rect(rand(Int, 40), rand(Int, 40), rand(Int, 40))
volume(rect) = rect.length .* rect.breadth .* rect.height
volume_val = volume(r)
df = DataFrame(:length => r.length, :width=> r.width, :height=> r.height, :volume => volume_val)
# For this df dataframe, I would like to extract length, width and height from volume without the use of volume equation
Thanks in advance!
Do you mean searching for which values of length, width, height you get the given volume?
using DataFrames
mutable struct rect
length
width
height
end
r = rect(rand(1:100, 10), rand(1:100, 10), rand(1:100, 10))
volume(rect) = rect.length .* rect.width .* rect.height
volume_val = volume(r)
julia> df = DataFrame(:length => r.length, :width=> r.width, :height=> r.height, :volume => volume_val)
10×4 DataFrame
Row │ length width height volume
│ Int64 Int64 Int64 Int64
─────┼───────────────────────────────
1 │ 41 82 58 194996
2 │ 41 57 92 215004
3 │ 88 42 63 232848
4 │ 32 98 12 37632
5 │ 26 65 14 23660
6 │ 94 26 40 97760
7 │ 14 72 65 65520
8 │ 51 72 79 290088
9 │ 36 50 26 46800
10 │ 63 22 94 130284
julia> df[df.volume .== 46800,:]
1×4 DataFrame
Row │ length width height volume
│ Int64 Int64 Int64 Int64
─────┼───────────────────────────────
1 │ 36 50 26 46800
I am wondering why #btime reports one memory allocation per element in basic loops like these ones:
julia> using BenchmarkTools
julia> v=[1:15;]
15-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
julia> #btime for vi in v end
1.420 μs (15 allocations: 480 bytes)
julia> #btime for i in eachindex(v) v[i]=-i end
2.355 μs (15 allocations: 464 bytes)
I do not know how to interpret this result:
is it a bug/artifact of #btime?
is there really one alloc per element? (this would ruin performance...)
julia> versioninfo()
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2603 v3 # 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, haswell)
You're benchmarking access to the global variable v, which is the very first performance tip you should be aware of.
With BenchmarkTools you can work around that by interpolating v:
julia> #btime for vi in v end
555.962 ns (15 allocations: 480 bytes)
julia> #btime for vi in $v end
1.630 ns (0 allocations: 0 bytes)
But note that in general it's better to put your code in functions. The global scope is just bad for performance:
julia> f(v) = for vi in v end
f (generic function with 1 method)
julia> #btime f(v)
11.410 ns (0 allocations: 0 bytes)
julia> #btime f($v)
1.413 ns (0 allocations: 0 bytes)
especially those dash lines, what does that mean?
julia> #code_warntype gcd(4,5)
Body::Int64
31 1 ── %1 = %new(Base.:(#throw1#307))::Core.Compiler.Const(getfield(Base, Symbol("#throw1#307"))(), false)
32 │ %2 = (a === 0)::Bool │╻ ==
└─── goto #3 if not %2 │
2 ── %4 = (Base.flipsign_int)(b, b)::Int64 ││╻ flipsign
└─── return %4 │
33 3 ── %6 = (b === 0)::Bool │╻ ==
└─── goto #5 if not %6 │
4 ── %8 = (Base.flipsign_int)(a, a)::Int64 ││╻ flipsign
└─── return %8 │
34 5 ── %10 = (Base.cttz_int)(a)::Int64 │╻ trailing_zeros
35 │ %11 = (Base.cttz_int)(b)::Int64 │╻ trailing_zeros
36 │ %12 = (Base.slt_int)(%11, %10)::Bool │╻╷ min
│ %13 = (Base.ifelse)(%12, %11, %10)::Int64 ││
37 │ %14 = (Base.sle_int)(0, %10)::Bool │╻╷ >>
│ %15 = (Base.bitcast)(UInt64, %10)::UInt64 ││╻ unsigned
│ %16 = (Base.ashr_int)(a, %15)::Int64 ││╻ >>
│ %17 = (Base.neg_int)(%10)::Int64 ││╻ -
│ %18 = (Base.bitcast)(UInt64, %17)::UInt64 │││╻ reinterpret
│ %19 = (Base.shl_int)(a, %18)::Int64 ││╻ <<
│ %20 = (Base.ifelse)(%14, %16, %19)::Int64 ││
│ %21 = (Base.flipsign_int)(%20, %20)::Int64 ││╻ flipsign
│ %22 = (Base.bitcast)(UInt64, %21)::UInt64 ││╻ reinterpret
38 │ %23 = (Base.sle_int)(0, %11)::Bool │╻╷ >>
│ %24 = (Base.bitcast)(UInt64, %11)::UInt64 ││╻ unsigned
│ %25 = (Base.ashr_int)(b, %24)::Int64 ││╻ >>
│ %26 = (Base.neg_int)(%11)::Int64 ││╻ -
│ %27 = (Base.bitcast)(UInt64, %26)::UInt64 │││╻ reinterpret
│ %28 = (Base.shl_int)(b, %27)::Int64 ││╻ <<
│ %29 = (Base.ifelse)(%23, %25, %28)::Int64 ││
│ %30 = (Base.flipsign_int)(%29, %29)::Int64 ││╻ flipsign
└─── %31 = (Base.bitcast)(UInt64, %30)::UInt64 ││╻ reinterpret
39 6 ┄─ %32 = φ (#5 => %22, #15 => %40)::UInt64 │
│ %33 = φ (#5 => %31, #15 => %61)::UInt64 │
│ %34 = (%32 === %33)::Bool │╻ !=
│ %35 = (Base.not_int)(%34)::Bool ││╻ !
└─── goto #16 if not %35 │
40 7 ── %37 = (Base.ult_int)(%33, %32)::Bool │╻╷ >
└─── goto #9 if not %37 │
8 ── nothing │
43 9 ── %40 = φ (#8 => %33, #7 => %32)::UInt64 │
│ %41 = φ (#8 => %32, #7 => %33)::UInt64 │
│ %42 = (Base.sub_int)(%41, %40)::UInt64 │╻ -
44 │ %43 = (Base.cttz_int)(%42)::UInt64 │╻ trailing_zeros
│ %44 = (Core.lshr_int)(%43, 63)::UInt64 ││╻╷╷ Type
│ %45 = (Core.trunc_int)(Core.UInt8, %44)::UInt8 │││┃││ toInt64
│ %46 = (Core.eq_int)(%45, 0x01)::Bool ││││┃│ check_top_bit
└─── goto #11 if not %46 │││││
10 ─ invoke Core.throw_inexacterror(:check_top_bit::Symbol, UInt64::Any, %43::UInt64)
└─── $(Expr(:unreachable)) │││││
11 ─ goto #12 │││││
12 ─ %51 = (Core.bitcast)(Core.Int64, %43)::Int64 ││││
└─── goto #13 ││││
13 ─ goto #14 │││
14 ─ goto #15 ││
15 ─ %55 = (Base.sle_int)(0, %51)::Bool ││╻ <=
│ %56 = (Base.bitcast)(UInt64, %51)::UInt64 │││╻ reinterpret
│ %57 = (Base.lshr_int)(%42, %56)::UInt64 ││╻ >>
│ %58 = (Base.neg_int)(%51)::Int64 ││╻ -
│ %59 = (Base.bitcast)(UInt64, %58)::UInt64 │││╻ reinterpret
│ %60 = (Base.shl_int)(%42, %59)::UInt64 ││╻ <<
│ %61 = (Base.ifelse)(%55, %57, %60)::UInt64 ││
└─── goto #6 │
46 16 ─ %63 = (Base.sle_int)(0, %13)::Bool │╻╷ <<
│ %64 = (Base.bitcast)(UInt64, %13)::UInt64 ││╻ unsigned
│ %65 = (Base.shl_int)(%32, %64)::UInt64 ││╻ <<
│ %66 = (Base.neg_int)(%13)::Int64 ││╻ -
│ %67 = (Base.bitcast)(UInt64, %66)::UInt64 │││╻ reinterpret
│ %68 = (Base.lshr_int)(%32, %67)::UInt64 ││╻ >>
│ %69 = (Base.ifelse)(%63, %65, %68)::UInt64 ││
48 │ %70 = (Base.ult_int)(0x7fffffffffffffff, %69)::Bool╷ >
│ %71 = (Base.or_int)(false, %70)::Bool ││╻ <
└─── goto #18 if not %71 │
17 ─ invoke %1(_2::Int64, _3::Int64) │
└─── $(Expr(:unreachable)) │
49 18 ─ %75 = (Base.bitcast)(Int64, %69)::Int64 │╻ rem
└─── return %75 │
source code:
30 function gcd(a::T, b::T) where T<:Union{Int8,UInt8,Int16,UInt16,Int32,UInt32,Int64,UInt64,Int128,UInt128}
31 #noinline throw1(a, b) = throw(OverflowError("gcd($a, $b) overflows"))
32 a == 0 && return abs(b)
33 b == 0 && return abs(a)
34 za = trailing_zeros(a)
35 zb = trailing_zeros(b)
36 k = min(za, zb)
37 u = unsigned(abs(a >> za))
38 v = unsigned(abs(b >> zb))
39 while u != v
40 if u > v
41 u, v = v, u
42 end
43 v -= u
44 v >>= trailing_zeros(v)
45 end
46 r = u << k
47 # T(r) would throw InexactError; we want OverflowError instead
48 r > typemax(T) && throw1(a, b)
49 r % T
50 end
This is explained in detail here https://github.com/JuliaLang/julia/blob/master/base/compiler/ssair/show.jl#L170.
In summary:
the number of lines indicates the level of nesting;
a half-size line (╷) indicates the start of a scope and a full size line (│) indicates
a continuing scope;
increased thickness of the appropriate line for the scope indicates which heuristic name of the entered scope is printed after the lines.
Consider the following 4 functions in Julia: They all pick/compute a random column of a matrix A and adds a constant times this column to a vector z.
The difference between slow1 and fast1 is how z is updated and likewise for slow2 and fast2.
The difference between the 1 functions and 2 functions is whether the matrix A is passed to the functions or computed on the fly.
The odd thing is that for the 1 functions, fast1 is faster (as I would expect when using BLAS instead of +=), but for the 2 functions slow1 is faster.
On this computer I get the following timings (for the second run of each function):
#time slow1(A, z, 10000);
0.172560 seconds (110.01 k allocations: 940.102 MB, 12.98% gc time)
#time fast1(A, z, 10000);
0.142748 seconds (50.07 k allocations: 313.577 MB, 4.56% gc time)
#time slow2(complex(float(x)), complex(float(y)), z, 10000);
2.265950 seconds (120.01 k allocations: 1.529 GB, 1.20% gc time)
#time fast2(complex(float(x)), complex(float(y)), z, 10000);
4.351953 seconds (60.01 k allocations: 939.410 MB, 0.43% gc time)
Is there an explanation to this behaviour? And a way to make BLAS faster than +=?
M = 2^10
x = [-M:M-1;]
N = 2^9
y = [-N:N-1;]
A = cis( -2*pi*x*y' )
z = rand(2*M) + rand(2*M)*im
function slow1(A::Matrix{Complex{Float64}}, z::Vector{Complex{Float64}}, maxiter::Int)
S = [1:size(A,2);]
for iter = 1:maxiter
idx = rand(S)
col = A[:,idx]
a = rand()
z += a*col
end
end
function fast1(A::Matrix{Complex{Float64}}, z::Vector{Complex{Float64}}, maxiter::Int)
S = [1:size(A,2);]
for iter = 1:maxiter
idx = rand(S)
col = A[:,idx]
a = rand()
BLAS.axpy!(a, col, z)
end
end
function slow2(x::Vector{Complex{Float64}}, y::Vector{Complex{Float64}}, z::Vector{Complex{Float64}}, maxiter::Int)
S = [1:length(y);]
for iter = 1:maxiter
idx = rand(S)
col = cis( -2*pi*x*y[idx] )
a = rand()
z += a*col
end
end
function fast2(x::Vector{Complex{Float64}}, y::Vector{Complex{Float64}}, z::Vector{Complex{Float64}}, maxiter::Int)
S = [1:length(y);]
for iter = 1:maxiter
idx = rand(S)
col = cis( -2*pi*x*y[idx] )
a = rand()
BLAS.axpy!(a, col, z)
end
end
Update:
Profiling slow2:
2260 task.jl; anonymous; line: 92
2260 REPL.jl; eval_user_input; line: 63
2260 profile.jl; anonymous; line: 16
2175 /tmp/axpy.jl; slow2; line: 37
10 arraymath.jl; .*; line: 118
33 arraymath.jl; .*; line: 120
5 arraymath.jl; .*; line: 125
46 arraymath.jl; .*; line: 127
3 complex.jl; cis; line: 286
3 complex.jl; cis; line: 287
2066 operators.jl; cis; line: 374
72 complex.jl; cis; line: 286
1914 complex.jl; cis; line: 287
1 /tmp/axpy.jl; slow2; line: 38
84 /tmp/axpy.jl; slow2; line: 39
5 arraymath.jl; +; line: 96
39 arraymath.jl; +; line: 98
6 arraymath.jl; .*; line: 118
34 arraymath.jl; .*; line: 120
Profiling fast2:
4288 task.jl; anonymous; line: 92
4288 REPL.jl; eval_user_input; line: 63
4288 profile.jl; anonymous; line: 16
1 /tmp/axpy.jl; fast2; line: 47
1 random.jl; rand; line: 214
3537 /tmp/axpy.jl; fast2; line: 48
26 arraymath.jl; .*; line: 118
44 arraymath.jl; .*; line: 120
1 arraymath.jl; .*; line: 122
4 arraymath.jl; .*; line: 125
53 arraymath.jl; .*; line: 127
7 complex.jl; cis; line: 286
3399 operators.jl; cis; line: 374
116 complex.jl; cis; line: 286
3108 complex.jl; cis; line: 287
2 /tmp/axpy.jl; fast2; line: 49
748 /tmp/axpy.jl; fast2; line: 50
748 linalg/blas.jl; axpy!; line: 231
Oddly, the computing time of col differs even though the functions are identical up to this point.
But += is still relatively faster than axpy!.
Some more info now that julia 0.6 is out. To multiply a vector by a scalar in place, there are at least four options. Following Tim's suggstions, I used BenchmarkTool's #btime macro. It turns out that loop fusion, the most julian way to write it, is on par with calling BLAS. That's something the julia developers can be proud of!
using BenchmarkTools
function bmark(N)
a = zeros(N);
#btime $a *= -1.;
#btime $a .*= -1.;
#btime LinAlg.BLAS.scal!($N, -1.0, $a, 1);
#btime scale!($a, -1.);
end
And the results for 10^5 numbers.
julia> bmark(10^5);
78.195 μs (2 allocations: 781.33 KiB)
35.102 μs (0 allocations: 0 bytes)
34.659 μs (0 allocations: 0 bytes)
34.664 μs (0 allocations: 0 bytes)
The profiling backtrace shows that scale! just calls blas in the background, so they should give the same best time.
The following program gives 8191 on INTEL platforms. On ARM platforms, it gives 8192 (the correct answer).
// g++ -o test test.cpp
#include <stdio.h>
int main( int argc, char *argv[])
{
double a = 8192.0 / (4 * 510);
long x = (long) (a * (4 * 510));
printf("%ld\n", x);
return 0;
}
Can anyone explain why? The problem goes away if I use any of the -O, -O2, or -O3 compile switches.
Thanks in advance!
long fun ( void )
{
double a = 8192.0 / (4 * 510);
long x = (long) (a * (4 * 510));
return(x);
}
g++ -c -O2 fun.c -o fun.o
objdump -D fun.o
0000000000000000 <_Z3funv>:
0: b8 00 20 00 00 mov $0x2000,%eax
5: c3 retq
No math, the compiler did all the math removing all of the dead code you had supplied.
gcc same deal.
0000000000000000 <fun>:
0: b8 00 20 00 00 mov $0x2000,%eax
5: c3 retq
arm gcc optimized
00000000 <fun>:
0: e3a00a02 mov r0, #8192 ; 0x2000
4: e12fff1e bx lr
the raw binary for the double a is
0x40101010 0x10101010
and double(4*510) is
0x409FE000 0x00000000
those are computations done at compile time even unoptimized.
generic soft float arm
00000000 <fun>:
0: e92d4810 push {r4, fp, lr}
4: e28db008 add fp, sp, #8
8: e24dd014 sub sp, sp, #20
c: e28f404c add r4, pc, #76 ; 0x4c
10: e8940018 ldm r4, {r3, r4}
14: e50b3014 str r3, [fp, #-20]
18: e50b4010 str r4, [fp, #-16]
1c: e24b1014 sub r1, fp, #20
20: e8910003 ldm r1, {r0, r1}
24: e3a02000 mov r2, #0
28: e59f3038 ldr r3, [pc, #56] ; 68 <fun+0x68>
2c: ebfffffe bl 0 <__aeabi_dmul>
30: e1a03000 mov r3, r0
34: e1a04001 mov r4, r1
38: e1a00003 mov r0, r3
3c: e1a01004 mov r1, r4
40: ebfffffe bl 0 <__aeabi_d2iz>
44: e1a03000 mov r3, r0
48: e50b3018 str r3, [fp, #-24]
4c: e51b3018 ldr r3, [fp, #-24]
50: e1a00003 mov r0, r3
54: e24bd008 sub sp, fp, #8
58: e8bd4810 pop {r4, fp, lr}
5c: e12fff1e bx lr
60: 10101010 andsne r1, r0, r0, lsl r0
64: 40101010 andsmi r1, r0, r0, lsl r0
68: 409fe000 addsmi lr, pc, r0
6c: e1a00000 nop ; (mov r0,
intel
0000000000000000 <fun>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 b8 10 10 10 10 10 movabs $0x4010101010101010,%rax
b: 10 10 40
e: 48 89 45 f0 mov %rax,-0x10(%rbp)
12: f2 0f 10 4d f0 movsd -0x10(%rbp),%xmm1
17: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 1f <fun+0x1f>
1e: 00
1f: f2 0f 59 c1 mulsd %xmm1,%xmm0
23: f2 48 0f 2c c0 cvttsd2si %xmm0,%rax
28: 48 89 45 f8 mov %rax,-0x8(%rbp)
2c: 48 8b 45 f8 mov -0x8(%rbp),%rax
30: 5d pop %rbp
31: c3 retq
0000000000000000 <.rodata>:
0: 00 00 add %al,(%rax)
2: 00 00 add %al,(%rax)
4: 00 e0 add %ah,%al
6: 9f lahf
7: 40 rex
so you can see in the arm that it is taking the 4.whatever (a value) and the 4*510 converted to double value and passing those into aeabi_dmul (double multiply no doubt). Then it converts that from double to integer (d2i) and there you go.
Intel same deal but with hard float instructions.
So if there is a difference (I would have to prep and fire up an arm to see at least one arm result, already have posted that my intel result for your program verbatim is 8192) would be in one of the two floating point operations (multiply or double to integer) and rounding choices may come into play.
this is oviously not a value that can be represented cleanly in base two (floating point).
0x40101010 0x10101010
start doing math with that and one of those ones may cause a rounding difference.
last and most important, just because the pentium bug was famous, and we were lead to believe they fixed it, floating point units still have bugs...But usually the programmer falls into a floating point accuracy trap before then which is likely what you are seeing here if you are actually seeing anything here...
The reason your problem goes away if you use optimisation flags is because this result is known a priori, i.e. the compiler can just replace x in the printf statement with 8192 and save memory. In fact, I'm willing to bet money that it's the compiler that's responsible for the differences you observe.
This question is essentially 'how do computers store numbers', and that question is always is relevant to programming in C++ (or C). I recommend you look at the link before reading further.
Let's look at these two lines:
double a = 8192.0 / (4 * 510);
long x = (long) (a * (4 * 510));
For a start, note that you're multiplying two int constants together -- implicitly, 4 * 510 is the same as (int)(4 * 150). However, C (and C++) has a happy rule that, when people write things like a/3, the calculation is done in floating-point arithmetic rather than with integer arithmetic. The calculation is done in double unless both operands are float, in which case the calculation is done in float, and integer if both operands are ints. I suspect that you might be running into precision issues for your ARM target. Let's make sure.
For the sake of my own curiosity, I've compiled your program to assembly, by calling gcc -c -g -Wa,-a,-ad test.c > test.s. On two different versions of GCC, both of them on Unix-like OSes, this snippet always ejects 8192 rather than 8191.
This particular combination of flags includes the corresponding line of C as an assembly comment, which makes it much easier to read what's happening. Here's the interesting bits, written in AT&T Syntax, i.e. commands have the form command source, destination.
30 5:testSpeed.c **** double a = 8192.0 / (4 * 510);
31 23 .loc 1 5 0
32 24 000f 48B81010 movabsq $4616207279229767696, %rax
33 24 10101010
34 24 1040
35 25 0019 488945F0 movq %rax, -16(%rbp)
Yikes! Let's break this down a bit. Lines 30 to 36 deal with the assignment of the quad-byte value 4616207279229767696 to the register rax, a processor register that holds values. The next line -- movq %rax, -16(%rbp) -- moves that data into a location in memory pointed to by rbp.
So, in other words, the compiler has assigned a to memory and forgotten about it.
The next set of lines are a bit more complicated.
36 6:testSpeed.c **** long x = (long) (a * (4 * 510));
37 26 .loc 1 6 0
38 27 001d F20F104D movsd -16(%rbp), %xmm1
39 27 F0
40 28 0022 F20F1005 movsd .LC1(%rip), %xmm0
41 28 00000000
42 29 002a F20F59C1 mulsd %xmm1, %xmm0
43 30 002e F2480F2C cvttsd2siq %xmm0, %rax
44 30 C0
45 31 0033 488945F8 movq %rax, -8(%rbp)
...
72 49 .LC1:
73 50 0008 00000000 .long 0
74 51 000c 00E09F40 .long 1084219392
75 52 .text
76 53 .Letext0:
Here, we start off by moving the contents of the register pointed to above -- i.e. a -- to a register (xmm1). We then take the data pointed to in the snipped I've shown below, .LC1, and jam it into another register (xmm0). Much to my surprise, we then do a scalar floating-point double precision multiply (mulsd). We then truncate the result (which is what your cast to long actually does by calling cvttsd2siq, and put the result somewhere (movq %rax, -8(%rbp)).
46 7:testSpeed.c **** printf("%ld\n", x);
47 32 .loc 1 7 0
48 33 0037 488B45F8 movq -8(%rbp), %rax
49 34 003b 4889C6 movq %rax, %rsi
50 35 003e BF000000 movl $.LC2, %edi
51 35 00
52 36 0043 B8000000 movl $0, %eax
53 36 00
54 37 0048 E8000000 call printf
55 37 00
56 8:testSpeed.c **** return 0;
The remainder of this code then just calls printf.
Now, let's do the same thing again, but with -O3, i.e. telling the compiler to be rather aggressively optimising. Here's a few choice snippets from the resulting assembly:
139 22 .loc 2 104 0
140 23 0004 BA002000 movl $8192, %edx
...
154 5:testSpeed.c **** double a = 8192.0 / (4 * 510);
155 6:testSpeed.c **** long x = (long) (a * (4 * 510));
156 7:testSpeed.c **** printf("%ld\n", x);
157 8:testSpeed.c **** return 0;
158 9:testSpeed.c **** }
...
In this instance, we see that the compiler hasn't even bothered to produce instructions from your code, and instead just inlines the right answer.
For the sake of argument, I did the same thing with long x=8192; printf("%ld\n",x);. The assembly is identical.
Something similar will happen for your ARM target, but the floating-point multiplies are different because it's a different processor (everything above only holds true for x86_64). Remember, if you see something you don't expect with C (or C++) programming, you need to stop and think about it. Fractions like 0.3 cannot be represented finitely in memory!