(SBCL 2.2.0)
Playing around with the time function, I chanced upon an unexplainable happening with dotimes: after a certain limit, it takes forever to loop.
For example:
For 100000: (it barely registers)
(time (dotimes (i 100000)
(1+ 1)))
Evaluation took:
0.000 seconds of real time
0.000000 seconds of total run time (0.000000 user, 0.000000 system)
100.00% CPU
139,293 processor cycles
0 bytes consed
For 10000000:
(time (dotimes (i 10000000)
(1+ 1)))
Evaluation took:
0.010 seconds of real time
0.000000 seconds of total run time (0.000000 user, 0.000000 system)
0.00% CPU
10,130,013 processor cycles
0 bytes consed
For 100000000 (note that it does not go 10x the previous value, rather ~5x)
(time (dotimes (i 100000000)
(1+ 1)))
Evaluation took:
0.050 seconds of real time
0.031250 seconds of total run time (0.031250 user, 0.000000 system)
62.00% CPU
84,139,662 processor cycles
0 bytes consed
For 1000000000:
(time (dotimes (i 1000000000)
(1+ 1)))
Evaluation took:
0.404 seconds of real time
0.328125 seconds of total run time (0.328125 user, 0.000000 system)
81.19% CPU
942,942,189 processor cycles
0 bytes consed
For 10000000000: explodes
(time (dotimes (i 10000000000)
(1+ 1)))
Evaluation took:
153.227 seconds of real time
129.781250 seconds of total run time (128.781250 user, 1.000000 system)
[ Run times consist of 11.651 seconds GC time, and 118.131 seconds non-GC time. ]
84.70% CPU
1 form interpreted
370,698,869,379 processor cycles
109,232,098,992 bytes consed
before it was aborted by a non-local transfer of control.
Binary searching around the numbers, I found that the crossing over point is 4294967295, after which the loop never ends. Why does this happen so suddenly?
The one-word-but-slightly-wrong answer is 'bignums'. You are using a 32-bit SBCL I assume and 2^32 is 4294967296. For integers larger than that the implementation has to do arithmetic itself rather than using the machine to do it. Worse, integers this large are not machine integers and so, generally, need to be heap-allocated which in turn means the GC now has work to do.
The more-correct answer is that, in fact, 2^32-1 will almost certainly be larger than the most positive fixnum (you can check what this is by looking at most-positive-fixnum) in a 32-bit SBCL, but SBCL has support for using naked machine integers in some cases, and this is probably one.
Related
I am trying to use MLJ on a DataFrame (30,000 rows x 8,000 columns) but every table operation seems to take a huge amount of time to compile but is fast to run.
I have given an example with code below in which a 5 x 5000 DataFrame is generated and it gets stuck on the unpack line (line 3). When I run the same code for a 5 x 5 DataFrame, line 3 outputs “2.872309 seconds (9.09 M allocations: 565.673 MiB, 6.47% gc time, 99.84% compilation time)”.
This is a crazy amount of compilation time for a seemingly simple task and I would like to know how I can reduce this.
Thank you,
Jack
using MLJ
using DataFrames
[line 1] #time arr = [[rand(1:10) for i in 1:5] for i in 1:5000];
output: 0.053668 seconds (200.76 k allocations: 11.360 MiB, 22.16% gc time, 99.16% compilation time)
[line 2] #time df = DataFrames.DataFrame(arr, :auto)
output: 0.267325 seconds (733.43 k allocations: 40.071 MiB, 4.29% gc time, 98.67% compilation time)
[line 3] #time y, X = unpack(df, ==(:x1));
does not finish running
It's not unexpected that the Julia compiler struggles with very wide DataFrames, which have (potentially) heterogeneous column types. That said I'm not sure why this has to be a problem for this operation - I've checked with MLJ maintainers who can hopefully chime in.
In the meantime you can simply do
y, X = df.x1, select!(df, Not(:x1))
which is instantaneous (Note select! will drop x1 from your underlying data, if you want to copy data use select instead)
Please don't cross-post a problem on multiple websites without linking.
The question has been answered at the Julia forum: https://discourse.julialang.org/t/simple-table-operation-has-very-large-compilation-time-with-mlj/82503/2. It was caused by a bug which is fixed in MLJBase 0.20.5.
Why is garbage collection so much slower when a large number of mutable structs are loaded in memory as compared with non-mutable structs? The object tree should have the same size in both cases.
julia> struct Foo
a::Float64
b::Float64
c::Float64
end
julia> mutable struct Bar
a::Float64
b::Float64
c::Float64
end
julia> #time dat1 = [Foo(0.0, 0.0, 0.0) for i in 1:1e9];
9.706709 seconds (371.88 k allocations: 22.371 GiB, 0.14% gc time)
julia> #time GC.gc(true)
0.104186 seconds, 100.00% gc time
julia> #time GC.gc(true)
0.124675 seconds, 100.00% gc time
julia> #time dat2 = [Bar(0.0, 0.0, 0.0) for i in 1:1e9];
71.870870 seconds (1.00 G allocations: 37.256 GiB, 73.88% gc time)
julia> #time GC.gc(true)
47.695473 seconds, 100.00% gc time
julia> #time GC.gc(true)
41.809898 seconds, 100.00% gc time
Non-mutable structs may be stored directly inside an Array. This will never happen for mutable structs. In your case, the Foo objects are all stored directly in dat1, so there is effectively just a single (albeit very large) allocation reachable after creating the Arary.
In the case of dat2, every Bar object will have its own piece of memory allocated for it and the Array will contain references to these objects. So with dat2, you end up with 1G + 1 reachable allocations.
You can also see this using Base.sizeof:
julia> sizeof(dat1)
24000000000
julia> sizeof(dat2)
8000000000
You'll see that dat1 is 3 times as large, as every array entry contains the 3 Float64s directly, while the entries in dat2 take up just the space for a pointer each.
As a side note: For these kinds of tests, it is a good idea to use BenchmarkTools.#btime instead of the built-in #time. As it removes the compilation overhead from the result and also runs your code multiple times in order to give you a more representative result:
#btime dat1 = [Foo(0.0, 0.0, 0.0) for i in 1:1e6]
2.237 ms (2 allocations: 22.89 MiB)
#btime dat2 = [Bar(0.0, 0.0, 0.0) for i in 1:1e6]
6.878 ms (1000002 allocations: 38.15 MiB)
As seen above, this is particularly useful to debug allocations. For dat1 we get 2 allocations (one for the Array instance itself and one for the chunk of memory where the array stores its data), while with dat2 we have an additional allocation per element.
I run some example and I got some result. I got for the large number of iteration we can get a good result but for less amount of iteration we can get a worse result.
I know there is a little overhead and it's absolutely ok, but is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
x = 0
#time for i=1:200000000
x = Int(rand(Bool)) + x
end
7.503359 seconds (200.00 M allocations: 2.980 GiB, 2.66% gc time)
x = #time #parallel (+) for i=1:200000000
Int(rand(Bool))
end
0.432549 seconds (3.91 k allocations: 241.138 KiB)
I got good result for parallel here but in following example not.
x2 = 0
#time for i=1:100000
x2 = Int(rand(Bool)) + x2
end
0.006025 seconds (98.97 k allocations: 1.510 MiB)
x2 = #time #parallel (+) for i=1:100000
Int(rand(Bool))
end
0.084736 seconds (3.87 k allocations: 239.122 KiB)
Doing things in parallel will ALWAYS be less efficient. It is because doing things parallel has always overhead to synchronize. Anyway the hope is, to get the result earlies on wall time than a pure sequential call (one computer, single core)
Your number are astonishing, and I found the cause.
First of all, allow to use all cores, goto into REPL
julia> nworkers
4
# original case to get correct relative times
julia> x = 0
julia> #time for i=1:200000000
x = Int(rand(Bool)) + x
end
7.864891 seconds (200.00 M allocations: 2.980 GiB, 1.62% gc time)
julia> x = #time #parallel (+) for i=1:200000000
Int(rand(Bool))
end
0.350262 seconds (4.08 k allocations: 254.165 KiB)
99991471
# now a correct benchmark
julia> function test()
x = 0
for i=1:200000000
x = Int(rand(Bool)) + x
end
end
julia> #time test()
0.465478 seconds (4 allocations: 160 bytes)
What happend?
Your first test case uses an global variable x. And that is terrible slow. The case access 200 000 000 times a slow variable.
In the second test case the global variable x is assigned just one time, so the poor performance is not coming into account
In my test case there is no global variable. I used a local variable. Local variables are much faster (due to better compiler optimizations)
Q: is there any way to run some loop with less amount of iteration in parallel way better than sequential way?
A: Yes.
1) Acquire more resources ( processors to compute, memory to store ) if all this ought get sense
2) Arrange the workflow smarter - to benefit from register-based code, from harnessing the cache-lines's sizes upon each first fetch, deploy re-use where possible ( hard work? yes, it is hard work, but why to repetitively pay 150+ [ns] instead of having paid this once and reuse well-aligned neighbouring cells just within ~ 30 [ns] latency-costs ( if NUMA permits )? ). Smarter workflow also often means code re-designs with respect to increasing the resulting assembly-code "density"-of-computations and tweaking the code so as to better by-pass the ( optimising-)-superscalar processor hardware design tricks, which are of no use / positive-benefit in highly-tuned HPC computing payloads.
3) Avoid headbangs into any blocking resources & bottlenecks ( central singularities alike a host's hardware unique source-of-randomness, IO-devices et al )
4) Get familiar with your optimising compilers internal options and "shortcuts" -- sometimes anti-patterns get generated at a cost of extended run-times
5) Get maximum from your underlying operating system's tweaking. Not doing this, your optimised code still waits ( and a lot ) in O/S-scheduler's queue
How many memory the bit-vector using in sbcl?
Does per bit spend 1 bit memory?
Does per bit spend 1 byte memory?
Does per bit spend 1 word memory?
Bit vectors in SBCL are stored efficiently with one bit per bit, plus some small housekeeping overhead per vector.
They are also very efficient at bitwise operations, working a full word at a time. For example, BIT-XOR on a 64-bit platform will work on 64 bits of a bit-vector at once.
From Common Lisp one can ask if there is a special array type for bit vectors:
* (UPGRADED-ARRAY-ELEMENT-TYPE 'bit)
BIT
That means that when you ask for a bit vector, then CL provides you with a bit vector and not a, say, vector with 8 bit elements.
Size of an object in SBCL
Alastair Bridgewater provided this function as an attempt to get the 'size' of an object in SBCL:
(defun get-object-size/octets (object)
(sb-sys:without-gcing
(nth-value 2 (sb-vm::reconstitute-object
(ash (logandc1 sb-vm:lowtag-mask
(sb-kernel:get-lisp-obj-address object))
(- sb-vm:n-fixnum-tag-bits))))))
* (get-object-size/octets (make-array 40 :element-type 'bit :initial-element 1))
32
* (get-object-size/octets (make-array 400 :element-type 'bit :initial-element 1))
80
* (get-object-size/octets (make-array 4000 :element-type 'bit :initial-element 1))
528
I'm using Antik and Gsll for matrix calculations.
Sometimes I want to do arithmetic operations on a subgrid (for example, multiply a column of a matrix by 2.0). Now I have to write these codes:
(setf (grid:column mat 0) (gsll:elt* (grid:column mat 0) 2.0))
According to these codes, Antik has to first create a temporary grid for storing intermediate results, and then set it bask to the original mat. The creation of temporary grid can be slow when the matrix is huge. So I hope I can do this directly on the original mat.
PS: The gsll:elt* does in-place modification to the whole matrix, e.g. (gsll:elt* mat 2.0), and this is very efficient even when the mat is huge.
Update:
I'm showing the result of my experiment here:
Code:
(let ((mat (grid:make-simple-grid
:grid-type 'grid:foreign-array
:initial-contents
(loop repeat 100 collect
(loop repeat 100 collect
(random 100))))))
(time (loop repeat 1000 do
(gsll:elt* mat 2.0)
(gsll:elt* mat 0.5)))
(time (loop repeat 1000 do
(setf (grid:row mat 0) (gsll:elt* (grid:row mat 0) 2.0))
(setf (grid:row mat 0) (gsll:elt* (grid:row mat 0) 0.5)))))
Result:
Evaluation took:
0.016 seconds of real time
0.016000 seconds of total run time (0.016000 user, 0.000000 system)
100.00% CPU
46,455,124 processor cycles
353,280 bytes consed
Evaluation took:
0.446 seconds of real time
0.444000 seconds of total run time (0.420000 user, 0.024000 system)
[ Run times consist of 0.008 seconds GC time, and 0.436 seconds non-GC time. ]
99.55% CPU
1,308,042,508 processor cycles
102,275,168 bytes consed
Note that the former calculation is taken on the whole 100x100 matrix, which is even faster than the the latter one (only taken on a 1x100 sub-matrix). And the latter one consed much more bytes due to the allocation of temporary storage.