I have a Julia code, version 1.2, which performs a lot of operations on a 10000 x 10000 Array . Due to OutOfMemory() error when I run the code, I’m exploring other options to run it, such as Memory-mapping. Concerning the use of Mmap.mmap, I’m a bit confused with the use of the Array that I map to my disk, due to little explanations on https://docs.julialang.org/en/v1/stdlib/Mmap/index.html. Here is the beginning of my code:
using Distances
using LinearAlgebra
using Distributions
using Mmap
data=Float32.(rand(10000,15))
Eucldist=pairwise(Euclidean(),data,dims=1)
D=maximum(Eucldist.^2)
sigma2hat=mean(((Eucldist.^2)./D)[tril!(trues(size((Eucldist.^2)./D)),-1)])
L=exp.(-(Eucldist.^2/D)/(2*sigma2hat))
L is the 10000 x 10000 Array with which I want to work, so I mapped it to my disk with
s = open("mmap.bin", "w+")
write(s, size(L,1))
write(s, size(L,2))
write(s, L)
close(s)
What am I supposed to do after that? The next step is to perform K=eigen(L) and apply other commands to K. How should I do that? With K=eigen(L) or K=eigen(s)? What’s the role of the object s and when does it get involved? Moreover, I don’t understand why I have to use Mmap.sync! and when. After each subsequent lines after eigen(L)? At the end of the code? How can I be sure that I’m using my disk space instead of RAM memory?Would like some highlights about memory-mapping, please. Thank you!
If memory usage is a concern, it is often best to re-assign your very large arrays to 0, or to a similar type-safe small matrix, so that the memory can be garbage collected, assuming you are done with those intermediate matrices. After that, you just call Mmap.mmap() on your stored data file, with the type and dimensions of the data as second and third arguments to mmap, and then assign the function's return value to your variable, in this case L, resulting in L being bound to the file contents:
using Distances
using LinearAlgebra
using Distributions
using Mmap
function testmmap()
data = Float32.(rand(10000, 15))
Eucldist = pairwise(Euclidean(), data, dims=1)
D = maximum(Eucldist.^2)
sigma2hat = mean(((Eucldist.^2) ./ D)[tril!(trues(size((Eucldist.^2) ./ D)), -1)])
L = exp.(-(Eucldist.^2 / D) / (2 * sigma2hat))
s = open("./tmp/mmap.bin", "w+")
write(s, size(L,1))
write(s, size(L,2))
write(s, L)
close(s)
# deref and gc collect
Eucldist = data = L = zeros(Float32, 2, 2)
GC.gc()
s = open("./tmp/mmap.bin", "r+") # allow read and write
m = read(s, Int)
n = read(s, Int)
L = Mmap.mmap(s, Matrix{Float32}, (m, n)) # now L references the file contents
K = eigen(L)
K
end
testmmap()
#time testmmap() # 109.657995 seconds (17.48 k allocations: 4.673 GiB, 0.73% gc time)
Related
I have a big file (75GB) memory mapped in an array d that I want to copy in another m. Because I do not have 75GB of RAM available, I did:
for (i,v) in enumerate(d)
m[i] = v
end
In order to copy the file value after value. But I get a copy rate of ~2MB/s on a SSD where I expect at least 50MB/s both in read and write.
How could I optimize this copy rate?
=== [edit] ===
According to the comments, I changed my code to the following, which sped up the write rate to 15MB/s
function copydcimg(m::Array{UInt16,4}, d::Dcimg)
m .= d
Mmap.sync!(m)
end
copydcimg(m,d)
At this point, I think I should optimize the Dcimg code. This binary file is made of frames spaced by a timestamp. Here is the code I use to access the frames:
module dcimg
using Mmap
using TOML
struct Dcimg <: AbstractArray{UInt16,4} # struct allowing to access dcimg file
filename::String # filename of the dcimg
header::Int # header size in bytes
clock::Int # clock size in bytes
x::Int
y::Int
z::Int
t::Int
m # linear memory map
Dcimg(filename, header, clock, x, y, z, t) =
new(filename, header, clock, x, y, z, t,
Mmap.mmap(open(filename), Array{UInt16, 3},
(x*y+clock÷sizeof(UInt16), z, t), header)
)
end
# following functions allows to access DCIMG like an Array
Base.size(D::Dcimg) = (D.x, D.y, D.z, D.t)
# skip clock
Base.getindex(D::Dcimg, i::Int) =
D.m[i + (i ÷ (D.x*D.y))*D.clock÷sizeof(UInt16)]
Base.getindex(D::Dcimg, x::Int, y::Int, z::Int, t::Int) =
D[x + D.x*((y-1) + D.y*((z-1) + D.z*(t-1)))]
# allowing to automatically parse size
function Dcimg(pathtag)
p = TOML.parsefile(pathtag * ".toml")
return Dcimg(pathtag * ".dcimg",
# ...
)
end
export Dcimg, getframe
end
I got it! The solution was to copy the file chunk by chunk lets say by frame (around 1024×720 UInt16). This way I reached 300MB/s, which I didn't even know was possible in single thread. Here is the code.
In module dcimg, I added the methods to access the file frame by frame
# get frame number n (starting form 1)
getframe(D::Dcimg,n::Int) =
reshape(D.m[
D.x*D.y*(n-1)+1 + (n-1)*D.clock÷sizeof(UInt16) : # cosmetic line break
D.x*D.y*n + (n-1)*D.clock÷sizeof(UInt16)
], D.x, D.y)
# get frame for layer z, time t (starting from 1)
getframe(D::Dcimg,z::Int,t::Int) =
getframe(D::Dcimg,(z-1)+D.z*(t-1))
Iterating over the frames within a loop
function copyframes(m::Array{UInt16,4}, d::Dcimg)
N = d.z*d.t
F = d.x*d.y
for i in 1:N
m[(i-1)*F+1:i*F] = getframe(d, i)
end
end
copyframes(m,d)
Thanks all in comments for leading me to this.
===== edit =====
for further reading, you might look at:
dd: How to calculate optimal blocksize?
http://blog.tdg5.com/tuning-dd-block-size/
which give hints about the optimal block size to copy at a time.
The following functions parallelize the processing of a list by first decomposing a list into large chunks and then processing each chunk.
let chunkList chunkSize (xs : list<'T>) =
query {
for idx in 0..(xs.Length - 1) do
groupBy (idx / chunkSize) into g
select (g |> Seq.map (fun idx -> xs.[idx]))
}
let par (foo: 'T -> 'S) (xs: list<'T>) =
xs
|> List.map (fun x -> async { return foo x })
|> Async.Parallel
|> Async.RunSynchronously
|> Array.toList
let parChunks chunkSize (f: 'T -> 'S) (xs: list<'T>) =
chunkList chunkSize xs |> Seq.map List.ofSeq |> List.ofSeq
|> par (List.map f)
|> List.concat
This function was used to test parChunks:
let g i = [1..1000000] |> List.map (fun x -> sqrt (float (1000 * x + 1))) |> List.head
Running the standard List.Seq and `parChunk`` with a chunk size equal to 1/2 the list size there was a performance gain:
List.map g [1..100];;
// Real: 00:00:28.979, CPU: 00:00:29.562
parChunks 50 g [1..100];;
// Real: 00:00:23.027, CPU: 00:00:24.687
However, with a chunk size equal to 1/4 the size of the list the performance was almost the same. I did not expect this since my processor (Intel 6700HQ) has four cores.
parChunks 25 g [1..100];;
// Real: 00:00:21.695, CPU: 00:00:24.437
Looking at the Performance app in Task Manager one sees that the four cores are never used.
Is there a way to make all four cores participate in this computation?
I think you are overcomplicating this problem.
The primary use of async workflows is not for CPU-bound work, it's for IO-bound work to avoid blocking threads while awaiting results that will arrive with some latency.
Although you can parallelise CPU-bound work using async, doing so is suboptimal.
What you want can be far more easily achieved by using the Array.Parallel module on Arrays rather than Lists.
let g i =
[|1..1000000|]
|> Array.Parallel.map (fun x -> sqrt (float (1000 * x + 1)))
|> Array.head
No need to write your own chunking and merging code, that's all handled for you and, by my measurements, it's much much faster.
In F#, async workflows run using the .Net ThreadPool class, which has GetMinThreads and GetMaxThreads methods. They use two out parameters to return the minimum or maximum number of threads that the thread pool is allowed to use, but in F# that gets converted to a function returning a tuple:
F# Interactive for F# 4.1
Freely distributed under the Apache 2.0 Open Source License
For help type #help;;
> open System.Threading ;;
> ThreadPool.GetMinThreads() ;;
val it : int * int = (4, 4)
> ThreadPool.GetMaxThreads() ;;
val it : int * int = (400, 200)
The two numbers are for "worker" threads and "asynchronous I/O" threads, respectively. My CPU has four cores, so the minimum number of both kinds of threads in the pool is 4. I don't know for certain that this is your problem, but try running ThreadPool.GetMinThreads() on your system and make sure that it's 4. If it's 2 for some reason, that could explain why you're not getting better performance.
See also https://stackoverflow.com/a/26041852/2314532 for an explanation of another possible performance problem with using async workflows for parallel processing. That could also be what's happening here.
Finally, there's one more thing I want to mention. As it currently stands, I'm genuinely surprised that you're getting any benefit out of your parallelism. That's because there's a cost to dividing up the list and concatenating it again. Since the F# list type is a singly-linked list, that cost is O(N), and those steps (divide and reassemble) cannot be parallelized.
The answer to that problem is to use a different data structure, like an RRB Tree, for any list of items that you plan to process in parallel: it's designed to be split and concatenated efficiently (effectively O(1) splits and joins, though the constant factor in joins is rather large). Unfortunately, there's currently no implementation of RRB trees in F#. I'm currently working on one, and estimate it may be ready in another month or so. You can subscribe to this GitHub issue if you want to find out when I've released the code I've been working on.
Good answers here but I will add some comments when it comes to performance and parallelism.
For performance in general, we like to avoid dynamic allocations because we don't want to waste precious cycles allocating objects (quite fast in .NET, slow in C/C++) or collecting them (quite slow).
We also like to minimize the memory footprint of objects and make sure they lay sequentially in memory (Arrays are our friends here) in order to make as efficient use of the CPU cache and prefetcher as possible. A cache miss might cost several hundred cycles.
I think it is important to always compare against a trivial, sequential yet efficiently implemented loop in order to have some sanity check of the parallel performance. Otherwise we might trick ourselves into thinking our parallel masterpiece is doing well when in reality it's outclassed by a simple loop.
Also, varying the size of the input data because of caching issues but also because there is overhead from starting up a parallel computation.
With that said, I have prepared different versions of the following code:
module SequentialFold =
let compute (vs : float []) : float =
vs |> Array.fold (fun s v -> s + sqrt (1000. * v + 1.)) 0.
then I compare the performance of the different versions in order to see which does the best on varying sized in terms of performance and GC pressure.
The performance test is done in such a way that the total amount of work is always the same regardless of input size in order to make times comparable.
Here is the code:
open System
open System.Threading.Tasks
let clock =
let sw = System.Diagnostics.Stopwatch ()
sw.Start ()
fun () -> sw.ElapsedMilliseconds
let timeIt n a =
let r = a () // Warm-up
GC.Collect (2, GCCollectionMode.Forced, true)
let inline cc g = GC.CollectionCount g
let bcc0, bcc1, bcc2 = cc 0, cc 1, cc 2
let before = clock ()
for i = 1 to n do
a () |> ignore
let after = clock ()
let acc0, acc1, acc2 = cc 0, cc 1, cc 2
after - before, acc0 - bcc0, acc1 - bcc1, acc2 - bcc2, r
// compute implemented using tail recursion
module TailRecursion =
let compute (vs : float []) : float =
let rec loop s i =
if i < vs.Length then
let v = vs.[i]
loop (s + sqrt (1000. * v + 1.)) (i + 1)
else
s
loop 0. 0
// compute implemented using Array.fold
module SequentialFold =
let compute (vs : float []) : float =
vs |> Array.fold (fun s v -> s + sqrt (1000. * v + 1.)) 0.
// compute implemented using Array.map + Array.fold
module SequentialArray =
let compute (vs : float []) : float =
vs |> Array.map (fun v -> sqrt (1000. * v + 1.)) |> Array.fold (+) 0.
// compute implemented using Array.Parallel.map + Array.fold
module ParallelArray =
let compute (vs : float []) : float =
vs |> Array.Parallel.map (fun v -> sqrt (1000. * v + 1.)) |> Array.fold (+) 0.
// compute implemented using Parallel.For
module ParallelFor =
let compute (vs : float []) : float =
let lockObj = obj ()
let mutable sum = 0.
let options = ParallelOptions()
let init () = 0.
let body i pls s =
let v = i |> float
s + sqrt (1000. * v + 1.)
let localFinally ls =
lock lockObj <| fun () -> sum <- sum + ls
let pls = Parallel.For ( 0
, vs.Length
, options
, Func<float> init
, Func<int, ParallelLoopState, float, float> body
, Action<float> localFinally
)
sum
// compute implemented using Parallel.For with batches of size 100
module ParallelForBatched =
let compute (vs : float []) : float =
let inner = 100
let outer = vs.Length / inner + (if vs.Length % inner = 0 then 0 else 1)
let lockObj = obj ()
let mutable sum = 0.
let options = ParallelOptions()
let init () = 0.
let rec loop e s i =
if i < e then
let v = vs.[i]
loop e (s + sqrt (1000. * v + 1.)) (i + 1)
else
s
let body i pls s =
let b = i * inner
let e = b + inner |> min vs.Length
loop e s b
let localFinally ls =
lock lockObj <| fun () -> sum <- sum + ls
let pls = Parallel.For ( 0
, outer
, options
, Func<float> init
, Func<int, ParallelLoopState, float, float> body
, Action<float> localFinally
)
sum
[<EntryPoint>]
let main argv =
let count = 100000000
let outers =
[|
//10000000
100000
1000
10
|]
for outer in outers do
let inner = count / outer
let vs = Array.init inner float
let testCases =
[|
"TailRecursion" , fun () -> TailRecursion.compute vs
"Fold.Sequential" , fun () -> SequentialFold.compute vs
"Array.Sequential" , fun () -> SequentialArray.compute vs
"Array.Parallel" , fun () -> ParallelArray.compute vs
"Parallel.For" , fun () -> ParallelFor.compute vs
"Parallel.For.Batched" , fun () -> ParallelForBatched.compute vs
|]
printfn "Using outer = %A, inner = %A, total is: %A" outer inner count
for nm, a in testCases do
printfn " Running test case: %A" nm
let tm, cc0, cc1, cc2, r = timeIt outer a
printfn " it took %A ms with GC collects (%A, %A, %A), result is: %A" tm cc0 cc1 cc2 r
0
And here are the results (Intel I5, 4 cores):
Using outer = 100000, inner = 1000, total is: 100000000
Running test case: "TailRecursion"
it took 389L ms with GC collects (0, 0, 0), result is: 666162.111
Running test case: "Fold.Sequential"
it took 388L ms with GC collects (0, 0, 0), result is: 666162.111
Running test case: "Array.Sequential"
it took 628L ms with GC collects (255, 0, 0), result is: 666162.111
Running test case: "Array.Parallel"
it took 993L ms with GC collects (306, 2, 0), result is: 666162.111
Running test case: "Parallel.For"
it took 711L ms with GC collects (54, 2, 0), result is: 666162.111
Running test case: "Parallel.For.Batched"
it took 490L ms with GC collects (52, 2, 0), result is: 666162.111
Using outer = 1000, inner = 100000, total is: 100000000
Running test case: "TailRecursion"
it took 389L ms with GC collects (0, 0, 0), result is: 666661671.1
Running test case: "Fold.Sequential"
it took 388L ms with GC collects (0, 0, 0), result is: 666661671.1
Running test case: "Array.Sequential"
it took 738L ms with GC collects (249, 249, 249), result is: 666661671.1
Running test case: "Array.Parallel"
it took 565L ms with GC collects (249, 249, 249), result is: 666661671.1
Running test case: "Parallel.For"
it took 157L ms with GC collects (0, 0, 0), result is: 666661671.1
Running test case: "Parallel.For.Batched"
it took 110L ms with GC collects (0, 0, 0), result is: 666661671.1
Using outer = 10, inner = 10000000, total is: 100000000
Running test case: "TailRecursion"
it took 387L ms with GC collects (0, 0, 0), result is: 6.666666168e+11
Running test case: "Fold.Sequential"
it took 390L ms with GC collects (0, 0, 0), result is: 6.666666168e+11
Running test case: "Array.Sequential"
it took 811L ms with GC collects (3, 3, 3), result is: 6.666666168e+11
Running test case: "Array.Parallel"
it took 567L ms with GC collects (4, 4, 4), result is: 6.666666168e+11
Running test case: "Parallel.For"
it took 151L ms with GC collects (0, 0, 0), result is: 6.666666168e+11
Running test case: "Parallel.For.Batched"
it took 102L ms with GC collects (0, 0, 0), result is: 6.666666168e+11
TailRecursion and Fold.Sequential have similar performance.
Array.Sequential does worse because the job is split on two operations map and fold. In addition we get GC pressure because it allocates an extra array.
Array.Parallel is the same as Array.Sequential but uses Array.Parallel.map over Array.map. Here we see there's an overhead of starting many small parallel compuations as small input sizes generate more parallel computations and this takes significant more performance. In addition, the performance is poor even if we use multiple cores. This is because the compuation per element is very small and any benefit of spreading the job over several cores is consumed by the overhead of managing the distribution. When comparing the single thread performance of 390ms with the parallel performance of 990ms one might be suprised that it is 3x worse but in reality it's 12x worse as all 4 cores are used to produce the answer 3x slower.
Parallel.For does better as it allows the parallel computation to take place without allocating a new array and the internal overhead is likely lower. Here we manage to gain performance for larger sizes but still lags behind the sequential algorithms for smaller sizes because of the overhead of starting parallel computations.
Parallel.For.Batched tries to reduce the overhead by increasing the cost of the individual computations by folding several array values in each parallel computation. Essentially a combination of the TailRecursion algorithm and Parallel.For. Thanks to this we manage to hit an efficiency of 95% for larger sizes which can be consider decent.
For a simple computation like this AVX could be used as well leading to a potential speedup of around 16X, the cost is that the code will get even hairier.
With a batched parallel for we reached 95% of expected performance speedup.
The point of this is that it's important to continuously measure performance of your parallel algorithms and compare them against trivial sequential implementations.
Question: I have a new type type MyFloat; x::Float64 ; end. I want to perform a deepcopy on a Vector{MyFloat}. Using Julia v0.5.0 on Ubuntu 16.04, the operation runs roughly 150 times slower than a deepcopy call on an equivalent length Vector{Float64}. Is it possible to speed up a deepcopy on my Vector{MyFloat}?
Code snippet: The 150 times slowdown can be seen with the following code snippet which can be pasted to the REPL:
#Just my own floating point type
type MyFloat
x::Float64
end
#This function performs N deepcopy operations on a Vector{MyFloat} of length J
function f1(J::Int, N::Int)
v = MyFloat.(rand(J))
x = [ deepcopy(v) for n = 1:N ]
end
#The same as f1, but on Vector{Float64} instead of Vector{MyFloat}
function f2(J::Int, N::Int)
v = rand(J)
x = [ deepcopy(v) for n = 1:N ]
end
#Pre-compilation step
f1(2, 2);
f2(2, 2);
#Timings
#time f1(100, 15000);
#time f2(100, 15000);
On my machine this produces:
julia> #time f1(100, 15000);
1.944410 seconds (4.61 M allocations: 167.888 MB, 7.72% gc time)
julia> #time f2(100, 15000);
0.013513 seconds (45.01 k allocations: 19.113 MB, 78.80% gc time)
Looking at the answer here it sounds like I can speed things up by defining my own copy method for MyFloat. I've tried things like:
Base.deepcopy(x::MyFloat)::MyFloat = MyFloat(x.x);
Base.deepcopy(v::Vector{MyFloat})::Vector{MyFloat} = [ MyFloat(y.x) for y in v ]
Base.copy(x::MyFloat)::MyFloat = MyFloat(x.x)
Base.copy(v::Vector{MyFloat})::Vector{MyFloat} = [ MyFloat(y.x) for y in v ]
but this doesn't make any difference.
Final note: Letting a = MyFloat.([1.0, 2.0]), I could just use b = copy(a) and there is no speed penalty. This is fine, as long as I am careful to only ever do operations like b[1] = MyFloat(3.0) (which will modify b but not a). But if I get sloppy and accidentally write b[1].x = 3.0, then this will modify both a and b.
By the way, it is entirely possible that I do not have a deep understanding of the differences between copy and deepcopy... I have read this great blog post (thanks #ChrisRackauckas), but I'm certainly a bit fuzzy about what is happening at a deeper level.
Try changing type MyFloat in the definition to immutable MyFloat or struct MyFloat (the keyword changed in 0.6). This makes the times almost equal.
As #Gnimuc mentioned, a mutable, which is not a bitstype, makes Julia keep track of a lot of other stuff. See here and in the comments.
Following up How to add vectors to the columns of some array in Julia?, I would like to have some analogous clarifications for DataArrays.
Let y=randn(100, 2). I would like to create a matrix x with the lagged value (with lags > 0) of y. I have already written a code which it seems is working properly (see below). I was wondering if there is a better way for concatenating a DataArray than the one I have used.
T, n = size(y);
x = #data(zeros(T-lags, 0));
for lag in 1:lags
x = hcat(x, y[lags-lag+1:end-lag, :]);
end
Unless there is a specific reason to do otherwise, my recommendation would be to start with your DataArray x being the size that you want it to be and then fill in the column values you want.
This will give you better performance than if you need to recreate the DataArray for each new column, which is what any method for "adding" columns will actually be doing. It's conceivable that the DataArray package might have some more pretty syntax for it than what you have in your question, but fundamentally, that's what it would still be doing.
Thus, in a simplified version of your example, I would recommend:
using DataArrays
N = 5; T = 10;
X = #data(zeros(T, N));
initial_data_cols = 2; ## specify how much of the initial data is filled in
lags = size(X,2) - initial_data_cols
X[:,1:initial_data_cols] = rand(size(X,1), initial_data_cols) ## First two columns of X are fixed in advance
for lag in 1:lags
X[:,(lag+initial_data_cols)] = rand(size(X,1))
end
If you did find yourself in a situation where you need to add columns to an already created object, you could improve somewhat upon the code that you have by first creating all of the new objects together and then doing a single addition of them to your initial DataArray. E.g.
X = #data(zeros(10, 2))
X = [X rand(10,3)]
For instance, consider the difference in execution time, and number and quantity of memory allocations in the two examples below:
n = 10^5; m = 10;
A = #data rand(n,m);
n_newcol = 10;
function t1(A::Array, n_newcol)
n = size(A,1)
for idx = 1:n_newcol
A = hcat(A, zeros(n))
end
return A
end
function t2(A::Array, n_newcol)
n = size(A,1)
[A zeros(n, n_newcol)]
end
# Stats after running each function once to compile
#time r1 = t1(A, n_newcol); ## 0.154082 seconds (124 allocations: 125.888 MB, 75.33% gc time)
#time r2 = t2(A, n_newcol); ## 0.007981 seconds (9 allocations: 22.889 MB, 31.73% gc time)
I want to find the key corresponding to the min or max value of a dictionary in julia. In Python I would to the following:
my_dict = {1:20, 2:10}
min(my_dict, my_dict.get)
Which would return the key 2.
How can I do the same in julia ?
my_dict = Dict(1=>20, 2=>10)
minimum(my_dict)
The latter returns 1=>20 instead of 2=>10 or 2.
You could use reduce like this, which will return the key of the first smallest value in d:
reduce((x, y) -> d[x] ≤ d[y] ? x : y, keys(d))
This only works for non-empty Dicts, though. (But the notion of the “key of the minimal value of no values” does not really make sense, so that case should usually be handled seperately anyway.)
Edit regarding efficiency.
Consider these definitions (none of which handle empty collections)...
m1(d) = reduce((x, y) -> d[x] ≤ d[y] ? x : y, keys(d))
m2(d) = collect(keys(d))[indmin(collect(values(d)))]
function m3(d)
minindex(x, y) = d[x] ≤ d[y] ? x : y
reduce(minindex, keys(d))
end
function m4(d)
minkey, minvalue = next(d, start(d))[1]
for (key, value) in d
if value < minvalue
minkey = key
minvalue = value
end
end
minkey
end
...along with this code:
function benchmark(n)
d = Dict{Int, Int}(1 => 1)
m1(d); m2(d); m3(d); m4(d); m5(d)
while length(d) < n
setindex!(d, rand(-n:n), rand(-n:n))
end
#time m1(d)
#time m2(d)
#time m3(d)
#time m4(d)
end
Calling benchmark(10000000) will print something like this:
1.455388 seconds (30.00 M allocations: 457.748 MB, 4.30% gc time)
0.380472 seconds (6 allocations: 152.588 MB, 0.21% gc time)
0.982006 seconds (10.00 M allocations: 152.581 MB, 0.49% gc time)
0.204604 seconds
From this we can see that m2 (from user3580870's answer) is indeed faster than my original solution m1 by a factor of around 3 to 4, and also uses less memory. This is appearently due to the function call overhead, but also the fact that the λ expression in m1 is not optimized very well. We can alleviate the second problem by defining a helper function like in m3, which is better than m1, but not as good as m2.
However, m2 still allocates O(n) memory, which can be avoided: If you really need the efficiency, you should use an explicit loop like in m4, which allocates almost no memory and is also faster.
another option is:
collect(keys(d))[indmin(collect(values(d)))]
it depends on properties of keys and values iterators which are not guaranteed, but in fact work for Dicts (and are guaranteed for OrderedDicts). like the reduce answer, d must be non-empty.
why mention this, when the reduce, pretty much nails it? it is 3 to 4 times faster (at least on my computer) !
Here is another way to find Min with Key and Value
my_dict = Dict(1 => 20, 2 =>10)
findmin(my_dict) gives the output as below
(10, 2)
to get only key use
findmin(my_dict)[2]
to get only value use
findmin(my_dict)[1]
Hope this helps.
If you only need the minimum value, you can use
minimum(values(my_dict))
If you need the key as well, I don't know a built-in function to do so, but you can easily write it yourself for numeric keys and values:
function find_min_key{K,V}(d::Dict{K,V})
minkey = typemax(K)
minval = typemax(V)
for key in keys(d)
if d[key] < minval
minkey = key
minval = d[key]
end
end
minkey => minval
end
my_dict = Dict(1=>20, 2=>10)
find_min_key(my_dict)
findmax(dict)[2]
findmin(dict)[2]
Should also return the key corresponding to the max and min value(s). Here [2] is the index of the key in the returned tuple.