I wanted to have a look at the julia language, so I wrote a little script to import a dataset I'm working with. But when I run and profile the script it turns out that it is much slower than a similar script in R.
When I do profiling it tells me that all the cat commands have a bad performance.
The files look like this:
Identifier1 data_string1
Identifier2 data_string2
Identifier3 data_string3
Identifier4 data_string4
I primarily want to get the data_strings and split them up into a matrix of single characters.
This is a somehow minimal code example:
function loadfile()
f = open("/file1")
m = Array(Any, 1,0)
for ln in eachline(f)
if ln[1] != '#' && ln[1] != '\n' && ln[1] != '/'
s = split(ln[1:end-1])
s = split(s[2],"")
if first
m = reshape(s,1,length(s))
first = false
s = reshape(s,1,length(s))
m = vcat(m, s)
Any idea why julia might be slow with the cat command or how i can do it differently?
Thanks for any suggestions!
Using cat like that is slow in that it requires a lot of memory allocations. Every time we do a vcat we are allocating a whole new array m which is mostly the same as the old m. Here is how I'd rewrite your code in a more Julian way, where m is only created at the end:
function loadfile2()
f = open("./sotest.txt","r")
first = true
lines = Any[]
for ln in eachline(f)
if ln[1] == '#' || ln[1] == '\n' || ln[1] == '/'
data_str = split(ln[1:end-1]," ")[2]
data_chars = split(data_str,"")
# Can make even faster (2x in my tests) with
# data_chars = [data_str[i] for i in 1:length(data_str)]
# But this inherently assumes ASCII data
push!(lines, data_chars)
m = hcat(lines...)' # Stick column vectors together then transpose
I made a 10,000 line version of your example data and found the following performance:
Old version:
elapsed time: 3.937826405 seconds (3900659448 bytes allocated, 43.81% gc time)
elapsed time: 3.581752309 seconds (3900645648 bytes allocated, 36.02% gc time)
elapsed time: 3.57753696 seconds (3900645648 bytes allocated, 37.52% gc time)
New version:
elapsed time: 0.010351067 seconds (11568448 bytes allocated)
elapsed time: 0.011136188 seconds (11568448 bytes allocated)
elapsed time: 0.010654002 seconds (11568448 bytes allocated)
I'd like to run heavy computations in Julia for a fixed duration, for example 10 seconds. I tried this:
timer = Timer(10.0)
while isopen(timer)
But this does not work, since the computations never let Julia's task scheduler take control. So I added yield() in the loop:
timer = Timer(10.0)
while isopen(timer)
But now there is significant overhead from calling yield(), especially when one call to computation() is short. I guess I could call yield() and isopen() only every 1000 iterations or so, but I would prefer a solution where I would not have to tweak the number of iterations every time I change the computations. Any ideas?
This pattern below uses threads and on my laptop has a latency of around 35ms for each 1,000,000 calls which is more than acceptable for any job.
Tested on Julia 1.5 release candidate:
function should_stop(timeout=10)
handle = Threads.Atomic{Bool}(false)
mytask = Threads.#spawn begin
Threads.atomic_or!(handle, true)
function do_some_job_with_timeout()
handle = should_stop(5)
res = BigInt() # save results to some object
mytask = Threads.#spawn begin
for i in 1:10_000_000
#TODO some complex computations here
res += 1 # mutate the result object
handle.value && break
wait(mytask) # wait for the job to complete
You can also used Distributed instead. The code below seems to have a much better latency - only about 1ms for each 1,000,000 timeout checks.
using Distributed
using SharedArrays
function get_termination_handle(timeout=5,workerid::Int=workers()[end])::SharedArray{Bool}
handle = SharedArray{Bool}([false])
proc = #spawnat workerid begin
function fun_within_timeout()
res = 0
h = get_termination_handle(0.1)
for i = 1:100_000_000
res += i % 2 == 0 ? 1 : 0
h[1] && break
Let x = randn(100, 2). I want to write x to its own file. This file will contain x, and only x, and x will only ever be of type Matrix{Float64}. In the past, I have always used HDF5 for this, but it occurs to me that this is over-kill, since in this setup I will only have one array per file. Note that JLD uses HDF5, and so is also over-kill.
1) What is the fastest method for reading and writing x assuming I will only ever want to read the entire matrix?
2) What is the fastest method for reading and writing x assuming I might want to read a slice of the matrix?
3) What is the fastest method for reading and writing x assuming I might want to read a slice of the matrix, or over-write a slice of the matrix (but not change the matrix size)?
You could use the serialize function, provided you heed the warnings in the documentation about non-guarantees between versions etc.
serialize(stream::IO, value)
Write an arbitrary value to a stream in an opaque format, such that it can be read back by deserialize. The read-back value will be as identical as possible to the original. In general, this process will not work if the reading and writing are done by different
versions of Julia, or an instance of Julia with a different system image. Ptr values are serialized as all-zero bit patterns (NULL).
An 8-byte identifying header is written to the stream first. To avoid writing the header, construct a SerializationState and use it as the first argument to serialize instead. See also Serializer.writeheader.
Really though, JLD (or in fact, its successor, JLD2) is generally the recommended way*.
*Of particular interest to you might be the statements that: "JLD2 saves and loads Julia data structures in a format comprising a subset of HDF5, without any dependency on the HDF5 C library" and that "it typically outperforms the previous JLD package (sometimes by multiple orders of magnitude) and often outperforms Julia's built-in serializer".
Based on the suggestions made by Tasos above, I put together a rudimentary speed test for both writes and reads using 4 different methods:
h5 (using the HDF5 package)
jld (using the JLD2 package)
slz (using serialize and deserialize)
dat (write to a binary file, using the first 128 bits to store the dimension of the matrix)
I've pasted the test code at the bottom of this answer. The results are:
julia> #time f_write_test(N, "h5")
0.191555 seconds (2.11 k allocations: 76.380 MiB, 26.39% gc time)
julia> #time f_write_test(N, "jld")
0.774857 seconds (8.33 k allocations: 77.058 MiB, 0.32% gc time)
julia> #time f_write_test(N, "slz")
0.108687 seconds (2.61 k allocations: 76.495 MiB, 1.91% gc time)
julia> #time f_write_test(N, "dat")
0.087488 seconds (1.61 k allocations: 76.379 MiB, 1.08% gc time)
julia> #time f_read_test(N, "h5")
0.051646 seconds (5.81 k allocations: 76.515 MiB, 14.80% gc time)
julia> #time f_read_test(N, "jld")
0.071249 seconds (10.04 k allocations: 77.136 MiB, 57.60% gc time)
julia> #time f_read_test(N, "slz")
0.038967 seconds (3.11 k allocations: 76.527 MiB, 22.17% gc time)
julia> #time f_read_test(N, "dat")
0.068544 seconds (1.81 k allocations: 76.405 MiB, 59.21% gc time)
So for writes, the write to binary option outperforms even serialize, and is twice as fast as HDF5 and almost an order of magnitude faster than JLD2.
For reads, deserialize has the best performance, while HDF5, JLD2 and reading from binary are all fairly close in performance, with HDF5 being slightly ahead.
I haven't included a test for writing to slices, but may come back to this in the future. Obviously writing to slices is impossible using serialize (not to mention the versioning/system image issues that serialize also faces), and I'm not really sure how to do it using JLD2. My gut feel writing a slice to binary will easily beat HDF5 if the slice is contiguous on disk, but will probably be significantly slower than HDF5 if it is non-contiguous and if the HDF5 method optimally exploits chunking. If HDF5 doesn't exploit chunking (which implies knowing at write time what slices you will want), then I suspect the binary method will come out ahead.
In summary, I'm going to go with the binary method, as I think that at this stage it is clearly the overall winner.
I suspect that eventually, JLD2 will probably be the method of choice, but there is a fair way to go here (the package itself is very new so not much time for the community to work on optimisations etc).
Test code follows:
using JLD2, HDF5
f_write_h5(fp::String, x::Matrix{Float64}) = h5write(fp, "G/D", x)
f_write_jld(fp::String, x::Matrix{Float64}) = #save fp x
f_write_slz(fp::String, x::Matrix{Float64}) = open(fid->serialize(fid, x), fp, "w")
f_write_dat_inner(fid1::IOStream, x::Matrix{Float64}) = begin ; write(fid1, size(x,1)) ; write(fid1, size(x,2)) ; write(fid1, x) ; end
f_write_dat(fp::String, x::Matrix{Float64}) = open(fid1->f_write_dat_inner(fid1, x), fp, "w")
f_read_h5(fp::String) = h5read(fp, "G/D")
f_read_jld(fp::String) = #load fp x
f_read_slz(fp::String) = open(deserialize, fp, "r")
f_read_dat_inner(fid1::IOStream) = begin ; d1 = read(fid1, Int) ; d2 = read(fid1, Int) ; read(fid1, Float64, (d1, d2)) ; end
f_read_dat(fp::String) = open(f_read_dat_inner, fp, "r")
function f_write_test(N::Int, filetype::String)
dp = "/home/colin/Temp/"
filetype == "h5" && [ f_write_h5("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
filetype == "jld" && [ f_write_jld("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
filetype == "slz" && [ f_write_slz("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
filetype == "dat" && [ f_write_dat("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
#[ rm("$(dp)$(n).$(filetype)") for n = 1:N ]
function f_read_test(N::Int, filetype::String)
dp = "/home/colin/Temp/"
filetype == "h5" && [ f_read_h5("$(dp)$(n).$(filetype)") for n = 1:N ]
filetype == "jld" && [ f_read_jld("$(dp)$(n).$(filetype)") for n = 1:N ]
filetype == "slz" && [ f_read_slz("$(dp)$(n).$(filetype)") for n = 1:N ]
filetype == "dat" && [ f_read_dat("$(dp)$(n).$(filetype)") for n = 1:N ]
[ rm("$(dp)$(n).$(filetype)") for n = 1:N ]
f_write_test(1, "h5")
f_write_test(1, "jld")
f_write_test(1, "slz")
f_write_test(1, "dat")
f_read_test(1, "h5")
f_read_test(1, "jld")
f_read_test(1, "slz")
f_read_test(1, "dat")
N = 100
#time f_write_test(N, "h5")
#time f_write_test(N, "jld")
#time f_write_test(N, "slz")
#time f_write_test(N, "dat")
#time f_read_test(N, "h5")
#time f_read_test(N, "jld")
#time f_read_test(N, "slz")
#time f_read_test(N, "dat")
Julia has two build-in functions readdlm & writedlm for doing this:
julia> x = randn(5, 5)
5×5 Array{Float64,2}:
-1.2837 -0.641382 0.611415 0.965762 -0.962764
0.106015 -0.344429 1.40278 0.862094 0.324521
-0.603751 0.515505 0.381738 -0.167933 -0.171438
-1.79919 -0.224585 1.05507 -0.753046 0.0545622
-0.110378 -1.16155 0.774612 -0.0796534 -0.503871
julia> writedlm("txtmat.txt", x, use_mmap=true)
julia> readdlm("txtmat.txt", use_mmap=true)
5×5 Array{Float64,2}:
-1.2837 -0.641382 0.611415 0.965762 -0.962764
0.106015 -0.344429 1.40278 0.862094 0.324521
-0.603751 0.515505 0.381738 -0.167933 -0.171438
-1.79919 -0.224585 1.05507 -0.753046 0.0545622
-0.110378 -1.16155 0.774612 -0.0796534 -0.503871
Definitely not the fastest way(use Mmap.mmap directly as DanGetz suggested in the comment if performance is a big deal), but it seems this is the simplest way and the output file is human-readable.
I am trying to port some of my R code to Julia;
Basically I have rewritten the following R code in Julia:
matrix_to_compare = expand.grid(c(0,1),c(0,1))
which((rowSums(x==matrix_to_compare)==2) %in% TRUE)
> system.time(lapply(large_matrix,function_compare))
user system elapsed
38.812 0.024 38.828
> system.time(mclapply(large_matrix,function_compare,mc.cores=11))
user system elapsed
63.128 1.648 6.108
As one can notice I am getting significant speed-up when going from one core to 11. Now I am trying to do the same in Julia:
#Define cluster:
using Distributions;
#everywhere using Iterators;
d = Normal();
eps_1 = rand(d,1000000);
eps_2 = rand(d,1000000);
#Create a large matrix:
large_matrix = hcat(eps_1,eps_2).>=0;
indices = collect(1:4:1000000)
#Split large matrix:
large_matrix = [large_matrix[i:(i+3),:] for i in indices];
#Define the function to apply:
#everywhere function function_split(x)
matrix_to_compare = transpose(reinterpret(Int,collect(product([0,1],[0,1])),(2,4)));
matrix_to_compare = matrix_to_compare.>0;
#time map(function_split,large_matrix )
#time pmap(function_split,large_matrix )
5.167820 seconds (22.00 M allocations: 2.899 GB, 12.83% gc time)
18.569198 seconds (40.34 M allocations: 2.082 GB, 5.71% gc time)
As one can notice I am not getting any speed up with pmap. Maybe somebody can suggest alternatives.
I think that some of the problem here is that #parallel and #pmap don't always handle moving data to and from the workers very well. Thus, they tend to work best in situations where what you are executing doesn't require very much data movement at all. I also suspect that there are probably things that could be done to improve their performance, but I'm not certain on the details.
For situations in which you do need more data moving around, it may be best to stick with options that directly call functions on workers, with those functions then accessing objects within the memory space of those workers. I give one example below, which speeds up your function using multiple workers. It uses perhaps the simplest option, which is #everywhere, but #spawn, remotecall() etc. are also worth considering, depending on your situation.
using Distributions;
#everywhere using Iterators;
d = Normal();
eps_1 = rand(d,1000000);
eps_2 = rand(d,1000000);
#Create a large matrix:
large_matrix = hcat(eps_1,eps_2).>=0;
indices = collect(1:4:1000000);
#Split large matrix:
large_matrix = [large_matrix[i:(i+3),:] for i in indices];
large_matrix = convert(Array{BitArray}, large_matrix);
function sendto(p::Int; args...)
for (nm, val) in args
#spawnat(p, eval(Main, Expr(:(=), nm, val)))
getfrom(p::Int, nm::Symbol; mod=Main) = fetch(#spawnat(p, getfield(mod, nm)))
#everywhere function function_split(x::BitArray)
matrix_to_compare = transpose(reinterpret(Int,collect(product([0,1],[0,1])),(2,4)));
matrix_to_compare = matrix_to_compare.>0;
function distribute_data(X::Array, WorkerName::Symbol)
size_per_worker = floor(Int,size(X,1) / nworkers())
StartIdx = 1
EndIdx = size_per_worker
for (idx, pid) in enumerate(workers())
if idx == nworkers()
EndIdx = size(X,1)
#spawnat(pid, eval(Main, Expr(:(=), WorkerName, X[StartIdx:EndIdx])))
StartIdx = EndIdx + 1
EndIdx = EndIdx + size_per_worker - 1
distribute_data(large_matrix, :large_matrix)
function parallel_split()
#everywhere begin
if myid() != 1
result = map(function_split,large_matrix );
results = cell(nworkers())
for (idx, pid) in enumerate(workers())
results[idx] = getfrom(pid, :result)
## results given after running once to compile
#time a = map(function_split,large_matrix); ## 6.499737 seconds (22.00 M allocations: 2.899 GB, 13.99% gc time)
#time b = parallel_split(); ## 1.097586 seconds (1.50 M allocations: 64.508 MB, 3.28% gc time)
julia> a == b
Note: even with this, the speedup is not perfect from the multiple processes. But, this is to be expected, since there is still a moderate amount of data to be returned as a result of your function, and that data's got to be moved, taking time.
P.S. See this post (Julia: How to copy data to another processor in Julia) or this package (https://github.com/ChrisRackauckas/ParallelDataTransfer.jl) for more on the sendto and getfrom functions I used here.
I have an algorithm that requires one column of an array to be replaced by another column of the same array.
I tried doing it with slices, and element-wise.
const M = 10^4
const N = 10^4
A = rand(Float32, M, N)
B = rand(Float32, N, M)
function copy_col!(A::Array{Float32,2},col1::Int,col2::Int)
A[1:end,col2] = A[1:end,col1]
function copy_col2!(A::Array{Float32,2},col1::Int,col2::Int)
for i in 1:size(A,1)
A[i,col2] = A[i,col1]
[Both functions+rand are called here once for compilation]
#time (for i in 1:20000 copy_col!(B, rand(1:size(B,2)),rand(1:size(B,2)) ); end )
#time (for i in 1:20000 copy_col2!(B, rand(1:size(B,2)),rand(1:size(B,2)) ); end )
>> 0.607899 seconds (314.81 k allocations: 769.879 MB, 25.05% gc time)
>> 0.213387 seconds (117.96 k allocations: 2.410 MB)
Why does copying using slices perform so much worse? Is there a better way than what copy_col2! does?
A[1:end,col1] makes a copy of indexed column first then it copies over to A[1:end,col2] so copy_col! allocates more and runs longer. There are sub, slice, and view that may remedy allocations in this case.
I have the following code:
using HDF5
using JLD
# read contents of a file
t = readall("sourceFile")
# remove unnecessary characters
t = replace(t, r"( 1:1\.0+)|(( 1:1\.0+)|(([1-6]:)|((\|user )|(\|))))", "")
# convert string into Float64 array (approximately ~140 columns)
data = readdlm(IOBuffer(t), ' ', char(10))
# save array on the hard drive
save("data.jld", "data", data)
Which works fine when I test it with the sourceFile that has 10^4 or less number of lines. However when sourceFile that has around 5*10^6 lines it fails at t = replace(t, r"( 1:1\.0+)|(( 1:1\.0+)|(([1-6]:)|((\|user )|(\|))))", "") with the following message
This question is old, and based on an older version of Julia. However, it would be useful to check if this works on a recent version. I recently tested this in latest 0.5 version of Julia, and the code above seems to work correctly with 5*10^6 lines of 600 characters each. The entire operation takes about 5G of peak memory on my laptop.
julia> t=[randstring(600) for i=1:5*10^6];
julia> writecsv("/Users/aviks/tmp/long.csv", t)
julia> t=readstring("/Users/aviks/tmp/long.csv");
julia> length(t)
julia> #time t = replace(t, r"( 1:1\.0+)|(( 1:1\.0+)|(([1-6]:)|((\|user )|(\|))))", "");
43.599660 seconds (137 allocations: 3.358 GB, 0.85% gc time)
(PS: Note that readall is now deprecated in favour of readstring).