Related
I am reading Julia performance tips,
https://docs.julialang.org/en/v1/manual/performance-tips/
At the beginning, it mentions two examples.
Example 1,
julia> x = rand(1000);
julia> function sum_global()
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_global()
0.009639 seconds (7.36 k allocations: 300.310 KiB, 98.32% compilation time)
496.84883432553846
julia> #time sum_global()
0.000140 seconds (3.49 k allocations: 70.313 KiB)
496.84883432553846
We see a lot of memory allocations.
Now example 2,
julia> x = rand(1000);
julia> function sum_arg(x)
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_arg(x)
0.006202 seconds (4.18 k allocations: 217.860 KiB, 99.72% compilation time)
496.84883432553846
julia> #time sum_arg(x)
0.000005 seconds (1 allocation: 16 bytes)
496.84883432553846
We see that by putting x into into the argument of the function, memory allocations almost disappeared and the speed is much faster.
My question are, can anyone explain,
why example 1 needs so many allocation, and why example 2 does not need as many allocations as example 1?
I am a little confused.
in the two examples, we see that the second time we run Julia, it is always faster than the first time.
Does that mean we need to run Julia twice? If Julia is only fast at the second run, then what is point? Why not Julia just do a compiling first, then do a run, just like Fortran?
Is there any general rule to preventing memory allocations? Or do we just always have to do a #time to identify the issue?
Thanks!
why example 1 needs so many allocation, and why example 2 does not need as many allocations as example 1?
Example 1 needs so many allocations, because x is a global variable (defined out of scope of the function sum_arg). Therefore the type of variable x can potentially change at any time, i.e. it is possible that:
you define x and sum_arg
you compile sum_arg
you redefine x (change its type) and run sum_arg
In particular, as Julia supports multiple threading, both actions in step 3 in general could happen even in parallel (i.e. you could have changed the type of x in one thread while sum_arg would be running in another thread).
So because after compilation of sum_arg the type of x can change Julia, when compiling sum_arg has to ensure that the compiled code does not rely on the type of x that was present when the compilation took place. Instead Julia, in such cases, allows the type of x to be changed dynamically. However, this dynamic nature of allowed x means that it has to be checked in run-time (not compile time). And this dynamic checking of x causes performance degradation and memory allocations.
You could have fixed this by declaring x to be a const (as const ensures that the type of x may not change):
julia> const x = rand(1000);
julia> function sum_global()
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_global() # this is now fast
0.000002 seconds
498.9290555615045
Why not Julia just do a compiling first, then do a run, just like Fortran?
This is exactly what Julia does. However, the benefit of Julia is that it does compilation automatically when needed. This allows you for a smooth interactive development process.
If you wanted you could compile the function before it is run with the precompile function, and then run it separately. However, normally people just run the function without doing it explicitly.
The consequence is that if you use #time:
The first time you run a function it returns you both execution time and compilation time (and as you can see in examples you have pasted - you get information what percentage of time was spent on compilation).
In the consecutive runs the function is already compiled so only execution time is returned.
Is there any general rule to preventing memory allocations?
These rules are exactly given in the Performance Tips section of the manual that you are quoting in your question. The tip on using #time is a diagnostic tip there. All other tips are rules that are recommended to get a fast code. However, I understand that the list is long so a shorter list that is good enough to start with in my experience is:
Avoid global variables
Avoid containers with abstract type parameters
Write type stable functions
Avoid changing the type of a variable
Ouput of 1000 numbers by the command
c(1:1000)
takes 6 seconds in a R Markdown. One can observe how the numbers are gradually displayed row by row. Similar oberservations with rep(1,1000) or rep('1',1000).
This happens on a new laptop with freshly installed RStudio.
There are no other programs that work in the background and all other programs on the laptop perform quickly as expected.
Is this a built-in delay to get an impression of the amount of displayed data and how this can be switched off?
technical details:
OS: Win10.19042
RStudio 1.3.1093
CPU: i7-9750H (6 cores)
RAM: 32GB
RStudio being busy. After 3 seconds about 500 numbers were displayed. After further 3 seconds the whole output would be completed.
I'm developing different discretization schemes and in order to find out which is the most efficient one I would like to determine the maximum RAM consumption and the time that takes to do an specific task, such as solving a system of equations, overwriting a matrix or writing the data to a file.
Is there any kind of code or something for doing what I need?
I'm using Julia in Ubuntu by the way, but I could do it in Windows as well.
Thanks a lot
I love using the built-in #time for this kind of thing. See "Measure performance with #time and pay attention to memory allocation". Example:
julia> #time myAwesomeFunction(tmp);
1.293542 seconds (22.08 M allocations: 893.866 MiB, 6.62% gc time)
This prints out time, the number of memory allocations, the size of memory allocations, and the percent time spent garbage collecting ("gc"). Always run this at least twice—the first run will be dominated by compile times!
Also consider BenchmarkTools.jl. This will run the code multiple times, with some cool variable interpolation tricks, and give you better runtime/memory estimates:
julia> using BenchmarkTools, Compat
julia> #btime myAwesomeFunction($tmp);
1.311 s (22080097 allocations: 893.87 MiB)
(My other favorite performance-related thing is the #code_* family of functions like #code_warntype.)
I think that BenchmarkTools.jl measures total memory use, not peak. I haven't found pure Julia code to measure this, but perhaps this thread is relevant.
I am trying to test the speed of Julia ODE solvers. I used the Lorenz equation in the tutorial:
using DifferentialEquations
using Plots
function lorenz(t,u,du)
du[1] = 10.0*(u[2]-u[1])
du[2] = u[1]*(28.0-u[3]) - u[2]
du[3] = u[1]*u[2] - (8/3)*u[3]
end
u0 = [1.0;1.0;1.0]
tspan = (0.0,100.0)
prob = ODEProblem(lorenz,u0,tspan)
sol = solve(prob,reltol=1e-8,abstol=1e-8,saveat=collect(0:0.01:100))
Loading the packages took about 25 s in the beginning, and the code ran for 7 s on a windows 10 quad core laptop in Jupyter notebook. I understand that Julia need to precompile packages first, and is that why the loading time was so long? I found 25 s unbearable. Also, when I ran the solver again using different initial values it took much less time (~1s) to run, and why is that? Is this the typical speed?
Tl;dr:
Julia packages have a precompilation phase. This helps make all further using calls quicker, at the cost of the first one storing some compilation data. This is only triggered each package update.
using has to pull in the package which takes a little bit (dependent on how much can precompile).
Precompilation isn't "complete", so the first time you run a function, even from a package, it will have to compile.
Julia devs know about this and there's already plans to get rid of (2) and (3) by making precompilation more complete. There's also plans to reduce compilation time, which I don't know details about.
All Julia functions specialize on the types that are given, and each function is a separate type, so DiffEq's internal functions are specializing on each ODE function you give.
In most cases with long computations, (5) doesn't actually matter since you aren't changing functions that often (if you are, consider changing parameters instead).
But (6) does matter when using it interactively. It makes it feel less "smooth".
We can get rid of this specialization on the ODE function, but it isn't the default because it causes a 2x-4x performant hit. Maybe it will be the default in the future.
Our timings post precompilation on this problem are still better than things like SciPy's wrapped Fortran solvers on problems like this by 20x. So this is all a compilation time problem, and not a runtime problem. Compilation time is essentially constant (larger problems calling the same function have about the same compilation), so this is really just an interactivity problem.
We (and Julia in general) can and will do better with interactivity in the future.
Full Explanation
This really isn't a DifferentialEquations.jl thing, this is just a Julia package thing. 25s would have to be including the precompilation time. The first time you load a Julia package it precompiles. Then that doesn't need to happen again until the next update. That's probably the longest initialization and it is quite long for DifferentialEquations.jl, but again that only happens each time you update the package code. Then, each time there's a small initialization cost for using. DiffEq is quite large, so it does take a bit to initialize:
#time using DifferentialEquations
5.201393 seconds (4.16 M allocations: 235.883 MiB, 4.09% gc time)
Then as noted in the comments you also have:
#time using Plots
6.499214 seconds (2.48 M allocations: 140.948 MiB, 0.74% gc time)
Then, the first time you run
function lorenz(t,u,du)
du[1] = 10.0*(u[2]-u[1])
du[2] = u[1]*(28.0-u[3]) - u[2]
du[3] = u[1]*u[2] - (8/3)*u[3]
end
u0 = [1.0;1.0;1.0]
tspan = (0.0,100.0)
prob = ODEProblem(lorenz,u0,tspan)
#time sol = solve(prob,reltol=1e-8,abstol=1e-8,saveat=collect(0:0.01:100))
6.993946 seconds (7.93 M allocations: 436.847 MiB, 1.47% gc time)
But then the second and third time:
0.010717 seconds (72.21 k allocations: 6.904 MiB)
0.011703 seconds (72.21 k allocations: 6.904 MiB)
So what's going on here? The first time Julia runs a function, it will compile it. So the first time you run solve, it will compile all of its internal functions as it runs. All of the proceeding times will be without the compilation. DifferentialEquations.jl also specializes on the function itself, so if we change the function:
function lorenz2(t,u,du)
du[1] = 10.0*(u[2]-u[1])
du[2] = u[1]*(28.0-u[3]) - u[2]
du[3] = u[1]*u[2] - (8/3)*u[3]
end
u0 = [1.0;1.0;1.0]
tspan = (0.0,100.0)
prob = ODEProblem(lorenz2,u0,tspan)
we will incur some of the compilation time again:
#time sol =
solve(prob,reltol=1e-8,abstol=1e-8,saveat=collect(0:0.01:100))
3.690755 seconds (4.36 M allocations: 239.806 MiB, 1.47% gc time)
So that's the what, now the why. There's a few things together here. First of all, Julia packages do not fully precompile. They don't keep the cached compiled versions of actual methods between sessions. This is something that is on the 1.x release list to do, and this would get rid of that first hit, similar to just calling a C/Fortran package since it would just be hitting a lot of ahead of time (AOT) compiled functions. So that'll be nice, but for now just note that there is a startup time.
Now let's talk about changing the functions. Every function in Julia automatically specializes on its arguments (see this blog post for details). The key idea here is that every function in Julia is a separate concrete type. So, since the problem type here is parameterized, changing the function triggers compilation. Note it's that relation: you can change parameters of the function (if you had parameters), you can change the initial conditions, etc., but it's only changing the type that triggers recompilation.
Is it worth it? Well, maybe. We want to specialize to have things fast for calculations which are difficult. Compilation time is constant (i.e. you can solve a 6 hour ODE and it'll still be a few seconds), and so the computationally-costly calculations aren't effected here. Monte Carlo simulations where you're running thousands of parameters and initial conditions aren't effected here because if you're just changing values of initial conditions and parameters then it won't recompile. But interactive use where you are changing functions does get a second or so hit in there, which isn't nice. One answer from the Julia devs for this is to spend post Julia 1.0 time speeding up compilation times, which is something that I don't know the details of but I am assured there's some low hanging fruit here.
Can we get rid of it? Yes. DiffEq Online doesn't recompile for each function because it's geared towards online use.
function lorenz3(t,u,du)
du[1] = 10.0*(u[2]-u[1])
du[2] = u[1]*(28.0-u[3]) - u[2]
du[3] = u[1]*u[2] - (8/3)*u[3]
nothing
end
u0 = [1.0;1.0;1.0]
tspan = (0.0,100.0)
f = NSODEFunction{true}(lorenz3,tspan[1],u0)
prob = ODEProblem{true}(f,u0,tspan)
#time sol = solve(prob,reltol=1e-8,abstol=1e-8,saveat=collect(0:0.01:100))
1.505591 seconds (860.21 k allocations: 38.605 MiB, 0.95% gc time)
And now we can change the function and not incur compilation cost:
function lorenz4(t,u,du)
du[1] = 10.0*(u[2]-u[1])
du[2] = u[1]*(28.0-u[3]) - u[2]
du[3] = u[1]*u[2] - (8/3)*u[3]
nothing
end
u0 = [1.0;1.0;1.0]
tspan = (0.0,100.0)
f = NSODEFunction{true}(lorenz4,tspan[1],u0)
prob = ODEProblem{true}(f,u0,tspan)
#time sol =
solve(prob,reltol=1e-8,abstol=1e-8,saveat=collect(0:0.01
:100))
0.038276 seconds (242.31 k allocations: 10.797 MiB, 22.50% gc time)
And tada, by wrapping the function in NSODEFunction (which is internally using FunctionWrappers.jl) it no longer specializes per-function and you hit the compilation time once per Julia session (and then once that's cached, once per package update). But notice that this has about a 2x-4x cost so I am not sure if it will be enabled by default. We could make this happen by default inside of the problem-type constructor (i.e. no extra specialization by default, but the user can opt into more speed at the cost of interactivity) but I am unsure what the better default is here (feel free to comment on the issue with your thoughts). But it will definitely get documented soon after Julia does its keyword argument changes and so "compilation-free" mode will be a standard way to use it, even if not default.
But just to put it into perspective,
import numpy as np
from scipy.integrate import odeint
y0 = [1.0,1.0,1.0]
t = np.linspace(0, 100, 10001)
def f(u,t):
return [10.0*(u[1]-u[0]),u[0]*(28.0-u[2])-u[1],u[0]*u[1]-(8/3)*u[2]]
%timeit odeint(f,y0,t,atol=1e-8,rtol=1e-8)
1 loop, best of 3: 210 ms per loop
we're looking at whether this interactive convenience should be made a default to be 5x faster instead of 20x faster than SciPy's default here (though our default will usually be much more accurate than the default SciPy uses, but that's data for another time which can be found in the benchmarks or just ask). On one hand it makes sense as ease-of-use, but on the other hand if re-enabling the specialization for long calculations and Monte Carlo isn't known (which is where you really want speed), then lots of people there will take a 2x-4x performance hit which could amount to extra days/weeks of computation. Ehh... tough choices.
So in the end there's a mixture of optimizing choices and some precompilation features missing from Julia that effect the interactivity without effecting the true runtime speed. If you're looking to estimate parameters using some big Monte Carlo, or solve a ton of SDEs, or solve a big PDE, we have that down. That was our first goal and we made sure to hit that as good as possible. But playing around in the REPL does have 2-3 second "gliches" which we also cannot ignore (better than playing around in C/Fortran though of course, but still not ideal for a REPL). For this, I've shown you that there's solutions already being developed and tested, and so hopefully this time next year we can have a better answer for that specific case.
PS
Two other things to note. If you're only using the ODE solvers, you can just do using OrdinaryDiffEq to keep downloading/installing/compiling/importing all of DifferentialEquations.jl (this is described in the manual). Also, using saveat like that probably isn't the fastest way to solve this problem: solving it with a lot less points and using the dense output as necessary may be better here.
Edit
I opened an issue detailing how we can reduce the "between function" compilation time without losing the speedup that specializing gives. I think this is something we can make a short-term priority since I agree that we could do better here.
reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns.
python 3.3 example
import pandas as pd
import time
start=time.time()
myData=pd.read_csv("C:\\myFile.txt",sep="|",header=None,low_memory=False)
print(time.time()-start)
Output: 19.90
R 3.0.2 example
system.time(myData<-read.delim("C:/myFile.txt",sep="|",header=F,
stringsAsFactors=F,na.strings=""))
Output:
User System Elapsed
181.13 1.07 182.32
Julia 0.2.0 (Julia Studio 0.4.4) example # 1
using DataFrames
timing = #time myData = readtable("C:/myFile.txt",separator='|',header=false)
Output:
elapsed time: 80.35 seconds (10319624244 bytes allocated)
Julia 0.2.0 (Julia Studio 0.4.4) example # 2
timing = #time myData = readdlm("C:/myFile.txt",'|',header=false)
Output:
elapsed time: 65.96 seconds (9087413564 bytes allocated)
Julia is faster than R, but quite slow compared to Python. What can I do differently to speed up reading a large text file?
a separate issue is the size in memory is 18 x size of hard disk file size in Julia, but only 2.5 x size for python. in Matlab, which I have found to be most memory efficient for large files, it is 2 x size of hard disk file size. Any particular reason for the large file size in memory in Julia?
The best answer is probably that I'm not as a good a programmer as Wes.
In general, the code in DataFrames is much less well-optimized than the code in Pandas. I'm confident that we can catch up, but it will take some time as there's a lot of basic functionality that we need to implement first. Since there's so much that needs to be built in Julia, I tend to focus on doing things in three parts: (1) build any version, (2) build a correct version, (3) build a fast, correct version. For the work I do, Julia often doesn't offer any versions of essential functionality, so my work gets focused on (1) and (2). As more of the tools I need get built, it'll be easier to focus on performance.
As for memory usage, I think the answer is that we use a set of data structures when parsing tabular data that's much less efficient than those used by Pandas. If I knew the internals of Pandas better, I could list off places where we're less efficient, but for now I'll just speculate that one obvious failing is that we're reading the whole dataset into memory rather than grabbing chunks from disk. This certainly can be avoided and there are issues open for doing so. It's just a matter of time.
On that note, the readtable code is fairly easy to read. The most certain way to get readtable to be faster is to whip out the Julia profiler and start fixing the performance flaws it uncovers.
There is a relatively new julia package called CSV.jl by Jacob Quinn that provides a much faster CSV parser, in many cases on par with pandas: https://github.com/JuliaData/CSV.jl
Note that the "n bytes allocated" output from #time is the total size of all allocated objects, ignoring how many of them might have been freed. This number is often much higher than the final size of live objects in memory. I don't know if this is what your memory size estimate is based on, but I wanted to point this out.
I've found a few things that can partially help this situation.
using the readdlm() function in Julia seems to work considerably faster (e.g. 3x on a recent trial) than readtable(). Of course, if you want the DataFrame object type, you'll then need to convert to it, which may eat up most or all of the speed improvement.
Specifying dimensions of your file can make a BIG difference, both in speed and in memory allocations. I ran this trial reading in a file that is 258.7 MB on disk:
julia> #time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1);
19.072266 seconds (221.60 M allocations: 6.573 GB, 3.34% gc time)
julia> #time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1, dims = (File_Lengths[1], 62));
10.309866 seconds (87 allocations: 528.331 MB, 0.03% gc time)
The type specification for your object matters a lot. For instance, if your data has strings in it, then the data of the array that you read in will be of type Any, which is expensive memory wise. If memory is really an issue, you may want to consider preprocessing your data by first converting the strings to integers, doing your computations, and then converting back. Also, if you don't need a ton of precision, using Float32 type instead of Float64 can save a LOT of space. You can specify this when reading the file in, e.g.:
Data = readdlm("file.csv", ',', Float32)
Regarding memory usage, I've found in particular that the PooledDataArray type (from the DataArrays package) can be helpful in cutting down memory usage if your data has a lot of repeated values. The time to convert to this type is relatively large, so this isn't a time saver per se, but at least helps reduce the memory usage somewhat. E.g. when loading a data set with 19 million rows and 36 columns, 8 of which represented categorical variables for statistical analysis, this reduced the memory allocation of the object from 5x its size on disk to 4x its size. If there are even more repeated values, the memory reduction can be even more significant (I've had situations where the PooledDataArray cuts memory allocation in half).
It can also sometimes help to run the gc() (garbage collector) function after loading and formatting data to clear out any unneeded ram allocation, though generally Julia will do this automatically pretty well.
Still though, despite all of this, I'll be looking forward to further developments on Julia to enable faster loading and more efficient memory usage for large data sets.
Let us first create a file you are talking about to provide reproducibility:
open("myFile.txt", "w") do io
foreach(i -> println(io, join(i+1:i+644, '|')), 1:153895)
end
Now I read this file in in Julia 1.4.2 and CSV.jl 0.7.1.
Single threaded:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
4.747160 seconds (1.55 M allocations: 1.281 GiB, 4.29% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
2.780213 seconds (13.72 k allocations: 1.206 GiB, 5.80% gc time)
and using e.g. 4 threads:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
4.546945 seconds (6.02 M allocations: 1.499 GiB, 5.05% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
0.812742 seconds (47.28 k allocations: 1.208 GiB)
In R it is:
> system.time(myData<-read.delim("myFile.txt",sep="|",header=F,
+ stringsAsFactors=F,na.strings=""))
user system elapsed
28.615 0.436 29.048
In Python (Pandas) it is:
>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
25.95710587501526
Now if we test fread from R (which is fast) we get:
> system.time(fread("myFile.txt", sep="|", header=F,
stringsAsFactors=F, na.strings="", nThread=1))
user system elapsed
1.043 0.036 1.082
> system.time(fread("myFile.txt", sep="|", header=F,
stringsAsFactors=F, na.strings="", nThread=4))
user system elapsed
1.361 0.028 0.416
So in this case the summary is:
despite the cost of compilation of CSV.File in Julia when you run it for the first time it is significantly faster than base R or Python
it is comparable in speed to fread in R (in this case slightly slower, but other benchmark made here shows cases when it is faster)
EDIT: Following the request I have added a benchmark for a small file: 10 columns, 100,000 rows Julia vs Pandas.
Data preparation step:
open("myFile.txt", "w") do io
foreach(i -> println(io, join(i+1:i+10, '|')), 1:100_000)
end
CSV.jl, single threaded:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
1.898649 seconds (1.54 M allocations: 93.848 MiB, 1.48% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
0.029965 seconds (248 allocations: 17.037 MiB)
Pandas:
>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
0.07587623596191406
Conclusions:
the compilation cost is a one-time cost that has to be paid and it is constant (roughly it does not depend on how big is the file you want to read in)
for small files CSV.jl is faster than Pandas (if we exclude compilation cost)
Now, if you would like to avoid having to pay compilation cost on every fresh Julia session this is doable with https://github.com/JuliaLang/PackageCompiler.jl.
From my experience, if you are doing data science work, where e.g. you read-in thousands of CSV files, I do not have a problem with waiting 2 seconds for the compilation, if later I can save hours. It takes more than 2 seconds to write the code that reads in the files.
Of course - if you write a script that does little work and terminates after it is done then it is a different use case as compilation time would be a majority of computational cost actually. In this case using PackageCompiler.jl is a strategy I use.
In my experience, the best way to deal with larger text files is not load them up into Julia, but rather to stream them. This method has some additional fixed costs, but generally runs extremely quickly. Some pseudo code is this:
function streamdat()
mycsv=open("/path/to/text.csv", "r") # <-- opens a path to your text file
sumvec = [0.0] # <-- store a sum here
i = 1
while(!eof(mycsv)) # <-- loop through each line of the file
row = readline(mycsv)
vector=split(row, "|") # <-- split each line by |
sumvec+=parse(Float64, vector[i])
i+=1
end
end
streamdat()
The code above is just a simple sum, but this logic can be expanded to more complex problems.
using CSV
#time df=CSV.read("C:/Users/hafez/personal/r/tutorial for students/Book2.csv")
recently I tried in Julia 1.4.2. I found different response and at first, I didn't understand Julia. then I posted the same thing in the Julia discussion forums. then I understood that this code will provide only compile time. here you can find benchmark