reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns.
python 3.3 example
import pandas as pd
import time
start=time.time()
myData=pd.read_csv("C:\\myFile.txt",sep="|",header=None,low_memory=False)
print(time.time()-start)
Output: 19.90
R 3.0.2 example
system.time(myData<-read.delim("C:/myFile.txt",sep="|",header=F,
stringsAsFactors=F,na.strings=""))
Output:
User System Elapsed
181.13 1.07 182.32
Julia 0.2.0 (Julia Studio 0.4.4) example # 1
using DataFrames
timing = #time myData = readtable("C:/myFile.txt",separator='|',header=false)
Output:
elapsed time: 80.35 seconds (10319624244 bytes allocated)
Julia 0.2.0 (Julia Studio 0.4.4) example # 2
timing = #time myData = readdlm("C:/myFile.txt",'|',header=false)
Output:
elapsed time: 65.96 seconds (9087413564 bytes allocated)
Julia is faster than R, but quite slow compared to Python. What can I do differently to speed up reading a large text file?
a separate issue is the size in memory is 18 x size of hard disk file size in Julia, but only 2.5 x size for python. in Matlab, which I have found to be most memory efficient for large files, it is 2 x size of hard disk file size. Any particular reason for the large file size in memory in Julia?
The best answer is probably that I'm not as a good a programmer as Wes.
In general, the code in DataFrames is much less well-optimized than the code in Pandas. I'm confident that we can catch up, but it will take some time as there's a lot of basic functionality that we need to implement first. Since there's so much that needs to be built in Julia, I tend to focus on doing things in three parts: (1) build any version, (2) build a correct version, (3) build a fast, correct version. For the work I do, Julia often doesn't offer any versions of essential functionality, so my work gets focused on (1) and (2). As more of the tools I need get built, it'll be easier to focus on performance.
As for memory usage, I think the answer is that we use a set of data structures when parsing tabular data that's much less efficient than those used by Pandas. If I knew the internals of Pandas better, I could list off places where we're less efficient, but for now I'll just speculate that one obvious failing is that we're reading the whole dataset into memory rather than grabbing chunks from disk. This certainly can be avoided and there are issues open for doing so. It's just a matter of time.
On that note, the readtable code is fairly easy to read. The most certain way to get readtable to be faster is to whip out the Julia profiler and start fixing the performance flaws it uncovers.
There is a relatively new julia package called CSV.jl by Jacob Quinn that provides a much faster CSV parser, in many cases on par with pandas: https://github.com/JuliaData/CSV.jl
Note that the "n bytes allocated" output from #time is the total size of all allocated objects, ignoring how many of them might have been freed. This number is often much higher than the final size of live objects in memory. I don't know if this is what your memory size estimate is based on, but I wanted to point this out.
I've found a few things that can partially help this situation.
using the readdlm() function in Julia seems to work considerably faster (e.g. 3x on a recent trial) than readtable(). Of course, if you want the DataFrame object type, you'll then need to convert to it, which may eat up most or all of the speed improvement.
Specifying dimensions of your file can make a BIG difference, both in speed and in memory allocations. I ran this trial reading in a file that is 258.7 MB on disk:
julia> #time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1);
19.072266 seconds (221.60 M allocations: 6.573 GB, 3.34% gc time)
julia> #time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1, dims = (File_Lengths[1], 62));
10.309866 seconds (87 allocations: 528.331 MB, 0.03% gc time)
The type specification for your object matters a lot. For instance, if your data has strings in it, then the data of the array that you read in will be of type Any, which is expensive memory wise. If memory is really an issue, you may want to consider preprocessing your data by first converting the strings to integers, doing your computations, and then converting back. Also, if you don't need a ton of precision, using Float32 type instead of Float64 can save a LOT of space. You can specify this when reading the file in, e.g.:
Data = readdlm("file.csv", ',', Float32)
Regarding memory usage, I've found in particular that the PooledDataArray type (from the DataArrays package) can be helpful in cutting down memory usage if your data has a lot of repeated values. The time to convert to this type is relatively large, so this isn't a time saver per se, but at least helps reduce the memory usage somewhat. E.g. when loading a data set with 19 million rows and 36 columns, 8 of which represented categorical variables for statistical analysis, this reduced the memory allocation of the object from 5x its size on disk to 4x its size. If there are even more repeated values, the memory reduction can be even more significant (I've had situations where the PooledDataArray cuts memory allocation in half).
It can also sometimes help to run the gc() (garbage collector) function after loading and formatting data to clear out any unneeded ram allocation, though generally Julia will do this automatically pretty well.
Still though, despite all of this, I'll be looking forward to further developments on Julia to enable faster loading and more efficient memory usage for large data sets.
Let us first create a file you are talking about to provide reproducibility:
open("myFile.txt", "w") do io
foreach(i -> println(io, join(i+1:i+644, '|')), 1:153895)
end
Now I read this file in in Julia 1.4.2 and CSV.jl 0.7.1.
Single threaded:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
4.747160 seconds (1.55 M allocations: 1.281 GiB, 4.29% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
2.780213 seconds (13.72 k allocations: 1.206 GiB, 5.80% gc time)
and using e.g. 4 threads:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
4.546945 seconds (6.02 M allocations: 1.499 GiB, 5.05% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
0.812742 seconds (47.28 k allocations: 1.208 GiB)
In R it is:
> system.time(myData<-read.delim("myFile.txt",sep="|",header=F,
+ stringsAsFactors=F,na.strings=""))
user system elapsed
28.615 0.436 29.048
In Python (Pandas) it is:
>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
25.95710587501526
Now if we test fread from R (which is fast) we get:
> system.time(fread("myFile.txt", sep="|", header=F,
stringsAsFactors=F, na.strings="", nThread=1))
user system elapsed
1.043 0.036 1.082
> system.time(fread("myFile.txt", sep="|", header=F,
stringsAsFactors=F, na.strings="", nThread=4))
user system elapsed
1.361 0.028 0.416
So in this case the summary is:
despite the cost of compilation of CSV.File in Julia when you run it for the first time it is significantly faster than base R or Python
it is comparable in speed to fread in R (in this case slightly slower, but other benchmark made here shows cases when it is faster)
EDIT: Following the request I have added a benchmark for a small file: 10 columns, 100,000 rows Julia vs Pandas.
Data preparation step:
open("myFile.txt", "w") do io
foreach(i -> println(io, join(i+1:i+10, '|')), 1:100_000)
end
CSV.jl, single threaded:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
1.898649 seconds (1.54 M allocations: 93.848 MiB, 1.48% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
0.029965 seconds (248 allocations: 17.037 MiB)
Pandas:
>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
0.07587623596191406
Conclusions:
the compilation cost is a one-time cost that has to be paid and it is constant (roughly it does not depend on how big is the file you want to read in)
for small files CSV.jl is faster than Pandas (if we exclude compilation cost)
Now, if you would like to avoid having to pay compilation cost on every fresh Julia session this is doable with https://github.com/JuliaLang/PackageCompiler.jl.
From my experience, if you are doing data science work, where e.g. you read-in thousands of CSV files, I do not have a problem with waiting 2 seconds for the compilation, if later I can save hours. It takes more than 2 seconds to write the code that reads in the files.
Of course - if you write a script that does little work and terminates after it is done then it is a different use case as compilation time would be a majority of computational cost actually. In this case using PackageCompiler.jl is a strategy I use.
In my experience, the best way to deal with larger text files is not load them up into Julia, but rather to stream them. This method has some additional fixed costs, but generally runs extremely quickly. Some pseudo code is this:
function streamdat()
mycsv=open("/path/to/text.csv", "r") # <-- opens a path to your text file
sumvec = [0.0] # <-- store a sum here
i = 1
while(!eof(mycsv)) # <-- loop through each line of the file
row = readline(mycsv)
vector=split(row, "|") # <-- split each line by |
sumvec+=parse(Float64, vector[i])
i+=1
end
end
streamdat()
The code above is just a simple sum, but this logic can be expanded to more complex problems.
using CSV
#time df=CSV.read("C:/Users/hafez/personal/r/tutorial for students/Book2.csv")
recently I tried in Julia 1.4.2. I found different response and at first, I didn't understand Julia. then I posted the same thing in the Julia discussion forums. then I understood that this code will provide only compile time. here you can find benchmark
Related
I am reading Julia performance tips,
https://docs.julialang.org/en/v1/manual/performance-tips/
At the beginning, it mentions two examples.
Example 1,
julia> x = rand(1000);
julia> function sum_global()
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_global()
0.009639 seconds (7.36 k allocations: 300.310 KiB, 98.32% compilation time)
496.84883432553846
julia> #time sum_global()
0.000140 seconds (3.49 k allocations: 70.313 KiB)
496.84883432553846
We see a lot of memory allocations.
Now example 2,
julia> x = rand(1000);
julia> function sum_arg(x)
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_arg(x)
0.006202 seconds (4.18 k allocations: 217.860 KiB, 99.72% compilation time)
496.84883432553846
julia> #time sum_arg(x)
0.000005 seconds (1 allocation: 16 bytes)
496.84883432553846
We see that by putting x into into the argument of the function, memory allocations almost disappeared and the speed is much faster.
My question are, can anyone explain,
why example 1 needs so many allocation, and why example 2 does not need as many allocations as example 1?
I am a little confused.
in the two examples, we see that the second time we run Julia, it is always faster than the first time.
Does that mean we need to run Julia twice? If Julia is only fast at the second run, then what is point? Why not Julia just do a compiling first, then do a run, just like Fortran?
Is there any general rule to preventing memory allocations? Or do we just always have to do a #time to identify the issue?
Thanks!
why example 1 needs so many allocation, and why example 2 does not need as many allocations as example 1?
Example 1 needs so many allocations, because x is a global variable (defined out of scope of the function sum_arg). Therefore the type of variable x can potentially change at any time, i.e. it is possible that:
you define x and sum_arg
you compile sum_arg
you redefine x (change its type) and run sum_arg
In particular, as Julia supports multiple threading, both actions in step 3 in general could happen even in parallel (i.e. you could have changed the type of x in one thread while sum_arg would be running in another thread).
So because after compilation of sum_arg the type of x can change Julia, when compiling sum_arg has to ensure that the compiled code does not rely on the type of x that was present when the compilation took place. Instead Julia, in such cases, allows the type of x to be changed dynamically. However, this dynamic nature of allowed x means that it has to be checked in run-time (not compile time). And this dynamic checking of x causes performance degradation and memory allocations.
You could have fixed this by declaring x to be a const (as const ensures that the type of x may not change):
julia> const x = rand(1000);
julia> function sum_global()
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_global() # this is now fast
0.000002 seconds
498.9290555615045
Why not Julia just do a compiling first, then do a run, just like Fortran?
This is exactly what Julia does. However, the benefit of Julia is that it does compilation automatically when needed. This allows you for a smooth interactive development process.
If you wanted you could compile the function before it is run with the precompile function, and then run it separately. However, normally people just run the function without doing it explicitly.
The consequence is that if you use #time:
The first time you run a function it returns you both execution time and compilation time (and as you can see in examples you have pasted - you get information what percentage of time was spent on compilation).
In the consecutive runs the function is already compiled so only execution time is returned.
Is there any general rule to preventing memory allocations?
These rules are exactly given in the Performance Tips section of the manual that you are quoting in your question. The tip on using #time is a diagnostic tip there. All other tips are rules that are recommended to get a fast code. However, I understand that the list is long so a shorter list that is good enough to start with in my experience is:
Avoid global variables
Avoid containers with abstract type parameters
Write type stable functions
Avoid changing the type of a variable
I'm new in using Julia and after some courses about numeric analysis programming became a hobby of mine.
I ran some tests with all my cores and did the same with threads to compare. I noticed that doing heavier computation went better with the threaded loop than with the process, But it was about the same when it came to addition. (operations were randomly selected for example)
After some research its all kinda vague and I ultimately want some perspective from someone that is using the same language if it matters at all.
Some technical info: 8 physical cores, julia added vector of 16 after addprocs() and nthreads() is 16
using Distributed
addprocs()
#everywhere using SharedArrays;
#everywhere using BenchmarkTools;
function test(lim)
r = zeros(Int64(lim / 16),Threads.nthreads())
Threads.#threads for i in eachindex(r)
r[Threads.threadid()] = (BigInt(i)^7 +5)%7;
end
return sum(r)
end
#btime test(10^4) # 1.178 ms (240079 allocations: 3.98 MiB)
#everywhere function test2(lim)
a = SharedArray{Int64}(lim);
#sync #distributed for i=1:lim
a[i] = (BigInt(i)^7 +5)%7;
end
return sum(a)
end
#btime test2(10^4) # 3.796 ms (4413 allocations: 189.02 KiB)
Note that your loops do very different things.
Int the first loop each thread keeps updating the same single cell the Array. Most likely since only a single memory cell is update in a single thread, the processor caching mechanism can be used to speed up things.
On the other hand the second loop each process is updating several different memory cells and such caching is not possible.
The first Array holds Float64 values while the second holds Int64 values
After correcting those things the difference gets smaller (this is on my laptop, I have only 8 threads):
julia> #btime test(10^4)
2.781 ms (220037 allocations: 3.59 MiB)
29997
julia> #btime test2(10^4)
4.867 ms (2145 allocations: 90.14 KiB)
29997
Now the other issue is that when Distributed is used you are doing inter-process communication which does not occur when using Threads.
Basically, the inter-process processing does not make sense to be used for jobs lasting few milliseconds. When you try to increase the processing volumes the difference might start to diminish.
So when to use what - it depends.. General guidelines (somewhat subjective) are following:
Processes are more robust (threads are still experimental)
Threads are easier as long as you do not need to use locking or atomic values
When the parallelism level is beyond 16 threads become inefficient and Distributed should be used (this is my personal observation)
When writing utility packages for others use threads - do not distribute code inside a package. Explanation: If you add multi-threading to a package it's behavior can be transparent to the user. On the other hand Julia's multiprocessing (Distributed package) abstraction does not distinguish between parallel and distributed - that is your workers can be either local or remote. This makes fundamental difference how code is designed (e.g. SharedArrays vs DistributedArrays), moreover the design of code might also depend on e.g. number of servers or possibilities of limiting inter-node communication. Hence normally, Distributed-related package logic should be separated from from standard utility package while the multi-threaded functionality can just be made transparent to the package user. There are of course some exceptions to this rule such as providing some distributed data processing server tools etc. but this is a general rule of thumb.
For huge scale computations I always use processes because you can easily go onto a computer cluster with them and distribute the workload across hundreds of machines.
I'm developing different discretization schemes and in order to find out which is the most efficient one I would like to determine the maximum RAM consumption and the time that takes to do an specific task, such as solving a system of equations, overwriting a matrix or writing the data to a file.
Is there any kind of code or something for doing what I need?
I'm using Julia in Ubuntu by the way, but I could do it in Windows as well.
Thanks a lot
I love using the built-in #time for this kind of thing. See "Measure performance with #time and pay attention to memory allocation". Example:
julia> #time myAwesomeFunction(tmp);
1.293542 seconds (22.08 M allocations: 893.866 MiB, 6.62% gc time)
This prints out time, the number of memory allocations, the size of memory allocations, and the percent time spent garbage collecting ("gc"). Always run this at least twice—the first run will be dominated by compile times!
Also consider BenchmarkTools.jl. This will run the code multiple times, with some cool variable interpolation tricks, and give you better runtime/memory estimates:
julia> using BenchmarkTools, Compat
julia> #btime myAwesomeFunction($tmp);
1.311 s (22080097 allocations: 893.87 MiB)
(My other favorite performance-related thing is the #code_* family of functions like #code_warntype.)
I think that BenchmarkTools.jl measures total memory use, not peak. I haven't found pure Julia code to measure this, but perhaps this thread is relevant.
I am running some simulations on a machine with 16GB memory. First, I met some errors:
Error: cannot allocate vector of size 6000.1 Mb (the number might be not accurate)
Then I tried to allocate more memory to R by using:
memory.limit(1E10)
The reason of choosing such a big number is because memory.limit could not allow me of selecting a number less than my system total memory
In memory.size(size) : cannot decrease memory limit: ignored
After doing this, I can finish my simulations, but R took around 15GB memory, which stopped my from doing any post analysis.
I used object.size() to estimate the total memory used of all the generated variable, which only took around 10GB. I could not figure where R took the rest of the memory. So my question is how do I reasonably allocate memory to R without exploding my machine?
Thanks!
R is interpreted so WYSINAWYG (what you see is not always what you get). As is mentioned in the comments you need more memory that is required by the storage of your objects due to copying of said objects. Also, it is possible that besides being inefficient, nested for loops are a bad idea because gc won't run in the innermost loop. If you have any of these I suggest you try to remove them using vectorised methods, or you manually call gc in your loops to force garbage collections, but be warned this will slow things down somewhat
The issue of memory required for simple objects can be illustrated by the following example. This code grows a data.frame object. Watch the memory use before, after and the resulting object size. There is a lot of garbage that is allowed to accumulate before gc is invoked. I think garbage collection is problematic on Windows than *nix systems. I am not able to replicate the example at the bottom on Mac OS X, but I can repeatedly on Windows. The loop and more explanations can be found in The R Inferno page 13...
# Current memory usage in Mb
memory.size()
# [1] 130.61
n = 1000
# Run loop overwriting current objects
my.df <- data.frame(a=character(0), b=numeric(0))
for(i in 1:n) {
this.N <- rpois(1, 10)
my.df <- rbind(my.df, data.frame(a=sample(letters,
this.N, replace=TRUE), b=runif(this.N)))
}
# Current memory usage afterwards (in Mb)
memory.size()
# [1] 136.34
# BUT... Size of my.df
print( object.size( my.df ) , units = "Mb" )
0.1 Mb
I am trying to read a large (~700Mb) .csv file into R.
The file contains an array of integers less than 256, with a header row and 2 header columns.
I use:
trainSet <- read.csv(trainFileName)
This eventually barfs with:
Loading Data...
R(2760) malloc: *** mmap(size=151552) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
R(2760) malloc: *** mmap(size=151552) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Error: cannot allocate vector of size 145 Kb
Execution halted
Looking at the memory usage, it conks out at about 3Gb usage on a 6Gb machine with zero page file usage at the time of the crash, so there may be another way to fix it.
If I use:
trainSet <- read.csv(trainFileName, header=TRUE, nrows=100)
classes = sapply(train,class);
I can see that all the columns are being loaded as "integer" which I think is 32 bits.
Clearly using 3Gb to load a part of a 700Mb .csv file is far from efficient. I wonder if there's a way to tell R to use 8 bit numbers for the columns? This is what I've done in the past in Matlab and it's worked a treat, however, I can't seem to find anywhere a mention of an 8 bit type in R.
Does it exist? And how would I tell it read.csv to use it?
Thanks in advance for any help.
The narrow answer is that the add-on package ff allows you to use a more compact representation.
The downside is that the different representation prevents you from passing the data to standard functions.
So you may need to rethink your approach: maybe sub-sampling the data, or getting more RAM.
Q: Can you tell R to use 8 bit numbers
A: No. (Edit: See Dirk's comment's below. He's smarter than I am.)
Q: Will more RAM help?
A: Maybe. Assuming a 64 bit OS and a 64 bit instance of R are the starting point, then "Yes", otherwise "No".
Implicit question A: Will a .csv dataset that is 700 MB be 700 MB when read in by read.csv?
A: Maybe. If it really is all integers, it may be smaller or larger. It's going to take 4 bytes for each integer and if most of your integers were in the range of -9 to 10, they might actually "expand" in size when stored as 4 bytes each. At the moment you are only using 1-3 bytes per value so you would expect about a 50% increase in size You would want to use colClasses="integer"in the read-function. Otherwise they may get stored as factor or as 8 byte "numeric" if there are any data-input glitches.
Implicit question B: If You get the data into the workspace will you be able to work with it?
A: Only maybe. You need at a minimum three times as much memory as your largest objects because of the way R copies on assignment even if it is a copy to its own name.
Not trying to be snarky, but the way to fix this is documented in ?read.csv:
These functions can use a surprising amount of memory when reading
large files. There is extensive discussion in the ‘R Data
Import/Export’ manual, supplementing the notes here.
Less memory will be used if ‘colClasses’ is specified as one of
the six atomic vector classes. This can be particularly so when
reading a column that takes many distinct numeric values, as
storing each distinct value as a character string can take up to
14 times as much memory as storing it as an integer.
Using ‘nrows’, even as a mild over-estimate, will help memory
usage.
This example takes awhile to run because of I/O, even with my SSD, but there are no memory issues:
R> # In one R session
R> x <- matrix(sample(256,2e8,TRUE),ncol=2)
R> write.csv(x,"700mb.csv",row.names=FALSE)
R> # In a new R session
R> x <- read.csv("700mb.csv", colClasses=c("integer","integer"),
+ header=TRUE, nrows=1e8)
R> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 173632 9.3 350000 18.7 350000 18.7
Vcells 100276451 765.1 221142070 1687.2 200277306 1528.0
R> # Max memory used ~1.5Gb
R> print(object.size(x), units="Mb")
762.9 Mb