I have a binary file. If I want to read all the numeric data in an array at once, the code is below:
y = Array{Float32}(undef, 1000000, 1);
read!("myfile.bin", y)
I will get an array y, y is 1000000*1 array{Float32, 2}.
My question is that, I don't want to read all the data in an array at once since it will use a lot of memory. I want to read a specific element in the binary file each time. For example, I only want to read the third element in the binary file which is the third element in array y. How can I do it?
If you just want to read a single element, you don't need to read into an array:
io = open("myfile.bin", "r") # open file for reading
Nbytes = sizeof(Float32) # number of bytes per element
seek(io, (3-1)*Nbytes) # move to the 3rd element
val = read(io, Float32) # read a Float32 element
close(io)
BTW: if you want an array for your data, you should probably use a 1000000 length Array{Float32, 1} instead of a size 1000000x1 Array{Float32, 2}:
y = Array{Float32}(undef, 1000000)
# or
y = Array{Float32, 1}(undef, 1000000)
# or
y = Vector{Float32}(undef, 1000000)
Alternatively, you could mmap the file to access it as an array:
fd = open("myfile.bin")
y = Mmap.mmap(fd, Vector{Float32}, 10000000)
println(y[3])
This will only use virtual memory, but no RAM. You can also make it writeable, too.
Related
If I understand correctly, in Julia m = [[1,2],[3,4]] is a two-dimensional vector with each vector of two elements. And this fact won't change after the operation of sympify
ms = sympy.sympify(m)
size(ms)
The output will still be (2,), as expected.
However, if I create a file test.txt with just one line [[1,2],[3,4]] and run
using SymPy
open("test.txt") do io
while ! eof(io)
m = readline(io)
println(m)
ms = sympy.sympify(m)
println(ms)
end
end
The output will be
[[1,2],[3,4]]
Sym[1 2; 3 4]
Namely, now ms suddenly changes to a two-by-two matrix! I really cannot understand this (since the dimension of m after readline stays to be (2,) as before). Could someone explain this for me?
As a workaround, you can define your own parsing function to read lines from a file into the vector of vectors:
function myparse(T::Type, s::String)
[[parse(T, c) for c in element if isnumeric(c)] for element in split(s, "],[")]
end
Then, you can simply do:
using SymPy
m = open("test.txt") do io
myparse(Int, readline(io))
end
println(typeof(m)) # Vector{Vector{Int64}} (alias for Array{Array{Int64, 1}, 1})
Then,
julia> ms = sympy.sympify(m)
2-element Vector{Sym}:
[1 2]
[3 4]
I have a Julia code, version 1.2, which performs a lot of operations on a 10000 x 10000 Array . Due to OutOfMemory() error when I run the code, I’m exploring other options to run it, such as Memory-mapping. Concerning the use of Mmap.mmap, I’m a bit confused with the use of the Array that I map to my disk, due to little explanations on https://docs.julialang.org/en/v1/stdlib/Mmap/index.html. Here is the beginning of my code:
using Distances
using LinearAlgebra
using Distributions
using Mmap
data=Float32.(rand(10000,15))
Eucldist=pairwise(Euclidean(),data,dims=1)
D=maximum(Eucldist.^2)
sigma2hat=mean(((Eucldist.^2)./D)[tril!(trues(size((Eucldist.^2)./D)),-1)])
L=exp.(-(Eucldist.^2/D)/(2*sigma2hat))
L is the 10000 x 10000 Array with which I want to work, so I mapped it to my disk with
s = open("mmap.bin", "w+")
write(s, size(L,1))
write(s, size(L,2))
write(s, L)
close(s)
What am I supposed to do after that? The next step is to perform K=eigen(L) and apply other commands to K. How should I do that? With K=eigen(L) or K=eigen(s)? What’s the role of the object s and when does it get involved? Moreover, I don’t understand why I have to use Mmap.sync! and when. After each subsequent lines after eigen(L)? At the end of the code? How can I be sure that I’m using my disk space instead of RAM memory?Would like some highlights about memory-mapping, please. Thank you!
If memory usage is a concern, it is often best to re-assign your very large arrays to 0, or to a similar type-safe small matrix, so that the memory can be garbage collected, assuming you are done with those intermediate matrices. After that, you just call Mmap.mmap() on your stored data file, with the type and dimensions of the data as second and third arguments to mmap, and then assign the function's return value to your variable, in this case L, resulting in L being bound to the file contents:
using Distances
using LinearAlgebra
using Distributions
using Mmap
function testmmap()
data = Float32.(rand(10000, 15))
Eucldist = pairwise(Euclidean(), data, dims=1)
D = maximum(Eucldist.^2)
sigma2hat = mean(((Eucldist.^2) ./ D)[tril!(trues(size((Eucldist.^2) ./ D)), -1)])
L = exp.(-(Eucldist.^2 / D) / (2 * sigma2hat))
s = open("./tmp/mmap.bin", "w+")
write(s, size(L,1))
write(s, size(L,2))
write(s, L)
close(s)
# deref and gc collect
Eucldist = data = L = zeros(Float32, 2, 2)
GC.gc()
s = open("./tmp/mmap.bin", "r+") # allow read and write
m = read(s, Int)
n = read(s, Int)
L = Mmap.mmap(s, Matrix{Float32}, (m, n)) # now L references the file contents
K = eigen(L)
K
end
testmmap()
#time testmmap() # 109.657995 seconds (17.48 k allocations: 4.673 GiB, 0.73% gc time)
I have a big file (75GB) memory mapped in an array d that I want to copy in another m. Because I do not have 75GB of RAM available, I did:
for (i,v) in enumerate(d)
m[i] = v
end
In order to copy the file value after value. But I get a copy rate of ~2MB/s on a SSD where I expect at least 50MB/s both in read and write.
How could I optimize this copy rate?
=== [edit] ===
According to the comments, I changed my code to the following, which sped up the write rate to 15MB/s
function copydcimg(m::Array{UInt16,4}, d::Dcimg)
m .= d
Mmap.sync!(m)
end
copydcimg(m,d)
At this point, I think I should optimize the Dcimg code. This binary file is made of frames spaced by a timestamp. Here is the code I use to access the frames:
module dcimg
using Mmap
using TOML
struct Dcimg <: AbstractArray{UInt16,4} # struct allowing to access dcimg file
filename::String # filename of the dcimg
header::Int # header size in bytes
clock::Int # clock size in bytes
x::Int
y::Int
z::Int
t::Int
m # linear memory map
Dcimg(filename, header, clock, x, y, z, t) =
new(filename, header, clock, x, y, z, t,
Mmap.mmap(open(filename), Array{UInt16, 3},
(x*y+clock÷sizeof(UInt16), z, t), header)
)
end
# following functions allows to access DCIMG like an Array
Base.size(D::Dcimg) = (D.x, D.y, D.z, D.t)
# skip clock
Base.getindex(D::Dcimg, i::Int) =
D.m[i + (i ÷ (D.x*D.y))*D.clock÷sizeof(UInt16)]
Base.getindex(D::Dcimg, x::Int, y::Int, z::Int, t::Int) =
D[x + D.x*((y-1) + D.y*((z-1) + D.z*(t-1)))]
# allowing to automatically parse size
function Dcimg(pathtag)
p = TOML.parsefile(pathtag * ".toml")
return Dcimg(pathtag * ".dcimg",
# ...
)
end
export Dcimg, getframe
end
I got it! The solution was to copy the file chunk by chunk lets say by frame (around 1024×720 UInt16). This way I reached 300MB/s, which I didn't even know was possible in single thread. Here is the code.
In module dcimg, I added the methods to access the file frame by frame
# get frame number n (starting form 1)
getframe(D::Dcimg,n::Int) =
reshape(D.m[
D.x*D.y*(n-1)+1 + (n-1)*D.clock÷sizeof(UInt16) : # cosmetic line break
D.x*D.y*n + (n-1)*D.clock÷sizeof(UInt16)
], D.x, D.y)
# get frame for layer z, time t (starting from 1)
getframe(D::Dcimg,z::Int,t::Int) =
getframe(D::Dcimg,(z-1)+D.z*(t-1))
Iterating over the frames within a loop
function copyframes(m::Array{UInt16,4}, d::Dcimg)
N = d.z*d.t
F = d.x*d.y
for i in 1:N
m[(i-1)*F+1:i*F] = getframe(d, i)
end
end
copyframes(m,d)
Thanks all in comments for leading me to this.
===== edit =====
for further reading, you might look at:
dd: How to calculate optimal blocksize?
http://blog.tdg5.com/tuning-dd-block-size/
which give hints about the optimal block size to copy at a time.
I would like to run a block of code that skips or exits a command if R goes over a specified memory limit at any time. To illustrate a related example, the following code will skip to the next iteration of the for loop, if the code block takes more than a specified time limit. It will print: '1', 'skip', '2'
params = c(1,4,2)
for(i in params) {
tryCatch(
expr = {
evalWithTimeout({
Sys.sleep(i)
print(i)
},
timeout = 3) #go to next iteration if block takes more than 3 seconds
},
TimeoutException = function(x) cat("skip")
)
}
I would like to do something similar, but skip or exit a command if R goes over a memory limit instead. For example, how can I make the following code print: '1', NOTHING, '2'. Note the second matrix with 1000 rows should be skipped before it is fully built. Also, I will not know the size of the matrix/object that needs to be skipped beforehand, I will only know the memory_limit
unknown = matrix(rnorm(1000*1000), ncol = 1000, nrow = 1000) #unknown object
memory_limit = object.size(unknown)-100000 #known memory limit that happens to be just under the object
##Evaluate_in_memory_limit##{
print(nrow(matrix(rnorm(1*1), ncol = 1, nrow = 1)))
print(nrow(unknown)) #this should be skipped
print(nrow(matrix(rnorm(2*2), ncol = 2, nrow = 2))),
limit = memory_limit
}
An idea:
You could calculate the size of the vector (matrix) beforehand, if you know the length of it in advance.
For
integers: 40 + 4 * n bytes
numeric: 40 + 8 * n bytes
should be the formula for vectors. Check with e.g.
sapply((1:3)^10, function(n) object.size(numeric(n)))
# or for matrix
sapply((1:3)^10, function(n) object.size(matrix(numeric(n))))
Then use system('free') on unix, to determine free memory.
Create your elements in a for loop and use the check in an if condition to next the loop, in case used memory will exceed available.
Following up How to add vectors to the columns of some array in Julia?, I would like to have some analogous clarifications for DataArrays.
Let y=randn(100, 2). I would like to create a matrix x with the lagged value (with lags > 0) of y. I have already written a code which it seems is working properly (see below). I was wondering if there is a better way for concatenating a DataArray than the one I have used.
T, n = size(y);
x = #data(zeros(T-lags, 0));
for lag in 1:lags
x = hcat(x, y[lags-lag+1:end-lag, :]);
end
Unless there is a specific reason to do otherwise, my recommendation would be to start with your DataArray x being the size that you want it to be and then fill in the column values you want.
This will give you better performance than if you need to recreate the DataArray for each new column, which is what any method for "adding" columns will actually be doing. It's conceivable that the DataArray package might have some more pretty syntax for it than what you have in your question, but fundamentally, that's what it would still be doing.
Thus, in a simplified version of your example, I would recommend:
using DataArrays
N = 5; T = 10;
X = #data(zeros(T, N));
initial_data_cols = 2; ## specify how much of the initial data is filled in
lags = size(X,2) - initial_data_cols
X[:,1:initial_data_cols] = rand(size(X,1), initial_data_cols) ## First two columns of X are fixed in advance
for lag in 1:lags
X[:,(lag+initial_data_cols)] = rand(size(X,1))
end
If you did find yourself in a situation where you need to add columns to an already created object, you could improve somewhat upon the code that you have by first creating all of the new objects together and then doing a single addition of them to your initial DataArray. E.g.
X = #data(zeros(10, 2))
X = [X rand(10,3)]
For instance, consider the difference in execution time, and number and quantity of memory allocations in the two examples below:
n = 10^5; m = 10;
A = #data rand(n,m);
n_newcol = 10;
function t1(A::Array, n_newcol)
n = size(A,1)
for idx = 1:n_newcol
A = hcat(A, zeros(n))
end
return A
end
function t2(A::Array, n_newcol)
n = size(A,1)
[A zeros(n, n_newcol)]
end
# Stats after running each function once to compile
#time r1 = t1(A, n_newcol); ## 0.154082 seconds (124 allocations: 125.888 MB, 75.33% gc time)
#time r2 = t2(A, n_newcol); ## 0.007981 seconds (9 allocations: 22.889 MB, 31.73% gc time)