Cross-posted here, but how can I save an array of arrays in Julia using HDF5?
In my particular case I have a single array containing 10,000 arrays of varying lengths. I'd like the 10,000 arrays to be part of a "group", but creating new datasets / groups for each array makes reading the file very slow, so I am seeking an alternative.
You can flatten the array of arrays to one single array where one column contains the original data, and another column denotes which i-th array this data was originally from.
using HDF5
# Define your array of arrays.
arr = [[1,2],[3,4,5]]
# Open your hdf5 file
h5open("data.hdf5", "w") do f
# Create a dataset with the length of all your arrays combined.
N = sum(length.(arr))
d_create(f, "X", Int, ((2,N),(2,-1)), "chunk", (1,1000))
n = 1
for i in 1:length(arr)
m = length(arr[i])
f["X"][1, n:n+m-1] = fill(i, m)
f["X"][2, n:n+m-1] = arr[i]
n+=m
end
print(f["X"][:,:])
end
Then arrays are then stored as follows:
> [1 1 2 2 2; 1 2 3 4 5]
Related
Let's say I have a vector V, and I want to either turn this vector into multiple m x n matrices, or get multiple m x n matrices from this Vector V.
For the most basic example: Turn V = collect(1:75) into 3 5x5 matrices.
As far as I am aware this can be done by first using reshape reshape(V, 5, :) and then looping through it. Is there a better way in Julia without using a loop?
If possible, a solution that can easily change between row-major and column-major results is preferrable.
TL:DR
m, n, n_matrices = 4, 2, 5
V = collect(1:m*n*n_matrices)
V = reshape(V, m, n, :)
V = permutedims(V, [2,1,3])
display(V)
From my limited knowledge about Julia:
When doing V = collect(1:m*n), you initialize a contiguous array in memory. From V you wish to create a container of m by n matrices. You can achieve this by doing reshape(V, m, n, :), then you can access the first matrix with V[:,:,1]. The "container" in this case is just another array (thus you have a three dimensional array), which in this case we interpret as "an array of matrices" (but you could also interpret it as a box). You can then transpose every matrix in your array by swapping the first two dimensions like this: permutedims(V, [2,1,3]).
How this works
From what I understand; n-dimensional arrays in Julia are contiguous arrays in memory when you don't do any "skipping" (e.g. V[1:2:end]). For example the 2 x 4 matrix A:
1 3 5 7
2 4 6 8
is in memory just 1 2 3 4 5 6 7 8. You simply interpret the data in a specific way, where the first two numbers makes up the first column, then the second two numbers makes the next column so on so forth. The reshape function simply specifies how you want to interpret the data in memory. So if we did reshape(A, 4, 2) we basically interpret the numbers in memory as "the first four values makes the first column, the second four values makes the second column", and we would get:
1 5
2 6
3 7
4 8
We are basically doing the same thing here, but with an extra dimension.
From my observations it also seems to be that permutedims in this case reallocates memory. Also, feel free to correct me if I am wrong.
Old answer:
I don't know much about Julia, but in Python using NumPy I would have done something like this:
reshape(V, :, m, n)
EDIT: As #BatWannaBe states, the result is technically one array (but three dimensional). You can always interpret a three dimensional array as a container of 2D arrays, which from my understanding is what you ask for.
Our project has dll, which has function which returns 2d array flattened to 1d array of Microsoft variant structure.
It looks like (assume VT type is int)
1 2 3
4 5 6
and it returns 1 2 3 4 5 6 (array)
number of rows and columns in array keep on changing (2x3 or 4x5 or 1000x1)
I am able to call function from .cpp using rcpp. I need way to convert the variant array into R data type(s)
One way is convert it into list. But to convert them in row column format I need to process them again which I want to avoid.
I cannot see any package in R which can help me here.Any suggestion ? Can we create nested lists dynamically in rcpp ?
I have two multi-dimensional arrays, i.e. A3D and B3D. The array A3D has a dimension of 2 x 2 x n, while the array B3D has a dimension of m x 2 x n. Every 2D subarray in A3D is a symmetric matrix. For every i, I want to compute B3D[:,:,i]* A3D[:,:,i]* transpose(B3D[:,:,i]). The result is then stored in a multi-dimensional array. I tried the following Julia codes to accomplish the task. However, the computational time with my codes was around 4(s), which is quite slow. I am wondering whether the performance of my codes could be improved. Below are my Julia codes. Thanks for looking at my problem.
m = 100;
n = 30_000; # this could be a very large number.
A3D = rand(2,2,n);
[A3D[:,:,i] = Symmetric(A3D[:,:,i]) for i in 1:n];
B3D = rand(m,2,n);
res3D = zeros(m,m,n);
# approach 1
#time [res3D[:,:,i] = eB*eA*transpose(eB)
for (eA, eB, i) in zip(eachslice(A3D,dims=3),eachslice(B3D,dims=3),1:n)];
UPDATE:
I added another approach to tackle my problem (see below). Approach 2 is a bit better than approach 1. But, can we even improve the performance of my code further?
# approach 2
#inbounds for i = 1:n
res3D[:,:,i] = B3D[:,:,i]*A3D[:,:,i]*B3D[:,:,i]';
end
I have an array of arrays, a
49455-element Array{Array{AbstractString,1},1}
the length varies, this is just one of many possibilities
I need to do a b = vcat(a...) giving me
195158-element Array{AbstractString,1}:
and convert it to a SharedArray to have all cores work on the strings in it (I'll convert to a Char matrix behind the curtians, but this is not important)
In a, every element is an array of some number of strings, which I do
map(x -> length(x), a)
49455-element Array{Int64,1}:
1
4
8
.
.
2
Is there a way I can easily resotre the array b to the same dimensions of a?
With the Iterators.jl package:
# `a` holds original. `b` holds flattened version. `newa` should == `a`
using Iterators # install using Pkg.add("Iterators")
lmap = map(length,a) # same length vector defined in OP
newa = [b[ib+1:ie] for (ib,ie) in partition([0;cumsum(lmap)],2,1)]
This is somewhat neat, and can also be used to produce a generator for the original vectors, but a for loop implementation should be just as fast and clear.
As a complement to Dan Getz's answer, we can also use zip instead of Iterators.jl's partition:
tails = cumsum(map(length,a))
heads = [1;tails+1][1:end-1]
newa = [b[i:j] for (i,j) in zip(heads,tails)]
In the following code I am using the Julia Optim package for finding an optimal matrix with respect to an objective function.
Unfortunately the provided optimize function only supports vectors, so I have to transform the matrix to a vector before passing it to the optimize function, and also transform it back when using it in the objective function.
function opt(A0,X)
I1(A) = sum(maximum(X*A,1))
function transform(A)
# reshape matrix to vector
return reshape(A,prod(size(A)))
end
function transformback(tA)
# reshape vector to matrix
return reshape(tA, size(A0))
end
obj(tA) = -I1(transformback(tA))
result = optimize(obj, transform(A0), method = :nelder_mead)
return transformback(result.minimum)
end
I think Julia is allocating new space for this every time and it feels slow, so what would be a more efficient way to tackle this problem?
So long as arrays contain elements that are considered immutable, which includes all primitives, then elements of an array are contained in 1 big contiguous blob of memory. So you can break dimension rules and simply treat a 2 dimensional array as a 1-dimensional array, which is what you want to do. So you don't need to reshape, but I don't think reshape is your problem
Arrays are column major and contiguous
Consider the following function
function enumerateArray(a)
for i = 1:*(size(a)...)
print(a[i])
end
end
This function multiplies all of the dimensions of a together and then loops from 1 to that number assuming a is one dimensional.
When you define a as the following
julia> a = [ 1 2; 3 4; 5 6]
3x2 Array{Int64,2}:
1 2
3 4
5 6
The result is
julia> enumerateArray(a)
135246
This illustrates a couple of things.
Yes it actually works
Matrices are stored in column-major format
reshape
So, the question is why doesn't reshape use that fact? Well it does. Here's the julia source for reshape in array.c
a = (jl_array_t*)allocobj((sizeof(jl_array_t) + sizeof(void*) + ndimwords*sizeof(size_t) + 15)&-16);
So yes a new array is created, but the only the new dimension information is created, it points back to the original data which is not copied. You can verify this simply like this:
b = reshape(a,6);
julia> size(b)
(6,)
julia> size(a)
(3,2)
julia> b[4]=100
100
julia> a
3x2 Array{Int64,2}:
1 100
3 4
5 6
So setting the 4th element of b sets the (1,2) element of a.
As for overall slowness
I1(A) = sum(maximum(X*A,1))
will create a new array.
You can use a couple of macros to track this down #profile and #time. Time will additionally record the amount of memory allocated and can be put in front of any expression.
For example
julia> A = rand(1000,1000);
julia> X = rand(1000,1000);
julia> #time sum(maximum(X*A,1))
elapsed time: 0.484229671 seconds (8008640 bytes allocated)
266274.8435928134
The statistics recorded by #profile are output using Profile.print()
Also, most methods in Optim actually allow you to supply Arrays, not just Vectors. You could generalize the nelder_mead function to do the same.