Combining vectors of unequal length - julia

x = [1, 2, 3, 4]
y = [1, 2]
If I want to be able to operate on the two vectors with a default value filling in, what are the strategies?
E.g. would like to do the following and implicitly fill in with 0 or missing
x + y # would like [2, 4, 3, 4]
Ideally would like to do this in a generic way so that I could do arbitrary operations with the two.

Disregarding whether Julia has something built-in to do this, remember that Julia is fast. This means that you can write code to support this kind of need.
extend!(x, y::Vector, default=0) = extend!(x, length(y), default)
extend!(x, n::Int, default=0) = begin
while length(x) < n
push!(x, default)
end
x
end
Then when you have code such as you describe, you can symmetrically extend x and y:
x = [1, 2, 3, 4]
y = [1, 2]
extend!(x, y)
extend!(y, x)
x + y
==> [2, 4, 3, 4]
Note that this mutates y. In many cases, the desired length would come from outside the code and would be applied to both x and y. I can also imagine that 0 is a bad default in general (even though it is completely appropriate in your context of addition.
A comment below makes the worthy point that you should consider using append! instead of looping over push!. In fact, it is best to measure differences like that if you care about very small differences. I went ahead and tested:
julia> using BenchmarkTools
julia> extend1(x, n) = begin
while length(x) < n
push!(x, 0)
end
x
end
julia> #btime begin
x = rand(10)
sum(x)
end
59.815 ns (1 allocation: 160 bytes)
5.037723569560573
julia> #btime begin
x = rand(10)
extend1(x, 1000)
sum(x)
end
7.281 μs (8 allocations: 20.33 KiB)
6.079832879992913
julia> x = rand(10)
julia> #btime begin
x = rand(10)
append!(x, zeros(990))
sum(x)
end
1.290 μs (3 allocations: 15.91 KiB)
3.688526541987817
julia>
Pushing primitives in a loop is damned fast, allocating a vector of zeros so we can use append! is very slightly faster.
But the real lesson here is seen in the fact that the loop version takes microseconds to append nearly 1000 values (reallocating the array several times). Appending 10 values one by one takes just over 150ns (and append! is slightly faster). This is blindingly fast. Literally doing nothing in R or Python can take longer than this.
This difference would matter in some situations and would be undetectable in many others. If it matters, measure. If it doesn't, do the simplest thing that comes to mind because Julia has your back (performance-wise).
FURTHER UPDATE
Taking a hint from another of Colin's comments, here are results where we use append! but we don't allocate a list. Instead, we use a generator ... that is, a data structure that invents data when asked for it with an interface much like a list. The results are much better than what I showed above.
julia> #btime begin
x = rand(10)
append!(x, (0 for i in 1:990))
sum(x)
end
565.814 ns (2 allocations: 8.03 KiB)
Note the round brackets around the 0 for i in 1:990.
In the end, Colin was right. Using append! is much faster if we can avoid related overheads. Surprisingly, the base function Iterators.repeated(0, 990) is much slower.
But, no matter what, all of these options are pretty blazingly fast and all of them would probably be so fast that none of these subtle differences would matter.
Julia is fun!

Note that if you want to fill with missing or some other type different from the element type in your original vector, then you will need to change the type of your vectors to allow those new elements. The function below will handle any case.
function fillvectors(x, y, fillvalue=missing)
xl = length(x)
yl = length(y)
if xl < yl
x::Vector{Union{eltype(x), typeof(fillvalue)}} = x
for i in xl+1:yl
push!(x, fillvalue)
end
end
if yl < xl
y::Vector{Union{eltype(y), typeof(fillvalue)}} = y
for i in yl+1:xl
push!(y, fillvalue)
end
end
return x, y
end
x = [1, 2, 3, 4]
y = [1, 2]
julia> (x, y) = fillvectors(x, y)
([1, 2, 3, 4], Union{Missing, Int64}[1, 2, missing, missing])
julia> y
4-element Vector{Union{Missing, Int64}}:
1
2
missing
missing
julia> (x, y) = fillvectors(x, y, 0)
([1, 2, 3, 4], [1, 2, 0, 0])
julia> y
4-element Vector{Int64}:
1
2
0
0
julia> (x, y) = fillvectors(x, y, 1.001)
([1, 2, 3, 4], Union{Float64, Int64}[1, 2, 1.001, 1.001])
julia> y
4-element Vector{Union{Float64, Int64}}:
1
2
1.001
1.001

Related

How to efficiently generate random unique series of bits with specific length?

Let's say I want to generate 3 unique random series of bits with a length of three. The possible output can be:
001 or [0, 0, 1]
010 or [0, 1, 0]
111 or [1, 1, 1]
#or
011 or [0, 1, 1]
110 or [1, 1, 0]
111 or [1, 1, 1]
# etc.
I provided two notations above (the Vector notation is preferred). The point is where they should be unique. I tried:
julia> unique(convert.(BitVector, rand.(Ref([0, 1]), repeat([3], 3))))
2-element Vector{BitVector}:
[0, 1, 1]
[0, 1, 0]
As you can see, there might be a set of two unique BitVectors rather than 3 and this is natural here. I can replace repeat([3], 3) with repeat([3], 6) to somewhat ensure I would get three unique sets:
julia> unique(convert.(BitVector, rand.(Ref([0, 1]), repeat([3], 5))))[1:3]
3-element Vector{BitVector}:
[1, 0, 0]
[1, 1, 1]
[1, 0, 1]
But I wonder if there's any better idea for this?
*However, I'm really curious about how I can efficiently generate the first notation for this question (like 101, 001, etc.).
Update: The following randBitSeq will be 3X faster. It generates unique random numbers first, then it fills a Boolean matrix with their binary values.
using StatsBase
function randBitSeq(N, L)
M = Matrix{Bool}(undef,N,L)
S = sample(0:2^L-1, N; replace=false)
i = 0
for n in S
i += 1
for j = 1:L
if n > 0
M[i,j] = isodd(n)
n ÷= 2
else
M[i,j] = false
end
end
end
return M
end
#btime randBitSeq(50, 10)
1.350 μs (3 allocations: 9.17 KiB)
# vs.
#btime randseqset(50, 10)
3.050 μs (5 allocations: 10.84 KiB)
Constructing all possible combinations will exponentially eat memory. A better option is to generate a Set of N random binary series of length L each. Then add more series if the required number is not achieved. This seems much faster for N,L > 3.
function randSeq(N, L)
s = Set(rand(Bool,L) for i=1:N)
while length(s) < N
push!(s, rand(Bool,L))
end
s
end
N = 50; L = 10
#btime randSeq($N, $L)
4.071 μs (57 allocations: 4.71 KiB)
Another nice option is:
using Random, StatsBase
function randseqset(N, L)
L < sizeof(Int)*8-1 || error("Short seqs only")
m = BitArray(undef, L, N)
s = sample(0:(1<<L)-1, N; replace=false)
map(i->digits!(#view(m[:,i]), s[i]; base=2), 1:N)
end
A version with simple vectors instead of #views is:
function randseqset(N, L)
L < sizeof(Int)*8-1 || error("Short seqs only")
s = sample(0:(1<<L)-1, N; replace=false)
map(i->digits!(Vector{Bool}(undef, L),
s[i]; base=2), 1:N)
end
It has the benefit of adapting to parameters a bit (inherited from the no-replace sample code). And it is quite performant and allocation thrifty.
For example:
julia> N = 10; L = 4;
julia> #btime randSeq($N, $L);
1.126 μs (16 allocations: 1.12 KiB)
julia> #btime randseqset($N, $L);
654.878 ns (5 allocations: 944 bytes)
PS If Matrix{Bool} preferable to BitMatrix then replace m = ... line with m = Matrix{Bool}(undef, L, N)
PPS As for the question about the strings, the following works (using same logic as above):
randseqstrset(N, L) = getindex.(
bitstring.(sample(0:(1<<L)-1, N; replace=false)),
Ref(sizeof(Int)*8-L+1:sizeof(Int)*8))
for example:
julia> randseqstrset(3,3)
3-element Vector{String}:
"101"
"000"
"011"
UPDATE: If speed is really an issue, another version can use some BitMatrix trickery:
function randBitSeq2(N, L)
BM = BitMatrix(undef, 0,0)
BM.chunks = sample(0:2^L-1, N; replace=false)
BM.dims = (sizeof(Int64)*8, N)
BM.len = sizeof(Int64)*8*N
return #view(BM[1:L,:])
end
This version is called randBitSeq2 because it returns a matrix like randBitSeq but is twice as fast:
julia> #btime randBitSeq2(50,10);
1.882 μs (5 allocations: 9.20 KiB)
julia> #btime randBitSeq(50,10);
3.634 μs (3 allocations: 9.17 KiB)
Here's a thought: given you're only talking about bit strings of length 3, instead of trying to randomly generate them and then enforce uniqueness, how about just take the set of all 3 bit strings and shuffle it around, and then select 3.
For example you could use collect(Iterators.product([0,1],[0,1],[0,1])) to generates all length 3 bit strings, and then shuffle(x)[1:3] from Random to sample without replacement, e.g.
julia> using Random
julia> shuffle(collect(Iterators.product([0,1],[0,1],[0,1])))[1:3]
3-element Array{Tuple{Int64,Int64,Int64},1}:
(0, 1, 1)
(1, 1, 1)
(1, 0, 1)
Also if you want BitVectors instead you can do
julia> shuffle(BitArray.(Iterators.product([0,1],[0,1],[0,1])))[1:3]
3-element Array{BitArray{1},1}:
[1, 0, 1]
[0, 1, 1]
[1, 1, 1]
Obviously this won't scale well with the length of the strings, but a suggestion for this small case.

Row-wise operations between matrices in Julia

I'm attempting to translate the equivalent of the following Python code (from SMT GEKPLS) into Julia:
def differences(X, Y):
D = X[:, np.newaxis, :] - Y[np.newaxis, :, :]
return D.reshape((-1, X.shape[1]))
So, given an input like this:
X = np.array([[1.0,1.0,1.0], [2.0,2.0,2.0]])
Y = np.array([[1.0,2.0,3.0], [4.0,5.0,6.0], [7.0,8.0,9.0]])
diff = differences(X,Y)
We get an output (diff) that looks like this:
[[ 0. -1. -2.]
[-3. -4. -5.]
[-6. -7. -8.]
[ 1. 0. -1.]
[-2. -3. -4.]
[-5. -6. -7.]]
What is an efficient way to do this with Julia code? I expect the X and Y input matrices to be quite large.
After some thinking, I came to this function:
function differences(X, Y)
Rx = repeat(X, inner=(size(Y, 1), 1))
Ry = repeat(Y, size(X, 1))
Rx - Ry
end
I hope I was helpful.
Here's a version that avoids repeat, which creates unnecessary data duplication:
function diffs_row(X, Y)
N = size(X, 2)
return reshape(reshape(X', 1, N, :) .- Y', N, :)'
end
The reason for all the adjoints ' is that it isn't really natural to operate row-wise in Julia. Julia arrays are column-major so reshape will retrieve data column-wise. If you decide instead to change the orientation of the data, you could write
function diffs_col(X, Y)
N = size(X, 1)
return reshape(reshape(X, N, 1, :) .- Y, N, :)
end
instead.
One often sees this when translating numpy code to Julia. Numpy is natively row-major, so the translation becomes a bit awkward. You should consider changing your data layout to be column major in many cases.
This might be faster than other alternatives, while still being easy to understand.
[x .- y for x ∈ X for y ∈ Y]
6-element Vector{Vector{Float64}}:
[0.0, -1.0, -2.0]
[-3.0, -4.0, -5.0]
[-6.0, -7.0, -8.0]
[1.0, 0.0, -1.0]
[-2.0, -3.0, -4.0]
[-5.0, -6.0, -7.0]
The one thing I disliked about numpy is that one has to exactly remember each function in conjunction with a combination of input parameters. In Julia, the traditional loop can serve as an efficient drop-in replacement for most algorithms.
Addendum: The above might be the fastest solution as I said, provided that working with a Vector{Vector{Float64}} is not an issue. If it is, here is another solution that outputs a Matrix{Float64} while being fast as well.
function diffr(X,Y)
i, l, m, n = 0, length(first(X)), length(X), length(Y)
Z = Matrix{Float64}(undef, m*n, l)
for x in X, y in Y
Z[i+=1,:] .= x .- y
end
Z
end
And here is a performance comparison of all posted solutions on my computer.
#btime [x.-y for x∈$X for y∈$Y] # 312.245 ns (9 allocations: 656 bytes)
#btime diffr($X, $Y) # 73.868 ns (1 allocation: 208 bytes)
#btime differences($X, $Y) # 439.000 ns (12 allocations: 896 bytes)
#btime diffs_row($X, $Y) # 463.131 ns (11 allocations: 784 bytes)

How to broadcast set operation on array of sets in Julia?

I'm trying to perform set operations between a given set y and all items in some array of sets X as follows:
X=Array{Set}([Set([1,2,1]), Set([4,6,8 ]), Set([4,5])])
y=Set{Int16}([2,8,4])
z=broadcast(intersect, y, X)
println(z)
Which gives me empty sets, instead of sets with the singletons in y, for my example.
You have to protect y from being iterated over. Normally you would get an error but unfortunately y has three elements as well as vector X. Let us create a bigger vector then so see the problem:
julia> X=Array{Set}([Set([1,2,1]), Set([4,6,8 ]), Set([4,5]), Set([7])])
4-element Array{Set,1}:
Set([2, 1])
Set([4, 8, 6])
Set([4, 5])
Set([7])
julia> y=Set{Int16}([2,8,4])
Set{Int16} with 3 elements:
4
2
8
julia> z=broadcast(intersect, y, X)
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 3 and 4")
How to solve it - wrap y in a 0-dimensional container with Ref(y) like this:
julia> z=broadcast(intersect, Ref(y), X)
4-element Array{Set{Int16},1}:
Set([2])
Set([4, 8])
Set([4])
Set()
or equivalently just write:
julia> z=intersect.(Ref(y), X)
4-element Array{Set{Int16},1}:
Set([2])
Set([4, 8])
Set([4])
Set()

Roll array, uniform circular shift

Given an array:
arr = [1, 2, 3, 4, 5]
I would like to shift all the elements.
shift!(arr, 2) => [4, 5, 1, 2, 3]
In Python, this is accomplished with Numpy using numpy.roll. How is this done in Julia?
No need to implement it yourself, there is a built-in function for this
julia> circshift(arr, 2)
5-element Array{Int64,1}:
4
5
1
2
3
It's also (slightly) more efficient than roll2 proposed above:
julia> #btime circshift($arr, 2);
68.563 ns (1 allocation: 128 bytes)
julia> #btime roll2($arr, 2);
70.605 ns (4 allocations: 256 bytes)
Note, however, that none of the proposed functions operates in-place. They all create a new array. There is also the built-in circshift!(dest, src, shift) which operates in a preallocated dest (which, however, must be != src).
The function by Seanny123 does a lot of copying can be improved to have smaller memory footprint and execute faster. Consider:
function roll2(arr, step)
len = length(arr)
[view(arr,len-step+1:len); view(arr,1:len-step)]
end
arr = [1,2,3,4,5,6,7,8,9,10];
And now the times (REPL output):
julia> using BenchmarkTools
julia> #btime roll($arr,2);
124.254 ns (3 allocations: 400 bytes)
julia> #btime roll2($arr,2);
73.386 ns (4 allocations: 288 bytes)
Of course the fastest way is to change arr in-place.
You can write a simple function for this:
function roll(arr, step)
return vcat(arr[end-step+1:end], arr[1:end-step])
end
println(roll(1:5, 2))
# => [4, 5, 1, 2, 3]
println(roll(1:6, 4))
# => [3, 4, 5, 6, 1, 2]

Equivalent of pandas 'clip' in Julia

In pandas, there is the clip function (see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.clip.html), which constrains values within the lower and upper bound provided by the user. What is the Julia equivalent? I.e., I would like to have:
> clip.([2 3 5 10],3,5)
> [3 3 5 5]
Obviously, I can write it myself, or use a combination of min and max, but I was surprised to find out there is none. StatsBase provides the trim and winsor functions, but these do not allow fixed values as input, but rather counts or percentiles (https://juliastats.github.io/StatsBase.jl/stable/robust.html).
You are probably looking for clamp:
help?> clamp
clamp(x, lo, hi)
Return x if lo <= x <= hi. If x > hi, return hi. If x < lo, return lo. Arguments are promoted to a common type.
This is a function for scalar x, but we can broadcast it over the vector using dot-notation:
julia> clamp.([2, 3, 5, 10], 3, 5)
4-element Array{Int64,1}:
3
3
5
5
If you don't care about the original array you can also use the in-place version clamp!, which modifies the input:
julia> A = [2, 3, 5, 10];
julia> clamp!(A, 3, 5);
julia> A
4-element Array{Int64,1}:
3
3
5
5

Resources