How to efficiently generate random unique series of bits with specific length? - julia

Let's say I want to generate 3 unique random series of bits with a length of three. The possible output can be:
001 or [0, 0, 1]
010 or [0, 1, 0]
111 or [1, 1, 1]
#or
011 or [0, 1, 1]
110 or [1, 1, 0]
111 or [1, 1, 1]
# etc.
I provided two notations above (the Vector notation is preferred). The point is where they should be unique. I tried:
julia> unique(convert.(BitVector, rand.(Ref([0, 1]), repeat([3], 3))))
2-element Vector{BitVector}:
[0, 1, 1]
[0, 1, 0]
As you can see, there might be a set of two unique BitVectors rather than 3 and this is natural here. I can replace repeat([3], 3) with repeat([3], 6) to somewhat ensure I would get three unique sets:
julia> unique(convert.(BitVector, rand.(Ref([0, 1]), repeat([3], 5))))[1:3]
3-element Vector{BitVector}:
[1, 0, 0]
[1, 1, 1]
[1, 0, 1]
But I wonder if there's any better idea for this?
*However, I'm really curious about how I can efficiently generate the first notation for this question (like 101, 001, etc.).

Update: The following randBitSeq will be 3X faster. It generates unique random numbers first, then it fills a Boolean matrix with their binary values.
using StatsBase
function randBitSeq(N, L)
M = Matrix{Bool}(undef,N,L)
S = sample(0:2^L-1, N; replace=false)
i = 0
for n in S
i += 1
for j = 1:L
if n > 0
M[i,j] = isodd(n)
n ÷= 2
else
M[i,j] = false
end
end
end
return M
end
#btime randBitSeq(50, 10)
1.350 μs (3 allocations: 9.17 KiB)
# vs.
#btime randseqset(50, 10)
3.050 μs (5 allocations: 10.84 KiB)
Constructing all possible combinations will exponentially eat memory. A better option is to generate a Set of N random binary series of length L each. Then add more series if the required number is not achieved. This seems much faster for N,L > 3.
function randSeq(N, L)
s = Set(rand(Bool,L) for i=1:N)
while length(s) < N
push!(s, rand(Bool,L))
end
s
end
N = 50; L = 10
#btime randSeq($N, $L)
4.071 μs (57 allocations: 4.71 KiB)

Another nice option is:
using Random, StatsBase
function randseqset(N, L)
L < sizeof(Int)*8-1 || error("Short seqs only")
m = BitArray(undef, L, N)
s = sample(0:(1<<L)-1, N; replace=false)
map(i->digits!(#view(m[:,i]), s[i]; base=2), 1:N)
end
A version with simple vectors instead of #views is:
function randseqset(N, L)
L < sizeof(Int)*8-1 || error("Short seqs only")
s = sample(0:(1<<L)-1, N; replace=false)
map(i->digits!(Vector{Bool}(undef, L),
s[i]; base=2), 1:N)
end
It has the benefit of adapting to parameters a bit (inherited from the no-replace sample code). And it is quite performant and allocation thrifty.
For example:
julia> N = 10; L = 4;
julia> #btime randSeq($N, $L);
1.126 μs (16 allocations: 1.12 KiB)
julia> #btime randseqset($N, $L);
654.878 ns (5 allocations: 944 bytes)
PS If Matrix{Bool} preferable to BitMatrix then replace m = ... line with m = Matrix{Bool}(undef, L, N)
PPS As for the question about the strings, the following works (using same logic as above):
randseqstrset(N, L) = getindex.(
bitstring.(sample(0:(1<<L)-1, N; replace=false)),
Ref(sizeof(Int)*8-L+1:sizeof(Int)*8))
for example:
julia> randseqstrset(3,3)
3-element Vector{String}:
"101"
"000"
"011"
UPDATE: If speed is really an issue, another version can use some BitMatrix trickery:
function randBitSeq2(N, L)
BM = BitMatrix(undef, 0,0)
BM.chunks = sample(0:2^L-1, N; replace=false)
BM.dims = (sizeof(Int64)*8, N)
BM.len = sizeof(Int64)*8*N
return #view(BM[1:L,:])
end
This version is called randBitSeq2 because it returns a matrix like randBitSeq but is twice as fast:
julia> #btime randBitSeq2(50,10);
1.882 μs (5 allocations: 9.20 KiB)
julia> #btime randBitSeq(50,10);
3.634 μs (3 allocations: 9.17 KiB)

Here's a thought: given you're only talking about bit strings of length 3, instead of trying to randomly generate them and then enforce uniqueness, how about just take the set of all 3 bit strings and shuffle it around, and then select 3.
For example you could use collect(Iterators.product([0,1],[0,1],[0,1])) to generates all length 3 bit strings, and then shuffle(x)[1:3] from Random to sample without replacement, e.g.
julia> using Random
julia> shuffle(collect(Iterators.product([0,1],[0,1],[0,1])))[1:3]
3-element Array{Tuple{Int64,Int64,Int64},1}:
(0, 1, 1)
(1, 1, 1)
(1, 0, 1)
Also if you want BitVectors instead you can do
julia> shuffle(BitArray.(Iterators.product([0,1],[0,1],[0,1])))[1:3]
3-element Array{BitArray{1},1}:
[1, 0, 1]
[0, 1, 1]
[1, 1, 1]
Obviously this won't scale well with the length of the strings, but a suggestion for this small case.

Related

Combining vectors of unequal length

x = [1, 2, 3, 4]
y = [1, 2]
If I want to be able to operate on the two vectors with a default value filling in, what are the strategies?
E.g. would like to do the following and implicitly fill in with 0 or missing
x + y # would like [2, 4, 3, 4]
Ideally would like to do this in a generic way so that I could do arbitrary operations with the two.
Disregarding whether Julia has something built-in to do this, remember that Julia is fast. This means that you can write code to support this kind of need.
extend!(x, y::Vector, default=0) = extend!(x, length(y), default)
extend!(x, n::Int, default=0) = begin
while length(x) < n
push!(x, default)
end
x
end
Then when you have code such as you describe, you can symmetrically extend x and y:
x = [1, 2, 3, 4]
y = [1, 2]
extend!(x, y)
extend!(y, x)
x + y
==> [2, 4, 3, 4]
Note that this mutates y. In many cases, the desired length would come from outside the code and would be applied to both x and y. I can also imagine that 0 is a bad default in general (even though it is completely appropriate in your context of addition.
A comment below makes the worthy point that you should consider using append! instead of looping over push!. In fact, it is best to measure differences like that if you care about very small differences. I went ahead and tested:
julia> using BenchmarkTools
julia> extend1(x, n) = begin
while length(x) < n
push!(x, 0)
end
x
end
julia> #btime begin
x = rand(10)
sum(x)
end
59.815 ns (1 allocation: 160 bytes)
5.037723569560573
julia> #btime begin
x = rand(10)
extend1(x, 1000)
sum(x)
end
7.281 μs (8 allocations: 20.33 KiB)
6.079832879992913
julia> x = rand(10)
julia> #btime begin
x = rand(10)
append!(x, zeros(990))
sum(x)
end
1.290 μs (3 allocations: 15.91 KiB)
3.688526541987817
julia>
Pushing primitives in a loop is damned fast, allocating a vector of zeros so we can use append! is very slightly faster.
But the real lesson here is seen in the fact that the loop version takes microseconds to append nearly 1000 values (reallocating the array several times). Appending 10 values one by one takes just over 150ns (and append! is slightly faster). This is blindingly fast. Literally doing nothing in R or Python can take longer than this.
This difference would matter in some situations and would be undetectable in many others. If it matters, measure. If it doesn't, do the simplest thing that comes to mind because Julia has your back (performance-wise).
FURTHER UPDATE
Taking a hint from another of Colin's comments, here are results where we use append! but we don't allocate a list. Instead, we use a generator ... that is, a data structure that invents data when asked for it with an interface much like a list. The results are much better than what I showed above.
julia> #btime begin
x = rand(10)
append!(x, (0 for i in 1:990))
sum(x)
end
565.814 ns (2 allocations: 8.03 KiB)
Note the round brackets around the 0 for i in 1:990.
In the end, Colin was right. Using append! is much faster if we can avoid related overheads. Surprisingly, the base function Iterators.repeated(0, 990) is much slower.
But, no matter what, all of these options are pretty blazingly fast and all of them would probably be so fast that none of these subtle differences would matter.
Julia is fun!
Note that if you want to fill with missing or some other type different from the element type in your original vector, then you will need to change the type of your vectors to allow those new elements. The function below will handle any case.
function fillvectors(x, y, fillvalue=missing)
xl = length(x)
yl = length(y)
if xl < yl
x::Vector{Union{eltype(x), typeof(fillvalue)}} = x
for i in xl+1:yl
push!(x, fillvalue)
end
end
if yl < xl
y::Vector{Union{eltype(y), typeof(fillvalue)}} = y
for i in yl+1:xl
push!(y, fillvalue)
end
end
return x, y
end
x = [1, 2, 3, 4]
y = [1, 2]
julia> (x, y) = fillvectors(x, y)
([1, 2, 3, 4], Union{Missing, Int64}[1, 2, missing, missing])
julia> y
4-element Vector{Union{Missing, Int64}}:
1
2
missing
missing
julia> (x, y) = fillvectors(x, y, 0)
([1, 2, 3, 4], [1, 2, 0, 0])
julia> y
4-element Vector{Int64}:
1
2
0
0
julia> (x, y) = fillvectors(x, y, 1.001)
([1, 2, 3, 4], Union{Float64, Int64}[1, 2, 1.001, 1.001])
julia> y
4-element Vector{Union{Float64, Int64}}:
1
2
1.001
1.001

Julia: Generate all non-repeating permutations in set with duplicates

Let's say, I have a vector x = [0, 0, 1, 1]
I want to generate all different permutations. However, the current permutations function in Julia does not recognize the presence of duplicates in the vector. Therefore in this case, it will output the exact same permutation three times (this one, one where both zeros are swapped and one where the ones are swapped).
Does anybody know a workaround? Because in larger system I end up with an out of bounds error...
Many thanks in advance! :)
permutations returns an iterator and hence running it through unique could be quite efficient with regard to memory usage.
julia> unique(permutations([0, 0, 1, 1]))
6-element Array{Array{Int64,1},1}:
[0, 0, 1, 1]
[0, 1, 0, 1]
[0, 1, 1, 0]
[1, 0, 0, 1]
[1, 0, 1, 0]
[1, 1, 0, 0]
I found this answer that I adapted. It expects a sorted vector (or at least repeated values should be together in the list).
julia> function unique_permutations(x::T, prefix=T()) where T
if length(x) == 1
return [[prefix; x]]
else
t = T[]
for i in eachindex(x)
if i > firstindex(x) && x[i] == x[i-1]
continue
end
append!(t, unique_permutations([x[begin:i-1];x[i+1:end]], [prefix; x[i]]))
end
return t
end
end
julia> #btime unique_permutations([0,0,0,1,1,1]);
57.100 μs (1017 allocations: 56.83 KiB)
julia> #btime unique(permutations([0,0,0,1,1,1]));
152.400 μs (2174 allocations: 204.67 KiB)
julia> #btime unique_permutations([1;zeros(Int,100)]);
7.047 ms (108267 allocations: 10.95 MiB)
julia> #btime unique(permutations([1;zeros(Int,8)]));
88.355 ms (1088666 allocations: 121.82 MiB)

How to efficiently initialize huge sparse arrays in Julia?

There are two ways one can initialize a NXN sparse matrix, whose entries are to be read from one/multiple text files. Which one is faster? I need the more efficient one, as N is large, typically 10^6.
1). I could store the (x,y) indices in arrays x, y, the entries in an array v and declare
K = sparse(x,y,value);
2). I could declare
K = spzeros(N)
then read of the (i,j) coordinates and values v and insert them as
K[i,j]=v;
as they are being read.
I found no tips about this on Julia’s page on sparse arrays.
Don’t insert values one by one: that will be tremendously inefficient since the storage in the sparse matrix needs to be reallocated over and over again.
You can also use BenchmarkTools.jl to verify this:
julia> using SparseArrays
julia> using BenchmarkTools
julia> I = rand(1:1000, 1000); J = rand(1:1000, 1000); X = rand(1000);
julia> function fill_spzeros(I, J, X)
x = spzeros(1000, 1000)
#assert axes(I) == axes(J) == axes(X)
#inbounds for i in eachindex(I)
x[I[i], J[i]] = X[i]
end
x
end
fill_spzeros (generic function with 1 method)
julia> #btime sparse($I, $J, $X);
10.713 μs (12 allocations: 55.80 KiB)
julia> #btime fill_spzeros($I, $J, $X);
96.068 μs (22 allocations: 40.83 KiB)
Original post can be found here

Roll array, uniform circular shift

Given an array:
arr = [1, 2, 3, 4, 5]
I would like to shift all the elements.
shift!(arr, 2) => [4, 5, 1, 2, 3]
In Python, this is accomplished with Numpy using numpy.roll. How is this done in Julia?
No need to implement it yourself, there is a built-in function for this
julia> circshift(arr, 2)
5-element Array{Int64,1}:
4
5
1
2
3
It's also (slightly) more efficient than roll2 proposed above:
julia> #btime circshift($arr, 2);
68.563 ns (1 allocation: 128 bytes)
julia> #btime roll2($arr, 2);
70.605 ns (4 allocations: 256 bytes)
Note, however, that none of the proposed functions operates in-place. They all create a new array. There is also the built-in circshift!(dest, src, shift) which operates in a preallocated dest (which, however, must be != src).
The function by Seanny123 does a lot of copying can be improved to have smaller memory footprint and execute faster. Consider:
function roll2(arr, step)
len = length(arr)
[view(arr,len-step+1:len); view(arr,1:len-step)]
end
arr = [1,2,3,4,5,6,7,8,9,10];
And now the times (REPL output):
julia> using BenchmarkTools
julia> #btime roll($arr,2);
124.254 ns (3 allocations: 400 bytes)
julia> #btime roll2($arr,2);
73.386 ns (4 allocations: 288 bytes)
Of course the fastest way is to change arr in-place.
You can write a simple function for this:
function roll(arr, step)
return vcat(arr[end-step+1:end], arr[1:end-step])
end
println(roll(1:5, 2))
# => [4, 5, 1, 2, 3]
println(roll(1:6, 4))
# => [3, 4, 5, 6, 1, 2]

How to find the index of the last maximum in julialang?

I have an array that contains repeated nonnegative integers, e.g., A=[5,5,5,0,1,1,0,0,0,3,3,0,0]. I would like to find the position of the last maximum in A. That is the largest index i such that A[i]>=A[j] for all j. In my example, i=3.
I tried to find the indices of all maximum of A then find the maximum of these indices:
A = [5,5,5,0,1,1,0,0,0,3,3,0,0];
Amax = maximum(A);
i = maximum(find(x -> x == Amax, A));
Is there any better way?
length(A) - indmax(#view A[end:-1:1]) + 1
should be pretty fast, but I didn't benchmark it.
EDIT: I should note that by definition #crstnbr 's solution (to write the algorithm from scratch) is faster (how much faster is shown in Xiaodai's response). This is an attempt to do it using julia's inbuilt array functions.
What about findlast(A.==maximum(A)) (which of course is conceptually similar to your approach)?
The fastest thing would probably be explicit loop implementation like this:
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
I tried #Michael's solution and #crstnbr's solution and I found the latter much faster
a = rand(Int8(1):Int8(5),1_000_000_000)
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 19 seconds
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 18 seconds
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
#time lastindmax(a) # 3 seconds
#time lastindmax(a) # 2.8 seconds
Michael's solution doesn't support Strings (ERROR: MethodError: no method matching view(::String, ::StepRange{Int64,Int64})) or sequences so I add another solution:
julia> lastimax(x) = maximum((j,i) for (i,j) in enumerate(x))[2]
julia> A="abžcdž"; lastimax(A) # unicode is OK
6
julia> lastimax(i^2 for i in -10:7)
1
If you more like don't catch exception for empty Sequence:
julia> lastimax(x) = !isempty(x) ? maximum((j,i) for (i,j) in enumerate(x))[2] : 0;
julia> lastimax(i for i in 1:3 if i>4)
0
Simple(!) benchmarks:
This is up to 10 times slower than Michael's solution for Float64:
julia> mlastimax(A) = length(A) - indmax(#view A[end:-1:1]) + 1;
julia> julia> A = rand(Float64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.166389 seconds (4.00 M allocations: 91.553 MiB, 4.63% gc time)
0.019560 seconds (6 allocations: 240 bytes)
80346
(I am surprised) it is 2 times faster for Int64!
julia> A = rand(Int64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.015453 seconds (10 allocations: 304 bytes)
0.031197 seconds (6 allocations: 240 bytes)
423400
it is 2-3 times slower for Strings
julia> A = ["A$i" for i in 1:1_000_000]; #time lastimax(A); #time mlastimax(A)
0.175117 seconds (2.00 M allocations: 61.035 MiB, 41.29% gc time)
0.077098 seconds (7 allocations: 272 bytes)
999999
EDIT2:
#crstnbr solution is faster and works with Strings too (doesn't work with generators). There difference between lastindmax and lastimax - first return byte index, second return character index:
julia> S = "1š3456789ž"
julia> length(S)
10
julia> lastindmax(S) # return value is bigger than length
11
julia> lastimax(S) # return character index (which is not byte index to String) of last max character
10
julia> S[chr2ind(S, lastimax(S))]
'ž': Unicode U+017e (category Ll: Letter, lowercase)
julia> S[chr2ind(S, lastimax(S))]==S[lastindmax(S)]
true

Resources