Get only elements of one array that are in another array - julia

I'm learning Julia coming from Python. I want to get the elements of an array b such that each element is in array a. My attempt in Julia is shown after doing what I need in python. My question is this: is there a better/faster way to do this in Julia? I'm suspicious about the simplicity of what I've written in Julia, and I worry that such a naive looking solution might have suboptimal performance (again coming from Python).
Python:
import numpy as np
a = np.array([1, 2, 3, 4])
b = np.array([7, 8, 2, 3, 5])
indices_b_in_a = np.nonzero(np.isin(b, a))
b_in_a = b[indices_b_in_a]
# array([2, 3])
Julia:
a = [1, 2, 3, 4];
b = [7, 8, 2, 3, 5];
indices_b_in_a = findall(ele -> ele in a, b);
b_in_a = b[indices_b_in_a];
#2-element Vector{Int64}:
# 2
# 3

Maybe this would be a helpful answer:
julia> intersect(Set(a), Set(b))
Set{Int64} with 2 elements:
2
3
# Or even
julia> intersect(a, b)
2-element Vector{Int64}:
2
3
Note that if you had repetitive numbers, this method fails to exactly replicate your expected behavior since I'm working on unique values here! If you have repetitive elements, there should replace an element-by-element approach for searching! in that case, using binary search would be a good choice.
Another approach is using broadcasting in Julia:
julia> a = rand(1:100, 1000);
b = rand(1:3000, 5000);
julia> b[in.(b, Ref(a))]
161-element Vector{Int64}:
8
5
70
73
⋮
# Exactly the same approach with a slightly different syntax
julia> b[b.∈Ref(a)]
161-element Vector{Int64}:
8
5
70
73
30
63
73
⋮
Q: What is the role of Ref in the above code block?
Ans: By wrapping a in Ref, I make a Reference of a and prevent the compiler from iterating through a as well within the broadcasting procedure. Otherwise, it would try to iterate on the elements of a and b simultaneously which is not the right solution (even if both objects hold the same length).
However, Julia's syntax is specific (typically), but it's not that complicated. I said this because you mentioned:
I worry that such a naive looking solution...
Last but not least, do not forget to wrap your code in a function if you want to obtain a good performance in Julia.

Another approach using array comprehensions.
julia> [i for i in a for j in b if i == j]
2-element Vector{Int64}:
2
3

Related

How do you access multi-dimension array by N array of index element-wise?

Suppose we have
A = [1 2; 3 4]
In numpy, the following syntax will produce
A[[1,2],[1,2]] = [1,4]
But, in julia, the following produce a permutation which output
A[[1,2],[1,2]] = [1 2; 3 4]
Is there a concise way to achieve the same thing as numpy without using for loops?
To get what you want I would use CartesianIndex like this:
julia> A[CartesianIndex.([(1,1), (2,2)])]
2-element Vector{Int64}:
1
4
or
julia> A[[CartesianIndex(1,1), CartesianIndex(2,2)]]
2-element Vector{Int64}:
1
4
Like Bogumil said, you probably want to use CartesianIndex. But if you want to get your result from supplying the vectors of indices for each dimensions, as in your Python [1,2],[1,2] example, you need to zip these indices first:
julia> A[CartesianIndex.(zip([1,2], [1,2]))]
2-element Vector{Int64}:
1
4
How does this work? zip traverses both vectors of indices at the same time (like a zipper) and returns an iterator over the tuples of indices:
julia> zip([1,2],[1,2]) # is a lazy iterator
zip([1, 2], [1, 2])
julia> collect(zip([1,2],[1,2])) # collect to show all the tuples
2-element Vector{Tuple{Int64, Int64}}:
(1, 1)
(2, 2)
and then CartesianIndex turns them into cartesian indices, which can then be used to get the corresponding values in A:
julia> CartesianIndex.(zip([1,2],[1,2]))
2-element Vector{CartesianIndex{2}}:
CartesianIndex(1, 1)
CartesianIndex(2, 2)

Create a Vector of Integers and missing Values

What a hazzle...
I'm trying to create a vector of integers and missing values. This works fine:
b = [4, missing, missing, 3]
But I would actually like the vector to be longer with more missing values and therefore use repeat(), but this doesn't work
append!([1,2,3], repeat([missing], 1000))
and this also doesn't work
[1,2,3, repeat([missing], 1000)]
Please, help me out, here.
It is also worth to note that if you do not need to do an in-place operation with append! actually in such cases it is much easier to do vertical concatenation:
julia> [[1, 2, 3]; repeat([missing], 2); 4; 5] # note ; that denotes vcat
7-element Array{Union{Missing, Int64},1}:
1
2
3
missing
missing
4
5
julia> vcat([1,2,3], repeat([missing], 2), 4, 5) # this is the same but using a different syntax
7-element Array{Union{Missing, Int64},1}:
1
2
3
missing
missing
4
5
The benefit of vcat is that it automatically does the type promotion (as opposed to append! in which case you have to correctly specify the eltype of the target container before the operation).
Note that because vcat does automatic type promotion in corner cases you might get a different eltype of the result of the operation:
julia> x = [1, 2, 3]
3-element Array{Int64,1}:
1
2
3
julia> append!(x, [1.0, 2.0]) # conversion from Float64 to Int happens here
5-element Array{Int64,1}:
1
2
3
1
2
julia> [[1, 2, 3]; [1.0, 2.0]] # promotion of Int to Float64 happens in this case
5-element Array{Float64,1}:
1.0
2.0
3.0
1.0
2.0
See also https://docs.julialang.org/en/v1/manual/arrays/#man-array-literals.
This will work:
append!(Union{Int,Missing}[1,2,3], repeat([missing], 1000))
[1,2,3] creates just a Vector{Int} and since Julia is strongly typed the Vector{Int} cannot accept values of non-Int type. Hence, when defining a structure, that you plan to hold more data types within, you need to explicitly state it - here we have defined Vector{Union{Int,Missing}}.

Vectorized splatting

I'd like to be able to splat an array of tuples into a function in a vectorized fashion. For example, if I have the following function,
function foo(x, y)
x + y
end
and the following array of tuples,
args_array = [(1, 2), (3, 4), (5, 6)]
then I could use a list comprehension to obtain the desired result:
julia> [foo(args...) for args in args_array]
3-element Array{Int64,1}:
3
7
11
However, I would like to be able to use the dot vectorization notation for this operation:
julia> foo.(args_array...)
ERROR: MethodError: no method matching foo(::Int64, ::Int64, ::Int64)
But as you can see, that particular syntax doesn't work. Is there a vectorized way to do this?
foo.(args_array...) doesn't work because it's doing:
foo.((1, 2), (3, 4), (5, 6))
# which is roughly equivalent to
[foo(1,3,5), foo(2,4,6)]
In other words, it's taking each element of args_array as a separate argument and then broadcasting foo over those arguments. You want to broadcast foo over the elements directly. The trouble is that running:
foo.(args_array)
# is roughly equivalent to:
[foo((1,2)), foo((3,4)), foo((5,6))]
In other words, the broadcast syntax is just passing each tuple as a single argument to foo. We can fix that with a simple intermediate function:
julia> bar(args) = foo(args...);
julia> bar.(args_array)
3-element Array{Int64,1}:
3
7
11
Now that's doing what you want! You don't even need to construct the second argument if you don't want to. This is exactly equivalent:
julia> (args->foo(args...)).(args_array)
3-element Array{Int64,1}:
3
7
11
And in fact you can generalize this quite easily:
julia> splat(f) = args -> f(args...);
julia> (splat(foo)).(args_array)
3-element Array{Int64,1}:
3
7
11
You could zip the args_array, which effectively transposes the array of tuples:
julia> collect(zip(args_array...))
2-element Array{Tuple{Int64,Int64,Int64},1}:
(1, 3, 5)
(2, 4, 6)
Then you can broadcast foo over the transposed array (actually an iterator) of tuples:
julia> foo.(zip(args_array...)...)
(3, 7, 11)
However, this returns a tuple instead of an array. If you need the return value to be an array, you could use any of the following somewhat cryptic solutions:
julia> foo.(collect.(zip(args_array...))...)
3-element Array{Int64,1}:
3
7
11
julia> collect(foo.(zip(args_array...)...))
3-element Array{Int64,1}:
3
7
11
julia> [foo.(zip(args_array...)...)...]
3-element Array{Int64,1}:
3
7
11
How about
[foo(x,y) for (x,y) in args_array]

Equivalent of pandas 'clip' in Julia

In pandas, there is the clip function (see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.clip.html), which constrains values within the lower and upper bound provided by the user. What is the Julia equivalent? I.e., I would like to have:
> clip.([2 3 5 10],3,5)
> [3 3 5 5]
Obviously, I can write it myself, or use a combination of min and max, but I was surprised to find out there is none. StatsBase provides the trim and winsor functions, but these do not allow fixed values as input, but rather counts or percentiles (https://juliastats.github.io/StatsBase.jl/stable/robust.html).
You are probably looking for clamp:
help?> clamp
clamp(x, lo, hi)
Return x if lo <= x <= hi. If x > hi, return hi. If x < lo, return lo. Arguments are promoted to a common type.
This is a function for scalar x, but we can broadcast it over the vector using dot-notation:
julia> clamp.([2, 3, 5, 10], 3, 5)
4-element Array{Int64,1}:
3
3
5
5
If you don't care about the original array you can also use the in-place version clamp!, which modifies the input:
julia> A = [2, 3, 5, 10];
julia> clamp!(A, 3, 5);
julia> A
4-element Array{Int64,1}:
3
3
5
5

Utilizing ndgrid/meshgrid functionality in Julia

I'm trying to find functionality in Julia similar to MATLAB's meshgrid or ndgrid. I know Julia has defined ndgrid in the examples but when I try to use it I get the following error.
UndefVarError: ndgrid not defined
Anyone know either how to get the builtin ndgrid function to work or possibly another function I haven't found or library that provides these methods (the builtin function would be preferred)? I'd rather not write my own in this case.
Thanks!
We prefer to avoid these functions, since they allocate arrays that usually aren't necessary. The values in these arrays have such a regular structure that they don't need to be stored; they can just be computed during iteration. For example, one alternative approach is to write an array comprehension:
julia> [ 10i + j for i=1:5, j=1:5 ]
5×5 Array{Int64,2}:
11 12 13 14 15
21 22 23 24 25
31 32 33 34 35
41 42 43 44 45
51 52 53 54 55
Or, you can write for loops, or iterate over a product iterator:
julia> collect(Iterators.product(1:2, 3:4))
2×2 Array{Tuple{Int64,Int64},2}:
(1, 3) (1, 4)
(2, 3) (2, 4)
I do find sometimes it's convenient to use some function like meshgrid in numpy. It's easy to do it with list comprehension:
function meshgrid(x, y)
X = [i for i in x, j in 1:length(y)]
Y = [j for i in 1:length(x), j in y]
return X, Y
end
e.g.
x = 1:4
y = 1:3
X, Y = meshgrid(x, y)
now
julia> X
4×3 Array{Int64,2}:
1 1 1
2 2 2
3 3 3
4 4 4
julia> Y
4×3 Array{Int64,2}:
1 2 3
1 2 3
1 2 3
1 2 3
However, I did not find this makes the code run faster than using iteration. Here's what I mean:
After defining
x = 1:1000
y = x
X, Y = meshgrid(x, y)
I did benchmark on the following two functions
using Statistics
function fun1()
return mean(sqrt.(X.*X + Y.*Y))
end
function fun2()
sum = 0.0
for i in 1:1000
for j in 1:1000
sum += sqrt(i*i + j*j)
end
end
return sum / (1000*1000)
end
Here are the benchmark results:
julia> #btime fun1()
8.310 ms (19 allocations: 30.52 MiB)
julia> #btime run2()
1.671 ms (0 allocations: 0 bytes)
The meshgrid method is both significantly slower and taking more memory. Any Julia expert knows why? I understand Julia is a compiling language unlike Python so iterations won't be slower than vectorization, but I don't understand why vector(array) calculation is many times slower than iteration. (For bigger N this difference is even larger.)
Edit: After reading this post, I have the following updated version of the 'meshgrid' method. The idea is to not create a meshgrid beforehand, but to do it in the calculation via Julia's powerful elementwise array operation:
x = collect(1:1000)
y = x'
function fun1v2()
mean(sqrt.(x .* x .+ y .* y))
end
The trick here is the .+ between a size-M column array and a size-N row array which returns a M-by-N array. It does the 'meshgrid' for you. This function is nearly 3 times faster then fun1, albeit not as fast as fun2.
julia> #btime fun1v2()
3.189 ms (24 allocations: 7.63 MiB)
765.8435104896155
Above, #ChrisRackauckas suggests that the "proper way" to do this is with a lazy operator but he hadn't gotten around to it.
There is now a registered packaged with lazy ndgrid in it:
https://github.com/JuliaArrays/LazyGrids.jl
It is more general than the version in
VectorizedRoutines.jl
because it can handle vectors with different types, e.g.,
ndgrid(1:3, Float16[0:2], ["x", "y", "z"]).
There are Literate.jl examples in the docs that show the lazy performance is pretty good.
Of course lazy meshgrid is just one step away:
meshgrid(y,x) = (ndgrid_lazy(x,y)[[2,1]]...,)

Resources