Irregular behaviour of vectors in OpenCL(1.2) kernels - opencl

So, I am trying to perform some operation inside an OpenCL kernel. I have this buffer named filter which is a 3x3 matrix initialized with value 1.
I pass this as an argument to the OpenCL kernel from the host side. The issue is when I try to fetch this buffer on the device side as a float3 vector. For ex -
__kernel void(constant float3* restrict filter)
{
float3 temp1 = filter[0];
float3 temp2 = filter[1];
float3 temp3 = filter[2];
}
The first two temp variables behave as expected and have all their value as 1. But, the third temp variable (temp3) has only the x component as 1 and rest of the y and z components are 0. When I fetch the buffer as only a float vector, everything behaves as expected. Am I doing something wrong? I don't want to use vload instructions as they give an overhead.

In OpenCL, float3 is just an alias for float4, so your 9 values will fill the x, y, z, and w component of temp1 and temp2, which leaves just one value for temp3.x. You will probably need to use the vload3 instruction.
See section 6.1.5. Alignment of Types of the OpenCL specification for more information:
For 3-component vector data types, the size of the data type is 4 * sizeof(component). This means that a 3-component vector data type will be aligned to a 4 * sizeof(component) boundary. The vload3 and vstore3 built-in functions can be used to read and write, respectively, 3-component vector data types from an array of packed scalar data type.

Related

Tensor mode n product in Julia

I need tensor mode n product.
The defination of tenosr mode n product can be seen here.
https://www.alexejgossmann.com/tensor_decomposition_tucker/
I found python code.
I would like to convert this code into julia.
def mode_n_product(x, m, mode):
x = np.asarray(x)
m = np.asarray(m)
if mode <= 0 or mode % 1 != 0:
raise ValueError('`mode` must be a positive interger')
if x.ndim < mode:
raise ValueError('Invalid shape of X for mode = {}: {}'.format(mode, x.shape))
if m.ndim != 2:
raise ValueError('Invalid shape of M: {}'.format(m.shape))
return np.swapaxes(np.swapaxes(x, mode - 1, -1).dot(m.T), mode - 1, -1)
I have found another answer using Tensortoolbox.jl
using TensorToolbox
X=rand(5,4,3);
A=rand(2,5);
ttm(X,A,n) #X times A[1] by mode n
One way is:
using TensorOperations
#tensor y[i1, i2, i3, out, i5] := x[i1, i2, i3, s, i5] * a[out, s]
This is literally the formula given at your link to define this, except that I changed the name of the summed index to s; you can you any index names you like, they are just markers. The sum is implicit, because s does not appear on the left.
There is nothing very special about putting the index out back in the same place. Like your python code, #tensor permutes the dimensions of x in order to use ordinary matrix multiplication, and then permutes again to give y the requested order. The fewer permutations needed, the faster this will be.
Alternatively, you can try using LoopVectorization, Tullio; #tullio y[i1, i2, ... with the same notation. Instead of permuting in order to call a library matrix multiplication function, this writes a pure-Julia version which works with the array as it arrives.

Fortran pointer to smaller dimension array

I would like to create a pointer to an array with smaller dimension.
For example, I have some array arr(1:2, 1:10, 1:10).
Now I want to create a pointer to arr(1:1, 1:10, 1:10) but I want to delete first I don't know how I should name it by it look like index, and second pointer to (2:2, 1:10, 1:10).
I need it because I would like to send array with 2 dimensions (matrix) to a function.
Here is an indication of what I want to do, with pseudocode.
INTEGER, DIMENSION (1:2, 1:10, 1:10), TARGET :: BOUNDRIES
INTEGER, DIMENSION (:,:), POINTER : LEFT_BOUNDRY
LEFT_BOUNDRY => BOUNDRIES(1,1:10,1:10)
DO i = 1,n
DO j = 1,10
write(*,*) LEFT_BOUNDRY(i,j)
END DO
END DO
Is it possible to do it?
When we have a dummy argument in a function or a subroutine (collectively, procedure) we have a corresponding actual argument when we execute that procedure. Consider the subroutine s:
subroutine s(x)
real x(5,2)
...
end subroutine s
The dummy argument x is in this case an explicit shape array, of rank 2, shape [5,2].
If we want to
call s(y)
where y is some real thing we don't need to have y a whole array which is of rank 2 and shape [5,2]. We simply need to have y have at least ten elements and a thing called storage association maps those ten elements to x when we are in the subroutine.
Imagine, then
real y1(10), y2(1,10), y3(29)
call s(y1)
call s(y2)
call s(y3)
Each of these works (in the final case, it's just the first ten elements that become associated with the dummy argument).
Crucially, it's a so-called element sequence that is important when choosing the elements to associate with x. Consider
real y(5,12,10,10)
call s (y(1,1,1:2,3:7))
This is an array section of y of ten elements. Those ten elements together become x in the subroutine s.
To conclude, if you want to pass arr(2,1:10,1:10) (which is actually a rank 2 array section) to a rank 2 argument which is an explicit shape array of no more than 100 elements, everything is fine.

julia: outer product function

In R, the function outer structurally allows you to take the outer product of two vectors x and y while providing a number of options for the actual function applied to each combination. For example outer(x,y,'-') creates an "outer product" matrix of the elementwise differences between x and y. Does Julia have something similar?
Broadcast is the Julia operation which occurs when adding .'s around. When the two containers have the same size, it's an element-wise operation. Example: x.*y is element-wise if size(x)==size(y). However, when the shapes don't match, then broadcast really comes into effect. If one of them is a row vector and one of them is a column vector, then the output will be 2D with out[i,j] matching the ith row of the column vector with the j row vector. This means x .* y is a peculiar way to write the outer product if one a row and the other is a column vector.
In general, what broadcast is doing is:
This is wasteful when dimensions get large, so Julia offers broadcast(), which expands singleton dimensions in array arguments to match the corresponding dimension in the other array without using extra memory
(This is from the Julia Manual)
But this generalizes to all of the other binary operators, so x .- y' is what you're looking for.

How to know the index of the iterator when using map in Julia

I have an Array of arrays, called y:
y=Array(Vector{Int64}, 10)
which is basically a list of 1-dimensional arrays(10 of them), and each 1-dimensional array has length 5. Below is an example of how they are initialized:
for i in 1:10
y[i]=sample(1:20, 5)
end
Each 1-dimensional array includes 5 randomly sampled integers between 1 to 20.
Right now I am applying a map function where for each of those 1-dimensional arrays in y , excludes which numbers from 1 to 20:
map(x->setdiff(1:20, x), y)
However, I want to make sure when the function applied to y[i], if the output of setdiff(1:20, y[i]) includes i, i is excluded from the results. in other words I want a function that works like
setdiff(deleteat!(Vector(1:20),i) ,y[i])
but with map.
Mainly my question is that whether you can access the index in the map function.
P.S, I know how to do it with comprehensions, I wanted to know if it is possible to do it with map.
comprehension way:
[setdiff(deleteat!(Vector(1:20), index), value) for (index,value) in enumerate(y)]
Like this?
map(x -> setdiff(deleteat!(Vector(1:20), x[1]),x[2]), enumerate(y))
For your example gives this:
[2,3,4,5,7,8,9,10,11,12,13,15,17,19,20]
[1,3,5,6,7,8,9,10,11,13,16,17,18,20]
....
[1,2,4,7,8,10,11,12,13,14,15,16,17,18]
[1,2,3,5,6,8,11,12,13,14,15,16,17,19,20]

Having an issue with MPI_GATHER/MPI_GATHERV in F90 with derived data types

I have a task in which I will have several data types together; character, several integers, and a double precision value, which represent a solution to a problem.
At the moment, I have a "toy" F90 program, that uses MPI with random numbers and a contrived character string for each processor. I want to have a data type that has the character and the double precision random number together.
I will use MPI_REDUCE to get the minimum value for the double precision values. I will have the data type for each process brought together to the root (rank = 0) via the MPI_GATHERV function.
My goal is to match up the minimum value from the random values to the data type. That would be the final answer. I have tried all sort of ideas up to this point, but to no avail. I end up with "forrtl: severe SIGSEGV, segmentation fault occurred".
Now I have looked at several of the other postings too. For instance, I cannot use the "use mpif.h" statement on this particular system.
But, at last, here is the code:
program fredtype
implicit none
include '/opt/apps/intel15/mvapich2/2.1/include/mpif.h'
integer rank,size,ierror,tag,status(MPI_STATUS_SIZE),i,np,irank
integer blocklen(2),type(2),num,rcount(4)
double precision :: x,aout
character(len=4) :: y
type, BIND(C) :: mytype
double precision :: x,aout,test
character :: y
end type mytype
type(mytype) :: foo,foobag(4)
integer(KIND=MPI_ADDRESS_KIND) :: disp(2),base
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierror)
aout = 99999999999.99
call random_seed()
call random_number(x)
if(rank.eq.0)y="dogs"
if(rank.eq.1)y="cats"
if(rank.eq.2)y="tree"
if(rank.eq.3)y="woof"
print *,rank,x,y
call MPI_GET_ADDRESS(foo%x,disp(1),ierror)
call MPI_GET_ADDRESS(foo%y,disp(2),ierror)
base = disp(1)
call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierror)
aout = 99999999999.99
call random_seed()
call random_number(x)
if(rank.eq.0)y="dogs"
if(rank.eq.1)y="cats"
if(rank.eq.2)y="tree"
if(rank.eq.3)y="woof"
print *,rank,x,y
call MPI_GET_ADDRESS(foo%x,disp(1),ierror)
call MPI_GET_ADDRESS(foo%y,disp(2),ierror)
base = disp(1)
call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierror)
aout = 99999999999.99
call random_seed()
call random_number(x)
if(rank.eq.0)y="dogs"
if(rank.eq.1)y="cats"
if(rank.eq.2)y="tree"
if(rank.eq.3)y="woof"
print *,rank,x,y
call MPI_GET_ADDRESS(foo%x,disp(1),ierror)
call MPI_GET_ADDRESS(foo%y,disp(2),ierror)
base = disp(1)
disp(2) = disp(2) - base
blocklen(1) = 1
blocklen(2) = 1
type(1) = MPI_DOUBLE_PRECISION
type(2) = MPI_CHARACTER
call MPI_TYPE_CREATE_STRUCT(2,blocklen,disp,type,foo,ierror)
call MPI_TYPE_COMMIT(foo,ierror)
call MPI_REDUCE(x,aout,1,MPI_DOUBLE_PRECISION,MPI_MIN,0,MPI_COMM_WORLD,i\
error)
call MPI_GATHER(num,1,MPI_INT,rcount,1,MPI_INT,0,MPI_COMM_WORLD)
call MPI_GATHERV(foo,num,type,foobag,rcount,disp,type,0,MPI_COMM_WORLD)
if(rank.eq.0)then
print *,'fin ',aout
end if
end program fredtype
Thank you for any help.
Sincerely,
Erin
Your code is definitely too confusing for me to try to fully fix it. So let's just assume that you have your type mytype defined as follow:
type, bind(C) :: mytype
double precision :: x, aout, test
character(len=4) :: y
end type mytype
(Rk: I've add len=4 to the definition of y as it seemed to be missing from your original code. I might be wrong it that and if so, just adjust blocklen(2) in the subsequent code accordingly)
Now let's assume that you only want to transfer the x and y fields of your variables of type mytype. For this, you'll need to create an appropriated derived MPI type using first MPI_Type_create_struct() to define the basic types and their location into your structure, and then MPI_Type_create_resized() to define the true extent and lower bound of the type, including holes.
The tricky part is usually to evaluate what the lower bound and extent of your Fortran type is. Here, as you include into the fields that you transfer the first and last of them, and as you added bind(C), you can just use MPI_Type_get_extend() to get these informations. However, if you hadn't included x or y (which are first and last fields of the type) into the MPI data type, MPI_Type_get_extent() wouldn't have return what you would have needed. So I'll propose you an alternative (slightly more cumbersome) approach which will, I believe, always work:
integer :: ierror, typefoo, tmptypefoo
integer :: blocklen(2), types(2)
type(mytype) :: foobag(4)
integer(kind=MPI_ADDRESS_KIND) :: disp(2), lb, extent
call MPI_Get_address( foobag(1), lb, ierror )
call MPI_Get_address( foobag(1)%x, disp(1), ierror )
call MPI_Get_address( foobag(1)%y, disp(2), ierror )
call MPI_Get_address( foobag(2), extent, ierror )
disp(1) = MPI_Aint_diff( disp(1), lb )
disp(2) = MPI_Aint_diff( disp(2), lb )
extent = MPI_Aint_diff( extent, lb )
lb = 0
blocklen(1) = 1
blocklen(2) = 4
types(1) = MPI_DOUBLE_PRECISION
types(2) = MPI_CHARACTER
call MPI_Type_create_struct( 2, blocklen, disp, types, tmptypefoo, ierror )
call MPI_Type_create_resized( tmptypefoo, lb, extent, typefoo, ierror )
call MPI_Type_commit( typefoo, ierror )
So as you can see, lb serves as base address for the displacements into the structure, and the type extent is computed by using the relative addresses of two consecutive elements of an array of type mytype.
Then, we create an intermediary MPI data type tmptypefoo which only contains the information about the actual data we will transfer, and we extent it with information about the actual lower bound and extent of the Fortran type into typefoo. Finally, only this last one needs to be committed as only it will serve for data transfers.

Resources