pyopencl.LogicError: clEnqueueNDRangeKernel failed: invalid work item size - opencl

I am attempting to implement in Python using pyopencl the dot_persist_kernel() shown here, and I've been squashing numerous bugs along the way. But, I've stumbled upon an issue that I can't crack:
self.program = cl.Program(self.ctx, code).build()
# code is a string with the code from the link given
a = cl_array.to_device(self.queue, np.random.rand(2**20).astype(np.float32))
b = cl_array.to_device(self.queue, np.random.rand(2**20).astype(np.float32))
c = 0.
mf = cl.mem_flags
c_buf = cl.Buffer(self.ctx, mf.WRITE_ONLY, 4)
MAX_COMPUTE_UNITS = cl.get_platforms()[0].get_devices()[0].max_compute_units
WORK_GROUPS_PER_CU = MAX_COMPUTE_UNITS * 4
ELEMENTS_PER_GROUP = a.size / WORK_GROUPS_PER_CU
ELEMENTS_PER_WORK_ITEM = ELEMENTS_PER_GROUP / 256
self.program.DotProduct(self.queue, a.shape, a.shape,
a.data, b.data, c_buf,
np.uint32(ELEMENTS_PER_GROUP),
np.uint32(ELEMENTS_PER_WORK_ITEM),
np.uint32(1028 * MAX_COMPUTE_UNITS))
Assuming an array of size 2^26, the constants will have values of:
MAX_COMPUTE_UNITS = 32 // from get_device()[0].max_compute_units
WORK_GROUPS_PER_CU = 128 // MAX_COMPUTE_UNITS * 4
ELEMENTS_PER_GROUP = 524288 // 2^19
ELEMENTS_PER_WORK_ITEM = 2048 // 2^11
The kernel header looks like:
#define LOCAL_GROUP_XDIM 256
// Kernel for part 1 of dot product, version 3.
__kernel __attribute__((reqd_work_group_size(LOCAL_GROUP_XDIM, 1, 1)))
void dot_persist_kernel(
__global const double * x, // input vector
__global const double * y, // input vector
__global double * r, // result vector
uint n_per_group, // elements processed per group
uint n_per_work_item, // elements processed per work item
uint n // input vector size
)
The error that it is giving is:
Traceback (most recent call last):
File "GPUCompute.py", line 102, in <module>
gpu = GPUCompute()
File "GPUCompute.py", line 87, in __init__
np.uint32(1028 * MAX_COMPUTE_UNITS))
File "C:\Miniconda2\lib\site-packages\pyopencl\__init__.py", line 512, in kernel_call
global_offset, wait_for, g_times_l=g_times_l)
pyopencl.LogicError: clEnqueueNDRangeKernel failed: invalid work item size
I've tried shifting the numbers around a lot, to no avail. Ideas?

There were a few issues going on with the previous implementation, but this one is working:
WORK_GROUPS = cl.get_platforms()[0].get_devices()[0].max_compute_units * 4
ELEMENTS_PER_GROUP = np_a.size / WORK_GROUPS
LOCAL_GROUP_XDIM = 256
ELEMENTS_PER_WORK_ITEM = ELEMENTS_PER_GROUP / LOCAL_GROUP_XDIM
self.program = cl.Program(self.ctx, kernel).build()
self.program.DotProduct(
self.queue, np_a.shape, (LOCAL_GROUP_XDIM,), # kernel information
cl_a, cl_b, cl_c, # data
np.uint32(ELEMENTS_PER_GROUP), # elements processed per group
np.uint32(ELEMENTS_PER_WORK_ITEM), # elements processed per work item
np.uint32(np_a.size) # input vector size
)
It was the culmination of a few things, but the biggest factor was that the second and third arguments passed to DotProduct() are supposed to be tuples--not ints, like I thought. :)

Related

Error "cannot allocate memory block of size 67108864 Tb" in the R function false.nearest

The R function
tseriesChaos::false.nearest(series, m, d, t, rt=10, eps=sd(series)/10)
realizes the false nearest neighbours algorithm to help deciding the optimal embedding dimension.
I would like to apply it to the following series:
dput(x)
c(0.230960354326456, 0.229123906233121, 0.222750351085665, 0.230096143459004,
0.226315220913903, 0.228151669007238, 0.225775089121746, 0.229447985308415,
0.230096143459004, 0.232256670627633, 0.23722588311548, 0.236361672248029,
0.231716538835476, 0.229231932591552, 0.229880090742141, 0.229447985308415,
0.236901804040186, 0.234525224154694, 0.236577724964891, 0.240574700226855,
0.238090093982932, 0.233552986928811, 0.235929566814303, 0.228799827157827,
0.224694825537431, 0.225775089121746, 0.224694825537431, 0.221129955709193,
0.214540347844874, 0.213352057902128, 0.21054337258291, 0.208706924489575,
0.211083504375068, 0.212487847034676, 0.20903100356487, 0.206654423679378,
0.213027978826834, 0.211083504375068, 0.216160743221346, 0.213244031543697,
0.214324295128011, 0.216160743221346, 0.215512585070757, 0.218753375823701,
0.215836664146052, 0.225126930971157, 0.228367721724101, 0.23128443340175,
0.240574700226855, 0.244139570055093, 0.246732202657448, 0.248028518958626,
0.246300097223723, 0.245976018148428, 0.241762990169601, 0.245976018148428,
0.248892729826078, 0.258831154801772, 0.265744841741385, 0.259803392027655,
0.258831154801772, 0.261855892837852, 0.262504050988441, 0.262071945554715,
0.257102733066868, 0.270065896078643, 0.276655503942962, 0.280544452846495,
0.280004321054337, 0.276547477584531, 0.286485902560225, 0.278924057470023,
0.279140110186886, 0.272658528680998, 0.262828130063736, 0.26466457815707,
0.254726153181376, 0.264448525440207, 0.261207734687264, 0.269741817003349,
0.259587339310792, 0.256886680350005, 0.26163984012099, 0.252133520579021,
0.257858917575888, 0.255158258615102, 0.252457599654316, 0.251701415145295,
0.251161283353138, 0.251053256994707, 0.251917467862158, 0.24316733282921,
0.242195095603327, 0.249540887976666, 0.259263260235497, 0.259263260235497,
0.258399049368046, 0.252565626012747, 0.263800367289619, 0.262071945554715,
0.259695365669223, 0.256886680350005, 0.253213784163336, 0.260127471102949,
0.268769579777466, 0.271578265096684, 0.270173922437075, 0.267905368910014,
0.262071945554715, 0.262936156422167, 0.261855892837852, 0.262720103705304,
0.259047207518635, 0.263044182780598, 0.257102733066868, 0.259155233877066,
0.259155233877066, 0.250297072485687, 0.24089877930215, 0.239494436642541,
0.241546937452738, 0.24014259479313, 0.244355622771956, 0.242195095603327,
0.242303121961759, 0.241438911094307, 0.236901804040186, 0.238954304850383,
0.236793777681754, 0.239386410284109, 0.241546937452738, 0.24608404450686,
0.244139570055093, 0.237333909473912, 0.238954304850383, 0.240250621151561,
0.235281408663714, 0.234093118720968, 0.237657988549206, 0.246948255374311,
0.249432861618235, 0.246516149940585, 0.247164308091174, 0.252997731446473,
0.258399049368046, 0.258399049368046, 0.256238522199417, 0.268661553419034,
0.275143134924922, 0.273630765906881, 0.270281948795506, 0.265204709949228,
0.262071945554715, 0.258074970292751, 0.261747866479421, 0.260883655611969,
0.264124446364913, 0.267257210759425, 0.271146159662958, 0.273954844982176,
0.266933131684131, 0.269201685211192, 0.278383925677865, 0.278491952036297,
0.271146159662958, 0.272982607756293, 0.27503510856649, 0.282921032731987,
0.285297612617479, 0.285189586259047, 0.280436426488063, 0.287026034352382,
0.288538403370422, 0.286593928918656, 0.287998271578265, 0.285081559900616,
0.28464945446689, 0.279032083828454, 0.280112347412769, 0.278816031111591,
0.281624716430809, 0.278491952036297, 0.2802203737712, 0.279896294695906,
0.28097655828022, 0.276763530301394, 0.272550502322567, 0.276979583018256,
0.292643404990818, 0.28907853516258, 0.291239062331209, 0.293615642216701,
0.286918007993951, 0.287998271578265, 0.288322350653559, 0.280868531921789,
0.274386950415901, 0.271146159662958, 0.278275899319434, 0.277411688451982,
0.279140110186886, 0.28907853516258, 0.258939181160203, 0.256670627633142,
0.25278167872961, 0.255698390407259, 0.261423787404127, 0.260559576536675,
0.263692340931187, 0.260667602895106, 0.255158258615102, 0.257858917575888,
0.250081019768824, 0.245219833639408, 0.24684022901588, 0.244895754564114,
0.242195095603327, 0.246300097223723, 0.253861942313925, 0.253429836880199,
0.264988657232365, 0.260235497461381, 0.258831154801772, 0.258831154801772,
0.253213784163336, 0.249864967051961, 0.250081019768824, 0.245219833639408,
0.249756940693529, 0.245651939073134, 0.24835259803392, 0.24835259803392,
0.245867991789997, 0.248244571675489, 0.247056281732743, 0.249756940693529,
0.248676677109215, 0.251593388786864, 0.254186021389219, 0.250837204277844,
0.251593388786864, 0.248676677109215, 0.249540887976666, 0.251593388786864,
0.242627201037053, 0.242519174678622, 0.240250621151561, 0.240034568434698,
0.243059306470779, 0.244031543696662)
Hence, I used the code:
false.nearest(x, m=50, d=r, t=220, eps=1, rt=3)
Anyway, I obtained the error:
Error in false.nearest(x, m = 50, d = r, t = 220, eps = 1, rt = 3) :
cannot allocate memory block of size 67108864 Tb
I can't explain it, vector x has only 250 observations!
Looking at false.nearest source code in tseriesChaos package:
/*
False nearest neighbours algorithm.
in_series: input time series (scaled between 0 and 1)
in_length: time series length
in_m, in_d, in_t: embedding dimension, time delay, theiler window
in_eps: neighbourhood size
in_rt: escape factor
out: fraction of false nearests
out2: total number of nearests
*/
void falseNearest(double *in_series, int *in_length, int *in_m, int *in_d, int *in_t, double *in_eps, double *in_rt, double *out, int *out2) {
double eps, *series;
double dst;
double *dsts;
int *ids;
int m,d, t, length, blength;
int num, denum;
int i,j,md;
double rt;
int id;
boxSearch bs;
/*
BIND PARAMETERS
*/
m = *in_m;
d = *in_d;
t = *in_t;
rt = *in_rt;
eps=*in_eps;
series=in_series;
length=*in_length;
/**/
/*
INIT VARIABLES
*/
blength = length - m*d - t;
With your parameters set :
length. <- 250
m <- 50
d <- 3
t <- 220
(blength = length. - m*d - t)
[1] -120
blength is used as parameter to R_alloc and should be positive, otherwise sign bit will be interpreted as a huge integer, causing memory allocation error :
dsts = (double*) R_alloc(blength, sizeof(double));
In this case, max value of m to keep blength positive is m=10.
Constraints on parameters use are not documented in the package, nor does the package output an informative error message : reason for error is understood, but difficult to help further.

OpenCL vstoren does not store vector in scalar array

I have the kernel as below.
My question is why is vstore8 not working? When the output is printed in the host code, it only returns 0s.
I put an "if(all(v == 0) == 1)" in the code to check whether the error was caused when I copy the values from int4* to int8 in v, but it was not that.
It seems like vstoren is doing nothing.
I am new to OpenCL so any help is appreciated.
__kernel void select_vec(__global int4 *input1,
__global int *input2,
__global int *output){
//copy values in input arrays to vectors
int i = get_global_id(0);
int4 vA = input1[i];
int4 vB = input1[i+1];
__private int8 v = (int8)(vA.s0, vA.s1, vA.s2, vA.s3, vB.s0, vB.s1, vB.s2, vB.s3);
__private int8 v1 = vload8(0, input2);
__private int8 v2 = vload8(1, input2);
int8 results;
if(any(v > 10) == 1){
//if there is any of the elements in v that are greater than 10
// copy the corresponding elements from v1 for elements greater than 10
// for elements less than or equal to 17, copy the corresponding elements from v2
results = select(v1, v2, v > 10);
}else{
//results is the combination of the first half of v2 and v2
results = (int8) (v1.lo, v2.lo);
}
/* for testing of the error is due to vstoren */
// results = (int8) (1);
//store results in output array
vstore8(results, i, output);
}
Do you mean int8 v1 = vload8(i+0, input2);, int8 v2 = vload8(i+1, input2); and vstore8(results, i, output);?
Currently you read from the same memory addresses in input2 (0-7 for v1 and 8-15 for v2) and write to the same memory address in output (0-7) with all threads. This is a race condition because depending on v and the last thread writing to output, you can get randomly different results. But if input2 starts with 0s in addresses 0-15 and output is initialized with all 0s, it will remain all 0s.

How to use Windows API 'GetTcpTable' to find a avaiable tcp port?

I'm trying to use windows api 'GetTcpTable', but I don't know how to prepare the data structure 'PMIB_TCPTABLE' and get the return value from it.
PROCEDURE GetTcpTable EXTERNAL "Iphlpapi":U:
DEFINE OUTPUT PARAMETER mTcpTable AS HANDLE TO MEMPTR. // the API expects a pointer, so use HANDLE TO syntax to pass one
DEFINE INPUT-OUTPUT PARAMETER SizePointer AS HANDLE TO LONG.
DEFINE INPUT PARAMETER Order AS LONG.
DEFINE RETURN PARAMETER IPHLPAPI_DLL_LINKAGE AS LONG.
END PROCEDURE.
DEFINE VARIABLE mTcpTable AS MEMPTR NO-UNDO.
DEFINE VARIABLE SizePointer AS INT NO-UNDO.
DEFINE VARIABLE Order AS INT NO-UNDO.
DEFINE VARIABLE IPHLPAPI AS INT NO-UNDO.
DEFINE VARIABLE mTempValue AS MEMPTR NO-UNDO.
DEFINE VARIABLE dwNumEntries AS INT NO-UNDO.
DEFINE VARIABLE dwState AS INT NO-UNDO.
DEFINE VARIABLE dwLocalAddr AS INT64 NO-UNDO.
DEFINE VARIABLE dwLocalPort AS INT NO-UNDO.
DEFINE VARIABLE dwRemoteAddr AS INT64 NO-UNDO.
DEFINE VARIABLE dwRemotePort AS INT NO-UNDO.
DEFINE VARIABLE ix AS INTEGER NO-UNDO.
SizePointer = 4.
Order = 1.
SET-SIZE(mTcpTable) = SizePointer.
RUN GetTcpTable(OUTPUT mTcpTable, INPUT-OUTPUT SizePointer,INPUT Order, OUTPUT IPHLPAPI). //get ERROR_INSUFFICIENT_BUFFER and know the real buffer now
MESSAGE "IPHLPAPI is " IPHLPAPI SKIP // ERROR_INSUFFICIENT_BUFFER = 122
"SizePointer is " SizePointer SKIP
VIEW-AS ALERT-BOX.
SET-SIZE(mTcpTable) = 0.
SET-SIZE(mTcpTable) = SizePointer.
RUN GetTcpTable(OUTPUT mTcpTable, INPUT-OUTPUT SizePointer,INPUT Order, OUTPUT IPHLPAPI).
MESSAGE "IPHLPAPI is " IPHLPAPI SKIP // NO_ERROR = 0
"SizePointer is " SizePointer SKIP
GET-LONG(mTcpTable,1) SKIP //dwNumEntries
VIEW-AS ALERT-BOX.
IF IPHLPAPI = 0 THEN DO:
dwNumEntries = GET-LONG(mTcpTable,1).
OUTPUT TO VALUE ("C:\temp\debug.txt") UNBUFFERED.
DO ix = 0 TO dwNumEntries - 1.
dwState = GET-UNSIGNED-LONG(mTcpTable,5 + ix * 20). // get value of dwState
dwLocalAddr = GET-UNSIGNED-LONG(mTcpTable,9 + ix * 20). // get value of dwLocalAddr
SET-SIZE(mTempValue) = 2.
PUT-BYTE(mTempValue,2) = GET-UNSIGNED-SHORT(mTcpTable,13 + ix * 20).
PUT-BYTE(mTempValue,1) = GET-UNSIGNED-SHORT(mTcpTable,14 + ix * 20).
dwLocalPort = GET-UNSIGNED-SHORT(mTempValue,1). // get value of dwLocalPort, The maximum size of an IP port number is 16 bits, so only the lower 16 bits should be used. The upper 16 bits may contain uninitialized data.
SET-SIZE(mTempValue) = 0.
dwRemoteAddr = GET-UNSIGNED-LONG(mTcpTable,17 + ix * 20). // get value of dwRemoteAddr
SET-SIZE(mTempValue) = 2.
PUT-BYTE(mTempValue,2) = GET-UNSIGNED-SHORT(mTcpTable,21 + ix * 20).
PUT-BYTE(mTempValue,1) = GET-UNSIGNED-SHORT(mTcpTable,22 + ix * 20).
dwRemotePort = GET-UNSIGNED-SHORT(mTempValue,1). // get value of dwRemotePort, The maximum size of an IP port number is 16 bits, so only the lower 16 bits should be used. The upper 16 bits may contain uninitialized data.
SET-SIZE(mTempValue) = 0.
PUT UNFORMATTED dwState "~t" dwLocalAddr "~t" dwLocalPort "~t" dwRemoteAddr "~t" dwRemotePort "~r".
END.
OUTPUT CLOSE.
END.
SET-SIZE(mTcpTable) = 0.
Is there any better way to get the lower 16 bits? I'm not sure why I need to put the byte in a reversed order in to mTempValue, is it releated to little endian of x86 PC?
How to get the right start position to read the value if I run the code on different bitness computer
I think the question can be closed now:
https://community.progress.com/s/feed/0D54Q000088eIeASAU?t=1602495359075&searchQuery
Thank you guys :)
Microsoft's doc usually has the definitions of the structures (see https://learn.microsoft.com/en-us/windows/win32/api/tcpmib/ns-tcpmib-mib_tcptable). Once you know those - and any other that are related/pointed to/etc - you can create and manipulate them using MEMPTR data types and the various functions that operate on them: SET-SIZE(), GET-BYTES() and GET-POINTER-VALUE() are typically used. There is doc on this at https://docs.progress.com/bundle/openedge-programmimg-interfaces/page/Shared-Library-and-DLL-Support.html .

Generating tuples containing Long for Vavr Property Checking

I need a pair of random longs for property checking with Vavr.
My implementation looks like this:
Gen<Long> longs = Gen.choose(Long.MIN_VALUE, Long.MAX_VALUE);
Arbitrary<Tuple2<Long, Long>> pairOfLongs = longs
.flatMap(value -> random -> Tuple.of(value, longs.apply(random)))
.arbitrary();
Is any better/nicer way to do the same in vavr?
Arbitrary<T> can be seen as a function of type
int -> Random -> T
Generating arbitrary integers
Because the sample size is of type int, it would be natural to do the following:
Arbitrary<Tuple2<Integer, Integer>> intPairs = size -> {
Gen<Integer> ints = Gen.choose(-size, size);
return random -> Tuple.of(ints.apply(random), ints.apply(random));
};
Let's test it:
Property.def("print int pairs")
.forAll(intPairs.peek(System.out::println))
.suchThat(pair -> true)
.check(10, 5);
Output:
(-9, 2)
(-2, -10)
(5, -2)
(3, 8)
(-10, 10)
Generating arbitrary long values
Currently we are not able to define a size of type long, so the workaround is to ignore the size and use the full long range:
Arbitrary<Tuple2<Long, Long>> longPairs = ignored -> {
Gen<Long> longs = Gen.choose(Long.MIN_VALUE, Long.MAX_VALUE);
return random -> Tuple.of(longs.apply(random), longs.apply(random));
};
Let's test it again:
Property.def("print long pairs")
.forAll(longPairs.peek(System.out::println))
.suchThat(pair -> true)
.check(0, 5);
Output:
(2766956995563010048, 1057025805628715008)
(-6881523912167376896, 7985876340547620864)
(7449864279215405056, 6862094372652388352)
(3203043896949684224, -2508953386204733440)
(1541228130048020480, 4106286124314660864)
Interpreting an integer size as long
The size parameter can be interpreted in a custom way. More specifically we could map a given int size to a long size:
Arbitrary<Tuple2<Long, Long>> longPairs = size -> {
long longSize = ((long) size) << 32;
Gen<Long> longs = Gen.choose(-longSize, longSize);
return random -> Tuple.of(longs.apply(random), longs.apply(random));
};
However, the last example does not match the full long range. Maybe it is possible to find a better mapping.
Disclaimer: I'm the author of Vavr (formerly known as Javaslang)

issue with OpenCL stencil code

I have a problem with a 4-point stencil OpenCL code. The code runs fine but I don't get symetrics final 2D values which are expected.
I suspect it is a problem of updates values in the kernel code. Here's the kernel code :
// kernel code
const char *source ="__kernel void line_compute(const double diagx, const double diagy,\
const double weightx, const double weighty, const int size_x,\
__global double* tab_new, __global double* r)\
{ int iy = get_global_id(0)+1;\
int ix = get_global_id(1)+1;\
double new_value, cell, cell_n, cell_s, cell_w, cell_e;\
double rk;\
cell_s = tab_new[(iy+1)*(size_x+2)+ix];\
cell_n = tab_new[(iy-1)*(size_x+2)+ix];\
cell_e = tab_new[iy*(size_x+2)+(ix+1)];\
cell_w = tab_new[iy*(size_x+2)+(ix-1)];\
cell = tab_new[iy*(size_x+2)+ix];\
new_value = weighty *( cell_n + cell_s + cell*diagy)+\
weightx *( cell_e + cell_w + cell*diagx);\
rk = cell - new_value;\
r[iy*(size_x+2)+ix] = rk *rk;\
barrier(CLK_GLOBAL_MEM_FENCE);\
tab_new[iy*(size_x+2)+ix] = new_value;\
}";
cell_s, cell_n, cell_e, cell_w represents the 4 values for the 2D stencil. I compute the new_value and update it after a "barrier(CLK_GLOBAL_MEM_FENCE)".
However, it seems there are conflicts between differents work-items. How could I fix this ?
The barrier GLOBAL_MEM_FENCE you use will not synchronize all work-items as intended. It does only synchronize access with one single workgroup.
Usually all workgroups won't be executed at the same time, because they are scheduled on only a small number of physical cores, and global synchronization is not possible within a kernel.
The solution is to write the output to a different buffer.

Resources