how to loop faster through grpc response iterators with python

how to loop faster through grpc response iterators with python - grpc

I'm calling a grpc service with python that responds with about a million iterator objects. At the moment I'm using list comprehension to access the 1 attribute I need from the iterators:
stub = QueryStub(grpc_channel)
return [object.attribute_i_need for object in stub.ResponseMethod]
To access around a million attributes takes a while (around 2-3 minutes). Is there a way I can speed this up? Interested to know how people process such scenarios faster. I have also tried using list(stub.ResponseMethod) and [*stub.ResponseMethod] to unpack or retrieve the objects faster, however these approaches take even longer since the iterator objects have a lot of other metadata I don't need and its storing them.
PS I don't necessarily need to store the attributes in memory, accessing them faster is what I'm trying to achieve

According to this documentation, I would say you need to try two things:
working with asyncio API (if that's not already done) by doing something like:
async def run(stub: QueryStub) -> None:
async for object in stub.ResponseMethod(empty_pb2.Empty()):
print(object.attribute_i_need)
note that the Empty() is just because I do not know your API definition.
second would be to try the experimental feature SingleThreadedUnaryStream (if applicable to your case) by doing:
with grpc.insecure_channel(target='localhost:50051', options=[(grpc.experimental.ChannelOptions.SingleThreadedUnaryStream, 1)]) as channel:
What I tried
I don't really know if it covers your use case (you can give me more info on that and I'll update), but here is what I tried:
I have a schema like:
service TestService {
rpc AMethod(google.protobuf.Empty) returns (stream Test) {} // stream is optional, I tried with both
}
message Test {
repeated string message = 1;
repeated string message2 = 2;
repeated string message3 = 3;
repeated string message4 = 4;
repeated string message5 = 5;
repeated string message6 = 6;
repeated string message7 = 7;
repeated string message8 = 8;
repeated string message9 = 9;
repeated string message10 = 10;
repeated string message11 = 11;
}
on the server side (with asyncio) I have
async def AMethod(self, request: empty_pb2.Empty, unused_context) -> AsyncIterable[Test]:
test = Test()
for i in range(10):
test.message.append(randStr())
# repeat append for every other field or not
for i in range(1000000):
yield test
where randStr creates a random string of length 10000 (totally arbitrary).
and on the client side (with SingleThreadedUnaryStream and asyncio)
async def run(stub: TesterStub) -> None:
tests = stub.AMethod(empty_pb2.Empty())
async for test in tests:
print(test.message)
Benchmark
Note: This might vary depending on your machine
For the example with only one repeated field filled, I get an average (ran it 3 times) of 77 sec.
And for all the fields being filled, it is really long so I tried providing smaller strings (10 in length) and it still takes too long. I think the mix of repeated and stream is not a good idea. I also tried without stream and I get an average (run 3 times) of 45 sec.
My conclusion
This is really slow if all the repeated fields all filled with data and this is ok-ish when only one is filled. But overall I think asyncio helps.
Furthermore, this documentation explains that Protocol Buffers are not designed to handle large messages, however Protocol Buffers are great for handling individual messages within a large data set.
I would suggest that, if I got your schema right, you rethink the API design because that seems to be not optimal.
but once again I might have not understand the schema properly.

I would advise you to loop through the object using a for loop if you haven't already done it anyway. But something needs to be said about that:
It is important to realize that everything you put in a loop gets executed for every loop iteration. They key to optimizing loops is to minimize what they do. Even operations that appear to be very fast will take a long time if the repeated many times. Executing an operation that takes 1 microsecond a million times will take 1 second to complete.
Don't execute things like len(list) inside a loop or even in its starting condition.
example
a = [i for i in range(1000000)]
length = len(a)
for i in a:
print(i - length)
is much much faster than
a = [i for i in range(1000000)]
for i in a:
print(i - len(a))
You can also use techniques like Loop Unrolling(https://en.wikipedia.org/wiki/Loop_unrolling) which is loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space-time tradeoff.
Using functions like map, filter, etc. instead of explicit for loops can also provide some performance improvements.

Related

Parallel iteration over array with step size greater than 1

I'm working on a practice program for doing belief propagation stereo vision. The relevant aspect of that here is that I have a fairly long array representing every pixel in an image, and want to carry out an operation on every second entry in the array at each iteration of a for loop - first one half of the entries, and then at the next iteration the other half (this comes from an optimisation described by Felzenswalb & Huttenlocher in their 2006 paper 'Efficient belief propagation for early vision'.) So, you could see it as having an outer for loop which runs a number of times, and for each iteration of that loop I iterate over half of the entries in the array.
I would like to parallelise the operation of iterating over the array like this, since I believe it would be thread-safe to do so, and of course potentially faster. The operation involved updates values inside the data structures representing the neighbouring pixels, which are not themselves used in a given iteration of the outer loop. Originally I just iterated over the entire array in one go, which meant that it was fairly trivial to carry this out - all I needed to do was put .Parallel between Array and .iteri. Changing to operating on every second array entry is trickier, however.
To make the change from simply iterating over every entry, I from Array.iteri (fun i p -> ... to using for i in startIndex..2..(ArrayLength - 1) do, where startIndex is either 1 or 0 depending on which one I used last (controlled by toggling a boolean). This means though that I can't simply use the really nice .Parallel to make things run in parallel.
I haven't been able to find anything specific about how to implement a parallel for loop in .NET which has a step size greater than 1. The best I could find was a paragraph in an old MSDN document on parallel programming in .NET, but that paragraph only makes a vague statement about transforming an index inside a loop body. I do not understand what is meant there.
I looked at Parallel.For and Parallel.ForEach, as well as creating a custom partitioner, but none of those seemed to include options for changing the step size.
The other option that occurred to me was to use a sequence expression such as
let getOddOrEvenArrayEntries myarray oddOrEven =
seq {
let startingIndex =
if oddOrEven then
1
else
0
for i in startingIndex..2..(Array.length myarray- 1) do
yield (i, myarray.[i])
}
and then using PSeq.iteri from ParallelSeq, but I'm not sure whether it will work correctly with .NET Core 2.2. (Note that, currently at least, I need to know the index of the given element in the array, as it is used as the index into another array during the processing).
How can I go about iterating over every second element of an array in parallel? I.e. iterating over an array using a step size greater than 1?

You could try PSeq.mapi which provides not only a sequence item as a parameter but also the index of an item.
Here's a small example
let res = nums
|> PSeq.mapi(fun index item -> if index % 2 = 0 then item else item + 1)
You can also have a look over this sampling snippet. Just be sure to substitute Seq with PSeq

Generate non-repeating random number using timestamp in Marklogic (XQuery)?

I want to generate non-repeating random number having time stamp in it. What could be the possible code for it?
I've tried using sem:uuid-string() function but it generates 36 long character which is very long.

I'd suggest taking a look at the ml-unique library. It provides 3 different methods for generating unique ids in MarkLogic, and explains to pros and cons of each. Maybe one of those fits your needs, or you can copy the code, and adapt as needed.
Note that a timestamp alone is not enough to guarantee uniqueness, particularly if generating multiple ids in one request, or when processing data in parallel.
The length of uuid string makes the chance of collisions very small by the way.
HTH!

It is not possible to generate a non-repeating random number and have the results fit into finite size. If 36 bytes is too large that further limits the theoretical maximum. The server itself uses 64 bit random numbers (effectively xdmp:random) for unique ID's. Attempting to to do better, with respect to collision probability, is futile - no matter what or how long a URI you use, internally references will be created as a 64 bit random number or as a hash value. The methods recommended will not produce an effectively colliding URI with less probability then the server itself will given non-colliding URI's of any size. Most likely attempts at more complex 'random' URI generation will result in much worse results due to the subtly of pseudo random number algorithms.

The code below generates (with arbitrary high probability) 10 different random numbers. Every iteration of for loop inserts newly generated random number into MarkLogic database. Exception error((), 'BREAK') will be thrown when 10 different numbers were already generated.
xquery version "1.0-ml";
xdmp:document-insert("/doc/random.xml",<root><a>{xdmp:random(100)}</a></root>);
try {
for $i in (1 to 200) (:200 can be replace with larger number to reduce probability that 10 different random numbers will never be selected.:)
return xdmp:invoke-function( function() as item()?
{ let $myrandom:= xdmp:random(100), $last:= count(doc("/doc/random.xml")/root/*)
return
if ($last lt 10) then (
if (doc("/doc/random.xml")/root/a/text() = $myrandom) then () else (xdmp:node-insert-after(doc("/doc/random.xml")/root/a[last()], <a>{$myrandom}</a>)))
else (if ($last eq 10) then (error((), 'BREAK')) else ())},
<options xmlns="xdmp:eval">
<transaction-mode>update</transaction-mode>
<transaction-mode>update-auto-commit</transaction-mode>
</options>)}
catch ($ex) {
if ($ex/error:code eq 'BREAK') then ("10 different random numbers were generated") else xdmp:rethrow() };

Iterating results back into an OpenCL kernel

I have written an openCL kernel that takes 25million points and checks them relative to two lines, (A & B). It then outputs two lists; i.e. set A of all of the points found to be beyond line A, and vice versa.
I'd like to run the kernel repeatedly, updating the input points with each of the line results sets in turn (and also updating the checking line). I'm guessing that reading the two result sets out of the kernel, forming them into arrays and then passing them back in one at a time as inputs is quite a slow solution.
As an alternative, I've tested keeping a global index in the kernel that logs which points relate to which line. This is updated at each line checking cycle. During each iteration, the index for each point in the overall set is switched to 0 (no line), A or B or so forth (i.e. the related line id). In subsequent iterations only points with an index that matches the 'live' set being checked in that cycle (i.e. tagged with A for set A) are tested further.
The problem is that, in each iteration, the kernels still have to check through the full index (i.e. all 25m points) to discover wether or not they are in the 'live' set. As a result, the speed of each cycle does not significantly improve as the size of the results set decrease over time. Again, this seems a slow solution; whilst avoiding passing too much information between GPU and CPU it instead means that a large number of the work items aren't doing very much work at all.
Is there an alternative solution to what I am trying to do here?

You could use atomics to sort the outputs into two arrays. Ie if we're in A then get my position by incrementing the A counter and put me into A, and do the same for B
Using global atomics on everything might be horribly slow (fast on amd, slow on nvidia, no idea about other devices) - instead you can use a local atomic_inc in a 0'd local integer to do exactly the same thing (but for only the local set of x work-items), and then at the end do an atomic_add to both global counters based on your local counters
To put this more clearly in code (my explanation is not great)
int id;
if(is_a)
id = atomic_inc(&local_a);
else
id = atomic_inc(&local_b);
barrier(CLK_LOCAL_MEM_FENCE);
__local int a_base, b_base;
int lid = get_local_id(0);
if(lid == 0)
{
a_base = atomic_add(a_counter, local_a);
b_base = atomic_add(b_counter, local_b);
}
barrier(CLK_LOCAL_MEM_FENCE);
if(is_a)
a_buffer[id + a_base] = data;
else
b_buffer[id + b_base] = data;
This involves faffing around with atomics which are inherently slow, but depending on how quickly your dataset reduces it might be much faster. Additionally if B data is not considered live, you can omit getting the b ids and all the atomics involving b, as well as the write back

How to avoid reading back in OpenCL

I am implementing an algorithm with OpenCL. I will loop in C++ many times and call a same OpenCL kernel each time. The kernel will generate the input data of next iteration and the number of these data. Currently, I read back this number in each loop for two usages:
I use this number to decide how many work items I need for next loop; and
I use this number to decide when to exit the loop (when the number is 0).
I found the reading takes most of time of the loop. Is there any way to avoid it?
Generally speaking, if you need to call a kernel repeatedly, and the exit condition is dependent to the result generated by the kernel (not fixed number loops), how can you do it efficiently? Is there anything like the occlusion query in OpenGL that you can just do some query instead of reading back from GPU?

Reading a number back from a GPU Kernel will always take 10s - 1000s microseconds or more.
If the controlling number is always reducing, you can keep in global memory, and test it against the global id and decide if the kernel does work or not on each iteration. Use a global memory barrier to sync all the threads ...
kernel void x(global int * the_number, constant int max_iterations, ... )
{
int index = get_global_id(0);
int count = 0; // stops an infinite loop
while( index < the_number[0] && count < max_iterations )
{
count++;
// loop code follows
....
// Use one thread decide what to do next
if ( index == 0 )
{
the_number[0] = ... next value
}
barrier( CLK_GLOBAL_MEM_FENCE ); // Barrier to sync threads
}
}

You have a couple of options here:
If possible, you can simply move the loop and the conditional into the kernel? Use a scheme where additional work items do nothing depending on the input for the current iteration.
If 1. isn't possible, I would recommend that you store the data generated by the "decision" kernel in a buffer and use that buffer to "direct" your other kernels.
Both these options will allow you to skip the readback.

I'm just finishing up some research where we had to tackle this exact problem!
We discovered a couple of things:
Use two (or more) buffers! Have the first iteration of the kernel
operate on data in b1, then the next on b2, then on b1 again. In
between each kernel call, read back the result of the other buffer
and check to see if it's time to stop iterating. Works best when the kernel takes longer than a read. Use a profiling tool to make sure you aren't waiting on reads (and if you are, increase the number of buffers).
Over shoot! Add a finishing check to each kernel, and call it
several (100s) of times before copying data back. If your kernel is
low-cost, this can work very well.

What is an elegant way to abstract functions - not objects?

I have a function that logs into a sensor via telnet/pexpect and acts as a data collector.
I don't want to rewrite the part that logs in, grabs the data, and parses out relevant output from it (pexpect). However, I need to do different things with this code and the data it gathers
For example, I may need to:
Time until the first reading is returned
Take the average of a varying number of sensor readings
Return the status (which is one piece of data) or return the sensor
reading (which is a separate piece of
data) from the output
Ultimately, it should still login and parse output the same and I want to use one code block for that part.
Higher up in the code, it's being used instantaneously. When I call it, I know what type of data I need to gather and that's that. Constructing objects is too clumsy.
My usage has outstripped adding more arguments to a single function.
Any ideas?

This is such a common situation, I'm surprised you haven't already done what everyone else does.
Refactor your function to decompose it into smaller functions.
Functions are objects, and can be passed as arguments to other functions.
def step1():
whatever
def step2():
whatever
def step2_alternative():
whatever
def original( args ):
step1()
step2()
def revised( args, step2_choice ):
step1()
step2_choice()
Now you can do this.
revised( step2 )
revised( step2_alternative )
It's just OO programming with function objects.

Could you pass a data processing function to the function you described as an argument?
That may be more or less elegant, depending on your taste.
(Forgive me: I know nothing about pexpect, and I may even have misunderstood your question!)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex