Callback from "multiprocessing" with CFFI segfaults after ~100 iterations - python-cffi

A PyPy callback, that works perfectly (in an infinite loop) when implemented (straightforwardly) as method of a Python object, segfaults after approximately 100 iterations when I move the Python object into a separate multiprocessing process.
In the main code I have:
import multiprocessing as mp
class Task(object):
def __init__(self, com, lib):
self.com = com # communication queue
self.lib = lib # ffi library
self.proc = mp.Process(target=self.spawn, args=(self.com,))
self.register_callback()
def spawn(self, com):
print('%s spawned.'%self.name)
# loop (keeping 'self' alive) until BREAK:
while True:
cmd = com.get()
if cmd == self.BREAK:
break
print("%s stopped."%self.name)
#ffi.calback("int(void*, Data*"): # old cffi (ABI mode)
def callback(self, data):
# <work on data>
return 1
def register_callback(self):
s = ffi.new_handle(self)
self.lib.register_callback(s, self.callback) # C-call
The idea is that multiple tasks should serve an equal number of callbacks concurrently. I have no clue what may cause the segfault, especially since it runs fine for the first ~100 iterations or so. Help much appreciated!

Solution
Handle 's' is garbage collected when returning from 'register_callback()'. Making the handle an attribute of 'self' and passing keeps it alive.
Standard CPython (cffi 1.6.0) segfaulted at the first iteration (i.e. gc was immediate) and provided me a crucial informative error message. PyPy on the other hand segfaulted after approximately 100 iterations without providing a message... Both run fine now.

Related

Dagster -Execute an #Op only when all parallel executions are finished(DynamicOutput)

I have a problem that in fact I am not able to solve in dagster.
I have the following configuration:
I have step 1 where I get the data from an endpoint
step 2 gets a list of customers dynamically:
step 3 is the database update with the response from step 1, for each customer from step 2, but in parallel.
before calling step 3, I have a function that serves to create DynamicOutput for each client of step 2, with the name "parallelize_clients "so that when it is invoked, it parallelizes the update processes of step_3 and finally I have a graph to join operations.
#op()
def step_1_get_response():
return {'exemple': 'data'}
#op()
def step_2_get_client_list():
return ['client_1', 'client_2', 'client_3'] #the number of customers is dynamic.
#op(out=DynamicOut())
def parallelize_clients(context, client_list):
for client in client_list:
yield DynamicOutput(client, mapping_key=str(client))
#op()
def step_3_update_database_cliente(response, client):
...OPERATION UPDATE IN DATABASE CLIENT
#graph()
def job_exemple_graph():
response = step_1_get_response()
clients_list = step_2_get_client_list()
clients = parallelize_clients(clients_list)
#run the functions in parallel
clients.map(lambda client: step_3_update_database_cliente(response, client))
According to the documentation, an #Op starts as soon as its dependencies are fulfilled, and in the case of Ops that have no dependency, they are executed instantly, without having an exact order of execution. Example: My step1 and step2 have no dependencies, so both are running in parallel automatically. After the clients return, the "parallelize_clients()" function is executed, and finally, I have a map in the graph that dynamically creates several executions according to the amount of client(DynamicOutput)
So far it works, and everything is fine. Here's the problem. I need to execute a specific function only when step3 is completely finished, and as it is created dynamically, several executions are generated in parallel, however, I am not able to control to execute a function only when all these executions in parallel are finished.
in the graph I tried to put the call to an op "exemplolaststep() step_4" at the end, however, step 4 is executed together with "step1" and "step2", and I really wanted step4 to only execute after step3, but not I can somehow get this to work. Could someone help me?
I tried to create a fake dependency with
#op(ins={"start": In(Nothing)})
def step_4():
pass
and in the graph, when calling the operations, I tried to execute the map call inside the step_4() function call; Example
#graph()
def job_exemple_graph():
response = step_1_get_response()
clients_list = step_2_get_client_list()
clients = parallelize_clients(clients_list)
#run the functions in parallel
step_4(start=clients.map(lambda client: step_3_update_database_cliente(response, client)))
I have tried other approaches as well, however, to no avail.
You just need to add a .collect() call on the mapped function in your graph, to indicate that all the parallel operations should join before moving on. Something like
#graph()
def job_exemple_graph():
response = step_1_get_response()
clients_list = step_2_get_client_list()
clients = parallelize_clients(clients_list)
# run the functions in parallel
step_4(
start=clients.map(
lambda client: step_3_update_database_cliente(response, client)
).collect()
)

How can I utilize asyncio to make third party file operations faster?

I am utilizing a third party library called isort. isort has an available function that opens and reads a file. In order to speed this up I attempted to changed the function called isort.check_file to make it perform asynchronously. The method check_file takes the file path, however the current behaviour that I have attempted does not work.
...
coroutines= [self.check_file('c:\\example1.py'), self.check_file('c:\\example2.py')]
loop = asyncio.get_event_loop()
result = loop.run_until_complete(asyncio.gather(*coroutines))
...
async def check_file(self, changed_file):
return isort.check_file(changed_file)
However, this does not seem to work. How can I make the library call isort.check_file be utilized correctly with asyncio.gather?
Better understanding of IO Bottleneck and GIL
What your async function check_file doing is just as same without async at front. To get any meaningful performance asynchronously, you Must be using some sort of Awaitables - which requires await keyword.
So basically what you did is:
import time
async def wait(n):
time.sleep(n)
Which does absolutely no good for asynchronous operations.
To make such synchronous function asynchronous - assuming it's mostly IO-bound - you can use asyncio.to_thread instead.
import asyncio
import time
async def task():
await asyncio.to_thread(time.sleep, 10) # <- await + something that's awaitable
# similar to await asyncio.sleep(10) now
async def main():
tasks = [task() for _ in range(10)]
await asyncio.gather(*tasks)
asyncio.run(main())
That essentially moves IO bound operation out of main thread, so main thread can do it's work without waiting for IO works.
But there's catch - Python's Global Interpreter Lock(GIL).
Due to CPython - official python implementation - limitation, only 1 python interpreter thread can run in at any given moment, stalling all others.
Then how we achieve better performance just by moving IO to different thread? Just simply by releasing GIL during IO operations.
IO Operations are basically just like this:
"Hey OS, please do this IO works for me. Wake me up when it's done."
Thread 1 goes to sleep
Some time later, OS punches Thread 1
"Your IO Operation is done, take this and get back to work."
So all it does is Doing Nothing - for such cases, aka IO Bound stuffs, GIL can be safely released and let other threads to run. Built-in functions like time.sleep, open(), etc implements such GIL release logic in their C code.
This doesn't change much in asyncio, which is internally bunch of event checks and callbacks. Each asyncio,Tasks works like threads in some degree - tasks asking main loop to wake them up when IO operation done is done.
Now these basic simplified concepts sorted out, we can go back to your question.
CPU Bottleneck and IO Bottleneck
Bsically what you're up against is Not an IO bottleneck. It's mostly CPU/etc bottleneck.
Loading merely few KB of texts from local drives then running tons of intense Python code afterward doesn't count as an IO bound operation.
Testing
Let's consider following test case:
run isort.check_file for 10000 scripts as:
Synchronously, just like normal python codes
Multithreaded, with 2 threads
Multiprocessing, with 2 processes
Asynchronous, using asyncio.to_thread
We can expect that:
Multithreaded will be slower than Synchronous code, as there's very little IO works
Multiprocessing process spawning & communicating takes time, so it will be slower in short workload, faster in longer workload.
Asynchronous will be even more slower than the Multithreaded, because Asyncio have to deal with threads which it's not really designed for.
With folder structure of:
├─ main.py
└─ import_messes
├─ lib_0.py
├─ lib_1.py
├─ lib_2.py
├─ lib_3.py
├─ lib_4.py
├─ lib_5.py
├─ lib_6.py
├─ lib_7.py
├─ lib_8.py
└─ lib_9.py
Which we'll load 1000 times each, making up to total 10000 loads.
Each of those are filled with random imports I grabbed from asyncio.
from asyncio.base_events import *
from asyncio.coroutines import *
from asyncio.events import *
from asyncio.exceptions import *
from asyncio.futures import *
from asyncio.locks import *
from asyncio.protocols import *
from asyncio.runners import *
from asyncio.queues import *
from asyncio.streams import *
from asyncio.subprocess import *
from asyncio.tasks import *
from asyncio.threads import *
from asyncio.transports import *
Source code(main.py):
"""
asynchronous isort demo
"""
import pathlib
import asyncio
import itertools
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from timeit import timeit
import isort
from isort import format
# target dir with modules
FILE = pathlib.Path("./import_messes")
# Monkey-patching isort.format.create_terminal_printer to suppress Terminal bombarding.
# Totally not required nor recommended for normal use
class SuppressionPrinter:
def __init__(self, *_, **__):
pass
def success(self, *_):
pass
def error(self, *_):
pass
def diff_line(self, *_):
pass
isort.format.BasicPrinter = SuppressionPrinter
# -----------------------------
# Test functions
def filelist_gen():
"""Chain directory list multiple times to get meaningful difference"""
yield from itertools.chain.from_iterable([FILE.iterdir() for _ in range(1000)])
def isort_synchronous(path_iter):
"""Synchronous usual isort use-case"""
# return list of results
return [isort.check_file(file) for file in path_iter]
def isort_thread(path_iter):
"""Threading isort"""
# prepare thread pool
with ThreadPoolExecutor(max_workers=2) as executor:
# start loading
futures = [executor.submit(isort.check_file, file) for file in path_iter]
# return list of results
return [fut.result() for fut in futures]
def isort_multiprocess(path_iter):
"""Multiprocessing isort"""
# prepare process pool
with ProcessPoolExecutor(max_workers=2) as executor:
# start loading
futures = [executor.submit(isort.check_file, file) for file in path_iter]
# return list of results
return [fut.result() for fut in futures]
async def isort_asynchronous(path_iter):
"""Asyncio isort using to_thread"""
# create coroutines that delegate sync funcs to threads
coroutines = [asyncio.to_thread(isort.check_file, file) for file in path_iter]
# run coroutines and wait for results
return await asyncio.gather(*coroutines)
if __name__ == '__main__':
# run once, no repetition
n = 1
# synchronous runtime
print(f"Sync func.: {timeit(lambda: isort_synchronous(filelist_gen()), number=n):.4f}")
# threading demo
print(f"Threading : {timeit(lambda: isort_thread(filelist_gen()), number=n):.4f}")
# multiprocessing demo
print(f"Multiproc.: {timeit(lambda: isort_multiprocess(filelist_gen()), number=n):.4f}")
# asyncio to_thread demo
print(f"to_thread : {timeit(lambda: asyncio.run(isort_asynchronous(filelist_gen())), number=n):.4f}")
Run results
Sync func.: 18.1764
Threading : 18.3138
Multiproc.: 9.5206
to_thread : 27.3645
(above results are ran on NVME ssd)
You can see isort.check_file is not an IO-Bound operation on fast IO devices. Therefore best bet is using Multiprocessing, if Really needed with such fast drives.
If number of files are low in above 'Fast IO Device' situations, like hundred or below, multiprocessing will suffer even more than using asyncio.to_thread, because cost to spawn, communicate, and kill process overwhelm the multiprocessing's benefits.
However - With slow IO devics like HDD Threading/async is totally valid idea and will give a great boost in performance.
Experiment with your usecase, adjust core/thread count (max_workers) to best fit your enviornments and your usecase.

After first run ,Jupyter notebook with python 3.6.1, using asyncio basic example gives: RuntimeError: Event loop is closed

In Jupyter Notebook (python 3.6.1) I went to run the basic python docs Hello_World in (18.5.3.1.1. Example: Hello World coroutine) and noticed that it was giving me a RuntimeError. After trying a long time to find the problem with the program(my understanding is that the docs may not be totally up to date), I finally noticed that it only does this after the second run and tested in a restarted Kernel. I've since then copied the same small python program in two successive cells(In 1 and 2) and found that it gives the error on the second not the first and gives the error to both there after. This repeats this after restarting the Kernel.
import asyncio
def hello_world(loop):
print('Hello World')
loop.stop()
loop = asyncio.get_event_loop()
# Schedule a call to hello_world()
loop.call_soon(hello_world, loop)
# Blocking call interrupted by loop.stop()
loop.run_forever()
loop.close()
The traceback:
RuntimeError Traceback (most recent call last)
<ipython-input-2-0930271bd896> in <module>()
6 loop = asyncio.get_event_loop()
7 # Blocking call which returns when the hello_world() coroutine
----> 8 loop.run_until_complete(hello_world())
9 loop.close()
/home/pontiac/anaconda3/lib/python3.6/asyncio/base_events.py in run_until_complete(self, future)
441 Return the Future's result, or raise its exception.
442 """
--> 443 self._check_closed()
444
445 new_task = not futures.isfuture(future)
/home/pontiac/anaconda3/lib/python3.6/asyncio/base_events.py in _check_closed(self)
355 def _check_closed(self):
356 if self._closed:
--> 357 raise RuntimeError('Event loop is closed')
358
359 def _asyncgen_finalizer_hook(self, agen):
RuntimeError: Event loop is closed
I don't get this error when running a file in the interpreter with all the Debug settings set. I am running this Notebook in my recently reinstalled Anaconda set up which only has the 3.6.1 python version installed.
the issue is that loop.close() makes the loop unavailable for future use. That is, you can never use a loop again after calling close. The loop stays around as an object, but almost all methods on th eloop will raise an exception once the loop is closed. However, asyncio.get_event_loop() returns the same loop if you call it more than once. You often want this, so that multiple parts of an application get the same event loop.
However if you plan on closing a loop, you are better off calling asyncio.new_event_loop rather than asyncio.get_event_loop. That will give you a fresh event loop. If you call new_event_loop rather than get_event_loop, you're responsible for making sure that the right loop gets used in all parts of the application that run in this thread. If you want to be able to run multiple times to test you could do something like:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
After that, you'll find that asyncio.get_event_loop returns the same thing as loop. So if you do that near the top of your program, you will have a new fresh event loop each run of the code.

Multiprocessing with worker.run() does work in serie instead of parallel?

I'm trying to create a program which in its essence works like this:
import multiprocessing
import time
def worker(numbers):
print(numbers)
time.sleep(2)
return
if __name__ =='__main__':
multiprocessing.set_start_method("spawn")
p1 = multiprocessing.Process(target=worker, args=([0,1,2,3,4],))
p2 = multiprocessing.Process(target=worker, args=([5,6,7,8],))
p1.start()
p2.start()
p1.join()
p2.join()
while(1):
p1.run()
p2.run()
p1.join()
p2.join()
print('Done!')
The first time the processes are called via p#.start(), they are executed in parallel. The second time they are called via the p#.run() method, they are executed in series.
How can I make sure the subsequent method calls are also performed in parallel?
Edit: It is important that the processes start together. It cannot happen that process 1 gets executed twice while process 2 only gets executed once.
Edit: I should also note that this code is running on a raspberry pi v3 model B.
As far as I know, a thread can only be started once. After that when you call the run method, it's just a simple function. That's why it isn't run in parallel.

CUDA IPC Memcpy + MPI fails in Theano, works in pycuda

For learning purposes, I wrote a small C Python module that is supposed to perform an IPC cuda memcopy to transfer data between processes. For testing, I wrote equivalent programs: one using theano's CudaNdarray, and the other using pycuda. The problem is, even though the test programs are nearly identical, the pycuda version works while the theano version does not. It doesn't crash: it just produces incorrect results.
Below is the relevant function in the C module. Here is what it does: every process has two buffers: a source and a destination. Calling _sillycopy(source, dest, n) copies n elements from each process's source buffer to the neighboring process's dest array. So, if I have two processes, 0 and 1, processes 0 will end up with process 1's source buffer and processes 1 will end up with process 0's source buffer.
Note that to transfer cudaIpcMemHandle_t values between processes, I use MPI (this is a small part of a larger project which uses MPI). _sillycopy is called by another function, "sillycopy" which is exposed in Python by the standard Python C API methods.
void _sillycopy(float *source, float* dest, int n, MPI_Comm comm) {
int localRank;
int localSize;
MPI_Comm_rank(comm, &localRank);
MPI_Comm_size(comm, &localSize);
// Figure out which process is to the "left".
// m() performs a mod and treats negative numbers
// appropriately
int neighbor = m(localRank - 1, localSize);
// Create a memory handle for *source and do a
// wasteful Allgather to distribute to other processes
// (could just use an MPI_Sendrecv, but irrelevant right now)
cudaIpcMemHandle_t *memHandles = new cudaIpcMemHandle_t[localSize];
cudaIpcGetMemHandle(memHandles + localRank, source);
MPI_Allgather(
memHandles + localRank, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
memHandles, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
comm);
// Open the neighbor's mem handle so we can do a cudaMemcpy
float *sourcePtr;
cudaIpcOpenMemHandle((void**)&sourcePtr, memHandles[neighbor], cudaIpcMemLazyEnablePeerAccess);
// Copy!
cudaMemcpy(dest, sourcePtr, n * sizeof(float), cudaMemcpyDefault);
cudaIpcCloseMemHandle(sourcePtr);
delete [] memHandles;
}
Now here is the pycuda example. For reference, using int() on a_gpu and b_gpu returns the pointer to the underlying buffer's memory address on the device.
import sillymodule # sillycopy lives in here
import simplempi as mpi
import pycuda.driver as drv
import numpy as np
import atexit
import time
mpi.init()
drv.init()
# Make sure each process uses a different GPU
dev = drv.Device(mpi.rank())
ctx = dev.make_context()
atexit.register(ctx.pop)
shape = (2**26,)
# allocate host memory
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)
# allocate device memory
a_gpu = drv.mem_alloc(a.nbytes)
b_gpu = drv.mem_alloc(b.nbytes)
# copy host to device
drv.memcpy_htod(a_gpu, a)
drv.memcpy_htod(b_gpu, b)
# A few more host buffers
a_p = np.zeros(shape, np.float32)
b_p = np.zeros(shape, np.float32)
# Sanity check: this should fill a_p with 1's
drv.memcpy_dtoh(a_p, a_gpu)
# Verify that
print(a_p[0:10])
sillymodule.sillycopy(
int(a_gpu),
int(b_gpu),
shape[0])
# After this, b_p should have all one's
drv.memcpy_dtoh(b_p, b_gpu)
print(c_p[0:10])
And now the theano version of the above code. Rather than using int() to get the buffers' address, the CudaNdarray way of accessing this is via the gpudata attribute.
import os
import simplempi as mpi
mpi.init()
# select's one gpu per process
os.environ['THEANO_FLAGS'] = "device=gpu{}".format(mpi.rank())
import theano.sandbox.cuda as cuda
import time
import numpy as np
import time
import sillymodule
shape = (2 ** 24, )
# Allocate host data
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)
# Allocate device data
a_gpu = cuda.CudaNdarray.zeros(shape)
b_gpu = cuda.CudaNdarray.zeros(shape)
# Copy from host to device
a_gpu[:] = a[:]
b_gpu[:] = b[:]
# Should print 1's as a sanity check
print(np.asarray(a_gpu[0:10]))
sillymodule.sillycopy(
a_gpu.gpudata,
b_gpu.gpudata,
shape[0])
# Should print 1's
print(np.asarray(b_gpu[0:10]))
Again, the pycuda code works perfectly and the theano version runs, but gives the wrong result. To be precise, at the end of the theano code, b_gpu is filled with garbage: neither 1's nor 0's, just random numbers as though it were copying from a wrong place in memory.
My original theory regarding why this was failing had to do with CUDA contexts. I wondered if it was possible theano was doing something with them that meant that the cuda calls made in sillycopy were run under a different CUDA context than had been used to create the gpu arrays. I don't think this is the case because:
I spent a lot of time digging deep in theano's code and saw no funny business being played with contexts
I would expect such a problem to result in a bad crash, not an incorrect result, which is not what happens.
A secondary thought is whether this has to do the fact that theano spawns several threads, even when using a cuda backend, which can be verified this by running "ps huH p ". I don't know how threads might affect anything, but I have run out of obvious things to consider.
Any thoughts on this would be greatly appreciated!
For reference: the processes are launched in the normal OpenMPI way:
mpirun --np 2 python test_pycuda.py

Resources