PyOpenCL enqueue_copy hanging when run on different devices - opencl

I'm having trouble getting a kernel to run on two different OpenCL platforms. The only difference in the platforms is one is OpenCL 1.1 and the other 1.2 as such:
Code works on this device (OS X 10.8):
===============================================================
('Platform name:', 'Apple')
('Platform profile:', 'FULL_PROFILE')
('Platform vendor:', 'Apple')
('Platform version:', 'OpenCL 1.2 (Sep 20 2012 17:42:28)')
---------------------------------------------------------------
('Device name:', 'Intel(R) Core(TM) i5-3427U CPU # 1.80GHz')
('Device type:', 'CPU')
('Device memory: ', 8192L, 'MB')
('Device max clock speed:', 1800, 'MHz')
('Device compute units:', 4)
Target device (Ubuntu 11.04):
===============================================================
('Platform name:', 'NVIDIA CUDA')
('Platform profile:', 'FULL_PROFILE')
('Platform vendor:', 'NVIDIA Corporation')
('Platform version:', 'OpenCL 1.1 CUDA 4.2.1')
---------------------------------------------------------------
('Device name:', 'Tesla M2050')
('Device type:', 'GPU')
('Device memory: ', 3071, 'MB')
('Device max clock speed:', 1147, 'MHz')
('Device compute units:', 14)
===============================================================
('Platform name:', 'NVIDIA CUDA')
('Platform profile:', 'FULL_PROFILE')
('Platform vendor:', 'NVIDIA Corporation')
('Platform version:', 'OpenCL 1.1 CUDA 4.2.1')
---------------------------------------------------------------
('Device name:', 'Tesla M2050')
('Device type:', 'GPU')
('Device memory: ', 3071, 'MB')
('Device max clock speed:', 1147, 'MHz')
('Device compute units:', 14)
I've traced what I believe to the source of the hang to the following code:
# set up
host_array = numpy.array(arr)
device_buffer = pyopencl.Buffer(context, pyopencl.mem_flags.WRITE_ONLY, host_array.nbytes)
# run the kernel
program.run(queue, host_array.shape, None, device_buffer)
# copy the results back --- this call causes the code to hang ----
pyopencl.enqueue_copy(queue, host_array, device_buffer)
There are no code changes between the two devices and both devices are running PyOpenCL 2013.1. Am I missing something? Any suggestion is much appreciated.

Try adding a .wait() to the program.run. This will determine if it's actually the program that's hanging.

Turns out the problem was a threading issue. I was using a 2nd thread spawned with the threading module to make my pyopencl calls. I believe the problem was that the context I was using to call pyopencl was created on the main thread and I think this was causing some sort of issue.
To fix I just made sure to declare my context, queue, and created program on the 2nd thread instead of on the primary thread.

Related

Why is my IR signal not powering on my TV

I am currently working on a personal project with pigpio and piscope on raspberry PI 4.
I try to simulate my TV remote by sending IR signal through an IR LED setup connected on GPIO 23 and GND pin (setup is a simple IR LED with a 200 ohm resistor)
I searched on LIRC database my TV remote config file and I did not find it, but I found another one (MKJ40653802-TV) which is said to be working also for my TV which is a LG 50PS3000:
https://www.remote-control-world.eu/lg-c-2_64/lg-mkj42519615-replacement-remote-control-p-4195
also config file :
begin remote
name MKJ40653802-TV
bits 16
flags SPACE_ENC|CONST_LENGTH
eps 30
aeps 100
header 9061 4473
one 591 1660
zero 591 521
ptrail 590
pre_data_bits 16
pre_data 0x20DF
gap 108029
toggle_bit_mask 0x0
begin codes
KEY_POWER 0x10EF # Was: power
After reading LIRC documentation and explainations on how to contruct an IR signal, I managed to get my hands through a python script which create IR waveform to be fired through IR LED
https://github.com/bschwind/ir-slinger/blob/master/pyslinger.py
I simply changed the NEC protocol paramters to the values present in the config file.
Also my power on/off hex value is 0x20DF23DC (pre-data + command) that I convert to binary 32 bits :
00100000110111110010001111011100
my code below :
#!/usr/bin/env python3
# Python IR transmitter
# Requires pigpio library
# Supports NEC, RC-5 and raw IR.
# Danijel Tudek, Aug 2016
import subprocess
import ctypes
import time
# This is the struct required by pigpio library.
# We store the individual pulses and their duration here. (In an array of these structs.)
class Pulses_struct(ctypes.Structure):
_fields_ = [("gpioOn", ctypes.c_uint32),
("gpioOff", ctypes.c_uint32),
("usDelay", ctypes.c_uint32)]
# Since both NEC and RC-5 protocols use the same method for generating waveform,
# it can be put in a separate class and called from both protocol's classes.
class Wave_generator():
def __init__(self,protocol):
self.protocol = protocol
MAX_PULSES = 12000 # from pigpio.h
Pulses_array = Pulses_struct * MAX_PULSES
self.pulses = Pulses_array()
self.pulse_count = 0
def add_pulse(self, gpioOn, gpioOff, usDelay):
self.pulses[self.pulse_count].gpioOn = gpioOn
self.pulses[self.pulse_count].gpioOff = gpioOff
self.pulses[self.pulse_count].usDelay = usDelay
self.pulse_count += 1
# Pull the specified output pin low
def zero(self, duration):
self.add_pulse(0, 1 << self.protocol.master.gpio_pin, duration)
# Protocol-agnostic square wave generator
def one(self, duration):
period_time = 1000000.0 / self.protocol.frequency
on_duration = int(round(period_time * self.protocol.duty_cycle))
off_duration = int(round(period_time * (1.0 - self.protocol.duty_cycle)))
total_periods = int(round(duration/period_time))
total_pulses = total_periods * 2
# Generate square wave on the specified output pin
for i in range(total_pulses):
if i % 2 == 0:
self.add_pulse(1 << self.protocol.master.gpio_pin, 0, on_duration)
else:
self.add_pulse(0, 1 << self.protocol.master.gpio_pin, off_duration)
# NEC protocol class
class NEC():
def __init__(self,
master,
frequency=38000,
duty_cycle=0.5,
leading_pulse_duration=9061,
leading_gap_duration=4473,
one_pulse_duration = 591,
one_gap_duration = 1660,
zero_pulse_duration = 591,
zero_gap_duration = 521,
trailing_pulse = [1, 590]):
self.master = master
self.wave_generator = Wave_generator(self)
self.frequency = frequency # in Hz, 38000 per specification
self.duty_cycle = duty_cycle # duty cycle of high state pulse
# Durations of high pulse and low "gap".
# The NEC protocol defines pulse and gap lengths, but we can never expect
# that any given TV will follow the protocol specification.
self.leading_pulse_duration = leading_pulse_duration # in microseconds, 9000 per specification
self.leading_gap_duration = leading_gap_duration # in microseconds, 4500 per specification
self.one_pulse_duration = one_pulse_duration # in microseconds, 562 per specification
self.one_gap_duration = one_gap_duration # in microseconds, 1686 per specification
self.zero_pulse_duration = zero_pulse_duration # in microseconds, 562 per specification
self.zero_gap_duration = zero_gap_duration # in microseconds, 562 per specification
self.trailing_pulse = trailing_pulse # trailing 562 microseconds pulse, some remotes send it, some don't
print("NEC protocol initialized")
# Send AGC burst before transmission
def send_agc(self):
print("Sending AGC burst")
self.wave_generator.one(self.leading_pulse_duration)
self.wave_generator.zero(self.leading_gap_duration)
# Trailing pulse is just a burst with the duration of standard pulse.
def send_trailing_pulse(self):
print("Sending trailing pulse")
self.wave_generator.one(self.trailing_pulse[1])
# This function is processing IR code. Leaves room for possible manipulation
# of the code before processing it.
def process_code(self, ircode):
if (self.leading_pulse_duration > 0) or (self.leading_gap_duration > 0):
self.send_agc()
for i in ircode:
if i == "0":
self.zero()
elif i == "1":
self.one()
else:
print("ERROR! Non-binary digit!")
return 1
if self.trailing_pulse[0] == 1:
self.send_trailing_pulse()
return 0
# Generate zero or one in NEC protocol
# Zero is represented by a pulse and a gap of the same length
def zero(self):
self.wave_generator.one(self.zero_pulse_duration)
self.wave_generator.zero(self.zero_gap_duration)
# One is represented by a pulse and a gap three times longer than the pulse
def one(self):
self.wave_generator.one(self.one_pulse_duration)
self.wave_generator.zero(self.one_gap_duration)
# RC-5 protocol class
# Note: start bits are not implemented here due to inconsistency between manufacturers.
# Simply provide them with the rest of the IR code.
class RC5():
def __init__(self,
master,
frequency=36000,
duty_cycle=0.33,
one_duration=889,
zero_duration=889):
self.master = master
self.wave_generator = Wave_generator(self)
self.frequency = frequency # in Hz, 36000 per specification
self.duty_cycle = duty_cycle # duty cycle of high state pulse
# Durations of high pulse and low "gap".
# Technically, they both should be the same in the RC-5 protocol, but we can never expect
# that any given TV will follow the protocol specification.
self.one_duration = one_duration # in microseconds, 889 per specification
self.zero_duration = zero_duration # in microseconds, 889 per specification
print("RC-5 protocol initialized")
# This function is processing IR code. Leaves room for possible manipulation
# of the code before processing it.
def process_code(self, ircode):
for i in ircode:
if i == "0":
self.zero()
elif i == "1":
self.one()
else:
print("ERROR! Non-binary digit!")
return 1
return 0
# Generate zero or one in RC-5 protocol
# Zero is represented by pulse-then-low signal
def zero(self):
self.wave_generator.one(self.zero_duration)
self.wave_generator.zero(self.zero_duration)
# One is represented by low-then-pulse signal
def one(self):
self.wave_generator.zero(self.one_duration)
self.wave_generator.one(self.one_duration)
# RAW IR ones and zeroes. Specify length for one and zero and simply bitbang the GPIO.
# The default values are valid for one tested remote which didn't fit in NEC or RC-5 specifications.
# It can also be used in case you don't want to bother with deciphering raw bytes from IR receiver:
# i.e. instead of trying to figure out the protocol, simply define bit lengths and send them all here.
class RAW():
def __init__(self,
master,
frequency=36000,
duty_cycle=0.33,
one_duration=520,
zero_duration=520):
self.master = master
self.wave_generator = Wave_generator(self)
self.frequency = frequency # in Hz
self.duty_cycle = duty_cycle # duty cycle of high state pulse
self.one_duration = one_duration # in microseconds
self.zero_duration = zero_duration # in microseconds
def process_code(self, ircode):
for i in ircode:
if i == "0":
self.zero()
elif i == "1":
self.one()
else:
print("ERROR! Non-binary digit!")
return 1
return 0
# Generate raw zero or one.
# Zero is represented by low (no signal) for a specified duration.
def zero(self):
self.wave_generator.zero(self.zero_duration)
# One is represented by pulse for a specified duration.
def one(self):
self.wave_generator.one(self.one_duration)
class IR():
def __init__(self, gpio_pin, protocol, protocol_config):
print("Starting IR")
print("Loading libpigpio.so")
self.pigpio = ctypes.CDLL('libpigpio.so')
print("Initializing pigpio")
PI_OUTPUT = 1 # from pigpio.h
self.pigpio.gpioInitialise()
subprocess.Popen('piscope', shell=True)
time.sleep(1)
self.gpio_pin = gpio_pin
print("Configuring pin %d as output" % self.gpio_pin)
self.pigpio.gpioSetMode(self.gpio_pin, PI_OUTPUT) # pin 17 is used in LIRC by default
print("Initializing protocol")
if protocol == "NEC":
self.protocol = NEC(self, **protocol_config)
elif protocol == "RC-5":
self.protocol = RC5(self, **protocol_config)
elif protocol == "RAW":
self.protocol = RAW(self, **protocol_config)
else:
print("Protocol not specified! Exiting...")
return 1
print("IR ready")
# send_code takes care of sending the processed IR code to pigpio.
# IR code itself is processed and converted to pigpio structs by protocol's classes.
def send_code(self, ircode):
print("Processing IR code: %s" % ircode)
code = self.protocol.process_code(ircode)
if code != 0:
print("Error in processing IR code!")
return 1
clear = self.pigpio.gpioWaveClear()
print(clear)
if clear != 0:
print("Error in clearing wave!")
return 1
pulses = self.pigpio.gpioWaveAddGeneric(self.protocol.wave_generator.pulse_count, self.protocol.wave_generator.pulses)
if pulses < 0:
print("Error in adding wave!")
return 1
wave_id = self.pigpio.gpioWaveCreate()
# Unlike the C implementation, in Python the wave_id seems to always be 0.
if wave_id >= 0:
print("Sending wave...")
result = self.pigpio.gpioWaveTxSend(wave_id, 0)
if result >= 0:
print("Success! (result: %d)" % result)
else:
print("Error! (result: %d)" % result)
return 1
else:
print("Error creating wave: %d" % wave_id)
return 1
while self.pigpio.gpioWaveTxBusy():
time.sleep(0.1)
print("Deleting wave")
self.pigpio.gpioWaveDelete(wave_id)
print("Terminating pigpio")
self.pigpio.gpioTerminate()
# Simply define the GPIO pin, protocol (NEC, RC-5 or RAW) and
# override the protocol defaults with the dictionary if required.
# Provide the IR code to the send_code() method.
# An example is given below.
if __name__ == "__main__":
protocol = "NEC"
gpio_pin = 23
protocol_config = dict(one_pulse_duration = 591,
zero_pulse_duration = 591)
ir = IR(gpio_pin, protocol, protocol_config)
ir.send_code("00100000110111110001000011101111")
print("Exiting IR")
When launching the script it's working, I can see the IR LED blinking through phone cam and also I see the waveform generating through piscope :
Everything looks correct to me but I don't know why it's not powering on my TV...
Could you please help me with this problem ? I don't know if I missed something or if I am using the wrong TV code...
Thanks a lot !
I tried other remote code, I tried the toggle-bit-mask on the first bit (toggle_bit_mask = 0x0)
I tried other codes (on and off) from this page :
https://gist.github.com/francis2110/8f69843dd57ae07dce80
with no success
It's working.
I just had to get close to tv (less than 1 meter away).
So I am reviewing my LED setup adding a transistor.
As seen online it should be working from longer distances...

Identify cause of high GHC memory consumption without profile build

I heard GHC is slow in terms of LOC per second. So I created this Haskell program:
module Main where
x1 = 1
x2 = 2
x3 = 3
...
x999998 = 999998
x999999 = 999999
x1000000 = 1000000
main = putStrLn "1M LOC!"
And I can't even compile it! At least I can see that parser can do 43 lines per second:
[1 of 1] Compiling Main ( 1Mloc.hs, 1Mloc.o )
*** Parser [Main]:
Parser [Main]: alloc=22735369056 time=23683.420
As far as I know, GHC RTS must be recompiled with profiling enabled to start digging into the cause. Given I don't have profiled GHC, is there any chance to figure out what is causing this? I can't even collect statistics because it gets killed...
Killed process 16609 (ghc) total-vm:1074093288kB, anon-rss:6804448kB ...
Actually, I can't compile 10K LOC either. With down to 1K LOC at least I can see horrible productivity numbers. I realize this is a synthetic program, but what could be so bad about it?
Linking 1Kloc ...
1,383,344,416 bytes allocated in the heap
325,164,408 bytes copied during GC
60,849,840 bytes maximum residency (9 sample(s))
282,960 bytes maximum slop
58 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 233 colls, 0 par 0.227s 0.230s 0.0010s 0.0066s
Gen 1 9 colls, 0 par 0.149s 0.174s 0.0193s 0.0588s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.522s ( 1.047s elapsed)
GC time 0.376s ( 0.404s elapsed)
EXIT time 0.000s ( 0.008s elapsed)
Total time 0.899s ( 1.460s elapsed)
Alloc rate 2,647,974,824 bytes per MUT second
Productivity 58.1% of total user, 71.7% of total elapsed

OpenCL only on AMD: CL_INVALID_ARG_SIZE

I have a kernel that runs on all platforms that I have access to, but AMD app SDK 3.0 with intel.
The platform is: OpenCL.Device(Intel(R) Core(TM) i7-6700 CPU # 3.40GHz on AMD Accelerated Parallel Processing
The MWE (sorry it's in Julia, but calls should be almost the same as in C):
using OpenCL
test_source = "
struct __attribute__((packed)) Test{
float3 f1;
int f2;
float f3;
};
__kernel void structest(struct Test a){}
"
device = first(cl.devices())
ctx = cl.Context(device)
prg = cl.Program(ctx, source = test_source)
queue = cl.CmdQueue(ctx)
cl.build!(prg)
structkernel = cl.Kernel(prg, "structest")
astruct = ((1f0, 2f0, 3f0, 0f0), Int32(0), 22f0)
sizeof(astruct)
# == 24 exactly the same as what sizeof(struct Test a) in the kernel returns
astruct_boxed = Ref(astruct)
cl.#check cl.api.clSetKernelArg(structkernel.id, cl.cl_uint(0), sizeof(astruct), astruct_boxed)
So I have confirmed, that the size of sizeof(astruct) and the size in the kernel match, but I still get an CL_INVALID_ARG_SIZE error. Is this a bug or am I missing something?

using zip and drop in Julia

This code doesn't work for some reason:
collect(zip(drop([1,2,3], 1), drop([1,2,3], 1)))
I'm trying to drop the first element of a collection and zip up two copies of the result.
This code runs perfectly fine for me. Please check your version using versioninfo()
julia> collect(zip(drop([1,2,3], 1), drop([1,2,3], 1)))
2-element Array{Tuple{Int64,Int64},1}:
(2,2)
(3,3)
julia> versioninfo()
Julia Version 0.5.1
Commit 6445c82 (2017-03-05 13:25 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i5-3210M CPU # 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)
julia>

Keras not using multiple cores

Based on the famous check_blas.py script, I wrote this one to check that theano can in fact use multiple cores:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
import theano
import theano.tensor as T
M=2000
N=2000
K=2000
iters=100
order='C'
a = theano.shared(numpy.ones((M, N), dtype=theano.config.floatX, order=order))
b = theano.shared(numpy.ones((N, K), dtype=theano.config.floatX, order=order))
c = theano.shared(numpy.ones((M, K), dtype=theano.config.floatX, order=order))
f = theano.function([], updates=[(c, 0.4 * c + .8 * T.dot(a, b))])
for i in range(iters):
f(y)
Running this as python3 check_theano.py shows that 8 threads are being used. And more importantly, the code runs approximately 9 times faster than without the os.environ settings, which apply just 1 core: 7.863s vs 71.292s on a single run.
So, I would expect that Keras now also uses multiple cores when calling fit (or predict for that matter). However this is not the case for the following code:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
from keras.models import Sequential
from keras.layers import Dense
coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100);
y = numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01
model = Sequential()
model.add(Dense(20, input_shape=(100,)))
model.add(Dense(1, input_shape=(20,)))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit(x, y, verbose=0, nb_epoch=10)
This script uses only 1 core with this output:
Using Theano backend.
/home/herbert/venv3/lib/python3.4/site-packages/theano/tensor/signal/downsample.py:5: UserWarning: downsample module has been moved to the pool module.
warnings.warn("downsample module has been moved to the pool module.")
Why does the fit of Keras only use 1 core for the same setup? Is the check_blas.py script actually representative for neural network training calculations?
FYI:
(venv3)herbert#machine:~/ $ python3 -c 'import numpy, theano, keras; print(numpy.__version__); print(theano.__version__); print(keras.__version__);'
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
1.11.0
0.8.0rc1.dev-e6e88ce21df4fbb21c76e68da342e276548d4afd
0.3.2
(venv3)herbert#machine:~/ $
EDIT
I created a Theano implementaiton of a simple MLP as well, which also does not run multi-core:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
import theano
import theano.tensor as T
M=2000
N=2000
K=2000
iters=100
order='C'
coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100).astype(theano.config.floatX)
y = (numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01).astype(theano.config.floatX).reshape(100000, 1)
x_shared = theano.shared(x)
y_shared = theano.shared(y)
x_tensor = T.matrix('x')
y_tensor = T.matrix('y')
W0_values = numpy.asarray(
numpy.random.uniform(
low=-numpy.sqrt(6. / 120),
high=numpy.sqrt(6. / 120),
size=(100, 20)
),
dtype=theano.config.floatX
)
W0 = theano.shared(value=W0_values, name='W0', borrow=True)
b0_values = numpy.zeros((20,), dtype=theano.config.floatX)
b0 = theano.shared(value=b0_values, name='b0', borrow=True)
output0 = T.dot(x_tensor, W0) + b0
W1_values = numpy.asarray(
numpy.random.uniform(
low=-numpy.sqrt(6. / 120),
high=numpy.sqrt(6. / 120),
size=(20, 1)
),
dtype=theano.config.floatX
)
W1 = theano.shared(value=W1_values, name='W1', borrow=True)
b1_values = numpy.zeros((1,), dtype=theano.config.floatX)
b1 = theano.shared(value=b1_values, name='b1', borrow=True)
output1 = T.dot(output0, W1) + b1
params = [W0, b0, W1, b1]
cost = ((output1 - y_tensor) ** 2).sum()
gradients = [T.grad(cost, param) for param in params]
learning_rate = 0.0000001
updates = [
(param, param - learning_rate * gradient)
for param, gradient in zip(params, gradients)
]
train_model = theano.function(
inputs=[],#x_tensor, y_tensor],
outputs=cost,
updates=updates,
givens={
x_tensor: x_shared,
y_tensor: y_shared
}
)
errors = []
for i in range(1000):
errors.append(train_model())
print(errors[0:50:])
Keras and TF themselves don't use whole cores and capacity of CPU! If you are interested in using all 100% of your CPU then the multiprocessing.Pool basically creates a pool of jobs that need doing. The processes will pick up these jobs and run them. When a job is finished, the process will pick up another job from the pool.
NB: If you want to just speed up this model, look into GPUs or changing the hyperparameters like batch size and number of neurons (layer size).
Here's how you can use multiprocessing to train multiple models at the same time (using processes running in parallel on each separate CPU core of your machine).
This answer inspired by #repploved
import time
import signal
import multiprocessing
def init_worker():
''' Add KeyboardInterrupt exception to mutliprocessing workers '''
signal.signal(signal.SIGINT, signal.SIG_IGN)
def train_model(layer_size):
'''
This code is parallelized and runs on each process
It trains a model with different layer sizes (hyperparameters)
It saves the model and returns the score (error)
'''
import keras
from keras.models import Sequential
from keras.layers import Dense
print(f'Training a model with layer size {layer_size}')
# build your model here
model_RNN = Sequential()
model_RNN.add(Dense(layer_size))
# fit the model (the bit that takes time!)
model_RNN.fit(...)
# lets demonstrate with a sleep timer
time.sleep(5)
# save trained model to a file
model_RNN.save(...)
# you can also return values eg. the eval score
return model_RNN.evaluate(...)
num_workers = 4
hyperparams = [800, 960, 1100]
pool = multiprocessing.Pool(num_workers, init_worker)
scores = pool.map(train_model, hyperparams)
print(scores)
Output:
Training a model with layer size 800
Training a model with layer size 960
Training a model with layer size 1100
[{'size':960,'score':1.0}, {'size':800,'score':1.2}, {'size':1100,'score':0.7}]
This is easily demonstrated with a time.sleep in the code. You'll see that all 3 processes start the training job, and then they all finish at about the same time. If this was single processed, you'd have to wait for each to finish before starting the next (yawn!).

Resources