LLVM error out of memory when running tensorflow code - out-of-memory

I want to implement an attention mechanism to perform a speech recognition task using PyCharm on Ubuntu 16.04. My machine has 16 GB RAM and two 1070Ti GPUs.
Unfortunately, the following code always outputs "LLVM error:out of memory":
def attention(self, x_i, x, index):
"""
Attention model for speech recognition
:param x_i: the embedded input at time i
:param x: the embedded input of all times(x_j of attentions)
:param index: step of time
"""
e_i = []
c_i = []
for output in x:
output = tf.reshape(output, [-1, self.embedding_size])
atten_hidden = tf.tanh(tf.add(tf.matmul(x_i, self.attention_W), tf.matmul(output, self.attention_U)))
e_i_j = tf.matmul(atten_hidden, self.attention_V)
e_i.append(e_i_j)
e_i = tf.concat(e_i, axis=1)
# e_i = tf.exp(e_i)
alpha_i = tf.nn.softmax(e_i)
alpha_i = tf.split(alpha_i, self.sequence_length, 1)
# i!=j
for j, (alpha_i_j, output) in enumerate(zip(alpha_i, x)):
if j == index:
continue
else:
output = tf.reshape(output, [-1, self.embedding_size])
c_i_j = tf.multiply(alpha_i_j, output)
c_i.append(c_i_j)
c_i = tf.reshape(tf.concat(c_i, axis=1), [-1, self.sequence_length-1, self.embedding_size])
c_i = tf.reduce_sum(c_i, 1)
return c_i

you may be need to add more RAM to the machine
you could try an official attention mechanism

Related

How to save GPU memory usage in PyTorch

In PyTorch I wrote a very simple CNN discriminator and trained it. Now I need to deploy it to make predictions. But the target machine has a small GPU memory and got out of memory error. So I think that I can set requires_grad = False to prevent PyTorch from storing the gradient values. However I didn't find it making any difference.
There are about 5 millions of parameters in my model. But when predicting a single batch of input, it consumes about 1.2GB of memory. I think there should be no need for such large memory.
The question is how to save GPU memory usage when I just want to use my model to make predictions?
Here is a demo, I use discriminator.requires_grad_ to disable/enable autograd of all parameters. But it seems to be no use.
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as functional
from pynvml.smi import nvidia_smi
nvsmi = nvidia_smi.getInstance()
def getMemoryUsage():
usage = nvsmi.DeviceQuery("memory.used")["gpu"][0]["fb_memory_usage"]
return "%d %s" % (usage["used"], usage["unit"])
print("Before GPU Memory: %s" % getMemoryUsage())
class Discriminator(nn.Module):
def __init__(self):
super().__init__()
# trainable layers
# input: 2x256x256
self.conv1 = nn.Conv2d(2, 8, 5, padding=2) # 8x256x256
self.pool1 = nn.MaxPool2d(2) # 8x128x128
self.conv2 = nn.Conv2d(8, 32, 5, padding=2) # 32x128x128
self.pool2 = nn.MaxPool2d(2) # 32x64x64
self.conv3 = nn.Conv2d(32, 96, 5, padding=2) # 96x64x64
self.pool3 = nn.MaxPool2d(4) # 96x16x16
self.conv4 = nn.Conv2d(96, 256, 5, padding=2) # 256x16x16
self.pool4 = nn.MaxPool2d(4) # 256x4x4
self.num_flat_features = 4096
self.fc1 = nn.Linear(4096, 1024)
self.fc2 = nn.Linear(1024, 256)
self.fc3 = nn.Linear(256, 1)
# loss function
self.loss = nn.MSELoss()
# other properties
self.requires_grad = True
def forward(self, x):
y = x
y = self.conv1(y)
y = self.pool1(y)
y = functional.relu(y)
y = self.conv2(y)
y = self.pool2(y)
y = functional.relu(y)
y = self.conv3(y)
y = self.pool3(y)
y = functional.relu(y)
y = self.conv4(y)
y = self.pool4(y)
y = functional.relu(y)
y = y.view((-1,self.num_flat_features))
y = self.fc1(y)
y = functional.relu(y)
y = self.fc2(y)
y = functional.relu(y)
y = self.fc3(y)
y = torch.sigmoid(y)
return y
def predict(self, x, score_th=0.5):
if len(x.shape) == 3:
singlebatch = True
x = x.view([1]+list(x.shape))
else:
singlebatch = False
y = self.forward(x)
label = (y > float(score_th))
if singlebatch:
y = y.view(list(y.shape)[1:])
return label, y
def requires_grad_(self, requires_grad=True):
for parameter in self.parameters():
parameter.requires_grad_(requires_grad)
self.requires_grad = requires_grad
x = torch.cuda.FloatTensor(np.zeros([2, 256, 256]))
discriminator = Discriminator()
discriminator.to("cuda:0")
# comment/uncomment this line to make difference
discriminator.requires_grad_(False)
discriminator.predict(x)
print("Requires grad", discriminator.requires_grad)
print("After GPU Memory: %s" % getMemoryUsage())
By comment out the line discriminator.requires_grad_(False), I got output:
Before GPU Memory: 6350MiB
Requires grad True
After GPU Memory: 7547MiB
While by uncomment the line, I got:
Before GPU Memory: 6350MiB
Requires grad False
After GPU Memory: 7543MiB
You can use pynvml.
This python tool made Nvidia so you can Python query like this:
from pynvml.smi import nvidia_smi
nvsmi = nvidia_smi.getInstance()
nvsmi.DeviceQuery('memory.free, memory.total')
You can always also execute:
torch.cuda.empty_cache()
To empty the cache and you will find even more free memory that way.
Before calling torch.cuda.empty_cache() if you have objects you don't use anymore you can call this:
obj = None
And after that you call
gc.collect()
Try to use model.eval() with torch.no_grad() on your target machine when making predictions. model.eval() will switch model layers to eval mode. torch.no_grad() will deactivate autograd engine and as a result memory usage will be reduced.
x = torch.cuda.FloatTensor(np.zeros([2, 256, 256]))
discriminator = Discriminator()
discriminator.to("cuda:0")
discriminator.eval()
with torch.no_grad():
discriminator.predict(x)
I guess it's not relevant anymore for your specific problem, but you could take a look at Torchscript
It's a good way to decrease the size and complexity of your model. It also speeds up the prediction. Unfortunately it cant help with the training itself. It is just in general a good idea for deployment of pytorch models used in other hardware or embedded in c++ code for efficiency.
Cheers. :-)

Error when implementing RBF kernel bandwidth differentiation in Pytorch

I'm implementing an RBF network by using some beginer examples from Pytorch Website. I have a problem when implementing the kernel bandwidth differentiation for the network. Also, Iwould like to know whether my attempt ti implement the idea is fine. This is a code sample to reproduce the issue. Thanks
# -*- coding: utf-8 -*-
import torch
from torch.autograd import Variable
def kernel_product(x,y, mode = "gaussian", s = 1.):
x_i = x.unsqueeze(1)
y_j = y.unsqueeze(0)
xmy = ((x_i-y_j)**2).sum(2)
if mode == "gaussian" : K = torch.exp( - xmy/s**2) )
elif mode == "laplace" : K = torch.exp( - torch.sqrt(xmy + (s**2)))
elif mode == "energy" : K = torch.pow( xmy + (s**2), -.25 )
return torch.t(K)
class MyReLU(torch.autograd.Function):
"""
We can implement our own custom autograd Functions by subclassing
torch.autograd.Function and implementing the forward and backward passes
which operate on Tensors.
"""
#staticmethod
def forward(ctx, input):
"""
In the forward pass we receive a Tensor containing the input and return
a Tensor containing the output. ctx is a context object that can be used
to stash information for backward computation. You can cache arbitrary
objects for use in the backward pass using the ctx.save_for_backward method.
"""
ctx.save_for_backward(input)
return input.clamp(min=0)
#staticmethod
def backward(ctx, grad_output):
"""
In the backward pass we receive a Tensor containing the gradient of the loss
with respect to the output, and we need to compute the gradient of the loss
with respect to the input.
"""
input, = ctx.saved_tensors
grad_input = grad_output.clone()
grad_input[input < 0] = 0
return grad_input
dtype = torch.cuda.FloatTensor
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outputs, and wrap them in Variables.
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)
# Create random Tensors for weights, and wrap them in Variables.
w1 = Variable(torch.randn(H, D_in).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)
# I've created this scalar variable (the kernel bandwidth)
s = Variable(torch.randn(1).type(dtype), requires_grad=True)
learning_rate = 1e-6
for t in range(500):
# To apply our Function, we use Function.apply method. We alias this as 'relu'.
relu = MyReLU.apply
# Forward pass: compute predicted y using operations on Variables; we compute
# ReLU using our custom autograd operation.
# y_pred = relu(x.mm(w1)).mm(w2)
y_pred = relu(kernel_product(w1, x, s)).mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum()
print(t, loss.data[0])
# Use autograd to compute the backward pass.
loss.backward()
# Update weights using gradient descent
w1.data -= learning_rate * w1.grad.data
w2.data -= learning_rate * w2.grad.data
# Manually zero the gradients after updating weights
w1.grad.data.zero_()
w2.grad.data.zero_()
However I get this error, which dissapears when I simply use a fixed scalar in the default input parameter of kernel_product():
RuntimeError: eq() received an invalid combination of arguments - got (str), but expected one of:
* (float other)
didn't match because some of the arguments have invalid types: (str)
* (Variable other)
didn't match because some of the arguments have invalid types: (str)
Well, you are calling kernel_product(w1, x, s) where w1, x and s are torch Variable while the definition of the function is: kernel_product(x,y, mode = "gaussian", s = 1.). Seems like s should be a string specifying the mode.

Using bias in PyTorch for basic function approximation

Using R, it is very easy to approximate basic functions through a neural network:
library(nnet)
x <- sort(10*runif(50))
y <- sin(x)
nn <- nnet(x, y, size=4, maxit=10000, linout=TRUE, abstol=1.0e-8, reltol = 1.0e-9, Wts = seq(0, 1, by=1/12) )
plot(x, y)
x1 <- seq(0, 10, by=0.1)
lines(x1, predict(nn, data.frame(x=x1)), col="green")
predict( nn , data.frame(x=pi/2) )
A simple neural network with one hidden layer of a mere 4 neurons is sufficient to approximate a sine. (As per stackoverflow question Approximating function with Neural Network.)
But I cannot obtain the same in PyTorch.
In fact, the neural network created by R contains not only an input, four hidden and an output, but also two "bias" neurons - the first connected towards the hidden layer, the second towards the output.
The plot above is obtained through the following:
library(devtools)
library(scales)
library(reshape)
source_url('https://gist.github.com/fawda123/7471137/raw/cd6e6a0b0bdb4e065c597e52165e5ac887f5fe95/nnet_plot_update.r')
plot.nnet(nn$wts,struct=nn$n, pos.col='#007700',neg.col='#FF7777') ### this plots the graph
plot.nnet(nn$wts,struct=nn$n, pos.col='#007700',neg.col='#FF7777', wts.only=1) ### this prints the weights
Attempting the same with PyTorch produces a different network: the bias neurons are missing.
Following is an attempt to do in PyTorch what was done previously in R. The results will not be satisfactory: the function is not approximated. The most evident difference is that absence of the bias neurons.
import torch
from torch.autograd import Variable
import random
import math
N, D_in, H, D_out = 1000, 1, 4, 1
l_x = []
l_y = []
for a in range(1000):
r = random.random()*10
l_x.append( [r] )
l_y.append( [math.sin(r)] )
tx = torch.cuda.FloatTensor(l_x)
ty = torch.cuda.FloatTensor(l_y)
x = Variable(tx, requires_grad=False)
y = Variable(ty, requires_grad=False)
w1 = Variable(torch.randn(D_in, H ).type(torch.cuda.FloatTensor), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(torch.cuda.FloatTensor), requires_grad=True)
learning_rate = 1e-5
for t in range(1000):
y_pred = x.mm(w1).clamp(min=0).mm(w2)
loss = (y_pred - y).pow(2).sum()
if t<10 or t%100==1: print(t, loss.data[0])
loss.backward()
w1.data -= learning_rate * w1.grad.data
w2.data -= learning_rate * w2.grad.data
w1.grad.data.zero_()
w2.grad.data.zero_()
t = [ [math.pi] ]
print( str(t) +" -> "+ str( (Variable(torch.cuda.FloatTensor( t ))).mm(w1).clamp(min=0).mm(w2).data ) )
t = [ [math.pi/2] ]
print( str(t) +" -> "+ str( (Variable(torch.cuda.FloatTensor( t ))).mm(w1).clamp(min=0).mm(w2).data ) )
How to make the network approximate to the given function (sine in this case), through either inserting the "bias" neurons or other missing detail?
Moreover: I have difficulties in understanding why R inserts the "bias". I found information that the bias could be akin to the "Intercept in a Regression Model" - I still find it not clear. Any information would be appreciated.
EDIT: an excellent explanation turned out to be at stackoverflow question Role of Bias in Neural Networks
EDIT:
An example to obtain the result, though using the "fuller" framework ("not reinventing the wheel") is as follows:
import torch
from torch.autograd import Variable
import torch.nn.functional as F
import math
N, D_in, H, D_out = 1000, 1, 4, 1
l_x = []
l_y = []
for a in range(1000):
t = (a/1000.0)*10
l_x.append( [t] )
l_y.append( [math.sin(t)] )
x = Variable( torch.FloatTensor(l_x) )
y = Variable( torch.FloatTensor(l_y) )
class Net(torch.nn.Module):
def __init__(self, n_feature, n_hidden, n_output):
super(Net, self).__init__()
self.to_hidden = torch.nn.Linear(n_feature, n_hidden)
self.to_output = torch.nn.Linear(n_hidden, n_output)
def forward(self, x):
x = self.to_hidden(x)
x = F.tanh(x) # activation function
x = self.to_output(x)
return x
net = Net(n_feature = D_in, n_hidden = H, n_output = D_out)
learning_rate = 0.01
optimizer = torch.optim.Adam( net.parameters() , lr=learning_rate )
for t in range(1000):
y_pred = net(x)
loss = (y_pred - y).pow(2).sum()
if t<10 or t%100==1: print(t, loss.data[0])
loss.backward()
optimizer.step()
optimizer.zero_grad()
t = [ [math.pi] ]
print( str(t) +" -> "+ str( net( Variable(torch.FloatTensor( t )) ) ) )
t = [ [math.pi/2] ]
print( str(t) +" -> "+ str( net( Variable(torch.FloatTensor( t )) ) ) )
Unfortunately, while this code works properly, it does not solve the matter of making the original, more "low level" code work as expected (e.g. introducing the bias).
Following up on #jdhao's comment - this is a super-simple PyTorch model that computes exactly what you want:
class LinearWithInputBias(nn.Linear):
def __init__(self, in_features, out_features, out_bias=True, in_bias=True):
nn.Linear.__init__(self, in_features, out_features, out_bias)
if in_bias:
in_bias = torch.zeros(1, out_features)
# in_bias.normal_() # if you want it to be randomly initialized
self._out_bias = nn.Parameter(in_bias)
def forward(self, x):
out = nn.Linear.forward(self, x)
try:
out = out + self._out_bias
except AttributeError:
pass
return out
However, there's an additional bug in your code: from what I can see, you don't train it - i.e. you do not call an optimizer (like torch.optim.SGD(mod.parameters()) before you delete the gradient information by calling grad.data.zero_().

Tensorflow: 6 layer CNN: OOM (use 10Gb GPU memory)

I am using the following code for running a 6 layer CNN with 2 FC layers on top (on Tesla K-80 GPU).
Somehow, it consumes entire memory 10GB and died out of memory.I know that i can reduce the batch_size and then run , but i also want to run with 15 or 20 CNN layers.Whats wrong with the following code and why it takes all the memory? How should i run the code for 15 layers CNN.
Code:
import model
with tf.Graph().as_default() as g_train:
filenames = tf.train.match_filenames_once(FLAGS.train_dir+'*.tfrecords')
filename_queue = tf.train.string_input_producer(filenames, shuffle=True, num_epochs=FLAGS.num_epochs)
feats,labels = get_batch_input(filename_queue, batch_size=FLAGS.batch_size)
### feats size=(batch_size, 100, 50)
logits = model.inference(feats, FLAGS.batch_size)
loss = model.loss(logits, labels, feats)
tvars = tf.trainable_variables()
global_step = tf.Variable(0, name='global_step', trainable=False)
# Add to the Graph operations that train the model.
train_op = model.training(loss, tvars, global_step, FLAGS.learning_rate, FLAGS.clip_gradients)
# Add the Op to compare the logits to the labels during evaluation.
eval_correct = model.evaluation(logits, labels, feats)
summary_op = tf.merge_all_summaries()
saver = tf.train.Saver(tf.all_variables(), max_to_keep=15)
# The op for initializing the variables.
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
summary_writer = tf.train.SummaryWriter(FLAGS.model_dir,
graph=sess.graph)
# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
step = 0
while not coord.should_stop():
_, loss_value = sess.run([train_op, loss])
if step % 100 == 0:
print('Step %d: loss = %.2f (%.3f sec)' % (step, loss_value))
# Update the events file.
summary_str = sess.run(summary_op)
summary_writer.add_summary(summary_str, step)
if (step == 0) or (step + 1) % 1000 == 0 or (step + 1) == FLAGS.max_steps:
ckpt_model = os.path.join(FLAGS.model_dir, 'model.ckpt')
saver.save(sess, ckpt_model, global_step=step)
#saver.save(sess, FLAGS.model_dir, global_step=step)
step += 1
except tf.errors.OutOfRangeError:
print('Done training for %d epochs, %d steps.' % (FLAGS.num_epochs, step))
finally:
coord.join(threads)
sess.close()
###################### File model.py ####################
def conv2d(x, W, b, strides=1):
# Conv2D wrapper, with bias and relu activation
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1],
padding='SAME')
x = tf.nn.bias_add(x, b)
return tf.nn.relu(x)
def maxpool2d(x, k=2,s=2):
# MaxPool2D wrapper
return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, s,
s,1],padding='SAME')
def inference(feats,batch_size):
#feats size (batch_size,100,50,1) #batch_size=256
conv1_w=tf.get_variable("conv1_w", [filter_size,filter_size,1,256],initializer=tf.uniform_unit_scaling_initializer())
conv1_b=tf.get_variable("conv1_b",[256])
conv1 = conv2d(feats, conv1_w, conv1_b,2)
conv1 = maxpool2d(conv1, k=2,s=2)
### This was replicated for 6 layers and the 2 FC connected layers are added
return logits
def training(loss, train_vars, global_step, learning_rate, clip_gradients):
# Add a scalar summary for the snapshot loss.
tf.scalar_summary(loss.op.name, loss)
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, train_vars,aggregation_method=1), clip_gradients)
optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.apply_gradients(zip(grads, train_vars), global_step=global_step)
return train_op
I am not too sure what the model python library is. If it is something you wrote and can change the setting in the optimizer I would suggest the following which I use in my own code
train_step = tf.train.AdamOptimizer(learning_rate).minimize(cost, aggregation_method = tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N)
By default the aggeragetion_method is ADD_N but if you change it to EXPERIMENTAL_ACCUMULATE_N or EXPERIMENTAL_TREE this will greatly save memory. The main memory hog in these programs is that tensorflow must save the output values at every neuron so that it can compute the gradients. Changing the aggregation_method helps a lot from my experience.
Also BTW I don't think there is anything wrong with your code. I can run out of memory on small cov-nets as well.

Keras not using multiple cores

Based on the famous check_blas.py script, I wrote this one to check that theano can in fact use multiple cores:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
import theano
import theano.tensor as T
M=2000
N=2000
K=2000
iters=100
order='C'
a = theano.shared(numpy.ones((M, N), dtype=theano.config.floatX, order=order))
b = theano.shared(numpy.ones((N, K), dtype=theano.config.floatX, order=order))
c = theano.shared(numpy.ones((M, K), dtype=theano.config.floatX, order=order))
f = theano.function([], updates=[(c, 0.4 * c + .8 * T.dot(a, b))])
for i in range(iters):
f(y)
Running this as python3 check_theano.py shows that 8 threads are being used. And more importantly, the code runs approximately 9 times faster than without the os.environ settings, which apply just 1 core: 7.863s vs 71.292s on a single run.
So, I would expect that Keras now also uses multiple cores when calling fit (or predict for that matter). However this is not the case for the following code:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
from keras.models import Sequential
from keras.layers import Dense
coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100);
y = numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01
model = Sequential()
model.add(Dense(20, input_shape=(100,)))
model.add(Dense(1, input_shape=(20,)))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit(x, y, verbose=0, nb_epoch=10)
This script uses only 1 core with this output:
Using Theano backend.
/home/herbert/venv3/lib/python3.4/site-packages/theano/tensor/signal/downsample.py:5: UserWarning: downsample module has been moved to the pool module.
warnings.warn("downsample module has been moved to the pool module.")
Why does the fit of Keras only use 1 core for the same setup? Is the check_blas.py script actually representative for neural network training calculations?
FYI:
(venv3)herbert#machine:~/ $ python3 -c 'import numpy, theano, keras; print(numpy.__version__); print(theano.__version__); print(keras.__version__);'
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
1.11.0
0.8.0rc1.dev-e6e88ce21df4fbb21c76e68da342e276548d4afd
0.3.2
(venv3)herbert#machine:~/ $
EDIT
I created a Theano implementaiton of a simple MLP as well, which also does not run multi-core:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
import theano
import theano.tensor as T
M=2000
N=2000
K=2000
iters=100
order='C'
coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100).astype(theano.config.floatX)
y = (numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01).astype(theano.config.floatX).reshape(100000, 1)
x_shared = theano.shared(x)
y_shared = theano.shared(y)
x_tensor = T.matrix('x')
y_tensor = T.matrix('y')
W0_values = numpy.asarray(
numpy.random.uniform(
low=-numpy.sqrt(6. / 120),
high=numpy.sqrt(6. / 120),
size=(100, 20)
),
dtype=theano.config.floatX
)
W0 = theano.shared(value=W0_values, name='W0', borrow=True)
b0_values = numpy.zeros((20,), dtype=theano.config.floatX)
b0 = theano.shared(value=b0_values, name='b0', borrow=True)
output0 = T.dot(x_tensor, W0) + b0
W1_values = numpy.asarray(
numpy.random.uniform(
low=-numpy.sqrt(6. / 120),
high=numpy.sqrt(6. / 120),
size=(20, 1)
),
dtype=theano.config.floatX
)
W1 = theano.shared(value=W1_values, name='W1', borrow=True)
b1_values = numpy.zeros((1,), dtype=theano.config.floatX)
b1 = theano.shared(value=b1_values, name='b1', borrow=True)
output1 = T.dot(output0, W1) + b1
params = [W0, b0, W1, b1]
cost = ((output1 - y_tensor) ** 2).sum()
gradients = [T.grad(cost, param) for param in params]
learning_rate = 0.0000001
updates = [
(param, param - learning_rate * gradient)
for param, gradient in zip(params, gradients)
]
train_model = theano.function(
inputs=[],#x_tensor, y_tensor],
outputs=cost,
updates=updates,
givens={
x_tensor: x_shared,
y_tensor: y_shared
}
)
errors = []
for i in range(1000):
errors.append(train_model())
print(errors[0:50:])
Keras and TF themselves don't use whole cores and capacity of CPU! If you are interested in using all 100% of your CPU then the multiprocessing.Pool basically creates a pool of jobs that need doing. The processes will pick up these jobs and run them. When a job is finished, the process will pick up another job from the pool.
NB: If you want to just speed up this model, look into GPUs or changing the hyperparameters like batch size and number of neurons (layer size).
Here's how you can use multiprocessing to train multiple models at the same time (using processes running in parallel on each separate CPU core of your machine).
This answer inspired by #repploved
import time
import signal
import multiprocessing
def init_worker():
''' Add KeyboardInterrupt exception to mutliprocessing workers '''
signal.signal(signal.SIGINT, signal.SIG_IGN)
def train_model(layer_size):
'''
This code is parallelized and runs on each process
It trains a model with different layer sizes (hyperparameters)
It saves the model and returns the score (error)
'''
import keras
from keras.models import Sequential
from keras.layers import Dense
print(f'Training a model with layer size {layer_size}')
# build your model here
model_RNN = Sequential()
model_RNN.add(Dense(layer_size))
# fit the model (the bit that takes time!)
model_RNN.fit(...)
# lets demonstrate with a sleep timer
time.sleep(5)
# save trained model to a file
model_RNN.save(...)
# you can also return values eg. the eval score
return model_RNN.evaluate(...)
num_workers = 4
hyperparams = [800, 960, 1100]
pool = multiprocessing.Pool(num_workers, init_worker)
scores = pool.map(train_model, hyperparams)
print(scores)
Output:
Training a model with layer size 800
Training a model with layer size 960
Training a model with layer size 1100
[{'size':960,'score':1.0}, {'size':800,'score':1.2}, {'size':1100,'score':0.7}]
This is easily demonstrated with a time.sleep in the code. You'll see that all 3 processes start the training job, and then they all finish at about the same time. If this was single processed, you'd have to wait for each to finish before starting the next (yawn!).

Resources