This is a snippet of my code in PyTorch, my jupiter notebook stuck when I used num_workers > 0, I spent a lot on this problem without any answer. I do not have a GPU and I work only with a CPU.
class IndexedDataset(Dataset):
def __init__(self,data,targets, test=False):
self.dataset = data
if not test:
self.labels = targets.numpy()
self.mask = np.concatenate((np.zeros(NUM_LABELED), np.ones(NUM_UNLABELED)))
def __len__(self):
return len(self.dataset)
def __getitem__(self, idx):
image = self.dataset[idx]
return image, self.labels[idx]
def display(self, idx):
plt.imshow(self.dataset[idx], cmap='gray')
plt.show()
train_set = IndexedDataset(train_data, train_target, test = False)
test_set = IndexedDataset(test_data, test_target, test = True)
train_loader = DataLoader(train_set, batch_size=BATCH_SIZE, num_workers=2)
test_loader = DataLoader(test_set, batch_size=BATCH_SIZE, num_workers=2)
Any help, appreciated.
When num_workers is greater than 0, PyTorch uses multiple processes for data loading.
Jupyter notebooks have known issues with multiprocessing.
One way to resolve this is not to use Jupyter notebooks - just write a normal .py file and run it via command-line.
Or try use what's suggested here: Jupyter notebook never finishes processing using multiprocessing (Python 3).
Since jupyter Notebook doesn't support python multiprocessing, there are two thin libraries, you should install one of them as mentioned here 1 and 2.
I prefer to solve my problem in two ways without using any external libraries:
By converting my file from .ipynb format to .py format and run it in the terminal and I write my code in the main() function as follows:
...
...
train_set = IndexedDataset(train_data, train_target, test = False)
train_loader = DataLoader(train_set, batch_size=BATCH_SIZE, num_workers=4)
if `__name__ == '__main__'`:
for images,label in train_loader:
print(images.shape)
With multiprocessing library as follows:
In try.ipynb:
import multiprocessing as mp
import processing as ps
...
...
train_set = IndexedDataset(train_data, train_target, test = False)
train_loader = DataLoader(train_set, batch_size=BATCH_SIZE)
if __name__=="__main__":
p = mp.Pool(8)
r = p.map(ps.getShape,train_loader)
print(r)
p.close()
In processing.py file:
def getShape(data):
for i in data:
return i[0].shape
Related
I am testing Bert base and Bert distilled model in Huggingface with 4 scenarios of speeds, batch_size = 1:
1) bert-base-uncased: 154ms per request
2) bert-base-uncased with quantifization: 94ms per request
3) distilbert-base-uncased: 86ms per request
4) distilbert-base-uncased with quantifization: 69ms per request
I am using the IMDB text as experimental data and set the max_length=512, so it's quite long. The cpu on Ubuntu 18.04 info is below:
cat /proc/cpuinfo | grep 'name'| uniq
model name : Intel(R) Xeon(R) Platinum 8163 CPU # 2.50GHz
The machine has 3 GPU available for use:
Tesla V100-SXM2
It seems quite slow for realtime application. Are those speeds normal for bert base model?
The testing code is below:
import pandas as pd
import torch.quantization
from transformers import AutoTokenizer, AutoModel, DistilBertTokenizer, DistilBertModel
def get_embedding(model, tokenizer, text):
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model(**inputs)
output_tensors = outputs[0][0]
output_numpy = output_tensors.detach().numpy()
embedding = output_numpy.tolist()[0]
def process_text(model, tokenizer, text_lines):
for index, line in enumerate(text_lines):
embedding = get_embedding(model, tokenizer, line)
if index % 100 == 0:
print('Current index: {}'.format(index))
import time
from datetime import timedelta
if __name__ == "__main__":
df = pd.read_csv('../data/train.csv', sep='\t')
df = df.head(1000)
text_lines = df['review']
text_line_count = len(text_lines)
print('Text size: {}'.format(text_line_count))
start = time.time()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
process_text(model, tokenizer, text_lines)
end = time.time()
print('Total time spent with bert base: {}'.format(str(timedelta(seconds=end - start))))
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
process_text(model, tokenizer, text_lines)
end2 = time.time()
print('Total time spent with bert base quantization: {}'.format(str(timedelta(seconds=end2 - end))))
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
process_text(model, tokenizer, text_lines)
end3 = time.time()
print('Total time spent with distilbert: {}'.format(str(timedelta(seconds=end3 - end2))))
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
process_text(model, tokenizer, text_lines)
end4 = time.time()
print('Total time spent with distilbert quantization: {}'.format(str(timedelta(seconds=end4 - end3))))
EDIT: based on suggestion I changed to the following:
inputs = tokenizer(text_batch, padding=True, return_tensors="pt")
outputs = model(**inputs)
Where text_batch is a list of text as input.
No, you can speed it up.
First, why are you testing it with batch size 1?
Both tokenizer and model accept batched inputs. Basically, you can pass a 2D array/list that contains a single sample at each row. See the documentation for tokenizer: https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__ The same applies for the models.
Also, your for loop is sequential even if you use batch size larger than 1. You can create a test data and then use Trainer class with trainer.predict()
Also see this discussion of mine at the HF forums: https://discuss.huggingface.co/t/urgent-trainer-predict-and-model-generate-creates-totally-different-predictions/3426
My script gives different results if I run it on colab, jupyter notebook or directly. The colab result is wrong as the 3 equation are not coupled.
import sympy as sp
from IPython.display import *
sp.init_printing(use_latex=True)
t,M_x0,My0,M_z0,w,k_2=sp.symbols('t,M_x0,My0,M_z0,omega,k_2',real=True)
M_x=sp.Function('M_x',real=True)(t)
M_y=sp.Function('M_y',real=True)(t)
M_z=sp.Function('M_z',real=True)(t)
e1=sp.Eq(sp.Derivative(M_x,t),w*M_y)
e2=sp.Eq(sp.Derivative(M_y,t),-w*M_x)
e3=sp.Eq(sp.Derivative(M_z,t),0)
sys3 = [e1,e2,e3]
sol= sp.dsolve(sys3)
display('Systeme :',sys3)
display('Solution :',sol)
The handling of systems of ODEs has been rewritten on sympy master and will be very different in sympy 1.7. This is what I get with current master (which will become 1.7):
In [7]: sys3
Out[7]:
⎡d d d ⎤
⎢──(Mₓ(t)) = ω⋅M_y(t), ──(M_y(t)) = -ω⋅Mₓ(t), ──(M_z(t)) = 0⎥
⎣dt dt dt ⎦
In [8]: dsolve(sys3)
Out[8]: [Mₓ(t) = C₁⋅sin(ω⋅t) + C₂⋅cos(ω⋅t), M_y(t) = C₁⋅cos(ω⋅t) - C₂⋅sin(ω⋅t), M_z(t) = C₃]
I've been trying to code an auto-clicker, but I first needed to learn how to use the Thread library, but I don't know why the code just stop when it runs the first time at the loop.
import time
from threading import Thread
def countdown(n):
while n > 0:
print('T-minus', n)
n -= 1
time.sleep(5)
t = Thread(target = countdown, args =(10, ))
t.start()
The output is only:
>>> T-minus 10
Anyone can help me?
I just understood what was going on! If you trying to run a thread on a Jupyter Cell, it won't show the output, I changed the code a bit to test it.
import time
from threading import Thread
TIMES = 0
def countdown(n):
global TIMES
while n > 0:
TIMES += 1
time.sleep(5)
t = Thread(target = countdown, args =(10, ))
t.start()
And in the next cell I just kept printing the TIMES value and it was changing! but just in the background!
for i in range(10):
time.sleep(3)
print(TIMES)
I'm working in the jupyter notebook.
Is it possible to run python callbacks from a bokeh widget?
Yes, you can embed a bokeh server app in a Jupyter notebook by defining a function that modifies a Bokeh document, and passes it to show, e.g.:
def modify_doc(doc):
df = sea_surface_temperature.copy()
source = ColumnDataSource(data=df)
plot = figure(x_axis_type='datetime', y_range=(0, 25),
y_axis_label='Temperature (Celsius)',
title="Sea Surface Temperature at 43.18, -70.43")
plot.line('time', 'temperature', source=source)
def callback(attr, old, new):
if new == 0:
data = df
else:
data = df.rolling('{0}D'.format(new)).mean()
source.data = ColumnDataSource(data=data).data
slider = Slider(start=0, end=30, value=0, step=1, title="Smoothing by N Days")
slider.on_change('value', callback)
doc.add_root(column(slider, plot))
You can see a complete example here:
https://github.com/bokeh/bokeh/blob/master/examples/howto/server_embed/notebook_embed.ipynb
(You will need to run the notebook locally)
Based on the famous check_blas.py script, I wrote this one to check that theano can in fact use multiple cores:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
import theano
import theano.tensor as T
M=2000
N=2000
K=2000
iters=100
order='C'
a = theano.shared(numpy.ones((M, N), dtype=theano.config.floatX, order=order))
b = theano.shared(numpy.ones((N, K), dtype=theano.config.floatX, order=order))
c = theano.shared(numpy.ones((M, K), dtype=theano.config.floatX, order=order))
f = theano.function([], updates=[(c, 0.4 * c + .8 * T.dot(a, b))])
for i in range(iters):
f(y)
Running this as python3 check_theano.py shows that 8 threads are being used. And more importantly, the code runs approximately 9 times faster than without the os.environ settings, which apply just 1 core: 7.863s vs 71.292s on a single run.
So, I would expect that Keras now also uses multiple cores when calling fit (or predict for that matter). However this is not the case for the following code:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
from keras.models import Sequential
from keras.layers import Dense
coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100);
y = numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01
model = Sequential()
model.add(Dense(20, input_shape=(100,)))
model.add(Dense(1, input_shape=(20,)))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit(x, y, verbose=0, nb_epoch=10)
This script uses only 1 core with this output:
Using Theano backend.
/home/herbert/venv3/lib/python3.4/site-packages/theano/tensor/signal/downsample.py:5: UserWarning: downsample module has been moved to the pool module.
warnings.warn("downsample module has been moved to the pool module.")
Why does the fit of Keras only use 1 core for the same setup? Is the check_blas.py script actually representative for neural network training calculations?
FYI:
(venv3)herbert#machine:~/ $ python3 -c 'import numpy, theano, keras; print(numpy.__version__); print(theano.__version__); print(keras.__version__);'
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
1.11.0
0.8.0rc1.dev-e6e88ce21df4fbb21c76e68da342e276548d4afd
0.3.2
(venv3)herbert#machine:~/ $
EDIT
I created a Theano implementaiton of a simple MLP as well, which also does not run multi-core:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
import theano
import theano.tensor as T
M=2000
N=2000
K=2000
iters=100
order='C'
coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100).astype(theano.config.floatX)
y = (numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01).astype(theano.config.floatX).reshape(100000, 1)
x_shared = theano.shared(x)
y_shared = theano.shared(y)
x_tensor = T.matrix('x')
y_tensor = T.matrix('y')
W0_values = numpy.asarray(
numpy.random.uniform(
low=-numpy.sqrt(6. / 120),
high=numpy.sqrt(6. / 120),
size=(100, 20)
),
dtype=theano.config.floatX
)
W0 = theano.shared(value=W0_values, name='W0', borrow=True)
b0_values = numpy.zeros((20,), dtype=theano.config.floatX)
b0 = theano.shared(value=b0_values, name='b0', borrow=True)
output0 = T.dot(x_tensor, W0) + b0
W1_values = numpy.asarray(
numpy.random.uniform(
low=-numpy.sqrt(6. / 120),
high=numpy.sqrt(6. / 120),
size=(20, 1)
),
dtype=theano.config.floatX
)
W1 = theano.shared(value=W1_values, name='W1', borrow=True)
b1_values = numpy.zeros((1,), dtype=theano.config.floatX)
b1 = theano.shared(value=b1_values, name='b1', borrow=True)
output1 = T.dot(output0, W1) + b1
params = [W0, b0, W1, b1]
cost = ((output1 - y_tensor) ** 2).sum()
gradients = [T.grad(cost, param) for param in params]
learning_rate = 0.0000001
updates = [
(param, param - learning_rate * gradient)
for param, gradient in zip(params, gradients)
]
train_model = theano.function(
inputs=[],#x_tensor, y_tensor],
outputs=cost,
updates=updates,
givens={
x_tensor: x_shared,
y_tensor: y_shared
}
)
errors = []
for i in range(1000):
errors.append(train_model())
print(errors[0:50:])
Keras and TF themselves don't use whole cores and capacity of CPU! If you are interested in using all 100% of your CPU then the multiprocessing.Pool basically creates a pool of jobs that need doing. The processes will pick up these jobs and run them. When a job is finished, the process will pick up another job from the pool.
NB: If you want to just speed up this model, look into GPUs or changing the hyperparameters like batch size and number of neurons (layer size).
Here's how you can use multiprocessing to train multiple models at the same time (using processes running in parallel on each separate CPU core of your machine).
This answer inspired by #repploved
import time
import signal
import multiprocessing
def init_worker():
''' Add KeyboardInterrupt exception to mutliprocessing workers '''
signal.signal(signal.SIGINT, signal.SIG_IGN)
def train_model(layer_size):
'''
This code is parallelized and runs on each process
It trains a model with different layer sizes (hyperparameters)
It saves the model and returns the score (error)
'''
import keras
from keras.models import Sequential
from keras.layers import Dense
print(f'Training a model with layer size {layer_size}')
# build your model here
model_RNN = Sequential()
model_RNN.add(Dense(layer_size))
# fit the model (the bit that takes time!)
model_RNN.fit(...)
# lets demonstrate with a sleep timer
time.sleep(5)
# save trained model to a file
model_RNN.save(...)
# you can also return values eg. the eval score
return model_RNN.evaluate(...)
num_workers = 4
hyperparams = [800, 960, 1100]
pool = multiprocessing.Pool(num_workers, init_worker)
scores = pool.map(train_model, hyperparams)
print(scores)
Output:
Training a model with layer size 800
Training a model with layer size 960
Training a model with layer size 1100
[{'size':960,'score':1.0}, {'size':800,'score':1.2}, {'size':1100,'score':0.7}]
This is easily demonstrated with a time.sleep in the code. You'll see that all 3 processes start the training job, and then they all finish at about the same time. If this was single processed, you'd have to wait for each to finish before starting the next (yawn!).