Why does Featuretools slows down when I increase the number of Dask workers? - jupyter-notebook

I'm using an Amazon SageMaker Notebook that has 72 cores and 144 GB RAM, and I carried out 2 tests with a sample of the whole data to check if the Dask cluster was working.
The sample has 4500 rows and 735 columns from 5 different "assets" (I mean 147 columns for each asset). The code is filtering the columns and creating a feature matrix for each filtered Dataframe.
First, I initialized the cluster as follows, I received 72 workers, and got 17 minutes of running. (I assume I created 72 workers with one core each.)
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(processes=True,n_workers=72,threads_per_worker=72)
def main():
import featuretools as ft
list_columns = list(df_concat_02.columns)
list_df_features=[]
from tqdm.notebook import tqdm
for asset in tqdm(list_columns,total=len(list_columns)):
dataframe = df_sma.filter(regex="^"+asset, axis=1).reset_index()
es = ft.EntitySet()
es = es.entity_from_dataframe(entity_id = 'MARKET', dataframe =dataframe,
index = 'index',
time_index = 'Date')
fm, features = ft.dfs(entityset=es,
target_entity='MARKET',
trans_primitives = ['divide_numeric'],
agg_primitives = [],
max_depth=1,
verbose=True,
dask_kwargs={'cluster': client.scheduler.address}
)
list_df_features.append(fm)
return list_df_features
if __name__ == "__main__":
list_df = main()
Second, I initialized the cluster as follows, I received 9 workers, and got 3,5 minutes of running. (I assume I created 9 workers with 8 cores each.)
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(processes=True)
def main():
import featuretools as ft
list_columns = list(df_concat_02.columns)
list_df_features=[]
from tqdm.notebook import tqdm
for asset in tqdm(list_columns,total=len(list_columns)):
dataframe = df_sma.filter(regex="^"+asset, axis=1).reset_index()
es = ft.EntitySet()
es = es.entity_from_dataframe(entity_id = 'MARKET', dataframe =dataframe,
index = 'index',
time_index = 'Date')
fm, features = ft.dfs(entityset=es,
target_entity='MARKET',
trans_primitives = ['divide_numeric'],
agg_primitives = [],
max_depth=1,
verbose=True,
dask_kwargs={'cluster': client.scheduler.address}
)
list_df_features.append(fm)
return list_df_features
if __name__ == "__main__":
list_df = main()
For me, it's mind-blowing because I thought that 72 workers could carry the work out faster! Once I'm not a specialist neither in Dask nor in FeatureTools I guess that I'm setting something wrong.
I would appreciate any kind of help and advice!
Thank you!

You are correctly setting dask_kwargs in DFS. I think the slow down happens as a result of additional overhead and less cores in each worker. The more workers there are, the more overhead exists from transmitting data. Additionally, 8 cores from 1 worker can be leveraged to make computations run faster than 1 core from 8 workers.

Related

RuntimeError: quantile() q tensor must be same dtype as the input tensor in pytorch-forecasting

PyTorch-Forecasting version: 0.10.2
PyTorch version:1.12.1
Python version:3.10.4
Operating System: windows
Expected behavior
No Error
Actual behavior
The Error is
File c:\Users\josepeeterson.er\Miniconda3\envs\pytorch\lib\site-packages\pytorch_forecasting\metrics\base_metrics.py:979, in DistributionLoss.to_quantiles(self, y_pred, quantiles, n_samples)
977 except NotImplementedError: # resort to derive quantiles empirically
978 samples = torch.sort(self.sample(y_pred, n_samples), -1).values
--> 979 quantiles = torch.quantile(samples, torch.tensor(quantiles, device=samples.device), dim=2).permute(1, 2, 0)
980 return quantiles
RuntimeError: quantile() q tensor must be same dtype as the input tensor
How do I set them to be of same datatype? This is happening internally. I do not have control over this. I am not using any GPUs.
The link to the .csv file with input data is https://github.com/JosePeeterson/Demand_forecasting
The data is just sampled from a negative binomila distribution wiht parameters (9,0.5) every 4 hours. the time inbetween is all zero.
I just want to see if DeepAR can learn this pattern.
Code to reproduce the problem
from pytorch_forecasting.data.examples import generate_ar_data
import matplotlib.pyplot as plt
import pandas as pd
from pytorch_forecasting.data import TimeSeriesDataSet
from pytorch_forecasting.data import NaNLabelEncoder
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
import pytorch_lightning as pl
from pytorch_forecasting import NegativeBinomialDistributionLoss, DeepAR
import torch
from pytorch_forecasting.data.encoders import TorchNormalizer
data = [pd.read_csv('1_f_nbinom_train.csv')]
data["date"] = pd.Timestamp("2021-08-24") + pd.to_timedelta(data.time_idx, "H")
data['_hour_of_day'] = str(data["date"].dt.hour)
data['_day_of_week'] = str(data["date"].dt.dayofweek)
data['_day_of_month'] = str(data["date"].dt.day)
data['_day_of_year'] = str(data["date"].dt.dayofyear)
data['_week_of_year'] = str(data["date"].dt.weekofyear)
data['_month_of_year'] = str(data["date"].dt.month)
data['_year'] = str(data["date"].dt.year)
max_encoder_length = 60
max_prediction_length = 20
training_cutoff = data["time_idx"].max() - max_prediction_length
training = TimeSeriesDataSet(
data.iloc[0:-620],
time_idx="time_idx",
target="value",
categorical_encoders={"series": NaNLabelEncoder(add_nan=True).fit(data.series), "_hour_of_day": NaNLabelEncoder(add_nan=True).fit(data._hour_of_day), \
"_day_of_week": NaNLabelEncoder(add_nan=True).fit(data._day_of_week), "_day_of_month" : NaNLabelEncoder(add_nan=True).fit(data._day_of_month), "_day_of_year" : NaNLabelEncoder(add_nan=True).fit(data._day_of_year), \
"_week_of_year": NaNLabelEncoder(add_nan=True).fit(data._week_of_year), "_year": NaNLabelEncoder(add_nan=True).fit(data._year)},
group_ids=["series"],
min_encoder_length=max_encoder_length,
max_encoder_length=max_encoder_length,
min_prediction_length=max_prediction_length,
max_prediction_length=max_prediction_length,
time_varying_unknown_reals=["value"],
time_varying_known_categoricals=["_hour_of_day","_day_of_week","_day_of_month","_day_of_year","_week_of_year","_year" ],
time_varying_known_reals=["time_idx"],
add_relative_time_idx=False,
randomize_length=None,
scalers=[],
target_normalizer=TorchNormalizer(method="identity",center=False,transformation=None )
)
validation = TimeSeriesDataSet.from_dataset(
training,
data.iloc[-620:-420],
# predict=True,
stop_randomization=True,
)
batch_size = 64
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=8)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=8)
# save datasets
training.save("training.pkl")
validation.save("validation.pkl")
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=5, verbose=False, mode="min")
lr_logger = LearningRateMonitor()
trainer = pl.Trainer(
max_epochs=10,
gpus=0,
gradient_clip_val=0.1,
limit_train_batches=30,
limit_val_batches=3,
# fast_dev_run=True,
# logger=logger,
# profiler=True,
callbacks=[lr_logger, early_stop_callback],
)
deepar = DeepAR.from_dataset(
training,
learning_rate=0.1,
hidden_size=32,
dropout=0.1,
loss=NegativeBinomialDistributionLoss(),
log_interval=10,
log_val_interval=3,
# reduce_on_plateau_patience=3,
)
print(f"Number of parameters in network: {deepar.size()/1e3:.1f}k")
torch.set_num_threads(10)
trainer.fit(
deepar,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader,
)
Need to cast samples to torch.tensor as shown below. Then save this base_metrics.py and rerun above code.
except NotImplementedError: # resort to derive quantiles empirically
samples = torch.sort(self.sample(y_pred, n_samples), -1).values
quantiles = torch.quantile(torch.tensor(samples), torch.tensor(quantiles, device=samples.device), dim=2).permute(1, 2, 0)
return quantiles

Are these normal speed of Bert Pretrained Model Inference in PyTorch

I am testing Bert base and Bert distilled model in Huggingface with 4 scenarios of speeds, batch_size = 1:
1) bert-base-uncased: 154ms per request
2) bert-base-uncased with quantifization: 94ms per request
3) distilbert-base-uncased: 86ms per request
4) distilbert-base-uncased with quantifization: 69ms per request
I am using the IMDB text as experimental data and set the max_length=512, so it's quite long. The cpu on Ubuntu 18.04 info is below:
cat /proc/cpuinfo | grep 'name'| uniq
model name : Intel(R) Xeon(R) Platinum 8163 CPU # 2.50GHz
The machine has 3 GPU available for use:
Tesla V100-SXM2
It seems quite slow for realtime application. Are those speeds normal for bert base model?
The testing code is below:
import pandas as pd
import torch.quantization
from transformers import AutoTokenizer, AutoModel, DistilBertTokenizer, DistilBertModel
def get_embedding(model, tokenizer, text):
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model(**inputs)
output_tensors = outputs[0][0]
output_numpy = output_tensors.detach().numpy()
embedding = output_numpy.tolist()[0]
def process_text(model, tokenizer, text_lines):
for index, line in enumerate(text_lines):
embedding = get_embedding(model, tokenizer, line)
if index % 100 == 0:
print('Current index: {}'.format(index))
import time
from datetime import timedelta
if __name__ == "__main__":
df = pd.read_csv('../data/train.csv', sep='\t')
df = df.head(1000)
text_lines = df['review']
text_line_count = len(text_lines)
print('Text size: {}'.format(text_line_count))
start = time.time()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
process_text(model, tokenizer, text_lines)
end = time.time()
print('Total time spent with bert base: {}'.format(str(timedelta(seconds=end - start))))
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
process_text(model, tokenizer, text_lines)
end2 = time.time()
print('Total time spent with bert base quantization: {}'.format(str(timedelta(seconds=end2 - end))))
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
process_text(model, tokenizer, text_lines)
end3 = time.time()
print('Total time spent with distilbert: {}'.format(str(timedelta(seconds=end3 - end2))))
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
process_text(model, tokenizer, text_lines)
end4 = time.time()
print('Total time spent with distilbert quantization: {}'.format(str(timedelta(seconds=end4 - end3))))
EDIT: based on suggestion I changed to the following:
inputs = tokenizer(text_batch, padding=True, return_tensors="pt")
outputs = model(**inputs)
Where text_batch is a list of text as input.
No, you can speed it up.
First, why are you testing it with batch size 1?
Both tokenizer and model accept batched inputs. Basically, you can pass a 2D array/list that contains a single sample at each row. See the documentation for tokenizer: https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__ The same applies for the models.
Also, your for loop is sequential even if you use batch size larger than 1. You can create a test data and then use Trainer class with trainer.predict()
Also see this discussion of mine at the HF forums: https://discuss.huggingface.co/t/urgent-trainer-predict-and-model-generate-creates-totally-different-predictions/3426

threading.Thread not executing loops on functions

I've been trying to code an auto-clicker, but I first needed to learn how to use the Thread library, but I don't know why the code just stop when it runs the first time at the loop.
import time
from threading import Thread
def countdown(n):
while n > 0:
print('T-minus', n)
n -= 1
time.sleep(5)
t = Thread(target = countdown, args =(10, ))
t.start()
The output is only:
>>> T-minus 10
Anyone can help me?
I just understood what was going on! If you trying to run a thread on a Jupyter Cell, it won't show the output, I changed the code a bit to test it.
import time
from threading import Thread
TIMES = 0
def countdown(n):
global TIMES
while n > 0:
TIMES += 1
time.sleep(5)
t = Thread(target = countdown, args =(10, ))
t.start()
And in the next cell I just kept printing the TIMES value and it was changing! but just in the background!
for i in range(10):
time.sleep(3)
print(TIMES)

Keras not using multiple cores

Based on the famous check_blas.py script, I wrote this one to check that theano can in fact use multiple cores:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
import theano
import theano.tensor as T
M=2000
N=2000
K=2000
iters=100
order='C'
a = theano.shared(numpy.ones((M, N), dtype=theano.config.floatX, order=order))
b = theano.shared(numpy.ones((N, K), dtype=theano.config.floatX, order=order))
c = theano.shared(numpy.ones((M, K), dtype=theano.config.floatX, order=order))
f = theano.function([], updates=[(c, 0.4 * c + .8 * T.dot(a, b))])
for i in range(iters):
f(y)
Running this as python3 check_theano.py shows that 8 threads are being used. And more importantly, the code runs approximately 9 times faster than without the os.environ settings, which apply just 1 core: 7.863s vs 71.292s on a single run.
So, I would expect that Keras now also uses multiple cores when calling fit (or predict for that matter). However this is not the case for the following code:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
from keras.models import Sequential
from keras.layers import Dense
coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100);
y = numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01
model = Sequential()
model.add(Dense(20, input_shape=(100,)))
model.add(Dense(1, input_shape=(20,)))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit(x, y, verbose=0, nb_epoch=10)
This script uses only 1 core with this output:
Using Theano backend.
/home/herbert/venv3/lib/python3.4/site-packages/theano/tensor/signal/downsample.py:5: UserWarning: downsample module has been moved to the pool module.
warnings.warn("downsample module has been moved to the pool module.")
Why does the fit of Keras only use 1 core for the same setup? Is the check_blas.py script actually representative for neural network training calculations?
FYI:
(venv3)herbert#machine:~/ $ python3 -c 'import numpy, theano, keras; print(numpy.__version__); print(theano.__version__); print(keras.__version__);'
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
1.11.0
0.8.0rc1.dev-e6e88ce21df4fbb21c76e68da342e276548d4afd
0.3.2
(venv3)herbert#machine:~/ $
EDIT
I created a Theano implementaiton of a simple MLP as well, which also does not run multi-core:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
import theano
import theano.tensor as T
M=2000
N=2000
K=2000
iters=100
order='C'
coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100).astype(theano.config.floatX)
y = (numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01).astype(theano.config.floatX).reshape(100000, 1)
x_shared = theano.shared(x)
y_shared = theano.shared(y)
x_tensor = T.matrix('x')
y_tensor = T.matrix('y')
W0_values = numpy.asarray(
numpy.random.uniform(
low=-numpy.sqrt(6. / 120),
high=numpy.sqrt(6. / 120),
size=(100, 20)
),
dtype=theano.config.floatX
)
W0 = theano.shared(value=W0_values, name='W0', borrow=True)
b0_values = numpy.zeros((20,), dtype=theano.config.floatX)
b0 = theano.shared(value=b0_values, name='b0', borrow=True)
output0 = T.dot(x_tensor, W0) + b0
W1_values = numpy.asarray(
numpy.random.uniform(
low=-numpy.sqrt(6. / 120),
high=numpy.sqrt(6. / 120),
size=(20, 1)
),
dtype=theano.config.floatX
)
W1 = theano.shared(value=W1_values, name='W1', borrow=True)
b1_values = numpy.zeros((1,), dtype=theano.config.floatX)
b1 = theano.shared(value=b1_values, name='b1', borrow=True)
output1 = T.dot(output0, W1) + b1
params = [W0, b0, W1, b1]
cost = ((output1 - y_tensor) ** 2).sum()
gradients = [T.grad(cost, param) for param in params]
learning_rate = 0.0000001
updates = [
(param, param - learning_rate * gradient)
for param, gradient in zip(params, gradients)
]
train_model = theano.function(
inputs=[],#x_tensor, y_tensor],
outputs=cost,
updates=updates,
givens={
x_tensor: x_shared,
y_tensor: y_shared
}
)
errors = []
for i in range(1000):
errors.append(train_model())
print(errors[0:50:])
Keras and TF themselves don't use whole cores and capacity of CPU! If you are interested in using all 100% of your CPU then the multiprocessing.Pool basically creates a pool of jobs that need doing. The processes will pick up these jobs and run them. When a job is finished, the process will pick up another job from the pool.
NB: If you want to just speed up this model, look into GPUs or changing the hyperparameters like batch size and number of neurons (layer size).
Here's how you can use multiprocessing to train multiple models at the same time (using processes running in parallel on each separate CPU core of your machine).
This answer inspired by #repploved
import time
import signal
import multiprocessing
def init_worker():
''' Add KeyboardInterrupt exception to mutliprocessing workers '''
signal.signal(signal.SIGINT, signal.SIG_IGN)
def train_model(layer_size):
'''
This code is parallelized and runs on each process
It trains a model with different layer sizes (hyperparameters)
It saves the model and returns the score (error)
'''
import keras
from keras.models import Sequential
from keras.layers import Dense
print(f'Training a model with layer size {layer_size}')
# build your model here
model_RNN = Sequential()
model_RNN.add(Dense(layer_size))
# fit the model (the bit that takes time!)
model_RNN.fit(...)
# lets demonstrate with a sleep timer
time.sleep(5)
# save trained model to a file
model_RNN.save(...)
# you can also return values eg. the eval score
return model_RNN.evaluate(...)
num_workers = 4
hyperparams = [800, 960, 1100]
pool = multiprocessing.Pool(num_workers, init_worker)
scores = pool.map(train_model, hyperparams)
print(scores)
Output:
Training a model with layer size 800
Training a model with layer size 960
Training a model with layer size 1100
[{'size':960,'score':1.0}, {'size':800,'score':1.2}, {'size':1100,'score':0.7}]
This is easily demonstrated with a time.sleep in the code. You'll see that all 3 processes start the training job, and then they all finish at about the same time. If this was single processed, you'd have to wait for each to finish before starting the next (yawn!).

How can I load a csv file which is too big in iPython?

How can I load a csv file which is too big in iPython? It seems that it cannot be loaded at once in memory.
you can use this code to read a file in chunks and it will also distribute the file over multiple processors.
import pandas as pd
import multiprocessing as mp
LARGE_FILE = "yourfile.csv"
CHUNKSIZE = 100000 # processing 100,000 rows at a time
def process_frame(df):
# process data frame
return len(df)
if __name__ == '__main__':
reader = pd.read_csv(LARGE_FILE, chunksize=CHUNKSIZE)
pool = mp.Pool(4) # use 4 processes
funclist = []
for df in reader:
# process each data frame
f = pool.apply_async(process_frame,[df])
funclist.append(f)
result = 0
for f in funclist:
result += f.get(timeout=10) # timeout in 10 seconds

Resources