I have a data-frame consisting of Time Series.
Date Index | Time Series 1 | Time Series 2 | ... and so on
I have used pyRserve to run a forecasting function using R.
I want to implement parallel processing using celery.
I have written the worker code in the following context.
def pipeR(k #input variable):
conn = pyRserve.connect(host = 'localhost', port = 6311)
# OPENING THE CONNECTION TO R
conn.r.i = k
# ASSIGNING THE PYTHON VARIABLE TO THAT OF IN THE R ENVIRONMENT
conn.voideval\('''
WKR_Func <- forecst(a)
{
...# FORECASTS THE TIMESERIES IN COLUMN a OF THE DATAFRAME
}
''')
conn.eval('forecst(i)')
# CALLING THE FUNCTION IN R
group(pipeR.s(k) for k in [...list of column headers...])()
To implement parallel processing, can I have a single port for all the worker process (like I did in the above code, port:6311) or should I have different ports for different worker processes ??
I'm currently getting an error
Error in socketConnection("localhost", port=port, server=TRUE, blocking=TRUE, : cannot open the connection
in R.
The problem got resolved when I opened different ports for each worker process...
def pipeR( k, Frequency, Horizon, Split, wd_path):
# GENERATING A RANDOM PORT
port = randint(1000,9999)
# OPENING THE PORT IN THE R ENVIRONMENT
conn0 = pyRserve.connect(host = 'localhost', port = 6311)
conn0.r.port = port
conn0.voidEval\
('''
library(Rserve)
Rserve(port = port, args = '--no-save')
''')
# OPENING THE PORT IN THE PYTHON ENVIRONMENT
conn = pyRserve.connect(host = 'localhost', port = port)
# ASSIGNING THE PYTHON VARIABLE TO THAT OF IN THE R ENVIRONMENT
conn.r.i = k
conn.voideval\
('''
WKR_Func <- forecst(a)
{
...# FORECASTS THE TIMESERIES IN COLUMN a OF THE DATAFRAME
}
''')
conn.eval/('forecst(i)')
conn0.close()
Related
I'm using an Amazon SageMaker Notebook that has 72 cores and 144 GB RAM, and I carried out 2 tests with a sample of the whole data to check if the Dask cluster was working.
The sample has 4500 rows and 735 columns from 5 different "assets" (I mean 147 columns for each asset). The code is filtering the columns and creating a feature matrix for each filtered Dataframe.
First, I initialized the cluster as follows, I received 72 workers, and got 17 minutes of running. (I assume I created 72 workers with one core each.)
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(processes=True,n_workers=72,threads_per_worker=72)
def main():
import featuretools as ft
list_columns = list(df_concat_02.columns)
list_df_features=[]
from tqdm.notebook import tqdm
for asset in tqdm(list_columns,total=len(list_columns)):
dataframe = df_sma.filter(regex="^"+asset, axis=1).reset_index()
es = ft.EntitySet()
es = es.entity_from_dataframe(entity_id = 'MARKET', dataframe =dataframe,
index = 'index',
time_index = 'Date')
fm, features = ft.dfs(entityset=es,
target_entity='MARKET',
trans_primitives = ['divide_numeric'],
agg_primitives = [],
max_depth=1,
verbose=True,
dask_kwargs={'cluster': client.scheduler.address}
)
list_df_features.append(fm)
return list_df_features
if __name__ == "__main__":
list_df = main()
Second, I initialized the cluster as follows, I received 9 workers, and got 3,5 minutes of running. (I assume I created 9 workers with 8 cores each.)
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(processes=True)
def main():
import featuretools as ft
list_columns = list(df_concat_02.columns)
list_df_features=[]
from tqdm.notebook import tqdm
for asset in tqdm(list_columns,total=len(list_columns)):
dataframe = df_sma.filter(regex="^"+asset, axis=1).reset_index()
es = ft.EntitySet()
es = es.entity_from_dataframe(entity_id = 'MARKET', dataframe =dataframe,
index = 'index',
time_index = 'Date')
fm, features = ft.dfs(entityset=es,
target_entity='MARKET',
trans_primitives = ['divide_numeric'],
agg_primitives = [],
max_depth=1,
verbose=True,
dask_kwargs={'cluster': client.scheduler.address}
)
list_df_features.append(fm)
return list_df_features
if __name__ == "__main__":
list_df = main()
For me, it's mind-blowing because I thought that 72 workers could carry the work out faster! Once I'm not a specialist neither in Dask nor in FeatureTools I guess that I'm setting something wrong.
I would appreciate any kind of help and advice!
Thank you!
You are correctly setting dask_kwargs in DFS. I think the slow down happens as a result of additional overhead and less cores in each worker. The more workers there are, the more overhead exists from transmitting data. Additionally, 8 cores from 1 worker can be leveraged to make computations run faster than 1 core from 8 workers.
I am trying to run Julia function via R using XRJulia package. Below is my code snippet.
## start
library(XRJulia)
prevInterface <- XR::getInterface()
if (is.null(prevInterface)) {
ev <- RJulia(.makeNew = TRUE)
} else {
ev <- RJulia(.makeNew = FALSE)
}
juliaAddToPath(directory = '/home/.julia/lib/v0.6/', package = NULL, evaluator = ev)
runjl <- juliaEval('function sum(a, b)
c= a+b;
return c
end
')
runjl_function <- JuliaFunction(runjl)
sum_result <- runjl_function(1, 5)
XR::rmInterface(XR::getInterface())
## end
This code is working fine. But few times when I am running above code multiple times I am getting
error: Unable to start Julia connection on port 1023: all connections
are in use.
How to close all connections of Julia and what is the systematic way..? Please suggest.
You have the function ServerQuit() in the RJuliaConnect:
https://github.com/johnmchambers/XRJulia/blob/master/R/RJuliaConnect.R
Based on the famous check_blas.py script, I wrote this one to check that theano can in fact use multiple cores:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
import theano
import theano.tensor as T
M=2000
N=2000
K=2000
iters=100
order='C'
a = theano.shared(numpy.ones((M, N), dtype=theano.config.floatX, order=order))
b = theano.shared(numpy.ones((N, K), dtype=theano.config.floatX, order=order))
c = theano.shared(numpy.ones((M, K), dtype=theano.config.floatX, order=order))
f = theano.function([], updates=[(c, 0.4 * c + .8 * T.dot(a, b))])
for i in range(iters):
f(y)
Running this as python3 check_theano.py shows that 8 threads are being used. And more importantly, the code runs approximately 9 times faster than without the os.environ settings, which apply just 1 core: 7.863s vs 71.292s on a single run.
So, I would expect that Keras now also uses multiple cores when calling fit (or predict for that matter). However this is not the case for the following code:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
from keras.models import Sequential
from keras.layers import Dense
coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100);
y = numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01
model = Sequential()
model.add(Dense(20, input_shape=(100,)))
model.add(Dense(1, input_shape=(20,)))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit(x, y, verbose=0, nb_epoch=10)
This script uses only 1 core with this output:
Using Theano backend.
/home/herbert/venv3/lib/python3.4/site-packages/theano/tensor/signal/downsample.py:5: UserWarning: downsample module has been moved to the pool module.
warnings.warn("downsample module has been moved to the pool module.")
Why does the fit of Keras only use 1 core for the same setup? Is the check_blas.py script actually representative for neural network training calculations?
FYI:
(venv3)herbert#machine:~/ $ python3 -c 'import numpy, theano, keras; print(numpy.__version__); print(theano.__version__); print(keras.__version__);'
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
1.11.0
0.8.0rc1.dev-e6e88ce21df4fbb21c76e68da342e276548d4afd
0.3.2
(venv3)herbert#machine:~/ $
EDIT
I created a Theano implementaiton of a simple MLP as well, which also does not run multi-core:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
import theano
import theano.tensor as T
M=2000
N=2000
K=2000
iters=100
order='C'
coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100).astype(theano.config.floatX)
y = (numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01).astype(theano.config.floatX).reshape(100000, 1)
x_shared = theano.shared(x)
y_shared = theano.shared(y)
x_tensor = T.matrix('x')
y_tensor = T.matrix('y')
W0_values = numpy.asarray(
numpy.random.uniform(
low=-numpy.sqrt(6. / 120),
high=numpy.sqrt(6. / 120),
size=(100, 20)
),
dtype=theano.config.floatX
)
W0 = theano.shared(value=W0_values, name='W0', borrow=True)
b0_values = numpy.zeros((20,), dtype=theano.config.floatX)
b0 = theano.shared(value=b0_values, name='b0', borrow=True)
output0 = T.dot(x_tensor, W0) + b0
W1_values = numpy.asarray(
numpy.random.uniform(
low=-numpy.sqrt(6. / 120),
high=numpy.sqrt(6. / 120),
size=(20, 1)
),
dtype=theano.config.floatX
)
W1 = theano.shared(value=W1_values, name='W1', borrow=True)
b1_values = numpy.zeros((1,), dtype=theano.config.floatX)
b1 = theano.shared(value=b1_values, name='b1', borrow=True)
output1 = T.dot(output0, W1) + b1
params = [W0, b0, W1, b1]
cost = ((output1 - y_tensor) ** 2).sum()
gradients = [T.grad(cost, param) for param in params]
learning_rate = 0.0000001
updates = [
(param, param - learning_rate * gradient)
for param, gradient in zip(params, gradients)
]
train_model = theano.function(
inputs=[],#x_tensor, y_tensor],
outputs=cost,
updates=updates,
givens={
x_tensor: x_shared,
y_tensor: y_shared
}
)
errors = []
for i in range(1000):
errors.append(train_model())
print(errors[0:50:])
Keras and TF themselves don't use whole cores and capacity of CPU! If you are interested in using all 100% of your CPU then the multiprocessing.Pool basically creates a pool of jobs that need doing. The processes will pick up these jobs and run them. When a job is finished, the process will pick up another job from the pool.
NB: If you want to just speed up this model, look into GPUs or changing the hyperparameters like batch size and number of neurons (layer size).
Here's how you can use multiprocessing to train multiple models at the same time (using processes running in parallel on each separate CPU core of your machine).
This answer inspired by #repploved
import time
import signal
import multiprocessing
def init_worker():
''' Add KeyboardInterrupt exception to mutliprocessing workers '''
signal.signal(signal.SIGINT, signal.SIG_IGN)
def train_model(layer_size):
'''
This code is parallelized and runs on each process
It trains a model with different layer sizes (hyperparameters)
It saves the model and returns the score (error)
'''
import keras
from keras.models import Sequential
from keras.layers import Dense
print(f'Training a model with layer size {layer_size}')
# build your model here
model_RNN = Sequential()
model_RNN.add(Dense(layer_size))
# fit the model (the bit that takes time!)
model_RNN.fit(...)
# lets demonstrate with a sleep timer
time.sleep(5)
# save trained model to a file
model_RNN.save(...)
# you can also return values eg. the eval score
return model_RNN.evaluate(...)
num_workers = 4
hyperparams = [800, 960, 1100]
pool = multiprocessing.Pool(num_workers, init_worker)
scores = pool.map(train_model, hyperparams)
print(scores)
Output:
Training a model with layer size 800
Training a model with layer size 960
Training a model with layer size 1100
[{'size':960,'score':1.0}, {'size':800,'score':1.2}, {'size':1100,'score':0.7}]
This is easily demonstrated with a time.sleep in the code. You'll see that all 3 processes start the training job, and then they all finish at about the same time. If this was single processed, you'd have to wait for each to finish before starting the next (yawn!).
Is it possible to manually add a task to the redis queue, so that it can be executed by a redis worker?
As a simple example, I'm launching a worker using:
require('doRedis')
redisWorker('jobs')
In another R session, I'm creating a queue and would like to send a simple expression (for example: print("hello world")) to the queue so that it is executed by the worker.
I know how to do it using foreach:
require('doRedis')
registerDoRedis('jobs')
foreach(j=1,.combine=sum,.multicombine=TRUE) %dopar% {
print("hello world")
1
}
I would like to be able to add a task to the queue without using foreach. The reason is, I do not want to have my R session wait for the output (the script would write its results to the disk).
Here is what I have tried so far, based on code in the .doRedis() function:
data <- list(queue = "jobs")
queue <- data$queue
queueCounter <- sprintf("%s:counter", queue) # job task ID counter
ID <- redisIncr(queueCounter)
queueEnv <- sprintf("%s:%.0f.env",queue,ID) # R job environment
queueTasks <- sprintf("%s:%.0f",queue,ID) # Job tasks hash
queueResults <- sprintf("%s:%.0f.results",queue,ID) # Output values
queueStart <- sprintf("%s:%.0f.start*",queue,ID)
queueAlive <- sprintf("%s:%.0f.alive*",queue,ID)
# add the environment to the queue
redisSet(key = queueEnv,
value = list(expr=expression(),
exportenv=baseenv(),
packages=NULL)
# put tasks in queue
taskblock <- list(ex1 <- expression('print("hello world")'))
j <- 1
taskLabel <- I
task_id = as.character(taskLabel(j))
task <- list(task_id=task_id, args=taskblock)
redisHSet(key = queueTasks,
field = task_id,
value = task)
redisRPush(key = queue, value = ID)
It doesn't work, and I think (at least) the definition of the environment is wrong...
Any help is very welcome !
In a loop, I start processes from within R with the following function:
system(mycommand.exe, ... )
Is there an elegant way of setting the process priority of mycommand.exe?
You can use psnice. If you use fork you can spawn a new pid.
psnice(pid = Sys.getpid(), value = NA_integer_)
system(mycommand.exe, ...)
or
new_pid <- fork(slave=NULL)
psnice(pid = new_pid, value = NA_integer_)
https://stat.ethz.ch/R-manual/R-devel/library/tools/html/psnice.html
http://svitsrv25.epfl.ch/R-doc/library/fork/html/fork.html