Pitch Calculation Error via Autocorrelation Method - r

Aim : Pitch Calculation
Issue : The calculated pitch does not match the expected one. For instance, the output is approx. 'D3', however the expected output is 'C5'.
Source Sound : https://freewavesamples.com/1980s-casio-celesta-c5
Source Code
library("tuneR")
library("seewave")
#0: Acquisition of sample sound
snd_smpl = readWave(paste("~/Music/sample/1980s-Casio-Celesta-C5.wav"),
from = 0, to = 1, units = "seconds")
dur_smpl = duration(snd_smpl)
len_smpl = length(snd_smpl)
#1 : Pre-Processing Stage
#1.1 : Application of Hanning Window
n = 1:len_smpl
han_win = 0.5-0.5*cos(2*pi*n/(len_smpl-1))
wind_sig = han_win*snd_smpl#left
#2.1 : Auto-Correlation Calculation
rev_wind_sig = rev(wind_sig) #Reversing the windowed signal
acorr_1 = convolve(wind_sig, rev_wind_sig, type = "open")
# Obtaining the 2nd half of the correlation, to simplify calculation
n = 2*len_smpl-1
acorr_2 = (1/len_smpl)*acorr_1[len_smpl:n]
#2.2 : Note Calculation
min_index = which.min(acorr_2)
print(min_index)
fs = 44100
fo = fs/min_index #To obtain fundamental frequency
print(fo)
print(notenames(noteFromFF(fo)))
Output
> print(min_index)
[1] 37
> fs = 44100
> fo = fs/min_index
> print(fo)
[1] 1191.892
> print(notenames(noteFromFF(fo)))
[1] "d'''"
The entire calculation is performed in the Time Domain.
I'm currently using autocorrelation as a base to understand more about Pitch Detection & Analysis. I've tried to analyse the sample with 'Audacity' and the result is 'C5'. Hence, I'm wondering where actually the issue is.
Can you all help me find it?
Also, there are a few but important doubts:
How small should actually my analysis window be (20ms, 1s,..)?
Will reinforcement of the Autocorrelation Algorithm with AMDF and other similar algorithms make this Pitch Detection module more robust?

This whole analysis seems not correct. You should not use windowing in time domain analysis.
Attached a short solution in the python language; you can use it as pseudocode
from soundfile import read
from glob import glob
from scipy.signal import correlate, find_peaks
from matplotlib.pyplot import plot, show, xlim, title, xlabel
import numpy as np
%matplotlib inline
name = glob('*wav')[0]
samples, fs = read(name)
corr = correlate(samples, samples)
corr = corr[corr.size / 2:]
time = np.arange(corr.size) / float(fs)
ind = find_peaks(corr[time < 0.002])[0]
plot(time, corr)
plot(time[ind], corr[ind], '*')
xlim([0, 0.005])
title('Frequency = {} Hz'.format(1 / time[ind][0]))
xlabel('Time [Sec]')
show()

Related

RuntimeError: quantile() q tensor must be same dtype as the input tensor in pytorch-forecasting

PyTorch-Forecasting version: 0.10.2
PyTorch version:1.12.1
Python version:3.10.4
Operating System: windows
Expected behavior
No Error
Actual behavior
The Error is
File c:\Users\josepeeterson.er\Miniconda3\envs\pytorch\lib\site-packages\pytorch_forecasting\metrics\base_metrics.py:979, in DistributionLoss.to_quantiles(self, y_pred, quantiles, n_samples)
977 except NotImplementedError: # resort to derive quantiles empirically
978 samples = torch.sort(self.sample(y_pred, n_samples), -1).values
--> 979 quantiles = torch.quantile(samples, torch.tensor(quantiles, device=samples.device), dim=2).permute(1, 2, 0)
980 return quantiles
RuntimeError: quantile() q tensor must be same dtype as the input tensor
How do I set them to be of same datatype? This is happening internally. I do not have control over this. I am not using any GPUs.
The link to the .csv file with input data is https://github.com/JosePeeterson/Demand_forecasting
The data is just sampled from a negative binomila distribution wiht parameters (9,0.5) every 4 hours. the time inbetween is all zero.
I just want to see if DeepAR can learn this pattern.
Code to reproduce the problem
from pytorch_forecasting.data.examples import generate_ar_data
import matplotlib.pyplot as plt
import pandas as pd
from pytorch_forecasting.data import TimeSeriesDataSet
from pytorch_forecasting.data import NaNLabelEncoder
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
import pytorch_lightning as pl
from pytorch_forecasting import NegativeBinomialDistributionLoss, DeepAR
import torch
from pytorch_forecasting.data.encoders import TorchNormalizer
data = [pd.read_csv('1_f_nbinom_train.csv')]
data["date"] = pd.Timestamp("2021-08-24") + pd.to_timedelta(data.time_idx, "H")
data['_hour_of_day'] = str(data["date"].dt.hour)
data['_day_of_week'] = str(data["date"].dt.dayofweek)
data['_day_of_month'] = str(data["date"].dt.day)
data['_day_of_year'] = str(data["date"].dt.dayofyear)
data['_week_of_year'] = str(data["date"].dt.weekofyear)
data['_month_of_year'] = str(data["date"].dt.month)
data['_year'] = str(data["date"].dt.year)
max_encoder_length = 60
max_prediction_length = 20
training_cutoff = data["time_idx"].max() - max_prediction_length
training = TimeSeriesDataSet(
data.iloc[0:-620],
time_idx="time_idx",
target="value",
categorical_encoders={"series": NaNLabelEncoder(add_nan=True).fit(data.series), "_hour_of_day": NaNLabelEncoder(add_nan=True).fit(data._hour_of_day), \
"_day_of_week": NaNLabelEncoder(add_nan=True).fit(data._day_of_week), "_day_of_month" : NaNLabelEncoder(add_nan=True).fit(data._day_of_month), "_day_of_year" : NaNLabelEncoder(add_nan=True).fit(data._day_of_year), \
"_week_of_year": NaNLabelEncoder(add_nan=True).fit(data._week_of_year), "_year": NaNLabelEncoder(add_nan=True).fit(data._year)},
group_ids=["series"],
min_encoder_length=max_encoder_length,
max_encoder_length=max_encoder_length,
min_prediction_length=max_prediction_length,
max_prediction_length=max_prediction_length,
time_varying_unknown_reals=["value"],
time_varying_known_categoricals=["_hour_of_day","_day_of_week","_day_of_month","_day_of_year","_week_of_year","_year" ],
time_varying_known_reals=["time_idx"],
add_relative_time_idx=False,
randomize_length=None,
scalers=[],
target_normalizer=TorchNormalizer(method="identity",center=False,transformation=None )
)
validation = TimeSeriesDataSet.from_dataset(
training,
data.iloc[-620:-420],
# predict=True,
stop_randomization=True,
)
batch_size = 64
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=8)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=8)
# save datasets
training.save("training.pkl")
validation.save("validation.pkl")
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=5, verbose=False, mode="min")
lr_logger = LearningRateMonitor()
trainer = pl.Trainer(
max_epochs=10,
gpus=0,
gradient_clip_val=0.1,
limit_train_batches=30,
limit_val_batches=3,
# fast_dev_run=True,
# logger=logger,
# profiler=True,
callbacks=[lr_logger, early_stop_callback],
)
deepar = DeepAR.from_dataset(
training,
learning_rate=0.1,
hidden_size=32,
dropout=0.1,
loss=NegativeBinomialDistributionLoss(),
log_interval=10,
log_val_interval=3,
# reduce_on_plateau_patience=3,
)
print(f"Number of parameters in network: {deepar.size()/1e3:.1f}k")
torch.set_num_threads(10)
trainer.fit(
deepar,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader,
)
Need to cast samples to torch.tensor as shown below. Then save this base_metrics.py and rerun above code.
except NotImplementedError: # resort to derive quantiles empirically
samples = torch.sort(self.sample(y_pred, n_samples), -1).values
quantiles = torch.quantile(torch.tensor(samples), torch.tensor(quantiles, device=samples.device), dim=2).permute(1, 2, 0)
return quantiles

Optimizing Distributed I/O with serial output

I am having trouble understanding how to optimize a distributed component with a serial output. This is my attempt with an example problem given in the openmdao docs.
import numpy as np
import openmdao.api as om
from openmdao.utils.array_utils import evenly_distrib_idxs
from openmdao.utils.mpi import MPI
class MixedDistrib2(om.ExplicitComponent):
def setup(self):
# Distributed Input
self.add_input('in_dist', shape_by_conn=True, distributed=True)
# Serial Input
self.add_input('in_serial', val=1)
# Distributed Output
self.add_output('out_dist', copy_shape='in_dist', distributed=True)
# Serial Output
self.add_output('out_serial', copy_shape='in_serial')
#self.declare_partials('*','*', method='cs')
def compute(self, inputs, outputs):
x = inputs['in_dist']
y = inputs['in_serial']
# "Computationally Intensive" operation that we wish to parallelize.
f_x = x**2 - 2.0*x + 4.0
# These operations are repeated on all procs.
f_y = y ** 0.5
g_y = y**2 + 3.0*y - 5.0
# Compute square root of our portion of the distributed input.
g_x = x ** 0.5
# Distributed output
outputs['out_dist'] = f_x + f_y
# Serial output
if MPI and comm.size > 1:
# We need to gather the summed values to compute the total sum over all procs.
local_sum = np.array(np.sum(g_x))
total_sum = local_sum.copy()
self.comm.Allreduce(local_sum, total_sum, op=MPI.SUM)
outputs['out_serial'] = g_y * total_sum
else:
# Recommended to make sure your code can run in serial too, for testing.
outputs['out_serial'] = g_y * np.sum(g_x)
size = 7
if MPI:
comm = MPI.COMM_WORLD
rank = comm.rank
sizes, offsets = evenly_distrib_idxs(comm.size, size)
else:
# When running in serial, the entire variable is on rank 0.
rank = 0
sizes = {rank : size}
offsets = {rank : 0}
prob = om.Problem()
model = prob.model
# Create a distributed source for the distributed input.
ivc = om.IndepVarComp()
ivc.add_output('x_dist', np.zeros(sizes[rank]), distributed=True)
ivc.add_output('x_serial', val=1)
model.add_subsystem("indep", ivc)
model.add_subsystem("D1", MixedDistrib2())
model.add_subsystem('con_cmp1', om.ExecComp('con1 = y**2'), promotes=['con1', 'y'])
model.connect('indep.x_dist', 'D1.in_dist')
model.connect('indep.x_serial', ['D1.in_serial','y'])
prob.driver = om.ScipyOptimizeDriver()
prob.driver.options['optimizer'] = 'SLSQP'
model.add_design_var('indep.x_serial', lower=5, upper=10)
model.add_constraint('con1', upper=90)
model.add_objective('D1.out_serial')
prob.setup(force_alloc_complex=True)
#prob.setup()
# Set initial values of distributed variable.
x_dist_init = [1,1,1,1,1,1,1]
prob.set_val('indep.x_dist', x_dist_init)
# Set initial values of serial variable.
prob.set_val('indep.x_serial', 10)
#prob.run_model()
prob.run_driver()
print('x_dist', prob.get_val('indep.x_dist', get_remote=True))
print('x_serial', prob.get_val('indep.x_serial'))
print('Obj', prob.get_val('D1.out_serial'))
The problem is with defining partials with 'fd' or 'cs'. I cannot define partials of serial output w.r.t distributed input. So I used prob.setup(force_alloc_complex=True) to use complex step. But gives me this warning DerivativesWarning:Constraints or objectives [('D1.out_serial', inds=[0])] cannot be impacted by the design variables of the problem. I understand this is because the total derivative is 0 which causes the warning but I dont understand the reason. Clearly the total derivative should not be 0 here. But I guess this is because I didn't explicitly declare_partials in the component. I tried removing the distributed components and ran it again with declare_partials and this works correctly(code below).
import numpy as np
import openmdao.api as om
class MixedDistrib2(om.ExplicitComponent):
def setup(self):
self.add_input('in_dist', np.zeros(7))
self.add_input('in_serial', val=1)
self.add_output('out_serial', val=0)
self.declare_partials('*','*', method='cs')
def compute(self, inputs, outputs):
x = inputs['in_dist']
y = inputs['in_serial']
g_y = y**2 + 3.0*y - 5.0
g_x = x ** 0.5
outputs['out_serial'] = g_y * np.sum(g_x)
prob = om.Problem()
model = prob.model
model.add_subsystem("D1", MixedDistrib2(), promotes_inputs=['in_dist', 'in_serial'], promotes_outputs=['out_serial'])
model.add_subsystem('con_cmp1', om.ExecComp('con1 = in_serial**2'), promotes=['con1', 'in_serial'])
prob.driver = om.ScipyOptimizeDriver()
prob.driver.options['optimizer'] = 'SLSQP'
model.add_design_var('in_serial', lower=5, upper=10)
model.add_constraint('con1', upper=90)
model.add_objective('out_serial')
prob.setup(force_alloc_complex=True)
prob.set_val('in_dist', [1,1,1,1,1,1,1])
prob.set_val('in_serial', 10)
prob.run_model()
prob.check_totals()
prob.run_driver()
print('x_dist', prob.get_val('in_dist', get_remote=True))
print('x_serial', prob.get_val('in_serial'))
print('Obj', prob.get_val('out_serial'))
What I am trying to understand is
How to use 'fd' or 'cs' in Distributed component with a serial output?
What is the meaning of prob.setup(force_alloc_complex=True) ? Is not forcing to use cs in all the components in the problem ? If so why does the total derivative becomes 0?
When I run your code in OpenMDAO V 3.11.0 (after uncommenting the declare_partials call) I get the following error:
RuntimeError: 'D1' <class MixedDistrib2>: component has defined partial ('out_serial', 'in_dist') which is a serial output wrt a distributed input. This is only supported using the matrix free API.
As the error indicates, you can't use the matrix-based api for derivatives in this situations. The reasons why are a bit subtle, and probably outside the scope of what needs to be delt with to answer your question here. It boils down to OpenMDAO not knowing why kind of distributed operations are being done in the compute and having no way to manage those details when you propagate things in reverse.
So you need to use the matrix-free derivative APIs in this situation. When you use the matrix-free APIs you DO NOT declare any partials, because you don't want OpenMDAO to allocate any memory for you to store partials in (and you wouldn't use that memory even if it did).
I've coded them for your example here, but I need to note a few important details:
Your example has a distributed IVC, but as of OpenMDAO V3.11.0 you can't get total derivatives with respect to distributed design variables. I assume you just made it that way to make your simple test case, but in case your real problem was set up this way, you need to note this and not do it this way. Instead, make the IVC serial, and use src indices to distribute the correct parts to each proc.
In the example below, the derivatives are correct. However, there seems to be a bug in the check_partials output when running in paralle. So the reverse mode partials look like they are off by a factor of the comm size... this will have to get fixed in later releases.
I only did the derivatives for out_serial. out_dist will work similarly and is left as an excersize for the reader :)
You'll notice that I duplicates some code in the compute and compute_jacvec_product methods. You can abstract this duplicate code out into its own method (or call compute from within compute_jacvec_product by providing your own output dictionary). However, you might be asking why the duplicate call is needed at all? Why can't u store the values from the compute call. The answer is, in large part, that OpenMDAO does not guarantee that compute is always called before compute_jacvec_product. However, I'll also point out that this kind of code duplication is very AD-like. Any AD code will have the same kind of duplication built in, even though you don't see it.
import numpy as np
import openmdao.api as om
from openmdao.utils.array_utils import evenly_distrib_idxs
from openmdao.utils.mpi import MPI
class MixedDistrib2(om.ExplicitComponent):
def setup(self):
# Distributed Input
self.add_input('in_dist', shape_by_conn=True, distributed=True)
# Serial Input
self.add_input('in_serial', val=1)
# Distributed Output
self.add_output('out_dist', copy_shape='in_dist', distributed=True)
# Serial Output
self.add_output('out_serial', copy_shape='in_serial')
# self.declare_partials('*','*', method='fd')
def compute(self, inputs, outputs):
x = inputs['in_dist']
y = inputs['in_serial']
# "Computationally Intensive" operation that we wish to parallelize.
f_x = x**2 - 2.0*x + 4.0
# These operations are repeated on all procs.
f_y = y ** 0.5
g_y = y**2 + 3.0*y - 5.0
# Compute square root of our portion of the distributed input.
g_x = x ** 0.5
# Distributed output
outputs['out_dist'] = f_x + f_y
# Serial output
if MPI and comm.size > 1:
# We need to gather the summed values to compute the total sum over all procs.
local_sum = np.array(np.sum(g_x))
total_sum = local_sum.copy()
self.comm.Allreduce(local_sum, total_sum, op=MPI.SUM)
outputs['out_serial'] = g_y * total_sum
else:
# Recommended to make sure your code can run in serial too, for testing.
outputs['out_serial'] = g_y * np.sum(g_x)
def compute_jacvec_product(self, inputs, d_inputs, d_outputs, mode):
x = inputs['in_dist']
y = inputs['in_serial']
g_y = y**2 + 3.0*y - 5.0
# "Computationally Intensive" operation that we wish to parallelize.
f_x = x**2 - 2.0*x + 4.0
# These operations are repeated on all procs.
f_y = y ** 0.5
g_y = y**2 + 3.0*y - 5.0
# Compute square root of our portion of the distributed input.
g_x = x ** 0.5
# Distributed output
out_dist = f_x + f_y
# Serial output
if MPI and comm.size > 1:
# We need to gather the summed values to compute the total sum over all procs.
local_sum = np.array(np.sum(g_x))
total_sum = local_sum.copy()
self.comm.Allreduce(local_sum, total_sum, op=MPI.SUM)
# total_sum
else:
# Recommended to make sure your code can run in serial too, for testing.
total_sum = np.sum(g_x)
num_x = len(x)
d_f_x__d_x = np.diag(2*x - 2.)
d_f_y__d_y = np.ones(num_x)*0.5*y**-0.5
d_g_y__d_y = 2*y + 3.
d_g_x__d_x = 0.5*x**-0.5
d_out_dist__d_x = d_f_x__d_x # square matrix
d_out_dist__d_y = d_f_y__d_y # num_x,1
d_out_serial__d_y = d_g_y__d_y # scalar
d_out_serial__d_x = g_y*d_g_x__d_x.reshape((1,num_x))
if mode == 'fwd':
if 'out_serial' in d_outputs:
if 'in_dist' in d_inputs:
d_outputs['out_serial'] += d_out_serial__d_x.dot(d_inputs['in_dist'])
if 'in_serial' in d_inputs:
d_outputs['out_serial'] += d_out_serial__d_y.dot(d_inputs['in_serial'])
elif mode == 'rev':
if 'out_serial' in d_outputs:
if 'in_dist' in d_inputs:
d_inputs['in_dist'] += d_out_serial__d_x.T.dot(d_outputs['out_serial'])
if 'in_serial' in d_inputs:
d_inputs['in_serial'] += total_sum*d_out_serial__d_y.T.dot(d_outputs['out_serial'])
size = 7
if MPI:
comm = MPI.COMM_WORLD
rank = comm.rank
sizes, offsets = evenly_distrib_idxs(comm.size, size)
else:
# When running in serial, the entire variable is on rank 0.
rank = 0
sizes = {rank : size}
offsets = {rank : 0}
prob = om.Problem()
model = prob.model
# Create a distributed source for the distributed input.
ivc = om.IndepVarComp()
ivc.add_output('x_dist', np.zeros(sizes[rank]), distributed=True)
ivc.add_output('x_serial', val=1)
model.add_subsystem("indep", ivc)
model.add_subsystem("D1", MixedDistrib2())
model.add_subsystem('con_cmp1', om.ExecComp('con1 = y**2'), promotes=['con1', 'y'])
model.connect('indep.x_dist', 'D1.in_dist')
model.connect('indep.x_serial', ['D1.in_serial','y'])
prob.driver = om.ScipyOptimizeDriver()
prob.driver.options['optimizer'] = 'SLSQP'
model.add_design_var('indep.x_serial', lower=5, upper=10)
model.add_constraint('con1', upper=90)
model.add_objective('D1.out_serial')
prob.setup(force_alloc_complex=True)
#prob.setup()
# Set initial values of distributed variable.
x_dist_init = np.ones(sizes[rank])
prob.set_val('indep.x_dist', x_dist_init)
# Set initial values of serial variable.
prob.set_val('indep.x_serial', 10)
prob.run_model()
prob.check_partials()
# prob.run_driver()
print('x_dist', prob.get_val('indep.x_dist', get_remote=True))
print('x_serial', prob.get_val('indep.x_serial'))
print('Obj', prob.get_val('D1.out_serial'))

Tensorflow: 6 layer CNN: OOM (use 10Gb GPU memory)

I am using the following code for running a 6 layer CNN with 2 FC layers on top (on Tesla K-80 GPU).
Somehow, it consumes entire memory 10GB and died out of memory.I know that i can reduce the batch_size and then run , but i also want to run with 15 or 20 CNN layers.Whats wrong with the following code and why it takes all the memory? How should i run the code for 15 layers CNN.
Code:
import model
with tf.Graph().as_default() as g_train:
filenames = tf.train.match_filenames_once(FLAGS.train_dir+'*.tfrecords')
filename_queue = tf.train.string_input_producer(filenames, shuffle=True, num_epochs=FLAGS.num_epochs)
feats,labels = get_batch_input(filename_queue, batch_size=FLAGS.batch_size)
### feats size=(batch_size, 100, 50)
logits = model.inference(feats, FLAGS.batch_size)
loss = model.loss(logits, labels, feats)
tvars = tf.trainable_variables()
global_step = tf.Variable(0, name='global_step', trainable=False)
# Add to the Graph operations that train the model.
train_op = model.training(loss, tvars, global_step, FLAGS.learning_rate, FLAGS.clip_gradients)
# Add the Op to compare the logits to the labels during evaluation.
eval_correct = model.evaluation(logits, labels, feats)
summary_op = tf.merge_all_summaries()
saver = tf.train.Saver(tf.all_variables(), max_to_keep=15)
# The op for initializing the variables.
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
summary_writer = tf.train.SummaryWriter(FLAGS.model_dir,
graph=sess.graph)
# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
step = 0
while not coord.should_stop():
_, loss_value = sess.run([train_op, loss])
if step % 100 == 0:
print('Step %d: loss = %.2f (%.3f sec)' % (step, loss_value))
# Update the events file.
summary_str = sess.run(summary_op)
summary_writer.add_summary(summary_str, step)
if (step == 0) or (step + 1) % 1000 == 0 or (step + 1) == FLAGS.max_steps:
ckpt_model = os.path.join(FLAGS.model_dir, 'model.ckpt')
saver.save(sess, ckpt_model, global_step=step)
#saver.save(sess, FLAGS.model_dir, global_step=step)
step += 1
except tf.errors.OutOfRangeError:
print('Done training for %d epochs, %d steps.' % (FLAGS.num_epochs, step))
finally:
coord.join(threads)
sess.close()
###################### File model.py ####################
def conv2d(x, W, b, strides=1):
# Conv2D wrapper, with bias and relu activation
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1],
padding='SAME')
x = tf.nn.bias_add(x, b)
return tf.nn.relu(x)
def maxpool2d(x, k=2,s=2):
# MaxPool2D wrapper
return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, s,
s,1],padding='SAME')
def inference(feats,batch_size):
#feats size (batch_size,100,50,1) #batch_size=256
conv1_w=tf.get_variable("conv1_w", [filter_size,filter_size,1,256],initializer=tf.uniform_unit_scaling_initializer())
conv1_b=tf.get_variable("conv1_b",[256])
conv1 = conv2d(feats, conv1_w, conv1_b,2)
conv1 = maxpool2d(conv1, k=2,s=2)
### This was replicated for 6 layers and the 2 FC connected layers are added
return logits
def training(loss, train_vars, global_step, learning_rate, clip_gradients):
# Add a scalar summary for the snapshot loss.
tf.scalar_summary(loss.op.name, loss)
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, train_vars,aggregation_method=1), clip_gradients)
optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.apply_gradients(zip(grads, train_vars), global_step=global_step)
return train_op
I am not too sure what the model python library is. If it is something you wrote and can change the setting in the optimizer I would suggest the following which I use in my own code
train_step = tf.train.AdamOptimizer(learning_rate).minimize(cost, aggregation_method = tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N)
By default the aggeragetion_method is ADD_N but if you change it to EXPERIMENTAL_ACCUMULATE_N or EXPERIMENTAL_TREE this will greatly save memory. The main memory hog in these programs is that tensorflow must save the output values at every neuron so that it can compute the gradients. Changing the aggregation_method helps a lot from my experience.
Also BTW I don't think there is anything wrong with your code. I can run out of memory on small cov-nets as well.

Keras not using multiple cores

Based on the famous check_blas.py script, I wrote this one to check that theano can in fact use multiple cores:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
import theano
import theano.tensor as T
M=2000
N=2000
K=2000
iters=100
order='C'
a = theano.shared(numpy.ones((M, N), dtype=theano.config.floatX, order=order))
b = theano.shared(numpy.ones((N, K), dtype=theano.config.floatX, order=order))
c = theano.shared(numpy.ones((M, K), dtype=theano.config.floatX, order=order))
f = theano.function([], updates=[(c, 0.4 * c + .8 * T.dot(a, b))])
for i in range(iters):
f(y)
Running this as python3 check_theano.py shows that 8 threads are being used. And more importantly, the code runs approximately 9 times faster than without the os.environ settings, which apply just 1 core: 7.863s vs 71.292s on a single run.
So, I would expect that Keras now also uses multiple cores when calling fit (or predict for that matter). However this is not the case for the following code:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
from keras.models import Sequential
from keras.layers import Dense
coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100);
y = numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01
model = Sequential()
model.add(Dense(20, input_shape=(100,)))
model.add(Dense(1, input_shape=(20,)))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit(x, y, verbose=0, nb_epoch=10)
This script uses only 1 core with this output:
Using Theano backend.
/home/herbert/venv3/lib/python3.4/site-packages/theano/tensor/signal/downsample.py:5: UserWarning: downsample module has been moved to the pool module.
warnings.warn("downsample module has been moved to the pool module.")
Why does the fit of Keras only use 1 core for the same setup? Is the check_blas.py script actually representative for neural network training calculations?
FYI:
(venv3)herbert#machine:~/ $ python3 -c 'import numpy, theano, keras; print(numpy.__version__); print(theano.__version__); print(keras.__version__);'
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
1.11.0
0.8.0rc1.dev-e6e88ce21df4fbb21c76e68da342e276548d4afd
0.3.2
(venv3)herbert#machine:~/ $
EDIT
I created a Theano implementaiton of a simple MLP as well, which also does not run multi-core:
import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'
import numpy
import theano
import theano.tensor as T
M=2000
N=2000
K=2000
iters=100
order='C'
coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100).astype(theano.config.floatX)
y = (numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01).astype(theano.config.floatX).reshape(100000, 1)
x_shared = theano.shared(x)
y_shared = theano.shared(y)
x_tensor = T.matrix('x')
y_tensor = T.matrix('y')
W0_values = numpy.asarray(
numpy.random.uniform(
low=-numpy.sqrt(6. / 120),
high=numpy.sqrt(6. / 120),
size=(100, 20)
),
dtype=theano.config.floatX
)
W0 = theano.shared(value=W0_values, name='W0', borrow=True)
b0_values = numpy.zeros((20,), dtype=theano.config.floatX)
b0 = theano.shared(value=b0_values, name='b0', borrow=True)
output0 = T.dot(x_tensor, W0) + b0
W1_values = numpy.asarray(
numpy.random.uniform(
low=-numpy.sqrt(6. / 120),
high=numpy.sqrt(6. / 120),
size=(20, 1)
),
dtype=theano.config.floatX
)
W1 = theano.shared(value=W1_values, name='W1', borrow=True)
b1_values = numpy.zeros((1,), dtype=theano.config.floatX)
b1 = theano.shared(value=b1_values, name='b1', borrow=True)
output1 = T.dot(output0, W1) + b1
params = [W0, b0, W1, b1]
cost = ((output1 - y_tensor) ** 2).sum()
gradients = [T.grad(cost, param) for param in params]
learning_rate = 0.0000001
updates = [
(param, param - learning_rate * gradient)
for param, gradient in zip(params, gradients)
]
train_model = theano.function(
inputs=[],#x_tensor, y_tensor],
outputs=cost,
updates=updates,
givens={
x_tensor: x_shared,
y_tensor: y_shared
}
)
errors = []
for i in range(1000):
errors.append(train_model())
print(errors[0:50:])
Keras and TF themselves don't use whole cores and capacity of CPU! If you are interested in using all 100% of your CPU then the multiprocessing.Pool basically creates a pool of jobs that need doing. The processes will pick up these jobs and run them. When a job is finished, the process will pick up another job from the pool.
NB: If you want to just speed up this model, look into GPUs or changing the hyperparameters like batch size and number of neurons (layer size).
Here's how you can use multiprocessing to train multiple models at the same time (using processes running in parallel on each separate CPU core of your machine).
This answer inspired by #repploved
import time
import signal
import multiprocessing
def init_worker():
''' Add KeyboardInterrupt exception to mutliprocessing workers '''
signal.signal(signal.SIGINT, signal.SIG_IGN)
def train_model(layer_size):
'''
This code is parallelized and runs on each process
It trains a model with different layer sizes (hyperparameters)
It saves the model and returns the score (error)
'''
import keras
from keras.models import Sequential
from keras.layers import Dense
print(f'Training a model with layer size {layer_size}')
# build your model here
model_RNN = Sequential()
model_RNN.add(Dense(layer_size))
# fit the model (the bit that takes time!)
model_RNN.fit(...)
# lets demonstrate with a sleep timer
time.sleep(5)
# save trained model to a file
model_RNN.save(...)
# you can also return values eg. the eval score
return model_RNN.evaluate(...)
num_workers = 4
hyperparams = [800, 960, 1100]
pool = multiprocessing.Pool(num_workers, init_worker)
scores = pool.map(train_model, hyperparams)
print(scores)
Output:
Training a model with layer size 800
Training a model with layer size 960
Training a model with layer size 1100
[{'size':960,'score':1.0}, {'size':800,'score':1.2}, {'size':1100,'score':0.7}]
This is easily demonstrated with a time.sleep in the code. You'll see that all 3 processes start the training job, and then they all finish at about the same time. If this was single processed, you'd have to wait for each to finish before starting the next (yawn!).

sign recognition like hand written digits example in scikit-learn (python)

I watch out this example: http://scikit-learn.org/stable/auto_examples/plot_digits_classification.html#example-plot-digits-classification-py
on handwritten digits in scikit-learn python library.
i would like to prepare a 3d array (N * a* b) where N is my images number (75) and a* b is the matrix of an image (like in the example a 8x8 shape).
My problem is: i have signs in a different shapes for every image: (202, 230), (250, 322).. and give me
this error: ValueError: array dimensions must agree except for d_0 in this code:
#here there is the error:
grigiume = np.dstack(listagrigie)
print(grigiume.shape)
grigiume=np.rollaxis(grigiume,-1)
print(grigiume.shape)
There is a manner to resize all images in a standard size (i.e. 200x200) or a manner to have a 3d array with matrix(a,b) where a != from b and do not give me an error in this code:
data = digits.images.reshape((n_samples, -1))
classifier.fit(data[:n_samples / 2], digits.target[:n_samples / 2])
My code:
import os
import glob
import numpy as np
from numpy import array
listagrigie = []
path = 'resize2/'
for infile in glob.glob( os.path.join(path, '*.jpg') ):
print("current file is: " + infile )
colorato = cv2.imread(infile)
grigiscala = cv2.cvtColor(colorato,cv2.COLOR_BGR2GRAY)
listagrigie.append(grigiscala)
print(len(listagrigie))
#here there is the error:
grigiume = np.dstack(listagrigie)
print(grigiume.shape)
grigiume=np.rollaxis(grigiume,-1)
print(grigiume.shape)
#last step
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)
# We learn the digits on the first half of the digits
classifier.fit(data[:n_samples / 2], digits.target[:n_samples / 2])
# Now predict the value of the digit on the second half:
expected = digits.target[n_samples / 2:]
predicted = classifier.predict(data[n_samples / 2:])
print "Classification report for classifier %s:\n%s\n" % (
classifier, metrics.classification_report(expected, predicted))
print "Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted)
for index, (image, prediction) in enumerate(
zip(digits.images[n_samples / 2:], predicted)[:4]):
pl.subplot(2, 4, index + 5)
pl.axis('off')
pl.imshow(image, cmap=pl.cm.gray_r, interpolation='nearest')
pl.title('Prediction: %i' % prediction)
pl.show()
You have to resize all your images to a fixed size. For instance using the Image class of PIL or Pillow:
from PIL import Image
image = Image.open("/path/to/input_image.jpeg")
image.thumbnail((200, 200), Image.ANTIALIAS)
image.save("/path/to/output_image.jpeg")
Edit: the above won't work, try instead resize:
from PIL import Image
image = Image.open("/path/to/input_image.jpeg")
image = image.resize((200, 200), Image.ANTIALIAS)
image.save("/path/to/output_image.jpeg")
Edit 2: there might be a way to preserve the aspect ratio and pad the rest with black pixels but I don't know how to do in a few PIL calls. You could use PIL.Image.thumbnail and use numpy to do the padding though.

Resources