In pytorch data parallel mode, how to use the global tensor? - mpi

In this example, I wish the z_proto could be global for different GPUs. However, in the data parallel mode, it is split into different GPUs as well. How to solve such a problem? Thank you.
class SequencePrototypeTokenClassification(nn.Module):
def __init__(self,seq_model, label_num):
super(SequencePrototypeTokenClassification, self).__init__()
self.seq_model = seq_model
self.label_num = label_num
def forward(self, input_ids, token_type_ids, attention_mask, labels, z_proto, n_query, target_inds):
z, _ = self.seq_model(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
z_dim = z.size(-1)
zq = z.squeeze().view(-1, z_dim)
dists = euclidean_dist(zq, z_proto)
log_p_y = F.log_softmax(-dists, dim=1).view(-1, self.label_num)
loss_val = -log_p_y.gather(1, self.target_inds).squeeze().view(-1).mean()
_, y_hat = log_p_y.max(1)
return loss_val, y_hat

Based on your above code, z_proto seems to be one of the arguments of the forward function and not part of the model. Therefore, simply storing it in a tensor on the main GPU would enable it to have the same value across GPUs.
Edit
Based on the documentation, it seems that DataParallel splits all the inputs to the forward pass function across the GPUs. A method by which you can circumvent it is by storing it as a class variable inside the model object itself. You can update the value before calling the forward function if it's not a static variable.
class SequencePrototypeTokenClassification(nn.Module):
def __init__(self,seq_model, label_num):
...
self.z_proto = None
...
...
#Training loop
...
model.z_proto = value
model.forward()
...

It turns out the DataParallel would only replicate the nn.Parameter of the nn.Module. So I random initialized a nn.Parameter named z_proto in the module and copy the value of tensor z_proto into the parameter. Then the parameter is replicated into 4 GPUs.

Related

Do you have to write a custom training loop to get the loss value in each epoch with Flux.jl?

Flux.jl provides a helpful train! function which when paired with the #epoch macro, can serve as the main training loop. However, unlike most custom training loops, there is no output info as to the accuracy / loss of the model during each training epoch. The train! function does provide an optional call back which seems like it could be used to show the training accuracy but I am unsure how I would do this. Is it possible to get these values using #epoch and train! or do I have to write a custom training loop?
One pattern is that you can construct the loss function as a do block. This anonymous function becomes the first argument of train!. It can contain whatever printing or logging you want, before returning the loss, which is then used for computing the gradient.
julia> m = Dense(ones(2,2));
julia> Flux.#epochs 3 Flux.train!(params(m), ([1,2], [3,4]), Descent(0.05)) do d
res = m(d)
#show res[1] # intermediate value just printed
tot = sum(res)
#show tot # final value is used for the gradient
end
[ Info: Epoch 1
res[1] = 3.0
tot = 6.0
res[1] = 6.3999999999999995
tot = 12.799999999999999
[ Info: Epoch 2
res[1] = 2.0999999999999996
tot = 4.199999999999999
res[1] = 4.499999999999999
tot = 8.999999999999998
[ Info: Epoch 3
res[1] = 1.2
tot = 2.4
res[1] = 2.599999999999999
tot = 5.199999999999998
There are many ways (for better or worse Julia is a TIMTOWTDI language after all). I think even a few packages (I leave suggesting those to someone who knows more so those)
But I would warn you against fearing custom loops
Custom training loop in Flux isn't as hard nor as advanced a feature as it sounds.
They are good and idiomatic.
Julia has fast loops.
So we can just make use of them directly..
And insert what we need where we need it, by writing normal code.
Rather than having to use complicated and less clear callbacks etc
You can find docs on them here

Problem with extracting values from vector for for loop

I am trying to extract values from a vector to generate random numbers from a GEV distribution. I keep getting an error. This is my code
x=rand(Truncated(Poisson(2),0,10),10)
t=[]
for i in 1:10 append!(t, maximum(rand(GeneralizedExtremeValue(2,4,3, x[i])))
I am new to this program and I think I am not passing the variable x properly. Any help will be appreciated. Thanks
If I am correctly understanding what you are trying to do, you might want something more like
x = rand(Truncated(Poisson(2),0,10),10)
t = Float64[]
for i in 1:10
append!(t, max(rand(GeneralizedExtremeValue(2,4,3)), x[i]))
end
Among other things, you were missing a paren, and probably want max instead of maximum here.
Also, while it would technically work, t = [] creates an empty array of type Any, which tends to be very inefficient, so you can avoid that by just telling Julia what type you want that array to hold with e.g. t = Float64[].
Finally, since you already know t only needs to hold ten results, you can make this again more efficient by pre-allocating t
x = rand(Truncated(Poisson(2),0,10),10)
t = Array{Float64}(undef,10)
for i in 1:10
t[i] = max(rand(GeneralizedExtremeValue(2,4,3)), x[i])
end

Reusing external functions with different argument names in OpenMDAO

I am trying to generate an MDAO problem from an external specification. This requires the automated creation of groups, disciplines and variables. I would like to reuse some analytical functions, but with different arguments. I have to assume that the names of these arguments can differ between reused instances, therefore I am looking for a way to formulate analytical functions without the necessary consistency between the keys in the function's dictionary-style inputs/outputs parameters and the discipline input and output variables.
Is it possible (if so, how?) to employ one of the following reusable functions MyReusableFunction / MyReusableFunctionAlt in the following example?
import openmdao.api as om
### External information
# I can choose the format of disciplinary functions. Some alternatives:
def MyNonReusableFunction1(inputs, outputs): # <- The way it works
# I have to use keys 'A', 'B', 'C' here
outputs['C'] = inputs['A']*inputs['B']
def MyNonReusableFunction2(inputs, outputs): # <- The way it works
# I have to use keys 'D', 'E', 'F' here
outputs['F'] = inputs['D']*inputs['E']
def MyReusableFunction(x, y): # <- The way I want it to work
return x*y
def MyReusableFunctionAlt(inputs, outputs): # <- This would also be fine
outputs['z'] = inputs['x']*inputs['y']
# Given structure of the problem
disciplines = {
'D1': {
'inputs': ['A', 'B'],
'outputs': ['C'],
'function': MyReusableFunction}, # <- instead of MyNonReusableFunction1
'D2': {
'inputs': ['D', 'E'],
'outputs': ['F'],
'function': MyReusableFunction}, # <- instead of MyNonReusableFunction2
}
connections = [('D2.F', 'D1.B')]
### My script starts here
problem = om.Problem()
for disc_name, disc_data in disciplines.items():
discipine = om.ExplicitComponent()
discipline.compute = disc_data['function']
for param_in in disc_data['inputs']:
discipline.add_input(param_in, 1)
for param_out in disc_data['outputs']:
discipline.add_output(param_out, 1)
problem.add_subsystem(disc_name, discipline)
for connection in connections:
problem.connect(connection[0], connection[1])
This feels like a use case for user-defined function registration in ExecComps. This is a brand new feature.
http://openmdao.org/twodocs/versions/latest/features/building_blocks/components/exec_comp.html#registering-user-functions
An example of its use is here:
http://openmdao.org/twodocs/versions/latest/features/building_blocks/components/exec_comp.html#execcomp-example-user-function-registration
This will handle derivatives for you, using either complex step or finite difference, depending on whether the given function is complex-safe.
Here's an example of your code. It doesn't perfectly replicate the dictionary which stores the user functions, but it's probably a bit easier to get their with this path, as opposed to reassigning compute.
import openmdao.api as om
def MyReusableFunction(x, y): # <- The way I want it to work
return x*y
connections = [('D2.F', 'D1.B')]
problem = om.Problem()
om.ExecComp.register('myfunc', MyReusableFunction, complex_safe=True)
D1 = om.ExecComp('C = myfunc(A, B)')
D2 = om.ExecComp('F = myfunc(D, E)')
problem.model.add_subsystem('D1', D1)
problem.model.add_subsystem('D2', D2)
for connection in connections:
problem.model.connect(connection[0], connection[1])
problem.setup()
problem.run_model()
Let us know if this will not work for your use-case.

Which python code will be included in the Dask graph

Which python code written in the client side of #dask is really added to the task graph?
In this script for example, I am reading an hdf5 dataset of 4 dim, using a loop for the fourth dimension.
I calculate the sum for each dim called here g for generation and subtract the result of this generation and the one before it.
Then i am calling the deriv.visualize() to see how it generates the graph.
alive = []
derivate = []
board = []
deriv = 0
rest_1 = 0
hf5 = h5py.File('Datata.h5', 'r')
hds5 = hf5.get('dataset')
list(hf5.keys())
last_gen = hds5.attrs.get('last_gen')
for g in range(0,generations):
board = hds5[g]
arr = da.asarray(board, chunks=(4,5,4))
res = arr.sum()
if g!=0 :
deriv = res - rest_1
rest_1 = res
deriv.visualize()
Here is the graph i am getting
Here without calling .compute() the subtract operator is added to the dask graph apparently, how do we explain this ?
If i add a .compute() in the "res = arr.sum().compute()" and keep the rest as it is, where the subtraction will be executed ? in the client side or in one of the workers ?
An other question which is more general, if i wanna keep the partial sums in the workers, and perform the subtraction (on partial sum of the current and last generation) in the workers, is there a way to say that i want theses operations to be performed on the same chunks over different generations? (for example the worker 0 will operate always on the 3 first rows of each generation, like in mpi even if it's not the same thing at all)?
Dask does not look at your Python code, and so can not see anything other than what you give to it. In this case it is these two lines:
arr = da.asarray(x, chunks=(4,5,4))
res = arr.sum().compute()

fixing race condition in tensorflow run

I would like to maintain a variable on the GPU, and perform some operations on that variable in place. The following snippet is a minimalish example of this.
import numpy as np
import tensorflow as tf
with tf.Graph().as_default():
i = tf.placeholder(tf.int32, [4], name='i')
y = tf.placeholder(tf.float32, [4], name='y')
_x = tf.get_variable('x', [4], initializer=tf.random_normal_initializer())
x = _x + tf.reduce_sum(tf.mul(_x,y))
assign_op = tf.assign(_x, x).op
permute_op = tf.assign(_x, tf.gather(_x, i))
ii = np.array([1,2,3,0])
yy = np.random.randn(4)
s = tf.Session()
s.run(tf.initialize_all_variables())
xxx0 = s.run(_x)
s.run([permute_op, assign_op], feed_dict={i: ii, y: yy})
xxx1 = s.run(_x)
print('assigned then permuted', np.allclose((xxx0+np.dot(xxx0,yy))[ii], xxx1))
print('permuted then assigned', np.allclose((xxx0[ii]+np.dot(xxx0[ii], yy)), xxx1))
The problem is that this program is ambiguous, in terms of the ordering of the assign_op and permute_op operations. Hence, one or the other of the final two print statements will be true, but which one that is varies randomly across multiple runs of the program. I could break this into two steps, the first running the permute_op and the second running the assign_op, but it seems this will be less efficient.
Is there an efficient way of breaking the race condition, and making the results predictable?
The easiest way to order the two assignments is to use the result of the first assignment as the variable input to the second one. This creates a data dependency between the assignments, which gives them a deterministic order. For example:
assigned = tf.assign(_x, x)
permuted = tf.assign(assigned, tf.gather(assigned, i))
sess.run(permuted.op) # Runs both assignments.
Note that I reversed the order of the permutation and assignment operations from what you said in your question, because doing the permutation first and then updating still has a race. Even if this isn't the semantics you wanted, the principle should hopefully be clear.
An alternative approach is to use with tf.control_dependencies(ops): blocks, where ops is a list of operations (such as assignments) that must run before the operations in the with block. This is slightly trickier to use, because you have to be careful about reading the updated value of a variable. (Like a non-volatile variable in C, the read may be cached.) The typical idiom to force a read is to use tf.identity(var.ref()), so the example would look something like:
assign_op = tf.assign(_x, x).op
with tf.control_dependencies([assign_op]):
# Read updated value of `_x` after `assign_op`.
new_perm = tf.gather(tf.identity(_x.ref()), i)
permute_op = tf.assign(_x, new_perm).op
sess.run(permute_op) # Runs both assignments.

Resources