OpenMDAO 1.x relevance reduction - openmdao

I have a component in OpenMDAO without outputs that serves to provide inputs to the rest of the group. apply_linear in that component is being called despite the fact that the output of it is not connected. Shouldn't the relevance reduction algorithm in OpenMDAO 1.x figure out that apply_linear for this method never needs to be called?

As it turns out, relevance reduction on a per-variable basis isn't turned on by default. You can turn it on with:
prob.root.ln_solver = LinearGaussSeidel()
prob.root.ln_solver.options['single_voi_relevance_reduction'] = True
This options is set to False by default because it does use more memory by allocating separate vectors for each quantity of interest (though each vector is smaller because it only contains relevant variables, but the total size may be larger.) Also, relevance-reduction is only applicable when using Linear Gauss Seidel as the top linear solver.

My reputation isn't high enough yet to leave comments, so I'm just adding another answer instead. I just wanted to mention that if you're not running under MPI, activating single_voi_relevance_reduction is essentially free. The real increase in memory use isn't due to the vectors themselves, but instead it's due to the index arrays that we store in order to transfer the data from source arrays to target arrays. We're forced to use index arrays under MPI, because PETSc requires it, but when we're not using MPI we use python slice objects to do our data transfer. Slice objects require very little memory.

Related

Large Discrete States for DQN when using ReinforcementLearning.jl

I am using the Julia package ReinforcementLearning.jl. I would like to gain from the fact that DQN does not require enumerate and revising the whole state space. So, my question is how to describe state_space for discrete environments with no need to enumerate states. In other words, let's assume states are represented by an array of N elements and each of these elements can take M possible values, I would like to avoid the enumeration of the N^M potential states and instead of it, have some generative function.
I have implemented DQN by using ReinforcementLearning.jl for environments where actions and states are discrete. To do so, I have enumerated states at the state_space definition. It works quite well, but the enumeration is avoiding me to get the computational advantages of DQN.

How to decide which mode to use for 'kaiming_normal' initialization

I have read several codes that do layer initialization using nn.init.kaiming_normal_() of PyTorch. Some codes use the fan in mode which is the default. Of the many examples, one can be found here and shown below.
init.kaiming_normal(m.weight.data, a=0, mode='fan_in')
However, sometimes I see people using the fan out mode as seen here and shown below.
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
Can someone give me some guidelines or tips to help me decide which mode to select? Further I am working on image super resolutions and denoising tasks using PyTorch and which mode will be more beneficial.
According to documentation:
Choosing 'fan_in' preserves the magnitude of the variance of the
weights in the forward pass. Choosing 'fan_out' preserves the
magnitudes in the backwards pass.
and according to Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - He, K. et al. (2015):
We note that it is sufficient to use either Eqn.(14) or
Eqn.(10)
where Eqn.(10) and Eqn.(14) are fan_in and fan_out appropriately. Furthermore:
This means that if the initialization properly scales the backward
signal, then this is also the case for the forward signal; and vice
versa. For all models in this paper, both forms can make them converge
so all in all it doesn't matter much but it's more about what you are after. I assume that if you suspect your backward pass might be more "chaotic" (greater variance) it is worth changing the mode to fan_out. This might happen when the loss oscillates a lot (e.g. very easy examples followed by very hard ones).
Correct choice of nonlinearity is more important, where nonlinearity is the activation you are using after the layer you are initializaing currently. Current defaults set it to leaky_relu with a=0, which is effectively the same as relu. If you are using leaky_relu you should change a to it's slope.

In chainer, How to write BPTT updater using multiple GPUs?

I don't find example because existing example only extends training.StandardUpdater, thus only use One GPU.
I assume that you are talking about the BPTTUpdater of the ptb example of Chainer.
It's not straight forward to make the customized updater support learning on multiple GPUs. The MultiprocessParallelUpdater hard code the way to compute the gradient (only the target link implementation is customizable), so you have to copy the overall implementation of MultiprocessParallelUpdater and modify the gradient computation parts. What you have to copy and edit is chainer/training/updaters/multiprocess_parallel_updater.py.
There are two parts in this file that compute gradient; one in _Worker.run, which represents a worker process task, and the other in MultiprocessParallelUpdater.update_core, which represents the master process task. You have to make these code do BPTT by modifying the code starting from _calc_loss to backward in each of these two parts:
# Change self._master into self.model for _Worker.run code
loss = _calc_loss(self._master, batch)
self._master.cleargrads()
loss.backward()
It should be modified by inserting the code of BPTTUpdater.update_core.
You also have to take care on the data iterators. MultiprocessParallelUpdater accept the set of iterators that will be distributed to master/worker processes. Since the ptb example uses a customized iterator (ParallelSequentialIterator), you have to make sure that these iterators iterate over different portions of the dataset or using different initial offsets of word positions. It may require customization to ParalellSequentialIterator as well.

Finite difference between old and new OpenMDAO

So I am converting a code from the old OpenMDAO to the new OpenMDAO. All the outputs and the partial gradients have been verified as correct. At first the problem would not optimize at all and then I realized that the old code had some components that did not provide gradients so they were automatically finite differenced. So I added fd_options['force_fd'] = True to those components but it still does not optimize to the right value. I checked the total derivative and it was still not correct. It also takes quite a bit longer to do each iteration than the old OpenMDAO. The only way I can get my new code to optimize to the same value as the old OpenMDAO code is to set each component to finite difference, even on the components that provide gradients. So I have a few questions about how finite difference works between the old and the new OpenMDAO:
When the old OpenMDAO did automatic finite difference did it only do it on the outputs and inputs needed for the optimization or did it calculate the entire Jacobian for all the inputs and outputs? Same question for the new OpenMDAO when you turn 'force_fd' to True.
Can you provide some parts of the Jacobian of a component and have it finite difference the rest? In the old OpenMDAO did it finite difference any gradients not provided unless you put missing_deriv_policy = 'assume_zero'?
So, the old OpenMDAO looked for groups of components without derivatives, and bundled them together into a group that could be finite differenced together. New OpenMDAO doesn't do that, so each of those components would be finite differenced separately.
We don't support that yet, and didn't in old OpenMDAO. We do have a story up on our pivotal tracker though, so we will eventually have this feature.
What I suspect might be happening for you is that the finite-difference groupings happened to be better in classic OpenMDAO. Consider one component with one input and 10 outputs connected to a second component with 10 inputs and 1 output. If you finite difference them together, only one execution is required. if you finite difference them individually, you need one execution of component one, and 10 executions of component two. This could cause a noticeable or even major performance hit.
Individual FD vs group FD can also cause accuracy problems, if there is an important input that has vastly different scaling than the other variables, so that the default FD stepsize of 1.0e-6 is no good. (Note: you can set a step_size when you add a param or output and it overrides the default for that var.)
Luckilly, new OpenMDAO has a way to recreate what you had in old OpenMDAO, but it is not automatic. What you would need to do is take a look at your model and figure out what components can be FD'd together, and then create a sub Group and move those components into that group. You can set fd_options['force_fd'] to True on the group, and it'll finite difference that group together. So for example, if you have A -> B -> C, with no components in between, and none have derivatives, you can move A, B, and C into a new sub Group with force_fd set to True.
If that doesn't fix things, we may have to look more deeply at your model.

OpenCl and power iteration method (eigendecomposition)

I'm new in OpenCL and I'm trying to implement power iteration method (described over here)
matrix sizes over 100000x100000!
Actually I have no idea how to implement this.
It's because workgroup have restriction CL_DEVICE_MAX_WORK_GROUP_SIZE (so I can't make one workgoup with 1000000 work-items)
But on each step of iterating I need to synchronize and normalize vector.
1) So is it possible to make all calculations inside one kernel? (I think that answer is no if matrix sizes is more than CL_DEVICE_MAX_WORK_GROUP_SIZE)
2) Can I make "while" loop in the host code? and is it still profitable to use GPU in this case?
something like:
while (condition)
{
kernel calling
synchronization
}
2: Yes, you can make a while loop in host code. Whether this is still profitable in terms of performance depends on whether the kernel that is called achieves a good speedup. My personal preference is not to pack too much logic into a single kernel, because smaller kernels are easier to maintain and sometimes easier to optimize. But of course, invoking a kernel has a (small) overhead that has to be taken into account. And whether combining to kernels into one can bring a speedup (or new potential for optimizations) depends on what the kernels are actually doing. But in this case (Matrix Multiplation and Vector Normalization) I'd personally start with two different kernels that are invoked from the host in a while-loop.
1: Since a 100000x100000 matrix with float values will take at least 40GB of memory, you'll have to think about the approach in general anyhow. There is a vast amount of literature on Matrix operations, their parallelization, and the corresponding implementations on the GPU. One important aspect from the "high level" point of view is whether the matrices are dense or sparse ( http://en.wikipedia.org/wiki/Sparse_matrix ). Depending on the sparsity, it might even be possible to handle 100000x100000 matrices in main memory. Apart from that, you might consider having a look at a library for matrix operations (e.g. http://viennacl.sourceforge.net/ ) because implementing an efficient matrix multiplication is challenging, particularly for sparse matrices. But if you want to go the whole way on your own: Good luck ;-) and ... the CL_DEVICE_MAX_WORK_GROUP_SIZE imposes no limitation on the problem size. In fact, the problem size (that is, the total number of work-items) in OpenCL is virtually infinitely large. If your CL_DEVICE_MAX_WORK_GROUP_SIZE is 256, and you want to handle 10000000000 elements, then you create 10000000000/256 work groups and let OpenCL care about how they are actually dispatched and executed. For matrix operations, the CL_DEVICE_MAX_WORK_GROUP_SIZE is primarily relevant when you want to use local memory (and you will have to, in order to achieve good performance): The size of the work groups thus implicitly defines how large your chunks of local memory may be.

Resources