I'm trying to get a parallel workflow to run in which I'm evaluating over 1000 parallel cases inside a ParallelGroup. If I run on a low amount of cores it doesn't crash, but increasing the number of nodes at some point raises an error, which indicates that it relates to how the problem is partitioned.
I'm getting an error from the deep dungeons of OpenMDAO and PETSc, relating to the target indices when setting up the communication tables as far as I can see. Below is a print of the traceback of the error:
File "/home/frza/git/OpenMDAO/openmdao/core/group.py", line 454, in _setup_vectors
impl=self._impl, alloc_derivs=alloc_derivs)
File "/home/frza/git/OpenMDAO/openmdao/core/group.py", line 1456, in _setup_data_transfer
self._setup_data_transfer(my_params, None, alloc_derivs)
File "/home/frza/git/OpenMDAO/openmdao/core/petsc_impl.py", line 125, in create_data_xfer
File "/home/frza/git/OpenMDAO/openmdao/core/petsc_impl.py", line 397, in __init__
tgt_idx_set = PETSc.IS().createGeneral(tgt_idxs, comm=comm)
File "PETSc/IS.pyx", line 74, in petsc4py.PETSc.IS.createGeneral (src/petsc4py.PETSc.c:74696)
tgt_idx_set = PETSc.IS().createGeneral(tgt_idxs, comm=comm)
File "PETSc/arraynpy.pxi", line 121, in petsc4py.PETSc.iarray (src/petsc4py.PETSc.c:8230)
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'
this answer:
https://scicomp.stackexchange.com/questions/2355/32bit-64bit-issue-when-working-with-numpy-and-petsc4py/2356#2356
led me to look for where you set up the tgt_idxs vector to see whether its defined with the correct dtype PETSc.IntType. But so far I only get Petsc has generated inconsistent data errors when I try to set the dtype of arrays I think may be causing the error.
I've not yet tried to reinstall PETSc with --with-64-bit-indices as suggested in the answer I linked to. Do you run PETSc configured this way?
edit:
I've now set up a stripped down version of the problem that replicates the error I get:
import numpy as np
from openmdao.api import Component, Group, Problem, IndepVarComp, \
ParallelGroup
class Model(Component):
def __init__(self, nsec, nx, nch):
super(Model, self).__init__()
self.add_output('outputs', shape=[nx+1, nch*6*3*nsec])
def solve_nonlinear(self, params, unknowns, resids):
pass
class Aggregate(Component):
def __init__(self, nsec, ncase, nx, nch, nsec_env=12):
super(Aggregate, self).__init__()
self.ncase = ncase
for i in range(ncase):
self.add_param('outputs_sec%03d'%i, shape=[nx+1, nch*6*3*nsec])
for i in range(nsec):
self.add_output('aoutput_sec%03d' % i, shape=[nsec_env, 6])
def solve_nonlinear(self, params, unknowns, resids):
pass
class ParModel(Group):
def __init__(self, nsec, ncase, nx, nch, nsec_env=12):
super(ParModel, self).__init__()
pg = self.add('pg', ParallelGroup())
promotes = ['aoutput_sec%03d' % i for i in range(nsec)]
self.add('agg', Aggregate(nsec, ncase, nx, nch, nsec_env), promotes=promotes)
for i in range(ncase):
pg.add('case%03d' % i, Model(nsec, nx, nch))
self.connect('pg.case%03d.outputs'%i, 'agg.outputs_sec%03d'%i)
if __name__ == '__main__':
from openmdao.core.mpi_wrap import MPI
if MPI:
from openmdao.core.petsc_impl import PetscImpl as impl
else:
from openmdao.core.basic_impl import BasicImpl as impl
p = Problem(impl=impl, root=Group())
root = p.root
root.add('dlb', ParModel(20, 1084, 36, 6))
import time
t0 = time.time()
p.setup()
print 'setup time', time.time() - t0
Having done that I can also see that the data size ends up becoming enormous due to the many cases we evaluate. I'll see if we can somehow reduce the data sizes. I can't actually get this to run at all now, since it either crashes with an error:
petsc4py.PETSc.Errorpetsc4py.PETSc.Error: error code 75
[77] VecCreateMPIWithArray() line 320 in /home/MET/Python-2.7.10_Intel/opt/petsc-3.6.2/src/vec/vec/impls/mpi/pbvec.c
[77] VecSetSizes() line 1374 in /home/MET/Python-2.7.10_Intel/opt/petsc-3.6.2/src/vec/vec/interface/vector.c
[77] Arguments are incompatible
[77] Local size 86633280 cannot be larger than global size 73393408
: error code 75
or the TypeError.
The data sizes that you're running with are definitely larger than can be expressed by 32 bit indices, so recompiling with --with-64-bit-indices makes sense if you're not able to decrease your data size. OpenMDAO uses PETSc.IntType internally for our indices, so they should become 64 bit in size if you recompile.
I've never used that option on petsc. A while back we did have some problems scaling up to larger numbers of cores, but we determined that the problem for us was with the OpenMPI compiling. Re-compiling OpenMDAO fixed our issues.
Since this error shows up on setup, we don't need to run to test the code. If you can provide us with the model that is showing the problem, and we can run it, then we can at least verify if the same problem is happening on our clusters.
It would be good to know how many cores you can successfully run on and at what point it breaks down too.
Related
I was practicing plotting when by mistake I assigned ylabel and title as below:
plt.ylabel = "No. of Hospitals"
plt.title = 'Hospitals by State'
This changed the two functions to string, as confirmed in the image below. (First blue circle)
Then I changed the statement to set these correctly to:
plt.ylabel("No. of Hospitals")
plt.title('Hospitals by State')
Now, I get the error
TypeError: 'str' object is not callable
In one of the stackoverflow article, I learned once the function is wrongly assigned then only way to fix this is restart the kernel. I don't want to restart the kernel and run all 500+ jobs above this mistake. I also tried importing matplotlib and sns again by calling pyplot as plt2 that didn't work either. (Second Blue circle in the image).
I was wondering is there a way to reset the function back to normal from the string status now?
I do understand that I can write the dataframe of interest to a file and then read it back in the new notebook. However, I'm sure many will agree knowing the process and ability to reset the function back to normal will help many in future without costly work around(s).
You may redefine them, like:
plt.title = lambda *args, **kwargs: plt.gca().set_title(*args, **kwargs)
plt.ylabel = lambda *args, **kwargs: plt.gca().set_ylabel(*args, **kwargs)
See also the original code of matplotlib.pyplot.title and matplotlib.pyplot.ylabel.
it is appearing in some big modules like matplotlib. For example expression :
import importlib
obj = importlib.import_module('matplotlib')
obj_entries = obj.__dict__
Between runs len of obj_entries can vary. From 108 to 157 (expected) entries. Especially pyplot can be ignored like some another submodules.
it can work stable during manual debug mode with len computing statement after dict extraction. But in auto it dont work well.
such error occures:
RuntimeError: dictionary changed size during iteration
python-BaseException
using clear python 3.10 on windows. Version swap change nothing at all
during some attempts some interesting features was found.
use of repr is helpfull before dict initiation.
But if module transported between classes like variable more likely lazy-import happening? For now there is evidence that not all names showing when command line interpriter doing opposite - returning what expected. So this junk of code help bypass this bechavior...
Note: using pkgutil.iter_modules(some_path) to observe modules im internal for pkgutil ModuleInfo form.
import pkgutil, importlib
module_info : pkgutil.ModuleInfo
name = module_info.name
founder = module_info.module_finder
spec = founder.find_spec(name)
module_obj = importlib.util.module_from_spec(spec)
loader = module_obj.__loader__
loader.exec_module(module_obj)
still unfamilliar with interior of import mechanics so it will be helpfull to recive some links to more detail explanation (spot on)
I am trying to run a simple mathematical problem in parallel in OpenMDAO 2.5.0. The problem is an adapted version of the example in the OpenMDAO docs found here: http://openmdao.org/twodocs/versions/latest/features/core_features/grouping_components/parallel_group.html. It has some extra components and connections and uses promotions instead of connections.
from openmdao.api import Problem, IndepVarComp, ParallelGroup, ExecComp, Group, NonlinearBlockGS
prob = Problem()
model = prob.model
model.add_subsystem('p1', IndepVarComp('x1', 1.0), promotes=['x1'])
model.add_subsystem('p2', IndepVarComp('x2', 1.0), promotes=['x2'])
cycle = model.add_subsystem('cycle', Group(), promotes=['*'])
parallel = cycle.add_subsystem('parallel', ParallelGroup(), promotes=['*'])
parallel.add_subsystem('c1', ExecComp(['y1=(-2.0*x1+z)/3']), promotes=['x1', 'y1', 'z'])
parallel.add_subsystem('c2', ExecComp(['y2=(5.0*x2-z)/6']), promotes=['x2', 'y2', 'z'])
cycle.add_subsystem('c3', ExecComp(['z=(3.0*y1+7.0*y2)/10']), promotes=['y1', 'y2', 'z'])
model.add_subsystem('c4', ExecComp(['z2 = y1+y2']), promotes=['z2', 'y1', 'y2'])
cycle.nonlinear_solver = NonlinearBlockGS()
prob.setup(mode='fwd')
prob.set_solver_print(level=2)
prob.run_model()
print(prob['z2'])
print(prob['z'])
print(prob['y1'])
print(prob['y2'])
When I run this code in series, it works as expected with no errors.
However, when I run this code in parallel with:
mpirun -n 2 python Test.py
I get this error for the first process:
RuntimeError: The promoted name y1 is invalid because it refers to multiple inputs: [cycle.c3.y1 ,c4.y1]. Access the value from the connected output variable cycle.parallel.c1.y1 instead.
and this error for the second process:
RuntimeError: The promoted name y2 is invalid because it refers to multiple inputs: [cycle.c3.y2 ,c4.y2]. Access the value from the connected output variable cycle.parallel.c2.y2 instead.
So my question is: why does this example give an error in the promotes names when running in parallel, while it is running without problems in series? Is it only allowed to use connections when running in parallel or are promoted variables okay as well?
As of OpenMDAO V2.5, you have to be a little careful about how you access problem variables in parallel.
If you look at the full stack track of your model, you can see that the error is being thrown right at the end when you call
print(prob['y1'])
print(prob['y2'])
What is happening here is that you have set the model up so that y1 only exists on proc 0 and y2 only exists on proc 1. Then you try to get values that don't exist on that proc and you get (an admittedly not very clear) error.
You can fix this with the following minor modification to your script:
from openmdao.api import Problem, IndepVarComp, ParallelGroup, ExecComp, Group, NonlinearBlockJac
prob = Problem()
model = prob.model
model.add_subsystem('p1', IndepVarComp('x1', 1.0), promotes=['x1'])
model.add_subsystem('p2', IndepVarComp('x2', 1.0), promotes=['x2'])
cycle = model.add_subsystem('cycle', Group(), promotes=['*'])
parallel = cycle.add_subsystem('parallel', ParallelGroup(), promotes=['*'])
parallel.add_subsystem('c1', ExecComp(['y1=(-2.0*x1+z)/3']), promotes=['x1', 'y1', 'z'])
parallel.add_subsystem('c2', ExecComp(['y2=(5.0*x2-z)/6']), promotes=['x2', 'y2', 'z'])
cycle.add_subsystem('c3', ExecComp(['z=(3.0*y1+7.0*y2)/10']), promotes=['y1', 'y2', 'z'])
model.add_subsystem('c4', ExecComp(['z2 = y1+y2']), promotes=['z2', 'y1', 'y2'])
cycle.nonlinear_solver = NonlinearBlockJac()
prob.setup(mode='fwd')
prob.set_solver_print(level=2)
prob.run_model()
print(prob['z2'])
print(prob['z'])
if prob.model.comm.rank == 0:
print(prob['y1'])
if prob.model.comm.rank == 0:
print(prob['y2'])
There are a few minor issues with this. 1) it means your script is now different for serial and parallel. 2) Its annoying. So we're working on a fix that will let things work more cleanly by automatically doing an MPI broadcast when you try to get a value thats no on your proc. That will be released in V2.6.
One other small note. I changed you NL solver over to NonLinearBlockJac. That is for Block Jacobi, which is designed to work in parallel. You could also use the Newton solver in parallel The Gauss-Seidel solver won't actually allow you to get parallel speedups.
For learning purposes, I wrote a small C Python module that is supposed to perform an IPC cuda memcopy to transfer data between processes. For testing, I wrote equivalent programs: one using theano's CudaNdarray, and the other using pycuda. The problem is, even though the test programs are nearly identical, the pycuda version works while the theano version does not. It doesn't crash: it just produces incorrect results.
Below is the relevant function in the C module. Here is what it does: every process has two buffers: a source and a destination. Calling _sillycopy(source, dest, n) copies n elements from each process's source buffer to the neighboring process's dest array. So, if I have two processes, 0 and 1, processes 0 will end up with process 1's source buffer and processes 1 will end up with process 0's source buffer.
Note that to transfer cudaIpcMemHandle_t values between processes, I use MPI (this is a small part of a larger project which uses MPI). _sillycopy is called by another function, "sillycopy" which is exposed in Python by the standard Python C API methods.
void _sillycopy(float *source, float* dest, int n, MPI_Comm comm) {
int localRank;
int localSize;
MPI_Comm_rank(comm, &localRank);
MPI_Comm_size(comm, &localSize);
// Figure out which process is to the "left".
// m() performs a mod and treats negative numbers
// appropriately
int neighbor = m(localRank - 1, localSize);
// Create a memory handle for *source and do a
// wasteful Allgather to distribute to other processes
// (could just use an MPI_Sendrecv, but irrelevant right now)
cudaIpcMemHandle_t *memHandles = new cudaIpcMemHandle_t[localSize];
cudaIpcGetMemHandle(memHandles + localRank, source);
MPI_Allgather(
memHandles + localRank, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
memHandles, sizeof(cudaIpcMemHandle_t), MPI_BYTE,
comm);
// Open the neighbor's mem handle so we can do a cudaMemcpy
float *sourcePtr;
cudaIpcOpenMemHandle((void**)&sourcePtr, memHandles[neighbor], cudaIpcMemLazyEnablePeerAccess);
// Copy!
cudaMemcpy(dest, sourcePtr, n * sizeof(float), cudaMemcpyDefault);
cudaIpcCloseMemHandle(sourcePtr);
delete [] memHandles;
}
Now here is the pycuda example. For reference, using int() on a_gpu and b_gpu returns the pointer to the underlying buffer's memory address on the device.
import sillymodule # sillycopy lives in here
import simplempi as mpi
import pycuda.driver as drv
import numpy as np
import atexit
import time
mpi.init()
drv.init()
# Make sure each process uses a different GPU
dev = drv.Device(mpi.rank())
ctx = dev.make_context()
atexit.register(ctx.pop)
shape = (2**26,)
# allocate host memory
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)
# allocate device memory
a_gpu = drv.mem_alloc(a.nbytes)
b_gpu = drv.mem_alloc(b.nbytes)
# copy host to device
drv.memcpy_htod(a_gpu, a)
drv.memcpy_htod(b_gpu, b)
# A few more host buffers
a_p = np.zeros(shape, np.float32)
b_p = np.zeros(shape, np.float32)
# Sanity check: this should fill a_p with 1's
drv.memcpy_dtoh(a_p, a_gpu)
# Verify that
print(a_p[0:10])
sillymodule.sillycopy(
int(a_gpu),
int(b_gpu),
shape[0])
# After this, b_p should have all one's
drv.memcpy_dtoh(b_p, b_gpu)
print(c_p[0:10])
And now the theano version of the above code. Rather than using int() to get the buffers' address, the CudaNdarray way of accessing this is via the gpudata attribute.
import os
import simplempi as mpi
mpi.init()
# select's one gpu per process
os.environ['THEANO_FLAGS'] = "device=gpu{}".format(mpi.rank())
import theano.sandbox.cuda as cuda
import time
import numpy as np
import time
import sillymodule
shape = (2 ** 24, )
# Allocate host data
a = np.ones(shape, np.float32)
b = np.zeros(shape, np.float32)
# Allocate device data
a_gpu = cuda.CudaNdarray.zeros(shape)
b_gpu = cuda.CudaNdarray.zeros(shape)
# Copy from host to device
a_gpu[:] = a[:]
b_gpu[:] = b[:]
# Should print 1's as a sanity check
print(np.asarray(a_gpu[0:10]))
sillymodule.sillycopy(
a_gpu.gpudata,
b_gpu.gpudata,
shape[0])
# Should print 1's
print(np.asarray(b_gpu[0:10]))
Again, the pycuda code works perfectly and the theano version runs, but gives the wrong result. To be precise, at the end of the theano code, b_gpu is filled with garbage: neither 1's nor 0's, just random numbers as though it were copying from a wrong place in memory.
My original theory regarding why this was failing had to do with CUDA contexts. I wondered if it was possible theano was doing something with them that meant that the cuda calls made in sillycopy were run under a different CUDA context than had been used to create the gpu arrays. I don't think this is the case because:
I spent a lot of time digging deep in theano's code and saw no funny business being played with contexts
I would expect such a problem to result in a bad crash, not an incorrect result, which is not what happens.
A secondary thought is whether this has to do the fact that theano spawns several threads, even when using a cuda backend, which can be verified this by running "ps huH p ". I don't know how threads might affect anything, but I have run out of obvious things to consider.
Any thoughts on this would be greatly appreciated!
For reference: the processes are launched in the normal OpenMPI way:
mpirun --np 2 python test_pycuda.py
A PyPy callback, that works perfectly (in an infinite loop) when implemented (straightforwardly) as method of a Python object, segfaults after approximately 100 iterations when I move the Python object into a separate multiprocessing process.
In the main code I have:
import multiprocessing as mp
class Task(object):
def __init__(self, com, lib):
self.com = com # communication queue
self.lib = lib # ffi library
self.proc = mp.Process(target=self.spawn, args=(self.com,))
self.register_callback()
def spawn(self, com):
print('%s spawned.'%self.name)
# loop (keeping 'self' alive) until BREAK:
while True:
cmd = com.get()
if cmd == self.BREAK:
break
print("%s stopped."%self.name)
#ffi.calback("int(void*, Data*"): # old cffi (ABI mode)
def callback(self, data):
# <work on data>
return 1
def register_callback(self):
s = ffi.new_handle(self)
self.lib.register_callback(s, self.callback) # C-call
The idea is that multiple tasks should serve an equal number of callbacks concurrently. I have no clue what may cause the segfault, especially since it runs fine for the first ~100 iterations or so. Help much appreciated!
Solution
Handle 's' is garbage collected when returning from 'register_callback()'. Making the handle an attribute of 'self' and passing keeps it alive.
Standard CPython (cffi 1.6.0) segfaulted at the first iteration (i.e. gc was immediate) and provided me a crucial informative error message. PyPy on the other hand segfaulted after approximately 100 iterations without providing a message... Both run fine now.