Below there is a sample code where the BSplineComp is combined either with an ExplicitComp or ExternalCodeComp.
Both of these two do the same calculation and both of the components' gradients are calculated using finite difference.
If I run the version Bspline+ExplicitComp the result is achieved within 2,3 iterations.
If I run the version Bspline+ExternalCodeComp I have to wait a lot. In this case it is trying to find the gradient of the output with respect to each input. So for example there are 9 control points that are interpolated to 70 points in the bspline component. Then the external component has to be evaluated as many as the interpolated points (70 times)
So in a case where the bspline is combined with an expensive external code the finite difference requires as much as the number of points it is interpolated to which becomes the bottleneck of the computation.
Based on this input I have two questions
1- If external code component is based on the explicit component what is the major difference that causes the behaviour difference? (considering both have an input of shape=70)
2- In the previously mentioned scneario where the bspline is combined with an expensive external code would there be a more efficient way of combining them apart from the way it is shown here.
MAIN CODE: 'external' variable is the flag for toggling external/explicit code comp. set that true/false for running the two cases explained above.
from openmdao.components.bsplines_comp import BsplinesComp
from openmdao.api import IndepVarComp, Problem, ExplicitComponent,ExecComp,ExternalCodeComp
from openmdao.api import ScipyOptimizeDriver, SqliteRecorder, CaseReader
import matplotlib.pyplot as plt
import numpy as np
external=True # change this to true for the case with external code comp. or false for the case with explicit comp.
rr=np.arange(0,70,1)
"Explicit component for the area under the line calculation"
class AreaComp(ExplicitComponent):
def initialize(self):
self.options.declare('lenrr', int)
self.options.declare('rr', types=np.ndarray)
def setup(self):
self.add_input('h', shape=lenrr)
self.add_output('area')
self.declare_partials(of='area', wrt='h', method='fd')
def compute(self, inputs, outputs):
rr = self.options['rr']
outputs['area'] = np.trapz(rr,inputs['h'])
class ExternalAreaComp(ExternalCodeComp):
def setup(self):
self.add_input('h', shape=70)
self.add_output('area')
self.input_file = 'paraboloid_input.dat'
self.output_file = 'paraboloid_output.dat'
# providing these is optional; the component will verify that any input
# files exist before execution and that the output files exist after.
self.options['external_input_files'] = [self.input_file]
self.options['external_output_files'] = [self.output_file]
self.options['command'] = [
'python', 'extcode_paraboloid.py', self.input_file, self.output_file
]
# this external code does not provide derivatives, use finite difference
self.declare_partials(of='*', wrt='*', method='fd')
def compute(self, inputs, outputs):
h = inputs['h']
# generate the input file for the paraboloid external code
np.savetxt(self.input_file,h)
# the parent compute function actually runs the external code
super(ExternalAreaComp, self).compute(inputs, outputs)
# parse the output file from the external code and set the value of f_xy
f_xy=np.load('a.npy')
outputs['area'] = f_xy
prob = Problem()
model = prob.model
n_cp = 9
lenrr = len(rr)
"Initialize the design variables"
x = np.random.rand(n_cp)
model.add_subsystem('px', IndepVarComp('x', val=x))
model.add_subsystem('interp', BsplinesComp(num_control_points=n_cp,
num_points=lenrr,
in_name='h_cp',
out_name='h'))
if external:
comp=ExternalAreaComp()
model.add_subsystem('AreaComp', comp)
else:
comp = AreaComp(lenrr=lenrr, rr=rr)
model.add_subsystem('AreaComp', comp)
case_recorder_filename2 = 'cases4.sql'
recorder2 = SqliteRecorder(case_recorder_filename2)
comp.add_recorder(recorder2)
comp.recording_options['record_outputs']=True
comp.recording_options['record_inputs']=True
model.connect('px.x', 'interp.h_cp')
model.connect('interp.h', 'AreaComp.h')
model.add_constraint('interp.h', lower=0.9, upper=1, indices=[0])
prob.driver = ScipyOptimizeDriver()
prob.driver.options['optimizer'] = 'SLSQP'
prob.driver.options['disp'] = True
#prob.driver.options['optimizer'] = 'COBYLA'
#prob.driver.options['disp'] = True
prob.driver.options['tol'] = 1e-9
model.add_design_var('px.x', lower=1,upper=10)
model.add_objective('AreaComp.area',scaler=1)
prob.setup(check=True)
#prob.run_model()
prob.run_driver()
cr = CaseReader(case_recorder_filename2)
case_keys = cr.system_cases.list_cases()
cou=-1
for case_key in case_keys:
cou=cou+1
case = cr.system_cases.get_case(case_key)
plt.plot(rr,case.inputs['h'],'-*')
The external code extcode_paraboloid.py is below
import numpy as np
if __name__ == '__main__':
import sys
input_filename = sys.argv[1]
output_filename = sys.argv[2]
h=np.loadtxt(input_filename)
rr=np.arange(0,70,1)
rk= np.trapz(rr,h)
np.save('a',np.array(rk))
In both cases your code takes 3 iterations to run. The wall time for the external code is much much longer simply because of the cost of file-io plus the requirement to make a system call to spool up a new process each time your function is called.
Yep, system calls are that expensive and file i/o isn't cheap either. If you have a more costly analysis its less of a big deal, but you can see why it should be avoided if at all possible.
In this case you can reduce your FD cost though. Since you have only 9 bspline variables, you have correctly deduced that you could run far fewer FD steps. You want to use the approximate semi-total derivative feature in OpenMDAO v2.4 to set up FD across the group instead of across each individual component.
Its as simple as this:
.
.
.
if external:
comp=ExternalAreaComp()
model.add_subsystem('AreaComp', comp)
else:
comp = AreaComp(lenrr=lenrr, rr=rr)
model.add_subsystem('AreaComp', comp)
model.approx_totals()
.
.
.
Related
I am trying to understand graph isomorphism network and graph attention network through PyTorch (GIN) and GAT for some classification tasks.
however, I can't find already implemented projects to read and understand as hints.
there are some for GCN and they are ok.
I wanted to know if anyone can suggest any kind of material except raw theoretical papers so I can refer to.
Graph Isomorphism networks (GIN) can be built using Tensorflow and spektral libraries.
Here is an example of GIN network built using above mentioned libraries:
class GIN0(Model):
def __init__(self, channels, n_layers):
super().__init__()
self.conv1 = GINConv(channels, epsilon=0, mlp_hidden=[channels, channels])
self.convs = []
for _ in range(1, n_layers):
self.convs.append(
GINConv(channels, epsilon=0, mlp_hidden=[channels, channels])
)
self.pool = GlobalAvgPool()
self.dense1 = Dense(channels, activation="relu")
def call(self, inputs):
x, a, i = inputs
x = self.conv1([x, a])
for conv in self.convs:
x = conv([x, a])
x = self.pool([x, i])
return self.dense1(x)
You can use this model for training and testing just like any other tensorflow model with some limitations.
My objective is to train a very simple CNN on MNIST using Tensorflow, convert it to TensorRT, and use it to perform inference on the MNIST test set using TensorRT, all on a Jetson Nano, but I am getting several errors and warnings, including “OutOfMemory Error in GpuMemory: 0”. To try and reduce memory footprint, I tried also creating a script where I simply load the TensorRT model (that had already been converted and saved in the previous script) and use it to perform inference on a small subset of the MNIST test set (100 floating point values), but I am still getting the same out of memory error. The entire directory containing the TensorRT model is only 488 KB, and the 100 test points can’t be taking up very much memory, so I am confused about why GPU memory is running out. What could be the reason for this, and how can I solve it?
Another thing which seems suspicious is that some of the Tensorflow logging info messages are being printed multiple times, EG “Successfully opened dynamic library libcudart”, “Successfully opened dynamic library libcublas”, “ARM64 does not support NUMA - returning NUMA node zero”. What could be the reason for this (EG dynamic libraries being opened over and over again), and could this have something to do with why the GPU memory keeps running out?
Shown below are the 2 Python scripts; the console output from each one is too long to post on Stack Overflow, but they can be seen attached to this Gist: https://gist.github.com/jakelevi1996/8a86f2c2257001afc939343891ee5de7
"""
Example script which trains a simple CNN for 1 epoch on a subset of MNIST, and
converts the model to TensorRT format, for enhanced performance which fully
utilises the NVIDIA GPU, and then performs inference.
Useful resources:
- https://stackoverflow.com/questions/58846828/how-to-convert-tensorflow-2-0-savedmodel-to-tensorrt
- https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#worflow-with-savedmodel
- https://www.tensorflow.org/api_docs/python/tf/experimental/tensorrt/Converter
- https://github.com/tensorflow/tensorflow/issues/34339
- https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/image_classification.py
Tested on the NVIDIA Jetson Nano, Python 3.6.9, tensorflow 2.1.0+nv20.4, numpy
1.16.1
"""
import os
from time import perf_counter
import numpy as np
t0 = perf_counter()
import tensorflow as tf
from tensorflow.keras import datasets, layers, models, Input
from tensorflow.python.compiler.tensorrt import trt_convert as trt
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.framework import convert_to_constants
tf.compat.v1.enable_eager_execution() # see github issue above
# Get training and test data
(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()
x_train = np.expand_dims(x_train, -1) / 255.0
x_test = np.expand_dims(x_test, -1) / 255.0
# Create model
model = models.Sequential()
# model.add(Input(shape=x_train.shape[1:], batch_size=batch_size))
model.add(layers.Conv2D(10, (5, 5), activation='relu', padding="same"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(10))
# Compile and train model
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(
x_train[:10000], y_train[:10000], validation_data=(x_test, y_test),
batch_size=100, epochs=1,
)
# Save model
print("Saving model...")
current_dir = os.path.dirname(os.path.abspath(__file__))
model_dir = os.path.join(current_dir, "CNN_MNIST")
if not os.path.isdir(model_dir): os.makedirs(model_dir)
# model.save(model_dir)
tf.saved_model.save(model, model_dir)
# Convert to TRT format
trt_model_dir = os.path.join(current_dir, "CNN_MNIST_TRT")
converter = trt.TrtGraphConverterV2(input_saved_model_dir=model_dir)
converter.convert()
converter.save(trt_model_dir)
t1 = perf_counter()
print("Finished TRT conversion; time taken = {:.3f} s".format(t1 - t0))
# Make predictions using saved model, and print the results (NB using an alias
# for tf.saved_model.load, because the normal way of calling this function
# throws an error because for some reason it is expecting a sess)
saved_model_loaded = tf.compat.v1.saved_model.load_v2(
export_dir=trt_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
graph_func = convert_to_constants.convert_variables_to_constants_v2(graph_func)
x_test_tensor = tf.convert_to_tensor(x_test, dtype=tf.float32)
preds = graph_func(x_test_tensor)[0].numpy()
print(preds.shape, y_test.shape)
accuracy = list(preds.argmax(axis=1) == y_test).count(True) / y_test.size
print("Accuracy of predictions = {:.2f} %".format(accuracy * 100))
"""
Example script which trains a simple CNN for 1 epoch on a subset of MNIST, and
converts the model to TensorRT format, for enhanced performance which fully
utilises the NVIDIA GPU.
Useful resources:
- https://stackoverflow.com/questions/58846828/how-to-convert-tensorflow-2-0-savedmodel-to-tensorrt
- https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#worflow-with-savedmodel
- https://www.tensorflow.org/api_docs/python/tf/experimental/tensorrt/Converter
- https://github.com/tensorflow/tensorflow/issues/34339
- https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/image_classification.py
Tested on the NVIDIA Jetson Nano, Python 3.6.9, tensorflow 2.1.0+nv20.4, numpy
1.16.1
"""
import os
from time import perf_counter
import numpy as np
t0 = perf_counter()
import tensorflow as tf
from tensorflow.keras import datasets
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
from tensorflow.python.framework import convert_to_constants
tf.compat.v1.enable_eager_execution() # see github issue above
# Get training and test data
(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()
x_train = np.expand_dims(x_train, -1) / 255.0
x_test = np.expand_dims(x_test, -1) / 255.0
# TEMPORARY: just use 100 test points to minimise GPU memory
num_points = 100
x_test, y_test = x_test[:num_points], y_test[:num_points]
current_dir = os.path.dirname(os.path.abspath(__file__))
trt_model_dir = os.path.join(current_dir, "CNN_MNIST_TRT")
# Make predictions using saved model, and print the results (NB using an alias
# for tf.saved_model.load, because the normal way of calling this function
# throws an error because for some reason it is expecting a sess)
saved_model_loaded = tf.compat.v1.saved_model.load_v2(
export_dir=trt_model_dir, tags=[tag_constants.SERVING])
graph_func = saved_model_loaded.signatures[
signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
graph_func = convert_to_constants.convert_variables_to_constants_v2(graph_func)
x_test_tensor = tf.convert_to_tensor(x_test, dtype=tf.float32)
preds = graph_func(x_test_tensor)[0].numpy()
print(preds.shape, y_test.shape)
accuracy = list(preds.argmax(axis=1) == y_test).count(True) / y_test.size
print("Accuracy of predictions = {:.2f} %".format(accuracy * 100))
t1 = perf_counter()
print("Finished inference; time taken = {:.3f} s".format(t1 - t0))
I had the same error on a Jetson Tx2. I think it comes from the shared memory between the GPU and the CPU, tensorflow doesn't allow enough memory or the os limit the allocation.
To fix this, you can allow memory growth:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
Or you can force tensorflow to allocate enough memory:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
Those example comes from https://www.tensorflow.org/guide/gpu
I see in logs that it created GPU device with 600 Mb:
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 638 MB memory)
And then it tried to allocate 1Gb:
Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.00GiB (rounded to 1073742336).
Also it's clear. that GPU device has more memory than 600Mb. It's visible here in the logs:
2020-06-23 23:06:36.463934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3
coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s
So maybe your GPU is running some other calculation?
I'm trying to demux an array with 3 axis as seen in the following sample code
import openmdao.api as om
p = om.Problem()
ivc = p.model.add_subsystem('idv', om.IndepVarComp(), promotes=['*'])
ivc.add_output(name='f0', shape=(2,3), units='m', val=[[111,112,113], [121,122,123]])
ivc.add_output(name='f1', shape=(2,3), units='m', val=[[211,212,213], [221,222,223]])
ivc.add_output(name='f2', shape=(2,3), units='m', val=[[311,312,313], [321,322,323]])
ivc.add_output(name='f3', shape=(2,3), units='m', val=[[411,412,413], [421,422,423]])
mux_comp = p.model.add_subsystem(name='mux', subsys=om.MuxComp(vec_size=4))
mux_comp.add_var('r', shape=(2,3), axis=0, units='m')
demux_comp = p.model.add_subsystem(name='demux', subsys=om.DemuxComp(vec_size=4))
demux_comp.add_var('g', shape=(4,2,3), axis=0, units='m')
p.model.connect('f0', 'mux.r_0')
p.model.connect('f1', 'mux.r_1')
p.model.connect('f2', 'mux.r_2')
p.model.connect('f3', 'mux.r_3')
p.model.connect('mux.r', 'demux.g')
p.setup()
p.run_model()
print(p['mux.r'])
print(p['mux.r'].shape)
print(p['demux.g_0'])
print(p['demux.g_1'])
print(p['demux.g_2'])
print(p['demux.g_3'])
print(p['demux.g_0'].shape)
When this runs, I get the following error
RuntimeError: DemuxComp (demux): d(g_0)/d(g): declare_partials has been called with rows and cols, which should be arrays of equal length, but rows is length 6 while cols is length 2.
As the error only pertained to the partials, I took a look into the demux_comp.py code in the OpenMDAO library and modified the declare_partials line from
self.declare_partials(of=out_name, wrt=var, rows=rs, cols=cs, val=1.0)
to
self.declare_partials(of=out_name, wrt=var, val=1.0)
This allowed the code to run successfully and output the proper demuxed variables. Will this have any adverse effects on the rest of my code and optimizations?
It looks like you've uncovered a bug. We'll get on fixing this.
While your fix will allow it to run, the partials will be incorrect. In the mean time you'd be better off replacing the declare partials call with instructions to use finite difference or complex step.
self.declare_partials(of=out_name, wrt=var, method='cs')
For complex-step, or method='fd' for finite difference.
I am currently working on doing graph classification on the IMDB-Binary dataset using deep learning and specifically the pytorch geometric environment.
I have split my data into test/train samples that are list of tuples containing a graph and its label. One thing I've had to do is to treat the different graph as a "Batch", a large disconnected graph, using torch_geometric.data.Batch. To start, I am using a data loader with the following collate function
def collate(samples) :
graphs,labels = map(list,zip(*samples))
datalist = make_datalist(graphs)
datalist = Batch.from_data_list(datalist)
return datalist, torch.tensor(labels)
and my classifier is the following :
class Classifier(nn.Module):
def __init__(self, in_dim, hidden_dim, n_classes):
super(Classifier, self).__init__()
self.conv1 = GraphConv(in_dim, hidden_dim)
self.conv2 = GraphConv(hidden_dim, hidden_dim)
self.classify = nn.Linear(hidden_dim, n_classes)
def forward(self, g):
# Use node degree as the initial node feature. For undirected graphs, the in-degree
# is the same as the out_degree.
h = g.in_degrees
# Perform graph convolution and activation function.
h = F.relu(self.conv1(g, h))
h = F.relu(self.conv2(g, h))
g.ndata['h'] = h
# Calculate graph representation by averaging all the node representations.
hg = dgl.mean_nodes(g, 'h')
return self.classify(hg)
Which simply averages the nodes representations of each graph, and feeds it to a MLP
The problem I come up with is that during the prediction of our batch, I have the error
AttributeError: 'Batch' object has no attribute 'local_var'
and I can't find where it may come from, would anyone know ?
Thank you for taking the time to read !
I am also experimenting with Pytorch geometric and its' data set capabilities.
Maybe following information will help someone in the future:
I'm facing AttributeErrors when forgetting to set #property annotated getters/setters for my data set class attributes. See https://docs.python.org/3.7/library/functions.html#property
I think to answer your question we need more information about your make_datalist function.
However, here are the links to the batch class:
https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html
https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/data/batch.html#Batch
And indeed, there is nothing like a local_var variable.
I noticed that prob.compute_totals() returns a wrong answer when prob.model.approx_totals() is not specified before. Having the partial derivative manually defined or computed by finite differences doesn't change anything, the answer remains wrong when not calling prob.model.approx_totals() before. Also, the call to compute_totals is actually faster when approx_totals is called before, compared to when it's not. This seems counter-intuitive with manually defined partials, since approx_totals is supposed to add an unnecessary finite-difference computation.
Here is a MWE with the Sellar example taken from the OpenMDAO documentation. I also noticed the same behaviour in OpenAeroStruct, even though the differences are smaller than in this example.
import openmdao.api as om
from openmdao.test_suite.components.sellar_feature import SellarMDA
prob = om.Problem()
prob.model = SellarMDA()
prob.driver = om.ScipyOptimizeDriver()
prob.driver.options['optimizer'] = 'SLSQP'
prob.driver.options['tol'] = 1e-8
prob.model.add_design_var('x', lower=0, upper=10)
prob.model.add_design_var('z', lower=0, upper=10)
prob.model.add_objective('obj')
prob.model.add_constraint('con1', upper=0)
prob.model.add_constraint('con2', upper=0)
prob.setup()
prob.set_solver_print(level=0)
prob.model.approx_totals() # Commenting this line gives the wrong result
prob.run_driver()
totals = prob.compute_totals(of=['obj'],wrt=['x','z'])
print("""
Obj = {}
x = {}
z = {}
y1 = {}
y2 = {}
Totals = {}""".format(prob['obj'][0],prob['x'][0],prob['z'][0],prob['y1'][0],prob['y2'][0],totals))
The good result, with approx_totals :
Optimization terminated successfully. (Exit mode 0)
Current function value: 3.183393951729169
Iterations: 6
Function evaluations: 6
Gradient evaluations: 6
Optimization Complete
-----------------------------------
Obj = 3.183393951729169
x = 0.0
z = 1.977638883487764
y1 = 3.1600000000897133
y2 = 3.755277766976125
Totals = OrderedDict([(('obj', 'x'), array([[0.94051147]])), (('obj', 'z'), array([[3.50849282, 1.72901602]]))])
The wrong result, whithout approx_totals :
Optimization terminated successfully. (Exit mode 0)
Current function value: 3.1833939532752136
Iterations: 11
Function evaluations: 12
Gradient evaluations: 11
Optimization Complete
-----------------------------------
Obj = 3.1833939532752136
x = 4.401421628747386e-15
z = 1.9776388839289216
y1 = 3.1600000016563765
y2 = 3.755277767857951
Totals = OrderedDict([(('obj', 'x'), array([[0.99341446]])), (('obj', 'z'), array([[3.90585351, 1.97002055]]))])
In this example, the problem is that you have a cycle in SellarMDA, but the model does not contain a linear solver that can compute the total derivatives across the cycle. One way you can check on this is to run "openmdao check myfilename.py" at the command-line. I ran it on your model, and got the following warnings:
INFO: checking comp_has_no_outputs
INFO: checking dup_inputs
INFO: checking missing_recorders
WARNING: The Problem has no recorder of any kind attached
INFO: checking out_of_order
INFO: checking solvers
WARNING: Group 'cycle' contains cycles [['d1', 'd2']], but does not have an iterative linear solver.
INFO: checking system
There are a couple of remedies for this. You could manually add a different linear solver such as DirectSolver or PETScKrylov to the "cycle" group. You could also import SellarMDALinearSolver instead of SellarMDA. SellarMDALinearSolver uses a Newton solver for converging the cycle, and a DirectSolver for computing the derivatives. SellarMDA uses NonlinearBlockGS to converge the cycle, but unfortunately does not contain an appropriate linear solver to compute the derivatives. These components are used in a variety of testing roles, but I think in retrospect, we should probably add a LinearBlockGS to SellarMDA in the future, so that total derivatives can be computed without modification. For now though, you'll have to use SellarMDALinearSolver or add the solver yourself.
BTW, I suspect the optimization was slower because the derivatives were so bad. It took twice as many iterations, though it still somehow managed to get pretty close to the answer.
You mentioned similar symptoms in your OpenAeroStruct model. I would suspect that either 1) a subcomponent has an error in its analytical derivatives, or 2) the linear solvers are not set up correctly (maybe you have a cycle somewhere without a good linear solver in that group or parent group.) I think that Problem.check_partials and Problem.check_totals will give you more insight on where the problem could be. There is more info on these here.