Managing external code errors - openmdao

I am trying to run an external code in OpenMDAO 2 that outputs some minor error messages as part of it's run process in windows shell. These error messages does not affect the results of the code and the code runs itself normally. However OpenMDAO raises a fault and stops whenever it detects these error messages. Is it possible for OpenMDAO to ignore such situation and continue running the analysis? I have tried setting fail_hard option to false, but it doesn't seem to change the behavior except that OpenMDAO raises analysis error instead of run-time error.

We can implement a feature to let you specify allowable return codes.. as long as you can enumerate which return codes are not errors, I think this will solve your problem?

Related

How to check which CUDA error arises in which asynchronous CUDA call?

Suppose we have the following situation:
launch_kernel_a<<<n_blocks, n_threads>>>(...);
launch_kernel_b<<<n_blocks, n_threads>>>(...);
cudaDeviceSynchronize();
if(cudaGetLastError() != CudaSuccess)
{
// Handle error
...
}
My understanding is that in the above, execution errors occurring during the asynchronous execution of either kernel may be returned by cudaGetLastError(). In that case, how do I figure out which kernel caused the error to occur during runtime?
My understanding is that in the above, execution errors occurring during the asynchronous execution of either kernel may be returned by cudaGetLastError().
That is correct. The runtime API will return the last error which was encountered. It isn't possible to know from which call in a sequence of asynchronous API calls an error was generated.
In that case, how do I figure out which kernel caused the error to occur during runtime?
You can't. You would require some kind of additional API call between the two kernel launches to determine the error. The crudest would be a cudaDeviceSynchronize() call, although that would serialize the operations if they actually did overlap (although I see no stream usage so that is probably not happening here).
As noted in comments -- most kernel runtime errors will result in context destruction, so if you got an error from the first kernel, the second kernel will abort or refuse to run anyway and that is probably fatal to your whole application.

Azure Machine Learning throws error "Invalid graph: You have invalid compute target(s) in node(s)" while running the pipeline

I am facing a strange issue while dealing with Azure Machine Learning (Preview) interface.
I have designed a training pipeline, which was getting initiated on certain compute node (2 node cluster, with minimal configurations). However, it used to take lot of time for execution. So, I tried to create a new training cluster (8 node cluster, with higher config). During this process, I had created and deleted some of the training clusters.
But, strangely, since then, while submitting the pipeline I am getting error as "Invalid graph: You have invalid compute target(s) in node(s)".
Could you please advise on this situation.
Thanks,
Mitul
I bet this was pretty frustrating. A common debugging strategy I have is to delete compute targets and create new ones. Perhaps this was another "transient" error?
The issue should have been fixed and will be rolled out soon. Meanwhile, as a temporary solution, you can refresh the page to make it work.

How come Stackdriver messes up my error grouping

In my experience the Stackdriver Error Reporting service groups unrelated errors together. This is a big problem for me on several levels:
The titles often do not correlate to the reported errors in "recent samples". So I have to look at the samples for each error to see what errors really happend because the title really can't be trusted.
I might set an error to "muted" and as a result other errors that are grouped under the same title don't get reported anymore. It might take me months to discover that certain errors have been happening that I wasn't aware of.
In general I have no overview about what errors are happening in what rate.
This all seems to violate basic functionality for an error reporting system, so I think I must be missing something.
The code is running on Firebase Functions, so the Firebase flavour of Google Cloud Functions and is written in Typescript (compiled to Javascript with Firebase predeploy script).
I log errors using console.error with arguments formatted as Error instances like console.error(new Error('some error message')). AFAIK that is the correct way for code running on Node.js.
Is there anything special I can do to make Stackdriver understand my code better?
I have this in a root of my functions deployment:
import * as sourceMaps from "source-map-support";
sourceMaps.install();
Below is a screenshot of one error category. You see that the error title is "The service is currently unavailable", yet the samples contain errors for "Request contains an invalid argument" and "This request was already locked..."
The error about service and invalid argument could be related to the FCM service, so there is some correlation although I think these are very different errors.
The error about request lock is really something completely unrelated. The word "request" in this context means something really different but the word is the only relationship I can see.
The error reporting supports Javascript, but not Typescript as mentioned in the documentation for the product, nevertheless, you should take a look at your logs and see if they are properly formatted for them to be ingested in the error reporting.
Also, keep in mind that the errors are grouped based on the guidelines over at this document, so maybe you won't get the grouping you get due to them.
Hope you find this useful.

BizTalk Parallel Convoy with seperate TimeOutException handling not able to build with error message "fatal error X1001: unknown system exception"

Consider the following basic structure of a Parallel Convoy pattern in BizTalk 2016. It is a Parallel Action with 2 active Receive shapes. Combined with a single correlation set that is being initialized by both active receives.
Now my issue arose when I want to have separate exception handling, one for the left receive, and one for the right receive. So I've put a scope around the left receive (Scope_1) with a timeout. And I've wrapped that scope in another scope (Scope_3), to catch the timeout exception.
Now for some reason this isn't allowed and I get back "fatal error X1001: unknown system exception" at build time.
However, if I wrap the scope_3 around both active receives, it's building successfully:
What's the significant difference here for BizTalk to not allow separate timeout exception handling in this scenario?
By the way:
It doesn't matter what type of exception I'm trying to catch, or if all my scopes are a Long Running transaction or not, the occurrence of the error is the same.
If I make a separate correlation set for each receive, the error does not occur, but of course that's not what I want because it wouldn't make it a parallel convoy then.
Setting scopes to synchronized does not affect the behavior.
The significant difference is that the Orchestration will start up when it receives the first message, which may not be the scope_1. So the timer would not be started in that scenario. And if it was scope_1, well it won't time out as you have received it, but it won't be timing out for scope_2.
Having the timer around both, does set the timeout in both scenarios.
What you could do is have the timeout scope as per your second example and set a flag to indicate which one was received, and use that in your exception block.
The other options is a first Receive shape that initializes the correlation set, and then a second receive after it that has the following correlation and have the timeout on that.
First, i am able to replicate your issue.
Although visual studio reported this as an unknown system exception but for me it looks unreachable code detected based the receive shape that is inside the scope (scope_3) that is trying to initialize your correlation. So there's a possibility that you wont be able to initialize the correlation same way your left scope (scope_2) does if your main scope (scope_1) is having some exceptions.
The only way I can think is to use using different correlation sets, you can set your send port to follow on these correlation set.
Without using correlation sets, this should not give error during build time. For me this is considered to be an MS bug, VS should be able to point out the unreachable code detected, not fatal error:

TDWALLETERROR(543): Teradata Wallet error. The helper process is already being traced

When I start my jobs using fast export they sometimes end with an error:
TDWALLETERROR(543): Teradata Wallet error. The helper process is already being traced
When I restart them, they work.
I'm using saved-key protection scheme.
Can someone explain to me why is that error occuring and how to fix it?
Looks like you have a trace activated in one of the scripts, run in the system.
Teradata has a shiffer code that attempts to validate if tracing is running during the wallet invocation - which triggers this error.

Resources