Azure Machine Learning throws error "Invalid graph: You have invalid compute target(s) in node(s)" while running the pipeline

Azure Machine Learning throws error "Invalid graph: You have invalid compute target(s) in node(s)" while running the pipeline - azure-machine-learning-studio

I am facing a strange issue while dealing with Azure Machine Learning (Preview) interface.
I have designed a training pipeline, which was getting initiated on certain compute node (2 node cluster, with minimal configurations). However, it used to take lot of time for execution. So, I tried to create a new training cluster (8 node cluster, with higher config). During this process, I had created and deleted some of the training clusters.
But, strangely, since then, while submitting the pipeline I am getting error as "Invalid graph: You have invalid compute target(s) in node(s)".
Could you please advise on this situation.
Thanks,
Mitul

I bet this was pretty frustrating. A common debugging strategy I have is to delete compute targets and create new ones. Perhaps this was another "transient" error?

The issue should have been fixed and will be rolled out soon. Meanwhile, as a temporary solution, you can refresh the page to make it work.

Related

How does Hydras sweeper, specifically Ax-sweeper free/allocate memory?

So I'm using Hydra 1.1 and hydra-ax-sweeper==1.1.5 to manage my configuration, and run some hyper-parameter optimization on minerl environment. For this purpose, I load a lot of data in to memory (peak around 50Gb while loading with multiprocessing, drops to 30Gb after fully loaded) with multiprocessing (by pytorch).
On a normal run this is not a problem (My machine have 90+Gb RAM), one training finish without any issue.
However, when I run the same code with -m option (and hydra/sweeper: ax in config), the code stops after about 2-3 sweeper runs, getting stuck at the data loading phase, because all memories of the system (+swap memory) is occupied.
First I thought this was some issue with minerl environment code, which starts java-code in sub-process. So I tried to run my code without the environment (only the 30Gb data), and I still have the same issue. So I suspect I have some memory-leak inbetween the Hydra sweeper.
So my question is, How does Hydra sweeper(or ax-sweeper) work in-between sweeps? I always had the impression that it runs the main(cfg: DictConfig) decorated with #hydra.main(...), takes a scalar return(score) and run the Bayesian optimizer with this score, with main() called similar to a function (everything inside being properly deallocated/garbage collected between each sweep-run).
Is this not the case? Should I then load the data somewhere outside the main() and keep it between sweeps?
Thank you very much in advance!

The hydra-ax-sweeper may run trials in parallel, depending on the result of calling the get_max_parallelism function defined in ax.service.ax_client.
I suspect that your machine is running out of memory because of this parallelism.
Hydra's Ax plugin does not currently have a config group for configuring this max_parallelism setting, so it is automatically set by ax.
Loading the data outside of main (as you suggested) may be a good workaround for this issue.

Hydra sweepers in general does not have a facility to control concurrency. This is the responsibility of the launcher you are using.
The built-in basic launcher runs the jobs serially so it should not trigger memory issues.
If you are using other launchers, you may need to control their parallelism via Launcher specific parameters.

How to resolve celery.backends.rpc.BacklogLimitExceeded error

I am using Celery with Flask after working for a good long while, my celery is showing a celery.backends.rpc.BacklogLimitExceeded error.
My config values are below:
CELERY_BROKER_URL = 'amqp://'
CELERY_TRACK_STARTED = True
CELERY_RESULT_BACKEND = 'rpc'
CELERY_RESULT_PERSISTENT = False
Can anyone explain why the error is appearing and how to resolve it?
I have checked the docs here which doesnt provide any resolution for the issue.

Possibly because your process consuming the results is not keeping up with the process that is producing the results? This can result in a large number of unprocessed results building up - this is the "backlog". When the size of the backlog exceeds an arbitrary limit, BacklogLimitExceeded is raised by celery.
You could try adding more consumers to process the results? Or set a shorter value for the result_expires setting?
The discussion on this closed celery issue may help:
Seems like the database backends would be a much better fit for this purpose.
The amqp/RPC result backends needs to send one message per state update, while for the database based backends (redis, sqla, django, mongodb, cache, etc) every new state update will overwrite the old one.
The "amqp" result backend is not recommended at all since it creates one queue per task, which is required to mimic the database based backends where multiple processes can retrieve the result.
The RPC result backend is preferred for RPC-style calls where only the process that initiated the task can retrieve the result.
But if you want persistent multi-consumer result you should store them in a database.
Using rabbitmq as a broker and redis for results is a great combination, but using an SQL database for results works well too.

Unity Hololens WorldAnchorTransferBatch ExportAsync fails most of the time

I'm trying to create a hologram that is shared among multiple Hololenses by following Microsoft's Holograms 240 tutorial at https://developer.microsoft.com/en-us/windows/holographic/holograms_240
The network connection is working fine, and I have all the correct permissions in my project settings. The issue is that aside from a rare occurence, the call to WorldAnchorTransferBatch.ExportAsync() will fail with a SerializationCompletionReason.UnknownError and print out "Failed to world lock serialization failed"
I haven't been able to find any information about this error in my searching. Does anyone know what the cause of this could be?

Loadrunner Test Scenario Controller doesn't want to start and throws this : "Failed to start/stop Service Virtualization"

I have a Loadrunner Test Scenario, here is the snapshot for it:
after opening my Test Scenario with Loadrunner Controller, I click then the "Start Scenario" button, the Scenario must run for 2 Hours, but it stops after 1 minute, and get the following Error:
Failed to stop Service Virtualization.
Failed to start Service Virtualization.
here you can see the error snapshot:
to increase the size of the snapshot please: Ctrl++

It seems that you have unwittingly activated an integration with HP Service Virtualization (or SV for short), without having SV installed on the same machine. In order to remove it, open the SV configuration dialog and uncheck all the entries.

I solved the problem with a lot of effort.
the problem is, when you have test scenario, which worked before, and you try to remove some script (groups) of the scenario,after that the test scenario will not work, what I've done, is I created from scratch a new scenario with the same test scripts, and voila it worked, I hope everyone will pay attention at that in the future

Indefinite Provisioning of EMR Cluster with Segue in R

I am trying to use the R package called Segue by JD Long, which is lauded as the ultimate in simplicity for using R with AWS by a book I read called "Parallel R".
However, for the 2nd day in a row I've run into a problem where I initiate the creation of a cluster and it just says STARTING indefinitely.
I tried this on OS X and in Linux with clusters of sizes 2, 6, 10, 20, and 25. I let them all run for at least 6 hours. I have no problem starting a cluster in the AWS EMR Management Console, though I have no clue how to connect Segue/R to a cluster that was started in the Management Console instead of via createCluster().
So my question is -- is there either some way to trouble shoot the provisioning of the cluster or to bypass the problem by creating the cluster manually and somehow getting Segue to work with that?
Here's an example of what I'm seeing:
library(segue)
Loading required package: rJava
Loading required package: caTools
Segue did not find your AWS credentials. Please run the setCredentials() function.
setCredentials("xxx", "xxx")
emr.handle <- createCluster(numInstances=10)
STARTING - 2013-07-12 10:36:44
STARTING - 2013-07-12 10:37:15
STARTING - 2013-07-12 10:37:46
STARTING - 2013-07-12 10:38:17
.... this goes on for hours and hours and hours...
UPDATE##: After 36 hours and many attempts that failed, this began working (randomly...) when I tried it with 1 node. I then tried it with 10 nodes and it worked great. To my knowledge nothing changed locally or on AWS...

I am answering my own question on behalf of the AWS support rep who gave me the following belated explanation:
The problem with the EMR creation is with the Availability Zone specified (us-east-1c), this availability zone is now constrained and doesn't allow the creation of new instances, so the job was trying to create the instances in a infinite loop.
You can see information about constrained AZ here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-regions-availability-zones
"As Availability Zones grow over time, our ability to expand them can become constrained. If this happens, we might restrict you from launching an instance in a constrained Availability Zone unless you already have an instance in that Availability Zone. Eventually, we might also remove the constrained Availability Zone from the list of Availability Zones for new customers. Therefore, your account might have a different number of available Availability Zones in a region than another account."
So you need to specify another AZ, or what I recommend is not specify any AZ, so EMR is going to be able to select any available.
I found this thread: https://groups.google.com/forum/#!topic/segue-r/GBd15jsFXkY
on Google Groups, where the topic of availability zones came up before. The zone that was set as the new default in that thread was the zone causing problems for me. I am attempting to edit the source of Segue.

Jason, I'm the author of Segue so maybe I can help.
Please look under the details section in the lower part of the AWS console and see if you can determine if the bootstrap sequences completed. This is an odd problem because typically an error at this stage is pervasive across all users. However I can't reproduce this one.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Azure Machine Learning throws error "Invalid graph: You have invalid compute target(s) in node(s)" while running the pipeline - azure-machine-learning-studio

I bet this was pretty frustrating. A common debugging strategy I have is to delete compute targets and create new ones. Perhaps this was another "transient" error?

The issue should have been fixed and will be rolled out soon. Meanwhile, as a temporary solution, you can refresh the page to make it work.

Related

How does Hydras sweeper, specifically Ax-sweeper free/allocate memory?

How to resolve celery.backends.rpc.BacklogLimitExceeded error

Unity Hololens WorldAnchorTransferBatch ExportAsync fails most of the time

Loadrunner Test Scenario Controller doesn't want to start and throws this : "Failed to start/stop Service Virtualization"

Indefinite Provisioning of EMR Cluster with Segue in R

Categories

Resources