I want to run same script in two different nodes at the same time, how its possible?
Currently i am able to run one script in one node only.
Related
I have two independent Python codes, each one with its own inputs and outputs.
I would like to have two IPython consoles and two Variable Explorers, each one associated to one of the two codes. In that way, I could switch back and forth between my two independent codes. I DON'T need to use results from one code to the other.
I thought I could do it by unchecking "Use a single instance" option in "Tools > Prefences > General", but this doesnt seem to achieve what I want, I still see only one Python Console and one variable expolorer.
Is there any way to do it?
(Spyder maintainer here) You can open two consoles in the same Spyder instance by going to the menu
Consoles > New console
Each console is independent and has its own Variable Explorer view.
For additional isolation you can also go to the menu
Run > Configuration per file
and select the option called Execute in a dedicated console. That will execute each of your codes in the same console every time you run them (our default is to run code in the console that has focus at the moment).
The API in Airflow seems to suggest it is build around backfilling, catching up and scheduling to run regularly in interval.
I have an ETL that extract data on S3 with the versions of the previous node (where the data comes from) in DAG. For example, here are the nodes of the DAG:
ImageNet-mono
ImageNet-removed-red
ImageNet-mono-scaled-to-100x100
ImageNet-removed-red-scaled-to-100x100
where ImageNet-mono is the previous node of ImageNet-mono-scaled-to-100x100 and
where ImageNet-removed-red is the previous node of ImageNet-removed-red-scaled-to-100x100
Both of them go through transformation of scaled-to-100x100 pipeline but producing different data since the input is different.
As you can see there is no date is involved. Is Airflow a good fit?
EDIT
Currently, the graph is simple enough to be managed manually with less than 10 nodes. They won't run in regularly interval. But instead as soon as someone update the code for a node, I would have to run the downstream nodes manually one by one python GetImageNet.py removed-red and then python scale.py 100 100 ImageNet-removed-redand then python scale.py 100 100 ImageNet-mono. I am looking into a way to manage the graph with a way to one click to trigger the run.
I think it's fine to use Airflow as long as you find it useful to use the DAG representation. If your DAG does not need to be executed on a regular schedule, you can set the schedule to None instead of a crontab. You can then trigger your DAG via the API or manually via the web interface.
If you want to run specific tasks you can trigger your DAG and mark tasks as success or clear them using the web interface.
I am working on a hybrid programming exercise using Open Mp and MPI, that has to be tested up on a cluster. The goal is basically to make an index of relevant words after processing a large set of files. What I am confused about is whether these files have to be sent to the respective nodes, from the root, or that they should already be there in the respective nodes. Thanks.
I have an R script that creates multiple scripts and submits these simultaneously to a computer cluster, and after all of the multiple scripts have completed and the output has been written in the respective folders, I would like to automatically launch another R script that works on these outputs.
I haven't been able to figure out whether there is a way to do this in R: the function 'wait' is not what I want since the scripts are submitted as different jobs and each of them completes and writes its output file at different times, but I actually want to run the subsequent script after all of the outputs appear.
One way I thought of is to count the files that have been created and, if the correct number of output files are there, then submit the next script. However to do this I guess I would have to have a script opened that checks for the presence of the files every now and then, and I am not sure if this is a good idea since it probably takes a day or more before the completion of the first scripts.
Can you please help me find a solution?
Thank you very much for your help
-fra
I think you are looking at this the wrong way:
Not an R problem at all, R happens to be the client of your batch job.
This is an issue that queue / batch processors can address on your cluster.
Worst case you could just wait/sleep in a shell (or R script) til a 'final condition reached' file has been touched
Inter-dependencies can be expressed with make too
I submitting jobs to the batch process one after the other.
How do i control such that the second batch job runs only when the first one is finished.
Right now both the jobs executes simultaneously which i dont want to happen
There are two options. You can do this through code, or just via manual setup. Manual method is fairly easy, just go to (Basic>Inquiries>Batch Job), create a new batch job and save it. Then click "View Tasks" and create a new task, where this will be your first batch task. Choose your class, description, batch group, etc., then save. Click "parameters" to setup the parameters.
After that, you can setup your dependent task. Make sure your tasks both have descriptions. Add your second batch task and save. Then in the lower left corner, you click on your task that you want to have a condition, then add a row there and setup your conditions so that one task won't go until the second has completed.
Via X++ code, you would create a BatchHeader where you setup basically the same thing we just did manually. You use the .addDependency to make one task dependent on the completion of the other. This walkthrough will get you started with a job to create the batch header, and you'll just have to play around to get the dependency working.