I am importing a workflow from other cluster, to avoid conclicts I follow :
I modify the workflow json "pk": 0
I delete dependencies : []
I change 'parent_directory' for the new cluster
By this way it imports in hue the new one workflow and a lot of copies with name+timestamp . How can i avoid this problem?
Related
Would someone let me know if there is a way to override default failure notification method.
I am planning to send failure notification to SNS, however this means I will have to change all the existing DAG and add on_failure_callback method to it.
I was thinking if there is a way I can override existing notification method such that I don't need to change all the DAG.
or configure global hook for all the dags, such that I don't need to add on_failure_callback to all the dags.
You can use Cluster policy to mutate the task right after the DAG is parsed.
For example, this function could apply a specific queue property when using a specific operator, or enforce a task timeout policy, making sure that no tasks run for more than 48 hours. Here’s an example of what this may look like inside your airflow_local_settings.py:
def policy(task):
if task.__class__.__name__ == 'HivePartitionSensor':
task.queue = "sensor_queue"
if task.timeout > timedelta(hours=48):
task.timeout = timedelta(hours=48)
For Airflow 2.0, this policy should looks:
def task_policy(task):
if task.__class__.__name__ == 'HivePartitionSensor':
task.queue = "sensor_queue"
if task.timeout > timedelta(hours=48):
task.timeout = timedelta(hours=48)
The policy function has been renamed to task_policy.
In a similar way, you can modify other attributes, e.g. on_execute_callback, on_failure_callback, on_success_callback, on_retry_callback.
The airflow_local_settings.py file must be in one of the directories that are in sys.path. The easiest way to take advantage of this is that Airflow adds the directory ~/airflow/config to sys.path at startup, so you you need to create an ~/airfow/config/airflow_local_settings.py file.
Context: I've defined a airflow DAG which performs an operation, compute_metrics, on some data for an entity based on a parameter called org. Underneath something like myapi.compute_metrics(org) is called. This flow will mostly be run on an ad-hoc basis.
Problem: I'd like to be able to select the org to run the flow against when I manually trigger the DAG from the airflow UI.
The most straightforward solution I can think of is to generate n different DAGs, one for each org. The DAGs would have ids like: compute_metrics_1, compute_metrics_2, etc... and then when I need to trigger compute metrics for a single org, I can pick the DAG for that org. This doesn't scale as I add orgs and as I add more types of computation.
I've done some research and it seems that I can create a flask blueprint for airflow, which to my understanding, extends the UI. In this extended UI I can add input components, like a text box, for picking an org and then pass that as a conf to a DagRun which is manually created by the blueprint. Is that correct? I'm imaging I could write something like:
session = settings.Session()
execution_date = datetime.now()
run_id = 'external_trigger_' + execution_date.isoformat()
trigger = DagRun(
dag_id='general_compute_metrics_needs_org_id',
run_id=run_id,
state=State.RUNNING,
execution_date=execution_date,
external_trigger=True,
conf=org_ui_component.text) # pass the org id from a component in the blueprint
session.add(trigger)
session.commit() # I don't know if this would actually be scheduled by the scheduler
Is my idea sound? Is there a better way to achieve what I want?
I've done some research and it seems that I can create a flask blueprint for airflow, which to my understanding, extends the UI.
The blueprint extends the API. If you want some UI for it, you'll need to serve a template view. The most feature-complete way of achieve this is developing your own Airflow Plugin.
If you want to manually create DagRuns, you can use this trigger as reference. For simplicity, I'd trigger a Dag with the API.
And specifically about your problem, I would have a single DAG compute_metrics that reads the org from an Airflow Variable. They are global and can be set dynamically. You can prefix the variable name with something like the DagRun id to make it unique and thus dag-concurrent safe.
I have set up a Janusgraph Cluster with Cassandra + ES. The cluster has been set up to support ConfiguredGraphFactory. Also, I am connecting the gremlin cluster remotely. I have set up a client and am able to create a graph using :
client.submit(String.format("ConfiguredGraphFactory.create(\"%s\")", graphName));
However, I am not able to get the traversalSource of the graph created using the gremlin driver. Do I have to create raw gremlin queries and traverse the graph using client.submit or is there a way to get it through the gremlin driver using Emptygraph.Instance().
To get the remote traversal reference, you need to pass in a variable name that is bound to your graph traversal. This binding is usually done as part of the "globals" in your startup script when you start the remote server (the start up script is configured to run as part of the gremlin-server.yaml).
There is currently no inherent way to dynamically bind a variable to a graph or traversal reference, but I plan on fixing this at some point.
A short term fix is to bind your graph and traversal references to a method that will be variably defined, and then create some mechanism to change the variable dynamically.
To further explain a potential solution:
Update your server's startup script to bind g to something variable:
globals << [g : DynamicBindingTool.getBoundGraphTraversal()]
Create DynamicBindingTool, which has to do two things:
A. Provide a way to setBoundGraph() which may look something like:
setBoundGraph(graphName) {
this.boundGraph = ConfiguredGraphFactory.open(graphName);
}
B. Provide a way to getBoundGraphTraversal() which may look something like:
getBoundGraphTraversal() {
this.boundGraph.traversal();
}
You can include these sorts of functions in your start-up script or perhaps even create a separate jar that you attach to your Gremlin Server.
Finally, I would like to note that the proposed example solution does not take into account a multi-node JanusGraph cluster, i.e. your notion of the current bound graph would not be shared across the JG nodes. To make this a multi-node solution, you can update the functions to define the bound graph on an external database or even piggybacked on a JanusGraph graph.
For example, something like this would be a multi-node safe implementation:
setBoundGraph(graphName) {
def managementGraph = ConfiguredGraphFactory.open("managementGraph");
managementGraph.traversal().V().has("boundGraph", true).remove();
def v = managementGraph.addVertex();
v.property("boundGraph", true);
v.property("graph.graphname", graphName);
}
and:
getBoundGraphTraversal() {
def managementGraph = ConfiguredGraphFactory.open("managementGraph");
def graphName = managementGraph.traversal().V().has("boundGraph", true).values("graph.graphname");
return ConfiguredGraphFactory.open(graphName).traversal();
}
EDIT:
Unfortunately, the above "short-term trick" will not work as the global bindings are evaluated once and stored in a Map for the duration of the sever life cycle. Please see here for more information and updates on fixes: https://issues.apache.org/jira/browse/TINKERPOP-1839.
I'm trying to get how to add ngrx into my current project structure.
I've got all concepts around ngrx/reflux. Nevertheless, I don't quite figure out how to rebuild project structure in order to integrate it into my project.
Where do I need to add reducers?
Where do I need to add states?
What about actions?
Where should Store be, on a service, injected in each component?
Where or when the data should be fetched from server?
Is there any project structure best practice over there?
First, you should take a look into #ngrx/store documentation : https://github.com/ngrx/store#setup
I've made a small (funny) project to demonstrate how to use :
- angular
- ngrx/store
- ngrx/effects
- normalized state
- selectors
You can find it here : https://github.com/maxime1992/pizza-sync
To give you some info about how it works :
- core.module.ts is were I declare my root reducer
- root.reducer.ts is were I build the root reducer and compose it with middlewares according to dev/prod env
- For a reducer, I keep every related part together (interface(s) + reducer + effects + selectors)
Then within a component, to access the store simply do like that :
Inject the store :
constructor(private _store$: Store<IStore>) { }
Get data from either
a) A selector (ex)
this._pizzasCategories$ = this._store$.let(getCategoriesAndPizzas());
b) Directly from the store (ex)
this._idCurrentUser$ = this
._store$
.select(state => state.users.idCurrentUser);
Notice that I didn't subscribe which means I'm using in my view the async pipe so angular subscribe to the Observable for me.
But of course you can also do it by hand and use subscribe in your ts.
PS: I'll be releasing a starter with all that already setup to avoid loosing time with all of that when starting a new project. I'll try to release it this week and I'll update this post as soon as it's done. Maybe it might help you.
EDIT 12/05/17
Late but I finally released the starter :) !
https://github.com/maxime1992/angular-ngrx-starter
When I define a task it gets called for each project in a multi-project build:
import sbt._
import Keys._
import IO._
object EnsimePlugin extends Plugin {
val ensime = TaskKey[Unit](
"generateEnsime",
"Generate the ENSIME configuration for this project")
override val projectSettings = Seq(
ensime := generateEnsime (
(thisProject in Test).value,
(update in Test).value
)
)
private def generateEnsime(proj: ResolvedProject, update: UpdateReport): Unit = {
println(s"called by ${proj.id}")
}
}
How can I define a task so that it is only called for the root project?
They are usually discouraged, but is this perhaps a valid use of a Command? e.g. like the sbt-idea plugin.
From the official docs about Aggregation:
In the project doing the aggregating, the root project in this case,
you can control aggregation per-task.
It describes aggregate key scoped to a task with the value false as:
aggregate in update := false
Use commands for session processing that would otherwise require additional steps in a task. It doesn't necessarily mean it's harder in tasks, but my understanding of commands vs tasks is that the former are better suited for session manipulation. I might be wrong, though in your particular case commands are not needed whatsoever.