Is it possible to generate jobs in Dagster dynamically using configuration from database - dagster

Currently, my database has multi departments. I need to apply a data pipeline to all of these departments with different configurations.
I want to load configurations for each department from a database. Then use these configuration to generate a list of Jobs in Dagster.
For example, I have 3 tenants:
Department1: Configuration1
Department2: Configuration2
Department3: Configuration3
These information is stored in my database.
How can I load these information and dynamically create 3 jobs (pipelines):
Pipeline1 for Department1 with Configuration1
Pipeline2 for Department2 with Configuration2
Pipeline3 for Department3 with Configuration3
Is it possible to do it on Dagster? I can do it with Airflow (dynamically generating DAGs) but not sure how to do this in Dagster. I cannot load database configuration outside of op/job in Dagster.

In Dagster, your #repository function is just a regular function, so you can run arbitrary code in there to query your database and generate jobs dynamically:
#repository
def my_repo():
configs = # some query to your database
jobs = []
for config in configs:
jobs.append(get_job_for_my_config(config))
return jobs
If you expect that the database call might be somewhat expensive in terms of time, you can look into making your repository lazy-loaded, which the Dagster RepositoryDefinition docs detail how to do.

Related

Which Aurora table store DAG variable information?

I have an airflow DAG which call a particular bash command using a variable. At the backend, we have Aurora DB. Do we know if there are any tables in the Aurora DB which stores information of the variables used in Airflow DAGs? I need to create a report out of it and hence, the ask to access the variables from backend.
I tried using the operational_insights schema but could not find any tables with the desired information.
If you are using an Airflow variable you should be able to query a list of them with the REST API no matter which backend you use.
curl "http://<your Airflow host>/api/v1/variables" --user "login:password"
This is preferred over querying the Airflow metadata database directly because if you accidentally modify or drop a table you can corrupt your Airflow.
With that caveat: the standard table where Airflow variables are stored is variable so after logging into the db SELECT * FROM variable; should return a list.
Again this is for Airflow Variables. From your question I am not entirely sure if you mean that or in general any variables that tasks use. In the latter case you might be looking for the rendered_fields parameter of the task instances, which can also be done using the API.

How to configure druid batch indexing jobs dynamic EMR cluster for batch ingestion?

I am trying to automate druid batch ingestion using Airflow. My data pipeline creates EMR cluster on demand and shut it down once druid indexing is completed. But for druid we need to have Hadoop configurations in druid server folder ref. This is blocking me from dynamic EMR clusters. Can we override Hadoop connection details in Job configuration or is there a way to support multiple indexing jobs to use different EMR clusters ?
I have tried out overriding the parameters ( Hadoop configuration) in core-site.xml,yarn-site.xml,mapred-site.xml,hdfs-site.xml as Job properties in druid indexing job. It worked. In that case no need of copying the above files in druid server.
Just used below python program to convert the properties to json key value pairs from xml files. Can do the same for all the files and pass everything as indexing job payload. The below thing can be automated using airflow after creating different EMR clusters.
import json
import xmltodict
path = 'mypath'
file = 'yarn-site.xml'
with open(os.path.join(path,file)) as xml_file:
data_dict = xmltodict.parse(xml_file.read())
xml_file.close()
druid_dict = {property.get('name'):property.get('value') for property in data_dict.get('configuration').get('property') }
print(json.dumps(druid_dict)) ```
In researching how this might be done, I found hadoopDependencyCoordinates property here: https://druid.apache.org/docs/0.22.1/ingestion/hadoop.html#task-syntax
which seems relevant.

Cloudera Post deployment config updates

In cloudera is there a way to update list of configurations at a time using CM-API or CURL?
Currently I am updating one by one one using below CM API.
services_api_instance.update_service_config()
How can we update all configurations stored in json/config file at a time.
The CM API endpoint you're looking for is PUT /cm/deployment. From the CM API documentation:
Apply the supplied deployment description to the system. This will create the clusters, services, hosts and other objects specified in the argument. This call does not allow for any merge conflicts. If an entity already exists in the system, this call will fail. You can request, however, that all entities in the system are deleted before instantiating the new ones.
This basically allows you to configure all your services with one call rather than doing them one at a time.
If you are using services that require a database (Hive, Hue, Oozie ...) then make sure you set them up before you call the API. It expects all the parameters you pass in to work so external dependencies must be resolved first.

Windows Workflow - Creating a reusable task list (bookmarks?)

I'm looking at migrating business processes into Windows Workflow, the client app will be ASP/MVC and the workflows are likely to be hosted via IIS.
I want to create a common 'simple task' activity which can be used across multiple workflows. Activity properties would look something like this:
Related customer
Assigned agent
Prompt ("Please review PO #12345")
Text for 'true' button ("Accept")
Text for 'false' button ("Reject")
Variable to store result in
Once the workflow hits this activity a task should be put into a db table. The web app will query the table and show the agent a list of tasks they need to complete. Once they hit accept / reject the workflow needs to resume.
It's the last bit that I'm stuck on. What do I need to store in the DB table to resume a workflow? Given that the tasks table will be used by multiple workflows how work I instantiate the workflow to resume it? I've looked at bookmarks but they assume that you know the type of workflow that you're resuming. Do I need to use reflection or is there a method in WF where I can pass a workflow id and it will instantiate it?
You can use workflow service and control its via ControlEndPoint.
For more info about controlendpoint you can refer at
http://msdn.microsoft.com/en-us/library/ee358723.aspx

SQL Server load balancing optimizing Hits or Optimize the query

When we developers write data access code what should we really worry about if the application should scale well and handle the load / Hits.
Given this simple problem , how would you solve it in scalable manner.
1.ProjectResource is a Class ( Encapsulating resources assigned to a Project)
2.Each resource assigned to Project is User Class
3.Each User in the Project also has ReportingHead and ProjectManager who are also instance of User
4.Finally there is a Project class containing project details
Legend of classes used
User
Project
ProjectResource
Table Diagram
ProjectResource
ResourceId
ProjectId
UserId
ReportingHead
ProjectManager
Class Diagram
ProjectResource
ResourceId : String / Guid
Project : Project
User : User
ReportingHead : User
ProjectManager : User
note:
All the user information is stored in the User table
All the Project information is stored in the project table
Here's the Problem
When the application requests for Resource In a Project operations below are followed
First Get the Records for the Project
Get the UserId , make the request(using Users DAL) to get the user instance
Get the ProjectId, make the request(using Projects DAL) to get the project information
Finally assign Users and Project to instance of ProjectResource
clearly you can see 3 Db Calls are made here for populating single ProjectResource but the concerns and who manages the objects are clearly defined. This is the way i have planned to , since there is also connection pooling available in Sql Server & ADO.net
There is also another way where all the details are retrieved in single hit using Table Inner Joins and then Populating.
Which way should i really be taking and Why?
Extras:
.NET 2.0,ASP.net 2.0,C#,Sql Server 2005,DB on same machine hosting application.
For best performance and scalability, you should minimize the number of round-trips to the DB. To prove that to yourself, just run some benchmarks; it becomes clear very quickly.
One approach to a single round-trip is to use joins. Another is to return multiple result sets. The latter can be helpful in eliminating possible duplicate data.

Resources