Airflow DAG processing for new clients

Airflow DAG processing for new clients - airflow

I'm new to Airflow and I understand how DAGs' start_date + time_interval work for my own company for example pulling Google Ads historical data.
But we're doing analysis for multiple clients data.
How do I structure DAGs to handle new client's data? I don't want to create a new DAG for every client onboarded, even if programmatically, that seems poor.
Also since new clients can come in any time, I'd like to pull data when clients create accounts which is event driven. I could use the experimental Event Driven API but is there a better approach?
Is Airflow not a good solution for this use case?

You can maintain the clients data in a persistent storage(database) and schedule your dag to reference this data and call api's accordingly.
If you need this as real time as possible and make it event driven I can think of aws solutions like lambda,s3 which are more appropriate for event driven architecture.
Let me know if this answers your question.

Related

Best way to send periodic spreadsheet-style reports from a Redshift DWH?

Ello M8's,
Currently in charge of an airflow DAG that sends periodic reports directly to people's emails as a custom CSV. Also sometimes to external S3.
The DAG implementation I'm working with is great for what it already does, but its difficult to extend and scale. In the case of a refactor, I'm wondering what the proper tool/method is to accomplish automated reports from a redshift DWH. Any tips? Kind of wish AWS had a reporting service on top of redshift, but maybe airflow is their answer to that in the first place.

Firebase Functions - Do something every 10 minutes

I'm building an app using Firebase. There's an admin setting I want to build, that essentially populates one of my nodes with data. It would do this every 10 minutes, and there are 50-80 data points that I'd want to add.
So this would take roughly 13 hours total, I'd probably want to expand this though in the future. I would only call this function, maybe once a week.
I would simply do this using a setTimeout but I've heard this can be expensive? Does someone know roughly how expensive this would be?
I'm not that experienced with CRON jobs, is there a better way of doing this with Firebase? It's not something I want to have running constantly, and it's not at a specific time, but just whenever I need it. I'd also potentially want to have the job running multiple times, at the same time. Which seems to be super easy using Firebase Functions.
Any ideas? Thank you!

You would have to trigger the functions from an outside source.
One way you could do that is by creating and subscribing to a pub-sub system:
exports.someSubscription = functions.pubsub.topic('some-subscription').onPublish((event) => {
// Your code here
});
Another way you could trigger the functions is by exposing them as http requests and hitting those endpoints - See this amazing article on the firebase documentation for some resources on creating http endpoints.
Both of these ways require you to have some sort of scheduling service to either publish a new event to your subscription or to hit the http endpoint. You will need to either implement these services yourself, however I am sure there are services out there that can do that for you.

Apple Watch complication network requests

I'm creating a weather application that pulls its information from an online API.
I am able to get the information successfully in the GlanceController and in the InterfaceController. However, I'm a little unsure as to how I should do this for the complication. Can I perform a network request within the ComplicationController class?
If so, How would I go about doing this?

You'll run into issues related to asynchronously fetching data from within the complication data source, mostly due to the data being received after the timeline update is complete.
Apple recommends that you fetch the data from a different part of your app, and have it available in advance of any complication update:
The job of your data source class is to provide ClockKit with any requested data as quickly as possible. The implementations of your data source methods should be minimal. Do not use your data source methods to fetch data from the network, compute values, or do anything that might delay the delivery of that data. If you need to fetch or compute the data for your complication, do it in your iOS app or in other parts of your WatchKit extension, and cache the data in a place where your complication data source can access it. The only thing your data source methods should do is take the cached data and put it into the format that ClockKit requires.
Other ways to approach it:
The best way to update your complication (from your phone once you have received updated weather data) is to use transferCurrentComplicationUserInfo.
Alternately, you could have your watch app or glance cache its most recent weather details to be on hand for the next scheduled update.
If you absolutely must handle it from the complication:
You could have the scheduled timeline update get the extension to start an NSURLSession background task to asynchronously download the information from your weather service. The first (scheduled) update will then end, with no new data. Once the new weather data is received, you could then perform a second (manual) update to reload the complication timeline using the just-received data.
I don't have any personal experience with that approach, mostly because of the unnecessary need for back-to-back timeline updates.

Load data on server start up and refresh on regular time interval using spring 3.1

I am new to spring framework. I would like to pull data from database and set that data in application context. When ever we change data in database according data should be refresh. Please help me out what would be the best approach.

If you want to refresh your data in application context on change in the database I have a bad news for you - that's not really possible (or easy and straightforward, at least).
Most of the common databases are passive in a sense that they won't let you to subscribe for specific events (like data update) because this will require some additional IPC between database and subscribed application and this is generally not the main purpose of database.
In any case something like this will be database-specific, so if you really want this - it is better to check api docs of your database - there is a chance that you'll find means for doing something like this there. Again, this probably won't be very flexible and robust solution.
In general case you go one of 3 routes:
Pull your data from database every time it is needed
Pull your data from database every time it is needed but add some cache
Implement application-level component that will manage data. When data is requested - it will fetch it from the database if it is missing in cache. When data is updated it will update it both in cache and database.
(1) and (2) are are pretty much your only options if your data can be updated not only from your applications. (3) is a good way if data can only be updated from within your application and if amounts of data are small enough to justify it's caching.
Hope this will help.

Create workflow service instances for large number of records at once

I’m working on a business problem which has to import files which has 1000s of records. Each record has to be registered in a Workflow as individual record which has to go through its own workflow.
WF4 Corporate Purchase Process example has a good solution, as in the first step it create bookmarks for all the required record ids. So the workflow can be resumed with rest of the actions for each individual record/id.
I would like to know how to implement same thing using Workflow services as I could get the benefits of AppFabric for my workflows.
Is there any other solutions to handle batch of records/ids? Otherwise workflow service has to be called 1000s of times just to register every record in a workflow instance which is a not a good solution.

I would like to know how to implement same thing using Workflow services as I could get the benefits of AppFabric for my workflows.
This is pretty straight forward. You're going to have one workflow that reads the file and loops through the results using the looping activities that exist. Then, inside the loop you'll be starting up the workflow that each record needs (the "Service") by calling the endpoint with a Send activity.
Now, as for the workflow that is the Service, you're going to have a Receive activity at the top of the workflow that also has CanCreateInstance set the true. The everything after the Receive is no different than any other workflow. You may consider having a Send activity right after the Receive just to let the caller know that the Service has been started. But that's not a requirement -- the Receive will be required because it forces WF to build the workflow to use the WorkflowServiceHost.
Is there any other solutions to handle batch of records/ids? Otherwise workflow service has to be called 1000s of times just to register every record in a workflow instance which is a not a good solution.
Are you indicating that a for a web server to receive 1000's of requests is not a good solution? Consider the fact that an IIS server can handle roughly 25-50 requests, per instant in time, per core. Now consider the fact that you're loop that's loading the workflows isn't going to average more than maybe 5 in that instant of time but probably more like 1 or 2.
I don't think the web server is going to be your issue. I've started up literally 10,000's of workflows on a server via a loop just like the one you're going to build and it didn't break a sweat.

One way would be to use WCF's MSMQ binding to launch your workflows. Requests can come in normally through HTTP, and WCF would route them to MSMQ and process the load. You can throttle how many workflow instances are used through the MSMQ binding + IIS settings.
Download this word document that describes setting up a workflow application with WCF and MSMQ: http://www.microsoft.com/en-us/download/details.aspx?id=21245

In the spirit of the doing the simplest thing that could work, you can bring the subworkflow in as an activity to the main workflow and use a parallel for each to execute the branch for each input from your file. No extra invoking is required and the tooling supports this out of the box because all workflows are activities. Hosting the main process in a service so you can avoid contention with the rest of your IIS users, real people that they may are, might be a good idea.
I do agree that calling IIS or a WCF service 1000's of times is not a problem though, unless you want to do it in a few seconds!
It is important to remember that one of the good things about workflow is that it has fairly low overhead (compared to other workflow products) so you should be more concerned about what your workflow does than just the idea of launching lots of instances. The idea of batches like your example is very common.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex