Airflow structure/organization of Dags and tasks

Airflow structure/organization of Dags and tasks - airflow

My questions :
What is a good directory structure in order to organize your dags and tasks? (the dags examples show only couple of tasks)
I currently have my dags at the root of the dags folder and my tasks in separate directories, not sure is the way to do it ?
Should we use zip files ? https://github.com/apache/incubator-airflow/blob/a1f4227bee1a70531cfa90769149322513cb6f92/airflow/models.py#L280

I use something like this.
A project is normally something completely separate or unique. Perhaps DAGs to process files that we receive from a certain client which will be completely unrelated to everything else (almost certainly a separate database schema)
I have my operators, hooks, and some helper scripts (delete all Airflow data for a certain DAG, etc.) in a common folder
I used to have a single git repository for the entire Airflow folder, but now I have a separate git per project (makes it more organized and easier to grant permissions on Gitlab since projects are so unrelated). This means that each project folder also as a .git and .gitignore, etc as well
I tend to save the raw data and then 'rest' a modified copy of the data which is exactly what gets copied into the database. I have to heavily modify some of the raw data due to different formats from different clients (Excel, web scraping, HTML email scraping, flat files, queries from SalesForce or other database sources...)
Example tree:
├───dags
│ ├───common
│ │ ├───hooks
│ │ │ pysftp_hook.py
│ │ │
│ │ ├───operators
│ │ │ docker_sftp.py
│ │ │ postgres_templated_operator.py
│ │ │
│ │ └───scripts
│ │ delete.py
│ │
│ ├───project_1
│ │ │ dag_1.py
│ │ │ dag_2.py
│ │ │
│ │ └───sql
│ │ dim.sql
│ │ fact.sql
│ │ select.sql
│ │ update.sql
│ │ view.sql
│ │
│ └───project_2
│ │ dag_1.py
│ │ dag_2.py
│ │
│ └───sql
│ dim.sql
│ fact.sql
│ select.sql
│ update.sql
│ view.sql
│
└───data
├───project_1
│ ├───modified
│ │ file_20180101.csv
│ │ file_20180102.csv
│ │
│ └───raw
│ file_20180101.csv
│ file_20180102.csv
│
└───project_2
├───modified
│ file_20180101.csv
│ file_20180102.csv
│
└───raw
file_20180101.csv
file_20180102.csv
Update October 2021. I have a single repository for all projects now. All of my transformation scripts are in the plugins folder (which also contains hooks and operators - basically any code which I import into my DAGs). DAG code I try to keep pretty bare so it basically just dictates the schedules and where data is loaded to and from.
├───dags
│ │
│ ├───project_1
│ │ dag_1.py
│ │ dag_2.py
│ │
│ └───project_2
│ dag_1.py
│ dag_2.py
│
├───plugins
│ ├───hooks
│ │ pysftp_hook.py
| | servicenow_hook.py
│ │
│ ├───sensors
│ │ ftp_sensor.py
| | sql_sensor.py
| |
│ ├───operators
│ │ servicenow_to_azure_blob_operator.py
│ │ postgres_templated_operator.py
│ |
│ ├───scripts
│ ├───project_1
| | transform_cases.py
| | common.py
│ ├───project_2
| | transform_surveys.py
| | common.py
│ ├───common
| helper.py
| dataset_writer.py
| .airflowignore
| Dockerfile
| docker-stack-airflow.yml

I would love to benchmark folder structure with other people as well. Maybe it will depend on what you are using Airflow to but I will share my case. I am doing data pipelines to build a data warehouse so in high level I basically have two steps:
Dump a lot of data into a data-lake (directly accessible only to a few people)
Load data from data lake into a analytic database where the data will be modeled and exposed to dashboard applications (many sql queries to model the data)
Today I organize the files into three main folders that try to reflect the logic above:
├── dags
│   ├── dag_1.py
│   └── dag_2.py
├── data-lake
│   ├── data-source-1
│   └── data-source-2
└── dw
├── cubes
│   ├── cube_1.sql
│   └── cube_2.sql
├── dims
│   ├── dim_1.sql
│   └── dim_2.sql
└── facts
├── fact_1.sql
└── fact_2.sql
This is more or less my basic folder structure.

I am using Google Cloud Composer. I have to manage multiple projects with some additional SQL scripts and I want to sync everything via gsutil rsync Hence I use the following structure:
├───dags
│ │
│ ├───project_1
│ │
│ ├───dag_bag.py
│ │
│ ├───.airflowignore
│ │
│ ├───dag_1
│ │ dag.py
│ │ script.sql
│
├───plugins
│ │
│ ├───hooks
│ │ hook_1.py
│ │
│ ├───sensors
│ │ sensor_1.py
│ │
│ ├───operators
│ │ operator_1.py
And the file dag_bag.py containes these lines
from airflow.models import DagBag
dag_bag = DagBag(dag_folder="/home/airflow/gcs/dags/project_1", include_examples=False)

Related

Deploying a basic next js app to firebase

I am deploying my NextJS app to firebase:
├───cache
│ ├───eslint
│ ├───images
│ │ ├───0WpnypBiLOeeiB3X0hVqOF0ll3SnkCsXd8yikMk4S34=
│ │ ├───1B8bGvhirJapcL5XmETMz1V3wVjjJAMe1CslxcDdVsA=
│ │ ├───5jeYnvvTrMRqPdLG++t4+flNiWSPYRi1ruSYhtVt4+A=
│ │ ├───fRkQP6QcfhhaBsX-8AHEVdjFj5cqrxmUNpEGchi7k5I=
│ │ ├───GQdSB-hnJuAXrDTKhALySs7hR3iq-F+b441ZwGY-Auc=
│ │ ├───hVI5b3aXH6yVKWpdXeQx1jIEJ86lkY2pkPOFUZK7l-k=
│ │ ├───mk63Qs1Sf2mImq3xg3G3yB2Zz4EPU5zyXkRNtxpHTF4=
│ │ ├───nwntPVsb9esdqbS5n1qHSnlBkYmWXGeLI1RPdQ823BA=
│ │ ├───PoCvg2wPdvFypgsmeVezHw+TcuEnNu8btEjSWpE9Zfs=
│ │ ├───tmBop5HuFCS0TfjZMskUrU8WLyjuckmrBAJT8H-8mA4=
│ │ ├───TMHEYoVnd6-4qFL5njYlP1xz6jFswOJGoaWtVKcuLTc=
│ │ └───zyuPavU9-TkEuRoxj2a6eJ+JjC+19Kd+2HElWk8WlxI=
│ ├───swc
│ │ └───plugins
│ │ └───v3
│ └───webpack
│ ├───client-development
│ ├───client-development-fallback
│ ├───client-production
│ ├───server-development
│ └───server-production
├───server
│ ├───chunks
│ └───pages
│ ├───api
│ │ └───trends
│ │ └───place
│ ├───tweet
│ └───user
│ └───[id]
└───static
├───8lMsDJmHpxf3_6VKO1ZNN
├───chunks
│ └───pages
│ ├───tweet
│ └───user
│ └───[id]
└───css
this is my tree for .next folder and I am totally stuck i have deployed it but the index.html file is empty and I want it to render the contents of a js file in the chunks folder index.html is not appearing here but it is there. Since index.html is empty I see a blank page when I deploy what should I add to index.html to show the content of the file I told you.
Please help me I have tried to add script tags and what not but to avail nothing.

Google Calendar API - Fetch instances of recurring events where original recurring event exists in calendar

I am currently working with Google Calendar API to fetch instances of recurring events, to retrieve the recurrence rule.
As suggested in https://stackoverflow.com/a/30505720/9524080, I am using singleEvents param while calling Events#list.
This allow me to fetch all instances of recurring events present in my calendar, while having a link to the original recurring event via recurringEventId.
By retrieving the event via this id, I am able to figure out the recurrence rule.
This is working as expected but there is an edge case.
When I am attendee of an instance of a recurring event while not being invited to the original recurring event, I can't use Events#get to retrieve the original recurring event, as it isn't present in my calendar (404 is thrown)
┌────────────────────────────────────────────────────┐ ┌────────────────────────────────────────────────────┐
│ My calendar │ │ Calendar 2 │
│ │ │ │
│ ┌──────────────────┐ │ │ ┌──────────────────┐ │
│ │ │ │ │ │ │ │
│ │ Recurring event │ │ │ │ Recurring event │ │
│ │ │ │ │ │ │ │
│ │ A │ │ │ │ B │ │
│ │ │ │ │ │ │ │
│ └──────────────────┘ │ │ └──────────────────┘ │
│ │ │ │
│ ┌──────────────────┐ ┌──────────────────┐ │ │ ┌──────────────────┐ │
│ │ │ │ │ │ │ │ │ │
│ │ Instance of │ │ Instance of │ │ │ │ Instance of │ │
│ │ recurring event │ │ recurring event │ │ │ │ recurring event │ │
│ │ A1 │ │ B1 │ │ │ │ B1 │ │
│ │ │ │ │ │ │ │ │ │
│ └──────────────────┘ └──────────────────┘ │ │ └──────────────────┘ │
│ │ │ │
│ │ │ │
└────────────────────────────────────────────────────┘ └────────────────────────────────────────────────────┘
(I can fetch A via A1, but not B via B1 because I only have instance B1 in my calendar and not B)
Is there any way, while using Events#list to return a list of instances of recurring events that includes only the instances of the original recurring events available in my calendar ?

In terms of permissions, you are explicitly given permission to just one instance of a recurring event that will be interpreted as an isolated event (from the perspective of the invitee), therefore getting a 404 as an invitee while trying to use Events:get using the main event ID can be considered an expected behavior.
So all of this is pretty much expected because of how Google manages the permissions. Let's take Google Drive as an example. Let's say I have a folder that contains a couple of files and I give you access only to a specific file, but with this access you have you then try to list all the files from the folder. It will fail because you are not supposed to have access to the rest of the files.
References:
Invite people to your Calendar event

Gatsby/NextJS: group pages by domains

I'm trying to understand how to migrate to Gatsby/NextJS our structure.
src
├── scripts
├── components
├── domains
│ ├── ca.cdn.domain-1.com
│ │ ├── global
│ │ └── pages
│ │ │ ├── page.html
│ │ │ └── page-2.html
│ ├── m.cdn.domain-2.de
│ │ ├── global
│ │ └── pages
│ │ │ ├── page.html
│ │ │ └── page-2.html
page.html files can have different html templates, but sharing some common components from src.
Currently, we generate static pages per domain with Gulp. Tried Gatsby today and pages can be only in src folder. Any suggestions on how to use Gatsby/NextJS with multiple domains in 1 repo?
Also more detailed I described the question here for NextJS.

Is there a way to minimize Qt required Dlls?

I am trying to deploy a simple Qt based chat program, that uses a WebWidget for the chat itself, QListWidgets and some labels. As well as QWebSocket for the network connection. But I do need to add 120 MB files to deploy it.
This are my QT and CONFIG variables in the pro file:
CONFIG += qt release
QT += gui websockets webkitwidgets widgets
This is the list of files I had to add:
│ D3Dcompiler_47.dll
│ icudt54.dll
│ icuin54.dll
│ icuuc54.dll
│ libEGL.dll
│ libgcc_s_dw2-1.dll
│ libGLESV2.dll
│ libstdc++-6.dll
│ libwinpthread-1.dll
│ opengl32sw.dll
│ Qt5Core.dll
│ Qt5Gui.dll
│ Qt5Multimedia.dll
│ Qt5MultimediaWidgets.dll
│ Qt5Network.dll
│ Qt5OpenGL.dll
│ Qt5Positioning.dll
│ Qt5PrintSupport.dll
│ Qt5Qml.dll
│ Qt5Quick.dll
│ Qt5Sensors.dll
│ Qt5Sql.dll
│ Qt5Svg.dll
│ Qt5WebChannel.dll
│ Qt5WebKit.dll
│ Qt5WebKitWidgets.dll
│ Qt5WebSockets.dll
│ Qt5Widgets.dll
│
├───audio
│ qtaudio_windows.dll
│
├───bearer
│ qgenericbearer.dll
│ qnativewifibearer.dll
│
├───iconengines
│ qsvgicon.dll
│
├───imageformats
│ qdds.dll
│ qgif.dll
│ qicns.dll
│ qico.dll
│ qjp2.dll
│ qjpeg.dll
│ qmng.dll
│ qsvg.dll
│ qtga.dll
│ qtiff.dll
│ qwbmp.dll
│ qwebp.dll
│
├───mediaservice
│ dsengine.dll
│ qtmedia_audioengine.dll
│
├───platforms
│ qwindows.dll
│
├───playlistformats
│ qtmultimedia_m3u.dll
│
├───position
│ qtposition_positionpoll.dll
│
├───printsupport
│ windowsprintersupport.dll
│
├───sensorgestures
│ qtsensorgestures_plugin.dll
│ qtsensorgestures_shakeplugin.dll
│
├───sensors
│ qtsensors_generic.dll
│
├───sqldrivers
│ qsqlite.dll
│ qsqlmysql.dll
│ qsqlodbc.dll
│ qsqlpsql.dll
│
└───translations
qt_ca.qm
qt_cs.qm
qt_de.qm
qt_en.qm
qt_fi.qm
qt_fr.qm
qt_he.qm
qt_hu.qm
qt_it.qm
qt_ja.qm
qt_ko.qm
qt_lv.qm
qt_ru.qm
qt_sk.qm
qt_uk.qm
QtPositioning, Sql dlls, Qml and QtQuick? Last time I deployed a Qt program was with Qt4; I remember I had less dependencies.. Is there something wrong?

You might want to do your own Qt build and cut it down as much as possible. It will still be a mess, but a smaller one. Remove optional modules you don't need, resort to using system libraries instead of those bundled with Qt wherever possible, don't use ICU - that alone will cut almost 30MB of dependencies.
The best option is to use a static build and link statically, but there are plenty of limitations at play, you either need a commercial license or to open your code, and still, deployment for QML projects is and has been broken for years. Sadly, it seems like making the lives of all of those using Qt for free as miserable as possible has become quite a priority, in order to force developers into spending on the expensive commercial license, which is the sole remedy to the situation, or at least it will be hopefully by the time Qt 5.7 is released.
BTW, if those DLLs got pulled in by the deployment tool - I advice against trusting it. I have tried it literally yesterday, and it turned out to be completely broken - failed to pull in half of the needed DLLs, half of those it pulled in weren't actually needed, and in terms of qml files, it did even worse.
If not by the deployment tool, those extra dlls are probably indirect dependencies - for example the web sockets define a QML API, so they might pull QML in as a dependency, which itself pulls a cascade of other modules and libraries. You should investigate if you can build those modules without their QML side.

Tool to create ASCII graph from a set of vertices and edges?

Is there a tool that takes as input a series of vertices and edges, and outputs a graph in ASCII/Unicode format?
Thanks,
Kevin

In addition to Graph::Easy mentioned by #nibot, there are a couple of other tools around for this:
Vijual (Clojure): http://lisperati.com/vijual/
ascii-graphs (Scala): https://github.com/mdr/ascii-graphs
(Disclaimer: I'm the developer of the latter).

Yes! Perl has Graph::Easy, as described in this Hacker News comment.
Here's some output from the online demo:
........ +---------+ +-----+
: Bonn : --> | Berlin | ..> | Ulm |
:......: +---------+ +-----+
H
H train
v
+---------+
| Koblenz |
+---------+

For whoever reading this post in 2022, check out Diagon.
There are both a command line tool diagon and a website.
you can create multiple ASCII visualization from text such as :
DAG
flowchart
sequence diagram
mathematical expression (without Latex)
DAG Example :
┌─────┐┌─────────┐┌─────┐
│socks││underwear││shirt│
└┬────┘└┬─┬──────┘└┬─┬──┘
│ │┌▽─────┐ │┌▽───────┐
│ ││pants │ ││tie │
│ │└┬──┬──┘ │└┬───────┘
┌▽──────▽─▽┐┌▽─────▽┐│
│shoes ││belt ││
└──────────┘└┬──────┘│
┌────────────▽───────▽┐
│jacket │
└─────────────────────┘
Also worth looking : https://www.plantuml.com/

I might recommend graphviz -- I don't know if it has an ascii-art output, but it does support a heap of other useful formats, and perhaps you can find a converter to ascii art from one of those formats.

Another option: https://www.npmjs.com/package/ascii-seq
input.txt
From -- message -> To .. response -> From
-- line --
Another -- msg -> Dest
Self -- abc -> Self
npx ascii-seq input.txt or cat input.txt | npx ascii-seq
┌──────┐ ┌────┐┌─────────┐┌──────┐┌──────┐
│ From │ │ To ││ Another ││ Dest ││ Self │
└───┬──┘ └──┬─┘└────┬────┘└───┬──┘└───┬──┘
│ │ │ │ │
├── message ──>│ │ │ │
│ │ │ │ │
│<∙ response ∙∙┤ │ │ │
│ │ │ │ │
│ │ │ │ │
────────────────────── line ───────────────────────
│ │ │ │ │
│ │ │ │ │
│ │ ├─ msg ──>│ │
│ │ │ │ │
│ │ │ │ ├──┐
│ │ │ │ │ abc
│ │ │ │ │<─┘
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │

yes, Its called unix directories and the 'tree' cmd.
Output example:
db
├── colors
│ ├── green
│ └── nongreen
└── person
└── type
├── alien
│ └── colors -> db/colors
├── female
│ └── colors -> db/colors
└── male
└── colors -> db/colors

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Airflow structure/organization of Dags and tasks - airflow

Related

Deploying a basic next js app to firebase

Google Calendar API - Fetch instances of recurring events where original recurring event exists in calendar

Gatsby/NextJS: group pages by domains

Is there a way to minimize Qt required Dlls?

Tool to create ASCII graph from a set of vertices and edges?

Categories

Resources