Airflow/Composer recommended folder structure

Airflow/Composer recommended folder structure - airflow

Do you guys have any recommended for Composer folder/directories structure? The way it should be structured is different from the way our internal Airflow server is using right now.
Based on Google documentation: https://cloud.google.com/composer/docs/concepts/cloud-storage:
plugins/: Stores your custom plugins, operators, hooks
dags/: store dags and any data the web server needs to parse a dag.
data/: Stores the data that tasks produce and use.
This is an example of how I organize my dags folder:
I had trouble before when I put the key.json file in the data/ folder and the dags cannot be parsed using the keys in the data/ folder. So now I tend to put all the support files in the dags/ folder.
Would the performance of the scheduler be impacted if I put the supported files (sql, keys, schema) for the dag in the dags/ folder? Is there a good use case to use the data/ folder?
It would be helpful if you guys can show me an example of how to structure the composer folder to support multiple projects with different dags, plugins and supported files.
Right now, we only have 1 Github for the entire Airflow folder. Is it better to have a separate git per project?
Thanks!

The impact on the scheduler should be fairly minimal as long as the files you place in the dags folder are not .py files; however, you can also place the files in the plugins folder which is also synced via copy.
I would use top-level folders to separate projects (e.g. dags/projectA/dagA.py), or even separate environments if the projects are large enough.

First question:
I had trouble before when I put the key.json file in the data/ folder and the dags cannot be parsed using the keys in the data/ folder. So now I tend to put all the support files in the dags/ folder.
You only need to set the correct configuration to read these files from the data/ or plugins/ folders. The difference between the filesystem between running airflow in composer against running it locally is that the path to these folders changes.
To help you with that, in another post I describe a solution to find the correct path for these folders. I quote my comment from the post:
"If the path I entered does not work for you, you will need to find the path for your Cloud Composer instance. This is not hard to find. In any DAG you could simply log the sys.path variable and see the path printed."
Second question:
Would the performance of the scheduler be impacted if I put the supported files (sql, keys, schema) for the dag in the dags/ folder? Is there a good use case to use the data/ folder?
Yes, at least the scheduler needs to check if these are python scripts or not. It's not really that much but it does have an impact.
If you solve the issue of reading from the data or plugins folders, you should move these files to those folders.
Third question:
Right now, we only have 1 Github for the entire Airflow folder. Is it better to have a separate git per project?
If your different projects require different pypi packages, having separate repositories for each one and different airflow environments too would be ideal. The reason for this is that you'll leverage the risk of falling into pypi packages dependencies errors and reduce build times.
In the other hand, if your projects will use the same pypi packages, I'd suggest to keep everything in a single repository until it becomes easier to have every project in a different repo. But having everything in a single repo makes the deployment easier.

Related

Use Git in SSH to pull specific directories

Total newbie question but what is the best practice when it comes to using SSH with Git? I'm working on a WordPress project. In the root I have gulp and other dev files/folders like SASS and Scripts that I don't need on the server and in the same project I have my WordPress folder that contains a theme and a few custom plugins. As you can imagine when the theme or any of the plugins are ready to be deployed I don't want to pull everything in my repository on the server. So far as a newbie I've always just pull and pushed the entire repository and used FTP to upload what I need to the server, so how is this done with SSH and Git and is there a better way to have my setup?
EDIT: To make my question a little bit more clear let me give you an example of what I think my issue is. In my main project folder, I have a SASS folder next to my WordPress folder. All I really need to deploy to the server is the WordPress folder. My build process that happens on my dev machine combines all of the SASS files into a single CSS that is then placed into the WordPress folder. I need the SASS folder to be tracked by Git so that any other developer can pull them and continue developing so I can't have git ignore it. However none of those SASS files need to be on the server for WordPress to work either. I just simply need to deploy the WordPress folder and everything that's in it.
I understand the idea of creating a bare repository on the server and using post-receive hook to point the git folder sitting outside your web root to point to where the web root is. But that's basically how GIT and SSH work and that's not answering my concern.

Not with Git
Git is not designed to pull specific files or directories only. It's a directed acyclic graph with binary blobs as objects and sometimes multiple objects get compacted into a single larger object.
Due to Git design, your specific request is not possible.
Alternatives
post-receive hook
If your website only contains simple static files then it's okay to push to a git repository over SSH. In reality, it's unlikely your repository will be large as long as you don't have non-text files.
Take for example the following setup.
/var/lib/www - apache web dir which is the cloned copy of www.git
/var/lib/www.git - a bare git repository.
/var/lib/www.git/hooks/post-recieve - A server side git hook. It can be a shell script that pulls the www repository when this repository is updated.
Sample post-recieve hook script:
#!/bin/bash
cd /home/sam/sandbox/git-hooks/www
unset GIT_DIR
git fetch origin master
git reset --hard origin/master
Zip up build in a tar.gz
At the end of your build you can zip up your files in a tar.gz. This file should be hosted somewhere (perhaps GitHub releases if you're using GitHub). Some enterprises use on premise artifact hosting like Nexus or Artifactory.
The idea being: you have a tested artifact that has a specific sha256sum. The artifact you test is the exact same artifact which eventually goes to production.
Diving into more detail such as continuous integration, continuous delivery, and the software development life cycle might be out of scope for your question.

No best practice.
Git is for source control, not for deployment. There is no best practice for using git this way because git is not a deployment tool. You also don't need git history on your server. In fact, you don't need git at all unless you insist on using it for deployment. You are welcome to use it this way but it's not ideal because of exactly the kind of problem you're asking about.
What is the best practice?
There are a number of tools you could use to handle your deployments. Most of the tools generally let you set up a series of steps that let you deploy the code you want into the environment you want. You could go with simple tools such as Phing or Deployer in the PHP world, or something more sophisticated like Puppet or Chef if you have more complex needs. You could just write your own bash scripts if what you need is really very simple. I recommend Phing or Deployer given the info you've provided. https://deployer.org/ https://www.phing.info/
You'll just configure whichever tool you want to ssh into your target box and copy over only the files you want into the directory you want on the server, in whatever way you would like to do that. Usually, you have the script copy files into a temp dir, tarball them up, ssh them over and untar them. After that, you'll usually do some additional work on the server to move files around, change symlinks, whatever else you might need to do.
What about compiled SASS, ES6 js files, or modern static stuff?
All you need to do is add steps to the handle the static files and where you want them to go. Include the generated static files in your tarball when you push stuff up, and put them in the right directories in the server once you untar it.
When you configured your SASS compiler, and whatever other pre-compiled static code you may have - you configured it to create a destination file. That is, the file(s) of actual CSS and JS that they generate. That's all you need to bring along - and if you have the destination directory set to be inside your wordpress theme, you may not even have to pay all that much special attention to it's handling. You may need to move them somewhere else once they are on the server but that all depends on the specific setup in your server, which I think is outside the scope of this question.
Additional Notes:
You didn't ask about this but I thought it was worth mentioning, that you shouldn't be sending the entire wordpress repository every time you update. Just like you don't need the uncompiled SASS code, you also don't need to be repackaging core WordPress. You don't even need to be commiting core WordPress, its a dependency and you don't need to be changing it.
All that should be getting committed by you is your theme and plugin code, and the uncompiled static files. Compiled static files and external dependencies like the WordPress core don't belong in your git history. For deployment purposes, WordPress should already be installed. The stuff in your tarballs should just be plugins and themes, and additional static files if they aren't already in there for some reason.
TLDR;
Don't use git for this. Use a tool like Phing or Deployer. Build your static files into your theme, and create phing/deployer scripts that tarball up only the code you want, SSH's it over to your server, and untars it into the directories you want. If you have some special location on the server for your static files, just make sure to add steps in your script for that.

So, based on your question and comment, there are three computers involved. There is a web server (when you say "server", I take it as a Web server in this scenario, or the server computer that runs a Web server program). There is another server where your git repo is hosted. And, there is your dev workstation. Is this correct?
It seems like, you have a cloned git repo on your Web server. Your current practice/workflow appears to be (1) (based on your expression "SSH'ed into my server") you log into the web server via SSH (just like Telnet) from your workstation (SSH is just a protocol, which can be used for different purposes). (2) you pull from your repo on hosted service (e.g., github), and (3) deploy it to your "www" directory on the same server. Is this correct?
(I can think of an alternative scenario based on your use of the word "FTP", etc., but let's focus on the above scenario, for now.)
Now, your question is, whenever you "pull" (on your Web server), you feel like you are pulling everything from your repo on your hosted service. And, is there a better way? Am I understanding your question correctly?
If so, as another commenter suggested, git (and, any version control system, in general) is very good at fetching "deltas" only. If you are worried about "fetching everything" every time you pull (the step (2) above), then your worry is unfounded.
Now, the question is, why do you have a git repo on your Web server, if that is indeed the case? This is a pretty legit setup and I've done this before (e.g., on EC2). But, as a best practice, people generally don't do that on "production" servers. It's because you have to "build" your web app, and you really don't want to do that on production servers.
The next question is, what do you exactly do in Step (3)? The build process (whatever process you use) typically generates an "output" which can be directly deployed to the web server. (The convention is the output is generally a single folder, "public", "www", "dist", or whatever, or a single file (e.g., tar.gz, zip, jar, war), etc.) Regardless of whether you build the deployable output on your dev workstation (or, a build machine) or on your Web server, you don't generally do "deltas" in this context. Even if you've only changed a single file (say, a CSS file), you generally build the whole output again (instead of, say, just replacing the changed CSS file only). When you use FTP to upload files, etc., you can selectively upload certain files and/or directories, etc., but as a general practice, we don't do that. We always build the complete output from scratch and deploy it to the Web server. (This is mainly to reduce the potential deployment errors and increase the reliability.)
So, to answer your question, (A) If you are pulling git repo on your Web server, you should really change that practice, and move the build process to your dev computer or a dedicated build machine. (BTW, services like github, gitlab, TFS, ... provide the build service for you.) (B) If you are currently selectively FTP'ing your web app files to your Web server, then you should really consider adopting some kind of formal build, and deployment, process moving forward.

After your SASS build process is done use scp or rsync to move the files to the prod server:
scp -r /[local wordpress dir]/wp-content/themes/your-theme/ username#your.prod.server.com:/path/to/dir/wordpress/wp-content/themes/
scp -r /[local wordpress dir]/wp-content/plugins/* username#your.prod.server.com:/path/to/dir/wordpress/wp-content/plugins/

I am working in a project and using git ssh with bitbucket following is the process i am using it may work for you also if not please correct me :
Step 1 ->I have setup git and create repo in bit-bucket.
Step 2 ->And setup project with my local and linked with my repo.
Step 3 ->connect my server using ssh.
Step 4 ->Work in my local and commit and push all changes in my git repo.
Step 5 ->Run git pull on ssh so all changes deployed in my server.
I am using above process and i love this process.i have used .gitignore file that is not required for push on my repo.
Thanks

Deploying 2 apps from the same git using CodeDeploy

We have an app with web and worker nodes - the code for both is in the same git but gets deployed to different autoscaling groups. The problem is that there is only one appspec file, however the deployment scripts (AfterInstall, AppStart, etc.) for the web/worker nodes are different. How would I go about setting my CodeDeploy to deploy both apps and execute different deployment scripts ?
(Right now we have an appspec file that just invokes chef recipes that execute different actions based on the role of the node)

I know this question is very old, but I had the same question/issue recently and found an easy way to make it work.
I have added two appspec files on the same git: appspec-staging.yml, appspec-storybook.yml.
Also added two buildspec files buildspec-staging.yml, buildspec-storybook.yml(AWS CodeBuild allows specify the buildspec file).
The idea is after the build is done, we will copy and rename the specific appspec-xx.yml file to the final appspec.yml file, so finally, in the stage of CodeDeploy, we will have a proper appspec.yml file to deploy. Below command is for linux environment.
post_build:
commands:
- mv appspec-staging.yml appspec.yml

Update - according to an Amazon technical support representative it is not possible.
They recommend having separate gits for different environments (prod,staging,dev,etc.) and different apps.
Makes it harder to share code, but probably doable.

You can make use of environment variables exposed by the agent in your deployment scripts to identify which deployment group is being deployed.
Here's how you can use them https://blogs.aws.amazon.com/application-management/post/Tx1PX2XMPLYPULD/Using-CodeDeploy-Environment-Variables
Thanks,
Surya.

The way I have got around this is to have an appspec.yml.web and an appspec.yml.worker in the root of the project. I then have two jobs in Jenkins; one each that correspond to the worker and the web deployments. In each, it renames the appropriate one to be just appspec.yml and does the bundling to send to codedeploy.

Use Fossil for system files?

As a new user of Fossil, I'm curious if there are any negative implications with using Fossil to store things like /etc/, /usr/local/etc files from Unix like systems like FreeBSD & OpenBSD. If I'm doing this for multiple systems, I think I'd create a branch with each hostname to track those files.
Q1: Have you done this? Do you prefer a different VCS to handle the system files?
Q2: Lots of changes have happened in Fossil over the years and I'm curious if it's possible to restrict who can merge branches with trunk. From reading earlier threads it wasn't possible but there are two workarounds:
a) tell people not to merge to trunk
b) have people clone and trunk maintainer pick up changes from their repo

System configuration files stored in /etc, /var or /usr/local/etc can generally only edited by the root user. But since root has complete access to the whole system, a mistaken command there can have dire consequences.
For that reason I generally use another location to keep edited configuration files, a directory in my home-directory that I call setup, which is under control of git. Since I have multiple machines running FreeBSD, each machine gets its own subdirectory. There is a special subdirectory of setup called shared for those configuration files that are used on multiple machines. Maintaining multiple copies of identical files in separate repositories or even branches can be a lot of extra work.
My workflow is the following;
Edit a configuration file in my repository.
Copy it to its proper location.
Test the changes. If problems occur, go back to step 1.
Commit the changes to the revision control system. Copy the
committed files to their proper location.
Initially I had a shell script (basically a list of install commands) to install the files for me. But I also wanted to see the differences between the working tree and the installed files.
So for my convenience, I wrote a script called deploy to help me with this. It can tell me which files in the repo are different from the installed files and can show me the differences. It can also install files to their proper locations.

Single git repo setup tracking multiple locations on hard drive

I'm very new to the world of git (done some svn in the past) and would like some advice on trying to accomplish the following.
My current workflow is that I setup the static html files using Middleman to get the base HTML structure and styles before porting over to a Wordpress template. These static files are located at C:/git/project-name/HTMLTemplates.
My wordpress setup uses Xampp so the theme files are kept in C:/Xampp/wordpress/wp-content/themes/project-theme.
What I would like to do is have a single git repo that tracks the changes of the two different locations (HTMLTemplates and project-theme)
Is this at all possible, or do I simply create two individual repos (eg: proect-static and project-wordpress)?

No, there is no mechanism in git for this. Git assumes that all files that it manages (the "working copy") live in a single directory (and subdirectories); there is no support for managing two separate directories in in repo.
So you'll have to somehow keep everything in one directory, probably as subdirectories HTMLTemplates and theme or similar.
You could use two git repos, but I'd strongly advise against this. A single repo should contain a whole "project", i.e. everything needed to build one piece of software (excluding things like external libraries). If you split your project across two repositories, you cannot usefully branch and merge (because you'd have to do it in both repos simultaneously), you cannot easily check out old versions etc..
To solve your problem, I see a few possible solutions:
Have some build / deployment script that copies everything to the right places. You probably alread have a script that invokes Middleman, and possibly tells Wordpress to refresh its cache, so you could add it there.
Set up a symbolic link for the wordpress directory. On UNIX-like systems this is easy and commonly done. On Windows, you can create "junction points", which I believe work similarly.
Configure Wordpress / Apache to read the directory directly from your git working copy. The path should be configurable.
I would prefer the first solution; this has the added advantage that it will decouple your development environment from the server configuration. This will make it easier if your setup later changes or your project needs to run in a different environment (development on a different machine, someone else also wants to work on your project, you want to deploy to a hosted server somewhere etc.).
Note: The problem is, I believe, that your are trying to use git as a deployment tool. While many people do this, git is not really suitable for this purpose. Deployment should usually be a separate step.

How to manage multiple alfresco repositories?

Problem description:
I have multiple alfresco installations (development, testing, production) of one project.
I need to copy files under Data Dictionary folder (Scripts, Templates, Web Scripts) from one to another in one direction (development -> testing -> production).
Current solution:
I copy files manually via webdav, which is annoying and unreliable (I can forget to copy some.).
Desired solution:
I'd like to have I tool, which will copy changed files at my command, what they are ready for the next step. I had an idea, that it could internally use a Git repository with branches for each installation, being able to fetch the files from devel and push the files to testing and production. This way (with Git) it could also support reverting changes.
It looks like a quite common problem, but I wasn't able to google something about it, so I'm asking here. Does such a tool exist or is there a better way of managing multiple repositories?

If you have a brand new installation of your development/testing/production Alfresco instances, you could simply migrate alf_data dir content, that contains by default db, indexes, content-store, backup files. If you need, you could migrate the "shared" folder too, or at least some files from the shared folder as could be some Alfresco customization (custom scripts or similar). Here is the link that helps with migration steps:
http://wiki.alfresco.com/wiki/System_Migration
Otherwise, if you need only to move a folder from Data Dictionary, or a set of documents, you could use ACP in order to achieve that. Here is the wiki for doing this: http://wiki.alfresco.com/wiki/Export_and_Import

You could do this via FTP. When your want to deploy new changes, you can go with manual client like FileZila to download changes from Dev, then upload them to test.
But you can also automate FTP, so that it can run a scheduled check if there are new things on, say, dev and push them to test.
If you use Git for source control, you could also do this via git-ftp. Hold a copy of Data Dictionary in your source folder, then add some sort of pre-commit check, which will see if you changed any of those files. If you did, on commit it will push the change to dev and test.

I think Relication service AF is suitable for you.
http://wiki.alfresco.com/wiki/Alfresco_Community_3.4.a#Replication

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex