DataFlow Job Startup Takes Too Long When triggered from Composer - airflow

I have a static pipeline with the following architecture:
main.py
setup.py
requirements.txt
module 1
__init__.py
functions.py
module 2
__init__.py
functions.py
dist
setup_tarball
The setup.py and requirements.txt contain the non-native PyPI and local functions which would be used by the Dataflow worker node. The dataflow options are written as follows:
import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from module2.functions import function_to_use
dataflow_options = ['--extra_package=./dist/setup_tarball','temp_location=<gcs_temp_location>', '--runner=DataflowRunner', '--region=us-central1', '--requirements_file=./requirements.txt]
So then the pipeline will run something like this:
options = PipelineOptions(dataflow_options)
p = beam.Pipeline(options=options)
transform = (p | ReadFromText(gcs_url) | beam.Map(function_to_use) | WriteToText(gcs_output_url))
Running this locally takes Dataflow around 6 minutes to complete, where most of the time goes to worker startup. I tried getting this code automated with Composer and re-arranged the architecture as follows: my main (dag) function in dags folder, the modules in plugins, and setup_tarball and requirements.txt in data folder... So the only parameters that really changed are:
'--extra_package=/home/airflow/gcs/data/setup_tarball'
'--requirements_file=/home/airflow/gcs/data/requirements.txt'
When I try running this modified code in Composer, it will work... but it'll take much, much longer... Once the worker starts up, it will take anywhere from 20-30 minutes before actually running the pipeline (which is only a few seconds).. This is much longer than triggering Dataflow from my local code, which was taking only 6 minutes to complete. I realize this question is very general, but since the code works, I don't think it's related to the Airflow task itself. Where would be a reasonable place to start looking at for troubleshooting this problem? At the Airflow level, what can be modified? How does Composer (Airflow) interact with Dataflow, and what can potentially cause this bottleneck?

It turns out that the problem was associated with Composer itself. The fix was to increase the capacity of Composer, i.e., increase vCPUs. Not sure why this would be the case, so if anyone has an idea for the foundation behind this issue, your input would be much appreciated!

Related

Airflow on conda - folder structure

I have bare bones airflow installation on conda - I managed to create custom operators by putting them in path:
airflow/dags/operators/custom_operator.py
then calling from dag as:
from operators.custom_operator import CustomOperator
how can I instead achieve folder structure:
airflow/operators/custom_operator.py
Which would be called from dag as:
from airflow.operators.custom_operator import CustomOperator
In case if you think that's a bad approach - please point it out in your answer/comment happy to tweak my approach, if there are better design patterns...
Interestingly - the solution here is in airflow.cfg (your airflow config file) to move parameter dags_folder one directory up - to $AIRFLOW_HOME, so instead of having:
....
[core]
dags_folder = /home/user/airflow/dags
....
Just make it:
....
[core]
dags_folder = /home/user/airflow
....
Airflow apparently will look recursivelly for dags, and capture only classes defined as dags... Whereas you can then keep clean folder structure, with custom operators, utility functions, custom sensors etc. outside dags/ folder.

Simple python app deployed but not deployed

I just installed dokku for the first time and I'm struggling with an apparently very simple problem... I made a sample python app that just logs an env variable:
import os
import time
API_TOKEN = os.getenv('API_TOKEN')
while True:
print(f'API_TOKEN is {API_TOKEN}')
time.sleep(1)
pass
With a Procfile as this:
worker: python temp.py
The deploy looks normal and successful, however if I try to look at the logs, dokku says
App <app name> has not been deployed .
Am I missing something very trivial?
Thanks in advance!
By default dokku only scales up the web process if it is present. Any workers or other processes are scaled to 0 otherwise known as "App < app name > has not been deployed"
To deploy your app you need to log onto the box and scale up the worker by running:
dokku ps:scale <app name> worker=1
change 1 to a larger number if you want more workers
If you often deploy the app to different dokku instances and have to search for this solution over and over again you can instead create a file in the root of your app called DOKKU_SCALE. In it you can set the default scale of all the proceses, like so:
worker=1
That reminds me, I need to go do that now. It is driving me nuts.

Pact-node dependencies are very large, any way to reduce the size?

We have implemented contract testing using pact for our Angular JS frontends and java backends.
I've noticed that the node_modules/#pact-foundation directory is pretty huge (pact-node v 4.3.2)
du -sh node_modules/#pact-foundation/
741M node_modules/#pact-foundation/
The JS UIs are always only consumers but the dependencies seem to require the following
ls node_modules/#pact-foundation/
pact-mock-service pact-node pact-provider-verifier-linux-x64
pact-mock-service-linux-x64 pact-provider-verifier
Is there any way to pull in a smaller set of dependencies?
Edit - it seems the reason for this is as follows
du -sh pact-node/node_modules/#pact-foundation/pact-mock-service/build/*
1.9M pact-node/node_modules/#pact-foundation/pact-mock-service/build/pact-mock_service-0.8.2
8.9M pact-node/node_modules/#pact-foundation/pact-mock-service/build/pact-mock-service-0.8.2-1-linux-x86_64.tar.gz
8.5M pact-node/node_modules/#pact-foundation/pact-mock-service/build/pact-mock-service-0.8.2-1-linux-x86.tar.gz
9.2M pact-node/node_modules/#pact-foundation/pact-mock-service/build/pact-mock-service-0.8.2-1-osx.tar.gz
12M pact-node/node_modules/#pact-foundation/pact-mock-service/build/pact-mock-service-0.8.2-1-win32.zip
50M pact-node/node_modules/#pact-foundation/pact-mock-service/build/pact-mock-service-darwin
48M pact-node/node_modules/#pact-foundation/pact-mock-service/build/pact-mock-service-linux-ia32
50M pact-node/node_modules/#pact-foundation/pact-mock-service/build/pact-mock-service-linux-x64
51M pact-node/node_modules/#pact-foundation/pact-mock-service/build/pact-mock-service-win32
pact-node depends on pact-mock-service & the bundled dependency includes the mock service for all OSes.
Edit 2 -
Changing my dependency to the following
"#pact-foundation/pact-node": "6.9.0",
and adding a resolution (I'm using yarn not npm)
"resolutions": {
"#pact-foundation/pact-node": "6.9.0"
}
Brings the total size of the dependencies down to
du -sh node_modules/#pact-foundation/*
1.7M node_modules/#pact-foundation/pact-node
170M node_modules/#pact-foundation/pact-standalone
Cheers
Shane
Sadly, no, not yet.
Currently, our main Pact application is written in Ruby and is packaged with Travelling Ruby, a way to package ruby apps for different OS/architectures. Now originally, the intention was to only download the OS/arch specific binary so you don't have to load everything, however, a bug in npm is causing issues with optional dependencies when a package-lock.json is committed into a repository. To work around this issue, we ended up having to package them all together, which I particularly don't like.
However, the good news is that we are working on this problem. We are currently trying to reimplement our Pact application using Rust, which will compile down to native binaries without all the extra stuff that came with Ruby, which will reduce the overall size of the binary drastically. It isn't finalized just yet, but it is still being worked on, so please be patient.
Thanks.

Can Ansible unarchive be made to write static folder modification times?

I am writing a build process for a WordPress installation using Ansible. It doesn't have a application-level build system at the moment, and I've chosen Ansible so that it can cleanly integrate with server build scripts, so I can bring up a working server at the touch of a button.
Most of my WordPress plugins are being installed with the unarchive feature, pointing to versioned plugin builds on the official wordpress.org installation server. I've encountered a problem with just one of these, which is that it is always being marked as "changed" even though the files are exactly the same.
Having examined the state of ls -Rl before and after, I noticed that this plugin (WordPress HTTPS) is the only one to use internal sub-directories, and upon each decompression, the modification time of folders is getting bumped.
It may be useful to know that this is a project build script, with a connection of local. I guess therefore that means that SSH is not being used.
Here is a snippet of my playbook:
- name: Install the W3 Total Cache plugin
unarchive: >
src=https://downloads.wordpress.org/plugin/w3-total-cache.0.9.4.1.zip
dest=wp-content/plugins
copy=no
- name: Install the WP DB Manager plugin
unarchive: >
src=https://downloads.wordpress.org/plugin/wp-dbmanager.2.78.1.zip
dest=wp-content/plugins
copy=no
# #todo Since this has internal sub-folders, need to work out
# how to preserve timestamps of the original folders rather than
# re-writing them, which forces Ansible to record a change of
# server state.
- name: Install the WordPress HTTPS plugin
unarchive: >
src=https://downloads.wordpress.org/plugin/wordpress-https.3.3.6.zip
dest=wp-content/plugins
copy=no
One hacky way of fixing this is to use ls -R before and after, using options to include file sizes but not timestamps, and then md5sum that output. I could then mark it as changed if there is a change in checksum. It'd work but it's not very elegant (and I'd want to do that for all plugins, for consistency).
Another approach is to abandon the task if a plugin file already exists, but that would cause problems when I bump the plugin version number to the latest copy.
Thus, ideally, I am looking for a switch to present to unarchive to say that I want the folder modification times from the zip file, not from playbook runtime. Is it possible?
Update: a commenter asked if the file contents could have changed in any way. To determine whether they have, I wrote this script, which creates a checksum for (1) all file contents and (2) all file/directory timestamps:
#!/bin/bash
# Save pwd and then change dir to root location
STARTDIR=`pwd`
cd `dirname $0`/../..
# Clear collation file
echo > /tmp/wp-checksum
# List all files recursively
find wp-content/plugins/wordpress-https/ -type f | while read file
do
#echo $file
cat $file >> /tmp/wp-checksum
done
# Get checksum of file contents
sha1sum /tmp/wp-checksum
# Get checksum of file sizes
ls -Rl wp-content/plugins/wordpress-https/ | sha1sum
# Go back to original dir
cd $STARTDIR
I ran this as part of my playbook (running it in isolation using tags) and received this:
PLAY [Set this playbook to run locally] ****************************************
TASK [setup] *******************************************************************
ok: [localhost]
TASK [jonblog : Run checksum command] ******************************************
changed: [localhost]
TASK [jonblog : debug] *********************************************************
ok: [localhost] => {
"checksum_before.stdout_lines": [
"374fadc4df1578f78fd60b1be6758477c2c533fa /tmp/wp-checksum",
"10d66f7bdbbdd3af531d1b11a3db3059a5868838 -"
]
}
TASK [jonblog : Install the WordPress HTTPS plugin] ***************
changed: [localhost]
TASK [jonblog : Run checksum command] ******************************************
changed: [localhost]
TASK [jonblog : debug] *********************************************************
ok: [localhost] => {
"checksum_after.stdout_lines": [
"374fadc4df1578f78fd60b1be6758477c2c533fa /tmp/wp-checksum",
"719c9da94b525e723b1abe188ee9f5bbaf121f3f -"
]
}
PLAY RECAP *********************************************************************
localhost : ok=6 changed=3 unreachable=0 failed=0
The debug lines reflect the checksum hash of the contents of the files (this is identical) and then the checksum hash of ls -Rl of the file structure (this has changed). This is in keeping with my prior manual finding that directory checksums are changing.
So, what can I do next to track down why folder modification times are incorrectly flagging this operation as changed?
Rather than overwriting all files each time and find a way to keep the same modification datetime, you may want to use the creates option of the unarchive module.
As you maybe already know, this tells Ansible that a specific file/folder will be created as a result of the task. Thus, next time the task will not be run again if that file/folder already exists.
See http://docs.ansible.com/ansible/unarchive_module.html#options
My solution is to modify the checksum script and to make that a permanent feature of the Ansible process. It feels a bit hacky to do my own checksumming, when Ansible should do it for me, but it works.
New answers that explain that I am doing something wrong, or that a new version of Ansible fixes the problem, would be most welcome.
If I get a moment, I will raise this as a possible bug with the Ansible team. However I do sometimes wonder about the effort/reward ratio when raising bugs on a busy tracker - I already have one item outstanding, it has been waiting a while, and I've chosen to work around that too.
Update (18 months later)
This Ansible build system never made it into live. It felt like I was always working around something. Recently, when I decided I needed to move my blog to another server, I finally Dockerised it. This took several weeks (since there is a surprising amount of things to think about in a real WordPress installation) but in general I found the process much nicer than using orchestration tools.

How to pass command line arguments to a Meteor app?

I would like to pass command line arguments to my Meteor app on start up.
For example --dev, --test or --prod indicating whether or not it is running in dev, test or prod environments. It can then load different resources on start up, etc...
I tried something like this in a /server/server.js
var arguments = process.argv.splice(2);
console.log('cmd args: ' + JSON.stringify(arguments,0,4));
The I ran a test. And quite a few others with just random command line arguments.
meteor --dev
The output in the console is only this.
cmd args: [
"--keepalive"
]
What is the best way to get command line arguments into a Meteor app?
Or, is this even the correct way to solve the higher level problem? and if not, what is the correct way to solve this problem of distinguishing between running enviro?
Meteor doesn't forward command line args to your app, if it doesn't know them. You have multiple possibilities:
Rewrite parts of meteor.js to forward unrecognized args. This shouldn't be too hard, but it's not a pretty solution. And if updates occur you are in trouble. :D
You could write a small config file and change the behaviour of your app based on the configuration options in there. Take a look at this question.
The easiest thing to do is using environment variables. You can read env vars in node like this. After that you can start your app the "express.js" way: $ METEOR_ENV=production meteor
I hope I could help you! :)
The reason it doesn't work is because the meteor command starts a proxy (which gets the arguments you give) and then it starts the meteor app with --keepalive.
process.argv will have correct values if you build Meteor with meteor build --directory /my/build/path and run it.

Resources