How can I configure an optional input path in oozie job.properties file? So I want the default value for that input path to be null and I will set that path when running the oozie job using -D parameter.
Does anyone have any ideas how to do it?
Thanks!
Related
We're running Airflow cluster using puckel/airflow docker image with docker-compose. Airflow's scheduler container outputs its logs to /usr/local/airflow/logs/scheduler.
The problem is that the log files are not rotated and disk usage increases until the disk gets full. Dag for cleaning up the log directory is available but the DAG run on worker node and log directory on scheduler container is not cleaned up.
I'm looking for the way to output scheduler log to stdout or S3/GCS bucket but unable to find out. Is there any to output the scheduler log to stdout or S3/GCS bucket?
Finally I managed to output scheduler's log to stdout.
Here you can find how to use custom logger of Airflow. The default logging config is available at github.
What you have to do is.
(1) Create custom logger class to ${AIRFLOW_HOME}/config/log_config.py.
# Setting processor (scheduler, etc..) logs output to stdout
# Referring https://www.astronomer.io/guides/logging
# This file is created following https://airflow.apache.org/docs/apache-airflow/2.0.0/logging-monitoring/logging-tasks.html#advanced-configuration
from copy import deepcopy
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
import sys
LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
LOGGING_CONFIG["handlers"]["processor"] = {
"class": "logging.StreamHandler",
"formatter": "airflow",
"stream": sys.stdout,
}
(2) Set logging_config_class property to config.log_config.LOGGING_CONFIG in airflow.cfg
logging_config_class = config.log_config.LOGGING_CONFIG
(3) [Optional] Add $AIRFLOW_HOME to PYTHONPATH environment.
export "${PYTHONPATH}:~"
Actually, you can set the path of logging_config_class to anything as long as the python is able to load the package.
Setting handler.processor to airflow.utils.log.logging_mixin.RedirectStdHandler didn't work for me. It used too much memory.
remote_logging=True in airflow.cfg is the key.
Please check the thread here for detailed steps.
You can extend the image with the following or do so in airflow.cfg
ENV AIRFLOW__LOGGING__REMOTE_LOGGING=True
ENV AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID=gcp_conn_id
ENV AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=gs://bucket_name/AIRFLOW_LOGS
the gcp_conn_id should have the correct permission to create/delete objects in GCS
With an Oozie coordinator and workflow, I see the following in the Coord Job Log for a specific action:
JOB[0134742-190911204352052-oozie-oozi-C] ACTION[0134742-190911204352052-oozie-oozi-C#1] [0134742-190911204352052-oozie-oozi-C#1]::CoordActionInputCheck:: Missing deps: ${coord:latest(0)}#${coord:latest(0)}#${coord:latest(0)}#${coord:latest(0)}#${coord:latest(0)}#${coord:latest(0)}
It seems the full path names are missing. If the path name is not specified in the coordinator with latest(0), the paths are available as seen here:
JOB[0134742-190911204352052-oozie-oozi-C] ACTION[0134742-190911204352052-oozie-oozi-C#1] [0134742-190911204352052-oozie-oozi-C#1]::CoordActionInputCheck:: Missing deps:hdfs://labs-xxx/data/funcxx/inputs/uploads/reports-for-targeting/20190923/14
Later the paths is resolved as:
JOB[0134742-190911204352052-oozie-oozi-C] ACTION[0134742-190911204352052-oozie-oozi-C#1] [0134742-190911204352052-oozie-oozi-C#1]::ActionInputCheck:: File:hdfs://labs-xxx/data/funcxx/inputs/uploads/reports-for-targeting/20190923/14, Exists? :true
How can I see the full path name instead of the ${coord:latest(0)} strings?
You can check this vis oozie cli -
oozie job -info 0134742-190911204352052-oozie-oozi-C#1
Airflow creates a unittest.cfg file in the AIRFLOW_HOME environment variable path.
My question is: how can I point to unittest.cfg in the same why that I point to the airflow.cfg via the environment variable AIRFLOW_CONFIG?
The reason why I want to do this is because I don't want to have any config files in the AIRFLOW_HOME directory.
Also, if anyone knows better, could you please explain what is the unittest.cfg is for as there is no documentation I could find on it.
unittest.cfg test configuration file is the default configuration file used when Airflow is running in test mode.
Test mode can be activated by setting the unit_test_mode configuration option in airflow.cfg or AIRFLOW__CORE__UNIT_TEST_MODE environment variable to True .
The configuration values in test configuration file overwrite those in airflow.cfg in runtime when test mode is activated.
# Source: https://github.com/apache/airflow/blob/1.10.5/airflow/configuration.py#L558,L561
def get_airflow_test_config(airflow_home):
if 'AIRFLOW_TEST_CONFIG' not in os.environ:
return os.path.join(airflow_home, 'unittests.cfg')
return expand_env_var(os.environ['AIRFLOW_TEST_CONFIG'])
The AIRFLOW_TEST_CONFIG environment variable can be set to the path of your test configuration file.
Following the documentation noted in the wiki, I'm trying to use the KeyczarTool to generate new keyset. Anyone else come across this FileNotFoundException? The KeyczarTool.jar has rwx permissions and tried running via sudo.
From docs
Command Usage:
create --location=/path/to/keys --purpose=(crypt|sign) [--name="A name"] [--asymmetric=(dsa|rsa|ec)]
Creates a new, empty key set in the given location.
This key set must have a purpose of either "crypt" or "sign"
and may optionally be given a name. The optional version
flag will generate a public key set of the given algorithm.
The "dsa" and "ec" asymmetric values are valid only for sets
with "sign" purpose.
Cmd:
$ java -jar KeyczarTool-0.71f-060112.jar create --location=/keys --purpose=crypt -name="first key" --asymmetric=rsa
output:
org.keyczar.exceptions.KeyczarException: Unable to write to: /keys/meta
at org.keyczar.KeyczarTool.create(KeyczarTool.java:366)
at org.keyczar.KeyczarTool.main(KeyczarTool.java:123)
Caused by: java.io.FileNotFoundException: /keys/meta (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
at org.keyczar.KeyczarTool.create(KeyczarTool.java:362)
... 1 more
With the current version of java keyczar the directory "keys" needs to be created first before running the program.
This is a known issue KeyczarTool should create directories automatically.
As #jbtule kindly pointed out you must create the keys dir first. But also include . before the slash.
Correct working command is:
$ java -jar KeyczarTool-0.71f-060112.jar create --location=./keys --purpose=crypt -name="first key" --asymmetric=rsa
(Get-Item $SymLink).LastWriteTime return's the SymLink's last modified time and not the target's modified time.
How do I get the target's last modified time?
There appears to be no direct way, thus for now this will have to be done in two steps-
Get the path of the SymLink's target
Get the LastWriteTime from the target's path
To determine if its a symlink: Check if SymLink - PowerShell
To get the path:
use the Dir command's summary output - from which the target information can be snipped out - using RegEx.
or using Native API Call: GetFinalPathNameByHandle; see: Calling Unmanaged Code
from PS