Jupyterhub K8s - Issue with Changing User from Jovyan to NB_USER - jupyter-notebook

Everything works well until we wanted to set the NB_USER to the logged in user. When changed the config to run as root and start.sh as default cmd, getting the below error in the log and the container is failing to start. Any help is highly appreciated
After running the container as root, getting the below log for the error:
Set username to: user1
Relocating home dir to /home/user1
mv: cannot move '/home/jovyan' to '/home/user1': Device or resource busy
Here is the config.yaml
singleuser:
defaultUrl: "/lab"
uid: 0
fsGid: 0
hub:
extraConfig: |
c.KubeSpawner.args = ['--allow-root']
c.Spawner.cmd = ['start.sh','jupyterhub-singleuser']
def notebook_dir_hook(spawner):
spawner.environment = {'NB_USER':spawner.user.name, 'NB_UID':'1500'}
c.Spawner.pre_spawn_hook = notebook_dir_hook
from kubernetes import client
def modify_pod_hook(spawner, pod):
pod.spec.containers[0].security_context = client.V1SecurityContext(
privileged=True,
capabilities=client.V1Capabilities(
add=['SYS_ADMIN']
)
)`enter code here`
return pod
c.KubeSpawner.modify_pod_hook = modify_pod_hook

Related

New eks node instance not able to join cluster, getting "cni plugin not initialized"

I am pretty new to terraform and trying to create a new eks cluster with node-group and launch template. The EKS cluster, node-group, launch template, nodes all created successfully. However, when I changed the desired size of the node group (using terraform or the AWS management console), it would fail. No error reported in the Nodg group Health issues tab. I digged further, and found that new instances were launched by the Autoscaling group, but new ones were not able to join the cluster.
Look into the troubled instances, I found the following log by running "sudo journalctl -f -u kubelet"
an 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.612322 3168 eviction_manager.go:254] "Eviction manager: failed to get summary stats" err="failed to get node info: node "ip-10-102-21-129.us-east-2.compute.internal" not found"
Jan 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.654501 3168 kubelet.go:2427] "Error getting node" err="node "ip-10-102-21-129.us-east-2.compute.internal" not found"
Jan 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.755473 3168 kubelet.go:2427] "Error getting node" err="node "ip-10-102-21-129.us-east-2.compute.internal" not found"
Jan 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.776238 3168 kubelet.go:2352] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Jan 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.856199 3168 kubelet.go:2427] "Error getting node" err="node "ip-10-102-21-129.us-east-2.compute.internal" not found"
Looked like the issue has something to do with the cni add-ons, googled it and others suggest to check for the log inside the /var/log/aws-routed-eni directory. I could find that directory and logs in the working nodes (the ones created initialy when the eks cluster was created), but the same directory and log files do not exist in the newly launch instances nodes (the one created after the cluster was created and by changing the desired node size)
The image I used for the node-group is ami-0af5eb518f7616978 (amazon/amazon-eks-node-1.24-v20230105)
Here is what my script looks like:
resource "aws_eks_cluster" "eks-cluster" {
name = var.mod_cluster_name
role_arn = var.mod_eks_nodes_role
version = "1.24"
vpc_config {
security_group_ids = [var.mod_cluster_security_group_id]
subnet_ids = var.mod_private_subnets
endpoint_private_access = "true"
endpoint_public_access = "true"
}
}
resource "aws_eks_node_group" "eks-cluster-ng" {
cluster_name = aws_eks_cluster.eks-cluster.name
node_group_name = "eks-cluster-ng"
node_role_arn = var.mod_eks_nodes_role
subnet_ids = var.mod_private_subnets
#instance_types = ["t3a.medium"]
scaling_config {
desired_size = var.mod_asg_desired_size
max_size = var.mod_asg_max_size
min_size = var.mod_asg_min_size
}
launch_template {
#name = aws_launch_template.eks_launch_template.name
id = aws_launch_template.eks_launch_template.id
version = aws_launch_template.eks_launch_template.latest_version
}
lifecycle {
create_before_destroy = true
}
}
resource "aws_launch_template" "eks_launch_template" {
name = join("", [aws_eks_cluster.eks-cluster.name, "-launch-template"])
vpc_security_group_ids = [var.mod_node_security_group_id]
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = var.mod_ebs_volume_size
volume_type = "gp2"
#encrypted = false
}
}
lifecycle {
create_before_destroy = true
}
image_id = var.mod_ami_id
instance_type = var.mod_eks_node_instance_type
metadata_options {
http_endpoint = "enabled"
http_tokens = "required"
http_put_response_hop_limit = 2
}
user_data = base64encode(<<-EOF
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="
--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
set -ex
exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
B64_CLUSTER_CA=${aws_eks_cluster.eks-cluster.certificate_authority[0].data}
API_SERVER_URL=${aws_eks_cluster.eks-cluster.endpoint}
K8S_CLUSTER_DNS_IP=172.20.0.10
/etc/eks/bootstrap.sh ${aws_eks_cluster.eks-cluster.name} --apiserver-endpoint $API_SERVER_URL --b64-cluster-ca $B64_CLUSTER_CA
--==MYBOUNDARY==--\
EOF
)
tag_specifications {
resource_type = "instance"
tags = {
Name = "EKS-MANAGED-NODE"
}
}
}
Another thing I notice is that I tagged the instance Name as "EKS-MANAGED-NODE". That tag showed up correctly in nodes created when the eks cluster was created. However, any new nodes created afterward, the Name changed to "EKS-MANAGED-NODEGROUP-NODE"
I wonder if that indicates there is issue?
I checked the log confirmed that the user-data got looked at and ran when instances started up.
sh-4.2$ more user-data.log
B64_CLUSTER_CA=LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJek1ERXlOekU
0TlRrMU1Wb1hEVE16TURFeU5E (deleted the rest)
API_SERVER_URL=https://EC283069E9FF1B33CD6C59F3E3D0A1B9.gr7.us-east-2.eks.amazonaws.com
K8S_CLUSTER_DNS_IP=172.20.0.10
/etc/eks/bootstrap.sh dev-test-search-eks-oVpBNP0e --apiserver-endpoint https://EC283069E9FF1B33CD6C59F3E3D0A1B9.gr7.us-east-2.eks.amazonaws.com --b64-cluster-ca LS0tLS
1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakND...(deleted the rest)
Using kubelet version 1.24.7
true
Using containerd as the container runtime
true
‘/etc/eks/containerd/containerd-config.toml’ -> ‘/etc/containerd/config.toml’
‘/etc/eks/containerd/sandbox-image.service’ -> ‘/etc/systemd/system/sandbox-image.service’
Created symlink from /etc/systemd/system/multi-user.target.wants/containerd.service to /usr/lib/systemd/system/containerd.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/sandbox-image.service to /etc/systemd/system/sandbox-image.service.
‘/etc/eks/containerd/kubelet-containerd.service’ -> ‘/etc/systemd/system/kubelet.service’
Created symlink from /etc/sy
I confirmed that the role being specified has all the required permission, the role is being used in other eks cluster, I am trying to create a new one based on the existing one using terraform.
I tried removing the launch template and let aws using the default one. Then new nodes have no issue joining the cluster.
I looked at my launch template script and at the registry https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/launch_template
nowhere mentioned that I need to manually add or run the cni plugin.
So I don't understand why the cni plugin was not installed automatically and why instances are not able to join the cluster.
Any help is appreciated.

Deploying flask with sqlite to heroku

When I push my app to Heroku it does use the data_base.db and if i try to add a user to the data base I get the error that no data base is found
2021-08-15T13:02:39.226938+00:00 app[web.1]: cursor.execute(statement, parameters)
2021-08-15T13:02:39.226938+00:00 app[web.1]: sqlite3.OperationalError: no such table: users
Does heroku forbid the use of sqlite or am I doing something wrong
This is config.py
import random
import string
class Config:
SECRET_KEY = ''.join(random.choice(string.ascii_letters + string.digits) for i in range(60))
SQLALCHEMY_DATABASE_URI = 'sqlite:///data_base.db'
SQLALCHEMY_TRACK_MODIFICATIONS = False
class DevelopmentConfig(Config):
DEBUG = True
class ProductionConfig(Config):
DEBUG = False
config = {
'dev' : DevelopmentConfig,
'prod': ProductionConfig,
}
replace this
SQLALCHEMY_DATABASE_URI = 'sqlite:///data_base.db'
with
uri = os.getenv("DATABASE_URL")
if uri.startswith("postgres://"):
uri = uri.replace("postgres://", "postgresql://", 1)
SQLALCHEMY_DATABASE_URI = uri or 'sqlite:///data_base.db'
simply tells the app to use postgresql when deployed to heroku
and don't forget to add psycopg2 to the requirements.txt file

Autoscaling using Ceilometer/Aodh failed to trigger an alarm in the openstack rocky

Here is the document I refer to
1.sample_server.yaml
type: os.nova.server
version: 1.0
properties:
name: cirros_server
flavor: m1.small
image: b86fb462-c5c2-4a08-9fe4-c9f86d05763d
networks:
- network: external-net
2.Execute the following command line
# openstack cluster create --profile pserver --desired-capacity 2 mycluster
# openstack cluster receiver create --type webhook --cluster mycluster --action CLUSTER_SCALE_OUT --params count=2 r_01
# export ALRM_URL01='http://vip:8777/v1/webhooks/aac3433a-40de-4d7d-830c-e0035f2a4d13/trigger?V=1&count=2'
# aodh alarm create --type gnocchi_resources_threshold --aggregation-method mean --name cpu-high --metric cpu_util --threshold 70 --comparison-operator gt --granularity 300 --evaluation-periods 1 --alarm-action $ALRM_URL01 --repeat-actions False --query metadata.user_metadata.cluster_id=$MYCLUSTER_ID --resource-type instance --resource-id f7e0e8a6-51a3-422d-b631-7ddaf65b3dfb
3.log into each cluster nodes and run some CPU burning workloads there to drive the CPU utilization high
I added log output to /usr/lib/python2.7/site-packages/aodh/notifier/rest.py when trigger the alert request
class RestAlarmNotifier(notifier.AlarmNotifier):
def notify(self, action, alarm_id, alarm_name, severity, previous,
current, reason, reason_data, headers=None):
body = {'alarm_name': alarm_name, 'alarm_id': alarm_id,
'severity': severity, 'previous': previous,
'current': current, 'reason': reason,
'reason_data': reason_data}
headers['content-type'] = 'application/json'
kwargs = {'data': json.dumps(body),
'headers': headers}
max_retries = self.conf.rest_notifier_max_retries
session = requests.Session()
LOG.info('#########################')
LOG.info(session)
LOG.info(kwargs)
LOG.info(action.geturl())
LOG.info('#########################')
session.mount(action.geturl(),
requests.adapters.HTTPAdapter(max_retries=max_retries))
resp = session.post(action.geturl(), **kwargs)
LOG.info('$$$$$$$$$$$$$$$$$$$$$$$')
LOG.info(resp.content)
LOG.info('$$$$$$$$$$$$$$$$$$$$$$$')
Some error messages are output in the /var/log/aodh/notifier.log log, as follows:
enter image description here
The reason is the error caused by adding the body request parameter, the direct post request can be successful, for example, using curl request without the body parameter
curl -g -i -X POST 'http://vip:8777/v1/webhooks/34e91386-7176-4b30-bc17-5c3503712696/trigger?V=1'
Aodh related version packages are as follows:
python2-aodhclient-1.1.1-1.el7.noarch
openstack-aodh-api-7.0.0-1.el7.noarch
openstack-aodh-common-7.0.0-1.el7.noarch
openstack-aodh-listener-7.0.0-1.el7.noarch
python-aodh-7.0.0-1.el7.noarch
openstack-aodh-notifier-7.0.0-1.el7.noarch
openstack-aodh-evaluator-7.0.0-1.el7.noarch
openstack-aodh-expirer-7.0.0-1.el7.noarch
Can anyone point me in the right direction? Thanks.
The problem has been solved. Here is the document I refer to
Modify aodh rest.py(aodh/notifier/rest.py)
https://github.com/openstack/aodh/blob/master/aodh/notifier/rest.py#L79
Under the headers['content-type'] , add this line: headers['openstack-api-version'] = 'clustering 1.10'
Restart aodh service

Removing Airflow task logs

I'm running 5 DAG's which have generated a total of about 6GB of log data in the base_log_folder over a months period. I just added a remote_base_log_folder but it seems it does not exclude logging to the base_log_folder.
Is there anyway to automatically remove old log files, rotate them or force airflow to not log on disk (base_log_folder) only in remote storage?
Please refer https://github.com/teamclairvoyant/airflow-maintenance-dags
This plugin has DAGs that can kill halted tasks and log-cleanups.
You can grab the concepts and can come up with a new DAG that can cleanup as per your requirement.
We remove the Task logs by implementing our own FileTaskHandler, and then pointing to it in the airflow.cfg. So, we overwrite the default LogHandler to keep only N task logs, without scheduling additional DAGs.
We are using Airflow==1.10.1.
[core]
logging_config_class = log_config.LOGGING_CONFIG
log_config.LOGGING_CONFIG
BASE_LOG_FOLDER = conf.get('core', 'BASE_LOG_FOLDER')
FOLDER_TASK_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}'
FILENAME_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log'
LOGGING_CONFIG = {
'formatters': {},
'handlers': {
'...': {},
'task': {
'class': 'file_task_handler.FileTaskRotationHandler',
'formatter': 'airflow.job',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
'filename_template': FILENAME_TEMPLATE,
'folder_task_template': FOLDER_TASK_TEMPLATE,
'retention': 20
},
'...': {}
},
'loggers': {
'airflow.task': {
'handlers': ['task'],
'level': JOB_LOG_LEVEL,
'propagate': False,
},
'airflow.task_runner': {
'handlers': ['task'],
'level': LOG_LEVEL,
'propagate': True,
},
'...': {}
}
}
file_task_handler.FileTaskRotationHandler
import os
import shutil
from airflow.utils.helpers import parse_template_string
from airflow.utils.log.file_task_handler import FileTaskHandler
class FileTaskRotationHandler(FileTaskHandler):
def __init__(self, base_log_folder, filename_template, folder_task_template, retention):
"""
:param base_log_folder: Base log folder to place logs.
:param filename_template: template filename string.
:param folder_task_template: template folder task path.
:param retention: Number of folder logs to keep
"""
super(FileTaskRotationHandler, self).__init__(base_log_folder, filename_template)
self.retention = retention
self.folder_task_template, self.folder_task_template_jinja_template = \
parse_template_string(folder_task_template)
#staticmethod
def _get_directories(path='.'):
return next(os.walk(path))[1]
def _render_folder_task_path(self, ti):
if self.folder_task_template_jinja_template:
jinja_context = ti.get_template_context()
return self.folder_task_template_jinja_template.render(**jinja_context)
return self.folder_task_template.format(dag_id=ti.dag_id, task_id=ti.task_id)
def _init_file(self, ti):
relative_path = self._render_folder_task_path(ti)
folder_task_path = os.path.join(self.local_base, relative_path)
subfolders = self._get_directories(folder_task_path)
to_remove = set(subfolders) - set(subfolders[-self.retention:])
for dir_to_remove in to_remove:
full_dir_to_remove = os.path.join(folder_task_path, dir_to_remove)
print('Removing', full_dir_to_remove)
shutil.rmtree(full_dir_to_remove)
return FileTaskHandler._init_file(self, ti)
Airflow maintainers don't think truncating logs is a part of airflow core logic, to see this, and then in this issue, maintainers suggest to change LOG_LEVEL avoid too many log data.
And in this PR, we can learn how to change log level in airflow.cfg.
good luck.
I know it sounds savage, but have you tried pointing base_log_folder to /dev/null? I use Airflow as a part of a container, so I don't care about the files either, as long as the logger pipe to STDOUT as well.
Not sure how well this plays with S3 though.
For your concrete problems, I have some suggestions.
For those, you would always need a specialized logging config as described in this answer: https://stackoverflow.com/a/54195537/2668430
automatically remove old log files and rotate them
I don't have any practical experience with the TimedRotatingFileHandler from the Python standard library yet, but you might give it a try:
https://docs.python.org/3/library/logging.handlers.html#timedrotatingfilehandler
It not only offers to rotate your files based on a time interval, but if you specify the backupCount parameter, it even deletes your old log files:
If backupCount is nonzero, at most backupCount files will be kept, and if more would be created when rollover occurs, the oldest one is deleted. The deletion logic uses the interval to determine which files to delete, so changing the interval may leave old files lying around.
Which sounds pretty much like the best solution for your first problem.
force airflow to not log on disk (base_log_folder), but only in remote storage?
In this case you should specify the logging config in such a way that you do not have any logging handlers that write to a file, i.e. remove all FileHandlers.
Rather, try to find logging handlers that send the output directly to a remote address.
E.g. CMRESHandler which logs directly to ElasticSearch but needs some extra fields in the log calls.
Alternatively, write your own handler class and let it inherit from the Python standard library's HTTPHandler.
A final suggestion would be to combine both the TimedRotatingFileHandler and setup ElasticSearch together with FileBeat, so you would be able to store your logs inside ElasticSearch (i.e. remote), but you wouldn't store a huge amount of logs on your Airflow disk since they will be removed by the backupCount retention policy of your TimedRotatingFileHandler.
Usually apache airflow grab the disk space due to 3 reasons
1. airflow scheduler logs files
2. mysql binaly logs [Major]
3. xcom table records.
To make it clean up on regular basis I have set up a dag which run on daily basis and cleans the binary logs and truncate the xcom table to make the disk space free
You also might need to install [pip install mysql-connector-python].
To clean up scheduler log files I do delete them manually two times in a week to avoid the risk of logs deleted which needs to be required for some reasons.
I clean the logs files by [sudo rm -rd airflow/logs/] command.
Below is my python code for reference
'
"""Example DAG demonstrating the usage of the PythonOperator."""
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
args = {
'owner': 'airflow',
'email_on_failure':True,
'retries': 1,
'email':['Your Email Id'],
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
dag_id='airflow_logs_cleanup',
default_args=args,
schedule_interval='#daily',
start_date=days_ago(0),
catchup=False,
max_active_runs=1,
tags=['airflow_maintenance'],
)
def truncate_table():
import mysql.connector
connection = mysql.connector.connect(host='localhost',
database='db_name',
user='username',
password='your password',
auth_plugin='mysql_native_password')
cursor = connection.cursor()
sql_select_query = """TRUNCATE TABLE xcom"""
cursor.execute(sql_select_query)
connection.commit()
connection.close()
print("XCOM Table truncated successfully")
def delete_binary_logs():
import mysql.connector
from datetime import datetime
date = datetime.today().strftime('%Y-%m-%d')
connection = mysql.connector.connect(host='localhost',
database='db_name',
user='username',
password='your_password',
auth_plugin='mysql_native_password')
cursor = connection.cursor()
query = 'PURGE BINARY LOGS BEFORE ' + "'" + str(date) + "'"
sql_select_query = query
cursor.execute(sql_select_query)
connection.commit()
connection.close()
print("Binary logs deleted successfully")
t1 = PythonOperator(
task_id='truncate_table',
python_callable=truncate_table, dag=dag
)
t2 = PythonOperator(
task_id='delete_binary_logs',
python_callable=delete_binary_logs, dag=dag
)
t2 << t1
'
I am surprized but it worked for me. Update your config as below:
base_log_folder=""
It is test in minio and in s3.
Our solution looks a lot like Franzi's:
Running on Airflow 2.0.1 (py3.8)
Override default logging configuration
Since we use a helm chart for airflow deployment it was easiest to push an env there, but it can also be done in the airflow.cfg or using ENV in dockerfile.
# Set custom logging configuration to enable log rotation for task logging
AIRFLOW__LOGGING__LOGGING_CONFIG_CLASS: "airflow_plugins.settings.airflow_local_settings.DEFAULT_LOGGING_CONFIG"
Then we added the logging configuration together with the custom log handler to a python module we build and install in the docker image. As described here: https://airflow.apache.org/docs/apache-airflow/stable/modules_management.html
Logging configuration snippet
This is only a copy on the default from the airflow codebase, but then the task logger gets a different handler.
DEFAULT_LOGGING_CONFIG: Dict[str, Any] = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'airflow': {'format': LOG_FORMAT},
'airflow_coloured': {
'format': COLORED_LOG_FORMAT if COLORED_LOG else LOG_FORMAT,
'class': COLORED_FORMATTER_CLASS if COLORED_LOG else 'logging.Formatter',
},
},
'handlers': {
'console': {
'class': 'airflow.utils.log.logging_mixin.RedirectStdHandler',
'formatter': 'airflow_coloured',
'stream': 'sys.stdout',
},
'task': {
'class': 'airflow_plugins.log.rotating_file_task_handler.RotatingFileTaskHandler',
'formatter': 'airflow',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
'filename_template': FILENAME_TEMPLATE,
'maxBytes': 10485760, # 10MB
'backupCount': 6,
},
...
RotatingFileTaskHandler
And finally the custom handler which is just a merge of the logging.handlers.RotatingFileHandler and the FileTaskHandler.
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""File logging handler for tasks."""
import logging
import os
from pathlib import Path
from typing import TYPE_CHECKING, Optional
import requests
from airflow.configuration import AirflowConfigException, conf
from airflow.utils.helpers import parse_template_string
if TYPE_CHECKING:
from airflow.models import TaskInstance
class RotatingFileTaskHandler(logging.Handler):
"""
FileTaskHandler is a python log handler that handles and reads
task instance logs. It creates and delegates log handling
to `logging.FileHandler` after receiving task instance context.
It reads logs from task instance's host machine.
:param base_log_folder: Base log folder to place logs.
:param filename_template: template filename string
"""
def __init__(self, base_log_folder: str, filename_template: str, maxBytes=0, backupCount=0):
self.max_bytes = maxBytes
self.backup_count = backupCount
super().__init__()
self.handler = None # type: Optional[logging.FileHandler]
self.local_base = base_log_folder
self.filename_template, self.filename_jinja_template = parse_template_string(filename_template)
def set_context(self, ti: "TaskInstance"):
"""
Provide task_instance context to airflow task handler.
:param ti: task instance object
"""
local_loc = self._init_file(ti)
self.handler = logging.handlers.RotatingFileHandler(
filename=local_loc,
mode='a',
maxBytes=self.max_bytes,
backupCount=self.backup_count,
encoding='utf-8',
delay=False,
)
if self.formatter:
self.handler.setFormatter(self.formatter)
self.handler.setLevel(self.level)
def emit(self, record):
if self.handler:
self.handler.emit(record)
def flush(self):
if self.handler:
self.handler.flush()
def close(self):
if self.handler:
self.handler.close()
def _render_filename(self, ti, try_number):
if self.filename_jinja_template:
if hasattr(ti, 'task'):
jinja_context = ti.get_template_context()
jinja_context['try_number'] = try_number
else:
jinja_context = {
'ti': ti,
'ts': ti.execution_date.isoformat(),
'try_number': try_number,
}
return self.filename_jinja_template.render(**jinja_context)
return self.filename_template.format(
dag_id=ti.dag_id,
task_id=ti.task_id,
execution_date=ti.execution_date.isoformat(),
try_number=try_number,
)
def _read_grouped_logs(self):
return False
def _read(self, ti, try_number, metadata=None): # pylint: disable=unused-argument
"""
Template method that contains custom logic of reading
logs given the try_number.
:param ti: task instance record
:param try_number: current try_number to read log from
:param metadata: log metadata,
can be used for steaming log reading and auto-tailing.
:return: log message as a string and metadata.
"""
# Task instance here might be different from task instance when
# initializing the handler. Thus explicitly getting log location
# is needed to get correct log path.
log_relative_path = self._render_filename(ti, try_number)
location = os.path.join(self.local_base, log_relative_path)
log = ""
if os.path.exists(location):
try:
with open(location) as file:
log += f"*** Reading local file: {location}\n"
log += "".join(file.readlines())
except Exception as e: # pylint: disable=broad-except
log = f"*** Failed to load local log file: {location}\n"
log += "*** {}\n".format(str(e))
elif conf.get('core', 'executor') == 'KubernetesExecutor': # pylint: disable=too-many-nested-blocks
try:
from airflow.kubernetes.kube_client import get_kube_client
kube_client = get_kube_client()
if len(ti.hostname) >= 63:
# Kubernetes takes the pod name and truncates it for the hostname. This truncated hostname
# is returned for the fqdn to comply with the 63 character limit imposed by DNS standards
# on any label of a FQDN.
pod_list = kube_client.list_namespaced_pod(conf.get('kubernetes', 'namespace'))
matches = [
pod.metadata.name
for pod in pod_list.items
if pod.metadata.name.startswith(ti.hostname)
]
if len(matches) == 1:
if len(matches[0]) > len(ti.hostname):
ti.hostname = matches[0]
log += '*** Trying to get logs (last 100 lines) from worker pod {} ***\n\n'.format(
ti.hostname
)
res = kube_client.read_namespaced_pod_log(
name=ti.hostname,
namespace=conf.get('kubernetes', 'namespace'),
container='base',
follow=False,
tail_lines=100,
_preload_content=False,
)
for line in res:
log += line.decode()
except Exception as f: # pylint: disable=broad-except
log += '*** Unable to fetch logs from worker pod {} ***\n{}\n\n'.format(ti.hostname, str(f))
else:
url = os.path.join("http://{ti.hostname}:{worker_log_server_port}/log", log_relative_path).format(
ti=ti, worker_log_server_port=conf.get('celery', 'WORKER_LOG_SERVER_PORT')
)
log += f"*** Log file does not exist: {location}\n"
log += f"*** Fetching from: {url}\n"
try:
timeout = None # No timeout
try:
timeout = conf.getint('webserver', 'log_fetch_timeout_sec')
except (AirflowConfigException, ValueError):
pass
response = requests.get(url, timeout=timeout)
response.encoding = "utf-8"
# Check if the resource was properly fetched
response.raise_for_status()
log += '\n' + response.text
except Exception as e: # pylint: disable=broad-except
log += "*** Failed to fetch log file from worker. {}\n".format(str(e))
return log, {'end_of_log': True}
def read(self, task_instance, try_number=None, metadata=None):
"""
Read logs of given task instance from local machine.
:param task_instance: task instance object
:param try_number: task instance try_number to read logs from. If None
it returns all logs separated by try_number
:param metadata: log metadata,
can be used for steaming log reading and auto-tailing.
:return: a list of listed tuples which order log string by host
"""
# Task instance increments its try number when it starts to run.
# So the log for a particular task try will only show up when
# try number gets incremented in DB, i.e logs produced the time
# after cli run and before try_number + 1 in DB will not be displayed.
if try_number is None:
next_try = task_instance.next_try_number
try_numbers = list(range(1, next_try))
elif try_number < 1:
logs = [
[('default_host', f'Error fetching the logs. Try number {try_number} is invalid.')],
]
return logs, [{'end_of_log': True}]
else:
try_numbers = [try_number]
logs = [''] * len(try_numbers)
metadata_array = [{}] * len(try_numbers)
for i, try_number_element in enumerate(try_numbers):
log, metadata = self._read(task_instance, try_number_element, metadata)
# es_task_handler return logs grouped by host. wrap other handler returning log string
# with default/ empty host so that UI can render the response in the same way
logs[i] = log if self._read_grouped_logs() else [(task_instance.hostname, log)]
metadata_array[i] = metadata
return logs, metadata_array
def _init_file(self, ti):
"""
Create log directory and give it correct permissions.
:param ti: task instance object
:return: relative log path of the given task instance
"""
# To handle log writing when tasks are impersonated, the log files need to
# be writable by the user that runs the Airflow command and the user
# that is impersonated. This is mainly to handle corner cases with the
# SubDagOperator. When the SubDagOperator is run, all of the operators
# run under the impersonated user and create appropriate log files
# as the impersonated user. However, if the user manually runs tasks
# of the SubDagOperator through the UI, then the log files are created
# by the user that runs the Airflow command. For example, the Airflow
# run command may be run by the `airflow_sudoable` user, but the Airflow
# tasks may be run by the `airflow` user. If the log files are not
# writable by both users, then it's possible that re-running a task
# via the UI (or vice versa) results in a permission error as the task
# tries to write to a log file created by the other user.
relative_path = self._render_filename(ti, ti.try_number)
full_path = os.path.join(self.local_base, relative_path)
directory = os.path.dirname(full_path)
# Create the log file and give it group writable permissions
# TODO(aoen): Make log dirs and logs globally readable for now since the SubDag
# operator is not compatible with impersonation (e.g. if a Celery executor is used
# for a SubDag operator and the SubDag operator has a different owner than the
# parent DAG)
Path(directory).mkdir(mode=0o777, parents=True, exist_ok=True)
if not os.path.exists(full_path):
open(full_path, "a").close()
# TODO: Investigate using 444 instead of 666.
os.chmod(full_path, 0o666)
return full_path
Maybe a final note; the links in the airflow UI to the logging will now only open the latest logfile, not the older rotated files which are only accessible by means of SSH or any other interface to access the airflow logging path.
I don't think that there is a rotation mechanism but you can store them in S3 or google cloud storage as describe here : https://airflow.incubator.apache.org/configuration.html#logs

Trouble connecting Firefox WebDriver with Selenium in Python

i'm running into an error in trying to start up the selenium firefox driver. it seems like others have hit snags at this step, and there is no readily available solution online, so hopefully this question will be broadly helpful. it seems like firefox is failing to establish an http server interface when initiated through selenium's driver. it appears that i can run firefox from the command line with no errors.
i should specify that i am doing this via ssh login to a linux container. i'm running python2.7 on Ubuntu 14.04 LTS (GNU/Linux 3.16.3-elastic x86_64). i have the latest version of selenium (2.44) installed, and i'm using firefox 34.0. i'm using xvfb to spoof a display.
below is my code, the error logs, and some related source code.
from selenium import webdriver
d = webdriver.Firefox()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/firefox/webdriver.py", line 59, in __init__
self.binary, timeout),
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/firefox/extension_connection.py", line 47, in __init__
self.binary.launch_browser(self.profile)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/firefox/firefox_binary.py", line 66, in launch_browser
self._wait_until_connectable()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/firefox/firefox_binary.py", line 105, in _wait_until_connectable
raise WebDriverException("Can't load the profile. Profile "
selenium.common.exceptions.WebDriverException: Message: Can't load the profile. Profile Dir: %s If you specified a log_file in the FirefoxBinary constructor, check it for details.
that error is raised here, due to a timeout:
def _wait_until_connectable(self):
"""Blocks until the extension is connectable in the firefox."""
count = 0
while not utils.is_connectable(self.profile.port):
if self.process.poll() is not None:
# Browser has exited
raise WebDriverException("The browser appears to have exited "
"before we could connect. If you specified a log_file in "
"the FirefoxBinary constructor, check it for details.")
if count == 30:
self.kill()
raise WebDriverException("Can't load the profile. Profile "
"Dir: %s If you specified a log_file in the "
"FirefoxBinary constructor, check it for details.")
count += 1
time.sleep(1)
return True
in the log_file:
tail -f logs/firefox_binary.log
1418661895753 addons.xpi DEBUG checkForChanges
1418661895847 addons.xpi DEBUG No changes found
1418661895853 addons.manager DEBUG Registering shutdown blocker for XPIProvider
1418661895854 addons.manager DEBUG Registering shutdown blocker for LightweightThemeManager
1418661895857 addons.manager DEBUG Registering shutdown blocker for OpenH264Provider
1418661895858 addons.manager DEBUG Registering shutdown blocker for PluginProvider
System JS : ERROR (null):0 - uncaught exception: 2147746065
JavaScript error: file:///tmp/tmplkLsLs/extensions/fxdriver#googlecode.com/components/driver-component.js, line 11507: NS_ERROR_NOT_AVAILABLE: Component is not available'Component is not available' when calling method: [nsIHttpServer::start]
*** Blocklist::_preloadBlocklistFile: blocklist is disabled
1418661908552 addons.manager DEBUG Registering shutdown blocker for <unnamed-provider>
one more point of information. early on, in the firefox driver initalization, socket.bind(('127.0.0.1',0)) was failing with a "can't assign requested address" error. i changed the location to (0.0.0.0,0) and edited the localhost entry in my /etc/hosts, and was able to bind that way. not sure if that could be causing the current failure though.
VV edits per louis's request VV . i specify the two lines where i change the localhost address.
def free_port():
"""
Determines a free port using sockets.
"""
free_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
free_socket.bind(('0.0.0.0', 0)) # changed from 127.0.0.1
free_socket.listen(5)
port = free_socket.getsockname()[1]
free_socket.close()
return port
def is_connectable(port):
"""
Tries to connect to the server at port to see if it is running.
:Args:
- port: The port to connect.
"""
try:
socket_ = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
socket_.settimeout(1)
socket_.connect(("0.0.0.0", port)) # changed again
socket_.close()
return True
except socket.error:
return False
here's the constructor from webdriver:
def __init__(self, firefox_profile=None, firefox_binary=None, timeout=30,
capabilities=None, proxy=None):
self.binary = firefox_binary
self.profile = firefox_profile
if self.profile is None:
self.profile = FirefoxProfile()
self.profile.native_events_enabled = (
self.NATIVE_EVENTS_ALLOWED and self.profile.native_events_enabled)
if self.binary is None:
self.binary = FirefoxBinary()
if capabilities is None:
capabilities = DesiredCapabilities.FIREFOX
if proxy is not None:
proxy.add_to_capabilities(capabilities)
RemoteWebDriver.__init__(self,
command_executor=ExtensionConnection("127.0.0.1", self.profile,
self.binary, timeout),
desired_capabilities=capabilities,
keep_alive=True)
self._is_remote = False
here's the constructor from extension_connector:
def __init__(self, host, firefox_profile, firefox_binary=None, timeout=30):
self.profile = firefox_profile
self.binary = firefox_binary
HOST = host
if self.binary is None:
self.binary = FirefoxBinary()
if HOST is None:
HOST = "127.0.0.1"
PORT = utils.free_port()
self.profile.port = PORT
self.profile.update_preferences()
self.profile.add_extension()
self.binary.launch_browser(self.profile)
_URL = "http://%s:%d/hub" % (HOST, PORT)
RemoteConnection.__init__(
self, _URL, keep_alive=True)
in posting my edits to louis's comment, i saw that my localhost issue turned up in other locations, as the host is hardcoded in twice more. i had my server master address the issue, changed everything in the source back to 127, and the problem was solved. thanks for prompting me, #louis, and i'm sorry my question wasn't more interesting. will accept my own answer after 2 days when SO allows me.

Resources