failure to install airflow helm chart on GKE due to failing migration - airflow

I have been trying to set up airflow using the official airflow helm chart from artifacthub on a GKE cluster but coming up with a couple of issues.
First I get these errors from the pods:
Failed to load logs: container "scheduler" in pod "airflow-scheduler-6b6cc9db4-qmvbw" is waiting to start: PodInitializing
Reason: BadRequest (400)
Then from the init containers, I get the following error:
Traceback (most recent call last):
File "/home/airflow/.local/bin/airflow", line 8, in <module>
sys.exit(main())
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/__main__.py", line 39, in main
args.func(args)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 52, in command
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/db_command.py", line 138, in check_migrations
db.check_migrations(timeout=args.migration_wait_timeout)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/db.py", line 739, in check_migrations
f"There are still unapplied migrations after {timeout} seconds. Migration"
TimeoutError: There are still unapplied migrations after 60 seconds. MigrationHead(s) in DB: set() | Migration Head(s) in Source Code: {'ecb43d2a1842'}
The postgres pods look okay.
The second issue is that I get the error that the pods are not scheduled because no nodes are available, and the cluster doesn't autoscale because even after that, the nodes will not contain the new pods.
The challenge with the above is that I upgraded the nodes to have 32 processors and 64GB memory, but the error persists. So I assume it is something else.

Related

airflow kubernetes pod operator retry no working

I am running airflow 1.10.12 in CeleryExecutor mode on Kubernetes cluster. This airflow DAG will create a Pod on the same Kubernetes cluster but in a different namespace.
On a good run, pod will run spark job successfully. But if it fails, airflow should retry. By retrying, the failed pod should receive a new label, metadata.labels.already_checked: 'True' on the first retry attempt and on the second retry attempt a new pod should be launched. However, a new label is not being marked as intended.
A snapshot of error message
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 282, in execute
final_state, result = self.handle_pod_overlap(labels, try_numbers_match, launcher, pod_list)
File "/usr/local/lib/python3.6/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 312, in handle_pod_overlap
final_state, result = self.monitor_launched_pod(launcher, pod_list.items[0])
File "/usr/local/lib/python3.6/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 432, in monitor_launched_pod
'Pod returned a failure: {state}'.format(state=final_state)
airflow.exceptions.AirflowException: Pod returned a failure: failed

apache airflow 1.10.9 statsd enabled making scheduler crashed

my airflow running in CeleryExecutor mode + progresql 12, all things go well except when turning statsd on:
statsd_on = True
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow
The schedulers can render jobs but jobs are not running, the scheduler log having below error:
[SQL: SELECT count(*) AS count_1
FROM task_instance
WHERE task_instance.pool = %(pool_1)s AND task_instance.state IN (%(state_1)s, %(state_2)s)]
[parameters: {'pool_1': 'default_pool', 'state_1': 'running', 'state_2': 'queued'}]
(Background on this error at: http://sqlalche.me/e/4xp6)[0m
[31mTraceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1246, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 588, in do_execute
cursor.execute(statement, parameters)
psycopg2.errors.ProtocolViolation: invalid frontend message type 97
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/jobs/scheduler_job.py", line 1495, in _validate_and_run_task_instances
self._process_and_execute_tasks(simple_dag_bag)
File "/usr/local/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 588, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.DatabaseError: (psycopg2.errors.ProtocolViolation) invalid frontend message type 97
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
If disable statsd, everything resume. Is it a bug for airflow? any advise to resolve it?
I faced the same error, and after a few tests, i can get statsd metrics working. Typically, you will see the error if the following conditions are met.
Statsd enabled set to True
SqlAlchemy connection pool set to True
Scheduler syserr log enabled (by redirect the err log to a file where you can see this error)
In my case, even though the scheduler kept throwing the error logs, statsd metrics were still delivered, and tasks were also scheduled as they should. I dont know how to measure the impact, i also dont want to sacrifice sql_alchemy connection pool, so I leave statsd turned off.
(I guess other people not seeing the error because they are missing the 3rd one above)

AWS SAM Local dotnetcore2.1 exception when running API Gateway

Setup
Windows 10
Docker for Windows v18.09.0
AWS SAM CLI v0.10.0
Python 3.7.0
AWS CLI v1.16.67
dotnet core sdk v2.1.403
Powershell v5.1.17134.407
Problem
I'm following the quickstart for AWS SAM Local (as well as the readme generated once the init command is executed below), using the dotnetcore2.1 runtime.
I've run the following command to initialise AWS SAM for use with dotnetcore2.1
sam init --runtime dtonetcore2.1
Then I created the package by running
build.ps1 --target=package
Finally I start the local API Gateway service by running
sam local start-api
I then open a browser and navigate to http://localhost:3000/hello where I'm presented with the following:
PS C:\Users\user_name\Documents\Workspace\messaround\aws-sam\sam-app> sam local start-api
2019-01-04 10:39:15 Found credentials in shared credentials file: ~/.aws/credentials
2019-01-04 10:39:15 Mounting HelloWorldFunction at http://127.0.0.1:3000/hello [GET]
2019-01-04 10:39:15 You can now browse to the above endpoints to invoke your functions. You do not need to restart/reload SAM CLI while working on your functions changes will be reflected instantly/automatically. You only need to restart SAM CLI if you update your AWS SAM template
2019-01-04 10:39:16 * Running on http://127.0.0.1:3000/ (Press CTRL+C to quit)
2019-01-04 10:40:10 Invoking HelloWorld::HelloWorld.Function::FunctionHandler (dotnetcore2.1)
2019-01-04 10:40:10 Decompressing C:\Users\user_name\Documents\Workspace\messaround\aws-sam\sam-app\artifacts\HelloWorld.zip
Fetching lambci/lambda:dotnetcore2.1 Docker container image......
2019-01-04 10:40:13 Mounting C:\Users\user_name\AppData\Local\Temp\tmpq0zka7a7 as /var/task:ro inside runtime container
2019-01-04 10:40:14 Exception on /hello [GET]
Traceback (most recent call last):
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\docker\api\client.py", line 246, in _raise_for_status
response.raise_for_status()
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\requests\models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localnpipe/v1.35/containers/102dda11417068e01873242be2383c78c7ad4e2739fd4f8b42c1e0ea494d2bbb/start
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\flask\app.py", line 2292, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\flask\app.py", line 1815, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\flask\app.py", line 1718, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\flask\_compat.py", line 35, in reraise
raise value
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\flask\app.py", line 1813, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\flask\app.py", line 1799, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\samcli\local\apigw\local_apigw_service.py", line 153, in _request_handler
self.lambda_runner.invoke(route.function_name, event, stdout=stdout_stream_writer, stderr=self.stderr)
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\samcli\commands\local\lib\local_lambda.py", line 85, in invoke
self.local_runtime.invoke(config, event, debug_context=self.debug_context, stdout=stdout, stderr=stderr)
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\samcli\local\lambdafn\runtime.py", line 86, in invoke
self._container_manager.run(container)
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\samcli\local\docker\manager.py", line 98, in run
container.start(input_data=input_data)
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\samcli\local\docker\container.py", line 187, in start
real_container.start()
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\docker\models\containers.py", line 390, in start
return self.client.api.start(self.id, **kwargs)
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\docker\utils\decorators.py", line 19, in wrapped
return f(self, resource_id, *args, **kwargs)
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\docker\api\container.py", line 1075, in start
self._raise_for_status(res)
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\docker\api\client.py", line 248, in _raise_for_status
raise create_api_error_from_http_exception(e)
File "C:\Users\user_name\AppData\Roaming\Python\Python37\site-packages\docker\errors.py", line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("error while creating mount source path '/host_mnt/c/Users/user_name/AppData/Local/Temp/tmpq0zka7a7': mkdir /host_mnt/c/Users/user_name/AppData: permission denied")
2019-01-04 10:40:14 127.0.0.1 - - [04/Jan/2019 10:40:14] "GET /hello HTTP/1.1" 502 -
2019-01-04 10:40:14 127.0.0.1 - - [04/Jan/2019 10:40:14] "GET /favicon.ico HTTP/1.1" 403 -
What I've tried
Resetting the shared drive credentials
Initially I though this was a permissioning error between my Windows drive and the VM running docker... After searching the docker forums I found this article which I've followed. However this doesn't seem to have changed the error message
Any suggestions would be greatly received. Thanks
That's how I fixed my problem:
When SAM CLI sees a zip, it unzip into a temp directory (looks to be C:/Users/user_name/AppData/Local/Temp/tmpq0zka7a7 in your case).
Docker must have access to that folder.
In my case, I've created a local user to give Docker access to shared drives and that local user didn't have access to C:/Users/user_name.
I gave it access and got my problem sorted. Maybe you can fix it the same way.
Try to run the following:
docker run --rm -v c:/Users/user_name:/data alpine ls /data
It should list c:/Users/user_name content if all is fine.
Good luck!

AirflowException: Celery command failed - The recorded hostname does not match this instance's hostname

I'm running Airflow on a clustered environment running on two AWS EC2-Instances. One for master and one for the worker. The worker node though periodically throws this error when running "$airflow worker":
[2018-08-09 16:15:43,553] {jobs.py:2574} WARNING - The recorded hostname ip-1.2.3.4 does not match this instance's hostname ip-1.2.3.4.eco.tanonprod.comanyname.io
Traceback (most recent call last):
File "/usr/bin/airflow", line 27, in <module>
args.func(args)
File "/usr/local/lib/python3.6/site-packages/airflow/bin/cli.py", line 387, in run
run_job.run()
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 198, in run
self._execute()
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 2527, in _execute
self.heartbeat()
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 182, in heartbeat
self.heartbeat_callback(session=session)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 2575, in heartbeat_callback
raise AirflowException("Hostname of job runner does not match")
airflow.exceptions.AirflowException: Hostname of job runner does not match
[2018-08-09 16:15:43,671] {celery_executor.py:54} ERROR - Command 'airflow run arl_source_emr_test_dag runEmrStep2WaiterTask 2018-08-07T00:00:00 --local -sd /var/lib/airflow/dags/arl_source_emr_test_dag.py' returned non-zero exit status 1.
[2018-08-09 16:15:43,681: ERROR/ForkPoolWorker-30] Task airflow.executors.celery_executor.execute_command[875a4da9-582e-4c10-92aa-5407f3b46d5f] raised unexpected: AirflowException('Celery command failed',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 52, in execute_command
subprocess.check_call(command, shell=True)
File "/usr/lib64/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'airflow run arl_source_emr_test_dag runEmrStep2WaiterTask 2018-08-07T00:00:00 --local -sd /var/lib/airflow/dags/arl_source_emr_test_dag.py' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/dist-packages/celery/app/trace.py", line 382, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/python3.6/dist-packages/celery/app/trace.py", line 641, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 55, in execute_command
raise AirflowException('Celery command failed')
airflow.exceptions.AirflowException: Celery command failed
When this error occurs the task is marked as failed on Airflow and thus fails my DAG when nothing actually went wrong in the task.
I'm using Redis as my queue and postgreSQL as my meta-database. Both are external as AWS services. I'm running all of this on my company environment which is why the full name of the server is ip-1.2.3.4.eco.tanonprod.comanyname.io. It looks like it wants this full name somewhere but I have no idea where I need to fix this value so that it's getting ip-1.2.3.4.eco.tanonprod.comanyname.io instead of just ip-1.2.3.4.
The really weird thing about this issue is that it doesn't always happen. It seems to just randomly happen every once in a while when I run the DAG. It's also occurring on all of my DAGs sporadically so it's not just one DAG. I find it strange though how it's sporadic because that means other task runs are handling the IP address for whatever this is just fine.
Note: I've changed the real IP address to 1.2.3.4 for privacy reasons.
Answer:
https://github.com/apache/incubator-airflow/pull/2484
This is exactly the problem I am having and other Airflow users on AWS EC2-Instances are experiencing it as well.
The hostname is set when the task instance runs, and is set to self.hostname = socket.getfqdn(), where socket is the python package import socket.
The comparison that triggers this error is:
fqdn = socket.getfqdn()
if fqdn != ti.hostname:
logging.warning("The recorded hostname {ti.hostname} "
"does not match this instance's hostname "
"{fqdn}".format(**locals()))
raise AirflowException("Hostname of job runner does not match")
It seems like the hostname on the ec2 instance is changing on you while the worker is running. Perhaps try manually setting the hostname as described here https://forums.aws.amazon.com/thread.jspa?threadID=246906 and see if that sticks.
I had a similar problem on my Mac. It fixed it setting hostname_callable = socket:gethostname in airflow.cfg.
Personally when running on my Mac, I found that I got similar errors to this when the Mac would sleep while I was running a long job. The solution was to go into System Preferences -> Energy Saver and then check "Prevent computer from sleeping automatically when the display is off."

Error executions Workflow install Cloudify

I'm trying to run a workflow in cloudify, but when running the command:
cfy executions start -w install -d teste003 --debug --include-logs
The following error occurs below:
Execution of workflow 'install' for deployment 'teste003' timed out. * Run 'cfy executions cancel --execution-id c12ac2b2-fd34-4a04-a4bc-252871f9e166' to cancel the running workflow.
* Run 'cfy events list --tail --include-logs --execution-id c12ac2b2-fd34-4a04-a4bc-252871f9e166' to retrieve the execution's events/logs
Traceback (most recent call last):
File "/home/ubuntu/cloudify/bin/cfy", line 9, in <module>
load_entry_point('cloudify==3.2.1', 'console_scripts', 'cfy')()
File "/home/ubuntu/cloudify/local/lib/python2.7/site-packages/cloudify_cli/cli.py", line 37, in main
args.handler(args)
File "/home/ubuntu/cloudify/local/lib/python2.7/site-packages/cloudify_cli/cli.py", line 143, in command_cmd_handler
command['handler'](**kwargs)
File "/home/ubuntu/cloudify/local/lib/python2.7/site-packages/cloudify_cli/commands/executions.py", line 174, in start
raise SuppressedCloudifyCliError()
SuppressedCloudifyCliError
Below my file aws-ec2-blueprint.yaml:
tosca_definitions_version: cloudify_dsl_1_1
imports:
- http://www.getcloudify.org/spec/cloudify/3.2.1/types.yaml
- http://www.getcloudify.org/spec/aws-plugin/1.2.1/plugin.yaml
- http://www.getcloudify.org/spec/diamond-plugin/1.2.1/plugin.yaml
inputs:
image:
description: >
Image to be used when launching agent VM's
size:
description: >
Flavor of the agent VM's
agent_user:
description: >
User for connecting to agent VM's
node_templates:
mongod_host:
type: cloudify.aws.nodes.Instance
properties:
image_id: { get_input: image }
instance_type: { get_input: size }
My inputs.yaml:
image: ami-d05e75b8
size: m3.medium
agent_user: ubuntu
Any suggestion?
It is hard to know why the install timed out without seeing the logs of the install.
In most cases it is related to the connection to the spawned VM or an install process that keep failing.
I would try to check:
AWS permissions to spawn a VM and connect to it
Security groups, port 22 open to the manager VM
Internet access of the spawned VM

Resources