No response from request api(longer than 30min) while using SimpleHttpOperator - airflow

Apache Airflow version
2.1.2
Operating System
CentOS Linux 7
Deployment
Docker-Compose
What happened:
I'm using SimpleHttpOperator to request an api which get reponse longer than 30min often, but task is still in running state even api has already returned a response.
What I'm expected to happen:
I expected that task using SimpleHttpOperator could switch state from running to success so that downstream task could get into the queue.
How to reproduce:
Creat a SimpleHttpOperator task, request an api that need at least 30 min or more to response. Trigger this task you could find even api has returned a response, the task is still in running state.

Related

Continuously running send pipeline instance

An instance of a BizTalk send pipeline has started to run continuously. On 09/12/2021 an attempt was made to send a file via SFTP, which retried several times but ultimately failed due to a network issue. The error from the event logs is:
The adapter failed to transmit message going to send port "Deliver Outgoing - SFTP" with URL "sftp://xxx.xxxxxx.co.nz:22/To_****/%SourceFileName%". It will be retransmitted after the retry interval specified for this Send Port. Details:"WinSCP.SessionRemoteException: Network error: Software caused connection abort.
For some reason BizTalk made another send attempt at 1:49pm on 10/12/2021 which succeeded as confirmed by the administrator of the SFTP site. Despite this, BizTalk continued making intermittent send attempts and the pipeline instance is still running. The same file has been sent 4 times to the SFTP server.
The pipeline instance in theory should have suspended at 9:47pm on 09/12/2021. I have been able to confirm definitively whether anybody resumed it, but it seems unlikely at this stage. In any case, after sending successfully the pipeline instance should have terminated and should not be re-executing intermittently.
Does anybody know what could account for this behaviour? This is occurring on BTS2020 with CU2 applied.
I've sent messages over SFTP where the WinSCP interpretation of the date-modified attribute doesn't work with a specific type of SFTP server.
With the WinSCP GUI a dialogue box appears and you can disregard this error, but this option isn't available with BizTalk's GUI. This error appears when a file with the same filename already exists on the server and is supposed to be overwritten.
My solution was to create a pipeline component that removed %SourceFileName% on the server. The pipeline component (just like WinSCP GUI) can disregard the modified-date.

Single Node Artifactory - deploy using AWS ECS fails with current node still available

Maybe Im just approaching this wrong.
Single Instance mode (non-HA)
AWS-RDS Postgres Database
Deploying via ECS
Currently have Artifactory-Pro building a docker container and deploying to ECS via CI/CD. The initial deploy goes fine. Everything stands up, database migrations occur, and the instance runs.
However, when doing an update to the task, a new task spins up. It then adds entries to the access_topology with the new container-ip and unique node-id, but they stay unhealthy. The logs just then bomb out with failure messages (below - due to existing heartbeat of other node).
If I first stop the running task, and start a new task, it spins up properly (Probably due to heartbeat loss).
In typical ECS world, the new task is spun up till its deemed healthy, and then the older task is killed off.
Either scenario creates orphaned NODE records that stay healthy -- trying to also figure out how to garbage collect on those and purge.
Any thoughts on this?
Errors are below – it appears that it wont properly join because of an active heartbeat, and not being HA. However, I want this node to stand up so I can topple the other. Thanks –
Cluster join: Successfully joined jfmd#01es5dmfhar6gcy5abyj4rwpkc with node id ip-10-10-3-248.us-XXXX-1.compute.internal
Application could not be initialized: Current Artifactory node last heartbeat is: 1607609142483. Stopping Artifactory since the local server is running as PRO/OSS but found other servers in registry
Error occurred when refreshing domain cache all domain endpoint failed : Fetch domains from http://localhost:8046/distribution/api/v1/events/domains failed (returned 404), Fetch domains from http://localhost:8046/artifactory/api/events/domains failed (returned 404), [domain_client]"
Retry 20 Elapsed 16.84 secs failed: Couldn't access another access peer. [localhost:8046]. Status code: UNAVAILABLE. HTTP status code 503
Status code: UNAVAILABLE. HTTP status code 503
1607609184634,invalid content-type: text/plain; charset=utf-8
1607609184634,"headers: Metadata(:status=503,content-type=text/plain; charset=utf-8,content-length=19,date=Thu, 10 Dec 2020 14:06:24 GMT)"
1607609184634,DATA-----------------------------
1607609184634,Service Unavailable. Trying again
This is not possible without an HA configuration. Since this is not an HA configuration, the application will not start up if there is another application still "alive". In this case, "alive" is defined as having written the heartbeat within X amount of seconds (I believe this is 10 by default).

gRPC not dropping disconnected channel

Steps to reproduce
Start server
Send a client RPC to server
Restart server
Using the same client, send another RPC. The call will fail
Send another RPC, this call will success
Also I found that if the server is leave stopped for a long time before starting up again, the call in step 5 will return "channel is in state TRANSIENT_FAILURE" as well.
Example code: https://github.com/whs/grpc-repro
(Install from requirements.txt then run main.py)
Expected result
All call should success.
Tested with Python grpcio==1.19.0 server/client and with go-grpc server. I tried setting grpc.max_connection_age_grace_ms, grpc.max_connection_age_ms, grpc.max_connection_idle_ms, grpc.keepalive_time_ms, grpc.keepalive_permit_without_calls but they doesn't seems to help.
The question is duplicated with https://groups.google.com/forum/#!msg/grpc-io/199V_iF0NMw/NahHz_vMBwAJ.
The feature you want probably is "wait_for_ready". In case of TRANSIENT_FAILURE (server not available temporarily), it will automatically wait for the channel become READY again without failing. Read more about wait for ready.

Timeout/synchronisation logic in UART communication

Background:
I use UART for communication between two devices (with different OSs, one with Linux and one with Windows). I have an application on Windows, as master (sending commands) and an application in Linux as acting and responding application. It will do corresponding operations and give response.
Windows app: Application will send a command and wait for the response. If Linux has not responded within some (say, 10 seconds), come out of wait, notifying the user a timeout error, (and send the next command by user).
Linux app: Application will wait for the command. Process it (say, for 5 seconds max), and then send response to Windows.
Problem: If due to any error/issue, Linux responds after Windows app's timeout (say, 15 seconds), Windows application has already aborted the command thinking it's timed out, and sent the next command. The response of first command is being treated as response for current one, which is not correct.
Solution: I thought of appending the command byte as first byte in Linux response to check/verify in Windows application (that whether the response is for current command or not) and ignore if invalid. But this too has limitations that, if both the commands are same, there will be a mismatch again.
What other logic I can implement to solve this?

gunicorn doesn't process simultaneous requests concurrently

I am trying to serve long running requests using gunicorn and its async workers but I can't find any examples that I can get to work. I used the example here but tweaked to add a fake delay (sleep for 5s) before returning the response:
def app(environ, start_response):
data = "Hello, World!\n"
start_response("200 OK", [
("Content-Type", "text/plain"),
("Content-Length", str(len(data)))
])
time.sleep(5)
return iter([data])
Then I run gunicorn so:
gunicorn -w 4 myapp:app -k gevent
When I open up two browser tabs and type in http://127.0.0.1:8000/ in both of them and send the requests almost at the same time, the requests appear to get processed sequentially - one returns after 5 seconds and the other returns after a further 5 seconds.
Q. I am guessing the sleep isn't gevent friendly? But there are 4 workers and so even if the type of worker was 'sync' two workers should handle two requests simultaneously?
I just ran into the same thing, opened a question here: Requests not being distributed across gunicorn workers . The result is, it appears that the browser serializes access to the same page. I'm guessing perhaps this has something to do w/ cacheability, i.e. the browser thinks it's likely the page is cacheable, wait until it loads finds out it isn't so it makes another request and so on.
Give gevent.sleep a shot instead of time.sleep.
It's weird that this is happening with -w 4, but -k gevent is an async worker type, so it's possible gunicorn is feeding both requests to the same client. Assuming that's what's happening, time.sleep will lock your process unless you use gevent.monkey.patch_all().
When using gunicorn with non-blocking worker type, like gevent, It will use ONLY ONE process dealing with requests, so it's no surprise that your 5-second work carried out sequentially.
The async worker is useful when your workload is light, and request rate is rapid, in that case, gunicorn can utilize times wasted on waiting IO (like, waiting for socket to be writable to write the response to it), by switching to another worker to work another request. by switching to another request assigned to the same worker.
UPDATE
I was wrong.
When using gunicorn with non-blocking worker type, with worker settings in gunicorn, each worker is a process, that runs a separate queue.
So if the time.sleep was ran on different process, it will run simultaneously, but when it's ran in the same worker, it will be carried out sequentially.
The problem is that the gunicorn loadbalancer may not have distributed the two requests into two worker processes. You can check the current process by os.getpid().

Resources