Artifactory access to remote repos with certifcates/passwords fails after server restart - artifactory

We have a recurring problem wherein, our remote repositories that use either certificates or passwords to access their upstream, become inaccessable after a server restart. If, however, we simply view one of them in the admin UI, and change anything in it, something trivial even and save, all of the remotes simultaneously regain their access.
For example for we have a remote to https://registry.redhat.io, which uses a password and token
There is no proxy
socket timeout (MS) 60000
Advanced config options:
Cache:
Unused Artifacts Cleanup Period (Hr) 168
Metadata Retrieval Cache Period (Sec) 0
Metadata Retrieval Cache Timeout (Sec) 60
Assumed Offline Period (Sec) 0
Missed Retrieval Cache Period (Sec) 0
Others:
nothing checked
It's running on Ubuntu 20.04.5.
The Server typically reboots about every 2 weeks after a kernel update comes through. Is there anything I can do to ensure those connections dont get severed each time?

Related

Single Node Artifactory - deploy using AWS ECS fails with current node still available

Maybe Im just approaching this wrong.
Single Instance mode (non-HA)
AWS-RDS Postgres Database
Deploying via ECS
Currently have Artifactory-Pro building a docker container and deploying to ECS via CI/CD. The initial deploy goes fine. Everything stands up, database migrations occur, and the instance runs.
However, when doing an update to the task, a new task spins up. It then adds entries to the access_topology with the new container-ip and unique node-id, but they stay unhealthy. The logs just then bomb out with failure messages (below - due to existing heartbeat of other node).
If I first stop the running task, and start a new task, it spins up properly (Probably due to heartbeat loss).
In typical ECS world, the new task is spun up till its deemed healthy, and then the older task is killed off.
Either scenario creates orphaned NODE records that stay healthy -- trying to also figure out how to garbage collect on those and purge.
Any thoughts on this?
Errors are below – it appears that it wont properly join because of an active heartbeat, and not being HA. However, I want this node to stand up so I can topple the other. Thanks –
Cluster join: Successfully joined jfmd#01es5dmfhar6gcy5abyj4rwpkc with node id ip-10-10-3-248.us-XXXX-1.compute.internal
Application could not be initialized: Current Artifactory node last heartbeat is: 1607609142483. Stopping Artifactory since the local server is running as PRO/OSS but found other servers in registry
Error occurred when refreshing domain cache all domain endpoint failed : Fetch domains from http://localhost:8046/distribution/api/v1/events/domains failed (returned 404), Fetch domains from http://localhost:8046/artifactory/api/events/domains failed (returned 404), [domain_client]"
Retry 20 Elapsed 16.84 secs failed: Couldn't access another access peer. [localhost:8046]. Status code: UNAVAILABLE. HTTP status code 503
Status code: UNAVAILABLE. HTTP status code 503
1607609184634,invalid content-type: text/plain; charset=utf-8
1607609184634,"headers: Metadata(:status=503,content-type=text/plain; charset=utf-8,content-length=19,date=Thu, 10 Dec 2020 14:06:24 GMT)"
1607609184634,DATA-----------------------------
1607609184634,Service Unavailable. Trying again
This is not possible without an HA configuration. Since this is not an HA configuration, the application will not start up if there is another application still "alive". In this case, "alive" is defined as having written the heartbeat within X amount of seconds (I believe this is 10 by default).

Unable to enable receive locations (many of them)

Whenever I try to enable these receive locations, I get the sql time out error. Moreover there are many service instances and messages which are in active state or Dehydrated (Queued awaiting processing).
TITLE: BizTalk Server Administration
Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. (.Net SqlClient Data Provider)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&EvtSrc=MSSQLServer&EvtID=-2&LinkId=20476
ADDITIONAL INFORMATION:
The wait operation timed out
BUTTONS:
OK
I had tried restarting the host instances, but it did not work. The environment is a clustered server. There looks like no performance issue: no memory or cpu spikes.

How can I debug buffering with HTTP.sys?

I am running Windows 8.1 and I have an integration test suite that leverages HostableWebCore to spin up isolated ASP.NET web server processes. For performance reasons, I am launching 8 of these at a time and once they are started up I send a very simple web request to each, which is handled by an MVC application loaded into each. Every instance is listening on a different port.
The problem is that the requests are getting held up (I believe) in HTTP.sys (or whatever it is called these days). If I look at fiddler, I can see all 8 requests immediately (within a couple milliseconds) hit the ServerGotRequest state. However, the requests sit in this state for 20-100 seconds, depending on how many I run in parallel at a time.
The reason I suspect this is HTTP.sys problem is because the amount of time I have to wait for any of them to respond increases with the number of hosting applications I spin up in parallel. If I only launch a single hosting application, it will start responding in ~20 seconds. If I spin up 2 they will both start responding in ~30 seconds. If I spin up 4, ~40 seconds. If I spin up 8, ~100 seconds (which is default WebClient request timeout).
Because of this long delay, I have enough time to attach a debugger and put a breakpoint in my controller action and that breakpoint will be hit after the 20-100 second delay, suggesting that my process hasn't yet received the request. All of the hosts are sitting idle for those 20-100 seconds after ~5-10 seconds of cold start CPU churning. All of the hosts appear to receive the requests at the same time, as if something was blocking any request from going through and then all of a sudden let everything through.
My problem is, I have been unable to locate any information related to how one can debug HTTP.sys. How can I see what it is doing? What is causing the block? Why is it waiting to forward on the requests to the workers? Why do they all come through together?
Alternatively, if someone has any idea how I can work around this and get the requests to come through immediately (without the waiting) I would very much appreciate it.
Another note: I can see System (PID 4) immediately register to listen on the port I have specified as soon as the hosting applications launch.
Additional Information:
This is what one of my hosting apps looks like under netsh http show servicestate
Server session ID: FD0000012000004C
Version: 2.0
State: Active
Properties:
Max bandwidth: 4294967295
Timeouts:
Entity body timeout (secs): 120
Drain entity body timeout (secs): 120
Request queue timeout (secs): 120
Idle connection timeout (secs): 120
Header wait timeout (secs): 120
Minimum send rate (bytes/sec): 150
URL groups:
URL group ID: FB00000140000018
State: Active
Request queue name: IntegrationTestAppPool10451{974E3BB1-7774-432B-98DB-99850825B023}
Properties:
Max bandwidth: inherited
Max connections: inherited
Timeouts:
Timeout values inherited
Logging information:
Log directory: C:\inetpub\logs\LogFiles\W3SVC1
Log format: 0
Number of registered URLs: 2
Registered URLs:
HTTP://LOCALHOST:10451/
HTTP://*:10451/
Request queue name: IntegrationTestAppPool10451{974E3BB1-7774-432B-98DB-99850825B023}
Version: 2.0
State: Active
Request queue 503 verbosity level: Basic
Max requests: 1000
Number of active processes attached: 1
Controller process ID: 12812
Process IDs:
12812
Answering this mainly for posterity. Turns out that my problem wasn't HTTP.sys but instead it was ASP.NET. It opens up a shared lock when it tries to compile files. This shared lock is identified by System.Web.HttpRuntime.AppDomainAppId. I believe that since all of my apps are built dynamically from a common applicationHost.config file, they all have the same AppDomainAppId (/LM/W3SVC/1/ROOT). This means they all share a lock and effectively all page compilation happens sequentially for all of the apps. However, due to the nature of coming/going from the lock all of the pages tend to finish at the same time because it is unlikely that any of them will get to the end of the process in a timely fashion, causing them all to finish around the same time. Once one of them makes it through, others are likely close behind and finish just after.

ODP.NET Connection Pooling Issues - Fault Tollerance After Database Goes Down

I have an WebAPI service using ODP.NET to make connections to several oracle databases. Normally the web service would be hit several times a second and will never have long periods on inactivity. In our test site however, we did not use it for 2-3 days. This morning, we hit the service and got "connection request timeout" exceptions from ODP.NET, suggesting that the connection pool was out of available connections. We are closing the connections after use. The service was working fine before the period, but today the very first query got the timeout exception. Our app pool in IIS is configured to never reset.
My question then is, what can cause the connection pool to fill with bad connections after a period of inactivity, where these connections are not cleaned up in the usual 3 minute cycle? It only happened to 2 out of the 3 of our databases, and Validate Connection=true is set for all of them.
EDIT
So after talking to the DBA, there is some different between a connection/session being killed manually or by timeout and the database server severing the TCP connections. In this case, the TCP connection was severed as part of a regular backup (why is not important for this). I guess this happens when the whole database server goes offline at once. The basis of the question still applies I think though: why is ODP.NET unable to cleanup severed connections overtime? There is a performance counter that refers to "Stasis" connections, could those connections be stuck in that state? I would think that it should be able to see that a connection is no longer active (Validate Connection=True), kill it and not return it to the pool.
Granted, this problem can be solved by just resetting the app pool everything the database goes down. I would still like to configure ODP.NET connection pooling to be more fault tolerant.
I have run into this same issue, and the only solution I have found is to use the Connection Lifetime connection string parameter in conjunction with Validate Connection.
In my particular case, the connection timeout was set at the server and the connections in the pool would timeout, but not be sniped out of the pool, resulting in errors.
Setting both the Connection Lifetime and the Validate Connection parameters has resolved the issue.
Make sure the Connection Lifetime value that you choose is less than the server connection inactivity timeout.
The recommended solution is to use ODP.NET Fast Connection Failover (FCF). FCF will automatically remove invalid connections from the pool such that you don't need to use Validate Connection, Connection Lifetime, nor clear the pool.
To use FCF, set "HA events=true", use connection pooling, and have your DBA set up Fast Application Notification (FAN) on the server side. FAN is what alerts the ODP.NET pool when a DB service or node goes down or rebooted. Upon receiving the message, ODP.NET knows which connections to remove from the pool and removes them, leaving all other valid connections untouched.
Something else is going on here. Min Pool Size and some of the other settings help when the connection is severed from things like DBA configured idle timeouts and firewall tcp idle timeouts, 'connection request timeout' occurs when created a new connection.
This could be simple network problem. There could be something interfering with dns resolution of the servers. Another case is not having fully qualified entries in tnsnames. I've been bit by the latter a couple of times.
The other issue is the one you've already recognized - full pool.
Double check that you don't have a connection leak somewhere. A missing .Close is one thing but if you're not using a 'using' statement, a try/finally is required as an unhandled exception could be thrown prior to the .Close.
I would use perfmon to monitor some of the connection statistics to start - NumberOfPooledConnections, NumberOfActiveConnections, etc:

Get rsyslog forwarding messages after remote server restart

I have syslog successfully forwarding logs to an upstream server like so:
$MainMsgQueyeType LinkedList
$MainMsgQueueSize 10000
$MainMsgQueusDiscardMark 8000
$MainMsgQueueDiscardSeverity 1
$MainMsgQueueSaveOnShutdown off
$MainMsgQueueTimeoutEnqueue 0
$ActionQueueType LinkedList # in memory queue
$ActionQueueFileName fwdRule1 # unique name prefix for spool files
$ActionQueueSize 10000 # Only allow 10000 elements in the queue
$ActionQueueDiscardMark 8000 # Only allow 8000 elements in the queue before dropping msgs
$ActionQueueDiscardSeverity 1 # Discard Alert,Critical,Error,Warning,Notice,Info,Debug, NOT Emergency
$ActionQueueSaveOnShutdown off # save messages to disk on shutdown
$ActionQueueTimeoutEnqueue 0
$ActionResumeRetryCount -1 # infinite retries if host is down
$RepeatedMsgReduction off
*.* ##remoteserver.mynetwork.com:5544
On the remoteserver I have something that talks syslog and listens on that port. To test, I have a simple log client that logs 100 messages a second to syslog.
This all works fine, and I have configured the queues above so that in the event that the remoteserver is unavailable, the queues start filling up, and then eventually messages get discarded, thus safeguarding syslog from blocking its logging clients.
When I stop the remote log sink on remoteserver:5544, syslog is still stable (queues filling up / full up), but when I restart the remote log sink a while later, rsyslog detects the server again, reestablishes a TCP connection
HOWEVER - syslog only forwards 1 message to it, despite the queue having many thousands of messages in it, and the logging client continuing to log 100 messages a second
How can I make syslog start forwarding messages again once it has detected the remoteserver is back up? (Without restarting syslog).
Am using rsyslog 4.6.2-2
I am using, and want to use TCP
The problem in case anybody comes across this was that workdirectory was set to:
$WorkDirectory /var/spool/rsyslog
And the above config, does this:
$ActionQueueFileName fwdRule1
Even though its supposed to be an in-memory queue. Because of this, when the queue reached 800 (bizarrely, not 8000), disk-assisted mode was activated, and syslog attempted to write messages to /var/spool/rsyslog. This directory didn't exist . Randomly, (hence a race condition must exist and a bug in rsyslog), after continually trying to open a queue file on the disk in that directory, rsyslog got into a twisted state and gave up and continued queueing messages, until it hit the high 10,000 mark. Restarting the downstream logserver failed to make it recover.
Taking out all references to ActionQueueFileName and making WorkDirectory exist fixed this issue.

Resources