pre-start agent or cloudify exception cause agent still alive on host, at this time, install an application will throw a file not found exception - cloudify

#Barak Sorry to bother you again.
Pre-start agent or cloudify exception cause agent still alive on host, at this time, install an application on this host will throw a file not found exception.
In the whole process of application deployment, the agent installation occupies half the time, so, I want to pre-start agent(use command) in all virtual machines. After start all agent, they can be seen In host tab view of gs-webui, Then when I deploy applications quickly, but at that time, an exception occurred and deployment failed.
the exception is :
Failed to execute entry: jetty_install.groovy;
Caused by:
org.cloudifysource.usm.USMException: Event lifecycle external process exited with abnormal status code: 1
Caught: java.io.FileNotFoundException: /home/vagrant/gigaspaces/work/processing-units/jettyTest_jetty_1_140282317/ext/Xmx512m (/home/vagrant/gigaspaces/work/processing-units/jettyTest_jetty_1_140282317/ext/Xmx512m)
java.io.FileNotFoundException: /home/vagrant/gigaspaces/work/processing-units/jettyTest_jetty_1_140282317/ext/Xmx512m (/home/vagrant/gigaspaces/work/processing-units/jettyTest_jetty_1_140282317/ext/Xmx512m)

Cloudify will only use, or shut down, agents that it has started. So just starting an agent and attaching it to the manager will not work.
You are going to need a custom cloud driver, possibly re-using on the existing BYON cloud driver. This cloud driver will allocate a machine from the pool, launch the process that starts the agent and then passes the compute instance back to Cloudify. In the MachineDetails object that the cloud driver returns, you should set the 'agentRunning' field to true, and Cloudify will use this agent.
For this to work correctly, you will need to generate the required environment file so that the agent is configured to work with the Cluster. You can check out an example here:
https://github.com/CloudifySource/cloudify/blob/master/esc/src/main/java/org/cloudifysource/esc/driver/provisioning/privateEc2/PrivateEC2CloudifyDriver.java
The above example starts a node using CloudFormation, passing it the required settings. See how this is done here:
https://github.com/CloudifySource/cloudify/blob/master/esc/src/main/java/org/cloudifysource/esc/driver/provisioning/privateEc2/PrivateEC2CloudifyDriver.java#L1158

Related

Connection to the java client failed after nebulagraph database was restarted

Due to slow query, nebula graph was down.
I ran I ran enter code here, and then the connection to the nebulagraph database was lost. The error message is java.net.SocketException:Broken pipe(Write failed)`.
I would like to ask if this is normal and if it is possible to optimize the client to automatically recognize and reinitialize the connection pool?
The application is java client :com.vesoft.client-3.3.0 and the connection pool is com.vesoft.nebula.client.graph.SessionPool.
This is normal, the database kernel restarted the application and the current connection will also be invalid. It will be better that we can catch the corresponding exception in the business layer and re-initialize the session pool. In addition session pool and connection pool are two different concepts.

Troubleshoot AWS Fargate healthcheck for spring actuator

I have spring boot application with /health endpoint accessible deployed in AWS ECS Fargate. Sometimes the container is stopped with Task failed container health checks message. Sometimes happens once daily, sometimes once a week, maybe depends on the load. This is the healthcheck command specified in the Task Definition:
CMD-SHELL,curl -f http://localhost/actuator/health || exit 1
My question is how to troubleshoot what AWS receive when health-check is failed.
In case anyone else lands here because of failing container health checks (not the same as ELB health checks), AWS provides some basic advice:
Check that the command works from inside the container. In my case I had not installed curl in the container image, but when I tested it from outside the container it worked fine, which fooled me into thinking it was working.
Check the task logs in CloudWatch
If the checks are only failing sometimes (especially under load), you can try increasing the timeout, but also check the task metrics (memory and CPU usage). Garbage collection can cause the task to pause, and if all the vCPUs are busy handling other requests, the health check may be delayed, so you may need to allocate more memory and/or vCPUs to the task.
Thank #John Velonis,
I don't have enough reputation for commenting on your answer, so I post that in a different answer
For my case, the ecs container keeps getting UNKNOWN status from the ecs cluster. But I can access the healthcheck successfully. when I read this post, and check my base image which is node:14.19.1-alpine3.14, it doesn't have curl command.
so I have to install that in the Dockerfile
RUN apk --no-cache add curl

MSDTC fails for self-hosted NServiceBus ASP.NET endpoints but not other processes

I have a Windows 2008 R2 server that hosts many back end NServiceBus endpoints. All of the services that rely on the NServiceBus.Host.exe host (installed as Windows Services) are able to interact with MSDTC perfectly, averaging a small handful of concurrent distributed transactions throughout the day. There are 2 small Web.API applications, however, that self host NServiceBus endpoints (as publishers) that constantly receive the following error when trying to process subscription requests:
NServiceBus.Transports.Msmq.MsmqDequeueStrategy Error in receiving
messages. System.Transactions.TransactionAbortedException: The
transaction has aborted. --->
System.Transactions.TransactionManagerCommunicationException:
Communication with the underlying transaction manager has failed. --->
System.Runtime.InteropServices.COMException: The Transaction Manager
is not available. (Exception from HRESULT: 0x8004D01B) at
System.Transactions.Oletx.IDtcProxyShimFactory.ConnectToProxy(String
nodeName, Guid resourceManagerIdentifier, IntPtr managedIdentifier,
Boolean& nodeNameMatches, UInt32& whereaboutsSize, CoTaskMemHandle&
whereaboutsBuffer, IResourceManagerShim& resourceManagerShim) at
System.Transactions.Oletx.DtcTransactionManager.Initialize() ---
End of inner exception stack trace --- at
System.Transactions.Oletx.OletxTransactionManager.ProxyException(COMException
comException) at
System.Transactions.Oletx.DtcTransactionManager.Initialize() at
System.Transactions.Oletx.DtcTransactionManager.get_ProxyShimFactory()
at
System.Transactions.Oletx.OletxTransactionManager.CreateTransaction(TransactionOptions
properties) at
System.Transactions.TransactionStatePromoted.EnterState(InternalTransaction
tx) --- End of inner exception stack trace --- at
System.Transactions.TransactionStateAborted.CheckForFinishedTransaction(InternalTransaction
tx) at System.Transactions.Transaction.Promote() at
System.Transactions.TransactionInterop.ConvertToOletxTransaction(Transaction
transaction) at
System.Transactions.TransactionInterop.GetDtcTransaction(Transaction
transaction) at
System.Messaging.MessageQueue.StaleSafeReceiveMessage(UInt32 timeout,
Int32 action, MQPROPS properties, NativeOverlapped* overlapped,
ReceiveCallback receiveCallback, CursorHandle cursorHandle, IntPtr
transaction) at
System.Messaging.MessageQueue.ReceiveCurrent(TimeSpan timeout, Int32
action, CursorHandle cursor, MessagePropertyFilter filter,
MessageQueueTransaction internalTransaction,
MessageQueueTransactionType transactionType) at
System.Messaging.MessageQueue.Receive(TimeSpan timeout,
MessageQueueTransactionType transactionType) at
NServiceBus.Transports.Msmq.MsmqDequeueStrategy.ReceiveMessage(Func`1
receive) in
c:\BuildAgent\work\31f8c64a6e8a2d7c\src\NServiceBus.Core\Transports\Msmq\MsmqDequeueStrategy.cs:line
313
Some other notes:
Both the erroring ApplicationPools' identities and the Windows
Services' Log On users are the same.
This actually worked well before
a recent reboot, as the Web.API services were able to successfully
process subscription requests, and are able to publish messages just
fine (though publishing does not automatically use MSDTC, and we are
not using a TransactionScope explicitly). Since the local reboot, we
simply get the above error if a subscription request message sits
in either of the Web.API publisher's input queue.
I've used both procmon.exe and MSDTC tracing and have found nothing of interest. The typical event viewer logs also do not provide any information.
All endpoints are running .NET 4.5 and NServiceBus 4.6
We cannot
recreate this in any other environment.
Additional notes from below conversations
The thread which throws the exception is pure NServiceBus subscription management where none of "my" code is involved. When the application pool starts the w3wp.exe worker process on demand, NSB is spawning a worker thread unbeknownst to the application to process subscription requests. It should only ever work across the publisher's input queue and the subscription storage, which I'm using MSMQ for that as well, in a queue right beside the other (i.e. no other server is involved to my knowledge).
The "code" of the website didn't change across reboots, and the application pool stopped and restarted several times before the reboot without issue.
Not really an answer, but too long for a comment.
What part of your operation requires DTC? A Distributed Transaction gets enlisted automatically when needed, usually when you are talking to two different DTC-supporting bits of infrastructure (e.g. MSMQ and a database).
You said you tested via DTC tracing--do you mean DTC Ping? Did you test by having it run on both machines (or all machines if there are more than two involved in the transaction)? The DTC tool is pretty esoteric, and its output can be confusing.
Also, if it did work before the reboot, is it possible the reboot reset firewall settings? Firewalls are a common cause of DTC problems.
Also, I assume you checked and rechecked your DTC settings on the local machine? Did you ensure that your MSMQ queues are set up to be transactional?
From your comments:
Note that this particular failure occurs when attempting to dequeue a
message from a local private MSMQ queue [...]
The stack trace makes it appear that that's all it's doing, but I suspect that as it is attempting dequeue it is also trying to enlist the transaction between multiple servers. See below.
Why MSDTC? It's the original way to support exactly-once messaging in
NServiceBus (see here).
Right, but what I'm asking is why the particular operation requires a distributed transaction. If all a handler is doing is reading from a queue and (for example) writing output to the console, MSDTC will never be enlisted, even though the handler is wrapped in a transaction scope. It will simply use a local transaction to read from the queue. The escalation to a distributed transaction is automatic, and only happens when it is needed to support multiple bits of infrastructure.
So if you recently deployed code in a handler that writes data to a new database server, you may be getting a failure because you are now enlisting a transaction that includes the new server, which may be where the failure is happening.
So determining all the servers involved in the distributed transaction is the first step. The next step would be to check the DTC settings on all involved servers. If DTC settings aren't the problem, I'd recommend testing communication between the servers using DTCPing. The NServiceBus documentation has some good instructions for using DTCPing.
What "fixed" this for us in the production environment was adding the application pool identity user to the local Administrators group on the server. Unfortunately we don't have time to determine what setting required that security setup, as this isn't a required configuration in other similar servers. Also, this isn't the most desirable solution from a security perspective, but in our particular situation, we're willing to live with it.

Biztalk Message restore

Requirement: Updating BizTalk application to a new version
Problem: The MSI import does not go through if there are running/suspended instances. Termination would result in loss of messages
What did I try:
I had about 100+ messages in messagebox some active, some with suspended resumable status.
I took the back up of BizTalkMsgBoxDb, I then terminated all instances from BTAdmin console and then restored the BizTalkMsgBoxDb.
I was expecting the messages to be back in BizTalkMsgBoxDb but when I queried from BiztalkAdmin console I don't find any of the message back.
Did I miss anything?
if your changes do not contain any changes to ports etc. try and replace the assemblies in GAC and then restart your host instances.
Doing a backup of just one of the BizTalk databases and restoring it is a very dangerous practice and I would strongly advise against it as it can cause some very nasty side effects.
The normal process of a deployment would be to switch of the receive locations and allow any running processes to finish and to resume or terminate any messages/orchestrations as appropriate.
Once there were no longer any suspended and running processes/messages would you unenlist all the Orchestrations and do the deployment.
If there are some long running processes that cannot be completed or terminated inside the deployment window then you would have to look at doing a side-by-side deployment. That involves changing the version number of all the DLLs, deploying this and then switching of the receive locations of the old version and switching on the new one.
When the old version has finished you stop that and un-deploy it.

LoadRunner - Monitoring linux counters gives RPC error

Linux distribution is Red Hat. I'm monitoring linux counters with the LoadRunner Controller's System Resources Graphs - Unix Resources. Monitoring is working properly and graphs are plotted in real time. But after a few minutes, errors are appearing:
Monitor name :UNIX Resources. Internal rpc error (error code:2).
Machine: 31.2.2.63. Hint: Check that RPC on this machine is up and running.
Check that rstat daemon on this machine is up and running
(use rpcinfo utility for this verification).
Details: RPC: RPC call failed.
RPC-TCP: recv()/recvfrom() failed.
RPC-TCP: Timeout reached. (entry point: Factory::CollectData).
[MsgId: MMSG-47197]
I logged on the Linux server and found rstatd is still running. Clearing the measurements in Controller's Unix Resources and adding them again, monitoring again started to work but after a few minutes, the same error occurred.
What might cause this error ? Is it due to network traffic ?
Consider using SiteScope, which has been the preferred monitoring foundation for the collection of UNIX|Linux status since version 8.0 of LoadRunner. Every Loadrunner license since version 8 has come with aa 500 Point SiteScope license in the box for this purpose. More points are available upon request for test exclusive use of the instance.

Resources