How to deal with RedisMessageListenerContainer death - spring-data-redis

I've encountered a case where the redis pubsub RedisMessageListenerContainer in my spring boot application died with
ERROR .RedisMessageListenerContainer: SubscriptionTask aborted with exception:
org.springframework.dao.QueryTimeoutException: Redis command timed out; nested exception is com.lambdaworks.redis.RedisCommandTimeoutException: Command timed out
at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:66)
at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:41)
at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:37)
at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:37)
at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:330)
at org.springframework.data.redis.connection.lettuce.LettuceConnection.subscribe(LettuceConnection.java:3179)
at org.springframework.data.redis.listener.RedisMessageListenerContainer$SubscriptionTask.eventuallyPerformSubscription(RedisMessageListenerContainer.java:790)
at org.springframework.data.redis.listener.RedisMessageListenerContainer$SubscriptionTask.run(RedisMessageListenerContainer.java:746)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.lambdaworks.redis.RedisCommandTimeoutException: Command timed out
at com.lambdaworks.redis.LettuceFutures.await(LettuceFutures.java:113)
at com.lambdaworks.redis.LettuceFutures.awaitOrCancel(LettuceFutures.java:92)
at com.lambdaworks.redis.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:63)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
at com.sun.proxy.$Proxy156.subscribe(Unknown Source)
at org.springframework.data.redis.connection.lettuce.LettuceSubscription.doSubscribe(LettuceSubscription.java:63)
at org.springframework.data.redis.connection.util.AbstractSubscription.subscribe(AbstractSubscription.java:142)
at org.springframework.data.redis.connection.lettuce.LettuceConnection.subscribe(LettuceConnection.java:3176)
... 3 common frames omitted
..
I think that shouldn't be an unrecoverable error in the first place because it's a temporary connection issue (and a TransientDataAccessException) but the application apparently needs to deal with exceptions in those case.
Currently this leaves the application in a state that is not acceptable. It merely logs the error but I would either need to kill the application so it gets replaced or better try to restart that container and ideally report via /health that the application is impacted as long as it's not all good.
Is there anything I'm overlooking that is less awkward than either trying to start() the container every x seconds or subclass it and overwrite handleSubscriptionException() and try to act from there? The latter needs much deeper integration with internals than I'd like to have in my code but it's what I so far went with:
RedisMessageListenerContainer container = new RedisMessageListenerContainer() {
#Override
protected void handleSubscriptionException(Throwable ex) {
super.handleSubscriptionException(ex); // don't know what actually happened in here and no way to find out :/
if (ex instanceof RedisConnectionFailureException) {
// handled by super hopefully, don't care
} else if (ex instanceof InterruptedException){
// can ignore those I guess
} else if (ex instanceof TransientDataAccessException || ex instanceof RecoverableDataAccessException) {
// try to restart in those cases?
if (isRunning()) {
logger.error("Connection failure occurred. Restarting subscription task manually due to " + ex, ex);
sleepBeforeRecoveryAttempt();
start(); // best we can do
}
} else {
// otherwise shutdown and hope for the best next time
if (isRunning()) {
logger.warn("Shutting down application due to unknown exception " + ex, ex);
context.close();
}
}
}
};

Related

Close Connection on SessionClient - AWS Neptune

I use aws-neptune.
And I try to implement my queries as transactional(with sessionClient like: https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-sessions.html). But when I try to implement it, closing client throws exception. There is similar issue like my case: https://groups.google.com/g/janusgraph-users/c/N1TPbUU7Szw
My code looks like:
#Bean
public Cluster gremlinCluster()
{
return Cluster.build()
.addContactPoint(GREMLIN_ENDPOINT)
.port(GREMLIN_PORT)
.enableSsl(GREMLIN_SSL_ENABLED)
.keyCertChainFile("classpath:SFSRootCAG2.pem")
.create();
}
private void runInTransaction()
{
String sessionId = UUID.randomUUID().toString();
Client.SessionedClient client = cluster.connect(sessionId);
try
{
client.submit("query...");
}
finally
{
if (client != null)
{
client.close();
}
}
}
And exception is:
INFO (ConnectionPool.java:225) - Signalled closing of connection pool on Host{address=...} with core size of 1
WARN (Connection.java:322) - Timeout while trying to close connection on ... - force closing - server will close session on shutdown or expiration.
java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
Is there any suggestion?
This might be a connectivity problem with the server which you are not able to observe while sending the query because you are not waiting for the future to complete.
When you do a client.submit("query...");, you receive a future. You need to wait for that future to complete to observe any exceptions (or success).
I would suggest the following:
Try hitting the server with a health status call using curl to verify connectivity with the server.
Replace the client.submit("query..."); with client.submit("query...").all().join(); to get the error during connection with the server.

What if only send without recv in my Thrift client?

I'm implementing a Thrift client in order to make connection to a built-in scribe server.
Everything is going OK if I use a standard Log method, like this:
public boolean log(List<LogEntry> messages) {
boolean ret = false;
PooledClient client = borrowClient();
try {
if ((client != null) && (client.getClient() != null)) {
ResultCode result = client.getClient().Log(messages);
ret = (result != null && result.equals(ResultCode.OK));
returnClient(client);
}
} catch (Exception ex) {
logger.error(LogUtil.stackTrace(ex));
invalidClient(client);
}
return ret;
}
However, when I use send_Log instead:
public void send_Log(List<LogEntry> messages) {
PooledClient client = borrowClient();
try {
if ((client != null) && (client.getClient() != null)) {
client.getClient().send_Log(messages);
returnClient(client);
}
} catch (Exception ex) {
logger.error(LogUtil.stackTrace(ex));
invalidClient(client);
}
}
It acctually causes some problems:
Total network connection to port 1463 (default port for a scribe server) is going to increase so much, and always in a CLOSE_WAIT state.
Cause my application got stuck without throwing any error, I think it may be an issue with network connection.
what if send without recv
As this is clearly TCP, the sender will block (in blocking mode), or incur EAGAIN/EWOULDBLOCK in non-blocking mode. EDIT It is now clear that you want to send without receiving the reply. You can do that by just sending and then closing the socket, but that may cause the peer to incur ECONNRESET, which may upset it. You should really implement the application protocol correctly.
1/ Total network connection to port 1463 (default port for a scribe server) is going to increase so much, and always in a CLOSE_WAIT state.
Lots of ports in CLOSE_WAIT state indicates a socket leak on the part of the local application.
2/ Cause my application got stuck without throwing any error. I think it may be an issues with network connection.
It is an issue with sending and not receiving.
Since you labelled this as a Thrift related question, the answer is oneway.
service foo {
oneway void FireAndForget(1: some args)
}
The oneway keyword does exactly what the name suggests. You get a client implementation that only sends and does not wait for anything to be returned from the server. This rule also includes exceptions. Hence a oneway method must always be void and can't throw any exceptions.
However, when I use send_Log instead ...
client.getClient().send_Log(messages);
Neither one of the Thrift-generated send_Xxx and recv_Xxx methods are meant to be public. That's why they are usually either private or protected methods. They should not be called directly, unless you are sure that you know what you are doing (and very obviously the latter is not the case here).
And since the real question is about performance: Why don't you just delegate the call(s) into a secondary thread? That way the I/O will not block the UI.

SignalR connect error

I use SignalR 2.0.0 Win2012 iis8 with two environment with two different ips.
one environment service is up and second is down(purposely)
use websocket protocol.
i have the following scenario:
When i connect to first environment and want to connect to the second.
i disconnected from first environment and try connect to second environment i get error(its correct behavior)
i try to reconnect back to the first environment but I get still the same error.
the error is "Error during negotiation request."
after refresh the browser i can connect success again to first environment.
What am i doing wrong?
this is part of my code:
function connect(host)
{
var hubConnection = $.hubConnection.('');
hubConnection.url = host;
hubConnection.start()
.done(open)
.fail(error);
}
function open()
{
console.log('login success')
}
function disconnect()
{
var self = this,
hubConnection = $.hubConnection("");
console.log('disconnect ')
hubConnection.stop(true, true);
}
function error(error)
{
var self = this,
hubConnection = $.hubConnection("");
console.log('connection error ')
if(error && hubConnection.state !== $.connection.connectionState.connected)
{
.....
.....
//logic detemninate wich environment ip was previous
connect(environment ip)
}
}
//occured when button disconnect clicked
function disconnectFromFirstEnvironmentAndConnectToSecond()
{
disconect();
connect(second environment ip);
}
.....
.....
connect(first environment ip);
You're not retaining your first connection reference.
Aka you create a HubConnection and then never capture it in a scope that can be used later; therefore when you disconnect later the connection.stop does nothing because it's not calling stop on the HubConnection that was originally started.
This could ultimately lead to you having too many concurrently open requests which will then not allow you to negotiate with a server hence your error.
I'd recommend fixing how you stop/start connections. Next if the issue still occurs I'd inspect the network traffic to ensure that valid requests are being made.

Getting App_Start Code First Migrations to work with Miniprofiler

I am running code first migrations. (EF 4.3.1)
I am also running Miniprofiler.
I run my code first migrations through code on App_Start.
My code looks like this:
public static int IsMigrating = 0;
private static void UpdateDatabase()
{
try
{
if (0 == System.Threading.Interlocked.Exchange(ref IsMigrating, 1))
{
try
{
// Automatically migrate database to catch up.
Elmah.ErrorLog.GetDefault(null).Log(new Elmah.Error(new Exception("Checking db for pending migrations.")));
var dbMigrator = new DbMigrator(new Ninja.Data.Migrations.Configuration());
var pendingMigrations = string.Join(", ", dbMigrator.GetPendingMigrations().ToArray());
Elmah.ErrorLog.GetDefault(null).Log(new Elmah.Error(new Exception("The database needs these code updates: " + pendingMigrations)));
dbMigrator.Update();
Elmah.ErrorLog.GetDefault(null).Log(new Elmah.Error(new Exception("Done upgrading database.")));
}
finally
{
System.Threading.Interlocked.Exchange(ref IsMigrating, 0);
}
}
}
catch (System.Data.Entity.Migrations.Infrastructure.AutomaticDataLossException ex)
{
Elmah.ErrorLog.GetDefault(null).Log(new Elmah.Error(ex));
}
catch (Exception ex)
{
Elmah.ErrorLog.GetDefault(null).Log(new Elmah.Error(ex));
}
}
The problem is that my DbUpdate is about to get called and then my app throws an exception which I think comes from the app on the first web page request.
saying:
Unable to update database to match the current model because there are pending changes and automatic migration is disabled. Either write the pending model changes to a code-based migration or enable automatic migration. Set DbMigrationsConfiguration.AutomaticMigrationsEnabled to true to enable automatic migration.
The problem is that I think my homepage is firing the dbcontext and this error before my dbupdate has finished.
How would you go about solving this?
Should I make the context wait using locks etc or is there an easier way?
More interestingly, If i start and stop the app a few times the db changes are pushed and the error goes away...
So I need to find a way to have the first request to the database on App_Start wait for the migrations to happen.
Thoughts?

httpconnection.getResponseCode() giving EOF exception

I am using Httconnection for connecting to webserver , somtimes request fails causing
EOFException when calling httpconnection.getResponseCode().
I am setting the following headers while making the connection
HttpConnection httpconnection = (HttpConnection) Connector.open(url.concat(";interface=wifi"));
httpconnection.setRequestProperty("User-Agent","Profile/MIDP-2.0 Configuration/CLDC-1.0");
httpconnection.setRequestProperty("Content-Language", "en-US");
I am closing all the connections after processing the request properly.Is this exception is due to exceeding max connections.
It's an internal server error, which return status code 500 in response.
This may be caused by incorrect request, but as well server code or overload may be the reason.
If you have access to server, check event logs.
See also
500 EOF when chunk header expected
Why might LWP::UserAgent be failing with '500 EOF'?
500 EOF instead of reponse status line in perl script
Apache 1.3 error - Unexpected EOF reading HTTP status - connectionreset
Error 500!
UPDATE On the other hand, if it's not response message, but a real exception, then it may be simply a bug, just like in old java
And workaround may be putting getResponseCode() inside of try/catch and call second time on exception:
int responseCode = -1;
try {
responseCode = con.getResponseCode();
} catch (IOException ex1) {
//check if it's eof, if yes retrieve code again
if (-1 != ex1.getMessage().indexOf("EOF")) {
try {
responseCode = con.getResponseCode();
} catch (IOException ex2) {
System.out.println(ex2.getMessage());
// handle exception
}
} else {
System.out.println(ex1.getMessage());
// handle exception
}
}
Talking by connections number limit, read
What Is - Maximum number of simultaneous connections
How To - Close connections
Using HTTPTransportSE, write this before invoke the method "call"
ArrayList<HeaderProperty> headerPropertyArrayList = new ArrayList<HeaderProperty>();
headerPropertyArrayList.add(new HeaderProperty("Connection", "close"));
transport.call(SOAP_ACTION, envelope, headerPropertyArrayList);

Resources