Google Cloud Bigtable: repeated grpc error code 13, then suddenly success - grpc

In short, we are sometimes seeing that a small number of Cloud Bigtable queries fail repeatedly (for 10s or even 100s of times in a row) with the error rpc error: code = 13 desc = "server closed the stream without sending trailers" until (usually) the query finally works.
In detail, our setup is as follows:
We are running a collection (< 10) of Go services on Google Compute Engine. Each service leases tasks from a pair of PULL task queues. Each task contains an ID of a bigtable row. The task handler executes the following query:
row, err := tbl.ReadRow(ctx, <my-row-id>,
bigtable.RowFilter(bigtable.ChainFilters(
bigtable.FamilyFilter(<my-column-family>),
bigtable.LatestNFilter(1))))
If the query fails then the task handler simply returns. Since we lease tasks with a lease time between 10 and 15 minutes, a little while later the lease will expire on that task, it will be lease again, and we'll retry. The tasks have a max retry of 1000 so they can be retried many times over a long period. In a small number of cases, a particular task will fail with the grpc error above. The task will typically fail with this same error every time it runs for hours or days on end, before (seemingly out of the blue) eventually succeeding (or the task runs out of retries and dies).
Since this often takes so long, it seems unrelated to server load. For example right now on a Sunday morning, these servers are very lightly loaded, and yet I see plenty of these errors when I tail the logs. From this answer, I had originally thought that this might be due to trying to query for a large amount of data, perhaps near the max limit that cloud bigtable will support. However I now see that this is not the case; I can find many examples where tasks that have failed many times finally succeed and report only a small amount of data (e.g. <1 MB) was retrieved.
What else should I be looking at here?
edit: From further testing I now know that this is completely machine (client) independent. If I tail the log on one of the task leasing machines, wait for a "server closed the stream without sending trailers" error, and then try a one-off ReadRow query to the same rowId from another, unrelated, totally unused machine, I get the same error repeatedly.

This error is typically caused by having more than 256MB of data in your reply.
However, there is currently a bug in our server side error handling code that allows some invalid characters in HTTP/2 trailers which is not allowed by the spec. This means that some error messages that have invalid characters will be seen as this kind of error. This should be fixed early next year.

Related

RMAN does not delete archive logs not applied on GG Downstream

There is an issue in a topology of 3 hosts.
Primary has a scheduled OS-task (every hour) to delete archive logs older 3 hours with RMAN. Archivelog deletion policy is configured to "Applied on all standby".
There are 2 remote log_archive_dest entries - Physical Standby and Downstream. Every day there appears a "RMAN-08120: WARNING: archived log not deleted, not yet applied by standby" in the task's logs and than it resolves in 2-3 hours.
I've checked V$ARCHIVE_LOG during the issue and figured out, that the redos are not applied on the Downstream server. I have not caught the issue on the Downstream server yet, but during the "good" periods all the apply processes are enabled, but the dba_apply_progress view tells me, that apply_time of the messages is 01-JAN-1970.
The dba_capture view tells that capture processes' statuses are ENABLED, status_change_time is approximately the time of the RMAN-issue resolved.
I'm new to the Golden Gate, Streams and Downstream technologies. I've read the reference Oracle Docs, but couldn't find anything about some schedule for capture or apply processes.
Could someone please help to figure out, what else to check or what to read?
Grateful for every response.
Thanks.

How to resolve celery.backends.rpc.BacklogLimitExceeded error

I am using Celery with Flask after working for a good long while, my celery is showing a celery.backends.rpc.BacklogLimitExceeded error.
My config values are below:
CELERY_BROKER_URL = 'amqp://'
CELERY_TRACK_STARTED = True
CELERY_RESULT_BACKEND = 'rpc'
CELERY_RESULT_PERSISTENT = False
Can anyone explain why the error is appearing and how to resolve it?
I have checked the docs here which doesnt provide any resolution for the issue.
Possibly because your process consuming the results is not keeping up with the process that is producing the results? This can result in a large number of unprocessed results building up - this is the "backlog". When the size of the backlog exceeds an arbitrary limit, BacklogLimitExceeded is raised by celery.
You could try adding more consumers to process the results? Or set a shorter value for the result_expires setting?
The discussion on this closed celery issue may help:
Seems like the database backends would be a much better fit for this purpose.
The amqp/RPC result backends needs to send one message per state update, while for the database based backends (redis, sqla, django, mongodb, cache, etc) every new state update will overwrite the old one.
The "amqp" result backend is not recommended at all since it creates one queue per task, which is required to mimic the database based backends where multiple processes can retrieve the result.
The RPC result backend is preferred for RPC-style calls where only the process that initiated the task can retrieve the result.
But if you want persistent multi-consumer result you should store them in a database.
Using rabbitmq as a broker and redis for results is a great combination, but using an SQL database for results works well too.

How long does Firebase throttle you?

Even with debug enabled for RemoteConfig, I still managed to get the following:
Error fetching remote config values Optional(Error Domain=com.google.remoteconfig.ErrorDomain Code=8002 "(null)"
UserInfo={error_throttled_end_time_seconds=1483110267.054194})
Here is my debug code:
let debug = FIRRemoteConfigSettings(developerModeEnabled: true)
FIRRemoteConfig.remoteConfig().configSettings = debug!
Shouldn't the above prevent throttling?
How long will the throttle error remain in effect?
I've experienced the same error due to throttling. I was calling FIRRemoteConfig.remoteConfig().fetchWithExpirationDuration with an expiry that was less than 60 seconds.
To immediately get around this issue during testing, use an alternative device. The throttling occurs against a particular device. e.g. move from your simulator to a device.
The intention is not to have a single client flooding the server with fetch requests every second. Make sensible use of the caching it offers out of the box and fetch only when necessary.
When you receive this error, plug the value of error_throttled_end_time_seconds into an epoch converter (like this one at https://www.epochconverter.com) and it will tell you the time when throttling ends. I've tested this myself, and the throttling remains in effect for 1 hour from the first moment you are throttled. So either wait an hour or try some of the other recommendations given here.
UPDATE: Also, if you continue making config requests and receive the throttle error, the expire timeout does not increase (i.e. "you are not further penalized").
The quick and easy hack to get your app running is to delete the application and reinstall it. Firebase identifies your device as new device on reinstalling.
Hope it helps and save your time.

Flink + Kafka: Why am I losing messages?

I have written a very simple Flink streaming job which takes data from Kafka using a FlinkKafkaConsumer082.
protected DataStream<String> getKafkaStream(StreamExecutionEnvironment env, String topic) {
Properties result = new Properties();
result.put("bootstrap.servers", getBrokerUrl());
result.put("zookeeper.connect", getZookeeperUrl());
result.put("group.id", getGroup());
return env.addSource(
new FlinkKafkaConsumer082<>(
topic,
new SimpleStringSchema(), result);
}
This works very well and whenever I put something into the topic on Kafka, it is received by my Flink job and processed. Now I tried to see what happens if my Flink Job isn't online for some reason. So I shut down the flink job and kept sending messages to Kafka. Then I started my Flink job again and was expecting that it would process the messages that were sent meanwhile.
However, I got this message:
No prior offsets found for some partitions in topic collector.Customer. Fetched the following start offsets [FetchPartition {partition=0, offset=25}]
So it basically ignored all messages that came since the last shutdown of the Flink job and just started to read at the end of the queue. From the documentation of FlinkKafkaConsumer082 I gathered, that it automatically takes care of synchronizing the processed offsets with the Kafka broker. However that doesn't seem to be the case.
I am using a single-node Kafka installation (the one that comes with the Kafka distribution) with a single-node Zookeper installation (also the one that is bundled with the Kafka distribution).
I suspect it is some kind of misconfiguration or something the like but I really don't know where to start looking. Has anyone else had this issue and maybe solved it?
I found the reason. You need to explicitly enable checkpointing in the StreamExecutionEnvironment to make the Kafka connector write the processed offsets to Zookeeper. If you don't enable it, the Kafka connector will not write the last read offset and it will therefore not be able to resume from there when the collecting Job is restarted. So be sure to write:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(); // <-- this is the important part
Anatoly's suggestion for changing the initial offset is probably still a good idea, in case checkpointing fails for some reason.
https://kafka.apache.org/08/configuration.html
set auto.offset.reset to smallest(by default it's largest)
auto.offset.reset:
What to do when there is no initial offset in Zookeeper or if an
offset is out of range:
smallest : automatically reset the offset to the smallest offset
largest : automatically reset the offset to the largest offset
anything else: throw exception to the consumer.
If this is set to largest, the consumer may lose some messages when the number of partitions, for the topics it subscribes to, changes on the broker. To
prevent data loss during partition addition, set auto.offset.reset to
smallest
Also make sure getGroup() is the same after restart

How to prevent a Hangfire recurring job from restarting after 30 minutes of continuous execution

I am working on an asp.net mvc-5 web application, and I am facing a problem in using Hangfire tool to run long running background jobs. the problem is that if the job execution exceed 30 minutes, then hangfire will automatically initiate another job, so I will end up having two similar jobs running at the same time.
Now I have the following:-
Asp.net mvc-5
IIS-8
Hangfire 1.4.6
Windows server 2012
Now I have defined a hangfire recurring job to run at 17:00 each day. The background job mainly scan our network for servers and vms and update the DB, and the recurring job will send an email after completing the execution.
The recurring job used to work well when its execution was less than 30 minutes. But today as our system grows, the recurring job completed after 40 minutes instead of 22-25 minutes as it used to be. and I received 2 emails instead of one email (and the time between the emails was around 30 minutes). Now I re-run the job manually and I have noted that that the problem is as follow:-
"when the recurring job reaches 30 minutes of continuous execution, a
new instance of the recurring job will start, so I will have two
instances instead of one running at the same time, so that why I received 2 emails."
Now if the recurring job takes less than 30 minutes (for example 29 minute) I will not face any problem, but if the recurring job execution exceeds 30 minutes then for a reason or another hangfire will initiate a new job.
although when I access the hangfire dashboard during the execution of the job, I can find that there is only one active job, when I monitor our DB I can see from the sql profiler that there are two jobs accessing the DB. this happens after 30 minutes from the beginning of the recurring job (at 17:30 in our case), and that why I received 2 emails which mean 2 recurring jobs were running in the background instead of one.
So can anyone advice on this please, how I can avoid hangfire from automatically initiating a new recurring job if the current recurring job execution exceeds 30 minutes?
Thanks
Did you look at InvisibilityTimeout setting from the Hangfire docs?
Default SQL Server job storage implementation uses a regular table as
a job queue. To be sure that a job will not be lost in case of
unexpected process termination, it is deleted only from a queue only
upon a successful completion.
To make it invisible from other workers, the UPDATE statement with
OUTPUT clause is used to fetch a queued job and update the FetchedAt
value (that signals for other workers that it was fetched) in an
atomic way. Other workers see the fetched timestamp and ignore a job.
But to handle the process termination, they will ignore a job only
during a specified amount of time (defaults to 30 minutes).
Although this mechanism ensures that every job will be processed,
sometimes it may cause either long retry latency or lead to multiple
job execution. Consider the following scenario:
Worker A fetched a job (runs for a hour) and started it at 12:00.
Worker B fetched the same job at 12:30, because the default invisibility timeout was expired.
Worker C (did not fetch) the same job at 13:00, because (it
will be deleted after successful performance.)
If you are using cancellation tokens, it will be set for Worker A at
12:30, and at 13:00 for Worker B. This may lead to the fact that your
long-running job will never be executed. If you aren’t using
cancellation tokens, it will be concurrently executed by WorkerA and
Worker B (since 12:30), but Worker C will not fetch it, because it
will be deleted after successful performance.
So, if you have long-running jobs, it is better to configure the
invisibility timeout interval:
var options = new SqlServerStorageOptions
{
InvisibilityTimeout = TimeSpan.FromMinutes(30) // default value
};
GlobalConfiguration.Configuration.UseSqlServerStorage("<name or connection string>", options);
As of Hangfire 1.5 this option is now Obsolete. Jobs that are being worked on are invisible to other workers.
Say goodbye to confusing invisibility timeout with unexpected
background job retries after 30 minutes (by default) when using SQL
Server. New Hangfire.SqlServer implementation uses plain old
transactions to fetch background jobs and hide them from other
workers.
Even after ungraceful shutdown, the job will be available for other
workers instantly, without any delays.
I was having trouble finding documentation on how to do this properly for a Postgresql database, every example I was see is using sqlserver, I found how the invisibility timeout was a property inside the PostgreSqlStorageOptions object, I found this here : https://github.com/frankhommers/Hangfire.PostgreSql/blob/master/src/Hangfire.PostgreSql/PostgreSqlStorageOptions.cs#L36. Luckily through trial and error I was able to figure out that the UsePostgreSqlStorage has an overload to accept this object. For .Net Core 2.0 when you are setting up the hangfire postgresql DB in the ConfigureServices method in the startup class add this(the default timeout is set to 30 mins):
services.AddHangfire(config =>
config.UsePostgreSqlStorage(Configuration.GetConnectionString("Hangfire1ConnectionString"), new PostgreSqlStorageOptions {
InvisibilityTimeout = TimeSpan.FromMinutes(720)
}));
I had this problem when using Hangfire.MemoryStorage as the storage provider. With memory storage you need to set the FetchNextJobTimeout in the MemoryStorageOptions, otherwise by default jobs will timeout after 30 minutes and a new job will be executed.
var options = new MemoryStorageOptions
{
FetchNextJobTimeout = TimeSpan.FromDays(1)
};
GlobalConfiguration.Configuration.UseMemoryStorage(options);
Just would like to point out that even though, it is stated the thing below:
As of Hangfire 1.5 this option is now Obsolete. Jobs that are being worked on are invisible to other workers.
Say goodbye to confusing invisibility timeout with unexpected background job retries after 30 minutes (by default) when using SQL Server. New Hangfire.SqlServer implementation uses plain old transactions to fetch background jobs and hide them from other workers.
Even after ungraceful shutdown, the job will be available for other workers instantly, without any delays.
It seems that for many people using MySQL, PostgreSQL, MongoDB, InvisibilityTimeout is still the way to go: https://github.com/HangfireIO/Hangfire/issues/1197

Resources