RMAN does not delete archive logs not applied on GG Downstream - oracle-golden-gate

There is an issue in a topology of 3 hosts.
Primary has a scheduled OS-task (every hour) to delete archive logs older 3 hours with RMAN. Archivelog deletion policy is configured to "Applied on all standby".
There are 2 remote log_archive_dest entries - Physical Standby and Downstream. Every day there appears a "RMAN-08120: WARNING: archived log not deleted, not yet applied by standby" in the task's logs and than it resolves in 2-3 hours.
I've checked V$ARCHIVE_LOG during the issue and figured out, that the redos are not applied on the Downstream server. I have not caught the issue on the Downstream server yet, but during the "good" periods all the apply processes are enabled, but the dba_apply_progress view tells me, that apply_time of the messages is 01-JAN-1970.
The dba_capture view tells that capture processes' statuses are ENABLED, status_change_time is approximately the time of the RMAN-issue resolved.
I'm new to the Golden Gate, Streams and Downstream technologies. I've read the reference Oracle Docs, but couldn't find anything about some schedule for capture or apply processes.
Could someone please help to figure out, what else to check or what to read?
Grateful for every response.
Thanks.

Related

nova-compute service State changes every second (UP<->DOWN)

the status of my nova-compute service is changed from UP to down every second. This causes instance creation to fail or repeat success.
Please let me know if you need any additional information.
Thank you.
Compute Service State 1
Compute Service State 2
This can happen when the time is out of sync between nodes. Especially if the nodes running nova-api have even minor drifts.
Can you run date on all nodes at the same time?

What are the possible reasons, the consumer leaves the consumer group?

I am struggling to find out the issue, for what the reason the consumer is getting stopped.
The issue is the consumer is getting stopped after a certain time ( around 4:52 sec) But be a able to consume the messages and able to process.
As per my understanding the reason for the consumer to stop is, the consumer will not be able to commit the offset (processing time is more than max.poll.interval.ms) within max.poll.interval.ms.
are there any other reasons ?
Here are my basic consumer properties :
max.poll.records = 2
auto.offset.reset = latest
max.poll.interval.ms = 300000
idle.poll.interval = 60000 (between two polls)
no.of.consumers =1
consumer.group.id = test2
listener.auto.start = true
I see some statements in log -
Received user wakeup,
Raising WakeupException in response to user wakeup,
Executing onLeavePrepare with generation Generation
Can someone help on this ?
Note : We are a consumer to the event hub, and on this connectivity we are seeing this issue.But when we connect to the Kafka we do not see any issues.
#Gary , Can you please help on this?
It seems the kafka configuration side, everything looks fine. The issue we found is at the docker pods, where the health of the pod is not being sent. Because livenessProbe - port is different than the application configuration port (By mistake i have hard coded a different port) . Any way thanks for this forum !!!!

Airflow Dependencies Blocking Task From Getting Scheduled

I have an airflow instance that had been running with no problem for 2 months until Sunday. There was a blackout in a system on which my airflow tasks depend and some tasks where queued for 2 days. After that we decided it was better to mark all the tasks for that day as failed and just lose that data.
Nevertheless, now all the new tasks get trigger at the proper time but they are never being set to any state (neither queued nor running). I check the logs and I see this output:
Dependencies Blocking Task From Getting Scheduled
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
The scheduler is down or under heavy load
The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
This task instance already ran and had its state changed manually (e.g. cleared in the UI)
I get the impression the 3rd topic is the reason why it is not working.
The scheduler and the webserver were working, however I restarted the scheduler and still I am having the same outcome. I also deleted the data in mysql database for one job and it is still not running.
I also saw a couple of post that said it is not running because the depens_on_past was set to true and if the previous runs failed, the next one will never be executed. I also checked it and it is not my case.
Any input would be really apreciated.
Any ideas? Thanks
While debugging a similar issue i found this setting: AIRFLOW__SCHEDULER__MAX_DAGRUNS_PER_LOOP_TO_SCHEDULE (or http://airflow.apache.org/docs/apache-airflow/2.0.1/configurations-ref.html#max-dagruns-per-loop-to-schedule), checking the airflow code it seems that the scheduler queries for dagruns to examine (consider to run ti's for), this query is limited to that number of rows (or 20 by default). So if you have >20 dagruns that are in some way blocked (in our case because ti's were on up-for-retry), then it won't consider other dagruns even though these could run fine.

Google Cloud Bigtable: repeated grpc error code 13, then suddenly success

In short, we are sometimes seeing that a small number of Cloud Bigtable queries fail repeatedly (for 10s or even 100s of times in a row) with the error rpc error: code = 13 desc = "server closed the stream without sending trailers" until (usually) the query finally works.
In detail, our setup is as follows:
We are running a collection (< 10) of Go services on Google Compute Engine. Each service leases tasks from a pair of PULL task queues. Each task contains an ID of a bigtable row. The task handler executes the following query:
row, err := tbl.ReadRow(ctx, <my-row-id>,
bigtable.RowFilter(bigtable.ChainFilters(
bigtable.FamilyFilter(<my-column-family>),
bigtable.LatestNFilter(1))))
If the query fails then the task handler simply returns. Since we lease tasks with a lease time between 10 and 15 minutes, a little while later the lease will expire on that task, it will be lease again, and we'll retry. The tasks have a max retry of 1000 so they can be retried many times over a long period. In a small number of cases, a particular task will fail with the grpc error above. The task will typically fail with this same error every time it runs for hours or days on end, before (seemingly out of the blue) eventually succeeding (or the task runs out of retries and dies).
Since this often takes so long, it seems unrelated to server load. For example right now on a Sunday morning, these servers are very lightly loaded, and yet I see plenty of these errors when I tail the logs. From this answer, I had originally thought that this might be due to trying to query for a large amount of data, perhaps near the max limit that cloud bigtable will support. However I now see that this is not the case; I can find many examples where tasks that have failed many times finally succeed and report only a small amount of data (e.g. <1 MB) was retrieved.
What else should I be looking at here?
edit: From further testing I now know that this is completely machine (client) independent. If I tail the log on one of the task leasing machines, wait for a "server closed the stream without sending trailers" error, and then try a one-off ReadRow query to the same rowId from another, unrelated, totally unused machine, I get the same error repeatedly.
This error is typically caused by having more than 256MB of data in your reply.
However, there is currently a bug in our server side error handling code that allows some invalid characters in HTTP/2 trailers which is not allowed by the spec. This means that some error messages that have invalid characters will be seen as this kind of error. This should be fixed early next year.

SolrCloud shard recovery

I'm a SolrCloud newbie, my setup is 3 shards, 3 replicas, external Zookeeper
Today I found shard3 down, replica3 had taken over as leader, so indexing was occurring to replica3 not shard3. I stopped Tomcat/SOLR in reverse order (R3,R2,R1,S3,S2,S1) and restarted in forward order (S1,S2,S3,R1,R2,R3). I did not delete any tlog or replication.properties files. The cloud graph shows all hosts with their correct assignments. As I understand it these assignments are set in Zookeeper on the first startup.
My question is how does the data that was indexed to replica3 get back to the revived shard3?
And surprisingly shard3 = 87G while replica3 = 80G.
Confused!
Dan,
The size of replicas are not important, only the number of documents that collection has.
The way Solr works, you can have deleted documents in your collection that only are deleted in merge operations, this extra 7G can be deleted documents.
1) As far as I know when the shard3 is up, live and running it is zookeeper which does the data sync job between shard and replica3.
2) Regarding your second question, may be the replica3 is in optimization state and hence you are seeing less data size and shard3 is yet to be optimized by SOLR. (This is just a wild guess)

Resources