SolrCloud shard recovery

SolrCloud shard recovery - solrcloud

I'm a SolrCloud newbie, my setup is 3 shards, 3 replicas, external Zookeeper
Today I found shard3 down, replica3 had taken over as leader, so indexing was occurring to replica3 not shard3. I stopped Tomcat/SOLR in reverse order (R3,R2,R1,S3,S2,S1) and restarted in forward order (S1,S2,S3,R1,R2,R3). I did not delete any tlog or replication.properties files. The cloud graph shows all hosts with their correct assignments. As I understand it these assignments are set in Zookeeper on the first startup.
My question is how does the data that was indexed to replica3 get back to the revived shard3?
And surprisingly shard3 = 87G while replica3 = 80G.
Confused!

Dan,
The size of replicas are not important, only the number of documents that collection has.
The way Solr works, you can have deleted documents in your collection that only are deleted in merge operations, this extra 7G can be deleted documents.

1) As far as I know when the shard3 is up, live and running it is zookeeper which does the data sync job between shard and replica3.
2) Regarding your second question, may be the replica3 is in optimization state and hence you are seeing less data size and shard3 is yet to be optimized by SOLR. (This is just a wild guess)

Related

RMAN does not delete archive logs not applied on GG Downstream

There is an issue in a topology of 3 hosts.
Primary has a scheduled OS-task (every hour) to delete archive logs older 3 hours with RMAN. Archivelog deletion policy is configured to "Applied on all standby".
There are 2 remote log_archive_dest entries - Physical Standby and Downstream. Every day there appears a "RMAN-08120: WARNING: archived log not deleted, not yet applied by standby" in the task's logs and than it resolves in 2-3 hours.
I've checked V$ARCHIVE_LOG during the issue and figured out, that the redos are not applied on the Downstream server. I have not caught the issue on the Downstream server yet, but during the "good" periods all the apply processes are enabled, but the dba_apply_progress view tells me, that apply_time of the messages is 01-JAN-1970.
The dba_capture view tells that capture processes' statuses are ENABLED, status_change_time is approximately the time of the RMAN-issue resolved.
I'm new to the Golden Gate, Streams and Downstream technologies. I've read the reference Oracle Docs, but couldn't find anything about some schedule for capture or apply processes.
Could someone please help to figure out, what else to check or what to read?
Grateful for every response.
Thanks.

Cosmos DB Emulator hangs when pumping continuation token, segmented query

I have just added a new feature to an app I'm building. It uses the same working Cosmos/Table storage code that other features use to query and pump results segments from the Cosmos DB Emulator via the Tables API.
The emulator is running with:
/EnableTableEndpoint /PartitionCount=50
This is because I read that the emulator defaults to 5 unlimited containers and/or 25 limited and since this is a Tables API app, the table containers are created as unlimited.
The table being queried is the 6th to be created and contains just 1 document.
It either takes around 30 seconds to run a simple query and "trips" my Too Many Requests error handling/retry in the process, or hangs seemingly forever and no results are returned, the emulator has to be shut down.
My understanding is that with 50 partitions I can make 10 unlimited tables, collections since each is "worth" 5. See documentation.
I have tried with rate limiting on and off, and jacked the RU/s to 10,000 on the table. It always fails to query this one table. The data, including the files on disk, has been cleared many times.
It seems like a bug in the emulator. Note that the "Sorry..." error that I would expect to see upon creation of the 6th unlimited table, as per the docs, is never encountered.

After switching to a real Cosmos DB instance on Azure, this is looking like a problem with my dodgy code.
Confirmed: my dodgy code.
Stand down everyone. As you were.

How to resolve celery.backends.rpc.BacklogLimitExceeded error

I am using Celery with Flask after working for a good long while, my celery is showing a celery.backends.rpc.BacklogLimitExceeded error.
My config values are below:
CELERY_BROKER_URL = 'amqp://'
CELERY_TRACK_STARTED = True
CELERY_RESULT_BACKEND = 'rpc'
CELERY_RESULT_PERSISTENT = False
Can anyone explain why the error is appearing and how to resolve it?
I have checked the docs here which doesnt provide any resolution for the issue.

Possibly because your process consuming the results is not keeping up with the process that is producing the results? This can result in a large number of unprocessed results building up - this is the "backlog". When the size of the backlog exceeds an arbitrary limit, BacklogLimitExceeded is raised by celery.
You could try adding more consumers to process the results? Or set a shorter value for the result_expires setting?
The discussion on this closed celery issue may help:
Seems like the database backends would be a much better fit for this purpose.
The amqp/RPC result backends needs to send one message per state update, while for the database based backends (redis, sqla, django, mongodb, cache, etc) every new state update will overwrite the old one.
The "amqp" result backend is not recommended at all since it creates one queue per task, which is required to mimic the database based backends where multiple processes can retrieve the result.
The RPC result backend is preferred for RPC-style calls where only the process that initiated the task can retrieve the result.
But if you want persistent multi-consumer result you should store them in a database.
Using rabbitmq as a broker and redis for results is a great combination, but using an SQL database for results works well too.

Galera 'Deadlock found when trying to get lock' on commit when a node drops

I'm seeing 'Deadlock found when trying to get lock' when issuing a COMMIT while a different node in the WAN Galera cluster has recently had connectivity issues (at virtually the same time as the COMMIT). In this particular situation I'm inserting data into a single table with an auto_increment PK, no FKs, and no other unique constraints. According to logging, the node issuing the COMMIT has yet to recognize that the other node has experienced any issues (cluster size has yet to change after the deadlock exception is thrown). I initially assumed this error had to do with the auto_increment_increment and auto_increment_offset values changing when the cluster size changed, leading to PK conflicts, so I tried to simplify matters by configuring Galera to not manage those values at all and manually set appropriate values across the cluster, but that didn't solve the problem. Based on the Galera documentation it sounds like the committing node verifies that the transaction doesn't cause any issues with the other nodes in the cluster. Based on my auto_increment_* configuration I know that the auto_increment id shouldn't conflict, so, I'm assuming at this point that the committing node is attempting to check the status of the transaction with all nodes, including the node which recently, and very temporarily (< 1 min), went offline, and it rejects the transaction because it can't get a response from the node currently experiencing issues.
I'm relatively new to Galera (8 months) and I was hoping a seasoned Galera veteran might have some advice on the best way of handling this situation. I'm aware of the "retry the transaction" approach, but that strikes me as a bit of a hack, and I'm hoping there is an alternate solution, or at least some additional information as to the underlying cause of this particular issues.
Thanks

Force drop a collection in MongoDB

I can't delete a collection it's telling me false every single time..
I do a getCollections() it gives me a lot of tmp.mr.mapreduce_1299189206_1618_inc (the ones I want to drop) I thought they were deleted during disconnection, but they're not (in my case).
Then when I do: db["tmp.mr.mapreduce_1299188705_5595"].drop() I always get false and it's not deleted.
The logs are not really helpful:
Wed Mar 9 11:05:51 [conn4] CMD: drop customers.tmp.mr.mapreduce_1299188705_5595
Now I maxed up my namespaces and I cannot create more collections help?
BTW, I can go down, this is not production (or even in production I can put it down too).

Now I maxed up my namespaces and I cannot create more collections help?
By default MongoDB has 20k namespaces, so that's a lot of dead M/R collections. According to the logs, the DB is getting the request to drop the collection. So the question now is whether or not MongoDB has gotten into a bad state.
Can you take it down and re-start to ensure that all connections are closed?
Is it a lot of data? Can you take it down and run --repair?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

SolrCloud shard recovery - solrcloud

Dan, The size of replicas are not important, only the number of documents that collection has. The way Solr works, you can have deleted documents in your collection that only are deleted in merge operations, this extra 7G can be deleted documents.

Related

RMAN does not delete archive logs not applied on GG Downstream

Cosmos DB Emulator hangs when pumping continuation token, segmented query

How to resolve celery.backends.rpc.BacklogLimitExceeded error

Galera 'Deadlock found when trying to get lock' on commit when a node drops

Force drop a collection in MongoDB

Categories

Resources