Scalable delete-all from Google Cloud Datastore - google-cloud-datastore

I'm trying to implement a complete backup/restore function for my Google appengine/datastore solution. I'm using the recommended https://cloud.google.com/datastore/docs/export-import-entities for periodic backup and for restore.
One thing I cannot wrap my head around on how to do is how to restore to an empty datastore? The import function won't clear the datastore before importing so I have to implement a total wipe of the datastore myself. (And, a way to clear the datastore might be a good thing also for test purposes etc.)
The datastore admin is not an option since it's being phased out.
The recommended way, according to the google documentation, is to use the bulk delete: https://cloud.google.com/dataflow/docs/templates/provided-templates#cloud-datastore-bulk-delete.
The problem with this method is that I will have to launch 1 dataflow job for each namespace/kind combination. And I have a multi-tenant solution with one namespace per tenant and around 20 kinds per namespace. Thus, if I have e.g. 100 tenants, that would give 2000 dataflow jobs to wipe the datastore. But the default quota is 25 simultaneous jobs... Yes, I can contact Google to get a higher quota, but the difference in numbers suggests that I'm doing it wrong.
So, any suggestions on how to wipe my entire datastore? I'm hoping for a scalable solution (that won't exceed request timeout limits etc) where I don't have to write hundreds of lines of code...

One possibility is to create a simple 1st generation python 2.7 GAE application (or just a service) in that project and use the ndb library (typically more efficient than the generic datastore APIs) to implement an on-demand selective/total datastore wiping as desired, along the lines described in How to delete all the entries from google datastore?

This solution deletes all entries in all namespaces.
By using ndb.metadata, no model classes are needed.
And by using ndb.delete_multi_async it will be able to handle a reasonably large datastore before hitting a request time limit.
from google.appengine.api import namespace_manager
from google.appengine.ext import ndb
...
def clearDb():
for namespace in ndb.metadata.get_namespaces():
namespace_manager.set_namespace(namespace)
for kind in ndb.metadata.get_kinds():
keys = [k for k in ndb.Query(kind=kind).iter(keys_only=True)]
ndb.delete_multi_async(keys)
The solution is a combination of the answers:
GAE, delete NDB namespace
https://stackoverflow.com/a/46802370/10612548
Refer to the latter for tips on how to improve it as time limits are hit and how to avoid instance explosion.

Related

Can I use wildcards when deleting Google Cloud Tasks?

I'm very new to Google Cloud Tasks.
I'm wondering, is there a way to use wildcards when deleting a task? For example, if I potentially had 3 tasks in queue using the following ID naming structure...
id-123-task-1
id-123-task-2
id-123-task-3
Could I simply delete id-123-task-* to delete all 3, or would I have to delete all 3 specific ID's every time? I guess I'm trying to limit the number of required API invocations to delete everything related to 'id-123'.
Can I use wildcards when deleting Google Cloud Tasks?
As of today, wildcards are not supported within Google Cloud Tasks. I can not confirm that you could pass the Google Cloud Task's ID as you mentioned id-123-task-* will delete all the tasks.
Nonetheless, if you are creating tasks for an specific purpose in mind, you could create a separate queue for this kind of tasks.
Not only you will win in terms of organizing your tasks, but when you would like to delete all, you will only need to purge all tasks from the specified queue making only 1 API invocation.
Here you could see how to purge all tasks from the specified queue, and also how to delete tasks and queues.
Also, I attached the API documentation in case you need further information about purging queues in Cloud Tasks.
As stated here, take into account that if you purge all the tasks from a queue:
Do not create new tasks immediately after purging a queue. Wait at least a second. Tasks created in close temporal proximity to a purge call will also be purged.
Also, if you are using named tasks, as stated here:
You can assign your own name to a task by using the name parameter. However, this introduces significant performance overhead, resulting in increased latencies and potentially increased error rates associated with named tasks. These costs can be magnified significantly if tasks are named sequentially, such as with timestamps.
As a consequence, if you are using named tasks, the documentation recommends using a well-distributed prefix for task names, such as a hash of the contents.
I think this is the best solution if you would like to limit the amount of API calls.
I hope it helps.

About the pattern to overcome the one update per second/entity limit on google datastore

I read this document and among several very relevant topics, some of them are key to a scalability problem I am facing.
Basically the document states that it is possible to overcome the 1 per second update ratio per entity that basically me drove me to redis in a use case that would not demand me to do it.
"a (google) software engineer in the Datastore team had mentioned a technique to obtain much higher throughput than one update per second on an entity group"
"The basic idea of Job Aggregation is to use a single thread to process a batch of updates. Because there is only one thread and only one transaction open on the entity group, there are no transaction failures due to concurrent updates. You can find similar ideas in other storage products such as VoltDb and Redis."
This is very useful to me but I don't have any clue on how this works.
Just creating a service and serialising (pull queue) upserts to a specific Kind could solve the issue? How datastore could be sure that no other thread would suddenly begin to upsert?
Thanks
It is important to keep in mind that Job Aggregation is not part of Datastore. As the documentation says, you need to use a single batch of updates. You can take a look here Batch operations to know how to upsert multiple entities.
About your second question, Datastore is not the responsible to ensure that other thread begin to upsert, you must to ensure that this not happens to get a better performance.
Here Datastore best practices there are other best practices that Google recommends to get better performance.

How to minimise Firebase Function Latency

As per the documentation, Firebase Functions are currently supported for 4 regions only - “us-central1”, “us-east1", “europe-west1”, “asia-northeast1"
That means locations further away would incur more latency, and often that translates to lower performance.
How can this limitation be worked around?
1) Choosing a location that is closest to you. You can set up test cloud functions in different regions, and test the round-trip latency. Only you can discover the specifics about your location.
2) Focus your software architecture on infrastructure that is locally available.
Use the client-side Firestore library directly as much as possible. It supports offline data, queueing data to send out later if you don't have internet, and caching read data locally - you can't get faster latency than that! So make sure you use Firestore for CRUD operations.
3) Architect to use CloudFunctions for batch and background processesing. If any business-logic processing is required, write the data to Firestore (using client libraries), and have a FF trigger to do some processing upon the write data-event. Have that trigger update that record with the additional processing, and state. I believe that if you're using the client-side libraries there is a way to have the updated data automatically pushed back to the client-side. (edited)
You also have the bonus benefit of being able to control authorisation with Firestore Auth, where Functions don't have an admin-level authorisation control.
4) Reduce chatter - minimising the amount of CloudFunction calls overall, and ensuring your CloudFunctions themselves do more in one go and return more complete data in one go.

Is it possible to use my data store in Spring Cloud Dataflow (for example, Apache Ignite or another InMemory store) for Spring Cloud Stream?

I saw in the tests spring cloud dataflow used to store the SpringDefinition - HashMap, is it possible to override the configuration of DateFlowServerConfiguration for storing streams and Tasks in an InMemory, for example in the same HashMap, if so, how?
I don't think it would be a trivial change. The server needs a backend to store it's metadata. By default it actually uses H2 in memory, and it relies on Spring Data JPA abstraction to give users the chance to select their RDBMS.
Storing on a different storage engine, would require not only replacing all the *Repository definitions on several configuration modules, but we do as well some pre population of data. It would become a bit hard to maintain this over time.
Is there a reason why a traditional RDBMS is not suitable here? or if you want in-memory just go with the ephemeral approach of H2?

Querying namespaces using Dataflow's DatastoreIO

Is it possible to query entities in a specific namespace when using Dataflow's DatastoreIO?
As of today, unfortunately no - DatastoreIO does not support reading from entities in namespaces due to limitations of the Datastore QuerySplitter API which is used to read results of a query in parallel. We are tracking the issue internally and your feedback is valuable for prioritizing it.
If the number of entities your pipeline reads from Datastore is small enough (or the rest of the processing heavy enough) that reading them sequentially (but processing in parallel) would be ok, you can try the workaround suggested in Google Cloud Dataflow User-Defined MySQL Source
You can also try exporting your data to BigQuery and processing it there, using BigQuery's querying capabilities or Dataflow's BigQueryIO connectors - those have no parallelism limitations.

Resources