Querying namespaces using Dataflow's DatastoreIO - google-cloud-datastore

Is it possible to query entities in a specific namespace when using Dataflow's DatastoreIO?

As of today, unfortunately no - DatastoreIO does not support reading from entities in namespaces due to limitations of the Datastore QuerySplitter API which is used to read results of a query in parallel. We are tracking the issue internally and your feedback is valuable for prioritizing it.
If the number of entities your pipeline reads from Datastore is small enough (or the rest of the processing heavy enough) that reading them sequentially (but processing in parallel) would be ok, you can try the workaround suggested in Google Cloud Dataflow User-Defined MySQL Source
You can also try exporting your data to BigQuery and processing it there, using BigQuery's querying capabilities or Dataflow's BigQueryIO connectors - those have no parallelism limitations.

Related

Possible to create pipeline that writes an SQL database to MongoDB daily?

TL:DR I'd like to combine the power of BigQuery with my MERN-stack application. Is it better to (a) use nodejs-biquery to write a Node/Express API directly with BigQuery, or (b) create a daily job that writes my (entire) BigQuery DB over to MongoDB, and then use mongoose to write a Node/Express API with MongoDB?
I need to determine the best approach for combining a data ETL workflow that creates a BigQuery database, with a react/node web application. The data ETL uses Airflow to create a workflow that (a) backs up daily data into GCS, (b) writes that data to BigQuery database, and (c) runs a bunch of SQL to create additional tables in BigQuery. It seems to me that my only two options are to:
Do a daily write/convert/transfer/migrate (whatever the correct verb is) from BigQuery database to MongoDB. I already have a node/express API written using mongoose, connected to a MongoDB cluster, and this approach would allow me to keep that API.
Use the nodejs-biquery library to create a node API that is directly connected to BigQuery. My app would change from MERN stack (BQ)ERN stack. I would have to re-write the node/express API to work with BigQuery, but I would no longer need the MongoDB (nor have to transfer data daily from BigQuery to Mongo). However, BigQuery can be a very slow database if I am looking for a single entry, a since its not meant to be used as Mongo or a SQL Database (it has no index, one row retrieve query run slow as full table scan). Most of my APIs calls are for very little data from the database.
I am not sure which approach is best. I don't know if having 2 databases for 1 web application is a bad practice. I don't know if it's possible to do (1) with the daily transfers from one db to the other, and I don't know how slow BigQuery will be if I use it directly with my API. I think if it is easy to add (1) to my data engineering workflow, that this is preferred, but again, I am not sure.
I am going with (1). It shouldn't be too much work to write a python script that queries tables from BigQuery, transforms, and writes collections to Mongo. There are some things to handle (incremental changes, etc.), however this is much easier to handle than writing a whole new node/bigquery API.
FWIW in a past life, I worked on a web ecommerce site that had 4 different DB back ends. ( Mongo, MySql, Redis, ElasticSearch) so more than 1 is not an issue at all, but you need to consider one as the DB of record, IE if anything does not match between them, one is the sourch of truth, the other is suspect. For my example, Redis and ElasticSearch were nearly ephemeral - Blow them away and they get recreated from the unerlying mysql and mongo sources. Now mySql and Mongo at the same time was a bit odd and that we were dong a slow roll migration. This means various record types were being transitioned from MySql over to mongo. This process looked a bit like:
- ORM layer writes to both mysql and mongo, reads still come from MySql.
- data is regularly compared.
- a few months elapse with no irregularities and writes to MySql are turned off and reads are moved to Mongo.
The end goal was no more MySql, everything was Mongo. I ran down that tangent because it seems like you could do similar - write to both DB's in whatever DB abstraction layer you used ( ORM, DAO, other things I don't keep up to date with etc.) and eventually move the reads as appropriate to wherever they need to go. If you need large batches for writes, you could buffer at that abstraction layer until a threshold of your choosing was reached before sending it.
With all that said, depending on your data complexity, a nightly ETL job would be completely doable as well, but you do run into the extra complexity of managing and monitoring that additional process. Another potential downside is the data is always stale by a day.

Scalable delete-all from Google Cloud Datastore

I'm trying to implement a complete backup/restore function for my Google appengine/datastore solution. I'm using the recommended https://cloud.google.com/datastore/docs/export-import-entities for periodic backup and for restore.
One thing I cannot wrap my head around on how to do is how to restore to an empty datastore? The import function won't clear the datastore before importing so I have to implement a total wipe of the datastore myself. (And, a way to clear the datastore might be a good thing also for test purposes etc.)
The datastore admin is not an option since it's being phased out.
The recommended way, according to the google documentation, is to use the bulk delete: https://cloud.google.com/dataflow/docs/templates/provided-templates#cloud-datastore-bulk-delete.
The problem with this method is that I will have to launch 1 dataflow job for each namespace/kind combination. And I have a multi-tenant solution with one namespace per tenant and around 20 kinds per namespace. Thus, if I have e.g. 100 tenants, that would give 2000 dataflow jobs to wipe the datastore. But the default quota is 25 simultaneous jobs... Yes, I can contact Google to get a higher quota, but the difference in numbers suggests that I'm doing it wrong.
So, any suggestions on how to wipe my entire datastore? I'm hoping for a scalable solution (that won't exceed request timeout limits etc) where I don't have to write hundreds of lines of code...
One possibility is to create a simple 1st generation python 2.7 GAE application (or just a service) in that project and use the ndb library (typically more efficient than the generic datastore APIs) to implement an on-demand selective/total datastore wiping as desired, along the lines described in How to delete all the entries from google datastore?
This solution deletes all entries in all namespaces.
By using ndb.metadata, no model classes are needed.
And by using ndb.delete_multi_async it will be able to handle a reasonably large datastore before hitting a request time limit.
from google.appengine.api import namespace_manager
from google.appengine.ext import ndb
...
def clearDb():
for namespace in ndb.metadata.get_namespaces():
namespace_manager.set_namespace(namespace)
for kind in ndb.metadata.get_kinds():
keys = [k for k in ndb.Query(kind=kind).iter(keys_only=True)]
ndb.delete_multi_async(keys)
The solution is a combination of the answers:
GAE, delete NDB namespace
https://stackoverflow.com/a/46802370/10612548
Refer to the latter for tips on how to improve it as time limits are hit and how to avoid instance explosion.

Objectify vs Java or Python API

We are currently using Google Data Store and Objectify to return query results back to the front end. I am currently doing performance comparisons between Data Store and Cloud Storage for returning lists of key values.
My question is whether using Objectify will perform better than the Java or Python low-level APIs, or whether they should be the same. If the performance is not better with Objectify then I can safely use the regular APIs for my performance tests.
Any help appreciated.
Thanks,
b/
This is a weird question. The performance of the Python and Java low-level APIs are wildly different because of the performance of the runtimes. Objectify is a thin object-mapping layer on top of the Java low-level API. In general, it does not add significant computational cost to do this mapping, although it is possible to create structures and patterns that do (especially with lifecycle callbacks). The "worst" is that Objectify does some class introspection on your entities at boot, which might or might not be significant depending on how many entity classes you have.
If you are asking this question, you are almost certainly prematurely optimizing.
Objectify allows you to write code faster and make it easier to maintain at the expense of very small/negligible performance penalty.
You can mix low-level API with Objectify in the same application as necessary. If you ever notice a spot where performance difference is significant (which is unlikely, if you use Objectify correctly), then you can always re-write that part in low-level API code.
Thanks for the responses. I am not currently trying to optimise the application (as such) but trying to assess whether our data can be stored in Cloud Storage instead of Datastore, without incurring a significant performance hit when retrieving the keys.
We constantly reload our data and thus have a large ingestion cost with Data Store each time we do so. If we used Cloud Storage instead then this would be minimal.
This is an option which Google's architects have suggested so we are just doing some due diligence on it.

Is it possible to use my data store in Spring Cloud Dataflow (for example, Apache Ignite or another InMemory store) for Spring Cloud Stream?

I saw in the tests spring cloud dataflow used to store the SpringDefinition - HashMap, is it possible to override the configuration of DateFlowServerConfiguration for storing streams and Tasks in an InMemory, for example in the same HashMap, if so, how?
I don't think it would be a trivial change. The server needs a backend to store it's metadata. By default it actually uses H2 in memory, and it relies on Spring Data JPA abstraction to give users the chance to select their RDBMS.
Storing on a different storage engine, would require not only replacing all the *Repository definitions on several configuration modules, but we do as well some pre population of data. It would become a bit hard to maintain this over time.
Is there a reason why a traditional RDBMS is not suitable here? or if you want in-memory just go with the ephemeral approach of H2?

Best UI interface/Language to query MarkLogic Data

We will be moving from Oracle and use MarkLogic 8 as our datastore and will be using MarkLogic's Java api to talk with data.
I am exploring for any UI tool (like SQL Developer is there for Oracle), which can be used for ML. I found that ML's Query Manager can used for accessing data. But I see multiple options wrt language:
SQL
SPARQL
XQuery
JavaScript
We need to perform CRUD operations and search for data, and our testing team is aware of SQL (for Oracle), so I am confused which route I should follow and on what basis I should decide which one/two will be better to explore. We are most likely to use JSON document type.
Any help/suggestions would be helpful.
You already mention you will be using the MarkLogic Java Client API, that should provide most of the common needs you could have, including search, CRUD, facets, lexicon values, and also custom extension though REST extensions as the Client API will be leveraging the MarkLogic REST API. It saves you from having to code inside MarkLogic to a large extent.
Apart from that you can run ad hoc commands from the Query Console, using one of the above mentioned languages. SQL will require the presence of a so-called SQL view (see also your earlier question Using SQL in Query Manager in MarkLogic). SPARQL will require enabling the triple index, and ingestion of RDF data.
That leaves XQuery and JavaScript, that have pretty much identical expression power, and performance. If you are unfamiliar with XQuery and XML languages in general, JavaScript might be more appealing.
HTH!

Resources