Firebase performance - fetching nodes - firebase

Two firebase performance questions:
Docs refer to flat data is best practice when constructing data. However, if I wish to retrieve a few nodes of data together (a JOIN query in SQL), this means a few network requests. Is Firebase optimizing such use case (in server/client side)? How?
When fetching a specific node, using its full path, is there any need of indexing it? (Docs refer to actual queries, and I'm not sure this case applies as a query)
Thanks

Doing a "client-side join" in Firebase is not nearly as expensive as you might expect. See this answer: Speed up fetching posts for my social network app by using query instead of observing a single event repeatedly
If you directly access the node (only calling new Firebase() and child()), no query is needed, so you won't need an index. If you're calling orderByChild() or orderByValue() you should add an index.

Related

Magnolia Headless Delivery API and personalized pages with filter query

We are using the personalization module to setup page variants (page-level) using a headless approach (JS Frontend).
Reading the docs, I understood that there is either a Query nodes or Get children scenario. It looks like that page variants are only handled when not using the Query nodes case. Unfortunately, I can not order nor filter the results in that case.
Is there any chance to use filter and orderBy params but also returning page variants based on my request traits? How would such a request look like?
For performance reasons, the variants filtering on queries is not supported. Hence, short of writing your own EP, there is no solution.
As alternative/workaround, you can perhaps run query and then on the path of each result make a call to retrieve a variant of that result via individual node retrieval EP, but that's also slow and waste of bandwidth ... perhaps getting list of nodes you want via GQL EP and then getting variant for each is tiny bit better (but not much).

Combining multiple Firestore queries to get specific results (with pagination)

I am working on small app the allows users to browse items based on various filters they select in the view.
After looking though, the firebase documentation I realised that the sort of compound query that I'm trying to create is not possible since Firestore only supports a single "IN" operator per query. To get around this the docs says to use multiple separate queries and then merge the results on the client side.
https://firebase.google.com/docs/firestore/query-data/queries#query_limitations
Cloud Firestore provides limited support for logical OR queries. The in, and array-contains-any operators support a logical OR of up to 10 equality (==) or array-contains conditions on a single field. For other cases, create a separate query for each OR condition and merge the query results in your app.
I can see how this would work normally but what if I only wanted to show the user ten results per page. How would I implement pagination into this since I don't want to be sending lots of results back to the user each time?
My first thought would be to paginate each separate query and then merge them but then if I'm only getting a small sample back from the db I'm not sure how I would compare and merge them with the other queries on the client side.
Any help would be much appreciated since I'm hoping I don't have to move away from firestore and start over in an SQL db.
Say you want to show 10 results on a page. You will need to get 10 results for each of the subqueries, and then merge the results client-side. You will be overreading quite a bit of data, but that's unfortunately unavoidable in such an implementation.
The (preferred) alternative is usually to find a data model that allows you to implement the use-case with a single query. It is impossible to say generically how to do that, but it typically involves adding a field for the OR condition.
Say you want to get all results where either "fieldA" is "Red" or "fieldB" is "Blue". By adding a field "fieldA_is_Red_or_fieldB_is_Blue", you could then perform a single query on that field. This may seem horribly contrived in this example, but in many use-cases it is more reasonable and may be a good way to implement your OR use-case with a single query.
You could just create a complex where
Take a look at the where property in https://www.npmjs.com/package/firebase-firestore-helper
Disclaimer: I am the creator of this library. It helps to manipulate objects in Firebase Firestore (and adds Cache)
Enjoy!

Lookup the existence of a large number of keys (up to1M) in datastore

We have a table with 100M rows in google cloud datastore. What is the most efficient way to look up the existence of a large number of keys (500K-1M)?
For context, a use case could be that we have a big content datastore (think of all webpages in a domain). This datastore contains pre-crawled content and metadata for each document. Each document, however, could be liked by many users. Now when we have a new user and he/she says he/she likes document {a1, a2, ..., an}, we want to tell if all these document ak {k in 1 to n} are already crawled. That's the reason we want to do the lookup mentioned above. If there is a subset of documents that we don't have yet, we would start to crawl them immediately. Yes, the ultimate goal is to retrieve all these document content and use them to build the user profile.
My current thought is to issue a bunch of batch lookup requests. Each lookup request can contain up to 1K of keys [1]. However to get the existence of every key in a set of 1M, I still need to issue 1000 requests.
An alternative is to use a customized middle layer to provide a quick look up (for example, can use bloom filter or something similar) to save the time between multiple requests. Assuming we never delete keys, every time we insert a key, we add it through the middle layer. The bloom-filter keeps track of what keys we have (with a tolerable false positive rate). Since this is a custom layer, we could provide a micro-service without a limit. Say we could respond to a request asking for the existence of 1M keys. However, this definitely increases our design/implementation complexity.
Is there any more efficient ways to do that? Maybe a better design? Thanks!
[1] https://cloud.google.com/datastore/docs/concepts/limits
I'd suggest breaking down the problem in a more scalable (and less costly) approach.
In the use case you mentioned you can deal with one document at a time, each document having a corresponding entity in the datastore.
The webpage URL uniquely identifies the page, so you can use it to generate a unique key/identifier for the respective entity. With a single key lookup (strongly consistent) you can then determine if the entity exists or not, i.e. if the webpage has already been considered for crawling. If it hasn't then a new entity is created and a crawling job is launched for it.
The length of the entity key can be an issue, see How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?. To avoid it you can have the URL stored as a property of the webpage entity. You'll then have to query for the entity by the url property to determine if the webpage has already been considered for crawling. This is just eventually consistent, meaning that it may take a while from when the document entity is created (and its crawling job launched) until it appears in the query result. Not a big deal, it can be addressed by a bit of logic in the crawling job to prevent and/or remove document duplicates.
I'd keep the "like" information as small entities mapping a document to a user, separated from the document and from the user entities, to prevent the drawbacks of maintaining possibly very long lists in a single entity, see Manage nested list of entities within entities in Google Cloud Datastore and Creating your own activity logging in GAE/P.
When a user likes a webpage with a particular URL you just have to check if the matching document entity exists:
if it does just create the like mapping entity
if it doesn't and you used the above-mentioned unique key identifiers:
create the document entity and launch its crawling job
create the like mapping entity
otherwise:
launch the crawling job which creates the document entity taking care of deduplication
launch a delayed job to create the mapping entity later, when the (unique) document entity becomes available. Possibly chained off the crawling job. Some retry logic may be needed.
Checking if a user liked a particular document becomes a simple query for one such mapping entity (with a bit of care as it's also eventually consistent).
With such scheme in place you no longer have to make those massive lookups, you only do one at a time - which is OK, a user liking documents one a time is IMHO more natural than providing a large list of liked documents.

Firebase web - transaction on query

Can I run a transaction on a query referring to multiple locations ?
In the doc I see that for example startAt returns a firebase.database.Query which has a ref property of type firebase.database.Reference which has the transaction method.
So can I do:
ref.startAt(ver).ref.transaction(transactionUpdate).then(... ?
Would the transaction then operate on multiple locations and update them correctly ?
What I'm trying to do is to get all locations since a particular version (key) and then mark them as 'read' so that a writing client will not update them. For that I need a transaction rather than a simple update.
Thx!
The answer is "no" to all questions.
The ref property of a Query gives you the reference of the node on which you set up the query. Consider how you built the query in the first place. In other words, ref.startAt(x).ref is equivalent to ref.
Manipulating a reference (navigating to children, adding query options, etc.) is completely independent of any query results. It's just local, trivial path manipulation, very similar to formatting a URL.
Transactions can only operate on a single node, by definition, using that node's value snapshots for incremental updates. They cannot "operate on multiple locations and update them correctly". These are not SQL transactions, the only thing common is the name – which might be, unfortunately, confusing.
The starting node doesn't have to be a leaf node. But if you start a transaction on a "parent" node, the client will have to download every child to create a whole snapshot, potentially multiple times if any of them is modified by another client.
This is most certainly a very slow, fragile and expensive operation, both for the user and you, the owner of the database. In general, it's not recommended to run transactions if the node might grow unbounded.
I suggest revising the presented strategy. Updating "all children" just to store a "read" marker simply does not scale.
You could for example store the last read ID of the client in a single node, and write security rules to enforce that no data with an ID less than this may be modified.

Can I use ZCatalogs Query Plan to Optimise Catalog Queries?

I'm wondering if I can make use of the information provided by the Query Report and Query Plan tabs on the portal catalog. Can I optimize ZCatalog queries based on the query report? How does ZCatalogs Query Plan differ from a query plan of an SQL database?
The query plan information is used to improve catalog performance, but you cannot optimize your own queries based on plan information.
The catalog only builds up that information as needed, based on your index sizes; unlike a SQL database the catalog does not plan each query based on such information but rather looks up pre-calculated plans from the structure reflected in the Query Plan tab.
The query report tab does give you information about what indexes are performing poorly for your code; you may want to rethink code that uses those combinations of indexes and/or look into why those indexes performed poorly; perhaps your query didn't limit the result quickly enough or the slow index is very large, indicating that perhaps your ZODB cache is too small to hold that large index or that other results keep pushing it out.
On the whole, for large applications it is a good idea to retain the query plan; in one project we dump cache information before stopping instances and reload that after starting again, and that includes the catalog query plan:
plan = site.portal_catalog.getCatalogPlan()
with open(PLAN_PATH, 'w') as out:
out.write(plan)
and on load:
if os.path.exists(PLAN_PATH):
from Products.ZCatalog.plan import PriorityMap
try:
PriorityMap.load_from_path(PLAN_PATH)
except Exception:
pass

Resources