Is there a way to count the number of results of a freebase MQL query? - freebase

For example, I want to get the number of freebase concepts which have a page in the english wikipedia.
Or I want to count the concepts which have the type - book/literature.

Yes, you can use "return":"count" to get the count of results for a query instead of the actual results of the query. For "hard" queries this may time out, but for simple queries like the number of books or instances of another type, it should work fine.

Related

Firestore Query crashes while using whereNotEqualTo and multiple orderBy [duplicate]

Let's say I have a collection of cars and I want to filter them by price range and by year range. I know that Firestore has strict limitations due performance reasons, so something like:
db.collection("products")
.where('price','>=', 70000)
.where('price','<=', 90000)
.where('year','>=', 2015)
.where('year','<=', 2018)
will throw an error:
Invalid query. All where filters with an inequality (<, <=, >, or >=) must be on the same field.
So is there any other way to perform this kind of query without local data managing? Maybe some kind of indexing or tricky data organization?
The error message and documentation are quite explicit on this: a Firestore query can only perform range filtering on a single field. Since you're trying to filter ranges on both price and year, that is not possible in a single Firestore query.
There are two common ways around this:
Perform filtering on one field in the query, and on the other field in your client-side code.
Combine the values of the two range into a single field in some way that allows your use-case with a single field. This is incredibly non-trivial, and the only successful example of such a combination that I know of is using geohashes to filter on latitude and longitude.
Given the difference in effort between these two, I'd recommend picking the first option.
A third option is to model your data differently, as to make it easier to implement your use-case. The most direct implementation of this would be to put all products from 2015-2018 into a single collection. Then you could query that collection with db.collection("products-2015-2018").where('price','>=', 70000).where('price','<=', 90000).
A more general alternative would be to store the products in a collection for each year, and then perform 4 queries to get the results you're looking for: one of each collection products-2015, products-2016, products-2017, and products-2018.
I recommend reading the document on compound queries and their limitations, and watching the video on Cloud Firestore queries.
You can't do multiple range queries as there are limitations mentioned here, but with a little cost to the UI, you can still achieve by indexing the year like this.
db.collection("products")
.where('price','>=', 70000)
.where('price','<=', 90000)
.where('yearCategory','IN', ['new', 'old'])
Of course, new and old go out of date, so you can group the years into yearCategory like yr-2014-2017, yr-2017-2020 so on. The in can only take 10 elements per query so this may give you an idea of how wide of a range to index the years.
You can write to yearCategory during insert or, if you have a large range such as a number of likes, then you'd want another process that polls these data and updates the category.
In Flutter You can do something like this,
final _queryList = await db.collection("products").where('price','>=', 70000).get();
final _docL1 = _querList.where('price','<=', 90000);
Add more queries as you want, but for firestore, you can only request a limited number of queries, and get the data. After that you can filter out according to your need.

Combining multiple Firestore queries to get specific results (with pagination)

I am working on small app the allows users to browse items based on various filters they select in the view.
After looking though, the firebase documentation I realised that the sort of compound query that I'm trying to create is not possible since Firestore only supports a single "IN" operator per query. To get around this the docs says to use multiple separate queries and then merge the results on the client side.
https://firebase.google.com/docs/firestore/query-data/queries#query_limitations
Cloud Firestore provides limited support for logical OR queries. The in, and array-contains-any operators support a logical OR of up to 10 equality (==) or array-contains conditions on a single field. For other cases, create a separate query for each OR condition and merge the query results in your app.
I can see how this would work normally but what if I only wanted to show the user ten results per page. How would I implement pagination into this since I don't want to be sending lots of results back to the user each time?
My first thought would be to paginate each separate query and then merge them but then if I'm only getting a small sample back from the db I'm not sure how I would compare and merge them with the other queries on the client side.
Any help would be much appreciated since I'm hoping I don't have to move away from firestore and start over in an SQL db.
Say you want to show 10 results on a page. You will need to get 10 results for each of the subqueries, and then merge the results client-side. You will be overreading quite a bit of data, but that's unfortunately unavoidable in such an implementation.
The (preferred) alternative is usually to find a data model that allows you to implement the use-case with a single query. It is impossible to say generically how to do that, but it typically involves adding a field for the OR condition.
Say you want to get all results where either "fieldA" is "Red" or "fieldB" is "Blue". By adding a field "fieldA_is_Red_or_fieldB_is_Blue", you could then perform a single query on that field. This may seem horribly contrived in this example, but in many use-cases it is more reasonable and may be a good way to implement your OR use-case with a single query.
You could just create a complex where
Take a look at the where property in https://www.npmjs.com/package/firebase-firestore-helper
Disclaimer: I am the creator of this library. It helps to manipulate objects in Firebase Firestore (and adds Cache)
Enjoy!

AWS Neptune Gremlin paginate on hashed edge ID

I have a very large data set, close to 500 million edges in which almost all edges need to be traversed. I'm trying to parallelize these traversals by trying to paginate on IDS. My strategy was to try and paginate by ID which is an MD5 hash. I tried queries like the following:
g.E().hasLabel('foo').has(id, TextP.startingWith('AAA')) for page 1
g.E().hasLabel('foo').has(id, TextP.startingWith('AAB')) for page 2
But each query seems to be doing a full scan and not just a subset. How do you recommend I go about pagination?
I suggest that you run profile step on your queries to see the amount of actual traversals.
Using startingWith predicate on id doesn't seem like an optimized solution to me, since it probably uses an hash index, and not range index.
I would try to prefix on other string property, or even add a random [1..n] 'replica' property and filter using .has('replica', i) to get the best performance, especially on such a large graph.

Firestore multiple range query

Let's say I have a collection of cars and I want to filter them by price range and by year range. I know that Firestore has strict limitations due performance reasons, so something like:
db.collection("products")
.where('price','>=', 70000)
.where('price','<=', 90000)
.where('year','>=', 2015)
.where('year','<=', 2018)
will throw an error:
Invalid query. All where filters with an inequality (<, <=, >, or >=) must be on the same field.
So is there any other way to perform this kind of query without local data managing? Maybe some kind of indexing or tricky data organization?
The error message and documentation are quite explicit on this: a Firestore query can only perform range filtering on a single field. Since you're trying to filter ranges on both price and year, that is not possible in a single Firestore query.
There are two common ways around this:
Perform filtering on one field in the query, and on the other field in your client-side code.
Combine the values of the two range into a single field in some way that allows your use-case with a single field. This is incredibly non-trivial, and the only successful example of such a combination that I know of is using geohashes to filter on latitude and longitude.
Given the difference in effort between these two, I'd recommend picking the first option.
A third option is to model your data differently, as to make it easier to implement your use-case. The most direct implementation of this would be to put all products from 2015-2018 into a single collection. Then you could query that collection with db.collection("products-2015-2018").where('price','>=', 70000).where('price','<=', 90000).
A more general alternative would be to store the products in a collection for each year, and then perform 4 queries to get the results you're looking for: one of each collection products-2015, products-2016, products-2017, and products-2018.
I recommend reading the document on compound queries and their limitations, and watching the video on Cloud Firestore queries.
You can't do multiple range queries as there are limitations mentioned here, but with a little cost to the UI, you can still achieve by indexing the year like this.
db.collection("products")
.where('price','>=', 70000)
.where('price','<=', 90000)
.where('yearCategory','IN', ['new', 'old'])
Of course, new and old go out of date, so you can group the years into yearCategory like yr-2014-2017, yr-2017-2020 so on. The in can only take 10 elements per query so this may give you an idea of how wide of a range to index the years.
You can write to yearCategory during insert or, if you have a large range such as a number of likes, then you'd want another process that polls these data and updates the category.
In Flutter You can do something like this,
final _queryList = await db.collection("products").where('price','>=', 70000).get();
final _docL1 = _querList.where('price','<=', 90000);
Add more queries as you want, but for firestore, you can only request a limited number of queries, and get the data. After that you can filter out according to your need.

Is there a way to get the list of indexed words from Marklogic universal index

I am working on Marklogic tool
I am having a database of around 27000 documents.
What I want to do is retrieve the keywords which have maximum frequency in the documents given by the result of any search query.
I am currently using xquery functions to count the frequency of each word in the set of all documents retrieved as query result. However, this is quite inefficient.
I was thinking that it would help me if i could get the list of words on which marklogic has performed indexing.
So is there a way to retrieve the list of indexed words from the universal index of marklogic??
Normally you would use something like this in MarkLogic:
(
for $v in cts:element-values(xs:Qname("myelem"))
let $f := cts:frequency($v)
order by $f descending
return $v
)[1 to 10]
This kind of functionality is built-in in the search:search library, which works very conveniently.
But you cannot use that on values from cts:words e.a. unfortunately. There is a little trick that could get you close though. Instead of using cts:frequency, you could use a xdmp:estimate on a cts:search to get a fragment count:
(
for $v in cts:words()
let $f := xdmp:estimate(cts:search(collection(), $v))
order by $f descending
return $v
)[1 to 10]
The performance is less, but still much faster than bluntly running through all documents.
HTH!
What if your search contains multiple terms? How will you calculate the order?
What if some of your terms are very common in your corpus of documents, and others are very rare? Should the count of "the" contribute more to the score than "protease", or should they contribute the same?
If the words occur in the title vs elsewhere in the document, should that matter?
What if one document is relatively short, and another is quite long. How do you account for that?
These are some of the basic questions that come up when trying to determine relevancy. Most search engines use a combination of term frequency (how often do the terms occur in your documents), and document frequency (how many documents contain the terms). They can also use the location of the terms in your documents to determine a score, and they can also account for document length in determining a score.
MarkLogic uses a combination of term frequency and document frequency to determine relevance by default. These factors (and others) are used to determine a relevance score for your search criteria, and this score is the default sorting for results returned by search:search from the search API or the low-level cts:search and its supporting operators.
You can look at the details of the options for cts:search to learn about some of the different scoring options. See 'score-logtfidf' and others here:
http://community.marklogic.com/pubs/5.0/apidocs/SearchBuiltins.html#cts:search
I would also look at the search developers guide:
http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf
Many of the concepts are under consideration by the XQuery working group as enhancements for a future version of XQuery. They aren't part of the language today. MarkLogic has been at the forefront of search for a number of years, so you'll find there are many features in the product, and a lot of discussion related to this area in the archives.
"Is there a way to retrieve the list of indexed words from the universal index of marklogic?" No. The universal index is a hash index, so it contains hashes not words.
As noted by others you can create value-based lexicons that can list their contents. Some of these also include frequency information. However, I have another suggestion: cts:distinctive-terms() will identify the most distinctive terms from a sequence of nodes, which could be the current page of search results. You can control whether the output terms are just words, or include more complex terms such as element-word or phrase. See the docs for more details.
http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/apidoc/SearchBuiltins.xml&category=SearchBuiltins&function=cts:distinctive-terms
I have used cts:distinctive-terms(). It gives mostly wildcarded terms in my case which are not of much use. Furthur it is suitable for finding distinctive terms in a single document. When I try to run it on many documents it is quite slow.
What I want to implement is a dynamic facet which is populated with the keywords of the documents which come up in the search result. I have implemented it but it is inefficient as it counts the frequency of all the words in the documents. I want it to be a suggestion or recommandation feature like if you have searched for this particular term or phrase then you may be interested in these suggested terms or phrases. So I want an efficient method to find the terms which are common in the result set of documents of a search.
I tried cts:words() as suggested. It gives similar words as the search query word and the number of documents in which it is contained. WHat it does not take into account is the set of search result documents. It just shows the number of documents which contain similar words in the whole database, irrespective of whether these documents are present in the search result or not

Resources