Is there a way to write queries for questions like
list of movies which are produced and directed by the same person?
In SPARQL/SQL this is easy, but in MQL is it possible to write this in a single query.
In general, can MQL be used for queries which require trace variables and conditional statements?
Update: A lengthier discussion on this topic at https://groups.google.com/forum/#!topic/freebase-discuss/EfB04zznvco
No, this is not possible in MQL. Often those sorts of queries take longer to execute and would time-out in our web API.
Here's about as close as you can get in MQL:
[{
"id": null,
"type1:type": "/film/director",
"type2:type": "/film/producer",
"name": null,
"/film/director/film": [{}],
"/film/producer/film": [{}]
}]
Then you just need to find the intersection of the films that they've directed and the films they've produced.
Usually, anything that resembles a recommendation system is better off being run offline using the Freebase data dumps.
Related
I saw a similar question here: Is there a workaround for the Firebase Query "IN" Limit to 10?
The point now is, with the query in, the union works, but with the query
not-in it will be intersection and give me all the documents, anyone knows how to do this?
As #samthecodingman mentioned, it's hard to provide specific advice without examples / code, but I've had to deal with this a few times and there are a few generalized strategies you can take:
Restructure your data - There's no limit on the number of equality operators you can use You can use up to 100 equality operators, so one possible approach is to store your filters/tags as a map, for example:
id: 1234567890,
...
filters: {
filter1: true,
filter2: true,
filter3: true,
}
If a doc doesn't have a particular tag, you could simply omit it, or you could set it to false, depending on your use case.
Note, however, that you may need to create composite indexes if you want to combine equality operators with inequality operators (see the docs). If you have too many filters, this will get unwieldy quickly.
Query everything and cache locally - As you mentioned, fetching all the data repeatedly can get expensive. But if it doesn't change too often or it isn't critical to get the changes in real time, you can cache it locally and refresh at some interval (hourly or daily, for example).
Implement Full-Text Search - If neither of the previous options will work for you, you can always implement full-text search using one of the services Firebase recommends like Elastic. These are typically far more efficient for use-cases with a high number of tags/filters, but obviously there's an upfront time cost for setup and potentially an ongoing monetary cost if your usage is higher than the free tiers these services offer.
I'm using Azure CosmosDB with the SQL api and I am trying to create, in my frontend, a graph that represents, in a month, all the documents that have been uploaded each specific day. The graph should be at most a month long. Below I have attached a screenshot of a mock of my idea. After some discussion in the comments I will add the data schema too.
Example of the data message (partition key is /message/deviceId)
{
"message": {
"deviceId": "device01",
"timestamp": "2018-07-25T08:47:16,094",
"payload": "6c,65,33",
},
"id": "ff670801-de08-422c-be0a-fa67e6324bb8",
"_rid": "75klAPTTTHADAAAAAAAAAA==",
"_self": "dbs/75klAA==/colls/75klAPTTTHA=/docs/75klAPTTTHADAAAAAAAAAA==/",
"_etag": "\"0000bc1d-0000-0000-0000-5c112e5a0000\"",
"_attachments": "attachments/",
"_ts": 1544629850
}
Now my question is: what is the best way to get this type of data? I usually go for the more easy and fast Functions but I think that this kind of approach wouldn't really work since I would need to fetch pretty much all the last month worth of data to get how many times something has been uploaded; it would also cost a lot of time and money to do so.
Is there an alternative way of gathering this sort of data? Would you guys recommend another approach? If so which one? I would like not to add any more services since I am already working on a relatively large project and I'm familiarizing myself with all these services.
EDIT: Would it be a bad idea to create some sort of document that kept all the information about the current month, like an Array of days? So the query would run just for the days that are not inside the array.
Thanks a lot in advance for the help!
I'm from the CosmosDB engineering team. From your question, I understand that you need counts of documents updated per day in the last month.
You could do this in two ways:
Issue a COUNT() query with a _ts filter for the date that you're interested in. This is currently sub-optimal - we are working on serving aggregates much more efficiently, and GROUP BY support as well, but we don't have a fixed date for these features yet. If the number of documents are small enough and your collection does not have a heavy workload, you could still stick with this option.
You could setup a change feed pipeline from your source collection, trap all the changes and update a separate metadata document that indicates the number of updates per day, with changes from the feed. Here's a link to working with the change feed processor: https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed
I want to store data of the following form in azure cosmos db:
{
"id": "guid",
"name": "a name"
"tenantId": "guid",
"filter1": true,
"filter2": false,
"hierarchicalData" :{}
}
Each document will be up to a few megabytes in size.
I need to be able to return a {id, name} list (100 < count < 10k, per tenant) for a given search by {tenantId,filter1,filter2}.
From the documentation, I see I can do an SQL query with a projection, but am not sure if there is a better way.
Is there an ideal way to do the above while making efficient use of RUs?
Is there an ideal way to do the above while making efficient use of
RUs?
Maybe it's hard to say that there is a best way to make efficient use of RUs and improve the query performance.
Based on your situation,of course, you could use SQL query to get data with specific filters. I'm just offering several ways to improve your query performance as below:
1.Add a partition key.
If your data is partitioned, then when you provide the partition key with sql,it could only scan the specific partition so that it will save RUs. Please refer to the document.
2.Use recent sdk.
The Azure Cosmos DB SDKs are constantly being improved to provide the best performance. See the Azure Cosmos DB SDK pages to determine the most recent SDK and review improvements.
3.Exclude unused paths from indexing for faster writes.
Cosmos DB's indexing policy also allows you to specify which document paths to include or exclude from indexing by leveraging Indexing Paths (IndexingPolicy.IncludedPaths and IndexingPolicy.ExcludedPaths). The use of indexing paths can offer improved write performance and lower index storage for scenarios in which the query patterns are known beforehand.
4.Use continuation token if the data is too large.
Paging the data with continuation token to improve query performance.Doc:
https://www.kevinkuszyk.com/2016/08/19/paging-through-query-results-in-azure-documentdb/
More details, please refer to here.
I have a single collection in Cosmos DB where documents are separated in two types. Let's call them board and pin.
Board:
{
"id": "board-1",
"description": "A collection of nice pins",
"author": "user-a",
"moments": [
{
"id": "pin-1"
},
{
"id": "pin-2"
},
{
"id": "pin-3"
}
]
}
Pin:
{
"id": "pin-1",
"description": "Number 1 is the best pin",
"author": "user-b"
}
I know how to query just a board of pin based on id. But i Need to query that (based on the id of the board) which gives me all the pins contained in a board. It would also be good if I could filter out one or more parts of the Pins.
Example: Not returning the author to the client.
{
"id": "pin-1",
"description": "Number 1 is the best pin"
},
{
"id": "pin-2",
"description": "Number 2 is very funny"
}..etc
I know I could handle this logic in the client app by making two requests, but is it possible to write a query for Cosmos DB that handles this?
Short answer: No, currently you can not join different documents in single sql query.
DocumentDB is schemaless and there is no hard concept of "references" like in a relational DB world. The referencing ids you have in your documents are just regular string data to DocumentDB and their special meaning (of linking to other documents) exists only in your application. Querying is currently just finding documents or parts of a document by some given predicates. It is carried out on documents independently of each other.
As a sidenote: I imagine it is by design as such chosen restriction enable potential of parallelism and probably contribute to the low-latency dream they are intended to deliver.
This does not mean that what you need is impossible. Options to consider:
Option 1: reference redesign
If you had a data design where the board-bin relationship data would have been stored on the pin-side, then you could have queried all pins in board-1 with single query, along the lines of:
select * from pin where pin.boardId = #boardId
It's quite common that you would need to denormalize your data model to some extent to optimize RU usage. Sometimes it is beneficial to duplicate some parent information to referencing documents. Or even store the relationship on both ends if the data is not too volatile and being read a lot from both sides. As a downside, keeping data is sync on writes becomes a bit more complicated. Mmmm, tradeoffs...
If redesign is an option then see the talk Modeling Data for NoSQL Document Databases from //build/2016 by Ryan CrawCour and David Makogon. It should give you some ideas to consider.
When designing data for documentdB then keep in mind that in DocumentDB storage is relatively cheap, processing power (RU) is what you pay for.
Option 2: stored procedures
If you want/need to optimize storage/latency and cannot modify the data design and really-really need to perform such query in single roundtrip then you could build a stored procedure to do the queries on server-side and then pack the results into single returned Json message within DocumentDB.
See Azure Cosmos DB server-side programming: Stored procedures, database triggers, and UDFs for more detail about what can be done and how.
I imagine you may get slightly better latency (due to single call) and slightly worse overall RU usage (extra work for SP execution, transaction, merging the results), but definitely test your case before commiting.
I consider this option a bit dirty as:
1. combining documents by higher level needs is logic and hence should not be implemented in database, but with your application logic layer.
1. JS in DocumentDB is more cumbersome to develop, debug and maintain.
Option 3: change nothing
.. and just do the 2 calls. It's simple and may just as likely end up the best solution in the long run (considering overall cost of design, development, maintenance, changes, etc..).
As I understand it, the Freebase taxonomy generally boils down to this hierarchy:
Domain Category > Domain > Type > Topic
I have an application that receives input and does a bit of natural language processing that spits out a bunch of terms--some useful and some not. In an initial effort to systematically "decide" whether a term is useful, my thought is to "test" it against Freebase by assuming it's a topic and seeing whether Freebase has the term classified under at least one type.
So what I'm trying to do now is, given a topic, find its type IDs (and names, ideally). If none are returned, that tells me something about the so-called topic. If one or more types is returned, then I not only have some measure of the term's usefulness, but also an ability to overlay the Freebase taxonomy and give folks a different method of accessing it (via that tree metaphor).
For example, I might receive "Politics", "Political organization", "administration", "photo", "MSN", etc. from the NLP engine. What kind of MQL query can tell me which type(s) are connected to those topics, if any?
Thanks for your help.
UPDATE
I just had one of those grandiose head slap moments. I stepped away from the query I'd been tinkering with for a while and when I got back, I saw the error of my ways. I was trying to make this way too difficult and, as always, the simple solution that I couldn't see was exactly what I needed to see:
[{
"id": null,
"name": "Politics",
"type": [{"id": null, "name": null }]
}]
This leads me to a slightly different question, though. What I get back is multiple topics, one of which is en/politics and a bunch of others whose id is /m/..., etc. I understand that the Freebase system is complex, but I'm a long way from understanding that complexity. For this kind of exercise, am I mostly likely to want the /en/ topic?
In general, the /en/ topics are more notable than /m/ topics. The /m/ IDs are automatically assigned to any new topic that gets added to Freebase, but the /en/ have to be added manually or semi-automatically by the community. So far, most of the /en/ keys come from Wikiedia (which has its own notability requirements) but they can come from anywhere.
Here is a list of some of the other popular namespaces that are used in Freebase.
Also, since you mentioned using NLP to match topics from text to Freebase, you might be interested in reading about the experimental Reconciliation API. This is how you would find the "best match" for a topic given the contextual clues available in your data.