I found only "SQLite uses a cost-based query planner that estimates the CPU and disk I/O costs of various competing query plans and chooses the plan that it thinks will be the fastest." from documents of documents, is there any paper or book specify the cost function with more details ? thx.
There is only the source code (where.c).
The query planner details are tweaked in every version.
Related
I need to save CosmosDB documents that contain a large list - larger than the document limit of 2 MB.
I'd like to split that list into an 'attachment' associated to the "main document".
But this documentation page briefly mentions that
Attachment feature is being depreciated [sic]
What's the deprecation plan? Will newly created collections (in the future) stop supporting attachments?
The same page of documentation mentions a limit of 2GB for "Maximum attachment size per Account".
What does that mean? 2GB per attachment? 2 GB total for all attachments?
I recommend not taking a dependency on attachments. We are still planning on deprecating them but have not started in earnest on this.
Depending on your access patterns for this data you may want to break this up as individual documents or modeled in some other way. CRUD operations on large documents can be very costly and will experience high latency because of the large payload in each request.
If you have an unbounded array these should definitely be stored as individual documents or modeled such that increasing size does not cause eventual performance issues. If your data is updated frequently it should be modeled such that the frequently updated portions are separate from properties that are static.
This article here describes scenarios and considerations when modeling data in Cosmos and may help you come up with a more efficient model.
https://learn.microsoft.com/en-us/azure/cosmos-db/modeling-data
Hope this is helpful.
System Design Question:
You are given a dataset of a few million used cars and information about them -- miles, color, price, etc. You have to create an API endpoint in two days that allows users to query the dataset.
This was the answer I gave:
Use a relational database (let's say PostgreSQL) to house the data. Expose a GET endpoint that takes query string parameters corresponding to the attributes in the dataset, parses them and uses them to query the database. The endpoint can also track which attributes are queried the most and add indexes to those attributes to speed up the queries. I was asked how I would handle a range (e.g. "car with 50,000 <= miles <= 100,000") to which I said this can be handled by the query string parameter and translated into the SQL query by the GET endpoint.
Feedback
I was told in feedback afterwards that this answer "didn't convey a strong understanding of how to design web systems." I was hoping for some insights as to where my solution may have been insufficient/weak or may have overlooked something about designing web systems.
Note: I reconstructed my answer from memory so it may be clearer here than it was in the interview.
Thanks for any help!
Like already discussed in the comments, the Interviewer wanted to hear something about SQL Injection. There are some counter measures, which you can do to avoid SQL Injection. These are (most probably not a complete list, but should give a hint, on what to look out for):
Use Prepared Statements
Take care about Access restrictions (in the DB as well as on the OS)
Validate the User Input
What is the conflict resolution strategy for DynamoDB ? The white paper on Dynamo talks about returning multiple versions by GetItem to be resolved by the client.
This SO Question says that Dynamo and DynamoDB are different and GetItem returns only one value. In that case, what is the conflict resolution strategy that DynamoDB employs ?
See this
"Conflicts can arise if applications update the same item in different regions at about the same time. To ensure eventual consistency, DynamoDB global tables use a “last writer wins” reconciliation between concurrent updates, where DynamoDB makes a best effort to determine the last writer. With this conflict resolution mechanism, all of the replicas will agree on the latest update, and converge toward a state in which they all have identical data."
So the latest write wins based on some for of consensus between the replicas.
As stated, your question is not very clear: "What is the conflict resolution strategy for DynamoDB" - what conflicts? Are you referring to potentially inconsistent reads?
DynamoDB, for GetItem queries, allows both eventual consistent and strongly consistent reads, configurable with a parameter on the request (as described in the docs here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html). For strongly consistent reads the value returned is the most recent value at the time the query was executed. For eventual consistent reads it is possible to read a slightly out of date version of an item but there is no "conflict resolution" per se.
You may be thinking about conditional updates which allow for requests to fail if an expected condition is not met at the time the query is executed.
For SQLite, given a SQL query Q, I am trying to figure out how to get the estimated query execution cost for Q out of the SQLite's query optimizer using C++ API.
I've searched for this problem, found lots of discussions about this cost on SQLite's website and how it is used internally by the query optimizer. But I cannot locate any C++ API call for getting this cost.
Hence, I'd guess maybe such a call is not implemented, but perhaps someone might know a way of getting this cost out of SQLite?
The documentation says:
The data returned by the EXPLAIN QUERY PLAN command is intended for interactive debugging only. The output format may change between SQLite releases. Applications should not depend on the output format of the EXPLAIN QUERY PLAN command.
Anything that is part of the API would need to be supported forever, so the details of the query optimizer cannot be made part of the API.
If you really want those values and know what you're doing, you could modify the SQLite source code to return them.
The sqlite3_stmt_status() function allows you to retrieve some performance-related metrics of a statement, but these are not estimates but the actual cost after execution.
I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph).
It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex.
One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph.
Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework?
Which technology would you use to store and query such a huge readonly graph?
LinkedIn have to manage a sizeable graph. It may be instructive to check out this info on their architecture. Note particularly how they cache their entire graph in memory.
There is also OrientDB a open source document-graph dbms with commercial friendly license (Apache 2). Simple API, SQL like language, ACID Transactions and the support for Gremlin graph language.
The SQL has extensions for trees and graphs. Example:
select from Account where friends traverse (1,7) (address.city.country.name = 'New Zealand')
To return all the Accounts with at least one friend that live in New Zealand. And for friend means recursively up to the 7th level of deep.
I have a directed graph for which I (mis)used Lucene.
Each edge was stored as a Document, with the nodes as Fields of the document that I could then search for.
It performs well enough, and query times for fetching in and outbound links from a node would be acceptable to a user using it as a web based tool. But for computationally intensive, batch calculations where I am doing many 100000s queries I am not satisfied with the query times I'm getting. I get the sense that I am definitely misusing Lucene so I'm working on a second Berkeley DB based implementation so that I can do a side by side comparison of the two. If I get a chance to post the results here I will do.
However, my data requirements are much larger than yours at > 3GB, more than could fit in my available memory. As a result the Lucene index I used was on disk, but with Lucene you can use a "RAMDirectory" index in which case the whole thing will be stored in memory, which may well suit your needs.
Correct me if I'm wrong, but since each node is list of the linked nodes, seems to me a DB with a schema is more of a burden than an advantage.
It also sound like Google App Engine would be right up your alley:
It's optimized for reading - and there's memcached if you want it even faster
it's distributed - so the size doesn't affect efficiency
Of course if you somehow rely on Relational DB to find the path, it won't work for you...
And I just noticed that the q is 4 months old
So you have a graph as your data and want to perform a classic graph operation. I can't see what other technology could fit better than a graph database.