I am working on improving the performance of application which is running with google cloud datastore.
The application itself will run the queries like below as sequence:
Select * from table where type = "role" and roleId ="admin"
Select * from table where type = "role" and roleId ="editor"
Select * from table where type = "role" and roleId ="reader"
Select * from table where type = "role" and roleId ="writer"
......
The issue is when running 1 it costs >2 seconds,
but the 2-4 are only milliseconds needed.
It seems like the java need to wake up the google cloud datastore(or construct the connection for the 1st time which is time consuming), and others queries are very quick.
My query is why for the first time it costs 100 times time longer than the rest consecutive queries, and how to make sure the queries are quickly enough responsive and no delay for the first query.
Hope any google cloud datastore expert can help, thanks.
Nicholas's comments should be an answer to this question, in general form. A connection is being established and afterwards can be used. If anybody would like more technical information, they can read the documentation and even send feedback on the docs if more clarity is needed in any aspect.
Related
Consider a simple blog post schema has the following columns
ID
Author
Category
Status
CreatedDateTime
UpdatedDateTime
So assume the following queries
query by ID
query by Author, paginated
query by (Author, Status), sorted by CreatedDateTime, paginated
query by (Category, Status), sorted by CreatedDateTime, paginated
So seems without doing much works, SimpleDB would be more easy to implement the codes?
SimpleDB is barely supported by AWS any more - you can't even find it in the AWS console, so while it may work for you, personally I would be deciding between DynamoDB and DocumentDB (assuming you want NoSQL) - don't think there is any reason to start a new project on such an old offering at this point.
You should use DynamoDB because it has a lot of useful features such as Point in Time Recovery, transactions, encryption-at-rest, and activity streams that SimpleDB does not have.
If you're operating on a small scale, DynamoDB has the advantage that it allows you to set a maximum capacity for your table, which means you can make sure you stay in the free tier.
If you're operating at a larger scale, DynamoDB automatically handles all of the partitioning of your data (and has, for all practical purposes, limitless capacity), whereas SimpleDB has a limit of 10 GB per domain (aka "table") and you are required to manage any horizontal partitioning across domains that you might need.
Finally, there are signs that SimpleDB is already on a deprecation path. For example, if you look at the SimpleDB release notes, you will see that the last update was in 2011, whereas DynamoDB had several new features announced at the last re:Invent conference. Also, there are a number of reddit posts (such as here, here, and here) where the general consensus is that SimpleDB is already deprecated, and in some of the threads, Jeff Barr even commented and did not contradict any of the assertions that SimpleDB is deprecated.
That being said, in DynamoDB, you can support your desired queries.
You will need two Global Secondary Indexes, which use a composite sort key. Your queries can be supported with the following schema:
ID — hash key of your table
Author — hash key of the Author-Status-CreatedDateTime-index
Category — hash key of the Category-Status-CreatedDateTime-index
Status
CreatedDateTime
UpdatedDateTime
Status-CreatedDateTime — sort key of Author-Status-CreatedDateTime-index and Category-Status-CreatedDateTime-index. This is a composite attribute that exists to enable some of your queries. It is simply the value of Status with a separator character (I'll assume it's # for the rest of this answer), and CreatedDateTime appended to the end. (Personal opinion here: use ISO-8601 timestamps instead of unix timestamps. It will make troubleshooting a lot easier.)
Using this schema, you can satisfy all of your queries.
query by ID:
Simply perform a GetItem request on the main table using the blog post Id.
query by Author, paginated:
Perform a Query on the Author-Status-CreatedDateTime-index with a key condition expression of Author = :author.
query by (Author, Status), sorted by CreatedDateTime, paginated:
Perform a Query on the Author-Status-CreatedDateTime-index with a key condition expression of Author = :author and begins_with(Status-CreatedDateTime, :status). The results will be returned in order of ascending CreatedDateTime.
query by (Category, Status), sorted by CreatedDateTime, paginated:
Perform a Query on the Category-Status-CreatedDateTime-index with a key condition expression of Author = :author and begins_with(Status-CreatedDateTime, :status). The results will be returned in order of ascending CreatedDateTime. (Additionally, if you wanted to get all the blog posts in the "technology" category that have the status published and were created in 2019, you could use a key condition expression of Category = "technology" and begins_with(Status-CreatedDateTime, "published#2019").
The sort order of the results can be controlled using the ScanIndexForward field of the Query request. The default is true (sort ascending); but by setting it to false DynamoDB will return results in descending order.
DynamoDB has built in support for paginating the results of a Query operation. Basically, any time that there are more results that were not returned, the query response will contain a lastEvaluatedKey which you can pass into your next query request to pick up where you left off. (See Query Pagination for more details about how it works.)
On the other hand, if you're already familiar with SQL, and you want to make this as easy for yourself as possible, consider just using the Aurora Serverless Data API.
I am using Python client SDK for Datastore (google-cloud-datastore) version 1.4.0. I am trying to run a key-only query fetch:
query = client.query(kind = 'SomeEntity')
query.keys_only()
Query filter has EQUAL condition on field1 and GREATER_THAN_OR_EQUAL condition on field2. Ordering is done based on field2
For fetch, I am specifying a limit:
query_iter = query.fetch(start_cursor=cursor, limit=100)
page = next(query_iter.pages)
keyList = [entity.key for entity in page]
nextCursor = query_iter.next_page_token
Though there are around 50 entities satisfying this query, each fetch returns around 10-15 results and a cursor. I can use the cursor to get all the results; but this results in additional call overhead
Is this behavior expected?
keys_only query is limited to 1000 entries in a single call. This operation counts as a single entity read.
For another limitations of Datastore, please refer detailed table in the documentation.
However, in the code, you did specify cursor as a starting point for a subsequent retrieval operation. Query can be limited, without cursor:
query = client.query()
query.keys_only()
tasks = list(query.fetch(limit=100))
For detailed instruction how to use limits and cursors, please refer documentation of the Google Gloud Datastore
In Azure Cosmos DB (SQL API) the following query charges 9356.66 RU's:
SELECT * FROM Core c WHERE c.id = #id -- #id is a GUID
In contrast the following more complex query charges only 6.84 RU's:
SELECT TOP 10 * FROM Core c WHERE c.type = "Agent"
The documents in both examples are pretty small having a handful of attributes. Also the document collection does not use any custom indexing policy. The collection contains 105685 documents.
To me this sounds as if there is no properly working index on the "id" field in place.
How is this possible and how can this be fixed?
Updates:
Without the TOP keyword the second query charges 3516.35 RU's and returns 100000 records.
The partition key is "/partition" and its values are 0 or 1 (evenly distributed).
If you have partition collection you need to specify partition keyif you want to do request most efficiently. Cross-partition queries is really expensive (and slower) in cosmos, because partitions data can be stored in different places.
Try following:
SELECT * FROM Core c WHERE c.id = #id AND c.partition = #partition
Or, specify partition key in feed options if you're using CosmosDB SDK.
Let me know, if this helps.
I assume the solution is the same as posted here:
Azure DocumentDB Query by Id is very slow
I will close my own question once I am able to verify this with Microsoft Support.
I've got this query
UPDATE linkeddb...table SET field1 = 'Y' WHERE column1 = '1234'
This takes 23 seconds to select and update one row
But if I use openquery (which I don't want to) then it only takes half a second.
The reason I don't want to use openquery is so I can add parameters to my query securely and be safe from SQL injections.
Does anyone know of any reason for it to be running so slowly?
Here's a thought as an alternative. Create a stored procedure on the remote server to perform the update and then call that procedure from your local instance.
/* On remote server */
create procedure UpdateTable
#field1 char(1),
#column1 varchar(50)
as
update table
set field1 = #field1
where column1 = #column1
go
/* On local server */
exec linkeddb...UpdateTable #field1 = 'Y', #column1 = '1234'
If you're looking for the why, here's a possibility from Linchi Shea's Blog:
To create the best query plans when
you are using a table on a linked
server, the query processor must have
data distribution statistics from the
linked server. Users that have limited
permissions on any columns of the
table might not have sufficient
permissions to obtain all the useful
statistics, and might receive aless
efficient query plan and experience
poor performance. If the linked
serveris an instance of SQL Server, to
obtain all available statistics, the
user must own the table or be a member
of the sysadmin fixed server role, the
db_ownerfixed database role, or the
db_ddladmin fixed database role on the
linkedserver.
(Because of Linchi's post, this clarification has been added to the latest BooksOnline SQL documentation).
In other words, if the linked server is set up with a user that has limited permissions, then SQL can't retrieve accurate statistics for the table and might choose a poor method for executing a query, including retrieving all rows.
Here's a related SO question about linked server query performance. Their conclusion was: use OpenQuery for best performance.
Update: some additional excellent posts about linked server performance from Linchi's blog.
Is column1 primary key? Probably not. Try to select records for update using primary key (where PK_field=xxx), otherwise (sometimes?) all records will be read to find PK for records to update.
Is column1 a varchar field? Is that why are you surrounding the value 1234 with single-quotation marks? Or is that simply a typo in your question?
I just started learning LINQ to SQL, and so far I'm impressed with the easy of use and good performance.
I used to think that when doing LINQ queries like
from Customer in DB.Customers where Customer.Age > 30 select Customer
LINQ gets all customers from the database ("SELECT * FROM Customers"), moves them to the Customers array and then makes a search in that Array using .NET methods. This is very inefficient, what if there are hundreds of thousands of customers in the database? Making such big SELECT queries would kill the web application.
Now after experiencing how actually fast LINQ to SQL is, I start to suspect that when doing that query I just wrote, LINQ somehow converts it to a SQL Query string
SELECT * FROM Customers WHERE Age > 30
And only when necessary it will run the query.
So my question is: am I right? And when is the query actually run?
The reason why I'm asking is not only because I want to understand how it works in order to build good optimized applications, but because I came across the following problem.
I have 2 tables, one of them is Books, the other has information on how many books were sold on certain days. My goal is to select books that had at least 50 sales/day in past 10 days. It's done with this simple query:
from Book in DB.Books where (from Sale in DB.Sales where Sale.SalesAmount >= 50 && Sale.DateOfSale >= DateTime.Now.AddDays(-10) select Sale.BookID).Contains(Book.ID) select Book
The point is, I have to use the checking part in several queries and I decided to create an array with IDs of all popular books:
var popularBooksIDs = from Sale in DB.Sales where Sale.SalesAmount >= 50 && Sale.DateOfSale >= DateTime.Now.AddDays(-10) select Sale.BookID;
BUT when I try to do the query now:
from Book in DB.Books where popularBooksIDs.Contains(Book.ID) select Book
It doesn't work! That's why I think that we can't use thins kinds of shortcuts in LINQ to SQL queries, like we can't use them in real SQL. We have to create straightforward queries, am I right?
You are correct. LINQ to SQL does create the actual SQL to retrieve your results.
As for your shortcuts, there are ways to work around the limitations:
var popularBooksIds = DB.Sales
.Where(s => s.SalesAmount >= 50
&& s.DateOfSale >= DateTime.Now.AddDays(-10))
.Select(s => s.Id)
.ToList();
// Actually should work.
// Forces the table into memory and then uses LINQ to Objects for the query
var popularBooksSelect = DB.Books
.ToList()
.Where(b => popularBooksIds.Contains(b.Id));
Yes, query gets translated to a SQL string, and the underlying SQL can be different depending on what you are trying to do... so you have to be careful in that regard. Checkout a tool called linqpad, you can try your query in it and see the executing SQL.
Also, it runs when iterating through the collection or calling a method on it like ToList().
Entity framework or linq queries can be tricky sometimes. Sometimes you are surprised at the efficiency of the sql query generated and sometimes the query is so complicated and inefficient that you would smack your forehead.
Best idea is that if you have any suspicions about a query, run an sql profiler at the backend that would monitor all the queries coming in. That way you know exactly what is being passed on to the sql server and correct any inefficiencies if need be.
http://damieng.com/blog/2008/07/30/linq-to-sql-log-to-debug-window-file-memory-or-multiple-writers
This will help you to see what and when queries are being run. Also, Damiens blog is full of other linq to sql goodness.
You can generate an EXISTS clause by using the .Any method. I have had more success that way than trying to generate IN clauses, because it likes to retrieve all the data and pass it all back in as parameters to a query
In linq to sql, IQueryable expression fragments can be combined to create a single query, it will try to keep everything as an IQueryable for as long as it can, before you do something that cannot be expressed in SQL. When you call ToList you are directly asking it to resolve that query into an IEnumerable stored in memory.
In most cases you are better off not selecting the book ids in advance. Keep the fragment for popular books in a single place in the code and use it when necessary, to build on another query. An IQueryable is just an expression tree, which is resolved into SQL at some other point.
If you think your application will perform better by storing the popular books elsewhere (memcache or whatever), then you may consider pulling them out before hand, and checking against that later. This will mean each book id will be passed in as a sproc parameter and used in an IN clause.