Understanding HQL queries on collection objects - collections

This is similar to a question I asked earlier. The answers to that question partially solved my issue, but I'm still having some issues in trying to perform the kind of search I specified there; furthermore, I'm simply having trouble understanding how Hibernate chooses what to return in different scenarios.
Here's my mapping:
Client {
#OneToMany(mappedBy="client",cascade=CascadeType.ALL)
private Set<Group> groups = new HashSet<Group>();
}
Group {
#ManyToOne (cascade=CascadeType.ALL)
private Client client = new Client();
private String name;
private String state; //two char state code
private String extId; //unique identifier; candidate key, but not the #Id.
}
Queries by name are inline (e.g., like with wildcards on both ends of the param); state and extId are by equality.
The following query returns a single client, with only the matching group attached, even if other groups are associated to the client (note again that extId will only return one group):
select distinct client from Client as client
inner join client.groups as grp
where grp.extId = :extId
This query returns a single client, but with all associated groups attached, regardless of whether the group's state code matches the criteria:
select distinct client from Client as client
inner join client.groups as grp
where grp.state= :state
Finally, this query returns a separate copy of the client for each matched group, and each copy contains all of its associated groups, regardless of whether the group's name matches the criteria:
select distinct client from Client as client
inner join client.groups as grp
where grp.name like :name
I'm new to Hibernate, and I'm finding it immensely frustrating that I'm unable to predict what is going to be returned from a given query. All three queries are nearly identical, except for some small changes in the WHERE clause, yet I get radically different results for each. I'd spent time reviewing the documentation, but I'm missing wherever this behavior is explained. Can anyone help shed some light on this?
Finally, what I really need to do is to return Clients when querying by Group, and have the client only contain the Groups which match the search criteria. Is there a single-shot way I can construct an HQL query to do so, or will I have to do multiple queries and build my objects up in code?
Thanks.

The answer to this is twofold. One, there was a problem with the test harness, which was (sensibly) using transaction rollback to create test instances without leaving artifacts in the database. This was the source of my odd responses in the queries.
I managed to return just the values I wanted in the collections by simply changing to an outer fetch join.

Related

Running into some SQLite limitation using IN operator

I have a query that uses WHERE id IN (1,2,3,...) where the list (1,2,3,...) is dynamically generated from an array of integers (not using parameters). Now I have a particular query that takes roughly 500ms with 26623 ids but 50s (100x slower) with 26624 ids.
I couldn't find anything that looks related in https://sqlite.org/limits.html
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE search NOT IN ('', '?')
AND flows.id IN (1,2,3) /* <=== here more than 26623 IDs make it super slow */
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
Before I try to make that reproducible in isolate (e.g. search_params is a custom virtual table), does anyone know what limitation I might be running into? It's not the number of IDs per se, since a different query runs just fine with the same IDs.
SQLite version 3.36.0 via better-sqlite3 (Node.js) with a readonly database. The only pragma I use is journal_mode = WAL.
Compiled with (https://github.com/JoshuaWise/better-sqlite3/blob/master/docs/compilation.md#bundled-configuration):
SQLITE_DQS=0
SQLITE_LIKE_DOESNT_MATCH_BLOBS
SQLITE_THREADSAFE=2
SQLITE_USE_URI=0
SQLITE_DEFAULT_MEMSTATUS=0
SQLITE_OMIT_DEPRECATED
SQLITE_OMIT_GET_TABLE
SQLITE_OMIT_TCL_VARIABLE
SQLITE_OMIT_PROGRESS_CALLBACK
SQLITE_OMIT_SHARED_CACHE
SQLITE_TRACE_SIZE_LIMIT=32
SQLITE_DEFAULT_CACHE_SIZE=-16000
SQLITE_DEFAULT_FOREIGN_KEYS=1
SQLITE_DEFAULT_WAL_SYNCHRONOUS=1
SQLITE_ENABLE_MATH_FUNCTIONS
SQLITE_ENABLE_DESERIALIZE
SQLITE_ENABLE_COLUMN_METADATA
SQLITE_ENABLE_UPDATE_DELETE_LIMIT
SQLITE_ENABLE_STAT4
SQLITE_ENABLE_FTS3_PARENTHESIS
SQLITE_ENABLE_FTS3
SQLITE_ENABLE_FTS4
SQLITE_ENABLE_FTS5
SQLITE_ENABLE_JSON1
SQLITE_ENABLE_RTREE
SQLITE_ENABLE_GEOPOLY
SQLITE_INTROSPECTION_PRAGMAS
SQLITE_SOUNDEX
HAVE_STDINT_H=1
HAVE_INT8_T=1
HAVE_INT16_T=1
HAVE_INT32_T=1
HAVE_UINT8_T=1
HAVE_UINT16_T=1
HAVE_UINT32_T=1
Here's the answer from the SQLite forums. Essentially this is a combination of how the query planner handles IN literals and what cost my virtual table estimates. That means I'm running into the exact moment when the query planner makes a different decision.
SQLite NGQP is a cost based query planner. The IN () operator with a list of literal values gets implemented as a kind of temporary table; sometimes SQLite decides to create an index and do lookups, other times it decides to use that table as the outermost loop of the query.
EXPLAIN QUERY PLAN should show that in a more concise manner.
If compiled in DEBUG mode mith WHERETRACE enabled, the .wheretrace command will show how SQLite NGQP reaches its plan. Essential input is the return values from the xBestIndex method of your virtual table, especially the "number of rows" and the "estimated cost". It is paramount to deliver accurate estimates. Cost should reflect processing cost relative to SQLite native tables.
Note that you can name the IN table by making it a CTE and CROSS JOIN to force the query plan that works fast.
https://sqlite.org/forum/forumpost/a3d68ed8b40cf583?t=h
The workaround I use is json_each and serialize the array of integers into a JSON string. In my particular use-case this has some other benefits as well (e.g. I can bind a single parameter and re-use the query with any number of IDs), so I don't mind doing that:
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE search NOT IN ('', '?')
-AND flows.id IN (1,2,3)
+AND flows.id IN (SELECT value FROM json_each('[1,2,3]'))
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
I also know that the generic virtual table implementation of better-sqlite3 makes a trade-off between being easy to use (it's ridiculously easy) and achieving maximum performance.

CosmosDB, very long index that's also the partition key

We are storing a folder tree, the number of items is huge so we have created a partition on the parent folder.
When we issue queries such as
SELECT * FROM root WHERE root.parentPath = "\\server\share\shortpath" AND root.isFile
The RUs is very low and the performance is very good.
But, when we have a long path eg
SELECT * FROM root WHERE root.parentPath = "\\server\share\a very\long\path\longer\than\this" AND root.isFile
The RUs go up to 5000 and the performance suffers.
parentPath works well as a partition key as all queries include this field in the filter.
If I add another clause to the query it also becomes very fast, eg if I do something like and root.name = 'filename'
It's almost like it's scanning the entire partition based on the hash that's derived from it.
The Query returns NO DATA
which is fine as its someone looking for child folders under a given node, once you get deep it just gets very slow.
Query Metrics
x-ms-documentdb-query-metrics:
totalExecutionTimeInMs=1807.61;
queryCompileTimeInMs=0.08;
queryLogicalPlanBuildTimeInMs=0.04;
queryPhysicalPlanBuildTimeInMs=0.06;
queryOptimizationTimeInMs=0.01;
VMExecutionTimeInMs=1807.11;
indexLookupTimeInMs=0.65;
documentLoadTimeInMs=1247.08;
systemFunctionExecuteTimeInMs=0.00;
userFunctionExecuteTimeInMs=0.00;
retrievedDocumentCount=72554;
retrievedDocumentSize=59561577;
outputDocumentCount=0;
outputDocumentSize=49;
writeOutputTimeInMs=0.00;
indexUtilizationRatio=0.00
From string
x-ms-documentdb-query-metrics: totalExecutionTimeInMs=1807.61;queryCompileTimeInMs=0.08;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.01;VMExecutionTimeInMs=1807.11;indexLookupTimeInMs=0.65;documentLoadTimeInMs=1247.08;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=72554;retrievedDocumentSize=59561577;outputDocumentCount=0;outputDocumentSize=49;writeOutputTimeInMs=0.00;indexUtilizationRatio=0.00
This is because of a path length limit in Indexing v1.
We have increased the path length limit to a larger value in the new index layout, therefore migrating the collections to this new layout would fix the issue and provide many performance benefit.
We have rolled out the new index layout for new collections by default. If it is possible for you to recreate the current collection and migrate existing data over there, it would be great. Otherwise, an alternative is to trigger the migration process to move existing collections to the new index layout. The following C# method can be used to do that:
static async Task UpgradeCollectionToIndexV2Async(
DocumentClient client,
string databaseId,
string collectionId)
{
DocumentCollection collection = (await client.ReadDocumentCollectionAsync(string.Format("/dbs/{0}/colls/{1}", databaseId, collectionId))).Resource;
collection.SetPropertyValue("IndexVersion", 2);
ResourceResponse<DocumentCollection> replacedCollection = await client.ReplaceDocumentCollectionAsync(collection);
Console.WriteLine(string.Format(CultureInfo.InvariantCulture, "Upgraded indexing version for database {0}, collection {1} to v2", databaseId, collectionId));
}
It could take several hours for the migration to complete, depending on the amount of data in the collection. The issue should be addressed once it is completed.
(This was copy pasted from an email conversation we had to resolve this issue)

Is it okay to use .Query<table_name> when updating SQLite using Xamarin?

I have taken over some code and I see that database updates are performed like this:
dbcon = DependencyService.Get<ISQLite>().GetConnection();
public void UpdateAnswered(string id)
{
lock(locker)
{
dbcon.Query<Phrase>("UPDATE Phrase SET Answered = Answered + 1 " +
"WHERE Id = ?", id);
}
}
I am new to using SQLite with Xamarin but it looks strange to me that this update is handled with a dbcon.Query and that the table name is passed as . Can someone confirm is this the optimal way to handle a table update? Also why is it coded as a query with the table name being passed?
Update<T>
This method allows you to pass in an instance of an object that this stored in the database which has a primary key. SQLite then recognizes the primary key and updates the rest of the object's values.
You would just call connection.Update( phrase ); where the phrase is an instance of the Phrase class with properties you want to set. Be aware that all columns except ID will be updated.
Query<T>
Performs a query and returns the results. The type parameter specifies the type of the items returned. This is most appropriate for SELECT queries.
Execute
This returns the number of affected rows by the query as an int. This is probably the best choice for your UPDATE query after the Update<T> method.
ExecuteScalar<T>
Use for queries that return scalar types - like COUNT, etc., where T is the type of the value.
In summary, Update is the most natural way to update a row in the database (with an instance you have), but Query<T> and Execute<T> are very useful if you just want to UPDATE one column like in your example.

Access database UPDATE table with subquery

I never should've expected that knowing mySQL I'd be safe using Access.
I have two tables: users and scores
users table contains: id(auto increment primary key), username, password, etc..
scoers table contains: id(number - foreign key to users.id), highScore
I've previously asked help for INSERT command, which now works as it should. Now I've got issues with a similar UPDATE command.
The non-working command looks like this:
string updateCommand = #"UPDATE scores
SET
id = (SELECT id FROM users WHERE username = #username),
highScore = #score
WHERE highScore = (SELECT MIN(highScore) FROM scores);";
which throws a: Operation must use an updateable query.
To rationalize what I'm trying to accomplish here: I'm INSERT-ing high scores until I reach 10 scores in the table, afterwards instead of adding any new scores and filling up the database needlessly I decided It'd be more sensible to just "overwrite" the currently lowest score using UPDATE.
I am supplied a username and the high score and since the scores table contains only id I need to reach the id of the current user so that's what the first subquery is doing, the second subquery in the WHERE clause is to specify which score to replace (though there is possibly a bug here if there are multiple people with the lowest score, any ideas how to fix that?)
I've also tried using OUTER RIGHT JOIN like this:
string updateCommand = #"UPDATE scores
OUTER RIGHT JOIN users ON scores.id = users.id
SET
scores.id = users.id,
scores.highScore = #score
WHERE (highScore = (SELECT MIN(highScore) FROM scores)) AND (username = #username);";
With no luck(I get a generic Syntax error in UPDATE statement.).
Browsing the net I've found that I possibly "can't" use subqueries in UPDATE statements but I seem to find conflicting opinions on the matter.
I've also tried using the DLookup function in place of subqueries like:
#"...
id = DLookup(""id"", ""users"", ""username = #username""),
...
WHERE highScore = DLookup(""MIN(highScore)"", ""scores"");";
elipses represent extraneous code which is identical to the code above.
Also as a last resort I've tried dividing into multiple queries however userId query which looks like this:
string userIdQuery = "SELECT id FROM users WHERE username = #username"
seems to return a null judging by the NullReferenceException i recieve (Object reference not set to an instance of an object.) when trying to use the variable userId after I've done this:
int userId = 0;
userId = (Int32)command.ExecuteScalar();
I'm supposed to get an integer however I get a null I think. The almost identical query for getting the minimum highscore works flawlessly and the int variable is filled with the correct value so I'm assuming that hte problem is in the query itself somehow. I've tried adding single quotes around the #username parameter assuming that it might not be recognizing the string but it seems that's not it.
Phew.. took me a while to write this. Anyone got any ideas on how to make this all work? If you need more info let me know.
So after some messing around I've found out the causes of my troubles. The bad side is that I increased the amount of code so that I'd avoid subqueries as much as possible since, at least from my experience, Access doesn't really like the use of subqueries in UPDATE or INSERT commands.
What I did first is split the command into 3 separate ones:
"SELECT id FROM users WHERE username = ?;" - To get the id of the user whose score
I'm putting in the database.
#"SELECT scores.id, scores.highScore, scores.dateTime FROM scores WHERE (((scores.highScore)=DMin(""highScore"",""scores"")));" - which gets the id, high score
and time when the entry was... well entered, of the lowest score currently in the high scores list. Thanks to a suggestion from HansUp I used DMin function instead of a subquery with MIN to avoid the Must use an updateable query error. The extraneous parentheses are due to Access since this command was generated by the Access query designer and I'm too afraid to change anything lest I break it.
#"UPDATE scores SET scores.id = ?, scores.highScore = ?, scores.[dateTime] = Now() WHERE (((scores.id)=?) AND ((scores.highScore)=?) AND ((scores.dateTime)=?));" - The update command itself, not much to say here except that it takes the previously extracted data and uses it as values for the command.
One thing I noticed is that even if I got the command working the .ExecuteNonQuery() would always return 0 rows affected. After poking around I found out that named parameters for commands in ASP.NET / C# don't always work and that instead ? should be used instead. It's kind of inconvenient but I can't complain too much.

SQLite - Get a specific row index for a Sorted/Filtered Query

I'm creating a caching system to take data from an SQLite database table using a sorted/filtered query and display it. The tables I'm pulling from can be potentially very large and, of course, I need to minimize impact on memory by only retaining a maximum number of rows in memory at any given time. This is easily done by using LIMIT and OFFSET to load only the records I need and update the cache as needed. Implementing this is trivial. The problem I'm having is determining where the insertion index is for a new record inserted into a particular query so I can update my UI appropriately. Is there an easy way to do this? So far the ideas I've had are:
Dump the entire cache, re-count the Query results (there's no guarantee the new row will be included), refresh the cache and refresh the entire UI. I hope it's obvious why that's not really desirable.
Use my own algorithm to determine whether the new row is included in the current query, if it is included in the current cached results and at what index it should be inserted into if it's within the current cached scope. The biggest downfall of this approach is it's complexity and the risk that my own sorting/filtering algorithm won't match SQLite's.
Of course, what I want is to be able to ask SQLite: Given 'Query A' what is the index of 'Row B', without loading the entire query results. However, so far I haven't been able to find a way to do this.
I don't think it matters but this is all occurring on an iOS device, using the objective-c programming language.
More Info
The Query and subsequent cache is based off of user input. Essentially the user can re-sort and filter (or search) to alter the results they're seeing. My reticence in simply recreating the cache on insertions (and edits, actually) is to provide a 'smoother' UI experience.
I should point out that I'm leaning toward option "2" at the moment. I played around with creating my own caching/indexing system by loading all the records in a table and performing the sort/filter in memory using my own algorithms. So much of the code needed to determine whether and/or where a particular record is in the cache is already there, so I'm slightly predisposed to use it. The danger lies in having a cache that doesn't match the underlying query. If I include a record in the cache that the query wouldn't return, I'll be in trouble and probably crash.
You don't need record numbers.
Save the values of the ordered field in the first and last records of the LIMITed query result.
Then you can use these to check whether the new record falls into this range.
In other words, assuming that you order by the Name field, and that the original query was this:
SELECT Name, ...
FROM mytab
WHERE some_conditions
ORDER BY Name
LIMIT x OFFSET y
then try to get at the new record with a similar query:
SELECT 1
FROM mytab
WHERE some_conditions
AND PrimaryKey = LastInsertedValue
AND Name BETWEEN CachedMin AND CachedMax
Similarly, to find out before (or after) which record the new record was inserted, start directly after the inserted record and use a limit of one, like this:
SELECT Name
FROM mytab
WHERE some_conditions
AND Name > MyInsertedName
AND Name BETWEEN CachedMin AND CachedMax
ORDER BY Name
LIMIT 1
This doesn't give you a number; you still have to check where the returned Name is in your cache.
Typically you'd expect a cache to be invalidated if there were underlying data changes. I think dropping it and starting over will be your simplest, maintainable solution. I would recommend it unless you have a very good reason.
You could write another query that just returned the row count (example below) to see if your cache should be invalidated. That would save recreating the cache when it did not change.
SELECT name,address FROM people WHERE area_code=970;
SELECT COUNT(rowid) FROM people WHERE area_code=970;
The information you'd need from sqlite to know when your cache was invalidated would require some rather intimate knowledge of how the query and/or index was working. I would say that is fairly high coupling.
Otherwise, you'd want to know where it was inserted with regards to the sorting. You would probably key each page on the sorted field. Delete anything greater than the insert/delete field. Any time you change the sorting you'd drop everything.
Something like the below would be a start if you were using C++. I realize you aren't doing C++, but hopefully it is evident as to what I'm trying to do.
struct Person {
std::string name;
std::string addr;
};
struct Page {
std::string key;
std::vector<Person> persons;
struct Less {
bool operator()(const Page &lhs, const Page &rhs) const {
return lhs.key.compare(rhs.key) < 0;
}
};
};
typedef std::set<Page, Page::Less> pages_t;
pages_t pages;
void insert(const Person &person) {
if (sql_insert(person)) {
pages_t::iterator drop_cache_start = pages.lower_bound(person);
//... drop this page and everything after it
}
}
You'd have to do some wrangling to get different datatypes of key to work nicely, but its possible.
Theoretically you could just leave the pages out of it and only use the objects themselves. The database would no longer "own" the data though. If you only fill pages from the database, then you'll have less data consistency worries.
This may be a bit off topic, you aren't re-implementing views are you? It doesn't cache per se, but it isn't clear if that is a requirement of your project.
The solution I came up with is not exactly simple, but it's currently working well. I realized that the index of a record in a Query Statement is also the Count of all it's previous records. What I needed to do was 'convert' all the ORDER statements in the query to a series of WHERE statements that would return only the preceding records and take a count of those records. It's trickier than it sounds (or maybe not...it sounds tricky). The biggest issue I had was making sure the query was, in fact, sorted in a way I could predict. This meant I needed to have an order column in the Order Parameters that was based off of a column with unique values. So, whenever a user sorts on a column, I append to the statement another order parameter on a unique column (I used a "Modified Date Stamp") to break ties.
Creating the WHERE portion of the statement requires more than just tacking on a bunch of ANDs. It's easier to demonstrate. Say you have 3 Order columns: "LastName" ASC, "FirstName" DESC, and "Modified Stamp" ASC (the tie breaker). The WHERE statement would have to look something like this ('?' = record value):
WHERE
"LastName" < ? OR
("LastName" = ? AND "FirstName" > ?) OR
("LastName" = ? AND "FirstName" = ? AND "Modified Stamp" < ?)
Each set of WHERE parameters grouped together by parenthesis are tie breakers. If, in fact, the record values of "LastName" are equal, we must then look at "FirstName", and finally "Modified Stamp". Obviously, this statement can get really long if you're sorting by a bunch of order parameters.
There's still one problem with the above solution. Mathematical operations on NULL values always return false, and yet when you sort SQLite sorts NULL values first. Therefore, in order to deal with NULL values appropriately you've gotta add another layer of complication. First, all mathematical equality operations, =, must be replace by IS. Second, all < operations must be nested with an OR IS NULL to include NULL values appropriately on the < operator. This turns the above operation into:
WHERE
("LastName" < ? OR "LastName" IS NULL) OR
("LastName" IS ? AND "FirstName" > ?) OR
("LastName" IS ? AND "FirstName" IS ? AND ("Modified Stamp" < ? OR "Modified Stamp" IS NULL))
I then take a count of the RowID using the above WHERE parameter.
It turned out easy enough for me to do mostly because I had already constructed a set of objects to represent various aspects of my SQL Statement which could be assembled to generate the statement. I can't even imagine trying to manipulate a SQL statement like this any other way.
So far, I've tested using this on several iOS devices with up to 10,000 records in a table and I've had no noticeable performance issues. Of course, it's designed for single record edits/insertions so I don't really need it to be super fast/efficient.

Resources