Linq 'contains' query taking too long - asp.net

I have this query:
var newComponents = from ic in importedComponents
where !existingComponents.Contains(ic)
select ic;
importedComponents and existingComponents are of type List<ImportedComponent>, and exist only in memory (are not tied to a data context). In this instance, importedComponents has just over 6,100 items, and existingComponents has 511 items.
This statement is taking too long to complete (I don't know how long, I stop the script after 20 minutes). I've tried the following with no improvement in execution speed:
var existingComponentIDs = from ec in existingComponents
select ec.ID;
var newComponents = from ic in importedComponents
where !existingComponentIDs.Contains(ic.ID)
select ic;
Any help will be much appreciated.

The problem is quadratic complexity of this algorithm. Put the IDs of all existingComponentIDs into a HashSet and use the HashSet.Contains method. It has O(1) lookup cost compared to O(N) for Contains/Any on a list.
The morelinq project contains a method that does all of that in one convenient step: ExceptBy.

You could use Except to get the set difference:
var existingComponentIDs = existingComponents.Select(c => c.ID);
var importedComponentIDs = importedComponents.Select(c => c.ID);
var newComponentIDs = importedComponentIDs.Except(existingComponentIDs);
var newComponents = from ic in importedComponents
join newID in newComponentIDs on ic.ID equals newID
select ic;
foreach (var c in newComponents)
{
// insert into database?
}
Why is LINQ JOIN so much faster than linking with WHERE?
In short: Join method can set up a hash table to use as an index to quicky zip two tables together

Well based on the logic and numbers you provided that means you are basically performing 3117100 comparisons when you run that statement. Obviously that is not entirely accurate because your condition may be satisfied before running through the entire array but you get my point.
With collections this large you are going to want use a collection where you can index your key (in this case your component ID) to help reduce the overhead of the search. The thing to remember is that even though LINQ looks like SQL there are no magic indexes here; it is mainly for convenience. In fact, I have seen articles where a link lookup is actually a slight bit slower than a brute force lookup.
EDIT: If it is possible I would suggest trying a Dictionary or SortedList for your values. I believe either one would have slightly better lookup performance.

Related

Cosmos DB read a single document without partition key

A container has a function called ReadItemAsync. The problem is I do not have the partition key, but only the id of the document. What is the best approach to get just a single item then?
Do I have to get it from a collection? Like:
var allItemsQuery = VesselContainer.GetItemQueryIterator<CoachVessel>("SELECT * FROM c where c.id=....");
var q = VesselContainer.GetItemLinqQueryable<CoachVessel>();
var iterator = q.ToFeedIterator();
var result = new List<CoachVessel>();
while (iterator.HasMoreResults)
{
foreach (var item in await iterator.ReadNextAsync())
{
result.Add(item);
}
}
Posting as answer.
Yes you have to do a fan out query but id is only distinct per partition key so even then you may end up with multiple items. Frankly speaking, if you don't have the partition key for a point read then the model for the database is not correct. It (or the application itself) should be redesigned.
Additionally. For small, single partition collections this x-partition query will not be too expensive as the collection is small. However, once the database starts to scale out this will get increasingly slower and more expensive as the query will fan out to ever increasing numbers of physical partitions. As stated above, I would strongly recommend you modify the app to pass the partition key value in the request. This will allow you to do a single point read operation which is extremely fast and efficient.
Good luck.
try using ReadItemAsync like:
dynamic log = await container.ReadItemAsync(ID, PartitionKey.None);

SQLite data retrieve with select taking too long

I have created a table with sqlite for my corona/lua app. It's a hashtable with ~=700 000 values.The table has two columns, which are the hashcode (a string), and the value (another string). During the program I need to get data several times by providing the hashcode.
I'm using something like this code to get the data:
for p in db:nrows([[SELECT * FROM test WHERE id=']].."hashcode"..[[';]]) do
print(p)
-- p = returned value --
end
This statement is though taking insanely too much time to perform
thanks,
Edit:
Success!
the mistake was with the primare key thing.I set the hashcode as the primary key like below and the retrieve time whent to normal:
CREATE TABLE IF NOT EXISTS test (id STRING PRIMARY KEY , array);
I also prepared the statements in advance as you said:
stmt = db:prepare("SELECT * FROM test WHERE id = ?;")
[...]
stmt:bind(1,s)
for p in stmt:nrows() do
The only problem was that the db file size,that was around 18 MB, went to 29,5 MB
You should create the table with id as a unique primary key; this will automatically make an index.
create table if not exists test
(
id text primary key,
val text
);
You should not construct statements using string concatenation; this is a security issue so avoid getting in this habit. Also, you should prepare statements in advance, at program initialization, and run the prepared statements.
Something like this... initially:
hashcode_query_stmt = db:prepare("SELECT * FROM test WHERE id = ?;")
then for each use:
hashcode_query_stmt:bind_values(hashcode)
for p in hashcode_query_stmt:urows() do ... end
Ensure that there is an index on the id/hashcode column? Without one such queries will be slow, slow, slow. This index should probably be unique.
If only selecting the value/hashcode (SELECT value FROM ..), it may be beneficial to have a covering index over (id, value) as that can avoid additional seeking to the row data (see SQLite Query Planning). Try it with and without such a covering index.
Also, it may be worthwhile to employ caching if the same hashcodes are queried multiple times.
As already stated, get sure you have an index on ID.
If you can't change table schema now, you can add a index ad hoc:
CREATE INDEX test_id ON test (id);
About hashes: if you are computing hashes in your software to speed up searches, don't!
SQLite will use your supplied hashes as any regular string/blob. Also, RDBMS are optimized for efficient searching, which may be greatly improved with indexes.
Unless your hashing to save space, you are wasting processor time computing hashes in your application.

Doctrine2 OFFSET and LIMIT

I want to use limit and offset in my query but the number of records returned does not match. When I'm not using offset and limit function gets 26 objects, and after setting methods
->setMaxResults(5)
->setFirstResult(10)
the number is 1 ...
What's going on?
What you are probably is a typical problem you get when fetch-joining in DQL. This is a very simple issue and derives from the fact that offset and limit are applied on a resultset that is not yet hydrated and has to be normalized (see the documentation about first and max results about that).
If you want to avoid the problem (even with more complex joined or fetch-joined results), you will need to use ORM DQL Paginator API. Using the paginator basically triggers multiple queries to:
compute the number of records in the resultset according to your offset/limit
compute the different identifiers of the root entity of your query (with applied max/first results)
retrieve joined results (without applied first/max results)
Its usage is quite simple:
$query = $em->crateQuery($fetchJoinQuery);
$paginator = new \Doctrine\ORM\Tools\Pagination\Paginator($query);
$query->setFirstResult(20);
$query->setMaxResults(100);
foreach ($paginator as $result) {
var_dump($result->getId());
}
This will print 100 items starting from the one at offset 20, regardless of the numer of joined or fetch-joined results.
While this may seem to be un-performant, it's the safest way to handle the problem of fetch-joined results causing apparently scrambled offsets and limits in results. You may look into how this is handled directly by diving into the internals of the ORM Paginator.

SQLite - Get a specific row index for a Sorted/Filtered Query

I'm creating a caching system to take data from an SQLite database table using a sorted/filtered query and display it. The tables I'm pulling from can be potentially very large and, of course, I need to minimize impact on memory by only retaining a maximum number of rows in memory at any given time. This is easily done by using LIMIT and OFFSET to load only the records I need and update the cache as needed. Implementing this is trivial. The problem I'm having is determining where the insertion index is for a new record inserted into a particular query so I can update my UI appropriately. Is there an easy way to do this? So far the ideas I've had are:
Dump the entire cache, re-count the Query results (there's no guarantee the new row will be included), refresh the cache and refresh the entire UI. I hope it's obvious why that's not really desirable.
Use my own algorithm to determine whether the new row is included in the current query, if it is included in the current cached results and at what index it should be inserted into if it's within the current cached scope. The biggest downfall of this approach is it's complexity and the risk that my own sorting/filtering algorithm won't match SQLite's.
Of course, what I want is to be able to ask SQLite: Given 'Query A' what is the index of 'Row B', without loading the entire query results. However, so far I haven't been able to find a way to do this.
I don't think it matters but this is all occurring on an iOS device, using the objective-c programming language.
More Info
The Query and subsequent cache is based off of user input. Essentially the user can re-sort and filter (or search) to alter the results they're seeing. My reticence in simply recreating the cache on insertions (and edits, actually) is to provide a 'smoother' UI experience.
I should point out that I'm leaning toward option "2" at the moment. I played around with creating my own caching/indexing system by loading all the records in a table and performing the sort/filter in memory using my own algorithms. So much of the code needed to determine whether and/or where a particular record is in the cache is already there, so I'm slightly predisposed to use it. The danger lies in having a cache that doesn't match the underlying query. If I include a record in the cache that the query wouldn't return, I'll be in trouble and probably crash.
You don't need record numbers.
Save the values of the ordered field in the first and last records of the LIMITed query result.
Then you can use these to check whether the new record falls into this range.
In other words, assuming that you order by the Name field, and that the original query was this:
SELECT Name, ...
FROM mytab
WHERE some_conditions
ORDER BY Name
LIMIT x OFFSET y
then try to get at the new record with a similar query:
SELECT 1
FROM mytab
WHERE some_conditions
AND PrimaryKey = LastInsertedValue
AND Name BETWEEN CachedMin AND CachedMax
Similarly, to find out before (or after) which record the new record was inserted, start directly after the inserted record and use a limit of one, like this:
SELECT Name
FROM mytab
WHERE some_conditions
AND Name > MyInsertedName
AND Name BETWEEN CachedMin AND CachedMax
ORDER BY Name
LIMIT 1
This doesn't give you a number; you still have to check where the returned Name is in your cache.
Typically you'd expect a cache to be invalidated if there were underlying data changes. I think dropping it and starting over will be your simplest, maintainable solution. I would recommend it unless you have a very good reason.
You could write another query that just returned the row count (example below) to see if your cache should be invalidated. That would save recreating the cache when it did not change.
SELECT name,address FROM people WHERE area_code=970;
SELECT COUNT(rowid) FROM people WHERE area_code=970;
The information you'd need from sqlite to know when your cache was invalidated would require some rather intimate knowledge of how the query and/or index was working. I would say that is fairly high coupling.
Otherwise, you'd want to know where it was inserted with regards to the sorting. You would probably key each page on the sorted field. Delete anything greater than the insert/delete field. Any time you change the sorting you'd drop everything.
Something like the below would be a start if you were using C++. I realize you aren't doing C++, but hopefully it is evident as to what I'm trying to do.
struct Person {
std::string name;
std::string addr;
};
struct Page {
std::string key;
std::vector<Person> persons;
struct Less {
bool operator()(const Page &lhs, const Page &rhs) const {
return lhs.key.compare(rhs.key) < 0;
}
};
};
typedef std::set<Page, Page::Less> pages_t;
pages_t pages;
void insert(const Person &person) {
if (sql_insert(person)) {
pages_t::iterator drop_cache_start = pages.lower_bound(person);
//... drop this page and everything after it
}
}
You'd have to do some wrangling to get different datatypes of key to work nicely, but its possible.
Theoretically you could just leave the pages out of it and only use the objects themselves. The database would no longer "own" the data though. If you only fill pages from the database, then you'll have less data consistency worries.
This may be a bit off topic, you aren't re-implementing views are you? It doesn't cache per se, but it isn't clear if that is a requirement of your project.
The solution I came up with is not exactly simple, but it's currently working well. I realized that the index of a record in a Query Statement is also the Count of all it's previous records. What I needed to do was 'convert' all the ORDER statements in the query to a series of WHERE statements that would return only the preceding records and take a count of those records. It's trickier than it sounds (or maybe not...it sounds tricky). The biggest issue I had was making sure the query was, in fact, sorted in a way I could predict. This meant I needed to have an order column in the Order Parameters that was based off of a column with unique values. So, whenever a user sorts on a column, I append to the statement another order parameter on a unique column (I used a "Modified Date Stamp") to break ties.
Creating the WHERE portion of the statement requires more than just tacking on a bunch of ANDs. It's easier to demonstrate. Say you have 3 Order columns: "LastName" ASC, "FirstName" DESC, and "Modified Stamp" ASC (the tie breaker). The WHERE statement would have to look something like this ('?' = record value):
WHERE
"LastName" < ? OR
("LastName" = ? AND "FirstName" > ?) OR
("LastName" = ? AND "FirstName" = ? AND "Modified Stamp" < ?)
Each set of WHERE parameters grouped together by parenthesis are tie breakers. If, in fact, the record values of "LastName" are equal, we must then look at "FirstName", and finally "Modified Stamp". Obviously, this statement can get really long if you're sorting by a bunch of order parameters.
There's still one problem with the above solution. Mathematical operations on NULL values always return false, and yet when you sort SQLite sorts NULL values first. Therefore, in order to deal with NULL values appropriately you've gotta add another layer of complication. First, all mathematical equality operations, =, must be replace by IS. Second, all < operations must be nested with an OR IS NULL to include NULL values appropriately on the < operator. This turns the above operation into:
WHERE
("LastName" < ? OR "LastName" IS NULL) OR
("LastName" IS ? AND "FirstName" > ?) OR
("LastName" IS ? AND "FirstName" IS ? AND ("Modified Stamp" < ? OR "Modified Stamp" IS NULL))
I then take a count of the RowID using the above WHERE parameter.
It turned out easy enough for me to do mostly because I had already constructed a set of objects to represent various aspects of my SQL Statement which could be assembled to generate the statement. I can't even imagine trying to manipulate a SQL statement like this any other way.
So far, I've tested using this on several iOS devices with up to 10,000 records in a table and I've had no noticeable performance issues. Of course, it's designed for single record edits/insertions so I don't really need it to be super fast/efficient.

LINQ to SQL and DataPager

I'm using LINQ to SQL to search a fairly large database and am unsure of the best approach to perform paging with a DataPager. I am aware of the Skip() and Take() methods and have those working properly. However, I'm unable to use the count of the results for the datapager, as they will always be the page size as determined in the Take() method.
For example:
var result = (from c in db.Customers
where c.FirstName == "JimBob"
select c).Skip(0).Take(10);
This query will always return 10 or fewer results, even if there are 1000 JimBobs. As a result, the DataPager will always think there's a single page, and users aren't able to navigate across the entire result set. I've seen one online article where the author just wrote another query to get the total count and called that.
Something like:
int resultCount = (from c in db.Customers
where c.FirstName == "JimBob"
select c).Count();
and used that value for the DataPager. But I'd really rather not have to copy and paste every query into a separate call where I want to page the results for obvious reasons. Is there an easier way to do this that can be reused across multiple queries?
Thanks.
In situations like this I sometimes return the total record count as a field in my result set from the db.
Basically you only have the two options, write another query specifically for the count, or return it as column in the results.
Remember that linq provides defferred query execution...
var qry = from c in db.Customers
where c.FirstName == "JimBob"
select c;
int resultCount = qry.Count();
var results = qry.Skip(0).Take(10);

Resources