I am a beginner to this progress 4GL. I have confused with the following logic especially how the index actually working.
I have added 2 fields in one index. As you can see below I have written three queries.
Query 1, Used the index and finding data from 2 fields to retrieve the data
Query 2, Used the same index but finding data from 1 field only
Query 3, Used the same index field with one non-index field
define temp-table tt_creldata no-undo
field tt_cscx_order as character
field tt_cscx_part as character
field tt_cscx_shipfrom as character
index tt_cscx
tt_cscx_order
tt_cscx_part
.
**Query 1:**
find first tt_creldata use-index tt_cscx
where tt_cscx_order = "153"
and tt_cscx_part = "113" no-lock no-error.
**Query 2:**
find first tt_creldata use-index tt_cscx
where tt_cscx_order = "153" no-lock no-error.
**Query 3:**
find first tt_creldata use-index tt_cscx
where tt_cscx_order = "153"
and tt_cscx_part = "113"
and tt_cscx_shipfrom = "US" no-lock no-error.
Question 1: Which query helps to improve the performance
Question 2: What if I don't use one field which is indexed when I mentioned use-index
Question 3: What if I add one non-index field when I mentioned use-index?
As a general rule of thumb, you should never use use-index.
The AVM will select one or more indexes to use for a query at compile time, and by forcing it to use one of your choosing, you are removing the possibility of this.
Having extra, possibly non-index, fields in your where clause will only affect the indexes chosen if you let the AVM choose (ie don't use use-index ). This is also true if you don't use indexed fields in your query.
You can see which indexes are used if you compile the program with the xref or xml-xref options, and looking for the SEARCH items.
As nwahmaet says, you should never use USE-INDEX. In this case it is especially pointless because there is only one index. In cases where there are multiple indexes a FIND statement will only use one of them no matter how complex the WHERE clause but the compiler will almost always do a better job picking an efficient index than you will. (The FOR EACH statement and its associated dynamic queries are capable of using multiple indexes. FIND is always limited to just one index.) In those rare cases where you think you are doing a better job you should thoroughly document why your choice is better and include detailed test cases and results.
All of your queries are using FIRST. This is necessary because your index is not defined as unique. That may be your intent but it seems unusual. And it means that in the event of duplicate records with the same key values you are magically making the "first" record more special than the others. Which is a data normalization faux pas (you are making "firstness" an attribute of the data) and a bug waiting to happen.
FIND FIRST and USE-INDEX are often used together to (try to) cover up for each other's deficiencies. By specifying a particular index the FIRST becomes more consistent. Likewise, FIRST is often used to "cure" performance issues that arise from insufficient index definitions, inadequate WHERE clauses or choosing FIND when FOR EACH would have been more appropriate.
None of these queries are going to perform notably faster than the others.
Query 2 may, or may not return the same record as query 1. For instance, if there is a part = "112" then query 2 will have a different "first" record. But it will be just as fast to return as query 1.
Likewise query 3 may have a different result depending on what records contain shipfrom = "US". In the best case where the very first order = "153" and part "113" also satisfy shipfrom = "US" then it will be the same speed as the others.
However, query 3 might be a lot slower depending on how many records have to be scanned before one is found that has shipfrom = "US" since that field is not a part of any index and matching it will, therefore, require scanning records until one is found which matches. That might be the first record or it might be the 10 zillionth.
Related
I've been reading a DynamoDB docs and was unable to understand if it does make sense to query on Global Secondary Index with a usage of 'contains' operator.
My problem is as follows: my dynamoDB document has a list of embedded objects, every object has a 'code' field which is unique:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
]
}
I want to be able to get all documents that contain entities with entity.code = X.
For this purpose I'm considering adding a Global Secondary Index that would contain all entity.codes that are present in current db document separated by a comma. So the example above would look like:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
],
"entitiesGlobalSecondaryIndex":"entityCode1,entityCode2"
}
And then I would like to apply filter expression on entitiesGlobalSecondaryIndex something like: entitiesGlobalSecondaryIndex contains entityCode1.
Would this be efficient or using global secondary index does not make sense in this way and DynamoDB will simply check the condition against every document which is similar so scan?
Any help is very appreciated,
Thanks
The contains operator of a query cannot be run on a partition Key. In order for a query to use any sort of operators (contains, begins with, > < ect...) you must have a range attributes- aka your Sort Key.
You can very well set up a GSI with some value as your PK and this code as your SK. However, GSIs are replication of the table - there is a slight potential for the data ina GSI to lag behind that of the master copy. If the query you're doing against this GSI isn't very often, then you're probably safe from that.
However. If you are trying to do this to the entire table at once then it's no better than a scan.
If what you need is a specific Code to return all its documents at once, then you could do a GSI with that as the PK. If you add a date field as the SK of this GSI it would even be time sorted. If you query against that code in that index, you'll get every single one of them.
Since you may have multiple codes, if they aren't too many per document, you maybe could use a Sparse Index - if you have an entity with code "AAAA" then you also have an attribute named AAAA (or AAAAflag or something.) It is always null/does not exist Unless the entities contains that code. If you do a GSI on this AAAflag attribute, it will only contain documents that contain that entity code, and ignore all where this attribute does not exist on a given document. This may work for you if you can also provide a good PK on this to keep the numbers well partitioned and if you don't have too many codes.
Filter expressions by the way are different than all of the above. Filter expressions are run on tbe data that would be returned, after it is already read out of the table. This is useful I'd you have a multi access pattern setup, but don't want a particular call to get all the documents associated with a particular PK - in the interests of keeping the data your code is working with concise. The query with a filter expression still retrieves everything from that query, but only presents what makes it past the filter.
If are only querying against a particular PK at any given time and you want to know if it contains any entities of x, then a Filter expressions would work perfectly. Of course, this is only per PK and not for your entire table.
If all you need is numbers, then you could do a count attribute on the document, or a meta document on that partition that contains these values and could be queried directly.
Lastly, and I have no idea if this would work or not, if your entities attribute is a map type you might very well be able to filter against entities code - and maybe even with entities.code.contains(value) if it was an SK - but I do not know if this is possible or not
I have a procedure that assigns values and sends it back. I need to implement a change that it would skip the assigning process whenever it finds duplicate iban code. It would be in this FOR EACH. Some kind of IF or something else. Basically, when it finds an iban code that was already used and assigned it would not assign it for the second or third time. I am new to OpenEdge Progress so it is really hard for me to understand correctly the syntax and write the code by myself yet. So if anyone could explain how I should implement this, give any pieces of advice or tips I would be very thankful.
FOR EACH viewpoint WHERE viewpoint.cif = cif.cif AND NOT viewpoint.close NO-LOCK:
DEFINE VARIABLE cIban AS CHARACTER NO-UNDO.
FIND FIRST paaa WHERE paaa.cif EQ cif.cif AND paaa.paaa = viewpoint.aaa AND NOT paaa.close NO-LOCK NO-ERROR.
cIban = viewpoint.aaa.
IF AVAILABLE paaa THEN DO:
cIban = paaa.vaaa.
CREATE tt_account_rights.
ASSIGN
tt_account_rights.iban = cIban.
END.
You have not shown the definition of tt_account_rights but assuming that "iban" is a uniquely indexed field in tt_account_rights you probably want something like:
DEFINE VARIABLE cIban AS CHARACTER NO-UNDO.
FOR EACH viewpoint WHERE viewpoint.cif = cif.cif AND NOT viewpoint.close NO-LOCK:
FIND FIRST paaa WHERE paaa.cif EQ cif.cif AND paaa.paaa = viewpoint.aaa AND NOT paaa.close NO-LOCK NO-ERROR.
cIban = viewpoint.aaa.
IF AVAILABLE paaa THEN DO:
cIban = paaa.vaaa.
find tt_account_rights where tt_account_rights.iban = cIban no-error.
if not available tt_account_rights then
do:
CREATE tt_account_rights.
ASSIGN
tt_account_rights.iban = cIban.
end.
END.
Some bonus perspective:
1) Try to express elements of the WHERE clause as equality matches whenever possible. This is the most significant contributor to query efficiency. So instead of saying "NOT viewpoint.close" code it as "viewpoint.close = NO".
2) Do NOT automatically throw FIRST after every FIND. You may have been exposed to some code where that is the "standard". It is none the less bad coding. If the FIND is unique it adds no value (it does NOT improve performance in that case). If the FIND is not unique and you do as you have done above and assign a value from that record you are, effectively, making that FIRST record special. Which is a violation of 3rd normal form (there is now a fact about the record which is not related to the key, the whole key and nothing but the key). What if the 2nd record has a different iBan? What if different WHERE clauses return different "1st" records?
There are cases where FIRST is appropriate. The point is that it is not ALWAYS correct and it should not be added to every FIND statement without any thought about why you are putting it there and what the impact of that keyword really is.
3) It is clearer to put the NO-LOCK (or EXCLUSIVE-LOCK or SHARE-LOCK) immediately after the table name rather than towards the end of the statement. The syntax works either way but from a readability perspective it is better to have the lock phrase right by the table.
What is the meaning of For each and For First.. Example below
FOR EACH <db> NO-LOCK,
FIRST <db> OF <db> NO-LOCK:
DISPLAY ..
Also why we need to use NO-LOCK for every table for every time.
Let's answer by giving an example based on the Progress demo DB:
FOR EACH Customer WHERE Customer.Country = "USA" NO-LOCK,
FIRST Salesrep WHERE Salesrep.salesrep = Customer.Saleserp:
/* your code block */
END.
The FOR EACH Block is an iterating block (loop) that integrates data access (and a few more features like error handling and frame scoping if you want to go that far back).
So the code in "your code block" is executed for every Customer record matching the criteria and it also fetches the matching Salesrep records. The join between Customer and Salesrep is an inner join. So you'll only be processing Customers where the Salesrep exists as well.
FOR statement documentation (includes EACH and FIRST keywords)
NO-LOCK documentation
Google is your friend and documentation on packages is usually quite user-friendly.
Try not to ask questions that can be solved by simple search on StackOverflow.
FOR EACH table
Selects a set of records and starts a block to process those records.
NO-LOCK means what it says, the records are retrieved from the database without any record locking. So you might get a "dirty read" (uncommitted data) and someone else might change the data while you are looking at that record.
That sounds awful but, in reality, NO-LOCK reads are almost always what you want to use. If you do need to update a NO-LOCK record you can just FIND CURRENT with a lock.
FOR EACH NO-LOCK can return large numbers of records in a single network message whereas the other lock types are one record at a time - this makes NO-LOCK quite a bit faster for many purposes. And even without the performance argument you probably don't want to be taking out large numbers of locks and preventing other users running inquiries all the time.
Your example lacks a WHERE clause so, by default, every record in the table is returned using the primary index. If you specify a WHERE clause you will potentially only have a subset of the data to loop through and the index selection may be impacted. You can also add a lot of other options like BY to specify sort order.
FOR FIRST is somewhat similar to FOR EACH except that you only return, at most, a single record. Even if the WHERE clause is empty or would otherwise specify a larger result set. BUT BE CAREFUL - the "FIRST" is deceptive. Even if you specify a sort order using BY the rule is "selection, then sorting". At most only one record gets selected so the BY doesn't matter. The index dictated by the WHERE (or lack of a WHERE) determines the sort order. So if your request something like:
FOR FIRST customer NO-LOCK BY discount:
DISPLAY custNum name discount.
END.
You will fetch customer #1, not customer #41 as you might have expected. (Try the code above with the sports2000 database. Replace FIRST with EACH in a second run.)
FOR EACH table1 NO-LOCK,
FIRST table2 NO-LOCK OF table1:
or
FOR EACH customer NO-LOCK,
FIRST salesRep NO-LOCK OF customer:
DISPLAY custnum name customer.salesRep.
END.
Is a join. The OF is a shortcut telling the compiler to find fields that the two tables have in common to build an implied WHERE clause from. This is one of those "makes a nice demo" features that you don't want to use in real code. It obfuscates the relationship between the tables and makes your code much harder to follow. Don't do that. Instead write out the complete WHERE clause. Perhaps like this:
for each customer no-lock,
first salesRep no-lock where sakesRep.salesRep = customer.salesRep:
display custnum name customer.salesRep.
end.
I have following declaration for collection
TYPE T_TABLE1 IS TABLE OF TABLE_1%ROWTYPE INDEX BY BINARY_INTEGER;
tbl1_u T_TABLE1;
tbl1_i T_TABLE1;
This table will keep growing and at the end, will be used in FORALL loop to do insert or update on TABLE_1.
Now there might be cases, where I want to delete a certain element. So i am planning to create a procedure, which will take the KEY (unique) and matched the element if that key is found
PSEDUO CODE
FOR i in tbl1_u.FIST..tbl1_u.LAST
LOOP
if tbl1_u(i).key = key then
tbl1.delete(i);
end if;
END LOOP;
My question is,
Once i delete the particular element, would be collection adjust automatically i.e., the index i would be replaced by next element or would that particular index will remain null/invalid and could possibly give me exception if i use it in FORALL INSERT/UPDATE?
I don't think that i can pass TABLE_1%ROWTYPE object to a procedure, do i have to create a record type ?
Any other tip regarding managing collection for bull delete/update/insert would be appreciate. Remeber, I would be dealing with 2 tables, if i am inserting/updating in table_1 then it means i am deleting it from table_2 and vice-versa.
Given that TABLE_1.KEY is unique you might consider using that as the index to your associative arrays. That way you can delete from the collections using the KEY value, which according to the pseudocode is available when doing the deletions. This would also save you having to iterate through the table to find the KEY you want, as the KEY would be the index - so your "deletion" pseudo-code would become:
tbl1_u.delete(key);
To answer your questions:
Since you're using associative arrays, when an element is deleted there is no "empty" space in the collection. The indexes for the elements, however, don't actually change. Therefore you need to use the collection.PRIOR and collection.NEXT methods to loop through the collection. But again, if you use the KEY value as the index you may not need to loop through the collections at all.
You can pass a TABLE_1%ROWTYPE as a parameter to a PL/SQL procedure or function.
You might want to consider using a MERGE statement which could handle doing the inserts and updates in one step. This might allow you to maintain only a single collection. Might be worth looking in to.
Share and enjoy.
I'm creating a caching system to take data from an SQLite database table using a sorted/filtered query and display it. The tables I'm pulling from can be potentially very large and, of course, I need to minimize impact on memory by only retaining a maximum number of rows in memory at any given time. This is easily done by using LIMIT and OFFSET to load only the records I need and update the cache as needed. Implementing this is trivial. The problem I'm having is determining where the insertion index is for a new record inserted into a particular query so I can update my UI appropriately. Is there an easy way to do this? So far the ideas I've had are:
Dump the entire cache, re-count the Query results (there's no guarantee the new row will be included), refresh the cache and refresh the entire UI. I hope it's obvious why that's not really desirable.
Use my own algorithm to determine whether the new row is included in the current query, if it is included in the current cached results and at what index it should be inserted into if it's within the current cached scope. The biggest downfall of this approach is it's complexity and the risk that my own sorting/filtering algorithm won't match SQLite's.
Of course, what I want is to be able to ask SQLite: Given 'Query A' what is the index of 'Row B', without loading the entire query results. However, so far I haven't been able to find a way to do this.
I don't think it matters but this is all occurring on an iOS device, using the objective-c programming language.
More Info
The Query and subsequent cache is based off of user input. Essentially the user can re-sort and filter (or search) to alter the results they're seeing. My reticence in simply recreating the cache on insertions (and edits, actually) is to provide a 'smoother' UI experience.
I should point out that I'm leaning toward option "2" at the moment. I played around with creating my own caching/indexing system by loading all the records in a table and performing the sort/filter in memory using my own algorithms. So much of the code needed to determine whether and/or where a particular record is in the cache is already there, so I'm slightly predisposed to use it. The danger lies in having a cache that doesn't match the underlying query. If I include a record in the cache that the query wouldn't return, I'll be in trouble and probably crash.
You don't need record numbers.
Save the values of the ordered field in the first and last records of the LIMITed query result.
Then you can use these to check whether the new record falls into this range.
In other words, assuming that you order by the Name field, and that the original query was this:
SELECT Name, ...
FROM mytab
WHERE some_conditions
ORDER BY Name
LIMIT x OFFSET y
then try to get at the new record with a similar query:
SELECT 1
FROM mytab
WHERE some_conditions
AND PrimaryKey = LastInsertedValue
AND Name BETWEEN CachedMin AND CachedMax
Similarly, to find out before (or after) which record the new record was inserted, start directly after the inserted record and use a limit of one, like this:
SELECT Name
FROM mytab
WHERE some_conditions
AND Name > MyInsertedName
AND Name BETWEEN CachedMin AND CachedMax
ORDER BY Name
LIMIT 1
This doesn't give you a number; you still have to check where the returned Name is in your cache.
Typically you'd expect a cache to be invalidated if there were underlying data changes. I think dropping it and starting over will be your simplest, maintainable solution. I would recommend it unless you have a very good reason.
You could write another query that just returned the row count (example below) to see if your cache should be invalidated. That would save recreating the cache when it did not change.
SELECT name,address FROM people WHERE area_code=970;
SELECT COUNT(rowid) FROM people WHERE area_code=970;
The information you'd need from sqlite to know when your cache was invalidated would require some rather intimate knowledge of how the query and/or index was working. I would say that is fairly high coupling.
Otherwise, you'd want to know where it was inserted with regards to the sorting. You would probably key each page on the sorted field. Delete anything greater than the insert/delete field. Any time you change the sorting you'd drop everything.
Something like the below would be a start if you were using C++. I realize you aren't doing C++, but hopefully it is evident as to what I'm trying to do.
struct Person {
std::string name;
std::string addr;
};
struct Page {
std::string key;
std::vector<Person> persons;
struct Less {
bool operator()(const Page &lhs, const Page &rhs) const {
return lhs.key.compare(rhs.key) < 0;
}
};
};
typedef std::set<Page, Page::Less> pages_t;
pages_t pages;
void insert(const Person &person) {
if (sql_insert(person)) {
pages_t::iterator drop_cache_start = pages.lower_bound(person);
//... drop this page and everything after it
}
}
You'd have to do some wrangling to get different datatypes of key to work nicely, but its possible.
Theoretically you could just leave the pages out of it and only use the objects themselves. The database would no longer "own" the data though. If you only fill pages from the database, then you'll have less data consistency worries.
This may be a bit off topic, you aren't re-implementing views are you? It doesn't cache per se, but it isn't clear if that is a requirement of your project.
The solution I came up with is not exactly simple, but it's currently working well. I realized that the index of a record in a Query Statement is also the Count of all it's previous records. What I needed to do was 'convert' all the ORDER statements in the query to a series of WHERE statements that would return only the preceding records and take a count of those records. It's trickier than it sounds (or maybe not...it sounds tricky). The biggest issue I had was making sure the query was, in fact, sorted in a way I could predict. This meant I needed to have an order column in the Order Parameters that was based off of a column with unique values. So, whenever a user sorts on a column, I append to the statement another order parameter on a unique column (I used a "Modified Date Stamp") to break ties.
Creating the WHERE portion of the statement requires more than just tacking on a bunch of ANDs. It's easier to demonstrate. Say you have 3 Order columns: "LastName" ASC, "FirstName" DESC, and "Modified Stamp" ASC (the tie breaker). The WHERE statement would have to look something like this ('?' = record value):
WHERE
"LastName" < ? OR
("LastName" = ? AND "FirstName" > ?) OR
("LastName" = ? AND "FirstName" = ? AND "Modified Stamp" < ?)
Each set of WHERE parameters grouped together by parenthesis are tie breakers. If, in fact, the record values of "LastName" are equal, we must then look at "FirstName", and finally "Modified Stamp". Obviously, this statement can get really long if you're sorting by a bunch of order parameters.
There's still one problem with the above solution. Mathematical operations on NULL values always return false, and yet when you sort SQLite sorts NULL values first. Therefore, in order to deal with NULL values appropriately you've gotta add another layer of complication. First, all mathematical equality operations, =, must be replace by IS. Second, all < operations must be nested with an OR IS NULL to include NULL values appropriately on the < operator. This turns the above operation into:
WHERE
("LastName" < ? OR "LastName" IS NULL) OR
("LastName" IS ? AND "FirstName" > ?) OR
("LastName" IS ? AND "FirstName" IS ? AND ("Modified Stamp" < ? OR "Modified Stamp" IS NULL))
I then take a count of the RowID using the above WHERE parameter.
It turned out easy enough for me to do mostly because I had already constructed a set of objects to represent various aspects of my SQL Statement which could be assembled to generate the statement. I can't even imagine trying to manipulate a SQL statement like this any other way.
So far, I've tested using this on several iOS devices with up to 10,000 records in a table and I've had no noticeable performance issues. Of course, it's designed for single record edits/insertions so I don't really need it to be super fast/efficient.