What is the difference using bracket on the entire index? - openedge

/* Method 1 */
FOR EACH customer NO-LOCK WHERE customer.name EQ 'John':
DISPLAY customer.name.
END.
/* Method 2*/
for each customer where (customer.name EQ 'John'):
DISPLAY customer.name.
end.
Could you please explain by putting brackets how the compiler will act as?

Your second example is NOT faster. It is actually slower because you failed to specify NO-LOCK. By default you will therefore get the record(s) with a SHARE-LOCK which will then need to be unlocked as each record goes out of scope. This extra work takes time and results in a slower query.
If you are connecting client/server it can be orders of magnitude slower because:
a) each record will require 3 network messages. 1 to ask for it, another to return it and a 3rd to unlock it.
b) NO-LOCK queries can pack multiple records into a response message. this is much more efficient than requesting and sending them one at a time. and since there is no lock nothing needs to be unlocked.
(In your sample you are only getting one record so the difference is pretty small.)
"Index Brackets" are an entirely different concept from grouping sub-expressions with "(" and ")". An index bracket is a set of records specified by elements of the WHERE clause. In both of your examples above you have an equality match on a field which is the leading component of a unique index. So the "bracket" is exactly one record. (Assuming that this is the standard sports database!)
If you specify something more complicated like:
for each order no-lock where order.custid = 1 and order.ship-date >= 7/1/2019 and order.ship-date <= 7/31/2019:
Progress uses a static, rule based query optimizer. The indexes are chosen at compile time. The rules are complicated but the most important, by far, is that equality matches on leading components keep your index in play. Range matches are next most useful. But once you get a range match no further fields will be considered for index selection.
There can be cases where using parenthesis to group elements of the WHERE clause has an impact on index selection. (The case where there is only one field being selected on is not one of them.) Generally this is going to be a situation where you have a complex query with a mix of AND and OR elements. If nothing else the use of parenthesis in these situations makes operator precedence much less error prone and easier for a human to read.
There is a lot of material on the index selection process available. I suggest that you start here: https://documentation.progress.com/output/ua/OpenEdge_latest/index.html#page/wp-abl-triggers/general-rules-for-choosing-a-single-index.html
There are also excellent presentations at every PUG Challenge on the topic. If you cannot attend in person (you should), many of them are available for download: http://pugchallenge.org/downloads.html

Related

DynamoDB top item per partition

We are new to DynamoDB and struggling with what seems like it would be a simple task.
It is not actually related to stocks (it's about recording machine results over time) but the stock example is the simplest I can think of that illustrates the goal and problems we're facing.
The two query scenarios are:
All historical values of given stock symbol <= We think we have this figured out
The latest value of all stock symbols <= We do not have a good solution here!
Assume that updates are not synchronized, e.g. the moment of the last update record for TSLA maybe different than for AMZN.
The 3 attributes are just { Symbol, Moment, Value }. We could make the hash_key Symbol, range_key Moment, and believe we could achieve the first query easily/efficiently.
We also assume could get the latest value for a single, specified Symbol following https://stackoverflow.com/a/12008398
The SQL solution for getting the latest value for each Symbol would look a lot like https://stackoverflow.com/a/6841644
But... we can't come up with anything efficient for DynamoDB.
Is it possible to do this without either retrieving everything or making multiple round trips?
The best idea we have so far is to somehow use update triggers or streams to track the latest record per Symbol and essentially keep that cached. That could be in a separate table or the same table with extra info like a column IsLatestForMachineKey (effectively a bool). With every insert, you'd grab the one where IsLatestForMachineKey=1, compare the Moment and if the insertion is newer, set the new one to 1 and the older one to 0.
This is starting to feel complicated enough that I question whether we're taking the right approach at all, or maybe DynamoDB itself is a bad fit for this, even though the use case seems so simple and common.
There is a way that is fairly straightforward, in my opinion.
Rather than using a GSI, just use two tables with (almost) the exact same schema. The hash key of both should be symbol. They should both have moment and value. Pick one of the tables to be stocks-current and the other to be stocks-historical. stocks-current has no range key. stocks-historical uses moment as a range key.
Whenever you write an item, write it to both tables. If you need strong consistency between the two tables, use the TransactWriteItems api.
If your data might arrive out of order, you can add a ConditionExpression to prevent newer data in stocks-current from being overwritten by out of order data.
The read operations are pretty straightforward, but I’ll state them anyway. To get the latest value for everything, scan the stocks-current table. To get historical data for a stock, query the stocks-historical table with no range key condition.

Search query to find documents that have multiple element

I have a few XML documents in marklogic which have the structure
<abc:doc>
<abc:doc-meta>
<abc:meetings>
<abc:meeting>
</abc:meeting>
<abc:meeting>
</abc:meeting>
</abc:meetings>
</abc:doc-meta>
</abc:doc>
We can have more than one <abc:meeting> element under the <abc:meetings> element.
I am trying to write a cts:search query to get only documents that have more than one <abc:meeting> element in the document.
Please advise
This is tricky. Ideally, you'd want to drive searches from indexes for best performance. Unfortunately, MarkLogic doesn't keep track of element counts in its universal index, and aggregating counts from a range index can be cumbersome.
The overall simplest solution would be to add a count attribute on abc:meetings, and then add a range index on that. It does mean you'd have to change your data, and you'd have to keep that attribute in synch with each change.
You could also just search on the presence of abc:meeting with cts:element-query(), and append an XPath predicate to count the number of elements afterwards. Something like:
cts:search(
collection(),
cts:element-query(xs:QName('abc:meeting'), cts:true-query())
)[count(.//abc:meeting) > 1]
If not many documents contain meetings, this might work fairly well for you, but it still requires pulling up all documents containing meetings, hence could be expensive.
I played with the thought of leveraging cts:near-query(), but that is driven on word positions, so depends on the actual amount of tokens inside a meeting. If that were always an exact number of tokens (unlikely I'd guess), you could use the minimal-distance option on a double cts:element-query() wrapped in a cts:near-query(). It might help optimize the previous option a little though.
Most performant option I can think of right now, involves adding a User-Defined aggregate Function. It unfortunately means compiling c++ code. I happen to have written such a UDF in the past, that you should be able to use as-is after compilation and installation. For details see:
https://github.com/grtjn/doc-count-udf
and
http://docs.marklogic.com/guide/app-dev/aggregateUDFs
HTH!
It boils down to how many "a few" is. If it's thousands or fewer, than what grtjn presents above for a cts:search plus an XPath expression will work fine. If it's more, I'd add the count attribute to abc:meetings and then use a pre-commit trigger (e.g. on the collection of these documents) to ensure that the count attribute value is kept in sync. You'd need a range index to be able to query for "Documents that have a count of meetings of 2 or greater".
Of course, if all you need to query on is whether there's more than one meeting, then just add a "multiple" attribute to abc:meetings with a value of "true". Then you don't need a range index - you can do a cts:element-attribute-value-query on abc:meetings and multiple="true".

Check if record exists - performance

I have a table that will contain large amounts of data. The purpose of this table is user transactions.
I will be inserting into this table from a web-service, which a third party will be calling, frequently.
The third party will be supplying a reference code (most probably a string).
The requirement here is that I will need to check whether this reference code has already been inserted. If it exists, just return the details and do nothing else. If it doesn't create the transaction as expected. The reasoning behind this is the possibility of loss of communication with the service after the request is received.
I have some performance concerns with this, as the search will be done on a string value, and also on a large table. Most of the time the transaction will not exist in the database, as this is just a precaution.
I am not asking for code here, but for the best approach for performance.
AS your subject indicates if you are trying to look for evaluating EXISTS (SELECT 1 from Sometable) then there will not be much of performance penality. This is because you will not be writing just a bunch of 1s (means the inner query) to evaluate the result to boolean.
Other aspect to this is the non clustered indexing provided on the reference code field.If the length of the reference code lets say its a fixed length string (CHAR(50) then also the B -tree will be optimum .
I am not sure about the data consistency requirements hence exepct the normal readcommitted will wont do any harm unless you have highly transnational read writes.

When to include an index (automated heuristic)

I have a piece of software which takes in a database, and uses it to produce graphs based on what the user wants (primarily queries of the form SELECT AVG(<input1>) AS x, AVG(<intput2>) as y FROM <input3> WHERE <key> IN (<vals..> AND ...). This works nicely.
I have a simple script that is passed a (often large) number of files, each describing a row
name=foo
x=12
y=23.4
....... etc.......
The script goes through each file, saving the variable names, and an INSERT query for each. It then loads the variable names, sort | uniq's them, and makes a CREATE TABLE statement out of them (sqlite, amusingly enough, is ok with having all columns be NUMERIC, even if they actually end up containing text data). Once this is done, it then executes the INSERTS (in a single transaction, otherwise it would take ages).
To improve performance, I added an basic index on each row. However, this increases database size somewhat significantly, and only provides a moderate improvement.
Data comes in three basic types:
single value, indicating things like program version, etc.
a few values (<10), indicating things like input parameters used
many values (>1000), primarily output data.
The first type obviously shouldn't need an index, since it will never be sorted upon.
The second type should have an index, because it will commonly be filtered by.
The third type probably shouldn't need an index, because it will be used in output.
It would be annoying to determine which type a particular value is before it is put in the database, but it is possible.
My question is twofold:
Is there some hidden cost to extraneous indexes, beyond the size increase that I have seen?
Is there a better way to index for filtration queries of the form WHERE foo IN (5) AND bar IN (12,14,15)? Note that I don't know which columns the user will pick, beyond the that it will be a type 2 column.
Read the relevant documentation:
Query Planning;
Query Optimizer Overview;
EXPLAIN QUERY PLAN.
The most important thing for optimizing queries is avoiding I/O, so tables with less than ten rows should not be indexed because all the data fits into a single page anyway, so having an index would just force SQLite to read another page for the index.
Indexes are important when you are looking up records in a big table.
Extraneous indexes make table updates slower, because each index needs to be updated as well.
SQLite can use at most one index per table in a query.
This particular query could be optimized best by having a single index on the two columns foo and bar.
However, creating such indexes for all possible combinations of lookup columns is most likely not worth the effort.
If the queries are generated dynamically, the best idea probably is to create one index for each column that has good selectivity, and rely on SQLite to pick the best one.
And don't forget to run ANALYZE.

Formulas to generate a unique id?

I would like to get a few ideas on generating unique id's without using the GUID. Preferably i would like the unique value to be of type int32.
I'm looking for something that can be used for database primary key as well as being url friendly.
Can these considered Unique?
(int)DateTime.Now.Ticks
(int)DateTime.Now * RandomNumber
Any other ideas?
Thanks
EDIT: Well i am trying to practise Domain Driven Design and all my entities need to have a ID upon creation to be valid. I could in theory call into the DB to get an auto incremented number but would rather steer clear of this as DB related stuff is getting into the Domain.
It depends on how unique you needed it to be and how many items you need to give IDs to. Your best bet may be assigning them sequentially; if you try to get fancy you'll likely run into the Birthday Paradox (collisions are more likely than you might expect) or (as in your case 1) above) be foreced to limit the rate at which you can issue them.
Your 1) above is a little better than the 2) for most cases; it's rate limited--you can't issue more than 1 ID per tick--but not susceptible to the Birthday Paradox. Your 2) is just throwing bits away. Might be slightly better to XOR with the random number, but in any case I don't think the rand is buying you anything, just hiding the problem & making it harder to fix.
Are these considered Globally Unique?
1) (int)DateTime.Now.Ticks 2)
(int)DateTime.Now * RandomNumber
Neither option is globally unique.
Option 1 - This is only unique if you can guarantee no more than one ID is generated per tick. From your description, it does not sound like this would work.
Option 2 - Random numbers are pseudo random, but not guaranteed to be unique. With that already in mind, we can reduce the DateTime portion of this option to a similar problem to option 1.
If you want a globally unique ID that is an int32, one good way would be a synchronous service of some sort that returns sequential IDs. I guess it depends on what your definition of global means. If you had larger than an int32 to work with, and you mean global on a given network, then maybe you could use IP address with a sequence number appended, where the sequence number is generated synchronously across processes.
If you have other unique identifiers besides IP address, then that would obviously be a better choice for displaying as part of a URL.
You can use the RNGCryptoServiceProvider class, if you are using .NET
RNGCryptoServiceProvider Class

Resources