In ActivePivot, what is the most efficient way to configure DISTINCT COUNT aggregation?
For instance if I want to configure a measure that for each cell returns the number of distinct products that contribute to that cell.
As ActivePivot supports the MDX language, you can do it in MDX. Here is an example where we define an MDX calculated member that counts the distinct desks contributing to a cell. (this query will run on the ActivePivot Sandbox sample application)
WITH
Member [Measures].[Desk Count] AS Count(
Descendants(
[Bookings].[Desk].CurrentMember,
[Bookings].[Desk].[Desk]
),
EXCLUDEEMPTY
)
SELECT NON EMPTY Hierarchize(
DrilldownLevel(
[Underlyings].[Products].[ALL].[AllMember]
)
) ON ROWS
FROM [EquityDerivativesCube]
WHERE [Measures].[Desk Count]
But the most efficient way is to use a post processor, because post processors run in the core aggregation engine of ActivePivot while the MDX engine operates at a higher layer. The "LEAF_COUNT" post processor is designed for that purpose, here is how you would declare it in the Sandbox application:
<postProcessor name="DeskCount" pluginKey="LEAF_COUNT" formatter="LONG[#,###]">
<properties>
<entry key="leafLevels" value="Desk#Desk#Bookings" />
</properties>
</postProcessor>
As the post processor must be declared in the configuration of the cube, it is not as flexible as the MDX solution, that a user can apply on any hierarchy at the last minute. But again it is more performant, especially in the context of hierarchies with large cardinalities.
Related
In XQuery Marklogic how to sort dynamically?
let $sortelement := 'Salary'
for $doc in collection('employee')
order by $doc/$sortelement
return $doc
PS: Sorting will change based on user input, like data, name in place of salary.
If Salary is the name of the element, then you could more generically select any element in the XPath with * and then apply a predicate filter to test whether the local-name() matches the variable for the selected element value $sortelement:
let $sortelement := 'Salary'
for $doc in collection('employee')
order by $doc/*[local-name() eq $sortelement]
return $doc
This manner of sorting all items in the collection may work with smaller number of documents, but if you are working with hundreds of thousands or millions of documents, you may find that pulling back all docs is either slow or blows out the Expanded Tree Cache.
A more efficient solution would be to create range indexes on the elements that you intend to sort on, and could then perform a search with options specified to order the results by cts:index-order with an appropriate reference to the indexed item, such as cts:element-reference(), cts:json-property-reference(), cts:field-reference().
For example:
let $sortelement := 'Salary'
return
cts:search(doc(),
cts:collection-query("employee"),
cts:index-order(cts:element-reference(xs:QName($sortelement)))
)
Not recommended because the chances of introducing security issues, runtime crashes and just 'bad results' is much higher and more difficult to control --
BUT available as a last resort.
ALL XQuery can be dynamically created as a string then evaluated using xdmp:eval
Much better to follow the guidance of Mads, and use the search apis instead of xquery FLOWR expressions -- note that these APIs actually 'compile down' to a data structure. This is what the 'cts constructors' do : https://docs.marklogic.com/cts/constructors
I find it helps to think of cts searches as a structured search described by data -- which the cts:xxx are simply helper functions to create the data structure.
(they dont actually do any searching, they build up a data structure that is used to do the searching)
If you look at the source to the search:xxx apis you can see how this is done.
I have a query that uses WHERE id IN (1,2,3,...) where the list (1,2,3,...) is dynamically generated from an array of integers (not using parameters). Now I have a particular query that takes roughly 500ms with 26623 ids but 50s (100x slower) with 26624 ids.
I couldn't find anything that looks related in https://sqlite.org/limits.html
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE search NOT IN ('', '?')
AND flows.id IN (1,2,3) /* <=== here more than 26623 IDs make it super slow */
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
Before I try to make that reproducible in isolate (e.g. search_params is a custom virtual table), does anyone know what limitation I might be running into? It's not the number of IDs per se, since a different query runs just fine with the same IDs.
SQLite version 3.36.0 via better-sqlite3 (Node.js) with a readonly database. The only pragma I use is journal_mode = WAL.
Compiled with (https://github.com/JoshuaWise/better-sqlite3/blob/master/docs/compilation.md#bundled-configuration):
SQLITE_DQS=0
SQLITE_LIKE_DOESNT_MATCH_BLOBS
SQLITE_THREADSAFE=2
SQLITE_USE_URI=0
SQLITE_DEFAULT_MEMSTATUS=0
SQLITE_OMIT_DEPRECATED
SQLITE_OMIT_GET_TABLE
SQLITE_OMIT_TCL_VARIABLE
SQLITE_OMIT_PROGRESS_CALLBACK
SQLITE_OMIT_SHARED_CACHE
SQLITE_TRACE_SIZE_LIMIT=32
SQLITE_DEFAULT_CACHE_SIZE=-16000
SQLITE_DEFAULT_FOREIGN_KEYS=1
SQLITE_DEFAULT_WAL_SYNCHRONOUS=1
SQLITE_ENABLE_MATH_FUNCTIONS
SQLITE_ENABLE_DESERIALIZE
SQLITE_ENABLE_COLUMN_METADATA
SQLITE_ENABLE_UPDATE_DELETE_LIMIT
SQLITE_ENABLE_STAT4
SQLITE_ENABLE_FTS3_PARENTHESIS
SQLITE_ENABLE_FTS3
SQLITE_ENABLE_FTS4
SQLITE_ENABLE_FTS5
SQLITE_ENABLE_JSON1
SQLITE_ENABLE_RTREE
SQLITE_ENABLE_GEOPOLY
SQLITE_INTROSPECTION_PRAGMAS
SQLITE_SOUNDEX
HAVE_STDINT_H=1
HAVE_INT8_T=1
HAVE_INT16_T=1
HAVE_INT32_T=1
HAVE_UINT8_T=1
HAVE_UINT16_T=1
HAVE_UINT32_T=1
Here's the answer from the SQLite forums. Essentially this is a combination of how the query planner handles IN literals and what cost my virtual table estimates. That means I'm running into the exact moment when the query planner makes a different decision.
SQLite NGQP is a cost based query planner. The IN () operator with a list of literal values gets implemented as a kind of temporary table; sometimes SQLite decides to create an index and do lookups, other times it decides to use that table as the outermost loop of the query.
EXPLAIN QUERY PLAN should show that in a more concise manner.
If compiled in DEBUG mode mith WHERETRACE enabled, the .wheretrace command will show how SQLite NGQP reaches its plan. Essential input is the return values from the xBestIndex method of your virtual table, especially the "number of rows" and the "estimated cost". It is paramount to deliver accurate estimates. Cost should reflect processing cost relative to SQLite native tables.
Note that you can name the IN table by making it a CTE and CROSS JOIN to force the query plan that works fast.
https://sqlite.org/forum/forumpost/a3d68ed8b40cf583?t=h
The workaround I use is json_each and serialize the array of integers into a JSON string. In my particular use-case this has some other benefits as well (e.g. I can bind a single parameter and re-use the query with any number of IDs), so I don't mind doing that:
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE search NOT IN ('', '?')
-AND flows.id IN (1,2,3)
+AND flows.id IN (SELECT value FROM json_each('[1,2,3]'))
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
I also know that the generic virtual table implementation of better-sqlite3 makes a trade-off between being easy to use (it's ridiculously easy) and achieving maximum performance.
I have a DynamoDB instance with a partition key and sort key. Let's say that they are organisation (hash key) and employee id (sort key).
I want to retrieve all employees who's ids are in a list. They all work for the same organisation but they are not all of the employees of that organisation.
In SQL I'd do something like:
select * from table where organisation_id = 'org' and employee_id in [list of ids]
There does not seem to be an equivalent in DynamoDB.
My choices seem to be:
1) Iterate over all employee IDs using a Query OR
2) Use BatchGetItems and provide organisation_id:employee_id for all items
The first seems like it will be slower as it involves multiple requests while the second is a single request but may consume more RCUs.
Which of these is preferred solution to this problem? Or am I missing a better third way?
I would iterate your list using GetItem, adding each employee found to a collection. This approach isn't slow - DynamoDB is designed specifically for getting lots of items fast using their keys.
There is no need to use Query as you have both the partition key and range key. You would only use a Query if say you wanted all employees of one organisation.
If your list is particularly large you could use BatchGetItem, which will create multiple parallel threads and therefore reduce latency. You won't find much a difference though unless you have a lot of items to get.
By the way, DynamoDB does have an 'IN' operator but your can't use it on KeyConditions.
I want to use rowset variable as scaler variable.
#cnt = Select count(*) from #tab1;
If (#cnt > 0) then
#cnt1= select * from #tab2;
End;
Is it possible?
======================================
I want to block the complex u-sql code based on some condition, lets say based on some control table. In my original code, I wrote 10-15 u-sql statements and I want to bound them within the If statement. I don't want to do cross join because it again start trying to join the table. If I use cross join, there is no significant save in execution time. Use of IF statement is, If the condition does not met, complete piece of code should not execute. Is it possible?
To add to wBob's and Alex's answers:
U-SQL does not provide data driven control flow within a script. The current IF statement requires the expression to be evaluated at compile time.
Consider a U-SQL script as just a single declarative query. So you have the following options:
Express your problem with relational expressions. This means that you will have to write a (cross) join to guard the execution. If you feel that the query optimizer does a bad job at optimizing such guards (e.g., it evaluates the expensive side of the join before the cheap guard), please report an issue and we will take a look.
Split your script into several scripts and look at the result of each script before doing your next step. This is a form of orchestration that you can do with ADF or writing your own orchestration with Powershell or any of the SDKs. The caveat here is that you will have to write intermediate results into files and download the files into your orchestration layer.
Having said this, it theoretically is possible to extent the language algebra with a "don't execute the remaining part of this operator tree if a condition is not satisfied" operator. However that is a major work item and can lead to very large query plans during compilation that may be going beyond the current limits. If you feel that neither 1 nor 2 above are sufficient to help with your scenario, please add your vote to https://feedback.azure.com/forums/327234-data-lake/suggestions/17635906-please-add-dynamic-if-evaluation-to-u-sql.
#cnt1 =
SELECT #tab2.*
FROM #tab2
CROSS JOIN (SELECT COUNT(*) AS cnt FROM #tab1) AS c
WHERE c.cnt > 0;
(Adding explanation) CROSS JOIN returns a cartesian product of all rows from #tab2 and the single row generated by the COUNT query. There WHERE condition then ensures the result of the query is all rows from #tab2 if COUNT(*)>0, no rows otherwise.
When I'm trying to filter CustAccount field on CustTableListPage it's taking too long to filter. On the other fields there is no latency. I'm trying to filter just part of account number like "*123".
I have done reindexing for custtable and also updated statics but not appreciable difference at all.
When i have added listpage's query in a view it's filtering custAccount field normally like the other fields.
Any suggestion?
Edit:
Our version is AX 2012 r2 cu8, not a user based problem it occurs for every user, Interaction class has some custimizations but just for setting some buttons enable/disable props. etc... i tryed to look query execution what i found is not clear. something like FETCH_API_CURSOR_000000..x
Record a trace of this execution and locate what is a bottleneck.
Keep in mind that that wildcards (such as *) have to be used with care. Using a filter string that starts with a wildcard kills all performance because the SQL indexes cannot be used.
Using a wildcard at the end
Imagine that you have a dictionnary and have to list all the words starting with 'Foo'. You can skip all entries before 'F', then all those before 'Fo', then all those before 'Foo' and start your result list from there.
Similarly, asking the underlying SQL engine to list all CustAccount entries starting with '123' (= filter string '123*') allows using an index on CustAccount to quickly skip to the relevant data.
Using a wildcard at the start
Imagine that you still have that dictionnary and have to list all the words ending with 'ing'. You would have no other choice than going through the entire dictionnary and checking the ending of every word (due to the alphabetical sorting).
This explains why asking the SQL engine to list all CustAccount entries ending with '123' (= filter string '*123') means that all CustAccount values must be investigated. So the AOS loops through all the entries and uses an SQL cursor to do this. That is the FETCH_API_CURSOR statement you see on the SQL level.
Possible solutions
Educate your end user that using a wildcard at the beginning of a filter string will always be slow on a large table.
Step up the SQL server hardware / allocated resources (faster CPU, more RAM, faster disk, ...).
Create a full text index on CustAccount (not a fan of this one and performance impact should be thoroughly investigated).
I've solve the problem. CustTableListPage query had a sorting over DirPartyTable.Name field. When I remove this sorting, filtering with wildcard working like a charm.