What happenes during a query using mondrian? - olap

I know mondrian converts MDX queries into relational queries and return the result. But is there any details about this process?
I use the sample cube HR. Here is the MDX:
WITH
SET [~ROWS] AS
TopCount({[Time].[Time].[Month].Members}, 3, [Measures].[Org Salary])
SELECT
NON EMPTY {[Measures].[Org Salary]} ON COLUMNS,
NON EMPTY [~ROWS] ON ROWS
FROM [HR]
And this is the SQL generated by the MDX. I find it in log:
select
"time_by_day"."the_year" as "c0",
"time_by_day"."quarter" as "c1",
"time_by_day"."the_month" as "c2",
"time_by_day"."month_of_year" as "c3",
sum("salary"."salary_paid") as "c4"
from
"salary" as "salary",
"time_by_day" as "time_by_day"
where
"time_by_day"."the_year" = 1997
and
"salary"."pay_date" = "time_by_day"."the_date"
group by
"time_by_day"."the_year",
"time_by_day"."quarter",
"time_by_day"."the_month",
"time_by_day"."month_of_year"
order by
CASE WHEN sum("salary"."salary_paid") IS NULL THEN 1 ELSE 0 END, sum("salary"."salary_paid") DESC,
CASE WHEN "time_by_day"."the_year" IS NULL THEN 1 ELSE 0 END, "time_by_day"."the_year" ASC,
CASE WHEN "time_by_day"."quarter" IS NULL THEN 1 ELSE 0 END, "time_by_day"."quarter" ASC,
CASE WHEN "time_by_day"."the_month" IS NULL THEN 1 ELSE 0 END, "time_by_day"."the_month" ASC,
CASE WHEN "time_by_day"."month_of_year" IS NULL THEN 1 ELSE 0 END, "time_by_day"."month_of_year" ASC
I change top 3 to top 10 and I got the same SQL. And the SQL has nothing like "limit".
So I am wondering what happenes during a query? I searched and didn't get any useful information. Can anybody help?
Thank you.
Longxing

If you look at your Mondrian log you will see infromation about this like:
SqlMemberSource.getMemberChildren
SqlTupleReader.readTuples
SqlStatisticsProvider.getColumnCardinality
Segment.load
Mondrian is reading the tubles and loading the segments from SQL, letting the database do some of the calculations, and doing the rest internally. Based on your MDX, schema, and setup it will give different tasks to the database and internal calculations, trying to solve it in a performance and memory optimal way. Taking into account that it should have a best possible cache for future queries.

Related

Get the number of records from MDX query with Subcubes

I'm developing a system for generate mdx queries from entity "FilterCriterias" and related info like the number of records of a query, so I need a generic way to get the number of records of a mdx query than use subcubes. In a normal query I do something like:
WITH
MEMBER [MyCount] AS
Count([Date].[Date].MEMBERS)
SELECT
{[MyCount]} ON 0
FROM [Adventure Works];
But I have problems when use this way in queries a little more complexes like that
WITH
MEMBER [MyCount] AS
Count([Date].[Date].MEMBERS)
SELECT
{[MyCount]} ON 0
FROM
(
SELECT
{[Measures].[Sales Amount]} ON 0
,{[Date].[Date].&[20050701] : [Date].[Date].&[20051231]} ON 1
FROM
(
SELECT
{[Sales Channel].[Sales Channel].&[Internet]} ON 0
FROM [Adventure Works]
)
);
I guess the logic response could be the number of records of [Date].[Members] left in the subcube, but I get a result without columns and rows. I'm newbie in mdx language and I don't understand this behavior. Exists some generic way to get the number of records from a "base" query just like SELECT COUNT(*) FROM () in plain SQL?
The structure is quite different to a ralational SELECT COUNT(*) FROM ().
I believe that the structure of a sub-select will be very similar to that of a sub-cube and reading through this definition from MSDN (https://msdn.microsoft.com/en-us/library/ms144774.aspx) of what a sub-cube contains tells us that it isn't a straight filter like in a relational query:
Admittedly I still find this behaviour rather "enigmatic" (a polite way of saying "I do not understand it")
Is there a workaround?

OrientDB Query Result set of vertices with an empty collection of edges, vertices

this might be a simple question but I am confused, please help.....
I am using OrientDB 2.1.9 and I am experimenting with VehicleHistoryGraph database.
From Studio, Browse mode, set limit to 9 records only. Now I am entering this simple query
select out() from Person
The result set I am getting back is 9 records BUT only two have Bought a vehicle. The rest are displayed with empty collections []. This is no good, I am confused. I would expect to get back only those two vertices with collections of edges !
How do I get back these two persons that bought something ?
I noticed also that there is this unwind operator in select. Is this useful in that case, can you make an example ?
Your query asks for out(), so out() is computed in all cases, and you're shown the results. If you only want the rows for which out().size() > 0 then you can construct a query like this:
select out() from v let n=out().size() where $n > 0
If you think that one ought to be able to write this more succintly, e.g. like so:
select out() as n from v where n > 0
then join the club (e.g. by supporting this enhancement request).
(select out() from v where out().size() > 0 is supported.)

Poor SP performance from ASP.NET

I have a stored procedure that handles sorting, filtering and paging (using Row_Number) and some funky trickery :) The SP is running against a table with ~140k rows.
The whole thing works great and for at least the first few dozen pages is super quick.
However, if I try to navigate to higher pages (e.g. head to the last page of 10k) the whole thing comes to a grinding halt and results in a SQL timeout error.
If I run the same query, using the same parms inside studio manager query window, the response is instant irrespective of the page number I pass in.
At the moment it's test code that is simply binding to a ASP:Datagrid in .NET 3.5
The SP looks like this:
BEGIN
WITH Keys
AS (
SELECT
TOP (#PageNumber * #PageSize) ROW_NUMBER() OVER (ORDER BY JobNumber DESC) as rn
,P1.jobNumber
,P1.CustID
,P1.DateIn
,P1.DateDue
,P1.DateOut
FROM vw_Jobs_List P1
WHERE
(#CustomerID = 0 OR CustID = #CustomerID) AND
(JobNumber LIKE '%'+#FilterExpression+'%'
OR OrderNumber LIKE '%'+#FilterExpression+'%'
OR [Description] LIKE '%'+#FilterExpression+'%'
OR Client LIKE '%'+#FilterExpression+'%')
ORDER BY P1.JobNumber DESC ),SelectedKeys
AS (
SELECT
TOP (#PageSize)SK.rn
,SK.JobNumber
,SK.CustID
,SK.DateIn
,SK.DateDue
,SK.DateOut
FROM Keys SK
WHERE SK.rn > ((#PageNumber-1) * #PageSize)
ORDER BY SK.JobNumber DESC)
SELECT
SK.rn
,J.JobNumber
,J.Description
,J.Client
,SK.CustID
,OrderNumber
,CAST(DateAdd(d, -2, CAST(isnull(SK.DateIn,0) AS DateTime)) AS nvarchar) AS DateIn
,CAST(DateAdd(d, -2, CAST(isnull(SK.DateDue,0) AS DateTime)) AS nvarchar) AS DateDue
,CAST(DateAdd(d, -2, CAST(isnull(SK.DateOut,0) AS DateTime)) AS nvarchar) AS DateOut
,Del_Method
,Ticket#
,InvoiceEmailed
,InvoicePrinted
,InvoiceExported
,InvoiceComplete
,JobStatus
FROM SelectedKeys SK
JOIN vw_Jobs_List J ON j.JobNumber=SK.JobNumber
ORDER BY SK.JobNumber DESC
END
And it's called via
sp_jobs (PageNumber,PageSize,FilterExpression,OrderBy,CustomerID)
e.g.
sp_Jobs '13702','10','','JobNumberDESC','0'
Can anyone shed any light on what might be the cause of the dramatic difference in performance between SQL query window and an asp.net page executing a dataset?
Check out the "WITH RECOMPILE" option
http://www.techrepublic.com/article/understanding-sql-servers-with-recompile-option/5662581
I have run into similar problems where the execution plan on stored procedures will work great for a while, but then get a new plan because the options changed. So, it will be "optimized" for one case and then perform "table scans" for another option. Here is what I have tried in the past:
Re-execute the stored procedure to calculate a new execution plan and then keep an eye on it.
Break up the stored procedure into separate stored procedures of each option such that it can be optimized and then the overall stored procedure simply calls each "optimized" stored procedure.
Bring in the records into an object and then perform all of the "funky trickery" in code and then it gives you the option to "cache" the results.
Obviously option #2 and #3 is better than option #1. I am honestly finding option #3 is becoming the best bet in most cases.
I just had another option 4. You could instead of performing your "inner selects" in one query, you could put the results of your inner selects into temporary tables and then JOIN on those results. I would still push for option #3 if possible, but I understand that sometimes you just need to keep working the stored procedure until it "works".
Good luck.

SQL sorting , paging, filtering best practices in ASP.NET

I am wondering how Google does it. I have a lot of slow queries when it comes to page count and total number of results. Google returns a count value of 250,000,00 in a fraction of a second.
I am dealing with grid views. I have built a custom pager for a gridview that requires an SQL query to return a page count based on the filters set by the user. The filters are at least 5 which includes a keyword, a category and subcategory, a date range filter, and a sort expression filter for sorting. The query contains about 10 massive table left joins.
This query is executed every time a search is performed and a query execution last an average of 30 seconds - be it count or a select. I believe what's making it slow is my query string of inclusive and exclusive date range filters. I have replaced (<=,>=) to BETWEEN and AND but still I experience the same problem.
See the query here:
http://friendpaste.com/4G2uZexRfhd3sSVROqjZEc
I have problems with a long date range parameter.
Check my table that contains the dates:
http://friendpaste.com/1HrC0L62hFR4DghE6ypIRp
UPDATE [9/17/2010] I minimized my date query and removed the time.
I tried reducing the joins for my count query (I am actually having a problem with my filter count which takes to long to return a result of 60k rows).
SELECT COUNT(DISTINCT esched.course_id)
FROM courses c
LEFT JOIN events_schedule esched
ON c.course_id = esched.course_id
LEFT JOIN course_categories cc
ON cc.course_id = c.course_id
LEFT JOIN categories cat
ON cat.category_id = cc.category_id
WHERE 1 = 1
AND c.course_type = 1
AND active = 1
AND c.country_id = 52
AND c.course_title LIKE '%cook%'
AND cat.main_category_id = 40
AND cat.category_id = 360
AND (
(2010-09-01' <= esched.date_start OR 2010-09-01' <= esched.date_end)
AND
('2010-09-25' >= esched.date_start OR '2010-09-25' >= esched.date_end)
)
I just noticed that my query is quite fast when I have a filter on my main or sub category fields. However when I only have a date filter and the range is a month or a week it needs to count a lot of rows and is done in 30seconds in average.
These are the static fields:
AND c.course_type = 1
AND active = 1
AND c.country_id = 52
UPDATE [9/17/2010] If a create a hash for these three fields and store it on one field will it do a change in speed?
These are my dynamic fields:
AND c.course_title LIKE '%cook%'
AND cat.main_category_id = 40
AND cat.category_id = 360
// ?DateStart and ?DateEnd
UPDATE [9/17/2010]. Now my problem is the leading % in LIKE query
Will post an updated explain
Search engines like Google use very complex behind-the-scenes algorythyms to index searches. Essentially, they have already determined which words occur on each page as well as the relative importance of those words and the relative importance of the pages (relative to other pages). These indexes are very quick because they are based on Bitwise Indexing.
Consider the following google searches:
custom : 542 million google hits
pager : 10.8 m
custom pager 1.26 m
Essentially what they have done is created a record for the word custom and in that record they have placed a 1 for every page that contains it and a 0 for every page that doesn't contain it. Then they zip it up because there are a lot more 0s than 1s. They do the same for pager.
When the search custom pager comes in, they unzip both records, perform a bitwise AND on them and this results in an array of bits where length is the total number of pages that they have indexed and the number of 1s represents the hit count for the search. The position of each bit corresponds to a particular result which is known in advance and they only have to look up the full details of the first 10 to display on the first page.
This is oversimplified, but that is the general principle.
Oh yes, they also have huge banks of servers performing the indexing and huge banks of servers responding to search requests. HUGE banks of servers!
This makes them a lot quicker than anything that could be done in a relational database.
Now, to your question: Could you paste some sample SQL for us to look at?
One thing you could try is changing the order that the tables and joins appear in your SQl statement. I know that it seems that it shouldn't make a difference but it certainly can. If you put the most restrictive joins earlier in the statement then you could well end up with fewer overall joins performed within the database.
A real world example. Say you wanted to find all of the entries in the phonebook under the name 'Johnson', with the number beginning with '7'. One way would be to look for all the numbers beginning with 7 and then join that with the numbers belonging to people called 'Johnson'. In fact it would be far quicker to perform the filtering the other way around even if you had indexing on both names and numbers. This is because the name 'Johnson' is more restrictive than the number 7.
So order does count, and datbase software is not always good at determining in advance which joins to perform first. I'm not sure about MySQL as my experience is mostly with SQL Server which uses index statistics to calculate which order to perform joins. These stats get out of date after a number of inserts, updates and deletes, so they have to be re-computed periodically. If MySQL has something similar, you could try this.
UPDATE
I have looked at the query that you posted. Ten left joins is not unusual and should perform fine as long as you have the right indexes in place. Yours is not a complicated query.
What you need to do is break this query down to its fundamentals. Comment out the lookup joins such as those to currency, course_stats, countries, states and cities along with the corresponding fields in the select statement. Does it still run as slowly? Probably not. But it is probably still not ideal.
So comment out all of the rest until you just have the courses and the group by course id and order by courseid. Then, experiment with adding in the left joins to see which one has the greatest impact. Then, focusing on the ones with the greatest impact on performance, change the order of the queries. This is the trial - and - error approach,. It would be a lot better for you to take a look at the indexes on the columns that you are joining on.
For example, the line cm.method_id = c.method_id would require a primary key on course_methodologies.method_id and a foreign key index on courses.method_id and so on. Also, all of the fields in the where, group by and order by clauses need indexes.
Good luck
UPDATE 2
You seriously need to look at the date filtering on this query. What are you trying to do?
AND ((('2010-09-01 00:00:00' <= esched.date_start
AND esched.date_start <= '2010-09-25 00:00:00')
OR ('2010-09-01 00:00:00' <= esched.date_end
AND esched.date_end <= '2010-09-25 00:00:00'))
OR ((esched.date_start <= '2010-09-01 00:00:00'
AND '2010-09-01 00:00:00' <= esched.date_end)
OR (esched.date_start <= '2010-09-25 00:00:00'
AND '2010-09-25 00:00:00' <= esched.date_end)))
Can be re-written as:
AND (
//date_start is between range - fine
(esched.date_start BETWEEN '2010-09-01 00:00:00' AND '2010-09-25 00:00:00')
//date_end is between range - fine
OR (esched.date_end BETWEEN '2010-09-01 00:00:00' AND '2010-09-25 00:00:00')
OR (esched.date_start <= '2010-09-01 00:00:00' AND esched.date_end >= '2010-09-01 00:00:00' )
OR (esched.date_start <= '2010-09-25 00:00:00' AND esched.date_end > = '2010-09-25 00:00:00')
)
on your update you mention you suspect the problem to be in the date filters.
All those date checks can be summed up in a single check:
esched.date_ends >= '2010-09-01 00:00:00' and esched.date_start <= '2010-09-25 00:00:00'
If with the above it behaves the same, check if the following returns quickly / is picking your indexes:
SELECT COUNT(DISTINCT esched.course_id)
FROM events_schedule esched
WHERE esched.date_ends >= '2010-09-01 00:00:00' and esched.date_start <= '2010-09-25 00:00:00'
ps I think that when using the join, you can do SELECT COUNT(c.course_id) to count main records of courses in the query directly i.e. might not need the distinct that way.
re update now most time going to the wild card search after the change:
Use a mysql full text search. Make sure to check fulltext-restrictions, one important is that its only supported in MyISAM tables. I must say that I haven't really used the mysql full text search, and I'm not sure how that impacts the use of other indexes in the query.
If you can't use a full text search, imho you are out luck in using your current approach to it i.e. since it can't use the regular index to check if a word its contained in any part of the text.
If that's the case, you might want to switch that specific part of the approach and introduce a tag/keywords based approach. Unlike categories, you can assign multiple to each item, so its flexible yet doesn't have the free text issue.

SQLite - getting number of rows in a database

I want to get a number of rows in my table using max(id). When it returns NULL - if there are no rows in the table - I want to return 0. And when there are rows I want to return max(id) + 1.
My rows are being numbered from 0 and autoincreased.
Here is my statement:
SELECT CASE WHEN MAX(id) != NULL THEN (MAX(id) + 1) ELSE 0 END FROM words
But it is always returning me 0. What have I done wrong?
You can query the actual number of rows withSELECT Count(*) FROM tblName
see https://www.w3schools.com/sql/sql_count_avg_sum.asp
If you want to use the MAX(id) instead of the count, after reading the comments from Pax then the following SQL will give you what you want
SELECT COALESCE(MAX(id)+1, 0) FROM words
In SQL, NULL = NULL is false, you usually have to use IS NULL:
SELECT CASE WHEN MAX(id) IS NULL THEN 0 ELSE (MAX(id) + 1) END FROM words
But, if you want the number of rows, you should just use count(id) since your solution will give 10 if your rows are (0,1,3,5,9) where it should give 5.
If you can guarantee you will always ids from 0 to N, max(id)+1 may be faster depending on the index implementation (it may be faster to traverse the right side of a balanced tree rather than traversing the whole tree, counting.
But that's very implementation-specific and I would advise against relying on it, not least because it locks your performance to a specific DBMS.
Not sure if I understand your question, but max(id) won't give you the number of lines at all. For example if you have only one line with id = 13 (let's say you deleted the previous lines), you'll have max(id) = 13 but the number of rows is 1. The correct (and fastest) solution is to use count(). BTW if you wonder why there's a star, it's because you can count lines based on a criteria.
I got same problem if i understand your question correctly, I want to know the last inserted id after every insert performance in SQLite operation. i tried the following statement:
select * from table_name order by id desc limit 1
The id is the first column and primary key of the table_name, the mentioned statement show me the record with the largest id.
But the premise is u never deleted any row so the numbers of id equal to the numbers of rows.
Extension of VolkerK's answer, to make code a little more readable, you can use AS to reference the count, example below:
SELECT COUNT(*) AS c from profile
This makes for much easier reading in some frameworks, for example, i'm using Exponent's (React Native) Sqlite integration, and without the AS statement, the code is pretty ugly.

Resources