How do I add KPI targets to my cube that are at a higher grain to my fact table? - olap

I have a simple star schema with 2 dimensions; course and student. My fact table is an enrolment on a course. I have KPI Values set up which use data in the fact table (e.g. percentage of students that completed course). All is working great.
I now need to add KPI Goals though that are a different grain to the fact table. The goals are at the course level, but should also work at department level, and for whatever combination of dimension attributes are selected. I have the numerator and denominators for the KPI Goals so want to aggregate these when there are multiple courses involved - before dividing to get the actual percentage goal.
How can I implement this? From my understanding I should only have one fact table in my star schema. So in that case would I perhaps store the higher grain values in the fact table? Or would I create a dimension with these values in? Or some alternative solution?

In most cases I would expect KPI measures to be calculated from the existing measures in your cube, so can you get away from the idea of fact table changes, and just set up KPIs as calculated members in the cube or MDX?
Your issue is complicated by the KPI granularity being different, yes...but I would just hide KPI measures when such a level of granularity was being displayed. You can implement this within the calculated measure definition too.
For example, I have used ISLEAF() to detect if a measure is about to be shown at the bottom level, and return blank/NULL. Or you can check the level number of any relevant dimensions.

Related

MariaDB partitionning last 3 month

First, I explain my problem:
This is a table that will contain approximately 5,000,000 record per year, these records will be kept at least 10 years (it is not yet defined). We talk about events of production machine. I generate a report + a dashbord for displaying various information relatively complex (average number of events per 10 minutes a month, graphics, ...) and also wants to see the records themselves. The data displayed will be in large majority of the last 2 months, viewing the rest of the data must always be possible but at a lower speed of access.
I work on MariaDB v10.1.12.
The idea was to make a partition on the last 3 months. I realize now that this is not so easy. I have not found any solution to this partition, in fact, it is impossible to make a partition based on a now() or other current_date() etc. directly or indirectly via another calculated column.
Do you have any ideas for me? Perhaps another solution than a partition.
Thank you in advance.
I recommend PARTITION BY RANGE(TO_DAYS(...)) If you are only now breaking the table into partitions, I would recommend annual partitions for data before this year, then quarterly or monthly partitions henceforth. Yes, that, in theory, leads to an infinite number of partitions, but I predict that you will revamp the data structure within a few years.
20-50 partitions is a good number. More than that leads to inefficiencies due to the multitude of partitions; less than that leads to asking "why bother".
Use InnoDB. Design the PRIMARY KEY carefully, since it may be useful as the primary index into the data.
Usually it is best to put the date/timestamp column last in any indexes. Putting it first would be redundant since partition pruning comes first.
More on partitioning.
It sounds like a main purpose for the table is to summarize the data for graphing, etc. In that case, it may be very beneficial to build and maintain "Summary table(s)" of counts and subtotals over selected time intervals. 100 rows get added up for a 10-minute interval? If so, then the summary table based on 10-minute intervals would have 1/100th as many rows, and the queries would be much faster. Plus, you could 'denormalize' the summary tables to make them even simpler.
More on Summary tables.
It might be worth it to gather data for 10 minutes into a staging table, then summarize it into the summary table. And also throw the raw data into the big table.
Or, if the summary tables have everything you need, you could abandon the big table. Or, as a compromise, keep 12 month's worth of data (partitioned by month), and DROP PARTITION for older data. Meanwhile, the summary tables can continue to grow (although they will be much smaller).
Table partitioning is an advance features, it is not indexing, but rearrangement of tables data. So it is not "duplicate", indeed new data will stored according to the predefined partitioning range.
You must also specify month range criteria as usual. you MUST create index if those column are not used as partition range. When you make a select, algorithm that associate with partition table will handle those merging(if required) in background. So you just treat partition exactly like your typical table.
For more details, please check Mariadb paritioning overview

Microstrategy intelligent cube data aggragation

I created a cube that contain five attributes and a metric and I want to create a document from this cube with different visualisations for each attribute. the problem is that data in the cube are aggregated based on all attributes in the cube, so when you add a grid with one attribute and the metric the numbers will not be correct.
Is there any way to make the metric dynamically aggragate depending on the attribute in use?
This depends what kind of metric you have in the cube. The best way to achieve aggregation across all attributes is obviously to have the most granular, least aggregated data in the cube, but understandably this is not always possible.
If your metric is a simple SUM metric then you can set your dynamic aggregation settings on the metric to just do SUM and it should perform SUM's appropriately regardless of the attributes you are placing on your document/report? Unless your attribute relationships are not set up correctly or there are some many-to-many relationships taking place between some of those attributes.
If you metric is a distinct count metric, then the approach is slightly different and has been covered previously in a few places. Here is one, on an older version of Microstrategy but the logic can still be applied to newer versions.:
http://community.microstrategy.com/t5/tkb/articleprintpage/tkb-id/architect/article-id/1695

Property graph (Neo4j) design: Single node with multiple relations or new nodes for each event occurence?

Let us say I've two Leagues L1 and L2. Each league can have multiple rounds like Playoffs, Quarterfinals, Semifinals and Finals. Moreover, I also need to represent the happens_after fact like Quarterfinals happens after Playoffs, Semifinals happens after the Quarterfinals and Finals happens after the Semifinals.
Questions
Should my graph have one node for each of these rounds and each League should link to these rounds? This way we are just creating new relationships (e.g. both L1 and L2 will have a relationship to Playoffs) but there is only one Playoff node. However, this limits the happens_after relationship because some leagues can have more rounds (for e.g. Round 2 can come before Quarterfinals). Is there a better way to represent this?
Use-cases
Need to be able to find all the rounds of a given league.
Need to be able to find the order of all the rounds of a given league and the dates each of these happened.
EDIT
In general everything that has an identify on its own should become a node. Relationships tie the "things" together.
Not sure if I fully understand your domain. L1, L2 and each round would be nodes. The relationship league -> round indicates that a given league takes part in the round.
The temporal order within the rounds can be modeled by having BEFORE and/or AFTER relationships among them. This way you build a linked (or a double linked) list of rounds.
Another way to express temporal order would be to store a indexed timestamp property for the round. If you're just interested in before or after and not on absolute time, the first approach (linked list) seems to fit better.

Build an undirected weighted graph by matching N vertices

Problem:
I want to suggest the top 10 most compatible matches for a particular user, by comparing his/her 'interests' with interests of all others. I'm building an undirected weighted graph between users, where the weight = match score between the two users.
I already have a set of N users: S. For any user U in S, I have a set of interests I. After a long time (a week?) I create a new user U with a set of interests and add it to S. To generate a graph for this new user, I'm comparing interest set I of the new user with the interest sets of all the users in S, iteratively. The problem is with this "all the users" part.
Let's talk about the function for comparing interests. An interest in a set of interests I is a string. I'm comparing two strings/interests using WikipediaMiner (it uses Wikipedia links to infer how closely related two strings are. eg. Billy Jean & Thriller ==> high match, Brad Pitt & Jamaica ==> low match blah blah). I've asked a question about this too (to see if there's a better solution than the one I'm currently using.
So, the above function takes non-negligible time, and in total, it'll take a HUGE time when we compare thousands (maybe millions?) of users and their hundreds of interests. For 100,000 users, I can't afford to make 100,000 user comparisons in a small time (<30sec) in this way. But, I have to give the top 10 recommendations within 30 secs, possibly a preliminary recommendation, and then improve on it in the next 1 min or so, calculate improved recommendations. Simply comparing 1 user vs the N users sequentially is too slow.
Question:
Please suggest an algorithm, method or tool using which I can improve my situation or solve my problem.
I could think of only an approach to solve the problem, since the outcomes of below stuff
depend on the nature of inter-relation between interests.
=>step:1 As your title says.Build an undirected weighted graph with interests as vertices and the weighted match between them as edges.
=>step:2 - cluster the interests. (Most complex)
Kmeans is a commonly used clustering algo, but works on based on
K-Dimensional vector space.refer wiki to see how K-means works.
it minimizes the sum of (sum of distance^2 for each point and say the center of the cluster) for all clusters. In your case, there are no dimensions available. so try if you can apply the minimizing logic applied there by creating some kind of rule, for distance between two vertices, higher match => lesser distance and vice versa (what are the different matching levels provided by wiki-miner?). chose the Mean of cluster as say the most connected vertex in the chosen set, page ranking sounds to be a good option for "figuring the most connected vertex ".
"Pair-counting F-Measure" sounds like it suit's your need (weighted graph), check for other options available.
(Note: keep modifying this step untill a right clustering algo is found and
the right calibration for distance rule, no of clusters etc are found. )
=>Step:3 - Evaluate the clusters
from here on its like calibrating a couple things to fit your need.
Examine the clusters, reevaluate :
the number of clusters , inter-cluster distance, distance between vertices inside clusters, size of clusters,
time\precision trade-off (compare final - match results without any clustring)
goto: step-2 untill this evaluation is satisfactory.
=>step:4 - Examinie new inerest
iterate thru all clusters, calculate conectivity in each cluster, sort clusters based on high connectivity, for the top x% of sorted clusters
sort and filter out the highly connected interests.
=>step:5 - Match User
reverse look up set of all users using the interests obtained out of step-4, compare all interests for both users, generate a score.
=>step:6 - Apart form the above
you can distribute the load (multiple machines can be used for clusters machine-n clusters) to multiple systems\processors, based on the traffic and stuff.
what is the application for this problem, whats the expected traffic?
Another solution to find the connectivity between the new interest and "set of interests in Cluster" C.
Wiki-Miner runs on a set of wiki documents, let me call it the UNIVERSE.
1:for each cluster fetch and maintain(index, lucene might be handy) the "set of high relevent docs"(I am calling it HRDC) out of the UNIVERSE. so you have 'N' HRDC's if you got 'N' clusters.
2:when a new interest comes find "Conectivity with Cluster" = "Hit ratio of interest in HRDC/Hit ratio of interest in UNIVERSE" for each HRDC.
3:Sort "Conectivity with Cluster"'s and choose the Highly connected clusters.
4:Either compare all the vertices in the cluster with the new interest or the highly connected vertices (using Page Ranking), depending on the time\Precision trade off , that suits you.
One flaw is that your basing your algorithms complexity on the wrong thing. The real issue is that you have to compare each unique interest against every other unique interest (and that interest against itself).
If all of the interests are unique, then there is probably nothing you can do. However, if you have a lot of duplicate interests you can perhaps speed up the algorithm this way by the following.
Create a graph that associates each interest with the users that have that interest. In such a way that allows for fast look-ups.
Create a graph that shows how each interest relates to each other interest, also in such a way that allows for fast look-ups.
Therefore, when a new user is added, their interests are compared to all other interest and stored in a graph. You can then use that information to build to build a list of users with similar interests. That list of users will then need to be filtered somehow to bring it down to the top 10.
Finally, add that user and their interests to the graph of users and interests. This is done last so that the user with the most closely matched interests isn't the user themselves.
Note:
There might be some statistical short cuts that you could do something like this: A is related to B, B is related to C, C is related to D, therefore A is related to B, C, and D. However, to use those kinds of short cuts likely requires a much better understanding of how your comparison function works, which is a bit beyond my expertise.
Approximate solution:
I forgot to mention it earlier, but what your looking when comparing users or interests is a "Nearest neighbor search" in higher dimensions. Meaning, that for exact solutions, a linear search generally works better than data structures. So approximation is probably the best way to go if you need it faster.
To obtain a quick approximate solution (without guarantees as to how close it is), you'll need a data structure that allows for quickly being able to determine which users are likely to be similar to a new user.
One way to build that structure:
Pick 300 random users. These will be the seed users for 300 clusters. Ideally, you'd use the 300 users that are least closely related, but that's probably not practical, still might be wise to ensure that the no seed user is too closely related to the other users (as a sum or average of it's comparison's to other users).
The clusters are then filled by each user joining the cluster whose representative user most closely matches it.
The top ton can then be determined by picking the top 10 users most closely related users from that cluster.
If you ensure that the number of clusters and the users per cluster is always fairly close to sqrt(number of users), then you obtain a fair approximation in O(sqrt(N)) by only checking the points within the cluster. You can improve that approximation by including users in additional clusters and checking the representative users for each cluster. The more clusters you check, the closer you get towards O(N) and an exact solution. Although, there's probably no way to say how close the current solution is to the exact solution. Chances are you start to hit dimishing returns after checking more than a total of log(sqrt(N)) clusters total. Which would put you at O(sqrt(N) log(sqrt(N))).
few thoughts ...
Not exactly a graph theory solution.
assuming a finite set of interests. for each user maintain a bit sequence where each interest is a bit representing whether the user has that interest or not.
For a new user simply multiply the bit sequence with the existing users bit sequence and find the number of bits in the result which gives an idea of how closely their interests match.

How do you design an OLAP Database?

I need a mental process to design an OLAP database...
Essentially for standard relational it'd be (loosely):
Identify Entities
Identify Relationships
Identify Properties of Entities
For each property:
Ensure property can be related to only one entity
Ensure property is directly related to entity
For OLAP databases, I understand the terminology, the motivation and the structure; however, I have no clue as to how to decompose my relational model into an OLAP model.
Identify Dimensions (or By's)
These are anything that you may want to analyse/group your report by. Every table in the source database is a potential Dimension. Dimensions should be hierarchical if possible, e.g. your Date dimension should have a year,month,day hierarchy, Similarly Location should have for example Country, Region, City hierarchy. This will allow your OLAP tool to more efficiently calculate aggregations.
Identify Measures
These are the KPI's or the actual numerical information your client wants to see, these are usually capable of being aggregated, therefore any non flag, non key numeric field in the source database is a potential measure.
Arrange in star schema, with Measures in the center 'Fact' table, and FK relations to applicable Dimension tables. Measures should be stored at the lowest dimension hierarchy level.
Identify the 'Grain' of the fact table, this is essentially the 'level of detail' held. It is usually determined by the reporting requirements, the data granularity available in the source and performance requirements of the reporting solution.You may identify the grain as you go, or you may approach it as a final step once all the important data has been identified. I tend to have a final step to ensure the grain is consistent between my fact tables.
The final step is identifying slowly changing dimensions, and the requirements for these. For example if the customer dimension includes an element of their address and they move, how is that to be handled.
One important point in identify the Dimensions and Measures is the final cardinality that you are electing for the model.
Let´s say that your relational database data entry is during all day.
Maybe you don´t need to visualize or aggregate the measures by hour, even by day. You can choose a week granularity or monthly etc.

Resources