Local Cube - Is there a reason to use OLTP's grain? - olap

I am building a local OLAP cube based on data gathered from several OLTP sources. Please note that I am doing this programmatically and do not have access to tools like SSAS or MDX-based tools.
My requirements are somewhat different than the operational requirements of the OLTP system users. I know that "in theory" it would be preferable to retain the most atomic grain available to me, but I don't see a reason to include the lowest level of data in the cube.
For example (I am simplifying), I have a measure field like "Price". Additionally, each sales fact has a Version attribute with values such as:
List (Original/Initial)
Initial Quote
Adjusted Quote
Sold
These describe the internal development of our pricing and are critical to the reports that I create.
However, for my reporting purposes, I will always want to know the value of all Versions whenever I am referencing a given transaction. Therefore, I am considering pivoting measures like Price by Version in the cube (Version will still be its own entity in the data model), resulting in measures like:
PriceList
PriceQuotedInitial
PriceQuotedAdjusted
PriceSold
Since only one Version is ever effective at a given point in time, we do not need to aggregate across multiple Versions.
Known Advantages
Since this will be a local cube file, it appears this approach would
simplify the creation of several required calculated measures that compare Price
across different Versions (would not be an issue to create calculated measures at various levels of aggregation if I was doing this with MDX)
It would also reduce the number of records by a factor of between 3
and 6, which would significantly boost performance for a local cube.
Known Disadvantages
While the data model will match the business process, the cube would not store the data at the most atomic level. An analyst would need to distinguish between Versions by Measure selection, and could not filter by Version - they would always get all available Versions.
This approach will greatly increase the number of Measures. For
example, there is not just one Price we are tracking, but several
price components and other Measures we track for each transaction.
So if we track a dozen true Measures for each transaction, that
might end up being 50-60 Measures if I take this approach.
I understand that for very large Fact tables, it would be preferable to factor all possible fields out of the Fact table into Dimensions for performance purposes, but I am not sure whether this is the case when using a local cube, as in all likelihood, I will put fewer than 50,000 records into any given cube file, given the limitations of local cubes.
Are there other drawbacks to this approach that I'm missing?

Related

Should bulk data be included in the graph?

I have been using ArangoDB for a while now for smaller system requirements and love it. We have recently been tasked by a client to analyze a large amount of financial data which is currently housed in SQL but I was hoping to more efficiently query the data in ArangoDB.
One of the more simplistic requirements is to rollup gl entry amounts to determine account totals across their general ledger. There are approximately 2200 accounts in their general ledger with a maximum depth of approximately 10. The number of gl entries is approximately 150 million and I was wondering what the most efficient method of aggregating account totals would be?
I plan on using a graph to manage the account hierarchy/structure but should edges be created for 150 million gl entries or is it more efficient to traverse the inbound relationships and run sub queries on the gl entry collections to calculate total the amounts?
I would normally just run the tests myself but I am struggling with simply loading the data in my local instance of arango and thought I would get some insight while I work at loading the data.
Thanks in advance!
What is the benefit you're looking to gain by moving the data into a graph model. If it's to build connections between accounts, customers, GL's, and such, then it might be best to go with a hybrid model.
It's possible to build a hierarchical graph style relationship between your accounts and GL's, but then store your GL entries in a flat document collection.
This way you can use AQL style graph queries to quickly determine relationships between accounts and GLs. If you need to SUM entries in a GL, then you can have queries that identify the GL._id's and then sum the flat collections that have foreign keys that reference the GL._id they are associated with.
By adding indexes on your foreign keys you will speed up queries, and by using Foxx Micro Services you can provide a layer of abstraction between a REST style query and the actual data model you are using. That way if you find you need to change your database model under the covers, by updating your Foxx MicroServices the consumer doesn't need to be aware of those changes.
I can't answer your question on performance, you'll just need to ensure your hardware is appropriately spec'ed.

Categorical Clustering of Users Reading Habits

I have a data set with a set of users and a history of documents they have read, all the documents have metadata attributes (think topic, country, author) associated with them.
I want to cluster the users based on their reading history per one of the metadata attributes associated with the documents they have clicked on. This attribute has 7 possible categorical values and I want to prove a hypothesis that there is a pattern to the users' reading habits and they can be divided into seven clusters. In other words, that users will often read documents based on one of the 7 possible values in the particular metadata category.
Anyone have any advice on how to do this especially in R, like specific packages? I realize that the standard k-means algorithm won't work well in this case since the data is categorical and not numeric.
Cluster analysis cannot be used to prove anything.
The results are highly sensitive to normalization, feature selection, and choice of distance metric. So no result is trustworthy. Most results you get out are outright useless. So it's as reliable as a proof by example.
They should only be used for explorative analysis, i.e. to find patterns that you then need to study with other methods.

StatsD/Graphite Naming Conventions for Metrics

I'm beginning the process of instrumenting a web application, and using StatsD to gather as many relevant metrics as possible. For instance, here are a few examples of the high-level metric names I'm currently using:
http.responseTime
http.status.4xx
http.status.5xx
view.renderTime
oauth.begin.facebook
oauth.complete.facebook
oauth.time.facebook
users.active
...and there are many, many more. What I'm grappling with right now is establishing a consistent hierarchy and set of naming conventions for the various metrics, so that the current ones make sense and that there are logical buckets within which to add future metrics.
My question is two fold:
What relevant metrics are you gathering that you have found indespensible?
What naming structure are you using to categorize metrics?
This is a question that has no definitive answer but here's how we do it at Datadog (we are a hosted monitoring service so we tend to obsess over these things).
1. Which metrics are indispensable? It depends on the beholder. But at a high-level, for each team, any metric that is as close to their goals as possible (which may not be the easiest to gather).
System metrics (e.g. system load, memory etc.) are trivial to gather but seldom actionable because they are too hard to reliably connect them to a probable cause.
On the other hand number of completed product tours matter to anyone tasked with making sure new users are happy from the first minute they use the product. StatsD makes this kind of stuff trivially easy to collect.
We have also found that the core set of key metrics for any teamchanges as the product evolves so there is a continuous editorial process.
Which in turn means that anyone in the company needs to be able to pick and choose which metrics matter to them. No permissions asked, no friction to get to the data.
2. Naming structure The highest level of hierarchy is the product line or the process. Our web frontend is internally called dogweb so all the metrics from that component are prefixed with dogweb.. The next level of hierarchy is the sub-component, e.g. dogweb.db., dogweb.http., etc.
The last level of hierarchy is the thing being measured (e.g. renderTime or responseTime).
The unresolved issue in graphite is the encoding of metric metadata in the metric name (and selection using *, e.g. dogweb.http.browser.*.renderTime) It's clever but can get in the way.
We ended up implementing explicit metadata in our data model, but this is not in statsd/graphite so I will leave the details out. If you want to know more, contact me directly.

How do you design an OLAP Database?

I need a mental process to design an OLAP database...
Essentially for standard relational it'd be (loosely):
Identify Entities
Identify Relationships
Identify Properties of Entities
For each property:
Ensure property can be related to only one entity
Ensure property is directly related to entity
For OLAP databases, I understand the terminology, the motivation and the structure; however, I have no clue as to how to decompose my relational model into an OLAP model.
Identify Dimensions (or By's)
These are anything that you may want to analyse/group your report by. Every table in the source database is a potential Dimension. Dimensions should be hierarchical if possible, e.g. your Date dimension should have a year,month,day hierarchy, Similarly Location should have for example Country, Region, City hierarchy. This will allow your OLAP tool to more efficiently calculate aggregations.
Identify Measures
These are the KPI's or the actual numerical information your client wants to see, these are usually capable of being aggregated, therefore any non flag, non key numeric field in the source database is a potential measure.
Arrange in star schema, with Measures in the center 'Fact' table, and FK relations to applicable Dimension tables. Measures should be stored at the lowest dimension hierarchy level.
Identify the 'Grain' of the fact table, this is essentially the 'level of detail' held. It is usually determined by the reporting requirements, the data granularity available in the source and performance requirements of the reporting solution.You may identify the grain as you go, or you may approach it as a final step once all the important data has been identified. I tend to have a final step to ensure the grain is consistent between my fact tables.
The final step is identifying slowly changing dimensions, and the requirements for these. For example if the customer dimension includes an element of their address and they move, how is that to be handled.
One important point in identify the Dimensions and Measures is the final cardinality that you are electing for the model.
Let´s say that your relational database data entry is during all day.
Maybe you don´t need to visualize or aggregate the measures by hour, even by day. You can choose a week granularity or monthly etc.

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources