How to measure information content or quantify information in a query vector? - information-theory

I have a query that consists of a number of attributes. I would like to measure the amount of information in that query. Then measure how that information content is reduced or not affected if I remove a certain attribute.
Is there a measure or metric from information theory that solves this problem?
Thanks

Related

What happens when the top-k query does not find enough documents to satisfy k constraint?

I am evaluating the top-k range query using NDCG. Given a spatial area and a query keyword, my top-k range query must return k documents in the given area that are textual relevant to the query keyword.
In my scenario, the range query usually finds only one document to return. But I have to compare this query to another one who can find more objects in the given area, with the same keyword. This is possible because an approach I am testing to improve objects description.
I am not figuring out how to use NDCG to compare these two queries in this scenario. I would like to compare Query A and B using NDCG#5, NDCG#10, but Query A only finds one object. Query A will have high NDCG value because of its lower ability to find more objects (probably the value will be one - the maximum). Query B finds more objects (in my opinion, a better solution) but has a lower NDCG value than query A.
You can consider looking at a different measure, e.g. Recall#10, if you care less about the ranking for your application.
NDCG is a measure designed for web search, where you really want to penalize a system that doesn't return the best item at the topmost result, which is why it has an exponential decay factor. This makes sense for navigational queries like ``stackoverflow'' you will look quite bad if you don't return this website first.
It sounds like you are building something a little more sophisticated, where the user cares about many results. Therefore, a more recall-oriented measure (that cares about getting multiple things right more than the ranking) may make more sense.
its lower ability to find more objects
I'd also double-check your implementation of NDCG: you always want to divide by the ideal ranking, regardless of what actually gets returned. It sounds like your Query A returns 1 correct object, but Query B returns more correct objects, but not at high ranks? Either way, you expect Query A to be divided by the DCG of a perfect ranking -- that means 10, 20, or thousands of "correct" objects. It may be that you just don't have enough judgments, and therefore your "perfect ranking" is too small, and therefore you aren't penalizing Query A enough.

Microstrategy intelligent cube data aggragation

I created a cube that contain five attributes and a metric and I want to create a document from this cube with different visualisations for each attribute. the problem is that data in the cube are aggregated based on all attributes in the cube, so when you add a grid with one attribute and the metric the numbers will not be correct.
Is there any way to make the metric dynamically aggragate depending on the attribute in use?
This depends what kind of metric you have in the cube. The best way to achieve aggregation across all attributes is obviously to have the most granular, least aggregated data in the cube, but understandably this is not always possible.
If your metric is a simple SUM metric then you can set your dynamic aggregation settings on the metric to just do SUM and it should perform SUM's appropriately regardless of the attributes you are placing on your document/report? Unless your attribute relationships are not set up correctly or there are some many-to-many relationships taking place between some of those attributes.
If you metric is a distinct count metric, then the approach is slightly different and has been covered previously in a few places. Here is one, on an older version of Microstrategy but the logic can still be applied to newer versions.:
http://community.microstrategy.com/t5/tkb/articleprintpage/tkb-id/architect/article-id/1695

Is there a way to find out the number of hits/lookups on a particular Oralce table?

Is there a way to find out the number of hits/lookups on a particular Oracle table?(i.e, how often is a table queried per amount of time) without going for auditing(FGA)?
I'm able to get some information from the gv$SQL, gv$SQL_AREA and dba_tab_modifications but it's not up to the mark.
If you are licensed to use the AWR, dba_hist_seg_stat has information about the I/O (logical and physical) done on each segment during each snapshot. If you aren't licensed to use the AWR, you can query the v$segstat and v$statname tables (joining on statistic#). There are a ton of statistics that you can get information about most of which you couldn't care less about. Something like "consistent gets" would be a reasonable thing to look at but you can get a ton of detail depending on how you want to slice and dice the data. The downside, though, is that the data isn't historical-- you'd need to do thing like save off the current values on a regular basis if you want to track activity over time.

How do I add KPI targets to my cube that are at a higher grain to my fact table?

I have a simple star schema with 2 dimensions; course and student. My fact table is an enrolment on a course. I have KPI Values set up which use data in the fact table (e.g. percentage of students that completed course). All is working great.
I now need to add KPI Goals though that are a different grain to the fact table. The goals are at the course level, but should also work at department level, and for whatever combination of dimension attributes are selected. I have the numerator and denominators for the KPI Goals so want to aggregate these when there are multiple courses involved - before dividing to get the actual percentage goal.
How can I implement this? From my understanding I should only have one fact table in my star schema. So in that case would I perhaps store the higher grain values in the fact table? Or would I create a dimension with these values in? Or some alternative solution?
In most cases I would expect KPI measures to be calculated from the existing measures in your cube, so can you get away from the idea of fact table changes, and just set up KPIs as calculated members in the cube or MDX?
Your issue is complicated by the KPI granularity being different, yes...but I would just hide KPI measures when such a level of granularity was being displayed. You can implement this within the calculated measure definition too.
For example, I have used ISLEAF() to detect if a measure is about to be shown at the bottom level, and return blank/NULL. Or you can check the level number of any relevant dimensions.

How do you design an OLAP Database?

I need a mental process to design an OLAP database...
Essentially for standard relational it'd be (loosely):
Identify Entities
Identify Relationships
Identify Properties of Entities
For each property:
Ensure property can be related to only one entity
Ensure property is directly related to entity
For OLAP databases, I understand the terminology, the motivation and the structure; however, I have no clue as to how to decompose my relational model into an OLAP model.
Identify Dimensions (or By's)
These are anything that you may want to analyse/group your report by. Every table in the source database is a potential Dimension. Dimensions should be hierarchical if possible, e.g. your Date dimension should have a year,month,day hierarchy, Similarly Location should have for example Country, Region, City hierarchy. This will allow your OLAP tool to more efficiently calculate aggregations.
Identify Measures
These are the KPI's or the actual numerical information your client wants to see, these are usually capable of being aggregated, therefore any non flag, non key numeric field in the source database is a potential measure.
Arrange in star schema, with Measures in the center 'Fact' table, and FK relations to applicable Dimension tables. Measures should be stored at the lowest dimension hierarchy level.
Identify the 'Grain' of the fact table, this is essentially the 'level of detail' held. It is usually determined by the reporting requirements, the data granularity available in the source and performance requirements of the reporting solution.You may identify the grain as you go, or you may approach it as a final step once all the important data has been identified. I tend to have a final step to ensure the grain is consistent between my fact tables.
The final step is identifying slowly changing dimensions, and the requirements for these. For example if the customer dimension includes an element of their address and they move, how is that to be handled.
One important point in identify the Dimensions and Measures is the final cardinality that you are electing for the model.
Let´s say that your relational database data entry is during all day.
Maybe you don´t need to visualize or aggregate the measures by hour, even by day. You can choose a week granularity or monthly etc.

Resources