sum and distinct-count measures (star schema design koan) - olap

I am quite a beginner in Data Warehouse Design. I have red some theory, but recently met a practical problem with a design of a OLAP cube. I use star schema.
Lets say I have 2 dimension tables and 1 fact table:
Dimension Gazetteer:
dimension_id
country_name
province_name
district_name
Dimension Device:
dimension_id
device_category
device_subcategory
Fact table:
gazetteer_id
device_dimension_id
hazard_id (measure column)
area_m2 (measure column)
A "business object" (which is a mine field actually) can have multiple devices, is located in a single location (Gazetteer) and ocuppies X square meters.
So in order to know which device categories there are, I created a fact per each device in hazard like this:
+--------------+---------------------+-----------------------+-----------+
| gazetteer_id | device_dimension_id | hazard_id | area_m2 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 321 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 654 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 987 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
I defined a measure "number of hazards" as distinct-count of hazard_id.
I also defined a "total area occupied" measure as a sum of area_m2.
Now I can use the dimension gazetteer and device and know how many hazards there are with given dimension members.
But the problem is the area_m2: because it is defined as a sum, it gives a value n-times higher than the actual area, where n is th number of devices of the hazard object. For example, with the data above would give 18000m2.
How would you solve this problem?
I am using the Pentaho stack.
Thanks in advance

[moved from comment]
If a hazard-id is a minefield, and you're looking at mines-by-region(gazetter) & size-of-minefields-by-gazetteer, maybe you could make a Hazard dimension, which holds the area of the Hazard; or possibly make a Null-device entry in the DeviceDimension table, and only the Null-device entry gets the area_m2 set, the real devices get area_m2=0.
If you need to answer queries like: total area of minefields containing device 321, the second approach isn't going to easily answer these questions, which suggests that making a Hazard dimension might be a better approach.
I would also consider adding a device-count fact, which could have the num devices of each type per hazard.

Related

Calculate the best distribution for a group of numbers that can FIT on a specific number

I have what I think is a interesting question, about google sheets and some Maths, here is the scenario:
4 numbers as follows:
64.20 | 107 | 535 | 1070
A reference number in which the previous numbers needs to fit leaving the minimum possible residue while setting the number of times each of them fitted in the reference number for example we could say the reference number is the following:
806.45
So here is the problem:
I'm calculating how many times those 4 numbers can fit in the reference number by starting from the higher to the lower number like this:
| 1070 | => =IF(E12/((I15+J15)+IF(H17,K17,0)+IF(H19,K19,0)) > 0,ROUNDDOWN(E12/((I15+J15)+IF(H17,K17,0)+IF(H19,K19,0))),0)
| 535 | => =IF(H15>0,ROUNDDOWN((E12-K15-IF(H17,K17,0)-IF(H19,K19,0))/(I14+J14)),ROUNDDOWN(E12/((I14+J14)+IF(H17,K17,0)+IF(H19,K19,0))))
| 107 | => =IF(OR(H15>0,H14>0),ROUNDDOWN((E12-K15-K14-IF(H17,K17,0)-IF(H19,K19,0))/(I13+J13)),ROUNDDOWN((E12-IF(H17,K17,0)-IF(H19,K19,0))/(I13+J13)))
| 64.20 | => =IF(OR(H15>0,H14>0,H13>0),ROUNDDOWN((E12-K15-K14-K13-IF(H17,K17,0)-IF(H19,K19,0))/(I12+J12)),ROUNDDOWN((E12-IF(H17,K17,0)-IF(H19,K19,0))/(I12+J12)))
As you can notice, I'm checking if the higher values has a concurrence, so I can substract the amount from the original number and calculate again how many times can fit the lower number in the result of that subtraction , you can also see that I'm including some checkboxes to the formula in order to add a fixed number to the main number.
This actually works, and as you can see in the example, the result is:
| 1070 | -> Fits 0 times
| 535 | -> Fits 1 time
| 107 | -> Fits 2 times
| 64.20 | -> Fits 0 times
The residue of 806.45 in this example is: 57.45
But each number that needs to fit on the main number needs to take in consideration others; IF you solve this exercise manually, you could get something much better.. like this:
| 1070 | -> Fits 0 times
| 535 | -> Fits 1 time
| 107 | -> Fits 0 times
| 64.20 | -> Fits 4 times
The residue of 806.45 in this example is: 14.65
When I’m talking about residue I mean the result when subtracting, I’m sorry if this is not clear, it’s hard to me to explain maths in English, since is not my native language, please see the spreadsheet and make a copy to better understand what I’m trying to do, or suggest me a way to explain it better if possible.
So what would you do to make it work more efficient and "smart" with the minimum possible residue after the calculation?
Here is the Google's spreadsheet for reference and practice, please make a copy so others can try their own solutions:
LINK TO SPREADSHEET
Thanks in advance for any help or hints.
Delete all current formulas in H12:H15.
Then place this mega-formula in H12:
=ArrayFormula(QUERY(SPLIT(FLATTEN(SPLIT(VLOOKUP(E12,QUERY(SPLIT(FLATTEN(QUERY(SPLIT(FLATTEN(QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&(SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13))),"|"),"Select Col1")&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)&" "&I14)&"|"&QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&((SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13)))),"|"),"Select Col2")+TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)*I14)),"|"),"Select Col1")&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I15),1,0)&" "&I15)&"|"&QUERY(SPLIT(FLATTEN(QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&(SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13))),"|"),"Select Col1")&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)&" "&I14)&"|"&QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&((SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13)))),"|"),"Select Col2")+TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)*I14)),"|"),"Select Col2")+TRANSPOSE(SEQUENCE(ROUNDUP(E12/I15),1,0)*I15)),"|"),"Select Col2, Col1 WHERE Col2 <= "&E12&" ORDER BY Col2 Asc, Col1 Desc"),2,TRUE)," / ",0,0))," "),"Select Col1"))
Typically, I explain my formulas. In this case, I trust that readers will understand why I cannot explain it. I can only offer it in working order.
To briefly give the general idea, this formula figures out how many times each of the four numbers fits into the target number alone and then adds every possible combination of all of those. Those are then limited to only the combinations less than the target number and sorted smallest to largest in total. Then a VLOOKUP looks up the target number in that list, returns the closest match, SPLITs the multiples from the amounts (which, in the end, have been concatenated into long strings), and returns only the multiples.

SumoLogic:Can I have graph of min/max difference?

I want to show a graph of minimum value, maximum value and difference between maximum and minimum for each timeslice.
It works ok for min and max
| parse "FromPosition *)" as FromPosition
| timeslice 2h
| max(FromPosition) ,min(FromPosition) group by _timeslice
but I couldn't find the correct way to specify the difference.
e.g.
| (max(FromPosition)- min(FromPosition)) as diffFromPosition by _timeslice
returns error -Unexpected token 'b' found.
I've tried a few different combinations to declare them on different lines as suggested on https://help.sumologic.com/05Search/Search-Query-Language/aaGroup. e.g.
| int(FromPosition) as intFromPosition
| max(intFromPosition) as maxFromPosition , min(intFromPosition) as minFromPosition
| (maxFromPosition - minFromPosition) as diffFromPosition
| diffFromPosition by _timeslice
without success.
Can anyone suggest the correct syntax?
Try this:
| parse "FromPosition *)" as FromPosition
| timeslice 2h
| max(FromPosition), min(FromPosition) by _timeslice
| _max - _min as diffFromPosition
| fields _timeslice, diffFromPosition
the group by is for the min and max functions to know what range to work with, not the group by for the overall search query. That's why you were getting the syntax errors and one reason I prefer to just use by as above.
For these kinds of queries I usually prefer a box plot where you would just do:
| min(FromPosition), pct(FromPosition, 25), pct(FromPosition, 50), pct(FromPosition, 75), max(FromPosition) by _timeslice
Then selecting box plot as the graph type. Looks great on a dashboard and provides a lot of detailed information about deviation and such at a glance.

How to get average of last N numbers in a stream with static memory

I have a stream of numbers and in every cycle I need to count the average of last N of them. This can be, of course, solved using an array where I store the last N numbers and in every cycle I shift it, add the new one and count the average.
N = 3
+---+-----+
| a | avg |
+---+-----+
| 1 | |
| 2 | |
| 3 | 2.0 |
| 4 | 3.0 |
| 3 | 3.3 |
| 3 | 3.3 |
| 5 | 3.7 |
| 4 | 4.0 |
| 5 | 4.7 |
+---+-----+
First N numbers (where there "isn't enough data for counting the average") doesn't interest me much, so the results there may be anything/undefined.
My question is, can this be done without using an array, that is, with static amount of memory? If so, then how?
I'll do the coding myself - I just need to know the theory.
Thanks
Think of this as a black box containing some state. If you control the input stream, you can draw conclusions on the state. In your sliding window array-based approach, it is kind of obvious that if you feed a bunch of zeros into the algorithm after the original input, you get a bunch of averages with a decreasing number of non-zero values taken into account. The last one has just one original non-zero value, so if you multiply that my N you get the last input back. Using that and the second-to-last output which accounts for two non-zero inputs, you can reconstruct the second-to-last input, and so on.
So essentially your algorithm needs to maintain sufficient state to reconstruct the last N elements of input, at least if you formulate it as an on-line algorithm. I don't think an off-line algorithm can do any better, except if you consider it reading the input multiple times, but I don't have as strong an agument for that.
Of course, in some theoretical models you can avoid the array and e.g. encode all the state into a single arbitrary length integer, but that's just cheating the theory, and doesn't make any difference in practice.

Azure Machin Learing - How to train with very limited dataset

I am a beginner and I need some advice on how to go about modelling the below scenario
I am consuming ~5000 rows of data on an average from an external system everyday. The number of incoming rows are between 4950 to 5050. I want to build an alerting mechanism which would tell me if the number of incoming rows is not normal. i.e., I want a solution to let me know if I get say, 2500 rows on a given day which is 50% less or say 15000 rows which way more than the average.
Sample data as below:
| Day | Size of incoming data (in MB) | Number of Rows | Label |
| Weekday | 3.44 | 5000 | Y |
| Weekday | 3.3 | 4999 | Y |
| Weekday | 3.1 | 4955 | Y |
| Weekday | 3.44 | 5000 | Y |
| Weekend | 4.1 | 5050 | N |
My initial thought was to use some anomaly detection algorithm. I tried using the Principal Component Analysis algorithm to detect the anomaly. I had collected the total number of rows I receive everyday and used it for training the model. But, after training with the data I had, which is quite limited (less than 500 observations) I find that the accuracy is very poor. One-Class SVM also did not give me good result.
I had used "Number of rows" as Categorical Feature, Label as.. label and ignore the rest of the parameters as they are of no interest to me in this case. Irrespective of day and size of incoming data my logic revolves around the number of rows only.
Also, I don't have any negative scenario so far, meaning, I never received far too less or far too many records. So I labeled all days when I received 5050 rows as anomalous. The rest I labeled as normal.
I do realize that I am doing something fundamentally wrong here. The question is, does my scenario even qualify for usage on machine learning? (I believe it does, but wanted your opinion)
If yes, how to deal with such limited set of training data where you hardly have any sample anomaly. And is it really an anomaly problem or can I just use some classification algorithm to get better result?
Thanks
please see the time-series anomaly detection module. it should do what you need:
https://msdn.microsoft.com/library/azure/96b98cc0-50df-46ff-bc18-c0665d69f3e3?f=255&MSPPError=-2147217396

Need a solution for designing my database that has some potential permutation complexity?

I am building a website where I need to make sure that the number of "coins" and number of "users" wont kill the database if increases too quickly. I first posted this on mathematica (thinking its a maths website, but found it it's not). If this is the wrong place, please let me know and I'll move it accordingly. However, it does boil down to solving a complex problem: will my database explode if the users increase too quickly?
Here's the problem:
I am trying to confirm if the following equations would work for my problem. The problem is that i have USERS (u) and i have COINS (c).
There are millions of different coins.
One user may have the same coin another user has. (i.e. both users have coin A)
Users can trade coins with each other. (i.e. Trade coin A for coin B)
Each user can trade any coin with another coin, so long as:
they don't trade a coin for the same coin (i.e. can't trade coin A for another coin A)
they can't trade with themselves (i.e. I can't offer my own Coin A for my own Coin B)
So, effectively, there are database rows stored in the DB:
trade_id | user_id | offer_id | want_id
1 | 1 | A | B
2 | 2 | B | C
So in the above data structure, user 1 wants coin A for coint B, and user 2 wants coin B for coin C. This is how I propose to store the data, and I need to know that if I get 1000 users, and each of them have 15 coins, how many relationships will get built in this table if each user offers each coin to another user. Will it explode exponentially? Will it be scalable? etc?
In the case of 2 users with 2 coins, you'd have user 1 being able to trade his two coins with the other users two coins, and vice versa. That makes it 4 total possible trade relationships that can be set up. However, keeping in mind that if user 1 offers A for B... user 2 can't offer B for A (because that relationship already exists.
What would the equation be to figure out how many TRADES can happen with U users and C coins?
Currently, I have one of two solutions, but neither seem to be 100% right. The two possible equations I have so far:
U! x C!
C x C x (U-1) x U
(where C = coins, and U = users);
Any thoughts on getting a more exact equation? How can I know without a shadow of a doubt, that if we scale to 1000 users with 10 coins each, that this table won't explode into millions of records?
If we just think about how many users can trade with other users. You could make a table with the allowable combinations.
user 1
1 | 2 | 3 | 4 | 5 | 6 | ...
________________________________
1 | N | Y | Y | Y | Y | Y | ...
user 2 2 | Y | N | Y | Y | Y | Y | ...
3 | Y | Y | N | Y | Y | Y | ...
The total number of entries in the table is U * U, and there are U N's down the diagonal.
Theres two possibilities depending on if order matters. Is trade(user_A,user_B) is the same as trade(user_B,user_A) or not? If order matters the same the number of possible trades is the number of Y's in the table which is U * U - U or (U-1) * U. If the order is irrelevant then its half that number (U-1) * U / 2 which are the Triangular numbers. Lets assume order is irrelevant.
Now if we have two users the situation with coins is similar. Order does matter here so it is C * (C-1) possible trades between the users.
Finally multiply the two together (U-1) * U * C * (C-1) / 2.
The good thing is that this is a polynomial roughly U^2 * C^2 so it will not grow to quickly. This thing to watch out for is if you have exponential growth, like calculating moves in chess. Your well clear of this.
One of the possibilities in your question had U! which is the number of ways to arrange U distinct objects into a sequence. This would have exponential growth.
There are U possible users and there are C possible coins.
Hence there are OWNS = CxU possible "coins owned by an individual".
Hence there are also OWNS "possible offerings for a trade".
But a trade is a pair of two such offerings, restricted by the rule that the two persons acting as offerer cannot be the same, and neither can the offered coin be the same. So the number of candidates for completing a "possible offering" to form a "complete trade" is (C-1)x(U-1).
The number of possible ordered pairs that form a "full-blown trade" is thus
CxUx(C-1)x(U-1)
And then this is still to be divided by two because of the permutation issue (trades are a set of two (person,coin) pairs, not an ordered pair).
But pls note that this sort of question is actually an extremely silly one to worry about in the world of "real" database design !
I need to know that if I get 1,000 users, and each of them have 15 coins, how many relationships will get built in this table if each user offers each coin to another user.
The most that can happen is all 1,000 users each trade all of their 15 coins, for 7,500 trades. This is 15,000 coins up for trade (1,000 users x 15 coins). Since it takes at least 2 coins to trade, you divide 15,000 by 2 to get the maximum number of trades, 7,500.
Your trade table is basically a Cartesian product of the number of users times the number of coins, divided by 2.
(U x C) / 2
I'm assuming users aren't trading for the sake of trading. That they want particular coins and once they get the coins, won't trade again.
Also, most relational databases can handle millions and even billions of rows in a table.
Just make sure you have an index on trade id and on user id, trade id in your Trade table.
The way I understand this is that you are designing an offer table. I.e. user A may offer coin a in exchange for coin b, but not to a specific user. Any other user may take the offer. If this is the case, the maximum number of offers is proportional to the number of users U and the square of the number of coins C.
The maximum number of possible trades (disregarding direction) is
C(C-1)/2.
Every user can offer all the possible trades, as long as every user is offering the trades in the same direction, without any trade being matched. So the absolute maximum number of records in the offer table is
C(C-1)/2*U.
If trades are allowed between more than two user the number decreases above half that though. E.g. if A offers a for b, B offers b for c and C offers c for a. Then a trade could be accomplished in a triangle by A getting b from B, B getting c from C and C getting a from A.
The maximum number of rows in the table can then be calculated by splitting the C coins into two groups and offering any coin it the first group in exchange any coin in the second. We get the maximum number of combination if the groups are of the same size, C/2. The number of combinations is
C/2*C/2 = C^2/4.
Every user may offer all these trades without there being any possible trade. So the maximum number of rows is
C^2/4*U
which is just over half of
C(C-1)/2*U = 2*(C^2/4*U) - C/2*U.

Resources