modeling scenario with mostly semi-additive facts - erd

Im learning dimensional modeling and Im trying to create a model. I was thinking about a social media platform which rates hotels. The platform has following data:
hotel information: name and address
a user can rate hotels (1-5 points)
a user can write comments
platform stores the date of the comments
hotel can answer via comment and it stores the date of it
the platform stores the total number of each rating level (i.e.: all rates with 1 point, all rates with 2 point etc.)
platform stores information of the user: sex, name, total number of votes he/she made and address
First, I tried to define which information belongs to a dimension or fact table
(here I also checked which one is additive/semi additive/non-additive)
I realized my example is kind of difficult, because it’s hard to decide if it belongs to a fact table or dimension.
I would like to hear some advice. Would someone agree with my model?
This is how I would model it:
Hotel information -> hotel dimension
User rating -> additive fact – because I can aggregate them with all dimensions
User comment -> semi additive? – because I can aggregate them with the date dimension (I don’t know if my argument is correct, but I know I would have new comments every day, which is for me a reason to store it in a fact table
Answer as comment -> same handling like with the user comments
Date of comment-> dimension
Total Number of all votes (1/2/3/4/5) -> semi-additive facts – makes no sense to aggregate them, since its already total but I would get the average
User information sex and name, address -> user-dimension
User Information: total number of votes -> could be dimension or fact. It depends how often it changes. If it changes often, I store it in a fact. If its not that often, then dimension
I still have question, hope someone can help me:
My Question: should I create two date dimensions, or can I store both information in one date dimension?
2nd Question: each user and hotel just have one address. Are there arguments, to separate the address dimension in a own hierarchy? Can I create a 1:1 relationship to a user dimension and address dimension?

For your model, it looks well considered, but here are some thoughts:
User comment (and answers to comments): they are an event to be captured (with new ones each day, as you mention) so are factual, with dimensionality of the commenter, type of comment, date, and the measure is at least a 'count' which is additive. But you don't want to store big text in a fact so you would put that in a dimension by itself which is 1:1 with the fact, for situations where you need to query on the comment itself.
Total Number of all votes (1/2/3/4/5) are, as you say, already aggregates, mostly for performance. Totals should be easy from the raw data itself so maybe not worthwhile to store them at all. You might also consider updating the hotel dimension with columns (hotel A has 5 '1' votes and 4 '2' votes) that you'd update as you go on, for easy filtering and categorisation.
User Information: total number of votes: it is factual information about a user (dimension) and it depends on whether you always just want to 'find it out' about a person or whether you are likely to use it to filter other information (i.e. show me all reviews for users who have made 10-20 votes). In that case you might store the total in the user dimension (and/or a banding, like 'number of reviews range' with 10-20, 20-30). You can update dimensions often if you need to, but you're right, it could still just live as a fact only.
As for date dimensions, if the 'grain' is 'day' then you only need one dimension, that you refer to from multiple facts.
As for addresses, you're right that there are arguments on both sides! Many people separate addresses into their own dimension, referred to from the other dimensions that use them. Kimball suggests you can do that behind the scenes if necessary, but prefers for each dimension to have its own set of address columns(but modelled as consistently as possible).

Related

Check if any combination of binary variables is correlated/has impact on an ordinal dependent variable

I am working on a case to finish my (not so advanced) data scientist course and I have already been helped a lot by topics here, thanks!
Unfortunately now I am stuck again and cannot find an existing answer.
My data comes from a bike shop and I want to see if products bought during customers' first registered purchase are related to/have impact on how important they will become to the shop in the future. I have grouped customers into 5 clusters (from those who registered and made never any registered purchase again, through these who made 2-3 purchases for little money, those who made a few purchases for a lot of money to those who purchase stuff regularly and really bring a lot of money to this bike shop), I have ordered them into an ordinal dependent variable.
As the independent variables I have prepared 20+ binary variables that identify products/services bought during the first purchase from this shop (first purchase as a registered customer). One row per customer. So I want to check the idea if there are combinations of products (probably "extras" to the bike purchase) that can increase the chance that a customer would register and hopefully stay as a loyal customer for the future.
The dream would be be able to say, for example, if you buy a cheap or middle-cheap bike during this first purchase you probably don't contribute so much to the bike shop in a long term so you have low grade on the dependent variable. But those who bought a middle-cheap bike AND a helmet AND a lock (probably to special price) are more likely to become one of the loyal registered customers bringing money for a longer time.
There might be no relation like that but I want to test that anyways. Implementation of the result could be being able to recommend an extra product during a purchase (with a good price on it).
I am learning R during this course. We went through some techniques and first I was imagining it would be possible to work with the neural networks (just cause it sounded most fun to try), having all these products as input in the sparse matrix and the customers clusters as the output (I hoped it was similar to the examples I read about with sparse matrix with pixels from a picture as the input and numbers 1-9 as the output) but then I was told that this actually is based on pictures and real patterns and in my case I don't even know if there is any.
Then I was thinking I could try with the ordinal forest. But it doesn't predict my clusters well, not at all (2 out of 5 clusters get no predictions). But that is OK, I don't expect the first purchase to be able to predict all the customers future. But I would really want to see if there are combinations of products that might increase the chance that a customer ends up in one of the "higher" clusters on the loyalty scale.
I am not sure if this was clear enough. :) Do you think that there is any way of testing my idea? What could I try to do? Let me know if you need more information.

Frustrating Formula - Help Needed with Google Sheets formula/method

I am struggling to come up with a formula that fits certain criteria and was hoping someone with a better math brain than me might be able to help. What I have is a Google Sheets based tool that determines how much a someone has purchased of a product and then calculates the amount of times a special additional offer will be redeemed based on the amount spent.
As an example, the offer has three tiers to it. Though the actual costs will be variable for different offers let's say the first tier is gained with a $10 purchase, the second with a $20 purchase and the third with a $35 purchase (the only real relationship between the prices is that they get higher for each tier but there is no specific pattern to the costing of different offers). So if the customer bought $35 worth of goods they would get three free gifts, if they bought $45 worth they would get 4 and then an additional spend of $5 (totaling $50) would then allow them to redeem 5 gifts in total. It can be considered like filling a bucket, each time you hit the red line you get a new gift, when the bucket is full it's emptied and the process begins again.
If each tier of the offer was the same cost (e.g. $5, $10 and $15) this would be a simple case of division by the total purchase amount but as there is no specific relationship between the cost of the tiers (they are based on the value of the contents) I am having trouble coming up with a simple 'bucket filling' formula or calculation method that will work for any price ranges given to it. My current solution involved taking the modulus, subtracting offer amounts from the purchase amount etc. but provides plenty of cases where it breaks . If anyone could give me a start or provide some information that might help in my quest I would be highly appreciative and let me know if my explanation is unclear.! Thanks in advance and all the best
EDIT:
The user has three tiers and then the offer wraps around to the start after the initial three are unlocked once, looping until the offer has been maxed out. Avoiding a long sheet with a dynamic column of prices would be preferable and a small, multicell formula would be ideal
What you need is a lookup table. Create a table with the tier value in the left column, and the corresponding number of gifts for that tier value in the right column. Then you can use Vlookup to match the amount spent to correct tier.
I am not quite sure about, everything into one entire formula(is there a formula for loop and building arrays?)
from my understanding the tier amounts are viable, so every time you add a new tier with a new price limit then it must be calculated with a new limit price number...wouldn't it be much easier to write such module in javascript than in a google sheet? :o
anyways here is my workaround, that may could help you to find an idea
Example Doc
https://docs.google.com/spreadsheets/d/1z6mwkxqc2NyLJsH16NFWyL01y0jGcKrNNtuYcJS5dNw/edit#gid=0
my approach :
- enter purchases value
-> filter all items based by smaller than or equal "<=" (save all item somewhere as placeholder)
-> then decrease the purchases value by amount of existing number(max value) based on filtered items
-> save the new purchases value somewhere and begin from filtering again and decreasing the purchases value
(this needs to be done as many times again, till the purchases is empty)
after that, sums up all placeholder

SSAS facts sharing the same dimension

I'm building a cube with 2 fact tables that share some dimensions.
In the example below, I have Fact_Employee, Fact_Manager, Dim_Date, Dim_Country, Dim_Employee and Dim_Manager, with the respective links.
In SSAS I've created one Dim_Country. In the Cube "Dimension Usage" I am creating 2 dimensions (Man_Country and Emp_Country) and linking to the respective measure groups.
My Fact_Employee has the key for the Dim_Manager, so I can relate them.
My problem here is, when in the pivot table I drag the Man_Country, Emp_Country, Emp_Amount and Man_Amount, this doesn't work because I'm getting the list of all Manager Countries not related to the Manager Number and then the Employee Countries are correctly linked to the Employee Number, but are duplicate.
The below image shows the result Pivot table and what I am trying to get.
What do I need to change in the data source view or cube dimension usage to have the correct results.
The users should be able to filter the pivot by, for example, Manager Country to see all the employee Countries and Numbers and the amounts (for Managers and Employees).
Many thanks in advance for any help.
Regards,
PC
If you have country dimension then you should use this dimension for both measure groups, just remember to configure dimension usage for this dimension vs both measure groups.
There are special cases where you would want to separate those dimensions, f.eks:if you want them to act separately - let say you have a fact table with parcels and you need to have both DimFromCountry and DimToCountry. In this case you would want to use role playing dimension - it is same dimension then, but connected differently.

Database relations query

I have two database tables
this is a sample database of a Ticketing system.
Figure 1: Sample table of air ticket.
Figure 2: Sample table of tax.
Requirement:
When ticket is made from the interface, it has multiple taxes of different names every time.
How can I store this information i.e. 'n' number of taxes for each ticket with different names every time.
I have tried to make many to many relationship but the problem is:
For each ticket if the tax is not setup, then need to add the tax first.
Any optimal solution for this?
"the problem is: For each ticket if the tax is not setup, then need to
add the tax first."
This is not a real-life problem. In real life governments declare taxes well in advance of collecting them, This gives organizations sufficient time to amend their systems which need to handle taxes. Tax is never a surprise.
"But this is very tiring solution for the end user.... to make bunch
of tax setup for each ticket"
This sort of thing is reference data, and is the duty of the system developer (hint: that's you) to populate the reference data tables. Or at least provide a screen where the user can create or amend various taxes. This is a different function from defining a ticket type.
The Ticket Creation screen should have a drop-down list (or similar widget) displaying all the existing taxes, which allows the user to pick the relevant one(s). If you reall think it's necessary you can include a link to the Create Tax screen, but that really is a very confusing workflow.
If the commentators are correct, and this is a ticket purchasing function, then your design is seriously wrong. Sales taxes must be included automatically to the cost of the purcahse as part of the transaction. Otherwise nobody would pay any tax.

Determining the popularity of a video with ratings and views

I am about to embark on a new project - a video website. Users will be able to register, and vote on videos by clicking "like" or "dislike", or something to that effect. In any event, it will be a 2-option voting system, not a 5-star system.
Every X number of days, I will be generating a "chart" of the most popular videos. So my question is: how should I determine the popularity of a given video?
If I went the route of tallying up the videos with the most views, this could have the effect of exceptionally bad videos making it to the of the charts (just because they're so bad).
If I go the route of a scoring system based on the amount of "like" and "dislike" votes (eg. 100 like votes, and 50 dislike votes equals a score of 2), videos with few views could appear on the top of the charts.
So, what I need to do is a combination of the two. Barring, of course, spammy views and votes.
What's your guys' thoughts on the subject?
Edit: the following tags were removed: [mysql] [postgresql], to make room for other, more representative tags; the SQL technology used in the intended implementation does not seem to bear much on the considerations regarding the rating model per-se.
You seem to be missing the point that likes and dislikes in movies are anything but objective even within the context of a relatively homogeneous group of "voters". Think how the term "Chix Flix" or the success story called "NetFlix", illustrate this subjectivity...
Yet, if you persist in implementing the model you suggest, there are several hidden variables and system dynamics that need to be acknowledged and possibly taken into account in the rating's formula.
the existence of a third, implicit, value of the vote: "No vote"
i.e. when someone views the movie page and yet doesn't vote, either way.
The problem of dealing with this extra value is its ambiguity: do people not vote because they didn't see the movie or because they neither truly like nor disliked it? Very likely a bit of both, therefore we can/should use the count of the "Page views without vote" in the formula, to boost (somewhat) the rating of movies that do not generate a strong (positive or negative) sentiment (lest the "polarizing" movies will appear more notorious or popular)
the bandwagon effect
Past a certain threshold, and particularly if the rating and/or vote counts is visible before the page view, the rating and vote counts can influence the way people decide to vote (either way) or even decide to abstain from voting. The implication is that the total vote and/or view counts do not relate linearly to the effective rating.
"quality" vs. "notoriety"
Vote ratios in general (eg "likes" / "total" or "likes"/"dislikes" etc.) are indicative of the "quality" of a movie (note the quotes around quality...), whereby the number of votes (and of views) is indicative of the notoriety ("name recognition" etc.) of a movie.
statistical representativity
Very small vote and/or view counts are to be handled carefully because they introduce much volatility in the rating. Phrased otherwise, small samples make for not so statically representative ratings.
trends (the time variable)
At the risk of complicating the model, consider keeping [some] record of when votes/view happened, to allow identifying "hot" (and "cooling") movies in the collection. This info may inform the rating logic, but also may be used to direct the users towards currently hot items. BTW, hence feeding the bandwagon effect mentioned :-( but also, increasing the voting sample size :-).
All these considerations suggest caution in implementing this rating system. It also hints at the likely need of including statistics about the complete set of movies into the rating formula for an individual movie. In other words, do not rate a given movie solely on the basis of the its own vote/view counts but also on say the average vote counts a move receives, the maximum view a movie page gets etc. In fact, an iterative process, whereby movies are [roughly] ranked at first and then the ranking is recalculated by using the statistics of groups of movies similarly rated may provide a better system (provided the formulas are "fair" and somehow converge)
A standard trick is to start with a neutral baseline: say 10 likes and 10 dislikes that gives a score of 1. The first few votes don't change the ratio too much, but as votes accumulate, the baseline is overwhelmed. The exact choice of the baseline values will influence the rating of a new movie (the two values don't have to be equal), and how many votes are needed to change the rating substantially.

Resources