Frustrating Formula - Help Needed with Google Sheets formula/method - math

I am struggling to come up with a formula that fits certain criteria and was hoping someone with a better math brain than me might be able to help. What I have is a Google Sheets based tool that determines how much a someone has purchased of a product and then calculates the amount of times a special additional offer will be redeemed based on the amount spent.
As an example, the offer has three tiers to it. Though the actual costs will be variable for different offers let's say the first tier is gained with a $10 purchase, the second with a $20 purchase and the third with a $35 purchase (the only real relationship between the prices is that they get higher for each tier but there is no specific pattern to the costing of different offers). So if the customer bought $35 worth of goods they would get three free gifts, if they bought $45 worth they would get 4 and then an additional spend of $5 (totaling $50) would then allow them to redeem 5 gifts in total. It can be considered like filling a bucket, each time you hit the red line you get a new gift, when the bucket is full it's emptied and the process begins again.
If each tier of the offer was the same cost (e.g. $5, $10 and $15) this would be a simple case of division by the total purchase amount but as there is no specific relationship between the cost of the tiers (they are based on the value of the contents) I am having trouble coming up with a simple 'bucket filling' formula or calculation method that will work for any price ranges given to it. My current solution involved taking the modulus, subtracting offer amounts from the purchase amount etc. but provides plenty of cases where it breaks . If anyone could give me a start or provide some information that might help in my quest I would be highly appreciative and let me know if my explanation is unclear.! Thanks in advance and all the best
EDIT:
The user has three tiers and then the offer wraps around to the start after the initial three are unlocked once, looping until the offer has been maxed out. Avoiding a long sheet with a dynamic column of prices would be preferable and a small, multicell formula would be ideal

What you need is a lookup table. Create a table with the tier value in the left column, and the corresponding number of gifts for that tier value in the right column. Then you can use Vlookup to match the amount spent to correct tier.

I am not quite sure about, everything into one entire formula(is there a formula for loop and building arrays?)
from my understanding the tier amounts are viable, so every time you add a new tier with a new price limit then it must be calculated with a new limit price number...wouldn't it be much easier to write such module in javascript than in a google sheet? :o
anyways here is my workaround, that may could help you to find an idea
Example Doc
https://docs.google.com/spreadsheets/d/1z6mwkxqc2NyLJsH16NFWyL01y0jGcKrNNtuYcJS5dNw/edit#gid=0
my approach :
- enter purchases value
-> filter all items based by smaller than or equal "<=" (save all item somewhere as placeholder)
-> then decrease the purchases value by amount of existing number(max value) based on filtered items
-> save the new purchases value somewhere and begin from filtering again and decreasing the purchases value
(this needs to be done as many times again, till the purchases is empty)
after that, sums up all placeholder

Related

Check if any combination of binary variables is correlated/has impact on an ordinal dependent variable

I am working on a case to finish my (not so advanced) data scientist course and I have already been helped a lot by topics here, thanks!
Unfortunately now I am stuck again and cannot find an existing answer.
My data comes from a bike shop and I want to see if products bought during customers' first registered purchase are related to/have impact on how important they will become to the shop in the future. I have grouped customers into 5 clusters (from those who registered and made never any registered purchase again, through these who made 2-3 purchases for little money, those who made a few purchases for a lot of money to those who purchase stuff regularly and really bring a lot of money to this bike shop), I have ordered them into an ordinal dependent variable.
As the independent variables I have prepared 20+ binary variables that identify products/services bought during the first purchase from this shop (first purchase as a registered customer). One row per customer. So I want to check the idea if there are combinations of products (probably "extras" to the bike purchase) that can increase the chance that a customer would register and hopefully stay as a loyal customer for the future.
The dream would be be able to say, for example, if you buy a cheap or middle-cheap bike during this first purchase you probably don't contribute so much to the bike shop in a long term so you have low grade on the dependent variable. But those who bought a middle-cheap bike AND a helmet AND a lock (probably to special price) are more likely to become one of the loyal registered customers bringing money for a longer time.
There might be no relation like that but I want to test that anyways. Implementation of the result could be being able to recommend an extra product during a purchase (with a good price on it).
I am learning R during this course. We went through some techniques and first I was imagining it would be possible to work with the neural networks (just cause it sounded most fun to try), having all these products as input in the sparse matrix and the customers clusters as the output (I hoped it was similar to the examples I read about with sparse matrix with pixels from a picture as the input and numbers 1-9 as the output) but then I was told that this actually is based on pictures and real patterns and in my case I don't even know if there is any.
Then I was thinking I could try with the ordinal forest. But it doesn't predict my clusters well, not at all (2 out of 5 clusters get no predictions). But that is OK, I don't expect the first purchase to be able to predict all the customers future. But I would really want to see if there are combinations of products that might increase the chance that a customer ends up in one of the "higher" clusters on the loyalty scale.
I am not sure if this was clear enough. :) Do you think that there is any way of testing my idea? What could I try to do? Let me know if you need more information.

Firestore best practices: Make one document per entry or group entries of a given month within one document

I am doubting myself on how I should approach this problem.
My users are able to record many parts of their day, including activities, mood, health measurement (heart bpm, glucose), exercise, meals.
I originally thought that I should create one document per entry (i.e. one entry per day). However, when displaying data to the user it rarely occurs on a day by day basis but more on a month by month (charts).
Should I model my Firestore DB in relation to my views or would it be better to just save each entry for each day and then just query?
I am just thinking that it will be more efficient in many parts of the app to have the entries grouped by month than by day.
Am I thinking this right or is there really no benefit? (i.e. maybe the amount of data transferred offsets the costs of unnecessary queries).
If you plane to save each entry for each day and then just query and query to find the result. The more document you'll have, the more document you will query and it will increase you're read/day and so may be more expensive than in a month by month.
To answer your last question :
Let's take an example : if you have a collection with 100 documents.
And you want to query 20 of it.
It will only count as 20 read and not 100 as we might expect.
Just to remind that with firebase you can read up to 50k/day for free, after this limit is it 0.06$/100k read.
I hope it will help you.
Have a nice day !

modeling scenario with mostly semi-additive facts

Im learning dimensional modeling and Im trying to create a model. I was thinking about a social media platform which rates hotels. The platform has following data:
hotel information: name and address
a user can rate hotels (1-5 points)
a user can write comments
platform stores the date of the comments
hotel can answer via comment and it stores the date of it
the platform stores the total number of each rating level (i.e.: all rates with 1 point, all rates with 2 point etc.)
platform stores information of the user: sex, name, total number of votes he/she made and address
First, I tried to define which information belongs to a dimension or fact table
(here I also checked which one is additive/semi additive/non-additive)
I realized my example is kind of difficult, because it’s hard to decide if it belongs to a fact table or dimension.
I would like to hear some advice. Would someone agree with my model?
This is how I would model it:
Hotel information -> hotel dimension
User rating -> additive fact – because I can aggregate them with all dimensions
User comment -> semi additive? – because I can aggregate them with the date dimension (I don’t know if my argument is correct, but I know I would have new comments every day, which is for me a reason to store it in a fact table
Answer as comment -> same handling like with the user comments
Date of comment-> dimension
Total Number of all votes (1/2/3/4/5) -> semi-additive facts – makes no sense to aggregate them, since its already total but I would get the average
User information sex and name, address -> user-dimension
User Information: total number of votes -> could be dimension or fact. It depends how often it changes. If it changes often, I store it in a fact. If its not that often, then dimension
I still have question, hope someone can help me:
My Question: should I create two date dimensions, or can I store both information in one date dimension?
2nd Question: each user and hotel just have one address. Are there arguments, to separate the address dimension in a own hierarchy? Can I create a 1:1 relationship to a user dimension and address dimension?
For your model, it looks well considered, but here are some thoughts:
User comment (and answers to comments): they are an event to be captured (with new ones each day, as you mention) so are factual, with dimensionality of the commenter, type of comment, date, and the measure is at least a 'count' which is additive. But you don't want to store big text in a fact so you would put that in a dimension by itself which is 1:1 with the fact, for situations where you need to query on the comment itself.
Total Number of all votes (1/2/3/4/5) are, as you say, already aggregates, mostly for performance. Totals should be easy from the raw data itself so maybe not worthwhile to store them at all. You might also consider updating the hotel dimension with columns (hotel A has 5 '1' votes and 4 '2' votes) that you'd update as you go on, for easy filtering and categorisation.
User Information: total number of votes: it is factual information about a user (dimension) and it depends on whether you always just want to 'find it out' about a person or whether you are likely to use it to filter other information (i.e. show me all reviews for users who have made 10-20 votes). In that case you might store the total in the user dimension (and/or a banding, like 'number of reviews range' with 10-20, 20-30). You can update dimensions often if you need to, but you're right, it could still just live as a fact only.
As for date dimensions, if the 'grain' is 'day' then you only need one dimension, that you refer to from multiple facts.
As for addresses, you're right that there are arguments on both sides! Many people separate addresses into their own dimension, referred to from the other dimensions that use them. Kimball suggests you can do that behind the scenes if necessary, but prefers for each dimension to have its own set of address columns(but modelled as consistently as possible).

Determining the popularity of a video with ratings and views

I am about to embark on a new project - a video website. Users will be able to register, and vote on videos by clicking "like" or "dislike", or something to that effect. In any event, it will be a 2-option voting system, not a 5-star system.
Every X number of days, I will be generating a "chart" of the most popular videos. So my question is: how should I determine the popularity of a given video?
If I went the route of tallying up the videos with the most views, this could have the effect of exceptionally bad videos making it to the of the charts (just because they're so bad).
If I go the route of a scoring system based on the amount of "like" and "dislike" votes (eg. 100 like votes, and 50 dislike votes equals a score of 2), videos with few views could appear on the top of the charts.
So, what I need to do is a combination of the two. Barring, of course, spammy views and votes.
What's your guys' thoughts on the subject?
Edit: the following tags were removed: [mysql] [postgresql], to make room for other, more representative tags; the SQL technology used in the intended implementation does not seem to bear much on the considerations regarding the rating model per-se.
You seem to be missing the point that likes and dislikes in movies are anything but objective even within the context of a relatively homogeneous group of "voters". Think how the term "Chix Flix" or the success story called "NetFlix", illustrate this subjectivity...
Yet, if you persist in implementing the model you suggest, there are several hidden variables and system dynamics that need to be acknowledged and possibly taken into account in the rating's formula.
the existence of a third, implicit, value of the vote: "No vote"
i.e. when someone views the movie page and yet doesn't vote, either way.
The problem of dealing with this extra value is its ambiguity: do people not vote because they didn't see the movie or because they neither truly like nor disliked it? Very likely a bit of both, therefore we can/should use the count of the "Page views without vote" in the formula, to boost (somewhat) the rating of movies that do not generate a strong (positive or negative) sentiment (lest the "polarizing" movies will appear more notorious or popular)
the bandwagon effect
Past a certain threshold, and particularly if the rating and/or vote counts is visible before the page view, the rating and vote counts can influence the way people decide to vote (either way) or even decide to abstain from voting. The implication is that the total vote and/or view counts do not relate linearly to the effective rating.
"quality" vs. "notoriety"
Vote ratios in general (eg "likes" / "total" or "likes"/"dislikes" etc.) are indicative of the "quality" of a movie (note the quotes around quality...), whereby the number of votes (and of views) is indicative of the notoriety ("name recognition" etc.) of a movie.
statistical representativity
Very small vote and/or view counts are to be handled carefully because they introduce much volatility in the rating. Phrased otherwise, small samples make for not so statically representative ratings.
trends (the time variable)
At the risk of complicating the model, consider keeping [some] record of when votes/view happened, to allow identifying "hot" (and "cooling") movies in the collection. This info may inform the rating logic, but also may be used to direct the users towards currently hot items. BTW, hence feeding the bandwagon effect mentioned :-( but also, increasing the voting sample size :-).
All these considerations suggest caution in implementing this rating system. It also hints at the likely need of including statistics about the complete set of movies into the rating formula for an individual movie. In other words, do not rate a given movie solely on the basis of the its own vote/view counts but also on say the average vote counts a move receives, the maximum view a movie page gets etc. In fact, an iterative process, whereby movies are [roughly] ranked at first and then the ranking is recalculated by using the statistics of groups of movies similarly rated may provide a better system (provided the formulas are "fair" and somehow converge)
A standard trick is to start with a neutral baseline: say 10 likes and 10 dislikes that gives a score of 1. The first few votes don't change the ratio too much, but as votes accumulate, the baseline is overwhelmed. The exact choice of the baseline values will influence the rating of a new movie (the two values don't have to be equal), and how many votes are needed to change the rating substantially.

ASP.Net + SQL Server Design/Implementation Suggestions

I'm constructing a prototype site for my company, which has several thousand employees and I'm running into a wall regarding the implementation of a specific requirement.
To simplify it down with an example, lets say each user has a bank account with interest. Every 5 minutes or so (can vary) the interest pays out. When the user hits the site, they see a timer counting down to when the interest is supposed to pay out.
The wall I'm running into is that it just feels dirty to have a windows service (or whatever) constantly hitting the database looking for accounts that need to 'pay out' and take action accordingly.
This seems to be the only solution in my head right now, but I'm fairly certain a service running a query to retrieve a sorted result set of accounts that need to be 'paid out' just won't cut it.
Thank you in advance for and ideas and suggestions!
Rather than updating records, just calculate the accrued interest on the fly.
This sort of math is pretty straightforward, the calculations are very likely to be orders of magnitude faster than the continuous updating.
Something like the following:
WITH depositswithperiods AS (SELECT accountid, depositamount,
FLOOR(DATEDIFF(n, deposit_timestamp, GETDATE()) / 5) as accrualperiods, interestrate
FROM deposits)
SELECT accountid, sum(depositamount) as TotalDeposits,
sum( POWER(depositamount * (1 + interestrate), accrualperiods) ) as Balance
FROM
depositswithperiods
GROUP BY accountid
I assumed compounded interest above, and no withdrawals.
The addition of withdrawals would require creating a group of deposits for each time period, taking the sum of those to get a net deposit for each time period, and then calculating the interest on those groups.
I don't know if the interest analogy will hold for your actual use case. If the database doesn't need to be kept up to date for all users at all times, you could apply the AddInterest operation multiple times at once when you need an up-to-date value. That is, whenever the value is displayed, or when the account balance is about to change.
You could do a single nightly update for all accounts.
A good thing to think of when doing this kind of thing is DateTime.
If you are charged 10 pence a minute for a phone call, there isn't a computer sitting there counting every second and working out minutes... It just records the date/time at the start, and the datetime at the end.
As others suggest, just calculate it when the user tries to view it.

Resources