Bayesian average for email response rates - r

I have a dataset with different email codes, email recipients and a flag of whether they responded to the email. I calculated the past response rates for each person, for the emails preceding the current email (sum of responses / number of emails). It looks something like this:
email_code responded person number_of_emails response_rate date
wy2 1 A 0 0 2022/01/12
na3 1 A 1 100 2022/01/22
li3 0 A 2 100 2022/01/23
pa4 1 A 3 66 2022/01/24
However, this doesn't seem right. Imagine that person A received 1 email and replied to it, so their response rate will be 100%. Person B received 10 emails and replied to 9 of them, so their response rate will be 90%. But person B is more likely to respond.
I think I need to calculate some Bayesian average, in a similar vein to this post and this website. However, these websites show how to do this for ratings, and I do not know how I can adapt the formula to my case.
Any help/suggestions would be greatly appreciated!

The post on SO perfectly describes how you can calculate the Bayesian rating, IMO.
I quote:
rating = (v / (v + m)) * R +
(m / (v + m)) * C;
The variables are:
R – The item's own rating. R is the average of the item's votes. (For example, if an item has no votes, its R is 0. If someone gives it 5 stars, R becomes 5. If someone else gives it 1 star, R becomes 3, the average of [1, 5]. And so on.)
C – The average item's rating. Find the R of every single item in the database, including the current one, and take the average of them; that is C. (Suppose there are 4 items in the database, and their ratings are [2, 3, 5, 5]. C is 3.75, the average of those numbers.)
v – The number of votes for an item. (To given another example, if 5 people have cast votes on an item, v is 5.)
m – The tuneable parameter. The amount of "smoothing" applied to the rating is based on the number of votes (v) in relation to m. Adjust m until the results satisfy you. And don't misinterpret IMDb's description of m as "minimum votes required to be listed" – this system is perfectly capable of ranking items with less votes than m.
So in your case:
R is the response rate or number of replies / number of received emails. If someone hasn't received any emails set Rto0to avoid divison by zero. If the haven't responded to any received emails theirR` is of course zero.
C, is the sum of Rs of all recipients divided by the number of all recipients.
v, is the number of received emails. If someone received 10 emails, their v will be 10. If the haven't received any emails, their v will be zero.
m, is, as described in the original post, the tuneable parameter.
Further quote from the original post which describes m very well:
All the formula does is: add m imaginary votes, each with a value of C, before calculating the average. In the beginning, when there isn't enough data (i.e. the number of votes is dramatically less than m), this causes the blanks to be filled in with average data. However, as votes accumulates, eventually the imaginary votes will be drowned out by real ones.

Related

Design an algorithm that minimises the load on the most heavily loaded server

Reading the book of Aziz & Prakash 2021 I am a bit stuck on problem 3.7 and the associated solution for which I am trying to implement.
The problem says :
You have n users with unique hashes h1 through hn and
m servers, numbered 1 to m. User i has Bi bytes to store. You need to
find numbers K1 through Km such that all users with hashes between
Kj and Kj+1 get assigned to server j. Design an algorithm to find the
numbers K 1 through Km that minimizes the load on the most heavily
loaded server.
The solution says:
Let L(a,b) be the maximum load on a server when
users with hash h1 through ha are assigned to servers S1 through Sb in
an optimal way so that the max load is minimised. We observe the
following recurrence:
In other words, we find the right value of x such that if we pack the
first x users in b - 1 servers and the remaining in the last servers the max
load on a given server is minimized.
Using this relationship, we can tabulate the values of L till we get
L(n,m). While computing L(a,b) when the values of L is tabulated
for all lower values of a and b we need to find the right value of x to
minimize the load. As we increase x, L(x,b-1) in the above expression increases the the sum term decreases. We can do binary search for x to find x that minimises their max.
I know that we can probably use some sort of dynamic programming, but how could we possibly implement this idea into a code?
The dynamic programming algorithm is defined fairly well given that formula: Implementing a top-down DP algorithm just needs you to loop from x = 1 to a and record which one minimizes that max(L(x,b-1), sum(B_i)) expression.
There is, however, a simpler (and faster) greedy/binary search algorithm for this problem that you should consider, which goes like this:
Compute prefix sums for B
Find the minimum value of L such that we can partition B into m contiguous subarrays whose maximum sum is equal to L.
We know 1 <= L <= sum(B). So, perform a binary search to find L, with a helper function canSplit(v) that tests whether we can split B into such subarrays of sum <= v.
canSplit(v) works greedily: Remove as many elements from the start of B as possible so that our sum does not exceed v. Repeat this a total of m times; return True if we've used all of B.
You can use the prefix sums to run canSplit in O(m log n) time, with an additional inner binary search.
Given L, use the same strategy as the canSplit function to determine the m-1 partition points; find the m partition boundaries from there.

How does one approach this challenge asked in an Amazon Interview?

I am struggling optimising this past amazon Interview question involving a DAG.
This is what I tried (The code is long and I would rather explain it)-
Basically since the graph is a DAG and because its a transitive relation a simple traversal for every node should be enough.
So for every node I would by transitivity traverse through all the possibilities to get the end vertices and then compare these end vertices to get
the most noisy person.
In my second step I have actually found one such (maybe the only one) most noisy person for all the vertices of the traversal in step 2. So I memoize all of this in a mapping and mark the vertices of the traversal as visited.
So I am basically maintaining an adjacency list for the graph, A visited/non visited mapping and a mapping for the output (the most noisy person for every vertex).
In this way by the time I get a query I would not have to recompute anything (in case of duplicate queries).
The above code works but since I cannot test is with testcases it may/may not pass the time limit. Is there a faster solution(maybe using DP) to this. I feel I am not exploiting the transitive and anti symmetric condition enough.
Obviously I am not checking the cases where a person is less wealthy than the current person. But for instance if I have pairs like - (1,2)(1,3)(1,4)...etc and maybe (2,6)(2,7)(7,8),etc then if I am given to find a more wealthy person than 1 I have traverse through every neighbor of 1 and then the neighbor of every neighbor also I guess. This is done only once as I store the results.
Question Part 1
Question Part 2
Edit(Added question Text)-
Rounaq is graduating this year. And he is going to be rich. Very rich. So rich that he has decided to have
a structured way to measure his richness. Hence he goes around town asking people about their wealth,
and notes down that information.
Rounaq notes down the pair (Xi; Yi) if person Xi has more wealth than person Yi. He also notes down
the degree of quietness, Ki, of each person. Rounaq believes that noisy persons are a nuisance. Hence, for
each of his friends Ai, he wants to determine the most noisy(least quiet) person among those who have
wealth more than Ai.
Note that "has more wealth than"is a transitive and anti-symmetric relation. Hence if a has more wealth
than b, and b has more wealth than c then a has more wealth than c. Moreover, if a has more wealth than
b, then b cannot have more wealth than a.
Your task in this problem is to help Rounaq determine the most noisy person among the people having
more wealth for each of his friends ai, given the information Rounaq has collected from the town.
Input
First line contains T: The number of test cases
Each Test case has the following format:
N
K1 K2 K3 K4 : : : Kn
M
X1 Y1
X2 Y2
. . .
. . .
XM YM
Q
A1
A2
. . .
. . .
AQ
N: The number of people in town
M: Number of pairs for which Rounaq has been able to obtain the wealth
information
Q: Number of Rounaq’s Friends
Ki: Degree of quietness of the person i
Xi; Yi: The pairs Rounaq has noted down (Pair of distinct values)
Ai: Rounaq’s ith friend
For each of Rounaq’s friends print a single integer - the degree of quietness of the most noisy person as required or -1 if there is no wealthier person for that friend.
Perform a topological sort on the pairs X, Y. Then iterate from the most wealthy down the the least wealthy, and store the most noisy person seen so far:
less wealthy -> most wealthy
<- person with lowest K so far <-
Then for each query, binary search the first person with greater wealth than the friend. The value we stored is the most noisy person with greater wealth than the friend.
UPDATE
It seems that we cannot rely on the data allowing for a complete topological sort. In this case, traverse sections of the graph that lead from known greatest to least wealth, storing for each person visited the most noisy person seen so far. The example you provided might look something like:
3 - 5
/ |
1 - 2 |
/ |
4 --
Traversals:
1 <- 3 <- 5
1 <- 2
4 <- 2
4 <- 5
(Input)
2 1
2 4
3 1
5 3
5 4
8 2 16 26 16
(Queries and solution)
3 4 3 5 5
16 2 16 -1 -1

Calculate the number of trips in graph traversal

Hello Stack Overflow Community,
I'm attempting to solve this problem:
https://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&page=show_problem&problem=1040
The problem is to find the best path based on capacity between edges. I get that this can be solved using Dynamic Programming, I'm confused by the example they provide:
According to the problem description, if someone is trying to get 99 people from city 1 to 7, the route should be 1-2-4-7 which I get since the weight of each edge represents the maximum amount of passengers that can go at once. What I don't get is that the description says that it takes at least 5 trips. Where does the 5 come from? 1-2-4-7 is 3 hops, If I take this trip I calculate 4 trips, since 25 is the most limited hop in the route, I would say you need 99/25 or at least 4 trips. Is this a typo, or am I missing something?
Given the first line of the problem statement:
Mr. G. works as a tourist guide.
It is likely that Mr. G must always be present on the bus, thus the equation for the number of trips is:
x = (ceil(x) + number_of_passengers) / best_route
rather than simply:
x = number_of_passengers / best_route
or, for your numbers:
x = (ceil(x) + 99) / 25
Which can be solved with:
x == 4.16 (trips)

How to calculate the expected cost?

I am not good at probability and I know it's not a coding problem directly. But I wish you would help me with this. While I was solving a computation problem I found this difficulty:
Problem definition:
The Little Elephant from the Zoo of Lviv is going to the Birthday
Party of the Big Hippo tomorrow. Now he wants to prepare a gift for
the Big Hippo. He has N balloons, numbered from 1 to N. The i-th
balloon has the color Ci and it costs Pi dollars. The gift for the Big
Hippo will be any subset (chosen randomly, possibly empty) of the
balloons such that the number of different colors in that subset is at
least M. Help Little Elephant to find the expected cost of the gift.
Input
The first line of the input contains a single integer T - the number
of test cases. T test cases follow. The first line of each test case
contains a pair of integers N and M. The next N lines contain N pairs
of integers Ci and Pi, one pair per line.
Output
In T lines print T real numbers - the answers for the corresponding test cases. Your answer will considered correct if it has at most 10^-6 absolute or relative error.
Example
Input:
2
2 2
1 4
2 7
2 1
1 4
2 7
Output:
11.000000000
7.333333333
So, Here I don't understand why the expected cost of the gift for the second case is 7.333333333, because the expected cost equals Summation[xP(x)] and according to this formula it should be 33/2?
Yes, it is a codechef question. But, I am not asking for the solution or the algorithm( because if I take the algo from other than it would not increase my coding potentiality). I just don't understand their example. And hence, I am not being able to start thinking about the algo.
Please help. Thanks in advance!
There are three possible choices, 1, 2, 1+2, with costs 4, 7 and 11. Each is equally likely, so the expected cost is (4 + 7 + 11) / 3 = 22 / 3 = 7.33333.

Need a solution for designing my database that has some potential permutation complexity?

I am building a website where I need to make sure that the number of "coins" and number of "users" wont kill the database if increases too quickly. I first posted this on mathematica (thinking its a maths website, but found it it's not). If this is the wrong place, please let me know and I'll move it accordingly. However, it does boil down to solving a complex problem: will my database explode if the users increase too quickly?
Here's the problem:
I am trying to confirm if the following equations would work for my problem. The problem is that i have USERS (u) and i have COINS (c).
There are millions of different coins.
One user may have the same coin another user has. (i.e. both users have coin A)
Users can trade coins with each other. (i.e. Trade coin A for coin B)
Each user can trade any coin with another coin, so long as:
they don't trade a coin for the same coin (i.e. can't trade coin A for another coin A)
they can't trade with themselves (i.e. I can't offer my own Coin A for my own Coin B)
So, effectively, there are database rows stored in the DB:
trade_id | user_id | offer_id | want_id
1 | 1 | A | B
2 | 2 | B | C
So in the above data structure, user 1 wants coin A for coint B, and user 2 wants coin B for coin C. This is how I propose to store the data, and I need to know that if I get 1000 users, and each of them have 15 coins, how many relationships will get built in this table if each user offers each coin to another user. Will it explode exponentially? Will it be scalable? etc?
In the case of 2 users with 2 coins, you'd have user 1 being able to trade his two coins with the other users two coins, and vice versa. That makes it 4 total possible trade relationships that can be set up. However, keeping in mind that if user 1 offers A for B... user 2 can't offer B for A (because that relationship already exists.
What would the equation be to figure out how many TRADES can happen with U users and C coins?
Currently, I have one of two solutions, but neither seem to be 100% right. The two possible equations I have so far:
U! x C!
C x C x (U-1) x U
(where C = coins, and U = users);
Any thoughts on getting a more exact equation? How can I know without a shadow of a doubt, that if we scale to 1000 users with 10 coins each, that this table won't explode into millions of records?
If we just think about how many users can trade with other users. You could make a table with the allowable combinations.
user 1
1 | 2 | 3 | 4 | 5 | 6 | ...
________________________________
1 | N | Y | Y | Y | Y | Y | ...
user 2 2 | Y | N | Y | Y | Y | Y | ...
3 | Y | Y | N | Y | Y | Y | ...
The total number of entries in the table is U * U, and there are U N's down the diagonal.
Theres two possibilities depending on if order matters. Is trade(user_A,user_B) is the same as trade(user_B,user_A) or not? If order matters the same the number of possible trades is the number of Y's in the table which is U * U - U or (U-1) * U. If the order is irrelevant then its half that number (U-1) * U / 2 which are the Triangular numbers. Lets assume order is irrelevant.
Now if we have two users the situation with coins is similar. Order does matter here so it is C * (C-1) possible trades between the users.
Finally multiply the two together (U-1) * U * C * (C-1) / 2.
The good thing is that this is a polynomial roughly U^2 * C^2 so it will not grow to quickly. This thing to watch out for is if you have exponential growth, like calculating moves in chess. Your well clear of this.
One of the possibilities in your question had U! which is the number of ways to arrange U distinct objects into a sequence. This would have exponential growth.
There are U possible users and there are C possible coins.
Hence there are OWNS = CxU possible "coins owned by an individual".
Hence there are also OWNS "possible offerings for a trade".
But a trade is a pair of two such offerings, restricted by the rule that the two persons acting as offerer cannot be the same, and neither can the offered coin be the same. So the number of candidates for completing a "possible offering" to form a "complete trade" is (C-1)x(U-1).
The number of possible ordered pairs that form a "full-blown trade" is thus
CxUx(C-1)x(U-1)
And then this is still to be divided by two because of the permutation issue (trades are a set of two (person,coin) pairs, not an ordered pair).
But pls note that this sort of question is actually an extremely silly one to worry about in the world of "real" database design !
I need to know that if I get 1,000 users, and each of them have 15 coins, how many relationships will get built in this table if each user offers each coin to another user.
The most that can happen is all 1,000 users each trade all of their 15 coins, for 7,500 trades. This is 15,000 coins up for trade (1,000 users x 15 coins). Since it takes at least 2 coins to trade, you divide 15,000 by 2 to get the maximum number of trades, 7,500.
Your trade table is basically a Cartesian product of the number of users times the number of coins, divided by 2.
(U x C) / 2
I'm assuming users aren't trading for the sake of trading. That they want particular coins and once they get the coins, won't trade again.
Also, most relational databases can handle millions and even billions of rows in a table.
Just make sure you have an index on trade id and on user id, trade id in your Trade table.
The way I understand this is that you are designing an offer table. I.e. user A may offer coin a in exchange for coin b, but not to a specific user. Any other user may take the offer. If this is the case, the maximum number of offers is proportional to the number of users U and the square of the number of coins C.
The maximum number of possible trades (disregarding direction) is
C(C-1)/2.
Every user can offer all the possible trades, as long as every user is offering the trades in the same direction, without any trade being matched. So the absolute maximum number of records in the offer table is
C(C-1)/2*U.
If trades are allowed between more than two user the number decreases above half that though. E.g. if A offers a for b, B offers b for c and C offers c for a. Then a trade could be accomplished in a triangle by A getting b from B, B getting c from C and C getting a from A.
The maximum number of rows in the table can then be calculated by splitting the C coins into two groups and offering any coin it the first group in exchange any coin in the second. We get the maximum number of combination if the groups are of the same size, C/2. The number of combinations is
C/2*C/2 = C^2/4.
Every user may offer all these trades without there being any possible trade. So the maximum number of rows is
C^2/4*U
which is just over half of
C(C-1)/2*U = 2*(C^2/4*U) - C/2*U.

Resources