Maths: conflicting approaches to averaging two averages? - math

A product is rated according to two features. The ratings for each feature are averaged, and the results delivered to us in a web service. The data might look something like:
Product_One:{
"Feature_A": {
"TotalReviewCount": 14,
"AverageOverallRating": 4.9286,
},
"Feature_B": {
"TotalReviewCount": 42,
"AverageOverallRating": 4.3571,
},
}
My working for calculating (to one decimal place) the average overall rating for the Product is:
4.9286 + 4.3571 = 9.2857
9.2857 / 2 = 4.64285
round(4.64285) = 4.6
A colleague has presented a different working, resulting in a different number:
(14 * 4.9286) + (42 * 4.3571) = 251.9986
251.9986 / (42 + 14) = 4.499975
round(4.499975) = 4.5
Whose is... best? Is one wrong?

Your colleague calculated weighted average per review, and you calculated average per feature (not taking into account number of reviews).
In general weighted average is more reliable (compare with Center of mass)

Related

Distance formula calculation with ordinal positions

I'm following one of BQ courses from Google's Skill Boost program. Using a dataset with football (soccer) stats, they're calculating the impact of shot distance on the likelihood of scoring a goal.
I don't quite get how the shot distance is calculated in this part:
SQRT(
POW(
(100 - positions[ORDINAL(1)].x) * 105/100,
2) +
POW(
(50 - positions[ORDINAL(1)].y) * 68/100,
2)
) AS shotDistance
I know the distance formula is used (d=√((x_2-x_1)²+(y_2-y_1)²)) but:
why use ORDINAL(1)? How does it work in this example?
why detract first from 100 and then from 50?
For the record, positions is a repeated field, with x,y int64 nested underneath. x and y have values between 1 and 100, demonstrating the % of the pitch where an event (e.g. a pass) was initiated or terminated.
The whole code is as follows:
WITH
Shots AS
(
SELECT
*,
/* 101 is known Tag for 'goals' from goals table */
(101 IN UNNEST(tags.id)) AS isGoal,
/* Translate 0-100 (x,y) coordinate-based distances to absolute positions
using "average" field dimensions of 105x68 before combining in 2D dist calc */
SQRT(
POW(
(100 - positions[ORDINAL(1)].x) * 105/100,
2) +
POW(
(50 - positions[ORDINAL(1)].y) * 68/100,
2)
) AS shotDistance
FROM
`soccer.events`
WHERE
/* Includes both "open play" & free kick shots (including penalties) */
eventName = 'Shot' OR
(eventName = 'Free Kick' AND subEventName IN ('Free kick shot', 'Penalty'))
)
SELECT
ROUND(shotDistance, 0) AS ShotDistRound0,
COUNT(*) AS numShots,
SUM(IF(isGoal, 1, 0)) AS numGoals,
AVG(IF(isGoal, 1, 0)) AS goalPct
FROM
Shots
WHERE
shotDistance <= 50
GROUP BY
ShotDistRound0
ORDER BY
ShotDistRound0
Thanks
why use ORDINAL(1)? How does it work in this example?
As per the BigQuery array documentation
To access elements from the arrays in this column, you must specify
which type of indexing you want to use: either OFFSET, for zero-based
indexes, or ORDINAL, for one-based indexes.
So taking a sample array to access the first element you would do the following:
array = [7, 5, 8]
array[OFFSET(0)] = 7
array[ORDINAL(1)] = 7
So in this example it is used to get the coordinates of where the shot took place (which in this data is the first set of x,y coordinates).
why detract first from 100 and then from 50?
The difference between 100 and 50 represents the position of the goals on the field.
So the end point of the shot is assumed to be in the middle of the goals which along the x axis from 0 - 100, 100 is the endline of the field, while on the y axis the goals is in the middle of the field equal distance from each sideline, so therefore 50 is the middle point of the goals.

Precision--recall curves in image retrieval domain

I am working on loop-closure detection problem in two different seasons, e.g., summer, and fall. I need to make precision-recall curves. Suppose, I have taken 500 image from summer and 500 image from fall season. I have distance matrix. enter image description here
But I am totally confused, how to make precison recall curves. Like, for each image from one season, I will get 500 nearest images in ascending (distance) order. I know the definition of precision and recall, but i can't get close to the solution of this problem. Looking forward for any kind of help or comments or advice. thanks in advance.
In precision-recall plots each point is a pair of precision and recall values. In your case, I guess, you'd need to compute those values for each image and then average them.
Imagine you have 1000 images in total and only 100 images that belong to summer. If you take 500 closest images to some "summer" image, precision in the best case (when the first images always belong to the class) would be:
precision(summer) = 100 / (100 + 400) = (retrieved summer images) / (retrieved summer images + other retrieved images) = 0.2
And recall:
recall(summer) = 100 / (100 + 0) = (retrieved summer images) / (retrieved summer images + not retrieved summer images) = 1
As you can see, it has high recall because all the summer images were retrieved, but low precision, because there are only 100 images, and 400 other images don't belong to the class.
Now, if you take the first 100 images instead of 500, both recall and precision would equal 1.
If you take 50 first images, then precision would be still 1, but recall would drop to 0.5.
So, by varying the number of images you can get points for the precision-recall curve. For the above-described example these points would be (0.2, 1), (1, 1), (1, 0.5).
You could compute these values for each of the 1000 images using different thresholds.

Formula for incremental payment plan starting low and ending high

I think this is simple but maybe I'm over thinking it or I'm just crap at Math.
I'm trying to work out a formula for a incremental payment plan calculator without interest, That starts with low payment and ends on the 8 month with higher payment.
$6,600 / 8 = $825 per month
The above is showing $825 per month for 8 months.
I want the first payment to start low and increment up per month until the last payment is higher until the 6,600 is payed.
how would I work this out in Math terms.
In some sense you are underthinking it rather than overthinking it, since there are infinitely many solutions and you haven't given any criteria for choosing between those solutions.
Presumably you want the increments to be the same size each month.
Let x be the initial amount and y the monthly step size
You want
x + (x+y) + (x + 2y) + ... + (x + 7y) = 6600
or
8x + 28y = 6600
Mathematically, this equation has infinitely many solutions. If you specify that x,y are positive and that furthermore, x has at most 2 decimal places so as to be exactly expressible as currency, there are still a very large number of solutions.
What you can do is solve for y in terms of x to get that:
y = (1650 - 2x)/7
But -- you would still have to pick x. This formula would allow you to explore the trade-off between x and y. For example, if pick x = 500 then y is (approximately) 92.86 (you would probably have to adjust the final payment by a few pennies to get it to balance out in the end).

Reconstructing a signal from its discrete fourier transform in R

I am trying to replicate the following figure in R: (adapted from http://link.springer.com/article/10.1007/PL00011669)
The basic concept of the figure is to show the first few components of a DFT, plotted in the time domain, and then show a reconstructed wave in the time domain using only these components (X') relative to the original data (X). I would like to slightly modify the above figure such that all of the lines shown are overlaid on a single plot.
I have been trying to adapt the figure with some real data sampled at 60 Hz. For example:
## 3 second sample where: time is in seconds and var is the variable of interest
temp = data.frame(time=seq(from=0,to=3,by=1/60),
var = c(0.054,0.054,0.054,0.072,0.072,0.072,0.072,0.09,0.09,0.108,0.126,0.126,
0.126,0.126,0.126,0.144,0.144,0.144,0.144,0.144,0.162,0.162,0.144,0.126,
0.126,0.108,0.144,0.162,0.18,0.162,0.126,0.126,0.108,0.108,0.126,0.144,
0.162,0.144,0.144,0.144,0.144,0.162,0.162,0.126,0.108,0.09,0.09,0.072,
0.054,0.054,0.054,0.036,0.036,0.018,0.018,0.018,0.018,0,0.018,0,
0,0,-0.018,0,0,0,-0.018,0,-0.018,-0.018,0,-0.018,
-0.018,-0.018,-0.018,-0.036,-0.036,-0.054,-0.054,-0.072,-0.072,-0.072,-0.072,-0.072,
-0.09,-0.09,-0.108,-0.126,-0.126,-0.126,-0.144,-0.144,-0.144,-0.162,-0.162,-0.18,
-0.162,-0.162,-0.162,-0.162,-0.144,-0.144,-0.144,-0.126,-0.126,-0.108,-0.108,-0.09,
-0.072,-0.054,-0.036,-0.018,0,0,0,0,0.018,0.018,0.036,0.054,
0.054,0.054,0.054,0.054,0.054,0.054,0.054,0.054,0.054,0.072,0.054,0.072,
0.072,0.072,0.072,0.072,0.072,0.054,0.054,0.054,0.036,0.036,0.036,0.036,
0.036,0.054,0.054,0.072,0.09,0.072,0.036,0.036,0.018,0.018,0.018,0.018,
0.036,0.036,0.036,0.036,0.018,0,-0.018,-0.018,-0.018,-0.018,-0.018,0,
-0.018,-0.036,-0.036,-0.018,-0.018,-0.018,-0.036,0,0,-0.018,-0.018,-0.018,-0.018))
##plot the original data
ggplot(temp, aes(x=time, y=var))+geom_line()
I believe that I can use fft() to eventually accomplish this goal however the leap from the output of fft() to my goal is a bit unclear.
I realize that this question is somewhat similar to: How do I calculate amplitude and phase angle of fft() output from real-valued input? but I am more specifically interested in the actual code for the specific data above.
Please note that I am relatively new to time series analysis so any clarity you could provide w.r.t. putting the output of fft() in context, or any package you could recommend that would accomplish this task efficiently would be appreciated.
Thank you
Matlab is your best tool, and the specific function is just fft(). To use it, first determine several basic parameters of your time domain data:
1, time duration (T), which equals to 3s.
2, Sampling interval T_s, which equals to 1/60 s.
3, Frequency domain revolution f_s, which equals to the frequency difference between two adjacent Fourier basis. You may define f_s according to your needs. However, the smallest possible f_s equals to 1/T=0.333 Hz. As a result, if you want better frequency domain revolution (smaller f_s), you need longer time domain data.
4, Maximum frequency f_M, which equals to 1/(2T_s)=30 according to Shannon sampling theory.
5, DFT length N, which equals to 2*f_M/f_s.
Then find out the specific frequencies of four Fourier basis that you want to use to approximate the data. For example, 3,6,9 and 12 Hz. So f_s = 3 Hz. Then N=2*f_M/f_s=20.
Your Matlab code looks like this:
var=[0.054,0.054,0.054 ...]; % input all your data points here
f_full=fft(var,20); % Do 20-point fft
f_useful=f_full(2:5); % You are interested with the lowest four frequencies except DC
Here f_useful contains the four complex coefficients of four Fourier basis. To reconstruct var, do the following:
% Generate basis functions
dt=0:1/60:3;
df=[3:3:12];
basis1=exp(1j*2*pi*df(1)*dt);
basis2=exp(1j*2*pi*df(2)*dt);
basis3=exp(1j*2*pi*df(3)*dt);
basis4=exp(1j*2*pi*df(4)*dt);
% Reconstruct var
var_recon=basis1*f_useful(1)+...
basis2*f_useful(2)+...
basis3*f_useful(3)+...
basis4*f_useful(4);
var_recon=real(var_recon);
% Plot both curves
figure;
plot(var);
hold on;
plot(var_recon);
Adapt this code to your paper :)
Adapting my own post from Signal Processing. I think it's still relevant for those in Python.
I am no expert in this topic, but have some useful examples to share.
The more Fourier components you keep, the closer you'll mimic the original signal.
This example shows what happens when you keep 10, 20, ...up to n components. Assuming x and y are your data vectors.
import numpy
from matplotlib import pyplot as plt
n = len(y)
COMPONENTS = [10, 20, n]
for c in COMPONENTS:
colors = numpy.linspace(start=100, stop=255, num=c)
for i in range(c):
Y = numpy.fft.fft(y)
numpy.put(Y, range(i+1, n), 0.0)
ifft = numpy.fft.ifft(Y)
plt.plot(x, ifft, color=plt.cm.Reds(int(colors[i])), alpha=.70)
plt.title("First {c} fourier components".format(c=c))
plt.plot(x,y, label="Original dataset", linewidth=2.0)
plt.grid(linestyle='dashed')
plt.legend()
plt.show()
For the book's dataset, keeping up to 4, 10, and n components:
For your dataset, keeping up to 4, 10, and n components:

Logarithmic distribution

First of all, math is not my area.
Imagine a problem like this:
I have a number of money to spend, say 500, and i need to spend them on a fixed number of days, say 20. I have a fixed maximum of money to spend per day, like 50. I don't need to spend money on a day.
Now i need to know how to calculate the total number of money I have to spend each day to get a spending curve like the following:
My goal is a function that takes a number of money and a number of days, and returns an tuple with day number and ammount of money for that day.
I know i need to use logarithms of some type, and i've tried pretty much everything that my brain can handle. I've been looking at wolfram mathworld and this formula:
y = a + b ln x
But it does not really help me.
An hint or example in PHP, Python or C# would be great, but any language will do.
PLEASE let me know if you need any more information or if the question is vague, I really want to solve this. Thank you!
I don't understand why you want a log distribution. A parabolic one will do to obtain the curve form you want:
spend[day] = a day^2 + c
where:
a -> (6 * (TD - TA)) / (TD *(-1 - 3 * TD + 4 * TD^2))
c -> -((1 + 3 * TD - 6*TA*TD + 2 * TD^2)/ (-1 - 3 * TD + 4 * TD^2))
TA = Total Amount
TD = Total Days
With this the amount you spend the last day is 1.
For your example values: (amt 500, days 20)
Are you sure what you are asking for is not a linear equation?
For example
y=f(x)=-50x+500 and the total number of days would be x where y=0.

Resources