I'm following one of BQ courses from Google's Skill Boost program. Using a dataset with football (soccer) stats, they're calculating the impact of shot distance on the likelihood of scoring a goal.
I don't quite get how the shot distance is calculated in this part:
SQRT(
POW(
(100 - positions[ORDINAL(1)].x) * 105/100,
2) +
POW(
(50 - positions[ORDINAL(1)].y) * 68/100,
2)
) AS shotDistance
I know the distance formula is used (d=√((x_2-x_1)²+(y_2-y_1)²)) but:
why use ORDINAL(1)? How does it work in this example?
why detract first from 100 and then from 50?
For the record, positions is a repeated field, with x,y int64 nested underneath. x and y have values between 1 and 100, demonstrating the % of the pitch where an event (e.g. a pass) was initiated or terminated.
The whole code is as follows:
WITH
Shots AS
(
SELECT
*,
/* 101 is known Tag for 'goals' from goals table */
(101 IN UNNEST(tags.id)) AS isGoal,
/* Translate 0-100 (x,y) coordinate-based distances to absolute positions
using "average" field dimensions of 105x68 before combining in 2D dist calc */
SQRT(
POW(
(100 - positions[ORDINAL(1)].x) * 105/100,
2) +
POW(
(50 - positions[ORDINAL(1)].y) * 68/100,
2)
) AS shotDistance
FROM
`soccer.events`
WHERE
/* Includes both "open play" & free kick shots (including penalties) */
eventName = 'Shot' OR
(eventName = 'Free Kick' AND subEventName IN ('Free kick shot', 'Penalty'))
)
SELECT
ROUND(shotDistance, 0) AS ShotDistRound0,
COUNT(*) AS numShots,
SUM(IF(isGoal, 1, 0)) AS numGoals,
AVG(IF(isGoal, 1, 0)) AS goalPct
FROM
Shots
WHERE
shotDistance <= 50
GROUP BY
ShotDistRound0
ORDER BY
ShotDistRound0
Thanks
why use ORDINAL(1)? How does it work in this example?
As per the BigQuery array documentation
To access elements from the arrays in this column, you must specify
which type of indexing you want to use: either OFFSET, for zero-based
indexes, or ORDINAL, for one-based indexes.
So taking a sample array to access the first element you would do the following:
array = [7, 5, 8]
array[OFFSET(0)] = 7
array[ORDINAL(1)] = 7
So in this example it is used to get the coordinates of where the shot took place (which in this data is the first set of x,y coordinates).
why detract first from 100 and then from 50?
The difference between 100 and 50 represents the position of the goals on the field.
So the end point of the shot is assumed to be in the middle of the goals which along the x axis from 0 - 100, 100 is the endline of the field, while on the y axis the goals is in the middle of the field equal distance from each sideline, so therefore 50 is the middle point of the goals.
I am working on loop-closure detection problem in two different seasons, e.g., summer, and fall. I need to make precision-recall curves. Suppose, I have taken 500 image from summer and 500 image from fall season. I have distance matrix. enter image description here
But I am totally confused, how to make precison recall curves. Like, for each image from one season, I will get 500 nearest images in ascending (distance) order. I know the definition of precision and recall, but i can't get close to the solution of this problem. Looking forward for any kind of help or comments or advice. thanks in advance.
In precision-recall plots each point is a pair of precision and recall values. In your case, I guess, you'd need to compute those values for each image and then average them.
Imagine you have 1000 images in total and only 100 images that belong to summer. If you take 500 closest images to some "summer" image, precision in the best case (when the first images always belong to the class) would be:
precision(summer) = 100 / (100 + 400) = (retrieved summer images) / (retrieved summer images + other retrieved images) = 0.2
And recall:
recall(summer) = 100 / (100 + 0) = (retrieved summer images) / (retrieved summer images + not retrieved summer images) = 1
As you can see, it has high recall because all the summer images were retrieved, but low precision, because there are only 100 images, and 400 other images don't belong to the class.
Now, if you take the first 100 images instead of 500, both recall and precision would equal 1.
If you take 50 first images, then precision would be still 1, but recall would drop to 0.5.
So, by varying the number of images you can get points for the precision-recall curve. For the above-described example these points would be (0.2, 1), (1, 1), (1, 0.5).
You could compute these values for each of the 1000 images using different thresholds.
I am trying to replicate the following figure in R: (adapted from http://link.springer.com/article/10.1007/PL00011669)
The basic concept of the figure is to show the first few components of a DFT, plotted in the time domain, and then show a reconstructed wave in the time domain using only these components (X') relative to the original data (X). I would like to slightly modify the above figure such that all of the lines shown are overlaid on a single plot.
I have been trying to adapt the figure with some real data sampled at 60 Hz. For example:
## 3 second sample where: time is in seconds and var is the variable of interest
temp = data.frame(time=seq(from=0,to=3,by=1/60),
var = c(0.054,0.054,0.054,0.072,0.072,0.072,0.072,0.09,0.09,0.108,0.126,0.126,
0.126,0.126,0.126,0.144,0.144,0.144,0.144,0.144,0.162,0.162,0.144,0.126,
0.126,0.108,0.144,0.162,0.18,0.162,0.126,0.126,0.108,0.108,0.126,0.144,
0.162,0.144,0.144,0.144,0.144,0.162,0.162,0.126,0.108,0.09,0.09,0.072,
0.054,0.054,0.054,0.036,0.036,0.018,0.018,0.018,0.018,0,0.018,0,
0,0,-0.018,0,0,0,-0.018,0,-0.018,-0.018,0,-0.018,
-0.018,-0.018,-0.018,-0.036,-0.036,-0.054,-0.054,-0.072,-0.072,-0.072,-0.072,-0.072,
-0.09,-0.09,-0.108,-0.126,-0.126,-0.126,-0.144,-0.144,-0.144,-0.162,-0.162,-0.18,
-0.162,-0.162,-0.162,-0.162,-0.144,-0.144,-0.144,-0.126,-0.126,-0.108,-0.108,-0.09,
-0.072,-0.054,-0.036,-0.018,0,0,0,0,0.018,0.018,0.036,0.054,
0.054,0.054,0.054,0.054,0.054,0.054,0.054,0.054,0.054,0.072,0.054,0.072,
0.072,0.072,0.072,0.072,0.072,0.054,0.054,0.054,0.036,0.036,0.036,0.036,
0.036,0.054,0.054,0.072,0.09,0.072,0.036,0.036,0.018,0.018,0.018,0.018,
0.036,0.036,0.036,0.036,0.018,0,-0.018,-0.018,-0.018,-0.018,-0.018,0,
-0.018,-0.036,-0.036,-0.018,-0.018,-0.018,-0.036,0,0,-0.018,-0.018,-0.018,-0.018))
##plot the original data
ggplot(temp, aes(x=time, y=var))+geom_line()
I believe that I can use fft() to eventually accomplish this goal however the leap from the output of fft() to my goal is a bit unclear.
I realize that this question is somewhat similar to: How do I calculate amplitude and phase angle of fft() output from real-valued input? but I am more specifically interested in the actual code for the specific data above.
Please note that I am relatively new to time series analysis so any clarity you could provide w.r.t. putting the output of fft() in context, or any package you could recommend that would accomplish this task efficiently would be appreciated.
Thank you
Matlab is your best tool, and the specific function is just fft(). To use it, first determine several basic parameters of your time domain data:
1, time duration (T), which equals to 3s.
2, Sampling interval T_s, which equals to 1/60 s.
3, Frequency domain revolution f_s, which equals to the frequency difference between two adjacent Fourier basis. You may define f_s according to your needs. However, the smallest possible f_s equals to 1/T=0.333 Hz. As a result, if you want better frequency domain revolution (smaller f_s), you need longer time domain data.
4, Maximum frequency f_M, which equals to 1/(2T_s)=30 according to Shannon sampling theory.
5, DFT length N, which equals to 2*f_M/f_s.
Then find out the specific frequencies of four Fourier basis that you want to use to approximate the data. For example, 3,6,9 and 12 Hz. So f_s = 3 Hz. Then N=2*f_M/f_s=20.
Your Matlab code looks like this:
var=[0.054,0.054,0.054 ...]; % input all your data points here
f_full=fft(var,20); % Do 20-point fft
f_useful=f_full(2:5); % You are interested with the lowest four frequencies except DC
Here f_useful contains the four complex coefficients of four Fourier basis. To reconstruct var, do the following:
% Generate basis functions
dt=0:1/60:3;
df=[3:3:12];
basis1=exp(1j*2*pi*df(1)*dt);
basis2=exp(1j*2*pi*df(2)*dt);
basis3=exp(1j*2*pi*df(3)*dt);
basis4=exp(1j*2*pi*df(4)*dt);
% Reconstruct var
var_recon=basis1*f_useful(1)+...
basis2*f_useful(2)+...
basis3*f_useful(3)+...
basis4*f_useful(4);
var_recon=real(var_recon);
% Plot both curves
figure;
plot(var);
hold on;
plot(var_recon);
Adapt this code to your paper :)
Adapting my own post from Signal Processing. I think it's still relevant for those in Python.
I am no expert in this topic, but have some useful examples to share.
The more Fourier components you keep, the closer you'll mimic the original signal.
This example shows what happens when you keep 10, 20, ...up to n components. Assuming x and y are your data vectors.
import numpy
from matplotlib import pyplot as plt
n = len(y)
COMPONENTS = [10, 20, n]
for c in COMPONENTS:
colors = numpy.linspace(start=100, stop=255, num=c)
for i in range(c):
Y = numpy.fft.fft(y)
numpy.put(Y, range(i+1, n), 0.0)
ifft = numpy.fft.ifft(Y)
plt.plot(x, ifft, color=plt.cm.Reds(int(colors[i])), alpha=.70)
plt.title("First {c} fourier components".format(c=c))
plt.plot(x,y, label="Original dataset", linewidth=2.0)
plt.grid(linestyle='dashed')
plt.legend()
plt.show()
For the book's dataset, keeping up to 4, 10, and n components:
For your dataset, keeping up to 4, 10, and n components: