How to calculate the distance of an object [migrated] - math

This question was migrated from Stack Overflow because it can be answered on Mathematics Stack Exchange.
Migrated last month.
I have two screenshots (1920x1080) of a game, one with a 348-pixel-tall object that is 1 meter distant from the camera, and the other with a 138-pixel-tall version of the same thing. Given that the camera's field of vision is 90 degrees in the second screenshot, how can I precisely measure the object's distance from the camera?
I tried using a formula to determine the object's distance from the camera based on the object's height and camera distance, but the results were inaccurate.

Precise answer: Without knowing the projection model, this is not possible.
An ad-hoc engineer's answer: One may assume the projection model is a pinhole camera model. This is probably a good assumption, because it is the model with least strange distortions and with geometrically beneficial properties. For a virtual game camera with horizontal* 90° opening angle this assumption seems to be reasonable.
Now you can translate your question into simple equations based on intercept and trigonometric theorems:
(I) w_1 / f = s / d_1
(II) w_2 / f = s / d_2
(III) w_45 / f = tan(45°)
where
f = focal length (unknown)
s = object width (unknown)
d_1 = object distance in image 1 = 1 m
d_2 = object distance in image 2 (unknown)
w_1 = object width in image coordinates in image 1 = 348 px
w_2 = object width in image coordinates in image 2 = 138 px
w_45 = 1920 px / 2
What remains is to solve this equation system:
(I) => f * s = w_1 * d_1
(II) => f * s = w_2 * d_2
___
=> w_1 * d_1 = w_2 * d_2
=> d_2 = w_1 * d_1 / w_2 = 348 px / (138 px) * 1 m = 2.52... m
So you do not even need equation (III). You would need it in case you are interested in focal length f = 960 px or object width s = 0.3625 m.
*I assume it is the horizontal opening angle from what is common and from what you drew in your question.

This might be of interest. I don't think this approach is entirely accurate due to perspective distortions, but in this case, I think a fair estimate of Distance D from the camera can be obtained by assuming constant pixels-per-unit-angle near the centre of the picture. Assume the object are poles of height h and the camera is pointing at their centre. Suppose the angle from the centre of the camera view to the top of the pole at 1m is Theta_1 and for the other object it is Theta_2. Then
tan Theta_1 = h/2
tan Theta_2 = h/2D
(tan Theta_1) / (tan Theta_2) = D
Taking a first order approximation,
D approx1.= Theta_1 / Theta_2
If we also assume 348 Theta_2 = 138 Theta_1, then we get the answer posted. For a 2nd order approximation, which hopefully should be more accurate, particularly for more distended cases,...
D = (sin Theta_1)(cos Theta_2) / ((sin Theta_2)(cos Theta_1))
approx2.= Theta_1(1 - Theta_2^2/2) / (Theta_2(1 - Theta_1^2/2))
approx2.= (Theta_1/Theta_2)(1 - Theta_2^2/2)(1 + Theta_1^2/2)
approx2.= (Theta_1/Theta_2)(1 - Theta_2^2/2 + Theta_1^2/2)
So if 348 Theta_2 = 138 Theta_1
D approx2.= (348/138)(1 + Theta_1^2(1/2 - (138/348)^2/2))
If 1920 pixels is pi/2 radians, then estimate
Theta_1 approx.= 348*pi/(2*2*1920) = 0.142353417
giving
D approx2.= (348/138)(1.008538919)
which suggests a less than 1 percent difference from the 1st order approximation in this case.

Related

Calculate time for triangular and / or trapezoidal motion profile with initial velocity

I want to calculate the time it takes to reach target-pos from initial-pos, taken in account some initial velocities.
i: Initial Position: 0.0 m
t: Target Position: 10.0 m
a: Acc: 0.3 m/s2
d: Dec: 0.3 m/s2
v: Max Velocity: 0.5 m/s
u: Initial Velocities: [0.0 m/s, -0.5 m/s, 5.0 m/s]
First I calculate the distance needed to travel to reach the max velocity.
v_dist = ((v² * (a + d)) / (a * d)) / 2.0
v_dist = ((0.25 * 0.6) / 0.09) / 2.0
v_dist = ((0.15) / (0.09)) / 2.0
v_dist = 0.833333
If we do not reach the maximum velocity, in other words, if we do not travel the distance we need to travel to reach the maximum velocity, it will always be a Triangular Motion Profile
For a Triangular Motion Profile I use the following formula:
t = sqrt(2.0 * abs(t-i) * ((a+d)/(a*d)))
Which results in:
t = sqrt(2.0 * abs(10.0) * ((0.3+0.3)/(0.3*0.3)))
t = sqrt(20 * (0.6/(0.09)))
t = sqrt(20 * (6.6667))
t = sqrt(133.333)
t = 11.547
Unfortunately this formula doen not respect the initial velocity, and I can not figure out where to insert this. I also experience difficulties in wrapping my head around the ((a+d)/(a*d)) part.
How can I adjust the formula so it does take an initial velocity in account, even if the current direction of motion is in the opposite of where the target position is?
For a Trapezoidal Profile I use the following formula:
t = (abs(t-i) - ((a * (v/a)²) /2) + ((d * (v/d)²) /2)) / v + (v/a) + (v/d)
For this formula I have the same problem as for the triangular formula. I do not understand where I put the u (initial velocity) so it is processed correctly.
Look at the picture showing diagram V(t) - velocity vs time for trapezoid profile (ACDE) and triangular one (BFG) (abscissa values are arbitrary here)
Ordinate of A point is initial velocity, ordinate of C,D is maximum velocity, ordinate of E is some velocity needed to reach the end.
Abscissa of C and F is moment, when acceleration ends, F and D - when deceleration starts, E and G are stop moments.
Slope of AB and BC are accelerations. Slopes of DE and FG are decelerations.
Area under the polyline is distance.
So for trapezoid profile you can calculate times needed for acceleration and deceleration, then find time at CD range to provide needed distance (sum of 0AC1, 1CD3, 3DE).
For triangle profile find time of BF and FG segments (they depends) to provided needed distance (as sum of 0BF4 and 4FG)

Estimate visible bounds of webcam using diagonal fov

I'm using a Logitech C920 webcam (specs here) and I need to estimate the visible bounds of it before installing it at the user place.
I see that it has a Diagonal FOV of 78°. So, following the math described here we have:
Where H is the vertical Fov, W is the horizontal Fov, D is the diagonal Fov and the aspect ratio is r.
Considering an aspect ratio of 16/9, that gives me approx. W = 67.9829 and H = 38.2403
So I create a frustum using W and H.
The problem is: a slice of this frustum isn't 16:9. Is it due because of the numeric approximations or I'm doing something else wrong?
Does the camera crop a bigger image?
How can I compute effectively what will be the visible frustum?
Thank you very much!
The formulae you have are for distances, not for angles. You would need to calculate the distance using tangens:
D = 2 * tan(diagonalFov / 2)
Then you can go ahead with your formula. H and W will again be distance values. If you need the according angles, you can use arc tan:
verticalFov = 2 * arc tan (H / 2)
horizontalFov = 2 * arc tan (W / 2)
For your values, you'll get
verticalFov = 43.3067°
horizontalFov = 70.428°

Given sum of two numbers A and B find maximum product of A and B

Given sum of A and B let S
i have to find maximum product of A*B
but there is one condition value of A will be in range [P,Q]
How to proceed for it ?
If there is no range then task is pretty simple.
Using derivation method .
How to find Maximum product for eg.
A+B=99
value of A will be range [10,20]
then what will be the maximum product of A and B.
O(N) will not sufficient for the problem
Clearly, B = S - A and you need to maximize A * (S - A).
You know from algebra that A * (S - A) achieves a maximum when A = S / 2.
If S / 2 falls in the allowed range [P Q], then the maximum value is A^2 / 4.
Otherwise, by monotonicity the maximum value is reached at one of the bounds and is the largest of P * (S - P) and Q * (S - Q).
This is an O(1) solution.
This is actually a maths question, nothing to do with programming
Even if you are given it as a programming question, You should first understand it mathematically.
You can think of it geometrically as "if the perimeter of my rectangle is fixed at S, how can I achieve the maximum area?"
The answer is by making sides of equal length, and turning it into a square. If you're constrained, you have to get as close to a square as possible.
You can use calculus to show it formally:
A+B = S, so B = S-A
AB therefore = A(S-A)
A is allowed to vary, so write it as x
y = x(S-x) = -x^2 + Sx
This is a quadratic, its graph Will look like an upsidedown parabola
You want the maximum, so you're looking for the top of the parabola
dy/dx = 0
-2x + S = 0
x = S/2
A better way of looking at it would be to start with our rectangle Pq = A, and say P is the longer edge.
So now make it more oblique, by making the longer edge P slightly longer and the shorter edge q slightly shorter, both by the same amount, so P+q does not change, and we can show that the area goes down:
Pq = A
(P+delta) * (q-delta)
= Pq + (q-P)*delta + delta^2
= A + (q-P)delta
and I throw away the delta^2 as it vanishes as delta shrinks to 0
= A + (something negative)*delta
= A - something positive
i.e. < A

Kinect intrinsic parameters from field of view

Microsoft state that the field of view angles for the Kinect are 43 degrees vertical and 57 horizontal (stated here) . Given these, can we calculate the intrinsic parameters i.e. focal point and centre of projection? I assume centre of projection can be given as (0,0,0)?
Thanks
EDIT: some more information on what I'm trying to do
I have a dataset of images recorded with a Kinect, I am trying to convert pixel positions (x_screen,y_screen and z_world (in mm)) to real world coordinates.
If I know the camera is placed at point (x',y',z') in the real world coordinate system, is it sufficient to find the real world coordinates by doing the following:
x_world = (x_screen - c_x) * z_world / f_x
y_world = (y_screen - c_y) * z_world / f_y
where c_x = x' and c_y = y' and f_x, f_y is the focal length? And also how can I find the focal length given just knowledge of the field of view?
Thanks
If you equate the world origin (0,0,0) with the camera focus (center of projection as you call it) and you assume the camera is pointing along the positive z-axis, then the situation looks like this in the plane x=0:
Here the axes are z (horizontal) and y (vertical). The subscript v is for "viewport" or screen, and w is for world.
If I get your meaning correctly, you know h, the screen height in pixels. Also, zw, yv and xv. You want to know yw and xw. Note this calculation has (0,0) in the center of the viewport. Adjust appropriately for the usual screen coordinate system with (0,0) in the upper left corner. Apply a little trig:
tan(43/2) = (h/2) / f = h / (2f), so f = h / ( 2 tan(43/2) )
and similar triangles
yw / zw = yv / f also xw / zw = xv / f
Solve:
yw = zw * yv / f and xw = zw * xv / f
Note this assumes the "focal length" of the camera is equal in the x-direction. It doesn't have to be. For best accuracy in xw, you should recalculate with f = w / 2 tan(57/2) where w is the screen width. This is because f isn't a true focal length. It's just a constant of conversion. If the pixels of the camera are square and optics have no aberrations, these two f calculations will give the same result.
NB: In a deleted (improper) article the OP seemed to say that it isn't zw that's known but the length D of the hypotenuse: origin to (xw,yw,zw). In this case just note zw = D * f / sqrt(xv² + yv² + f²) (assuming camera pixels are square; some scaling is necessary if not). They you can proceed as above.
i cannot add comment since i have a too low reputation here.
But I remind that the camera angle of the kinect isn't general the same
like in a normal photo camera, due to the video stream format and its sensor chip. Therefore the SDK mentioning 57 degrees and 43 degrees, might refer to different degree resolution for hight and width.
it sends a bitmap of 320x240 pixels and those pixels relate to
Horizontal FOV: 58,5° (as distributed over 320 pixels horizontal)
Vertical FOV: 45,6° (as distributed over 240 pixels vertical).
Z is known your angle is known, so i supose law of sines can get you proper locations then https://en.wikipedia.org/wiki/Law_of_sines

Vertex shader world transform, why do we use 4 dimensional vectors?

From this site: http://www.toymaker.info/Games/html/vertex_shaders.html
We have the following code snippet:
// transformations provided by the app, constant Uniform data
float4x4 matWorldViewProj: WORLDVIEWPROJECTION;
// the format of our vertex data
struct VS_OUTPUT
{
float4 Pos : POSITION;
};
// Simple Vertex Shader - carry out transformation
VS_OUTPUT VS(float4 Pos : POSITION)
{
VS_OUTPUT Out = (VS_OUTPUT)0;
Out.Pos = mul(Pos,matWorldViewProj);
return Out;
}
My question is: why does the struct VS_OUTPUT have a 4 dimensional vector as its position? Isn't position just x, y and z?
Because you need the w coordinate for perspective calculation. After you output from the vertex shader than DirectX performs a perspective divide by dividing by w.
Essentially if you have 32768, -32768, 32768, 65536 as your output vertex position then after w divide you get 0.5, -0.5, 0.5, 1. At this point the w can be discarded as it is no longer needed. This information is then passed through the viewport matrix which transforms it to usable 2D coordinates.
Edit: If you look at how a matrix multiplication is performed using the projection matrix you can see how the values get placed in the correct places.
Taking the projection matrix specified in D3DXMatrixPerspectiveLH
2*zn/w 0 0 0
0 2*zn/h 0 0
0 0 zf/(zf-zn) 1
0 0 zn*zf/(zn-zf) 0
And applying it to a random x, y, z, 1 (Note for a vertex position w will always be 1) vertex input value you get the following
x' = ((2*zn/w) * x) + (0 * y) + (0 * z) + (0 * w)
y' = (0 * x) + ((2*zn/h) * y) + (0 * z) + (0 * w)
z' = (0 * x) + (0 * y) + ((zf/(zf-zn)) * z) + ((zn*zf/(zn-zf)) * w)
w' = (0 * x) + (0 * y) + (1 * z) + (0 * w)
Instantly you can see that w and z are different. The w coord now just contains the z coordinate passed to the projection matrix. z contains something far more complicated.
So .. assume we have an input position of (2, 1, 5, 1) we have a zn (Z-Near) of 1 and a zf (Z-Far of 10) and a w (width) of 1 and a h (height) of 1.
Passing these values through we get
x' = (((2 * 1)/1) * 2
y' = (((2 * 1)/1) * 1
z' = ((10/(10-1) * 5 + ((10 * 1/(1-10)) * 1)
w' = 5
expanding that we then get
x' = 4
y' = 2
z' = 4.4
w' = 5
We then perform final perspective divide and we get
x'' = 0.8
y'' = 0.4
z'' = 0.88
w'' = 1
And now we have our final coordinate position. This assumes that x and y ranges from -1 to 1 and z ranges from 0 to 1. As you can see the vertex is on-screen.
As a bizarre bonus you can see that if |x'| or |y'| or |z'| is larger than |w'| or z' is less than 0 that the vertex is offscreen. This info is used for clipping the triangle to the screen.
Anyway I think thats a pretty comprehensive answer :D
Edit2: Be warned i am using ROW major matrices. Column major matrices are transposed.
Rotation is specified by a 3 dimensional matrix and translation by a vector. You can perform both transforms in a "single" operation by combining them into a single 4 x 3 matrix:
rx1 rx2 rx3 tx1
ry1 ry2 ry3 ty1
rz1 rz2 rz3 tz1
However as this isn't square there are various operations that can't be performed (inversion for one). By adding an extra row (that does nothing):
0 0 0 1
all these operations become possible (if not easy).
As Goz explains in his answer by making the "1" a non identity value the matrix becomes a perspective transformation.
Clipping is an important part of this process, as it helps to visualize what happens to the geometry. The clipping stage essentially discards any point in a primitive that is outside of a 2-unit cube centered around the origin (OK, you have to reconstruct primitives that are partially clipped but that doesn't matter here).
It would be possible to construct a matrix that directly mapped your world space coordinates to such a cube, but gradual movement from the far plane to the near plane would be linear. That is to say that a move of one foot (towards the viewer) when one mile away from the viewer would cause the same increase in size as a move of one foot when several feet from the camera.
However, if we have another coordinate in our vector (w), we can divide the vector component-wise by w, and our primitives won't exhibit the above behavior, but we can still make them end up inside the 2-unit cube above.
For further explanations see http://www.opengl.org/resources/faq/technical/depthbuffer.htm#0060 and http://en.wikipedia.org/wiki/Transformation_matrix#Perspective_projection.
A simple answer would be to say that if you don't tell the pipeline what w is then you haven't given it enough information about your projection. This can be verified directly without understanding what the pipeline does with it...
As you probably know the 4x4 matrix can be split into parts based on what each part does. The 3x3 matrix at the top left is altered when you do rotation or scale operations. The fourth column is altered when you do a translation. If you ever inspect a perspective matrix, it alters the bottom row of the matrix. If you then look at how a Matrix-Vector multiplication is done, you see that the bottom row of the matrix ONLY affects the resultant w component of the vector. So if you don't tell the pipeline about w it won't have all your information.

Resources