Efficient conditional ceiling and floor in HLSL - math

I was trying to figure this out for quite some time already, but I can't get it quite right. What I want to do is to round a float towards the nearest integer, based on a different float.
I basically need a function that should work like this:
float roundParam(float val, float dir)
{
if (dir >= 0)
return ceil(val);
else
return floor(val);
}
This is of course VERY inefficient, as it requires a branch per vector component. I figured this out, but it breaks for integers:
float roundParam(float val, float dir)
{
return round(val + 0.5 * sign(dir));
}

Thanks to #wim and his observation that floor(x) = -ceil(-x) and ceil(x) = -floor(-x) I was able to create this function that solved the problem:
float3 roundParam(float3 val, float3 dir)
{
float3 dirSign = sign(dir);
return dirSign * floor(dirSign * val) + dirSign;
}

In C you can use the following well vectorizable function. Maybe you can use the same idea in hlsl. This solution is only suitable if you don't care about the difference between +0 and -0 (signed zero) for dir.
float roundParam_v2(float val, float dir)
{
union fl_i32{float f; int i;} x, y, d;
x.f = val;
d.f = dir;
d.i = d.i & 0x80000000; /* extract the sign bit */
x.i = x.i ^ d.i; /* multiply x 1.0f if signbit is set */
y.f = ceilf(x.f); /* note that floor(z) = - ceil( -z) */
y.i = y.i ^ d.i; /* multiply x 1.0f if signbit is set */
return y.f;
}

How about:
float roundParam(float val, float dir)
{
return ceil(val)*(float)(dir>=0)+floor(val)*(float)(dir<0);
}
It can be probably be further optimized, but that optimization is probably already made by the compiler.
Btw, if you add the [flatten] tag to the if conditional, it probably already gets optimized by the compiler. And for such a simple branch, it is most probably already flattened by the compiler whether you tag it or not.
It would be interesting to check the compiled code and see if the branch has already been removed. I’m currently afk so I cannot check...

Perhaps use an array of pointers selected by dir?
Below is C. Unclear how usable such an approach is in hsl, shader
float roundParam(float val, float dir) {
static float (*f[2])(float) = {ceilf, floorf};
return f[!!signbit(dir)](val);
}

Related

I don't Understand - *(uint*)((byte*)p + Offset)

I am having trouble understanding this code and would like a good explanation.
The following function takes in a hex file and modifies the address without overwriting everything else.
Can someone explain to me how its doing that?
unsafe void WriteUint32(void* p, int Offset, uint value)
{
*(uint*)((byte*)p + Offset) = value;
}
If we ignore Offset, you have
*(uint*)((byte*)p) = value;
which is just assigning value to what p points to, interpreted as a uint.
Adding Offset just changes the pointer to where value is being assigned.

Manual loop unrolling with known maximum size

Please take a look at this code in an OpenCL kernel:
uint point_color = 4278190080;
float point_percent = 1.0f;
float near_pixel_size = (...);
float far_pixel_size = (...);
float delta_pixel_size = far_pixel_size - near_pixel_size;
float3 near = (...);
float3 far = (...);
float3 direction = normalize(far - near);
point_position = (...) + 10;
for (size_t p = 0; p < point_count; p++, position += 4)
{
float3 point = (float3)(point_list[point_position], point_list[point_position + 1], point_list[point_position + 2]);
float projection = dot(point - near, direction);
float3 projected = near + direction * projection;
float rejection_length = distance(point, projected);
float percent = projection / segment_length;
float pixel_size = near_pixel_size + percent * delta_pixel_size;
bool is_candidate = (pixel_size > rejection_length && point_percent > percent);
point_color = (is_candidate ? (uint)point_list[point_position + 3] | 4278190080 : point_color);
point_percent = (is_candidate ? percent : point_percent);
}
This code attempts to find the point in a list that is nearest to the line segment between far and near, and assigning its color to point_color and its "percentual distance" into point_percent. (Incidentally, the code seems to be OK).
The number of elements specified by point_count is variable, so I cannot assume too much about it, save for one thing: point_count will always be equal or less than 8. That's a fixed fact in my code and data.
I would like to unroll this loop manually, and I'm afraid I will need to use lots of
value = (point_count < constant ? new_value : value)
for all lines in it. In your experience, will such a strategy increase performance in my kernel?
And yes, I know, I should be performing some benchmarking by myself; I just wanted to ask someone with lots of experience in OpenCL before actually attempting this on my own.
Most OpenCL drivers (that I'm familiar with, at least) support the use of #pragma unroll to unroll loops at compile time. Simply use it like so:
#pragma unroll
for (int i = 0; i < 4; i++) {
/* ... */
}
It's effectively the same as unrolling it manually, with none of the effort. In your case, this would probably look more like:
if (pointCount == 1) {
/* ... */
} else if (pointCount == 2) {
#pragma unroll
for (int i = 0; i < 2; i++) { /* ... */ }
} else if (pointCount == 3) {
#pragma unroll
for (int i = 0; i < 3; i++) { /* ... */ }
}
I can't say for certain whether there will be an improvement, but there's one way to find out. If pointCount is constant for the local work group for example, it might improve performance, but if it's completely variable, this might actually make things worse.
You can read more about it here.

Qt optimization of a QByteArray conversion

I wrote a function to convert a hexa string representation (like x00) of some binary data to the data itself.
How to improve this code?
QByteArray restoreData(const QByteArray &data, const QString prepender = "x")
{
QByteArray restoredData = data;
return QByteArray::fromHex(restoredData.replace(prepender, ""));
}
How to improve this code?
Benchmark before optimizing this. Do not do premature optimization.
Beyond the main point: Why would you like to optimize it?
1) If you are really that concerned about performance where this negligible code from performance point of view matters, you would not use Qt in the first place because Qt is inherently slow compared to a well-optimized framework.
2) If you are not that concerned about performance, then you should keep the readability and maintenance in mind as leading principle, in which case your code is fine.
You have not shown any real world example either why exactly you want to optimize. This feels like an academic question without much pratical use to me. It would be interesting to know more about the motivation.
That being said, several improvement items, which are also optimization, could be done in your code, but then again: it is not done for optimization, but more like logical reasons.
1) Prepender is bad name; it is usually called "prefix" in the English language.
2) You wish to use QChar as opposed to QString for a character.
3) Similarly, for the replacement, you wish to use '' rather than the string'ish "" formula.
4) I would pass classes like that with reference as opposed to value semantics even if it is CoW (implicitly shared).
5) I would not even use an argument here for the prefix since it is always the same, so it does not really fit the definition of variable.
6) It is needless to create an interim variable explicitly.
7) Make the function inline.
Therefore, you would be writing something like this:
QByteArray restoreData(QByteArray data)
{
return QByteArray::fromHex(data.replace('x', ''));
}
Your code has a performance problem because of replace(). Replace itself is not very fast, and creating intermediate QByteArray object slows the code down even more. If you are really concerned about performance, you can copy QByteArray::fromHex implementation from Qt sources and modify it for your needs. Luckily, its implementation is quite self-contained. I only changed / 2 to / 3 and added --i line to skip "x" characters.
QByteArray myFromHex(const QByteArray &hexEncoded)
{
QByteArray res((hexEncoded.size() + 1)/ 3, Qt::Uninitialized);
uchar *result = (uchar *)res.data() + res.size();
bool odd_digit = true;
for (int i = hexEncoded.size() - 1; i >= 0; --i) {
int ch = hexEncoded.at(i);
int tmp;
if (ch >= '0' && ch <= '9')
tmp = ch - '0';
else if (ch >= 'a' && ch <= 'f')
tmp = ch - 'a' + 10;
else if (ch >= 'A' && ch <= 'F')
tmp = ch - 'A' + 10;
else
continue;
if (odd_digit) {
--result;
*result = tmp;
odd_digit = false;
} else {
*result |= tmp << 4;
odd_digit = true;
--i;
}
}
res.remove(0, result - (const uchar *)res.constData());
return res;
}
Test:
qDebug() << QByteArray::fromHex("54455354"); // => "TEST"
qDebug() << myFromHex("x54x45x53x54"); // => "TEST"
This code can behave unexpectedly when hexEncoded is malformed (.e.g. "x54x45x5" will be converted to "TU"). You can fix this somehow if it's a problem.

Potential floating point issue with cosine acceleration curve

I am using a cosine curve to apply a force on an object between the range [0, pi]. By my calculations, that should give me a sine curve for the velocity which, at t=pi/2 should have a velocity of 1.0f
However, for the simplest of examples, I get a top speed of 0.753.
Now if this is a floating point issue, that is fine, but that is a very significant error so I am having trouble accepting that it is (and if it is, why is there such a huge error computing these values).
Some code:
// the function that gives the force to apply (totalTime = pi, maxForce = 1.0 in this example)
return ((Mathf.Cos(time * (Mathf.PI / totalTime)) * maxForce));
// the engine stores this value and in the next fixed update applies it to the rigidbody
// the mass is 1 so isn't affecting the result
engine.ApplyAccelerateForce(applyingForce * ship.rigidbody2D.mass);
Update
There is no gravity being applied to the object, no other objects in the world for it to interact with and no drag. I'm also using a RigidBody2D so the object is only moving on the plane.
Update 2
Ok have tried a super simple example and I get the result I am expecting so there must be something in my code. Will update once I have isolated what is different.
For the record, super simple code:
float forceThisFrame;
float startTime;
// Use this for initialization
void Start () {
forceThisFrame = 0.0f;
startTime = Time.fixedTime;
}
// Update is called once per frame
void Update () {
float time = Time.fixedTime - startTime;
if(time <= Mathf.PI)
{
forceThisFrame = Mathf.Cos (time);
if(time >= (Mathf.PI /2.0f)- 0.01f && time <= (Mathf.PI /2.0f) + 0.01f)
{
print ("Speed: " + rigidbody2D.velocity);
}
}
else
{
forceThisFrame = 0.0f;
}
}
void FixedUpdate()
{
rigidbody2D.AddForce(forceThisFrame * Vector2.up);
}
Update 3
I have changed my original code to match the above example as near as I can (remaining differences listed below) and I still get the discrepancy.
Here are my results of velocity against time. Neither of them make sense to me, with a constant force of 1N, that should result in a linear velocity function v(t) = t but that isn't quite what is produced by either example.
Remaining differences:
The code that is "calculating" the force (now just returning 1) is being run via a non-unity DLL, though the code itself resides within a Unity DLL (can explain more but can't believe this is relevant!)
The behaviour that is applying the force to the rigid body is a separate behaviour.
One is moving a cube in an empty enviroment, the other is moving a Model3D and there is a plane nearby - tried a cube with same code in broken project, same problem
Other than that, I can't see any difference and I certainly can't see why any of those things would affect it. They both apply a force of 1 on an object every fixed update.
For the cosine case this isn't a floating point issue, per se, it's an integration issue.
[In your 'fixed' acceleration case there are clearly also minor floating point issues].
Obviously acceleration is proportional to force (F = ma) but you can't just simply add the acceleration to get the velocity, especially if the time interval between frames is not constant.
Simplifying things by assuming that the inter-frame acceleration is constant, and therefore following v = u + at (or alternately ∂v = a.∂t) you need to scale the effect of the acceleration in proportion to the time elapsed since the last frame. It follows that the smaller ∂t is, the more accurate your integration.
This was a multi-part problem that started with me not fully understanding Update vs. FixedUpdate in Unity, see this question on GameDev.SE for more info on that part.
My "fix" from that was advancing a timer that went with the fixed update so as to not apply the force wrong. The problem, as demonstrated by Eric Postpischil was because the FixedUpdate, despite its name, is not called every 0.02s but instead at most every 0.02s. The fix for this was, in my update to apply some scaling to the force to apply to accomodate for missed fixed updates. My code ended up looking something like:
Called From Update
float oldTime = time;
time = Time.fixedTime - startTime;
float variableFixedDeltaTime = time - oldTime;
float fixedRatio = variableFixedDeltaTime / Time.fixedDeltaTime;
if(time <= totalTime)
{
applyingForce = forceFunction.GetValue(time) * fixedRatio;
Vector2 currentVelocity = ship.rigidbody2D.velocity;
Vector2 direction = new Vector2(ship.transform.right.x, ship.transform.right.y);
float velocityAlongDir = Vector2.Dot(currentVelocity, direction);
float velocityPrediction = velocityAlongDir + (applyingForce * lDeltaTime);
if(time > 0.0f && // we are not interested if we are just starting
((velocityPrediction < 0.0f && velocityAlongDir > 0.0f ) ||
(velocityPrediction > 0.0f && velocityAlongDir < 0.0f ) ))
{
float ratio = Mathf.Abs((velocityAlongDir / (applyingForce * lDeltaTime)));
applyingForce = applyingForce * ratio;
// We have reversed the direction so we must have arrived
Deactivate();
}
engine.ApplyAccelerateForce(applyingForce);
}
Where ApplyAccelerateForce does:
public void ApplyAccelerateForce(float requestedForce)
{
forceToApply += requestedForce;
}
Called from FixedUpdate
rigidbody2D.AddForce(forceToApply * new Vector2(transform.right.x, transform.right.y));
forceToApply = 0.0f;

Recursive Stackoverflow Error

Alright, so I'm trying to make a Java program to solve a picross board, but I keep getting a Stackoverflow error. I'm currently just teaching myself a little Java, and so I like to use the things I know rather than finding a solution online, although my way is obviously not as efficient. The only way I could think of solving this was through a type of brute force, trying every possibility. The thing is, I know that this function works because it works for smaller sized boards, the only problem is that with larger boards, I tend to get errors before the function finishes.
so char[][] a is just the game board with all the X's and O's. int[][] b is an array with the numbers assigned for the picross board like the numbers on the top and to the left of the game. isDone() just checks if the board matches up with the given numbers, and shift() shifts one column down. I didn't want to paste my entire program, so if you need more information, let me know. Thanks!
I added the code for shift since someone asked. Shift just moves all the chars in one row up one cell.
Update: I'm thinking that maybe my code isn't spinning through every combination, and so it skips over the correct answer. Can anyone verify is this is actually trying every possible combination? Because that would explain why I'm getting stackoverflow errors. On the other hand though, how many iterations can this go through before it's too much?
public static void shifter(char[][] a, int[][] b, int[] clockwork)
{
boolean correct = true;
correct = isDone(a, b);
if(correct)
return;
clockwork[a[0].length - 1]++;
for(int x = a[0].length - 1; x > 0; x--)
{
if(clockwork[x] > a.length)
{
shift(a, x - 1);
clockwork[x - 1]++;
clockwork[x] = 1;
}
correct = isDone(a, b);
if(correct)
return;
}
shift(a, a[0].length - 1);
correct = isDone(a, b);
if(correct)
return;
shifter(a, b, clockwork);
return;
}
public static char[][] shift(char[][] a, int y)
{
char temp = a[0][y];
for(int shifter = 0; shifter < a.length - 1; shifter++)
{
a[shifter][y] = a[shifter + 1][y];
}
a[a.length - 1][y] = temp;
return a;
}
Check Recursive call.and give the termination condition.
if(terminate condition)
{
exit();
}
else
{
call shifter()
}

Resources