Eigen::SelfAdjointView::rankUpdate slower than A += w*w.transpose()

Eigen::SelfAdjointView::rankUpdate slower than A += w*w.transpose() - linear-algebra

Tested speed of Eigen::SelfAdjointView::rankUpdate with Eigen::Matrix4d
comparing to naive A += w*w.transpose()
and it was 2 times slower.
What im doing wrong?
Can i speed up this computations?

For small fixed sized expressions you can't save anything with SelfAdjointView::rankUpdate, it rather adds overhead because it needs to make sure that only elements of one half are modified. In your case a simple
A.noalias() += w*w.adjoint();
should give near optimal code (adding the .noalias() avoids a copy into a temporary).

Related

Rf_allocVector only allocates and does not zero out memory

Original motivation behind this is that I have a dynamically sized array of floats that I want to pass to R through Rcpp without either incurring the cost of a zeroing out nor the cost of a deep copy.
Originally I had thought that there might be some way to take heap allocated array, make it aware to R's gc system and then wrap it with other data to create a "Rcpp::NumericVector" but it seems like that that's not possible - or doable with my current knowledge.
However and correct me if I'm wrong it looks like simply constructing a NumericVector with a size N and then using it as an N sized allocation will call R.h's Rf_allocVector and that itself does not either zero out the allocated array - I tested it on a small C program that gets dyn.loaded into R and it looks like garbage values. I also took a peek at the assembly and there doesn't seem to be any zeroing out.
Can anyone confirm this or offer any alternate solution?

Welcome to StackOverflow.
You marked this rcpp but that is a function from the C API of R -- whereas the Rcpp API offers you its constructors which do in fact set the memory tp zero:
> Rcpp::cppFunction("NumericVector goodVec(int n) { return NumericVector(n); }")
> sum(goodVec(1e7))
[1] 0
>
This creates a dynamically allocated vector using R's memory functions. The vector is indistinguishable from R's own. And it has the memory set to zero
as we use R_Calloc, which is documented in Writing R Extension to setting the memory to zero. (We may also use memcpy() explicitly, you can check the sources.)
So in short, you just have yourself confused over what the C API of R, as well as Rcpp offer, and what is easiest to use when. Keep reading documentation, running and writing examples, and studying existing code. It's all out there!

How do I trim a slice and get the indices (indexes) of the result?

This seems like a simple problem:
let slice = " some wacky text. ";
let trimmed = slice.trim();
// how do I get the index of the start and end within the original slice?
Attempt 1
Look for an alternative API. trim wraps trim_matches which deals with indices internally anyway: so lets copy this code! But this uses std::str::pattern::Pattern which is unstable, thus can't be used outside std in stable Rust.
Attempt 2
Just use trim and calculate the slice indices from the pointers. There's a nice as_ptr_range method, but its also unstable; luckily as the PR says there's an easy work-around.
let slice_ptr = slice.as_ptr();
let trimmed_ptr = trimmed.as_ptr();
// don't bother about the end (we can use trimmed.len())
Now that we've got some pointers, we need their difference. sub is not the right method for this. offset_from is, but it's unstable (as noted in the design, it's only valid use is to compare two pointers into the same slice, which is exactly what we want to do, unfortunately it's yet another thing delayed by the details).
Now, there are hackier ways of solving this problem. We could transmute the pointers to usize (we know the element size is 1 byte, so no need to multiply). But this is most likely the Undefined Behaviour type of unsafe, so lets not go there.
Attempt 3
Edit: the source problem is easy to solve directly, so probably the answer in this case is roll-my-own. Possibly I should just close this.

OpenCL clEnqueueCopyImageToBuffer with stride

I have an OpenCL buffer containing an 2D image.
This image have stride bigger than its width.
I need to make OpenCL image from this buffer.
The problem is that function clEnqueueCopyImageToBuffer does not contain stride as an input parameter.
Is it possible to make OpenCL image from OpenCL buffer(with stride bigger than width), with only one copying or faster?
The one way to solve this problem is to write own kernel, but maybe there are much more neat solutions?

Unfortunately, there is no method in the OpenCL specification which allows you to directly create an image from a buffer when the buffer data has a stride not equal to the image width. The most efficient solution would probably be to write your own kernel to do this.
The simplest solution that doesn't involve writing your own kernel would be to copy one line at a time with clEnqueueCopyBufferToImage. If your image is big enough, it might be that the performance of this technique would be reasonably comparable to the hand-written kernel, but you would have to try it out to see.
I didn't include the clEnqueueCopyBufferRect approach in my original answer because my first instinct was that the extra copy would kill performance. However, the comments above got me thinking about it further, and I was interested enough to implement all three approaches to see what the performance was actually like.
As I suspected, the fastest approach was to implement a kernel to do this directly. However, copying the data over line-by-line was significantly slower than I had anticipated. Copying the buffer into an intermediate buffer with clEnqueueCopyBufferRect is actually a pretty good compromise of performance and simplicity, although is still a couple of times slower than the kernel implementation.
The source code for this little experiment can be found here. I was copying a 1020x1020 image with a stride of 1024, and the timings are averaged over 8 runs.

how can i make this equation faster

right now i have a bottle neck in my program I'm trying to write. i am trying to use a pinch gesture to control the scale of a UIImage. its the calculation of the scale that is causing the program to slow down and become choppy. below is the equation.
currentScale = (currentDistance / initialDistance) * scaleMod;
scaleMod is what ever the current scale was the user took their fingers off the screen. so the next time the user does a pinch the old scale is essentially the starting point of the new scaling action.

1) Can't you calculate scaleMod / initialDistance once while currentDistance is changing. That way you only have do that value times currentDistance, which removes a divide.
2) Make sure that this is actually the bottleneck. It most likely isn't, unless your doing something really wrong.

For any type of the three vars, this calculation can easily be done millions of times per second with little performance impact. Your problem is elsewhere.

If you fix the scaleMod and initialDistance to powers of 2 you could use shifts for faster multiplication and division.
See here for reference.

You could store scaleMod / initialDistance. When scaling is active (the user's fingers are still on the screen), multiply that value by currentDistance as needed.
Once the user has finished pinching, store the new scaleMod / initialDistance value for the next time pinching happens.

if you are doing computation with int (or other integer), see if he can do it using float precision. Floating-point divide is faster than integer (fewer bits to divide, assuming your CPU) has floating-point unit).
also, try to factor out division as multiplication by a reciprocal.

Check that InitialDistance != 0 first! :)

What is the fastest way to draw an Image on another image?

I have 3 Bitmap point .
Bitmap* totalCanvas = new Bitmap(400, 300, PixelFormat32bppARGB); // final canvas
Bitmap* bottomLayer = new Bitmap(400, 300,PixelFormat32bppARGB); // background
Bitmap* topLayer = new Bitmap(XXX); // always changed.
I will draw complex background on bottomLayer. I don't want to redraw complex background on totalCanvas again and again, so I stored it in bottomLayer.
TopLayer changed frequently.
I want to draw bottomLayer to totalCanvas. Which is the fastest way?
Graphics canvas(totalCanvas);
canvas.DrawImage(bottomLayer, 0, 0); step1
canvas.DrawImage(topLayer ,XXXXX); step2
I want step1 to be as fast as possible. Can anyone give me some sample?
Thanks very much!
Thanks for unwind's answer. I write the following code:
Graphics canvas(totalCanvas);
for (int i = 0; i < 100; ++i)
{
canvas.DrawImage(bottomLayer, 0,0);
}
this part takes 968ms... it is too slow...

Almost all GDI+ operations should be implemented by the driver to run as much as possible on the GPU. This should mean that a simple 2D bitmap copy operation is going to be "fast enough", for even quite large values of "enough".
My recommendation is the obvious one: don't sweat it by spending time hunting for a "fastest" way of doing this. You have formulated the problem very clearly, so just try implementing it that clearly, by doing it as you've outlined in the question. Then you can of course go ahead and benchmark it and decide to continue the hunt.
A simple illustration:
A 32 bpp 400x300 bitmap is about 469 KB in size. According to this handy table, an Nvidia GeForce 4 MX from 2002 has a theoretical memory bandwidth of 2.6 GB/s. Assuming the copy is done in pure "overwrite" mode, i.e. no blending of the existing surface (which sounds right, as your copy is basically a way of "clearing" the frame to the copy's source data), and an overhead factor of four just to be safe, we get:
(2.6 * 2^30 / (4 * 469 * 2^10)) = 1453
This means your copy should run at 1453 FPS, which I happily assume to be "good enough".

If at all possible (and it looks like it from your code), using DrawImageUnscaled will be significgantly faster than DrawImage. Or if you are using the same image over and over again, create a TextureBrush and use that.
The problem with GDI+, is that for the most part, it is unaccelerated. To get the lightening fast drawing speeds you really need GDI and BitBlt, which is a serious pain in the but to use with GDI+, especially if you are in Managed code (hard to tell if you are using managed C++ or straight C++).
See this post for more information about graphics quickly in .net.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Eigen::SelfAdjointView::rankUpdate slower than A += w*w.transpose() - linear-algebra

Tested speed of Eigen::SelfAdjointView::rankUpdate with Eigen::Matrix4d comparing to naive A += w*w.transpose() and it was 2 times slower. What im doing wrong? Can i speed up this computations?

Related

Rf_allocVector only allocates and does not zero out memory

How do I trim a slice and get the indices (indexes) of the result?

OpenCL clEnqueueCopyImageToBuffer with stride

how can i make this equation faster

What is the fastest way to draw an Image on another image?

Categories

Resources