I am trying to use Flex Profiler to improve the application performance (loading time, etc). I have seen the profiler results for the current desgn. I want to compare these results with a new design for the same set of data. Is there some direct way to do it? I don't know any way to save the current profiling results in history and compare it later with the results of a new design.
Otherwise I have to do it manually, write the two results in a notepad and then compare it.
Thanks in advance.
Your stated goal is to improve aspects of the application performance (loading time, etc.) I have similar issues in other languages (C#, C++, C, etc.) I suggest that you focus not so much on the timing measurements that the Flex profiler gives you, but rather use it to extract a small number of samples of the call stack while it is being slow. Don't deal in summaries, but rather examine those stack samples closely. This may bend your mind a little bit, because it will not give you particularly precise time measurements. What it will tell you is which lines of code you need to focus on to get your speedup, and it will give you a very rough idea of how much speedup you can expect. To get the exact amount of speedup, you can time it afterward. (I just use a stopwatch. If I'm getting the load time down from 2 minutes to 10 seconds, timing it is not a high-tech problem.)
(If you are wondering how/why this works, it works because the reason for the program being slower than it's going to be is that it's requesting work to be done, mostly by method calls, that you are going to avoid executing so much. For the amount of time being spent in those method calls, they are sitting exposed on the stack, where you can easily see them. For example, if there is a line of code that is costing you 60% of the time, and you take 5 stack samples, it will appear on 3 samples, plus or minus 1, roughly, regardless of whether it is executed once or a million times. So any such line that shows up on multiple stacks is a possible target for optimization, and targets for optimization will appear on multiple stack samples if you take enough.
The hard part about this is learning not to be distracted by all the profiling results that are irrelevant. Milliseconds, average or total, for methods, are irrelevant. Invocation counts are irrelevant. "Self time" is irrelevant. The call graph is irrelevant. Some packages worry about recursion - it's irrelevant. CPU-bound vs. I/O bound - irrelevant. What is relevant is the fraction of stack samples that individual lines of code appear on.)
ADDED: If you do this, you'll notice a "magnification effect". Suppose you have two independent performance problems, A and B, where A costs 50% and B costs 25%. If you fix A, total time drops by 50%, so now B takes 50% of the remaining time and is easier to find. On the other hand, if you happen to fix B first, time drops by 25%, so A is magnified to 67%. Any problem you fix makes the others appear bigger, so you can keep going until you just can't squeeze it any more.
Related
I am working on my bachelor's final project, which is about the comparison between Apache Spark Streaming and Apache Flink (only streaming) and I have just arrived to "Physical partitioning" in Flink's documentation. The matter is that in this documentation it doesn't explain well how this two transformations work. Directly from the documentation:
shuffle(): Partitions elements randomly according to a uniform distribution.
rebalance(): Partitions elements round-robin, creating equal load per partition. Useful for performance optimisation in the presence of data skew.
Source: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#physical-partitioning
Both are automatically done, so what I understand is that they both redistribute equally (shuffle() > uniform distribution & rebalance() > round-robin) and randomly the data. Then I deduce that rebalance() distributes the data in a better way ("equal load per partitions") so the tasks have to process the same amount of data, but shuffle() may create bigger and smaller partitions. Then, in which cases might you prefer to use shuffle() than rebalance()?
The only thing that comes to my mind is that probably rebalance()requires some processing time so in some cases it might use more time to do the rebalancing than the time it will improve in the future transformations.
I have been looking for this and nobody has talked about this, only in a mailing list of Flink, but they don't explain how shuffle() works.
Thanks to Sneftel who has helped me to improve my question asking me things to let me rethink about what I wanted to ask; and to Till who answered quite well my question. :D
As the documentation states, shuffle will randomly distribute the data whereas rebalance will distribute the data in a round robin fashion. The latter is more efficient since you don't have to compute a random number. Moreover, depending on the randomness, you might end up with some kind of not so uniform distribution.
On the other hand, rebalance will always start sending the first element to the first channel. Thus, if you have only few elements (fewer elements than subtasks), then only some of the subtasks will receive elements, because you always start to send the first element to the first subtask. In the streaming case this should eventually not matter because you usually have an unbounded input stream.
The actual reason why both methods exist is a historically reason. shuffle was introduced first. In order to make the batch an streaming API more similar, rebalance was then introduced.
This statement by Flink is misleading:
Useful for performance optimisation in the presence of data skew.
Since it's used to describe rebalance, but not shuffle, it suggests it's the distinguishing factor. My understanding of it was that if some items are slow to process and some fast, the partitioner will use the next free channel to send the item to. But this is not the case, compare the code for rebalance and shuffle. The rebalance just adds to next channel regardless how busy it is.
// rebalance
nextChannelToSendTo = (nextChannelToSendTo + 1) % numberOfChannels;
// shuffle
nextChannelToSendTo = random.nextInt(numberOfChannels);
The statement can be also understood differently: the "load" doesn't mean actual processing time, just the number of items. If your original partitioning has skew (vastly different number of items in partitions), the operation will assign items to partitions uniformly. However in this case it applies to both operations.
My conclusion: shuffle and rebalance do the same thing, but rebalance does it slightly more efficiently. But the difference is so small that it's unlikely that you'll notice it, java.util.Random can generate 70m random numbers in a single thread on my machine.
I'm working on a game (using Game Maker: Studio Professional v1.99.355) that needs to have both user-modifiable level geometry and AI pathfinding based on platformer physics. Because of this, I need a way to dynamically figure out which platforms can be reached from which other platforms in order to build a node graph I can feed to A*.
My current approach is, more or less, this:
For each platform consider each other platform in the level.
For each of those platforms, if it is obviously unreachable (due to being higher than the maximum jump height, for example) do not form a link and move on to next platform.
If a link seems possible, place an ai_character instance on the starting platform and (within the current step event) simulate a jump attempt.
3.a Repeat this jump attempt for each possible starting position on the starting platform.
If this attempt is successful, record the data necessary to replicate it in real time and move on to the next platform.
If not, do not form a link.
Repeat for all platforms.
This approach works, more or less, and produces a link structure that when visualised looks like this:
linked platforms (Hyperlink because no rep.)
In this example the mostly-concealed pink ghost in the lower right corner is trying to reach the black and white box. The light blue rectangles are just there to highlight where recognised platforms are, the actual platforms are the rows of grey boxes. Link lines are green at the origin and red at the destination.
The huge, glaring problem with this approach is that for a level of only 17 platforms (as shown above) it takes over a second to generate the node graph. The reason for this is obvious, the yellow text in the screen centre shows us how long it took to build the graph: over 24,000(!) simulated frames, each with attendant collision checks against every block - I literally just run the character's step event in a while loop so everything it would normally do to handle platformer movement in a frame it now does 24,000 times.
This is, clearly, unacceptable. If it scales this badly at a mere 17 platforms then it'll be a joke at the hundreds I need to support. Heck, at this geometric time cost it might take years.
In an effort to speed things up, I've focused on the other important debugging number, the tests counter: 239. If I simply tried every possible combination of starting and destination platforms, I would need to run 17 * 16 = 272 tests. By figuring out various ways to predict whether a jump is impossible I have managed to lower the number of expensive tests run by a whopping 33 (12%!). However the more exceptions and special cases I add to the code the more convinced I am that the actual problem is in the jump simulation code, which brings me at long last to my question:
How would you determine, with complete reliability, whether it is possible for a character to jump from one platform to another, preferably without needing to simulate the whole jump?
My specific platform physics:
Jumps are fixed height, unless you hit a ceiling.
Horizontal movement has no acceleration or inertia.
Horizontal air control is allowed.
Further info:
I found this video, which describes a similar problem but which doesn't provide a good solution. This is literally the only resource I've found.
You could limit the amount of comparisons by only comparing nearby platforms. I would probably only check the horizontal distance between platforms, and if it is wider than the longest jump possible, then don't bother checking for a link between those two. But you might have done this since you checked for the max height of a jump.
I glanced at the video and it gave me an idea. Instead of looking at all platforms to find which jumps are impossible, what if you did the opposite? Try placing an AI character on all platforms and see which other platforms they can reach. That's certainly easier to implement if your enemies can't change direction in midair though. Oh well, brainstorming is the key to finding something.
Several ideas you could try out:
Limit the amount of comparisons you need to make by using a spatial data structure, like a quad tree. This would allow you to severely limit how many platforms you're even trying to check. This is mostly the same as what you're currently doing, but a bit more generic.
Try to pre-compute some jump trajectories ahead of time. This will not catch all use cases that you have - as you allow for full horizontal control - but might allow you to catch some common cases more quickly
Consider some kind of walkability grid instead of a link generation scheme. When geometry is modified, compute which parts of the level are walkable and which are not, with some resolution (something similar to the dimensions of your agent might be good starting point). You could also filter them with a height, so that grid tiles that are higher than your jump height, and you can't drop from a higher place on to them, are marked as unwalkable. Then, when you compute your pathfinding, as part of your pathfinding step you can compute when you start a jump, if a path is actually executable ('start a jump, I can go vertically no more than 5 tiles, and after the peak of the jump, i always fall down vertically with some speed).
On our HPC cluster, one of the users runs mpiblast jobs on upward of 30 cores. These will typically end up on about 10 different nodes, the nodes normally being shared between users. Although these jobs occasionally scale fairly well and can effectively use about 90% of the cores available, often scaling is very bad with jobs only accumulating CPU-time corresponding to around 10% of the cores available.
Should mpiblast scale better in general? Does anyone know what factors might lead to poor scaling?
mpiblast should work faster in general but there's no guarantee that the scaling would be better. Few factors are there:
For parallel processing, you need to make sure that the nodes that are being used are not idle/not being used properly. That is one of the main reason for poor scaling!
Also, it depends on the files you are using for BLAST. For instance, there are some parameters in mpiblast, you should go through them first.
But in general, mpiblast should scale well when the nodes are equally being used that means greatly load-balanced :)
Is it possible to estimate the heat generated by an individual process in runtime.
Temperature readings of the processor is easily accessible but what I need is process specific information.
Is it possible to map information such as cpu utilization, io, running time, memory usage etc to get some kind of an estimate?
I'm gonna say no. Because the overall temperature of your system components isn't a simple mathematical equation with everything that's moving and switching either.
Heat generated by and inside a computer is dependent on many external factors like hardware setup, ambient temperature of the room, possibly the age of the components, is there dust on them or in the fans, was the cooling paste correctly applied on the CPU or elsewhere, where heat sinks are present, how is heat being dissipated etc.etc.. In short, again, no.
Additionally, your computer runs a LOT of processes at any given time apart from the ones that you control (and "control" is a relative term). Even if it is possible to access certain sensory data for individual components (like you can see to some extent in the BIOS) then interpolating one single process' generated temperature in regard to the total is, well, impossible.
At the lowest levels (gate networks, control signalling etc.), an external individual no longer has any means to observe or measure what's going on but there as well, things are in a changing state, a variable amount of electricity is being used and thus a variable amount of heat generated.
Pertaining to your second question: that's basically what your task manager does. There are countless examples and articles on the internet on how to get that done in a plethora of programming languages.
That is, unless some of the actually smart people in this merry little community of keytappers and screengazers say that it IS actually possible, at which point I will be thoroughly amazed...
EDIT: Monitoring the processes is a first step in what you're looking for. take a look at How to detect a process start & end using c# in windows? and be sure to follow up on duplicates like the one mentioned by Hans.
You could take a look at PowerTOP or some other tool that monitors power usage. I am not sure how accurate it is across different systems but a power estimation should provide at least some relative information as the heat generated assuming the processes you are comparing are running in similar manners on hardware. In reality there are just too many factors to predict power, much less heat, effectively but you may be able to get an idea of the usage.
After working for a while developing games, I've been exposed to both variable frame rates (where you work out how much time has passed since the last tick and update actor movement accordingly) and fixed frame rates (where you work out how much time has passed and choose either to tick a fixed amount of time or sleep until the next window comes).
Which method works best for specific situations? Please consider:
Catering to different system specifications;
Ease of development/maintenance;
Ease of porting;
Final performance.
I lean towards a variable framerate model, but internally some systems are ticked on a fixed timestep. This is quite easy to do by using a time accumulator. Physics is one system which is best run on a fixed timestep, and ticked multiple times per frame if necessary to avoid a loss in stability and keep the simulation smooth.
A bit of code to demonstrate the use of an accumulator:
const float STEP = 60.f / 1000.f;
float accumulator = 0.f;
void Update(float delta)
{
accumulator += delta;
while(accumulator > STEP)
{
Simulate(STEP);
accumulator -= STEP;
}
}
This is not perfect by any means but presents the basic idea - there are many ways to improve on this model. Obviously there are issues to be sorted out when the input framerate is obscenely slow. However, the big advantage is that no matter how fast or slow the delta is, the simulation is moving at a smooth rate in "player time" - which is where any problems will be perceived by the user.
Generally I don't get into the graphics & audio side of things, but I don't think they are affected as much as Physics, input and network code.
It seems that most 3D developers prefer variable FPS: the Quake, Doom and Unreal engines both scale up and down based on system performance.
At the very least you have to compensate for too fast frame rates (unlike 80's games running in the 90's, way too fast)
Your main loop should be parameterized by the timestep anyhow, and as long as it's not too long, a decent integrator like RK4 should handle the physics smoothly Some types of animation (keyframed sprites) could be a pain to parameterize. Network code will need to be smart as well, to avoid players with faster machines from shooting too many bullets for example, but this kind of throttling will need to be done for latency compensation anyhow (the animation parameterization would help hide network lag too)
The timing code will need to be modified for each platform, but it's a small localized change (though some systems make extremely accurate timing difficult, Windows, Mac, Linux seem ok)
Variable frame rates allow for maximum performance. Fixed frame rates allow for consistent performance but will never reach max on all systems (that's seems to be a show stopper for any serious game)
If you are writing a networked 3D game where performance matters I'd have to say, bite the bullet and implement variable frame rates.
If it's a 2D puzzle game you probably can get away with a fixed frame rate, maybe slightly parameterized for super slow computers and next years models.
One option that I, as a user, would like to see more often is dynamically changing the level of detail (in the broad sense, not just the technical sense) when framerates vary outside of a certian envelope. If you are rendering at 5FPS, then turn off bump-mapping. If you are rendering at 90FPS, increase the bells and whistles a bit, and give the user some prettier images to waste their CPU and GPU with.
If done right, the user should get the best experince out of the game without having to go into the settings screen and tweak themselves, and you should have to worry less, as a level designer, about keeping the polygon count the same across difference scenes.
Of course, I say this as a user of games, and not a serious one at that -- I've never attempted to write a nontrivial game.
The main problem I've encountered with variable length frame times is floating point precision, and variable frame times can surprise you in how they bite you.
If, for example, you're adding the frame time * velocity to a position, and frame time gets very small, and position is largish, your objects can slow down or stop moving because all your delta was lost due to precision. You can compensate for this using a separate error accumulator, but it's a pain.
Having fixed (or at least a lower bound on frame length) frame times allows you to control how much FP error you need to take into account.
My experience is fairly limited to somewhat simple games (developed with SDL and C++) but I have found that it is quite easy just to implement a static frame rate. Are you working with 2d or 3d games? I would assume that more complex 3d environments would benefit more from a variable frame rate and that the difficulty would be greater.