I'm looking to apply the paradigms from my experience in using Redux for state management in web applications into an iOS app that will be written in Swift. The reason full native is being chosen is in part due to the requirement of handling large datasets on the device and needing to render UIs around these large datasets (specifically, looking at large polygons to be rendered on a map - our worst case so far is 24 000 points in memory, and an optimisation of that for rendering on screen).
Given that a functional state management system should not be mutating the existing state, what options are there to limit the overhead for dealing with large datasets that have a cost for deep copying?
While I have some ideas about potentially modifying our data structure away from vectors of numbers to vectors of indexable/id-able structures that can be operated on more granularly, there still remains the issue (as I can see) that at least one large copy operation per array, per update would need to happen.
Are there sensible ways of dealing with this issue in a way that stays true to the idea that you never mutate the state directly?
Related
I read in a stackoverflow post that (link here)
By using predictable (e.g. sequential) IDs for documents, you increase the chance you'll hit hotspots in the backend infrastructure. This decreases the scalability of the write operations.
I would like if anyone could explain better on the limitations that can occur when using sequential or user provided id.
Cloud Firestore scales horizontally by allocated key ranges to machines. As load increases beyond a certain threshold on a single machine, it will split the range being served by it and assign it to 2 machines.
Let's say you just starting writing to Cloud Firestore, which means a single server is currently handling the entire range.
When you are writing new documents with random Ids, when we split the range into 2, each machine will end up with roughly the same load. As load increases, we continue to split into more machines, with each one getting roughly the same load. This scales well.
When you are writing new documents with sequential Ids, if you exceed the write rate a single machine can handle, the system will try to split the range into 2. Unfortunately, one half will get no load, and the other half the full load! This doesn't scale well as you can never get more than a single machine to handle your write load.
In the case where a single machine is running more load than it can optimally handle, we call this "hot spotting". Sequential Ids mean we cannot scale to handle more load. Incidentally, this same concept applies to index entries too, which is why we warn sequential index values such as timestamps of now as well.
So, how much is too much load? We generally say 500 writes/second is what a single machine will handle, although this will naturally vary depending on a lot of factors, such as how big a document you are writing, number of transactions, etc.
With this in mind, you can see that smaller more consistent workloads aren't a problem, but if you want something that scales based on traffic, sequential document ids or index values will naturally limit you to what a single machine in the database can keep up with.
Is it possible to estimate the heat generated by an individual process in runtime.
Temperature readings of the processor is easily accessible but what I need is process specific information.
Is it possible to map information such as cpu utilization, io, running time, memory usage etc to get some kind of an estimate?
I'm gonna say no. Because the overall temperature of your system components isn't a simple mathematical equation with everything that's moving and switching either.
Heat generated by and inside a computer is dependent on many external factors like hardware setup, ambient temperature of the room, possibly the age of the components, is there dust on them or in the fans, was the cooling paste correctly applied on the CPU or elsewhere, where heat sinks are present, how is heat being dissipated etc.etc.. In short, again, no.
Additionally, your computer runs a LOT of processes at any given time apart from the ones that you control (and "control" is a relative term). Even if it is possible to access certain sensory data for individual components (like you can see to some extent in the BIOS) then interpolating one single process' generated temperature in regard to the total is, well, impossible.
At the lowest levels (gate networks, control signalling etc.), an external individual no longer has any means to observe or measure what's going on but there as well, things are in a changing state, a variable amount of electricity is being used and thus a variable amount of heat generated.
Pertaining to your second question: that's basically what your task manager does. There are countless examples and articles on the internet on how to get that done in a plethora of programming languages.
That is, unless some of the actually smart people in this merry little community of keytappers and screengazers say that it IS actually possible, at which point I will be thoroughly amazed...
EDIT: Monitoring the processes is a first step in what you're looking for. take a look at How to detect a process start & end using c# in windows? and be sure to follow up on duplicates like the one mentioned by Hans.
You could take a look at PowerTOP or some other tool that monitors power usage. I am not sure how accurate it is across different systems but a power estimation should provide at least some relative information as the heat generated assuming the processes you are comparing are running in similar manners on hardware. In reality there are just too many factors to predict power, much less heat, effectively but you may be able to get an idea of the usage.
I have a filtering algorithm that needs to be applied recursively and I am not sure if MapReduce is suitable for this job. W/o giving too much away, I can say that each object that is being filtered is characterized by a collection if ordered list or queue.
The data is not huge, just about 250MB when I export from SQL to
CSV.
The mapping step is simple: the head of the list contains an object that can classify the list as belonging to one of N mapping nodes. the filtration algorithm at each node works on the collection of lists assigned to the node and at the end of the filtration, either a list remains the same as before the filtration or the head of the list is removed.
The reduce function is simple too: all the map jobs' lists are brought together and may have to be written back to disk.
When all the N nodes have returned their output, the mapping step is repeated with this new set of data.
Note: N can be as much as 2000 nodes.
Simple, but it requires perhaps up to a 1000 recursions before the algorithm's termination conditions are met.
My question is would this job be suitable for Hadoop? If not, what are my options?
The main strength of Hadoop is its ability to transparently distribute work on a large number of machines. In order to fully benefit from Hadoop your application has to be characterized, at least by the following three things:
work with large amounts of data (data which is distributed in the cluster of machines) - which would be impossible to store on one machine
be data-parallelizable (i.e. chunks of the original data can be manipulated independently from other chunks)
the problem which the application is trying to solve lends itself nicely to the MapReduce (scatter - gather) model.
It seems that out of these 3, your application has only the last 2 characteristics (with the observation that you are trying to recursively use a scatter - gather procedure - which means a large number of jobs - equal to the recursion depth; see last paragraph why this might not be appropriate for hadoop).
Given the amount of data you're trying to process, I don't see any reason why you wouldn't do it on a single machine, completely in memory. If you think you can benefit from processing that small amount of data in parallel, I would recommend focusing on multicore processing than on distributed data intensive processing. Of course, using the processing power of a networked cluster is tempting but this comes at a cost: mainly the time inefficiency given by the network communication (network being the most contended resource in a hadoop cluster) and by the I/O. In scenarios which are well-fitted to the Hadoop framework these inefficiency can be ignored because of the efficiency gained by distributing the data and the associated work on that data.
As I can see, you need 1000 jobs. The setup and the cleanup of all those jobs would be an unnecessary overhead for your scenario. Also, the overhead of network transfer is not necessary, in my opinion.
Recursive algos are hard in the distributed systems since they can lead to a quick starvation. Any middleware that would work for that needs to support distributed continuations, i.e. the ability to make a "recursive" call without holding the resources (like threads) of the calling side.
GridGain is one product that natively supports distributed continuations.
THe litmus test on distributed continuations: try to develop a naive fibonacci implementation in distributed context using recursive calls. Here's the GridGain's example that implements this using continuations.
Hope it helps.
Q&D, but I suggest you read a comparison of MongoDB and Hadoop:
http://www.osintegrators.com/whitepapers/MongoHadoopWP/index.html
Without knowing more, it's hard to tell. You might want to try both. Post your results if you do!
I have an OpenCL kernel that calculates total force on a particle exerted by other particles in the system, and then another one that integrates the particle position/velocity. I would like to parallelize these kernels across multiple GPUs, basically assigning some amount of particles to each GPU. However, I have to run this kernel multiple times, and the result from each GPU is used on every other. Let me explain that a little further:
Say you have particle 0 on GPU 0, and particle 1 on GPU 1. The force on particle 0 is changed, as is the force on particle 1, and then their positions and velocities are changed accordingly by the integrator. Then, these new positions need to be placed on each GPU (both GPUs need to know where both particle 0 and particle 1 are) and these new positions are used to calculate the forces on each particle in the next step, which is used by the integrator, whose results are used to calculate forces, etc, etc. Essentially, all the buffers need to contain the same information by the time the force calculations roll around.
So, the question is: What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer? They cannot have a single shared buffer if I want to keep parallelism, as per my last question (though, if there is a way to create a shared buffer and still keep multiple GPUs, I'm all for that). I suspect that copying the results each step will cause more slowdown than it's worth to parallelize the algorithm across GPUs.
I did find this thread, but the answer was not very definitive and applied only to a single buffer across all GPUs. I would like to know, specifically, for Nvidia GPUs (more specifically, the Tesla M2090).
EDIT: Actually, as per this thread on the Khronos forums, a representative from the OpenCL working group says that a single buffer on a shared context does indeed get spread across multiple GPUs, with each one making sure that it has the latest info in memory. However, I'm not seeing that behavior on Nvidia GPUs; when I use watch -n .5 nvidia-smi while my program is running in the background, I see one GPU's memory usage go up for a while, and then go down while another GPU's memory usage goes up. Is there anyone out there that can point me in the right direction with this? Maybe it's just their implementation?
It sounds like you are having implementation trouble.
There's a great presentation from SIGGRAPH that shows a few different ways to utilize multiple GPUs with shared memory. The slides are here.
I imagine that, in your current setup, you have a single context containing multiple devices with multiple command queues. This is probably the right way to go, for what you're doing.
Appendix A of the OpenCL 1.2 specification says that:
OpenCL memory objects, [...] are created using a context and can be shared across multiple command-queues created using the same context.
Further:
The application needs to implement appropriate synchronization across threads on the host processor to ensure that the changes to the state of a shared object [...] happen in the correct order [...] when multiple command-queues in multiple threads are making changes to the state of a shared object.
So it would seem to me that your kernel that calculates particle position and velocity needs to depend on your kernel that calculates the inter-particle forces. It sounds like you already know that.
To put things more in terms of your question:
What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer?
... I think the answer is "don't have the buffers be separate." Use the same cl_mem object between two devices by having that cl_mem object come from the same context.
As for where the data actually lives... as you pointed out, that's implementation-defined (at least as far as I can tell from the spec). You probably shouldn't worry about where the data is living, and just access the data from both command queues.
I realize this could create some serious performance concerns. Implementations will likely evolve and get better, so if you write your code according to the spec now, it'll probably run better in the future.
Another thing you could try in order to get a better (or a least different) buffer-sharing behavior would be to make the particle data a map.
If it's any help, our setup (a bunch of nodes with dual C2070s) seem to share buffers fairly optimally. Sometimes, the data is kept on only one device, other times it might have the data exist in both places.
All in all, I think the answer here is to do it in the best way the spec provides and hope for the best in terms of implementation.
I hope I was helpful,
Ryan
After working for a while developing games, I've been exposed to both variable frame rates (where you work out how much time has passed since the last tick and update actor movement accordingly) and fixed frame rates (where you work out how much time has passed and choose either to tick a fixed amount of time or sleep until the next window comes).
Which method works best for specific situations? Please consider:
Catering to different system specifications;
Ease of development/maintenance;
Ease of porting;
Final performance.
I lean towards a variable framerate model, but internally some systems are ticked on a fixed timestep. This is quite easy to do by using a time accumulator. Physics is one system which is best run on a fixed timestep, and ticked multiple times per frame if necessary to avoid a loss in stability and keep the simulation smooth.
A bit of code to demonstrate the use of an accumulator:
const float STEP = 60.f / 1000.f;
float accumulator = 0.f;
void Update(float delta)
{
accumulator += delta;
while(accumulator > STEP)
{
Simulate(STEP);
accumulator -= STEP;
}
}
This is not perfect by any means but presents the basic idea - there are many ways to improve on this model. Obviously there are issues to be sorted out when the input framerate is obscenely slow. However, the big advantage is that no matter how fast or slow the delta is, the simulation is moving at a smooth rate in "player time" - which is where any problems will be perceived by the user.
Generally I don't get into the graphics & audio side of things, but I don't think they are affected as much as Physics, input and network code.
It seems that most 3D developers prefer variable FPS: the Quake, Doom and Unreal engines both scale up and down based on system performance.
At the very least you have to compensate for too fast frame rates (unlike 80's games running in the 90's, way too fast)
Your main loop should be parameterized by the timestep anyhow, and as long as it's not too long, a decent integrator like RK4 should handle the physics smoothly Some types of animation (keyframed sprites) could be a pain to parameterize. Network code will need to be smart as well, to avoid players with faster machines from shooting too many bullets for example, but this kind of throttling will need to be done for latency compensation anyhow (the animation parameterization would help hide network lag too)
The timing code will need to be modified for each platform, but it's a small localized change (though some systems make extremely accurate timing difficult, Windows, Mac, Linux seem ok)
Variable frame rates allow for maximum performance. Fixed frame rates allow for consistent performance but will never reach max on all systems (that's seems to be a show stopper for any serious game)
If you are writing a networked 3D game where performance matters I'd have to say, bite the bullet and implement variable frame rates.
If it's a 2D puzzle game you probably can get away with a fixed frame rate, maybe slightly parameterized for super slow computers and next years models.
One option that I, as a user, would like to see more often is dynamically changing the level of detail (in the broad sense, not just the technical sense) when framerates vary outside of a certian envelope. If you are rendering at 5FPS, then turn off bump-mapping. If you are rendering at 90FPS, increase the bells and whistles a bit, and give the user some prettier images to waste their CPU and GPU with.
If done right, the user should get the best experince out of the game without having to go into the settings screen and tweak themselves, and you should have to worry less, as a level designer, about keeping the polygon count the same across difference scenes.
Of course, I say this as a user of games, and not a serious one at that -- I've never attempted to write a nontrivial game.
The main problem I've encountered with variable length frame times is floating point precision, and variable frame times can surprise you in how they bite you.
If, for example, you're adding the frame time * velocity to a position, and frame time gets very small, and position is largish, your objects can slow down or stop moving because all your delta was lost due to precision. You can compensate for this using a separate error accumulator, but it's a pain.
Having fixed (or at least a lower bound on frame length) frame times allows you to control how much FP error you need to take into account.
My experience is fairly limited to somewhat simple games (developed with SDL and C++) but I have found that it is quite easy just to implement a static frame rate. Are you working with 2d or 3d games? I would assume that more complex 3d environments would benefit more from a variable frame rate and that the difficulty would be greater.