I happen to have a large number of files to read and process in R (~20000 files, total ~40gb)
I was thinking about paralellizing the read; yet one phylosophical question comes to my mind regarding parallelizing. Perhaps the question is just wrong and my wording is not correct since I am not expert on the subject, so please correct me where I am wrong : Even with parallelizing, the disk reader pin still needs to access the file sequentially (there is only one reader head traversing the disk). We are parallelizing the cpu process, but would at any point the fact that mechanical reading becomes a hinder on the CPU parallelizing ? Will separating the files into clusters help the read, as we are trying to parallelize the physical read as well ?
Related
I'm working on a scientific project with a huge database. Think 100bn lines. Everything is split up in chunks and chunks can reach 1Go or more. I'm using a cluster and run a fairly simple function (it's already optimized) and I put, what I think is, the maximum memory limit for my cluster 128Go but my script reaches the memory limit.
My question is: I have access to multiple nodes and cores per node. Does using parallelization solve my problem of reaching memory limit? I've never used parallelization so I have no idea if this is worth learning. Thanks!
I don't know almost anything about parallel computing so this question might be very stupid and it is maybe impossible to do what I would like to.
I am using linux cluster with 40 nodes, however since I don't know how to write parallel code in R I am limited to using only one. On this node I am trying to analyse data which floods the memory (arround 64GB). So my problem isn't lack of computational power but rather memory limitation.
My question is, whether it is even possible to use some R package (like doSnow) for implicit parallelisation to use 2-3 nodes to increase the RAM limit or would I have to rewrite the script from ground to make it explicit parallelised ?
Sorry if my question is naive, any suggestions are welcomed.
Thanks,
Simon
I don't think there is such a package. The reason is that it would not make much sense to have one. Memory access is very fast, and accessing data from another computer over the network is very slow compared to that. So if such a package existed it would be almost useless, since the processor would need to wait for data over the network all the time, and this would make the computation very very slow.
This is true for common computing clusters, built from off-the-shelf hardware. If you happen to have a special cluster where remote memory access is fast, and is provided as a service of the operating system, then of course it might be not that bad.
Otherwise, what you need to do is to try to divide up the problem into multiple pieces, manually, and then parallelize, either using R, or another tool.
An alternative to this would be to keep some of the data on the disk, instead of loading all of it into the memory. You still need to (kind of) divide up the problem, to make sure that the part of the data in the memory is used for a reasonable amount of time for computation, before loading another part of the data.
Whether it is worth (or possible at all) doing either of these options, depends completely on your application.
Btw. a good list of high performance computing tools in R is here:
http://cran.r-project.org/web/views/HighPerformanceComputing.html
For future inquiry:
You may want to have a look at two packages "snow" and "parallel".
Library "snow" extends the functionality of apply/lapply/sapply... to work on more than one core and/or one node.
Of course, you can perform simple parallel computing using more than one core:
#SBATCH --cpus-per-task= (enter some number here)
You can also perform parallel computing using more than one node (preferably with the previously mentioned libraries) using:
#SBATCH --ntasks-per-node= (enter some number here)
However, for several implications, you may wanna think of using Python instead of R where parallelism can be much more efficient using "Dask" workers.
You might want to take a look at TidalScale, which can allow you to aggregate nodes on your cluster to run a single instance of Linux with the collective resources of the underlying nodes. www.tidalscale.com. Though the R application may be inherently single threaded, you'll be able to provide your R application with a single, simple coherent memory space across the nodes that will be transparent to your application.
Good luck with your project!
I'm working on an algorithm which needs very fast random access to video frames in a possibly long video (minimum 30 minutes). I am currently using OpenCV's VideoCapture to read my video, but the seeking functionality is either broken or very slow. The best I found until now is using the MJPEG codec inside a MKV container, but it's not fast enough.
I can chose any video format or even create a new one. The storage space is not a problem (to some extents of course). The only requirement is to get the fastest possible seeking time to any location in the video. Ideally, I would like to be able to access to multiple frames simultaneously, taking advantages of my quad-core CPU.
I know that relational databases are very good to store large volumes of data, they allows simultaneous read accesses and they're very fast when using indexes.
Is SQLite a good fit for my specific needs ? I plan to store each video frame compressed in JPEG, and use an index on the frame number to access them quickly.
EDIT : for me a frame is just an image, not the entire video. A 30mn video # 25 fps contains 30*60*25=45000 frames, and I want to be able to quickly get one of them using its number.
EDIT : For those who could be interested, I finally implemented a custom video container saving each frame in fixed-sized blocks (consequently, the position of any frame can be directly computed !). The images are compressed with the turbojpeg library and file accesses are multi-threaded (to be NCQ-friendly). The bottleneck is not the HDD anymore and I finally obtained much better perfs :)
I don't think using SQLite (or any other dabatase engine) is a good solution for your problem. A database is not a filesystem.
If what you need is very fast random access, then stick to the filesystem, it was designed for this kind of usage, and optimized with this in mind. As per your comment, you say a 5h video would require 450k files, well, that's not a problem in my opinion. Certainly, directory listing will be a bit long, but you will get the absolute fastest possible random access. And it will certainly be faster than SQLite because you're one level of abstraction under.
And if you're really worried about directory listing times, you just have to organize your folder structure like a tree. That will get you longer paths, but fast listing.
Keep a high level perspective. The problem is that OpenCV isn't fast enough at seeking in the source video. This could be because
Codecs are not OpenCV's strength
The source video is not encoded for efficient seeking
You machine has a lot of dedicated graphics hardware to leverage, but it does not have specialized capabilities for randomly seeking within a 17 GB dataset, be it a file, a database, or a set of files. The disk will take a few milliseconds per seek. It will be better for an SSD but still not so great. Then you wait for it to load into main memory And you have to generate all that data in the first place.
Use ffmpeg, which should handle decoding very efficiently, perhaps even using the GPU. Here is a tutorial. (Disclaimer, I haven't used it myself.)
You might preprocess the video to add key frames. In principle this shouldn't require completely re-encoding, at least for MPEG, but I don't know much about specifics. MJPEG essentially turns all frames into keyframes, but you can find a middle ground and maybe seek 1.5x faster at a 2x size cost. But avoid hitting the disk.
As for SQLite, that is a fine solution to the problem of seeking within 17 GB of data. The notion that databases aren't optimized for random access is poppycock. Of course they are. A filesystem is a kind of database. Random access in 17 GB is slow because of hardware, not software.
I would recommend against using the filesystem for this task, because it's a shared resource synchronized with the rest of the machine. Also, creating half a million files (and deleting them when finished) will take a long time. That is not what a filesystem is specialized for. You can get around that, though, by storing several images to each file. But then you need some format to find the desired image, and then why not put them all in one file?
Indeed, (if going the 17 GB route) why not ignore the entire problem and put everything in virtual memory? VM is just as good at making the disk seek as SQLite or the filesystem. As long as the OS knows it's OK for the process to use that much memory, and you're using 64-bit pointers, it should be a fine solution, and the first thing to try.
I have a C++ code using mpi and is executed in a sequential-parallel-sequential pattern. The above pattern is repeated in a time loop.
While validating the code with the serial code, I could get a reduction in time for the parallel part and in fact the reduction is almost linear with the no of processors.
The problem that I am facing is that the time required for the sequential part also increases considerably when using higher no of processors.
The parallel part takes less time to be executed in comparison with total sequential time of the entire program.
Therefore although there is a reduction in time for the parallel part when using higher no of processors, the saving in time is lost considerably due to increase in time while executing the sequential part. Also the sequential part includes a large no of computations at each time step and writing the data to an output file at some specified time.
All the processors are made to run during the execution of sequential part and the data is gathered to the root processor after the parallel computation and only the root processor is allowed to write the file.
Therefore can anyone suggest what is the efficient way to calculate the serial part (large no of operations + write the file) of the parallel code ? I would also like to clarify on any of the point if required.
Thanks in advance.
First of all, do file writing from separate thread (or process in MPI terms), so other threads can use your cores for computations.
Then, check why your parallel version is much slower than sequential. Often this means you creates too small tasks so communication between threads (synchronization) eats your performance. Think if tasks can be combined into chunks and complete chunks processed in parallel.
And, of course, use any profiler that is good for multithreading environment.
[EDIT]
sequential part = part of your logic that cannot be (and is not) paralleled, do you mean the same? sequential part on multicore can work a bit slower, probably because of OS dispatcher or something like this. It's weird that you see noticable difference.
Disk is sequential by its nature, so writing to disk from many threads don't give any benefits, but can lead to the situation when many threads try to do this simultaneously and waits for each other instead of doing something useful.
BTW, what MPI implementation do you use?
Your problem description is too high-level, provide some pseudo-code or something like this, this can help us to help you.
My problem is:
I have a perl script which uses lot of memory (expected behaviour because of caching). But, I noticed that the more I do caching, slower it gets and the process spends most of the time in sleep mode.
I thought pre-allocating memory to the process might speed up the performance.
Does someone have any ideas here?
Update:
I think I am not being very clear here. I will put question in clearer way:
I am not looking for the ways of pre-allocating inside the perl script. I dont think that would help me much here. What I am interested in is a way to tell OS to allocate X amount of memory for my perl script so that it does not have to compete with other processes coming in later.
Assume that I cant get away with the memory usage. Although, I am exploring ways of reducing that too but dont expect much improvement there.
FYI, I am working on a solaris 10 machine.
What I gathered from your posting and comments is this:
Your program gets slow when memory use rises
Your pogram increasingly spends time sleeping, not computing.
Most likely eplanation: Sleeping means waiting for a resource to become available. In this case the resource most likely is memory. Use the vmstat 1 command to verify. Have a look at the sr column. If it goes beyond ~150 consistently the system is desperate to free pages to satisfy demand. This is accompanied by high activity in the pi, po and fr columns.
If this is in fact the case, your best choices are:
Upgrade system memory to meet demand
Reduce memory usage to a level appropiate for the system at hand.
Preallocating memory will not help. In either case memory demand will exceed the available main memory at some point. The kernel will then have to decide which pages need to be in memory now and which pages may be cleared and reused for the more urgently needed pages. If all regularily needed pages (the working set) exceeds the size of main memory, the system is constantly moving pages from and to secondary storage (swap). The system is then said to be thrashing and spends not much time doing useful work. There is nothing you can do about this execept adding memory or using less of it.
From a comment:
The memory limitations are not very severe but the memory footprint easily grows to GBs and when we have competing processes for memory, it gets very slow. I want to reserve some memory from OS so that thrashing is minimal even when too many other processes come. Jagmal
Let's take a different tack then. The problem isn't really with your Perl script in particular. Instead, all the processes on the machine are consuming too much memory for the machine to handle as configured.
You can "reserve" memory, but that won't prevent thrashing. In fact, it could make the problem worse because the OS won't know if you are using the memory or just saving it for later.
I suspect you are suffering the tragedy of the commons. Am I right that many other users are on the machine in question? If so, this is more of a social problem than a technical problem. What you need is someone (probably the System Administrator) to step in and coordinate all the processes on the machine. They should find the most extravagant memory hogs and work with their programmers to reduce the cost on system resources. Further, they ought to arrange for processes to be scheduled so that resource allocation is efficient. Finally, they may need to get more or improved hardware to handle the expected system load.
Some questions you might ask yourself:
are my data structures really useful for the task at hand?
do I really have to cache that much?
can I throw away cached data after some time?
my #array;
$#array = 1_000_000; # pre-extend array to one million elements,
# http://perldoc.perl.org/perldata.html#Scalar-values
my %hash;
keys(%hash) = 8192; # pre-allocate hash buckets
# (same documentation section)
Not being familiar with your code, I'll venture some wild speculation here [grin] that these techniques aren't going to offer new great efficiencies to your script, but that the pre-allocation could help a little bit.
Good luck!
-- Douglas Hunter
I recently rediscovered an excellent Randal L. Schwartz article that includes preallocating an array. Assuming this is your problem, you can test preallocating with a variation on that code. But be sure to test the result.
The reason the script gets slower with more caching might be thrashing. Presumably the reason for caching in the first place is to increase performance. So a quick answer is: reduce caching.
Now there may be ways to modify your caching scheme so that it uses less main memory and avoids thrashing. For instance, you might find that caching to a file or database instead of to memory can boost performance. I've found that file system and database caching can be more efficient than application caching and can be shared among multiple instances.
Another idea might be to alter your algorithm to reduce memory usage in other areas. For instance, instead of pulling an entire file into memory, Perl programs tend to work better reading line by line.
Finally, have you explored the Memoize module? It might not be immediately applicable, but it could be a source of ideas.
I could not find a way to do this yet.
But, I found out that (See this for details)
Memory allocated to lexicals (i.e.
my() variables) cannot be reclaimed or
reused even if they go out of scope.
It is reserved in case the variables
come back into scope. Memory allocated
to global variables can be reused
(within your program) by using
undef()ing and/or delete().
So, I believe a possibility here could be to check if i can reduce the total memory print of lexical variables at a given point in time.
It sounds like you are looking for limit or ulimit. But I suspect that will cause a script that goes over the limit to fail, which probably isn't what you want.
A better idea might be to share cached data between processes. Putting data in a database or in a file works well in my experience.
I hate to say it, but if your memory limitations are this severe, Perl is probably not the right language for this application. C would be a better choice, I'd think.
One thing you could do is to use solaris zones (containers) .
You could put your process in a zone and allocate it resources like RAM and CPU's.
Here are two links to some tutorials :
Solaris Containers How To Guide
Zone Resource Control in the Solaris 10 08/07 OS
While it's not pre-allocating as you asked for, you may also want to look at the large page size options, so that when perl has to ask the OS for more memory for your program, it gets it in
larger chunks.
See Solaris Internals: Multiple Page Size Support for more information on the difference this makes and how to do it.
Look at http://metacpan.org/pod/Devel::Size
You could also inline a c function to do the above.
As far as I know, you cannot allocate memory directly from Perl. You can get around this by writing an XS module, or using an inline C function like I mentioned.