I'm working on an algorithm which needs very fast random access to video frames in a possibly long video (minimum 30 minutes). I am currently using OpenCV's VideoCapture to read my video, but the seeking functionality is either broken or very slow. The best I found until now is using the MJPEG codec inside a MKV container, but it's not fast enough.
I can chose any video format or even create a new one. The storage space is not a problem (to some extents of course). The only requirement is to get the fastest possible seeking time to any location in the video. Ideally, I would like to be able to access to multiple frames simultaneously, taking advantages of my quad-core CPU.
I know that relational databases are very good to store large volumes of data, they allows simultaneous read accesses and they're very fast when using indexes.
Is SQLite a good fit for my specific needs ? I plan to store each video frame compressed in JPEG, and use an index on the frame number to access them quickly.
EDIT : for me a frame is just an image, not the entire video. A 30mn video # 25 fps contains 30*60*25=45000 frames, and I want to be able to quickly get one of them using its number.
EDIT : For those who could be interested, I finally implemented a custom video container saving each frame in fixed-sized blocks (consequently, the position of any frame can be directly computed !). The images are compressed with the turbojpeg library and file accesses are multi-threaded (to be NCQ-friendly). The bottleneck is not the HDD anymore and I finally obtained much better perfs :)
I don't think using SQLite (or any other dabatase engine) is a good solution for your problem. A database is not a filesystem.
If what you need is very fast random access, then stick to the filesystem, it was designed for this kind of usage, and optimized with this in mind. As per your comment, you say a 5h video would require 450k files, well, that's not a problem in my opinion. Certainly, directory listing will be a bit long, but you will get the absolute fastest possible random access. And it will certainly be faster than SQLite because you're one level of abstraction under.
And if you're really worried about directory listing times, you just have to organize your folder structure like a tree. That will get you longer paths, but fast listing.
Keep a high level perspective. The problem is that OpenCV isn't fast enough at seeking in the source video. This could be because
Codecs are not OpenCV's strength
The source video is not encoded for efficient seeking
You machine has a lot of dedicated graphics hardware to leverage, but it does not have specialized capabilities for randomly seeking within a 17 GB dataset, be it a file, a database, or a set of files. The disk will take a few milliseconds per seek. It will be better for an SSD but still not so great. Then you wait for it to load into main memory And you have to generate all that data in the first place.
Use ffmpeg, which should handle decoding very efficiently, perhaps even using the GPU. Here is a tutorial. (Disclaimer, I haven't used it myself.)
You might preprocess the video to add key frames. In principle this shouldn't require completely re-encoding, at least for MPEG, but I don't know much about specifics. MJPEG essentially turns all frames into keyframes, but you can find a middle ground and maybe seek 1.5x faster at a 2x size cost. But avoid hitting the disk.
As for SQLite, that is a fine solution to the problem of seeking within 17 GB of data. The notion that databases aren't optimized for random access is poppycock. Of course they are. A filesystem is a kind of database. Random access in 17 GB is slow because of hardware, not software.
I would recommend against using the filesystem for this task, because it's a shared resource synchronized with the rest of the machine. Also, creating half a million files (and deleting them when finished) will take a long time. That is not what a filesystem is specialized for. You can get around that, though, by storing several images to each file. But then you need some format to find the desired image, and then why not put them all in one file?
Indeed, (if going the 17 GB route) why not ignore the entire problem and put everything in virtual memory? VM is just as good at making the disk seek as SQLite or the filesystem. As long as the OS knows it's OK for the process to use that much memory, and you're using 64-bit pointers, it should be a fine solution, and the first thing to try.
Related
In my case, raw data is stored on NoSQL. Before training ML model, i should preprocess raw data on NoSQL. At this time, if i preprocess raw data, then what is the best way to keep prerocessed data?
1. keep it on memory
2. keep it on another table in NoSQL
3. can you recommend another options?
Depends on your use case, size of the data, tech stack and machine learning framework / library. Truth be told, without knowledge of your data and requirements, no-one on SO will be able to give you a complete answer.
In terms of passing data to the model/ running the model, load it in memory. Look at batching your data into the model if you hit memory limits. Or use an AWS EMR cluster!
For the question on storing the data, I’ll use the previous answer’s example of Spark and try to give some general rules.
If the processed data is “Big” and regularly accessed (eg once a month/week/day), then store it in a distributed manner, then load into memory when running the model.
For Spark, best bet is to write it as partitioned parquet files or to a Hive Data Warehouse.
The key thing about those two is that they are distributed. Spark will create N parquet files containing all your data. When it comes to reading the dataset into memory (before running your model), it can read from many files at once - saving a lot of time. Tensorflow does a similar thing with the TFRecords format.
If your NoSQL database is distributed, then you can potentially use that.
If it won’t be regularly accessed and is “small”, then just run the code from scratch & load into memory.
If the processing takes no time at all and it’s not used for other work, then there’s no point storing it. It’s a waste of time. Don’t even think about it. Just focus on your model, get the data in memory and get running.
If the data won’t be regularly accessed but is “Big”, then time to think hard!
You need to carefully think about the trade off of processing time vs. data storage capability.
How much will it cost to store this data?
How often is it needed?
Is it business critical?
When someone asks for this, is it always a “needed to be done yesterday” request?
Etc.
—-
The Spark framework is a good solution to make what you want to do learn more about it here: spark. Spark for machine learning: here.
Here's a situation I am facing right now at work:
we currently have 300GB+ of production data (and it increases every day at large). It's in a mongodb clustr
data science team members are working on few algorithms that require access to all of this data at once and those algorithms may update data in place, hence, they have replicated the data in dev environment for their use until they are sure their code works
if multiple devs are running their algorithms then all/some of them may end up with unexpected output because other algorithms are also updating the data
this problem could be easily solved if everyone had their own copy of data!
however, given the volume of data, it's not feasible for me to provide them (8 developers right now) with their exclusive copy everyday. Even if I automate this process, we'll have to wait until copy is completed over the wire
I am hoping for a future proof approach considering we'll be dealing with TB's of data quite soon
I am assuming that many organizations would be facing such issues, and wondering how do other folks approach such a case.
I'd highly appreciate any pointers, leads, solutions for this.
Thanks
You can try using snapshots on the replicated data so each developer can have his own "copy" of the data. See Snapshots definition and consult your cloud provider if it can provide writable snapshots.
Note, snapshots are created almost instantly and at the moment of creation they almost do not require storage space because this technology utilizes pointers but not data itself. Unfortunately each snapshot can grow up to the original volume size because any change of data will initiate physical data copy: the technology that hides behind the process is usually CoW - Copy-on-write. So there is a serious danger that uncontrolled snapshots can "eat" all your free storage space.
I have a single OpenCL kernel that needs access to different memory objects at the same time, and those memory objects can't all fit in the device memory, so I need to only have at any given iteration the ones that are needed. How can I maintain only the objects needed without having to destroy and transfer again the objects that are already there and need to stay there?
In an ideal world I could copy each object individually as needed, but given that there's no such thing as arrays of pointers on the device, how could I do this if I need 50 of them in one same instance of a kernel?
For context: I use one pretty general-purpose kernel that takes care of all my graphics needs by following a list of things to do, but now I need it to access images, tiles from huge images or smaller mipmap versions, many at once. It might need the exact same objects with no changes for thousands of iterations, or it might need to remove some objects but not some others on a regular basis.
I want to write an application that removes data from a hard drive. Are there any standards that I need to adhere to which will ensure that my software removes at least the bare minimum, or should I just use off the shelf software? If so any advice?
I think any "standard" you may encounter won't be any less science fiction or science mysticism than anything you come up with yourself. Basically, as long as you physically overwrite the data (even just once), there's no commercial forensic service that - even in the face of any amount of money you throw at them - will claim to be able to recover your data.
(Any "overwrite 35 times with rotating bit patterns" advice may have been true for coarsely spaced magnetic tapes in the 1970s, but it is entirely irrelevant for contemporary hard disks).
The far more important problem you have to solve is how to overwrite data physically. This is essentially impossible through any sort of application or even OS programming, and you'll have to find a way to talk to the hardware properly and get a reliable confirmation that the location you intended to write to has indeed be written to, and that there aren't any relocations of the clusters in question to other parts of the disk that might leak the data.
So in essence this is a very low-level question that'll probably have you pouring over your hard disk manufacturer's manuals quite a bit if you want a genuine solution.
Please define "data removal". Is this scrubbing in order to make undeletions impossible; or simply deletion of data ?
It is common to write over a file several times with a random bitpattern, if one wants to make sure it cannot be recovered. Due to the analog nature of the magnetic bit patterns, it might be possible to recover overwritten data in some circumstances.
Under all circumstances a normal file system delete operation will be revertable in most cases. When you delete a file (using a normal file system delete operation), you remove the file allocation table entry, not the data.
There are standards... see http://en.wikipedia.org/wiki/Data_erasure
You don't give any details so it is hard to tell whether they apply to your situation... Deleting a file with OS built-in file deletion can be almost always reverted... OTOH formatting a drive (NOT quick format) is usually ok except when you deal with sensitive data (like data from clients, patients, finance etc. or some security relevant stuff) then the above mentioned standards which usually use differents amounts/rounds/patterns of overwriting the data so make it nearly impossible to revert the deletion... in really really sensitive cases you first use the best of these methods, then format the drive, then use that method again and then destroy the drive physically (which in fact means real destruction, not only removing the electronics or similar!).
The best way to avoid all this hassle is to plan for this kind of thing and to use strong proven full-disk-encryption (with a key NOT stored on the drive electronics or media!)... this way you can easily just format the drive (NOT quick) and then sell it for example... since any strong encryption will look like "random data" is (if implemented correctly) absolutely useless without the key(s).
My problem is:
I have a perl script which uses lot of memory (expected behaviour because of caching). But, I noticed that the more I do caching, slower it gets and the process spends most of the time in sleep mode.
I thought pre-allocating memory to the process might speed up the performance.
Does someone have any ideas here?
Update:
I think I am not being very clear here. I will put question in clearer way:
I am not looking for the ways of pre-allocating inside the perl script. I dont think that would help me much here. What I am interested in is a way to tell OS to allocate X amount of memory for my perl script so that it does not have to compete with other processes coming in later.
Assume that I cant get away with the memory usage. Although, I am exploring ways of reducing that too but dont expect much improvement there.
FYI, I am working on a solaris 10 machine.
What I gathered from your posting and comments is this:
Your program gets slow when memory use rises
Your pogram increasingly spends time sleeping, not computing.
Most likely eplanation: Sleeping means waiting for a resource to become available. In this case the resource most likely is memory. Use the vmstat 1 command to verify. Have a look at the sr column. If it goes beyond ~150 consistently the system is desperate to free pages to satisfy demand. This is accompanied by high activity in the pi, po and fr columns.
If this is in fact the case, your best choices are:
Upgrade system memory to meet demand
Reduce memory usage to a level appropiate for the system at hand.
Preallocating memory will not help. In either case memory demand will exceed the available main memory at some point. The kernel will then have to decide which pages need to be in memory now and which pages may be cleared and reused for the more urgently needed pages. If all regularily needed pages (the working set) exceeds the size of main memory, the system is constantly moving pages from and to secondary storage (swap). The system is then said to be thrashing and spends not much time doing useful work. There is nothing you can do about this execept adding memory or using less of it.
From a comment:
The memory limitations are not very severe but the memory footprint easily grows to GBs and when we have competing processes for memory, it gets very slow. I want to reserve some memory from OS so that thrashing is minimal even when too many other processes come. Jagmal
Let's take a different tack then. The problem isn't really with your Perl script in particular. Instead, all the processes on the machine are consuming too much memory for the machine to handle as configured.
You can "reserve" memory, but that won't prevent thrashing. In fact, it could make the problem worse because the OS won't know if you are using the memory or just saving it for later.
I suspect you are suffering the tragedy of the commons. Am I right that many other users are on the machine in question? If so, this is more of a social problem than a technical problem. What you need is someone (probably the System Administrator) to step in and coordinate all the processes on the machine. They should find the most extravagant memory hogs and work with their programmers to reduce the cost on system resources. Further, they ought to arrange for processes to be scheduled so that resource allocation is efficient. Finally, they may need to get more or improved hardware to handle the expected system load.
Some questions you might ask yourself:
are my data structures really useful for the task at hand?
do I really have to cache that much?
can I throw away cached data after some time?
my #array;
$#array = 1_000_000; # pre-extend array to one million elements,
# http://perldoc.perl.org/perldata.html#Scalar-values
my %hash;
keys(%hash) = 8192; # pre-allocate hash buckets
# (same documentation section)
Not being familiar with your code, I'll venture some wild speculation here [grin] that these techniques aren't going to offer new great efficiencies to your script, but that the pre-allocation could help a little bit.
Good luck!
-- Douglas Hunter
I recently rediscovered an excellent Randal L. Schwartz article that includes preallocating an array. Assuming this is your problem, you can test preallocating with a variation on that code. But be sure to test the result.
The reason the script gets slower with more caching might be thrashing. Presumably the reason for caching in the first place is to increase performance. So a quick answer is: reduce caching.
Now there may be ways to modify your caching scheme so that it uses less main memory and avoids thrashing. For instance, you might find that caching to a file or database instead of to memory can boost performance. I've found that file system and database caching can be more efficient than application caching and can be shared among multiple instances.
Another idea might be to alter your algorithm to reduce memory usage in other areas. For instance, instead of pulling an entire file into memory, Perl programs tend to work better reading line by line.
Finally, have you explored the Memoize module? It might not be immediately applicable, but it could be a source of ideas.
I could not find a way to do this yet.
But, I found out that (See this for details)
Memory allocated to lexicals (i.e.
my() variables) cannot be reclaimed or
reused even if they go out of scope.
It is reserved in case the variables
come back into scope. Memory allocated
to global variables can be reused
(within your program) by using
undef()ing and/or delete().
So, I believe a possibility here could be to check if i can reduce the total memory print of lexical variables at a given point in time.
It sounds like you are looking for limit or ulimit. But I suspect that will cause a script that goes over the limit to fail, which probably isn't what you want.
A better idea might be to share cached data between processes. Putting data in a database or in a file works well in my experience.
I hate to say it, but if your memory limitations are this severe, Perl is probably not the right language for this application. C would be a better choice, I'd think.
One thing you could do is to use solaris zones (containers) .
You could put your process in a zone and allocate it resources like RAM and CPU's.
Here are two links to some tutorials :
Solaris Containers How To Guide
Zone Resource Control in the Solaris 10 08/07 OS
While it's not pre-allocating as you asked for, you may also want to look at the large page size options, so that when perl has to ask the OS for more memory for your program, it gets it in
larger chunks.
See Solaris Internals: Multiple Page Size Support for more information on the difference this makes and how to do it.
Look at http://metacpan.org/pod/Devel::Size
You could also inline a c function to do the above.
As far as I know, you cannot allocate memory directly from Perl. You can get around this by writing an XS module, or using an inline C function like I mentioned.