Retrieve video file duration (time) using R - r

I’m creating a code to delete some video files that I don’t need. The videos are from CCTV footage and they record 24/7. However the software that records the video saves the files in ~1 hour videos and this is the problem (not being exact duration). I’m only interested in keeping videos from a particular part of the day (which varies) and because the duration of the video is not exact this is causing me problems.
The video file name has a date and time stamp but only for the start so if I could find the duration everything becomes simple algebra.
So my question is simple is it possible to get the duration (time) of video files using R?
Just a couple of other useful information the videos are from several cameras and each camera as a different recording frame rate so using file.info to return the file size and derive the length of the video is not an option. Also the video files are in .avi format.
Cheers
Patrao

As far as I know, there are no ready packages that handle video files in R (like matlab does). This isn't a pure R solution, but gets the job done. I installed CLI interface to MediaInfo and called it from R. I called it using system.
wolf <- system("q:/mi_cli/mediainfo.exe Krofel_video2volk2.AVI", intern = TRUE)
wolf # output by MediaInfo
[1] "General"
[2] "Complete name : Krofel_video2volk2.AVI"
[3] "Format : AVI"
[4] "Format/Info : Audio Video Interleave"
[5] "File size : 10.7 MiB"
[6] "Duration : 11s 188ms"
[7] "Overall bit rate : 8 016 Kbps"
...
[37] "Channel count : 1 channel"
[38] "Sampling rate : 8 000 Hz"
[39] "Bit depth : 16 bits"
[40] "Stream size : 174 KiB (2%)"
[41] "Alignment : Aligned on interleaves"
[42] "Interleave, duration : 63 ms (1.00 video frame)"
# Find where Duration is (general) and extract it.
find.duration <- grepl("Duration", wolf)
wolf[find.duration][1]# 1 = General, 2 = Video, 3 = Audio
[1] "Duration : 11s 188ms"
Have fun parsing the time.

This might be a bit low level, but if you're up to the task of parsing binary data, look up a copy of the AVI spec and figure out how to get both the number of video frames and the frame rate.
If you look at one of the AVI files using a hex editor, you will see a series of LIST chunks at the beginning. A little farther into this chunk will be a vids chunk. Immediately following vids should be a human-readable video four-character code (FourCC) specifying the video codec, probably something like mjpg (MJPEG) or avc1 (H.264) for a camera. 20 bytes after that will be 4 bytes stored in little endian notation which indicate the frame rate. Skip another 4 bytes and then the next 4 bytes will be another little endian number which indicate the total number of video frames.
I'm looking at a sample AVI file right now where the numbers are: frame rate = 24 and # of frames = 0x37EB = 14315. This works out to 9m56s, which is accurate for this file.

Related

Read pdf page size on .NET Core

I have only one requirement. I need to read PDF page size and determine if page is not bigger then 17x17 inches to not send it to some external service which rejects such pdfs.
Is there any free library working on .NET Core? I wasn't able to find it. Or maybe anyone implemented this by reading binary file?
A pdf does not HAVE TO declare page size externally since every page can be a different size thus 100 pages may be 100 different page sizes.
However many PDF will contain a text entry for one or more pages so you can (depending on construction) parse as text for /MediaBox and or potentially /CropBox dimensions.
So the first PDF example I pick on and open to search for /MediaBox in WordPad tells me its 210 mm x 297 mm (i.e my local A4) /MediaBox [0 0 594.95996 841.91998] and for a 3 page file all 3 entries are the same.
you can try that using command line as
type "filename.pdf" | find /i "/media"
but may not work in all cases so a bigger chance of result (but more chaff) is
type "filename.pdf" | findstr /i "^/media ^/crop"
The value is based on the default number of point size units per inch (so can be divided by 72 as a rough guide), however, thats not your aim since you know you dont want more that 17x72=1224.
So in simple terms, if either value was over 1224 then I could reject as "TOO BIG".
HOWEVER I need to also consider those two 0 values, thus if one was +100 then the limit becomes 100 more and more importantly, if one was -100 then your desired 17" restriction will fail at 1124.
So you can write in any method or language (even CMD) a simple test, however, that will require too much expanding to cover all cases, SO:-
Seriously I would use / shell a one line command tool like xpdf/poppler pdfinfo to parse all different types of PDF and then grep that output.
The output is similar for both with many lines but for your need
xpdf\pdfinfo -box filename
gives Page size: 594.96 x 841.92 pts (A4) (rotated 0 degrees)
and
poppler\pdfinfo -box filename
gives Page size: 594.96 x 841.92 pts (A4)
Thus to check the file does not exceed 17" (in either direction) it should be easy to set a comparison testing that both values are under 1224.01

Optimizing `data.table::fread` speed advice

I am experiencing read speeds that I believe are much slower than should be expected when trying to read a fairly large file in R with fread.
The file is ~60m rows x 147 columns, out of which I am only selecting 27 columns, directly in the fread call using select; only 23 of the 27 are found in the actual file. (Probably I inputted some of the strings incorrectly but I guess that matters less.)
data.table::fread("..\\TOI\\TOI_RAW_APextracted.csv",
verbose = TRUE,
select = cols2Select)
The system being used is an Azure VM with a 16-core Intel Xeon and 114 GB of RAM, running Windows 10.
I'm also using R 3.5.2, RStudio 1.2.1335 and data.table 1.12.0
I should also add that the file is a csv file that I have transferred onto the local drive of the VM, so there is no network / ethernet involved. I am not sure how Azure VMs work and what drives they use, but I would assume it's something equivalent to an SSD. Nothing else is running / being processed on the VM at the same time.
Please find below the verbose output of fread:
omp_get_max_threads() = 16 omp_get_thread_limit() = 2147483647 DTthreads = 0 RestoreAfterFork = true Input contains no \n. Taking this to be a filename to open [01] Check arguments Using 16 threads (omp_get_max_threads()=16, nth=16) NAstrings = [<<NA>>] None of the NAstrings look like numbers. show progress = 1 0/1 column will be read as integer [02] Opening the file Opening file ..\TOI\TOI_RAW_APextracted.csv File opened, size = 49.00GB (52608776250 bytes). Memory mapped ok [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal. [05] Skipping initial rows if needed Positioned on line 1 starting: <<"POLNO","ProdType","ProductCod>> [06] Detect separator, quoting rule, and ncolumns Detecting sep automatically ... sep=',' with 100 lines of 147 fields using quote rule 0 Detected 147 columns on line 1. This line is either column names or first data row. Line starts as: <<"POLNO","ProdType","ProductCod>> Quote rule picked = 0 fill=false and the most number of columns found is 147 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 100 because (52608776248 bytes from row 1 to eof) / (2 * 85068 jump0size) == 309216 Type codes (jump 000) : A5AA5555A5AA5AAAA57777777555555552222AAAAAA25755555577555757AA5AA5AAAAA5555AAA2A...2222277555 Quote rule 0 Type codes (jump 001) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777555577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 002) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 003) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 010) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 031) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 098) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 100) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 10045 sample rows ===== Sampled 10045 rows (handled \n inside quoted fields) at 101 jump points Bytes from first data row on line 2 to the end of last row: 52608774311 Line length: mean=956.51 sd=35.58 min=823 max=1063 Estimated number of rows: 52608774311 /
956.51 = 55000757 Initial alloc = 60500832 rows (55000757 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
===== [08] Assign column names [09] Apply user overrides on column types After 0 type and 124 drop user overrides : 05000005A0005AA0A0000770000077000A000A00000000770700000000000000A00A000000000000...0000000000 [10] Allocate memory for the datatable Allocating 23 column slots (147 - 124 dropped) with 60500832 rows [11] Read the data jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| Read 54964696 rows x 23 columns from 49.00GB (52608776250 bytes) file in 30:26.810 wall clock time [12] Finalizing the datatable Type counts:
124 : drop '0'
3 : int32 '5'
7 : float64 '7'
13 : string 'A'
=============================
0.000s ( 0%) Memory map 48.996GB file
0.035s ( 0%) sep=',' ncol=147 and header detection
0.001s ( 0%) Column type detection using 10045 sample rows
6.000s ( 0%) Allocation of 60500832 rows x 147 cols (9.466GB) of which 54964696 ( 91%) rows used
1820.775s (100%) Reading 50176 chunks (0 swept) of 1.000MB (each chunk 1095 rows) using 16 threads + 1653.728s ( 91%) Parse to row-major thread buffers (grown 32 times) + 22.774s ( 1%) Transpose +
144.273s ( 8%) Waiting
24.545s ( 1%) Rereading 1 columns due to out-of-sample type exceptions
1826.810s Total Column 2 ("ProdType") bumped from 'int32' to 'string' due to <<"B810">> on row 14
Basically, I would like to find out if this is just normal or if there is anything I can do to improve these reading speeds. Based on various benchmarks I've seen around and my own experience and intuition with fread using smaller files, I would have expected this to be read in much much quicker.
Also I was wondering if the multi-core capabilities are fully being used, as I have heard that under Windows this might not always be straightforward. My knowledge around this topic is pretty limited unfortunately, but it does appear from the verbose output that fread is detecting 16 cores.
Thoughts:
(1) If you are using Windows, use Microsoft Open R; even more so if the cloud is Azure. Actually, there may be coordination between Open R and Azure client. Because of Intel's MKL and Microsoft built in enhancements, I find Microsoft Open R faster on Windows.
(2) I suspect 'Select' and 'Drop' work after a full file read. Maybe read all the file, subset or filter afterward.
(3) I think a restart is overkill. I run gc thrice every so often like this: 'gc();gc();gc();'' I have heard others say this does nothing. But at least it makes me feel better. Actually, I notice it helps me on Windows.
(4) The latest versions of data.table fread are implementing 'YAML' . This looks promising.
(5) setDTthread(0) uses all the cores. Too much parellitization can work against you. Trying halving your cores.

Open only part of an image (JPEG/TIFF etc.) in R

I am analysing very large images in R, in the order of tens of thousands of pixels square. Unfortunately, even with 64 GB RAM, these images sometimes fail to fit into memory, and when they do I can only open one at a time, precluding parallelisation.
My current strategy is to load them using the JPEG or TIFF packages. e.g.:
image <- readJPEG('image.jpg')
However, as I am only performing simple mathematical manipulations (summing, thresholding etc.) that could be performed piece-by-piece, is it possible to only open part of an image at a time by specifying the dimensions to load? If so, I could write a loop to open 1024 x 1024 sized tiles. The JPEG and TIFF packages do not offer an option to do this.
If you are working with very large images, libvips is probably your best bet. You can shell out to it from R using system().
Your question is not very specific, but let's make a 10,000x10,000 pixel TIFF with ImageMagick and it is a black-white gradient:
convert -size 10000x10000 gradient: -depth 8 a.tif
Now threshold that at 50% with vips and check memory required:
vips im_thresh a.tif b.tif 128 --vips-leak
memory: high-water mark 292.21 MB
Pretty frugal, no? By comparison, the equivalent ImageMagick command requires 1.6GB of RAM:
/usr/bin/time -l convert a.tif -threshold 50% b.tif
Sample Output
...
1603895296 maximum resident set size
...
How about adding 64 to every pixel using im_gadd which does:
usage: vips im_gadd a in1 b in2 c out
where:
a is of type "double"
in1 is of type "image"
b is of type "double"
in2 is of type "image"
c is of type "double"
out is of type "image"
calculate a*in1 + b*in2 + c = outfile
So we use:
vips im_gadd 1 a.tif 0 b.tif 64 c.tif --vips-leak
memory: high-water mark 584.41 MB
Need to do some statistics?
vips im_stats c.tif
band minimum maximum sum sum^2 mean deviation
all 64 319 1.915e+10 4.20922e+12 191.5 73.6206
1 64 319 1.915e+10 4.20922e+12 191.5 73.6206
As it turns out, there is an R package - RBioFormats - that allows you to specify part of an image being opened (though it is not available on CRAN). It can be installed from Git as follows:
source("https://bioconductor.org/biocLite.R")
biocLite("aoles/RBioFormats") # You might need to first run `install.packages("devtools")`
library(RBioFormats)
The dimensions of the image can be read from the metadata without having to open the image:
metadata <- read.metadata('image.tiff')
xdim <- metadata#.Data[[1]]$sizeX
ydim <- metadata#.Data[[1]]$sizeY
Suppose that we want to load the top-left 512 x 512 pixels, we use the subset function:
image <- read.image('image.tiff', subset = list(X = 1:512, y = 1:512))
From this it is trivial to write a loop to iteratively process a whole large image. RBioFormats is an R interface into the Java BioFormats library and will open Tiffs, PNGs, JPEGs as well as many proprietary imaging formats.

Loading wikidata dump

I'm loading all geographic entries (Q56061) from wikidata json dump.
Whole dump contains about 16M entries according to Wikidata:Statistics page.
Using python3.4 + ijson + libyajl2 it comes to take about 93 hours of CPU (AMD Phenom II X4 945 3GHz) time just to parse the file.
Using online sequential item queries for total of 2.3M entries of interest comes to take about 134 hours.
Is there some more optimal way to perform this task?
(maybe, something like openstreetmap pdf format and osmosis tool)
My loading code and estimations were wrong.
Using ijson.backends.yajl2_cffi gives about 15 hours for full parsing + filtering + storing to database.

Convert a DCR video file

I have a DCR (file has a .dcr extension) video file coming from a video surveillance device ( I don't know the make and model of the recorder )
I'm unable to read it with VLC, Media Player, and it won't open in Virtual Dub or can't be converted with the standard "ffmpeg.exe video.dcr output.avi" command line.
But I'm able to get a very basic** playback of the video stream with MPC-HC player of the Combined Community Codec Pack. Unfortunately, the audio stream (which I'm looking for) will not play.
According to the MPC-HC player file info, I'm dealing with this:
General
Format : MPEG-4 Visual
File size : 459 MiB
Video
Format : MPEG-4 Visual
Format profile : Advanced Simple#L5
Format settings, BVOP : Yes
Format settings, QPel : No
Format settings, GMC : No warppoints
Format settings, Matrix : Default (H.263)
Muxing mode : Packed bitstream
Width : 640 pixels
Height : 480 pixels
Display aspect ratio : 4:3
Frame rate : 25.000 fps
Color space : YUV
Chroma subsampling : 4:2:0
Bit depth : 8 bits
Scan type : Progressive
Compression mode : Lossy
Writing library : XviD 64
** By very basic, I mean I can play the video, but not seek through it, and there is no keyframe at all in the video output.
Hopefully some of you guys will have dealt with DCR files from video surveillance equipment.
I would recommend downloading the latest version of the digital court player from http://www.bisdigital.com. This is the program that many courts use to view video from court hearings in .dcr format. You should be able to pause, rewind, fast forward, as well as control the speed of the video playback.

Resources