I am analysing very large images in R, in the order of tens of thousands of pixels square. Unfortunately, even with 64 GB RAM, these images sometimes fail to fit into memory, and when they do I can only open one at a time, precluding parallelisation.
My current strategy is to load them using the JPEG or TIFF packages. e.g.:
image <- readJPEG('image.jpg')
However, as I am only performing simple mathematical manipulations (summing, thresholding etc.) that could be performed piece-by-piece, is it possible to only open part of an image at a time by specifying the dimensions to load? If so, I could write a loop to open 1024 x 1024 sized tiles. The JPEG and TIFF packages do not offer an option to do this.
If you are working with very large images, libvips is probably your best bet. You can shell out to it from R using system().
Your question is not very specific, but let's make a 10,000x10,000 pixel TIFF with ImageMagick and it is a black-white gradient:
convert -size 10000x10000 gradient: -depth 8 a.tif
Now threshold that at 50% with vips and check memory required:
vips im_thresh a.tif b.tif 128 --vips-leak
memory: high-water mark 292.21 MB
Pretty frugal, no? By comparison, the equivalent ImageMagick command requires 1.6GB of RAM:
/usr/bin/time -l convert a.tif -threshold 50% b.tif
Sample Output
...
1603895296 maximum resident set size
...
How about adding 64 to every pixel using im_gadd which does:
usage: vips im_gadd a in1 b in2 c out
where:
a is of type "double"
in1 is of type "image"
b is of type "double"
in2 is of type "image"
c is of type "double"
out is of type "image"
calculate a*in1 + b*in2 + c = outfile
So we use:
vips im_gadd 1 a.tif 0 b.tif 64 c.tif --vips-leak
memory: high-water mark 584.41 MB
Need to do some statistics?
vips im_stats c.tif
band minimum maximum sum sum^2 mean deviation
all 64 319 1.915e+10 4.20922e+12 191.5 73.6206
1 64 319 1.915e+10 4.20922e+12 191.5 73.6206
As it turns out, there is an R package - RBioFormats - that allows you to specify part of an image being opened (though it is not available on CRAN). It can be installed from Git as follows:
source("https://bioconductor.org/biocLite.R")
biocLite("aoles/RBioFormats") # You might need to first run `install.packages("devtools")`
library(RBioFormats)
The dimensions of the image can be read from the metadata without having to open the image:
metadata <- read.metadata('image.tiff')
xdim <- metadata#.Data[[1]]$sizeX
ydim <- metadata#.Data[[1]]$sizeY
Suppose that we want to load the top-left 512 x 512 pixels, we use the subset function:
image <- read.image('image.tiff', subset = list(X = 1:512, y = 1:512))
From this it is trivial to write a loop to iteratively process a whole large image. RBioFormats is an R interface into the Java BioFormats library and will open Tiffs, PNGs, JPEGs as well as many proprietary imaging formats.
Related
please let me know any other system/code I need to include, as I am not as familiar with writing out images to my computer. I am creating 360 png files as follows:
for(theta in 1:360){
ic=as.character(theta)
if(theta<10) ic=paste("00",ic,sep="")
if(theta>=10 & theta<100) ic=paste("0",ic,sep="") # make filenames the same length
fn=paste("c:iris360\\HW4_",ic,".png",sep="") #filename
png(fn,width=1000,height=1000) # save as *.png
p3(X1,X2, r=100,theta=theta,mainL=paste("theta =",theta))
# legend("topleft",pch=16,cex=1.5,col=allcl)
dev.off()
}
system("magick c:iris360\\HW4*.png c:iris.gif")
where p3 is just a function that takes my matrices X1 and X2 and plots the points and their segments(let me know if I need to include it as well). However, I get this error:
magick: must specify image size iris360HW4*.png' # error/raw.c/ReadRAWImage/140.
I am unable to open the gif file, as my mac says it is damaged or uses a file format that preview does not recognize.
Update 1: I replaced fn's declaration with
fn <- sprintf("c:iris360/HW4_%03i.png", theta)
as well as replacing ic with sprintf("%03i", theta) everywhere it appeared, but still got the same specify image size error.
When I run the system command into my terminal, I still get the same error asking me to specify the image size.
Magick needs to know several things (e.g., image size, delay between frames, images to use, destination file name) in order to convert a stack of png into a gif. See GIF Animations and Animation Meta-data
magick -delay 100 -size 100x100 xc:SkyBlue \
-page +5+10 balloon.gif -page +35+30 medical.gif \
-page +62+50 present.gif -page +10+55 shading.gif \
-loop 0 animation.gif
So it looks like you need to change
system("magick c:iris360\\HW4*.png c:iris.gif")
to something more like
system("magick -delay 10 -size 100x100 —loop 0 c:iris360\\HW4*.png c:iris.gif")
I have only one requirement. I need to read PDF page size and determine if page is not bigger then 17x17 inches to not send it to some external service which rejects such pdfs.
Is there any free library working on .NET Core? I wasn't able to find it. Or maybe anyone implemented this by reading binary file?
A pdf does not HAVE TO declare page size externally since every page can be a different size thus 100 pages may be 100 different page sizes.
However many PDF will contain a text entry for one or more pages so you can (depending on construction) parse as text for /MediaBox and or potentially /CropBox dimensions.
So the first PDF example I pick on and open to search for /MediaBox in WordPad tells me its 210 mm x 297 mm (i.e my local A4) /MediaBox [0 0 594.95996 841.91998] and for a 3 page file all 3 entries are the same.
you can try that using command line as
type "filename.pdf" | find /i "/media"
but may not work in all cases so a bigger chance of result (but more chaff) is
type "filename.pdf" | findstr /i "^/media ^/crop"
The value is based on the default number of point size units per inch (so can be divided by 72 as a rough guide), however, thats not your aim since you know you dont want more that 17x72=1224.
So in simple terms, if either value was over 1224 then I could reject as "TOO BIG".
HOWEVER I need to also consider those two 0 values, thus if one was +100 then the limit becomes 100 more and more importantly, if one was -100 then your desired 17" restriction will fail at 1124.
So you can write in any method or language (even CMD) a simple test, however, that will require too much expanding to cover all cases, SO:-
Seriously I would use / shell a one line command tool like xpdf/poppler pdfinfo to parse all different types of PDF and then grep that output.
The output is similar for both with many lines but for your need
xpdf\pdfinfo -box filename
gives Page size: 594.96 x 841.92 pts (A4) (rotated 0 degrees)
and
poppler\pdfinfo -box filename
gives Page size: 594.96 x 841.92 pts (A4)
Thus to check the file does not exceed 17" (in either direction) it should be easy to set a comparison testing that both values are under 1224.01
I am experiencing read speeds that I believe are much slower than should be expected when trying to read a fairly large file in R with fread.
The file is ~60m rows x 147 columns, out of which I am only selecting 27 columns, directly in the fread call using select; only 23 of the 27 are found in the actual file. (Probably I inputted some of the strings incorrectly but I guess that matters less.)
data.table::fread("..\\TOI\\TOI_RAW_APextracted.csv",
verbose = TRUE,
select = cols2Select)
The system being used is an Azure VM with a 16-core Intel Xeon and 114 GB of RAM, running Windows 10.
I'm also using R 3.5.2, RStudio 1.2.1335 and data.table 1.12.0
I should also add that the file is a csv file that I have transferred onto the local drive of the VM, so there is no network / ethernet involved. I am not sure how Azure VMs work and what drives they use, but I would assume it's something equivalent to an SSD. Nothing else is running / being processed on the VM at the same time.
Please find below the verbose output of fread:
omp_get_max_threads() = 16 omp_get_thread_limit() = 2147483647 DTthreads = 0 RestoreAfterFork = true Input contains no \n. Taking this to be a filename to open [01] Check arguments Using 16 threads (omp_get_max_threads()=16, nth=16) NAstrings = [<<NA>>] None of the NAstrings look like numbers. show progress = 1 0/1 column will be read as integer [02] Opening the file Opening file ..\TOI\TOI_RAW_APextracted.csv File opened, size = 49.00GB (52608776250 bytes). Memory mapped ok [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal. [05] Skipping initial rows if needed Positioned on line 1 starting: <<"POLNO","ProdType","ProductCod>> [06] Detect separator, quoting rule, and ncolumns Detecting sep automatically ... sep=',' with 100 lines of 147 fields using quote rule 0 Detected 147 columns on line 1. This line is either column names or first data row. Line starts as: <<"POLNO","ProdType","ProductCod>> Quote rule picked = 0 fill=false and the most number of columns found is 147 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 100 because (52608776248 bytes from row 1 to eof) / (2 * 85068 jump0size) == 309216 Type codes (jump 000) : A5AA5555A5AA5AAAA57777777555555552222AAAAAA25755555577555757AA5AA5AAAAA5555AAA2A...2222277555 Quote rule 0 Type codes (jump 001) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777555577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 002) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 003) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 010) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 031) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 098) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 100) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 10045 sample rows ===== Sampled 10045 rows (handled \n inside quoted fields) at 101 jump points Bytes from first data row on line 2 to the end of last row: 52608774311 Line length: mean=956.51 sd=35.58 min=823 max=1063 Estimated number of rows: 52608774311 /
956.51 = 55000757 Initial alloc = 60500832 rows (55000757 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
===== [08] Assign column names [09] Apply user overrides on column types After 0 type and 124 drop user overrides : 05000005A0005AA0A0000770000077000A000A00000000770700000000000000A00A000000000000...0000000000 [10] Allocate memory for the datatable Allocating 23 column slots (147 - 124 dropped) with 60500832 rows [11] Read the data jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| Read 54964696 rows x 23 columns from 49.00GB (52608776250 bytes) file in 30:26.810 wall clock time [12] Finalizing the datatable Type counts:
124 : drop '0'
3 : int32 '5'
7 : float64 '7'
13 : string 'A'
=============================
0.000s ( 0%) Memory map 48.996GB file
0.035s ( 0%) sep=',' ncol=147 and header detection
0.001s ( 0%) Column type detection using 10045 sample rows
6.000s ( 0%) Allocation of 60500832 rows x 147 cols (9.466GB) of which 54964696 ( 91%) rows used
1820.775s (100%) Reading 50176 chunks (0 swept) of 1.000MB (each chunk 1095 rows) using 16 threads + 1653.728s ( 91%) Parse to row-major thread buffers (grown 32 times) + 22.774s ( 1%) Transpose +
144.273s ( 8%) Waiting
24.545s ( 1%) Rereading 1 columns due to out-of-sample type exceptions
1826.810s Total Column 2 ("ProdType") bumped from 'int32' to 'string' due to <<"B810">> on row 14
Basically, I would like to find out if this is just normal or if there is anything I can do to improve these reading speeds. Based on various benchmarks I've seen around and my own experience and intuition with fread using smaller files, I would have expected this to be read in much much quicker.
Also I was wondering if the multi-core capabilities are fully being used, as I have heard that under Windows this might not always be straightforward. My knowledge around this topic is pretty limited unfortunately, but it does appear from the verbose output that fread is detecting 16 cores.
Thoughts:
(1) If you are using Windows, use Microsoft Open R; even more so if the cloud is Azure. Actually, there may be coordination between Open R and Azure client. Because of Intel's MKL and Microsoft built in enhancements, I find Microsoft Open R faster on Windows.
(2) I suspect 'Select' and 'Drop' work after a full file read. Maybe read all the file, subset or filter afterward.
(3) I think a restart is overkill. I run gc thrice every so often like this: 'gc();gc();gc();'' I have heard others say this does nothing. But at least it makes me feel better. Actually, I notice it helps me on Windows.
(4) The latest versions of data.table fread are implementing 'YAML' . This looks promising.
(5) setDTthread(0) uses all the cores. Too much parellitization can work against you. Trying halving your cores.
I want to calculate space left on my embedded target.
The Arduino IDE shows this in the output window:
Sketch uses 9544 bytes (3%) of program storage space. Maximum is 262144 bytes.
avr-size has -C option that shows "xx% left":
$ avr-size -C --mcu=atmega32u4 build/myproject.hex
AVR Memory Usage
----------------
Device: atmega32u4
Program: 8392 bytes (25.6% Full)
(.text + .data + .bootloader)
Data: 2196 bytes (85.8% Full)
(.data + .bss + .noinit)
However, I'm actually writing a CMake file to develop code for an Arduino board with an Arm Cortex M0 CPU, so I use arm-none-eabi-size, which shows the code size like this:
[100%] Built target hex
text data bss dec hex filename
8184 208 1988 10380 288c build/myproject
[100%] Built target size
*** Finished ***
Is there a way to calculate the program and data space left on the device? Or do I need to regex the output and calculate percent of a hard-coded value?
If you are using arm-none-eabi toolchain, you can add linker option -Wl,--print-memory-usage which prints RAM and Flash usage in percentage. Output looks like this:
Memory region Used Size Region Size %age Used
RAM: 8968 B 20 KB 43.79%
FLASH: 34604 B 128 KB 26.40%
I am using make file generated by CubeMX, to enable this print I added the option at the end of LDFLAGS line. For CMake this thread might be useful.
Due to I constantly reach memory size limit in my R Session (8GB Windows PC) I start to remove big objects loaded in. However once I reach this limit, removing objects seems not to work.
So, I was wondering if there's a way to get the R Session size. I know that it's possible to retrieve objects' size (saw in this thread).I want to know if there's a way to count the complete R Session size though (loaded packages, objects, etc).
Thank you!
I personally use this function to get the available memory:
getAvailMem <- function(format = TRUE) {
gc()
if (Sys.info()[["sysname"]] == "Windows") {
memfree <- 1024^2 * (utils::memory.limit() - utils::memory.size())
} else {
# http://stackoverflow.com/a/6457769/6103040
memfree <- 1024 * as.numeric(
system("awk '/MemFree/ {print $2}' /proc/meminfo", intern = TRUE))
}
`if`(format, format(structure(memfree, class = "object_size"),
units = "auto"), memfree)
}
To get the total memory used by R, you may try mem_used() from pryr package. Unlike memory.size, this one is not OS dependent, because it uses the R function gc() underneath it. Try to look in the function body and also this pryr:::node_size and pryr:::show_bytes
pryr::mem_used()
The help file ?pryr::mem_used describes
R breaks down memory usage into Vcells (memory used by vectors) and
Ncells (memory used by everything else). However, neither this
distinction nor the "gc trigger" and "max used" columns are typically
important. What we're usually most interested in is the the first
column: the total memory used. This function wraps around gc() to
return the total amount of memory (in megabytes) currently used by R.
You can also use pryr::mem_change to track the size of the memory used by the R code. Try the example in its documentation page.
The numbers such as 28L and 56L used to refer node size with pryr:::node_size comes from the help file of ?gc, which describes
gc returns a matrix with rows "Ncells" (cons cells), usually 28 bytes
each on 32-bit systems and 56 bytes on 64-bit systems, and "Vcells"
(vector cells, 8 bytes each),
After removing a large object run gc() to free memory