classify a large collection of image files - r

I have a large collection of image files for a book, andthe publisher wants a list where files are classified by "type" (greyscale graph, b/w halftone image, color image, line drawing, etc.). This is a hard problem in general, but perhaps I can do some of this automatically using image processing tools, e.g., ImageMagick with the R magick package.
I think ImageMagick is the right tool, but I don't really know how to use it for this purpose.
What I have is just a list of fig numbers & file names:
1.1 ch01-intro/fig/alcohol-risk.jpg
1.2 ch01-intro/fig/measels.png
1.3 ch01-intro/fig/numbers.png
1.4 ch01-intro/fig/Lascaux-bull-chamber.jpg
...
Can someone help get me started?
Edit: This was probably an ill-framed or overly-arching question as initially stated. I thought that ImageMagick identify or the R magick::image_info() function could help, so the initial question perhaps should have been: "How to extract image information from a list of files [in R]". I can pose this separately, if not already asked.
An initial attempt at this gave me the following for my first images,
library(magick)
# initialize an empty array to hold the results of `image_info`
figinfo <- data.frame(
format=character(),
width=numeric(),
height=numeric(),
colorspace=character(),
matte=logical(),
filesize=numeric(),
density=character(), stringsAsFactors = FALSE
)
for (i in seq_along(files)) {
img <- image_read(files[i])
info <- image_info(img)
figinfo[i,] <- info
}
I get:
> figinfo
format width height colorspace matte filesize density
1 JPEG 661 733 sRGB FALSE 41884 72x72
2 PNG 838 591 sRGB TRUE 98276 38x38
3 PNG 990 721 sRGB TRUE 427253 38x38
4 JPEG 798 219 sRGB FALSE 99845 300x300
I conclude that this doesn't help much in answering the question I posed, of how to classify these images.
Edit2 Before closing this question, the advice to look into direct use of ImageMagick identify was helpful. https://imagemagick.org/script/escape.php
In particular, the %[type] is closer to
what I need. This is not exposed in magick::image_info(), so I may have to write a shell script or call system() in a loop.
For the record, here is how I can extract relevant attributes of these image files using identify directly.
# Get image characteristics via ImageMagick identify
# from: https://imagemagick.org/script/escape.php
#
# -format elements:
# %m image file format
# %f filename
# %[type] image type
# %k number of unique colors
# %h image height in pixels
# %r image class and colorspace
identify -format "%m,%f,%[type],%r,%k,%hx%w" imagefile
>identify -format "%m,%f,%[type],%r,%k,%hx%w" Quipu.png
PNG,Quipu.png,GrayscaleAlpha,DirectClass Gray Matte,16,449x299
The %[type] attribute takes me towards what I want.

To close this question:
In an R context, I was successful in using system(, intern=TRUE) for this task, as follows, with some manual fixups
# Use identify directly via system()
# function to run identify for one file
get_info <- function(file) {
cmd <- 'identify -quiet -format "%f,%m,%[type],%r,%h,%w,%x,%y"'
info <- system(paste(cmd, file), intern=TRUE)
unlist(strsplit(info, ","))
}
# This doesn't cause coercion to numeric
figinfo <- data.frame(
filename=character(),
format=character(),
type=character(),
class=character(),
height=numeric(),
width=numeric(),
xres=numeric(),
yres=numeric(),
stringsAsFactors = FALSE
)
for (i in seq_along(files)) {
info <- get_info(files[i])
info[4] <- sub("DirectClass ", "", info[4])
figinfo[i,] <- info
}
figinfo$height <- as.numeric(figinfo$height)
figinfo$width <- as.numeric(figinfo$width)
figinfo$xres=round(as.numeric(figinfo$xres))
figinfo$yres=round(as.numeric(figinfo$yres))
Then I have more or less what I want:
> str(figinfo)
'data.frame': 161 obs. of 8 variables:
$ filename: chr "mileyears4.png" "alcohol-risk.jpg" "measels.png" "numbers.png" ...
$ format : chr "PNG" "JPEG" "PNG" "PNG" ...
$ type : chr "Palette" "TrueColor" "TrueColorAlpha" "TrueColorAlpha" ...
$ class : chr "PseudoClass sRGB " "sRGB " "sRGB Matte" "sRGB Matte" ...
$ height : num 500 733 591 721 219 ...
$ width : num 720 661 838 990 798 ...
$ xres : num 72 72 38 38 300 38 300 38 28 38 ...
$ yres : num 72 72 38 38 300 38 300 38 28 38 ...
>

Related

Split PDF files in multiples files every 2 pages in R

I have a PDF document with 300 pages. I need to split this file in 150 files containing each one 2 pages. For example, the 1st document would contain pages 1 & 2 of the original file, the 2nd document, the pages 3 & 4 and so on.
Maybe I can use the "pdftools" package, but I don't know how.
1) pdftools Assuming that the input PDF is in the current directory and the outputs are to go into the same directory, change the inputs below and then get the number of pages num, compute the st and en vectors of start and end page numbers and repeatedly call pdf_subset. Note that the pdf_length and pdf_subset functions come from the qpdf R package but are also made available by the pdftools R package by importing them and exporting them back out.
library(pdftools)
# inputs
infile <- "a.pdf" # input pdf
prefix <- "out_" # output pdf's will begin with this prefix
num <- pdf_length(infile)
st <- seq(1, num, 2)
en <- pmin(st + 1, num)
for (i in seq_along(st)) {
outfile <- sprintf("%s%0*d.pdf", prefix, nchar(num), i)
pdf_subset(infile, pages = st[i]:en[i], output = outfile)
}
2) pdfbox The Apache pdfbox utility can split into files of 2 pages each. Download the .jar command line utilities file from pdfbox and be sure you have java installed. Then run this assuming that your input file is a.pdf and is in the current directory (or run the quoted part directly from the command line without the quotes and without R). The jar file name below may need to be changed if a later version is to be used. The one named below is the latest one currently (not counting alpha version).
system("java -jar pdfbox-app-2.0.26.jar PDFSplit -split 2 a.pdf")
3) animation/pdftk Another option is to install the pdftk program, change the inputs at the top of the script below and run. This gets the number of pages in the input, num, using pdftk and then computes the start and end page numbers, st and en, and then invokes pdftk repeatedly, once for each st/en pair to extract those pages into another file.
library(animation)
# inputs
PDFTK <- "~/../bin/pdftk.exe" # path to pdftk
infile <- "a.pdf" # input pdf
prefix <- "out_" # output pdf's will begin with this prefix
ani.options(pdftk = Sys.glob(PDFTK))
tmp <- tempfile()
dump_data <- pdftk(infile, "dump_data", tmp)
g <- grep("NumberOfPages", readLines(tmp), value = TRUE)
num <- as.numeric(sub(".* ", "", g))
st <- seq(1, num, 2)
en <- pmin(st + 1, num)
for (i in seq_along(st)) {
outfile <- sprintf("%s%0*d.pdf", prefix, nchar(num), i)
pdftk(infile, sprintf("cat %d-%d", st[i], en[i]), outfile)
}
Neither pdftools nor qpdf (on which the first depends) support splitting PDF files by other than "every page". You likely will need to rely on an external program, I'm confident you can get pdftk to do that by calling it once for each 2-page output.
I have a 36-page PDF here named quux.pdf in the current working directory.
str(pdftools::pdf_info("quux.pdf"))
# List of 11
# $ version : chr "1.5"
# $ pages : int 36
# $ encrypted : logi FALSE
# $ linearized : logi FALSE
# $ keys :List of 8
# ..$ Producer : chr "pdfTeX-1.40.24"
# ..$ Author : chr ""
# ..$ Title : chr ""
# ..$ Subject : chr ""
# ..$ Creator : chr "LaTeX via pandoc"
# ..$ Keywords : chr ""
# ..$ Trapped : chr ""
# ..$ PTEX.Fullbanner: chr "This is pdfTeX, Version 3.141592653-2.6-1.40.24 (TeX Live 2022) kpathsea version 6.3.4"
# $ created : POSIXct[1:1], format: "2022-05-17 22:54:40"
# $ modified : POSIXct[1:1], format: "2022-05-17 22:54:40"
# $ metadata : chr ""
# $ locked : logi FALSE
# $ attachments: logi FALSE
# $ layout : chr "no_layout"
I also have pdftk installed and available in the page,
Sys.which("pdftk")
# pdftk
# "C:\\PROGRA~2\\PDFtk Server\\bin\\pdftk.exe"
With this, I can run an external script to create 2-page PDFs:
list.files(pattern = "pdf$")
# [1] "quux.pdf"
pages <- seq(pdftools::pdf_info("quux.pdf")$pages)
pages <- split(pages, (pages - 1) %/% 2)
pages[1:3]
# $`0`
# [1] 1 2
# $`1`
# [1] 3 4
# $`2`
# [1] 5 6
for (pg in pages) {
system(sprintf("pdftk quux.pdf cat %s-%s output out_%02i-%02i.pdf",
min(pg), max(pg), min(pg), max(pg)))
}
list.files(pattern = "pdf$")
# [1] "out_01-02.pdf" "out_03-04.pdf" "out_05-06.pdf" "out_07-08.pdf"
# [5] "out_09-10.pdf" "out_11-12.pdf" "out_13-14.pdf" "out_15-16.pdf"
# [9] "out_17-18.pdf" "out_19-20.pdf" "out_21-22.pdf" "out_23-24.pdf"
# [13] "out_25-26.pdf" "out_27-28.pdf" "out_29-30.pdf" "out_31-32.pdf"
# [17] "out_33-34.pdf" "out_35-36.pdf" "quux.pdf"
str(pdftools::pdf_info("out_01-02.pdf"))
# List of 11
# $ version : chr "1.5"
# $ pages : int 2
# $ encrypted : logi FALSE
# $ linearized : logi FALSE
# $ keys :List of 2
# ..$ Creator : chr "pdftk 2.02 - www.pdftk.com"
# ..$ Producer: chr "itext-paulo-155 (itextpdf.sf.net-lowagie.com)"
# $ created : POSIXct[1:1], format: "2022-05-18 09:37:56"
# $ modified : POSIXct[1:1], format: "2022-05-18 09:37:56"
# $ metadata : chr ""
# $ locked : logi FALSE
# $ attachments: logi FALSE
# $ layout : chr "no_layout"

writing a loop to go through a large list with sublists and save these sublist

I would like to extract data from large list with many sub-lists called 'summary' https://www.dropbox.com/s/uiair94p0v7z2zr/summary10.csv?dl=0
This file is compilation of the fitting of dose response curve by patient and drugs. I share a small file with just 10 patients, 105 drugs and x and y as readout for the fitting with each 100pt.
I would like to save all the fits for each patient and every drug in a separate file.
I tried to write the list into a df to use tidyverse but didn't manage. I have only started out with R so this is very complex for me.
for (i in 1:length(summary10))
{for (j in 1:length(summary10[[i]]))
{x1 <- summary10[[i]][[j]][[1]]
y1 <- summary10[[i]][[j]][[2]]
print(summary10[[i]][[j]]);}}
the loop works but I don't know how to save them in different files so that I will be able to know what is what. I tried something I found online but it doesn't work:
for (i in 1:length(summary10))
{for (j in 1:length(summary10[[i]]))
{x1 <- summary10[[i]][[j]][[1]]
y1 <- summary10[[i]][[j]][[2]]
cbind(x1,y1) -> resp
write.csv(resp, file = paste0(summary[[i]], ".-csv"), row.names = FALSE)
}}
Error in file(file, ifelse(append, "a", "w")) : invalid 'description' argument
In addition:
Warning message: In if (file == "") file <- stdout() else if (is.character(file)) { : the condition has length > 1 and only the first element will be used
It's really hard to anticipate what goes wrong, when we cannot see how you made summary10. No way am I going to guess how you came from your tabular file, to a list of lists (or whatever summary10 may be).
But in the end, your error indicates that you are providing an illicit filename in the file = paste0(summary[[i]], ".-csv") argument. First tip on debugging is simply printing to console. Try this on for size:
cbind(x1,y1) -> resp
cat(paste0(summary[[i]], ".-csv", '\n') # <-----
# use `cat` to print to console the contents of your expressiosn
write.csv(resp, file = paste0(summary[[i]], ".-csv"), row.names = FALSE)
What is it? It should evaluate to a simple string, say B.M.21.S.-csv, but it might not be the case.
At a first glance, I would guess you've misspelled your variable. summary is usually a function, whereas you might be looking for summary10. Still, the i'th element of summary10 looks like it could be a list itself, so your expression will fail to produce a simple string.
Update with summary10
I always recommend using str to examine the structure of an object. For lists, use the argument max.level to avoid printing endless nested lists:
> str(summary10, max.level=1)
List of 10
$ B-HR-25 :List of 106
$ B-SR-22 :List of 106
$ B-VHR-01:List of 106
$ B-SR-23 :List of 106
$ B-SR-24 :List of 106
$ B-HR-21 :List of 106
$ B-M-21 :List of 106
$ B-SR-21 :List of 106
$ B-MR-01 :List of 106
$ B-M-01 :List of 106
And then a step further in:
> str(summary10[[1]], max.level=2)
List of 106
$ PP242 :List of 2
..$ x: num [1:100] 1 1.1 1.2 1.32 1.45 ...
..$ y: num [1:100] 0.923 0.922 0.921 0.92 0.919 ...
$ AZD8055 :List of 2
..$ x: num [1:100] 1 1.1 1.2 1.32 1.45 ...
..$ y: num [1:100] 0.953 0.953 0.953 0.952 0.952 ...
So object summary10 is a collection of patients (lists of lists); summary10[1] is the collection containing the first patient, summary10[[1]] the first patient (a list itself) with their responses to drugs.
So what happens when you try to make a filename from summary10[[i]]? Try it, I won't print the output here. Back to str(summary10), the patients' designations ("B-HR-25", etc.) are the names of the entries. Get them with names(summary10). As an exercise, compare names(summary10), names(summary10)[1], names(summary10[1]) and names(summary10[[1]]).

Define attribute classes during readOGR

Is there any possibility to declare the data type of the attribute columns when importing, for example, a ESRI Shapefile with the readOGR command?
For example, I would like to keep the leading zeros in my key column (id_code):
example<- readOGR(example.shp", example")
str(example#data)
#'data.frame': 7149 obs. of 22 variables:
# $ id_code: num 101 102 103 104 105 106 107 108 109 110 ...
The result should something be like this:
str(example#data)
#'data.frame': 7149 obs. of 22 variables:
# $ id_code: char "0101" "0102" "0103" "0104" "0105" "0106"...
I am looking for something similar as colClasses in the read.csv() function
Yes, you can declare the data type when importing by specifying the encoding, ogrDrivers and use_iconv options in readOGR.
Please see ?readOGR.
From the documentation for the encoding option:
default NULL, if set to a character string, and the driver is “ESRI
Shapefile”, and use_iconv is FALSE, it is passed to the CPL Option
“SHAPE_ENCODING” immediately before reading the DBF of a shapefile. If
use_iconv is TRUE, and encoding is not NULL, it will be used to
convert input strings from the given value to the native encoding for
the system/platform.
You may also want to look into ogrInfo.

Excel misses values from numeric vector with write.table function in R

I am trying to save a data.frame from R so that it can be read from Excel. I have done this with several other data.frames that have the same structure as the one I refer to now, so far without problems. But for some reason when I try to save this data.frame and then open it with Excel, many of the numerical values in the columns FreqDev and LengthDev are not read by Excel. Instead, the rows show a string of "#" symbols.
My data.frame looks like this:
head(RegPartV)
LogFreq Word PhonCV WordClass FreqDev LengthDev Irregular
1277 28.395 geweest CV-CVVCC V 5.464336 -1.1518498 FALSE
903 25.647 gemaakt CV-CVVCC V 4.885296 -1.1518498 FALSE
752 23.304 gehad CV-CVC V 4.391595 -2.1100420 FALSE
610 22.765 gebracht CV-CCVCC V 4.278021 -0.6727537 FALSE
1312 22.041 gezegd CV-CVCC V 4.125465 -1.6309459 FALSE
647 21.987 gedaan CV-CVVC V 4.114086 -1.6309459 FALSE
The type of information in the data.frame is:
str(RegPartV)
'data.frame': 2096 obs. of 7 variables:
$ LogFreq : num 28.4 25.6 23.3 22.8 22 ...
$ Word : chr "geweest" "gemaakt" "gehad" "gebracht" ...
$ PhonCV : chr "CV-CVVCC" "CV-CVVCC" "CV-CVC" "CV-CCVCC" ...
$ WordClass: Factor w/ 1 level "V": 1 1 1 1 1 1 1 1 1 1 ...
$ FreqDev : num 5.46 4.89 4.39 4.28 4.13 ...
$ LengthDev: num -1.152 -1.152 -2.11 -0.673 -1.631 ...
$ Irregular: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
What is strange is that if I put my mouse over the numerical cells that now have only # symbols (in the excel file), I see a trace of the numbers that used to be there in the original R data.frame. For example, the values of these columns for the first row in the data.frame are:
>RegPartV[1,c(5,6)]
FreqDev LengthDev
1277 5.464336 -1.15185
And if I put my mouse over the Excel cells (that contain only # symbols) corresponding to the same values I just showed, I see:
54643356148468
and
-115184982188519
So the numbers are still there, but for some reason either R or Excel lost count of where the decimal was.
The method I am using to save the data.frame (and that I've used for structurally equivalent data.frame) is:
write.table(RegPartV,file="RegPartV",quote=F,sep="\t",row.names=F,col.names=T)
Then I open the file with Excel and I would expect to see all the info there, for some reason I'm having this numeric problem with this particular data.frame.
Any suggestions to get an Excel-readable data.frame are very welcome.
Thanks in advance.
From your problem description I suspect that you have "," as the default decimal separator in Excel. Either change the default in Excel or add dec="," to the write.table command.
That isn't actually an error: "#" means that a string/value is too long to fit into column. Widen the column and you'll see proper contents.

Writing a Simple Triplet Matrix to a File?

I am using the tm package to compute term-document-matrix for a dataset, I now have to write the term-document-matrix to a file but when I use the write functions in R I am getting a error.
Here is the code which I am using and the error I am getting:
data("crude")
tdm <- TermDocumentMatrix(crude, control = list(weighting = weightTfIdf, stopwords = TRUE))
dtm <- DocumentTermMatrix(crude, control = list(weighting = weightTfIdf, stopwords = TRUE))
and this is the error while I use the write.table command on this data:
Error in cat(list(...), file, sep, fill, labels, append) : argument 1 (type 'list') cannot be handled by 'cat'
I understand that tbm is a object of type Simple Triplet Matrix, but how can I write this to a simple text file.
I think I might be misunderstanding the question, but if all you want to do is export the term document matrix to a file, then how about this:
m <- inspect(tdm)
DF <- as.data.frame(m, stringsAsFactors = FALSE)
write.table(DF)
Is that what you're after mate?
Hope that helps a little,
Tony Breyal
Should the file be "human-readable"? If not, use dump, dput, or save. If so, convert your list into a data.frame.
Edit: You can convert your list into a matrix if each list element is equal length by doing matrix(unlist(list.name), nrow=length(list.name[[1]])) or something like that (or with plyr).
Why aren't you doing your SVM analysis in R (e.g. with kernlab)?
Edit 2: Ok, I looked at your data, and it isn't easy to convert into a matrix because the list elements aren't equal length:
> is.list(tdm)
[1] TRUE
> str(tdm)
List of 7
$ i : int [1:1475] 15 29 151 152 173 205 215 216 227 228 ...
$ j : int [1:1475] 1 1 1 1 1 1 1 1 1 1 ...
$ v : Named num [1:1475] 3.32 4.32 2.32 2 2.32 ...
..- attr(*, "names")= chr [1:1475] "1.50" "16.00" "barrel," "barrel." ...
$ nrow : int 985
$ ncol : int 20
$ dimnames :List of 2
..$ Terms: chr [1:985] "(bpd)" "(bpd)." "(gcc)" "(it) appears to be nearing a crossroads with regard to\nderegulation, both as it pertains to investments and imports," ...
..$ Docs : chr [1:20] "127" "144" "191" "194" ...
$ Weighting: chr [1:2] "term frequency - inverse document frequency" "tf-idf"
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
In order to convert this to a matrix, you will need to either take elements of this list (e.g. i, j) or else do some other manipulation.
Edit 3: Just to conclude my commentary here: these objects are intended to be used with the inspect function (see the package vignette).
As discussed, in order to use a function like write.table, you will need to convert your list into a matrix, which requires some manipulation of that list such that you have several vectors of equal length. Looking at the structure of these tm objects: this will be very difficult to do, and I suggest you work with the helper functions that are included with that package.
dtmMatrix <- as.matrix(dtm)
write.csv(dtmMatrix, 'mydata.csv')
This certainly does the work. However, when I tried it on a very large DTM (25000 by 35000), it gave errors relating to lack of memory space.
I used the following method:
dtm <- DocumentTermMatrix(corpus)
dtm1 <- removeSparseTerms(dtm,0.998) ##max allowed sparsity 0.998
m <- inspect(dtm1)
DF <- as.data.frame(m, stringsAsFactors = FALSE)
write.csv(DF,"mydata0.998sparse.csv")
Which reduced the size of the document term matrix to a great extent!
Here you can increase the max allowable sparsity (closer to 1) to include more terms in DF.

Resources