How can I losslessly crop a jpeg in R - r

I am new to R. I have a folder full of images(RGB) which are not of the same dimensions. My requirement is to have them all in the same dimensions which would involve resizing a bunch of them. I wrote the following code to get this done
#EBImage
library(EBImage)
path = "G:/Images/"
file.names = dir(path,full.names = TRUE, pattern =".jpeg")
reqd_dim = c(3099,2329,3)
sprintf("Number of Image Files is: %d", length(file.names))
for(i in 1:length(file.names)){
correction_flag = FALSE
print("Loop Number:")
flush.console()
print(i)
flush.console()
img = readImage(file.names[i])
# Checking if the dimensions are the same
for (j in 1:length(reqd_dim)) {
if(dim(img)[j]!=reqd_dim[j]){
correction_flag = TRUE
break
}
}
if(correction_flag==TRUE){
print("Correcting dimensions of the image")
flush.console()
writeImage(img[1:3099, 1:2329, 1:3],file.names[i],quality = 100)
}
}
My problem is that while the images are originally between 500-600 kb in size, the ones that are resized end up being between 1.8 to 2 Mb. In my particular case the images are in either of the two sizes - 3100x2329 or 3099x2329. So my resizing involves removing the extra column of pixels to make all images 3099x2329. I am ok with the file size of the files going down a bit as I expect some information to be lost; but in my case the file size is increasing more than three-fold.
Alternatively I have thought of converting the images into matrices(which is supported by EBImage) and remove the extra row. But I have two issues here, one is that I don't know how to do it and two is even if I found a way to do it, I'm afraid I might loose some information if I ever needed to convert it back to an image.
I'm open to an improvement over this approach, or a totally different one as well. My only requirement is that I need to be able to do resize my images in R without adding or losing any information (apart from the information in the pixels to be removed)

To perform lossless JPEG cropping you can use jpegtran, an external command line tool distributed as part of the IJG library. For example, the following command removes the last column of pixels from a 768x512 image:
jpegtran -crop 767x512+0+0 -optimize image.jpg >image.jpg
The -crop switch specifies the rectangular subarea WxH+X+Y, and -optimize is an option for reducing file size without quality loss by optimizing the Huffman table. For a complete list of switches see jpegtran -help.
Once jpegtran is installed on your system, it can be invoked from R by system(). The following example first takes a sample image and saves it as JPEG. The image is then cropped, and the pixel values are compared to the values from the original image.
library("EBImage")
# resave a sample image as JPG
f = system.file("images", "sample.png", package="EBImage")
writeImage(readImage(f), "image.jpg", quality=90)
# do the cropping
system("jpegtran -crop 767x512+0+0 -optimize image.jpg >cropped.jpg")
# compare file size
file.size("image.jpg", "cropped.jpg")
## [1] 65880 65005
original = readImage("image.jpg")
dim(original)
## [1] 768 512
cropped = readImage("cropped.jpg")
dim(cropped)
## [1] 767 512
# check whether original values are retained
identical(original[1:767,], cropped)
## TRUE
Back to your specific use-case: your script could be further improved by examining image dimensions without actually loading the whole pixel array into R. For this you could, for example, use RBioFormats to only read image meatadata containing image dimensions into R. But you can also use another command line tool identify distributed as part of the ImageMagick suite to retrieve the image dimensions, as illustrated below.
path = "G:/Images/"
file.names = dir(path, full.names = TRUE, pattern =".jpeg")
reqd_dim = c(3099,2329,3)
cat(sprintf("Number of Image Files is: %d\n", length(file.names)))
for (i in seq_along(file.names)) {
file = file.names[i]
cat(sprintf("Checking dimensions of image number %d: ", i))
flush.console()
cmd = paste('identify -format "c(%w, %h)"', file)
res = eval(parse(text=system(cmd, intern=TRUE)))
# Checking if the dimensions are the same
if ( all(res==reqd_dim) ) {
cat("OK\n")
flush.console()
}
else {
cat("Correcting\n")
flush.console()
system(sprintf("jpegtran -crop %dx%d+0+0 -optimize %s >%s",
reqd_dim[1], reqd_dim[2], file, file))
}
}

Related

Failure of unz() to unzip from a zip file offset of more than 2^31 bytes

I have been obtaining .zip archives of genome annotation from NCBI (mainly gff files). In order save disk space I prefer not to unzip the archive, but to read these files directly into R using unz(). However, it seems that unz() is unable to extract files from the end of 'large' zip files:
ncbi.zip <- "file_location/name.zip"
files <- unzip(ncbi.zip, list=TRUE)
gff.files <- files$Name[ grep("gff$", files$Name) ]
## this works
gff.128 <- readLines( unz(ncbi.zip, gff.files[128]) )
## this gives an empty data structure (read.table() stops
## with an error saying no lines or similar
gff.129 <- readLines( unz(ncbi.zip, gff.files[129]) )
## there are 31 more gff files after the 129th one.
## no lines are read from any of these.
The zip file itself seems to be fine; I can unzip the specific files using unzip on the command line and unzip -t does not report any errors.
I've tried this with R versions 3.5 (openSuse Leap 15.1), 3.6, and 4.2 (centOS 7) and with more than one zip file and get exactly the same result.
I attached strace to R whilst reading in the 128 and 129th file. In both cases I get a lot of lseek towards the end of file (offset 2845892608, larger than 2^31) to start with. This is where I assume the zip directory can be found. For the 128th file (the one that can be read), I eventually get an lseek to an offset slightly below 2^31, followed by a set of lseeks and reads (that extend beyone 2^31).
For the 129th file, I get the same reads towards the end of the file, but then rather than finding a position within the file I get:
lseek(3, 2845933568, SEEK_SET) = 2845933568
lseek(3, 4294963200, SEEK_SET) = 4294963200
read(3, "", 4096) = 0
lseek(3, 4095, SEEK_CUR) = 4294967295
read(3, "", 4096) = 0
Which is a bit weird since the file itself is only about 2.8 GB. 4294967295, is of course 2^32 - 1.
To me this feels like an integer overflow bug, and I am considering to post a bug report. But am wondering if anyone has seen something similar before or if I am doing something stupid.
Having done what I should have started with (reading the specification for the zip64 format specification), it's actually clear that this is not an integer overflow error.
Zip files contain a central directory at the end of the archive; this contains amongst other things the names of the compressed files and the offset of the compressed data in the zip archive. The offset (and file size fields) are only given 4 bytes each in the standard directory field; when the offset is larger than this it should instead be given in the extra fields section and the value in the standard field should be set to 0xFFFFFFFF. Since this is the offset that gets used when reading the file it seems clear that the problem lies in the parsing of the extra field.
I had a look at the source code for R 4.2.1 and it seems that the problem is due to the way the offset specified in the standard offset field is tested:
if(file_info.uncompressed_size == (ZPOS64_T)(unsigned long)-1)
changing this == 0xFFFFFFFF seems to fix the problem.
I've submitted a bug report to R. Hopefully changing the check will not have any unintended consequences and the issue will be fixed.
Still, I'm curious as to whether anyone else has come across the same issue. Seems a bit unlikely that my experience is unique.

I am trying to save 360 png files as a gif with image magick in R (I am working with MacOS)

please let me know any other system/code I need to include, as I am not as familiar with writing out images to my computer. I am creating 360 png files as follows:
for(theta in 1:360){
ic=as.character(theta)
if(theta<10) ic=paste("00",ic,sep="")
if(theta>=10 & theta<100) ic=paste("0",ic,sep="") # make filenames the same length
fn=paste("c:iris360\\HW4_",ic,".png",sep="") #filename
png(fn,width=1000,height=1000) # save as *.png
p3(X1,X2, r=100,theta=theta,mainL=paste("theta =",theta))
# legend("topleft",pch=16,cex=1.5,col=allcl)
dev.off()
}
system("magick c:iris360\\HW4*.png c:iris.gif")
where p3 is just a function that takes my matrices X1 and X2 and plots the points and their segments(let me know if I need to include it as well). However, I get this error:
magick: must specify image size iris360HW4*.png' # error/raw.c/ReadRAWImage/140.
I am unable to open the gif file, as my mac says it is damaged or uses a file format that preview does not recognize.
Update 1: I replaced fn's declaration with
fn <- sprintf("c:iris360/HW4_%03i.png", theta)
as well as replacing ic with sprintf("%03i", theta) everywhere it appeared, but still got the same specify image size error.
When I run the system command into my terminal, I still get the same error asking me to specify the image size.
Magick needs to know several things (e.g., image size, delay between frames, images to use, destination file name) in order to convert a stack of png into a gif. See GIF Animations and Animation Meta-data
magick -delay 100 -size 100x100 xc:SkyBlue \
-page +5+10 balloon.gif -page +35+30 medical.gif \
-page +62+50 present.gif -page +10+55 shading.gif \
-loop 0 animation.gif
So it looks like you need to change
system("magick c:iris360\\HW4*.png c:iris.gif")
to something more like
system("magick -delay 10 -size 100x100 —loop 0 c:iris360\\HW4*.png c:iris.gif")

Overlapping lines in Gnuplot when exporting

I'm trying to plot a discrete brownian path in gnuplot, which involves a lot of overlaping lines. This is how it's displayed in the qt terminal (I have generated the image with a screenshot):
Notice how the overlapping lines get colored in a stronger color, which is beautiful.
If I export it in png, with
set term pngcairo size 1366,768 enhanced
I obtain this:
All the lines have the same intensity. Setting transparent doesn't help, either.
The same happens with this MWE:
set term pngcairo size 1366,768 background '#000000' enhanced
set output "image.png"
unset key
set border 0
unset xtics
unset ytics
set samples 1e6
set xrange [0:0.1]
p sin(1/x) w l lw 0.3
set output
I'm running gnuplot -d each time so my local config does not get loaded. How should I export the plot to obtain the same effect as in the GUI?
Here are some results of my investigation :
I couldn't achieve beautiful results with pngcairo either. Opacity isn't added when 2 curves overlap each other.
Exporting to SVG and converting to PNG looked a bit better, either with inkscape -z -e image.png -w 1600 -h 1200 image.svg or convert -density 3000 -resize 1600x1200 image.svg image.png. This step could be included in gnuplot as a system command.
It is possible to export the qt render to png directly from the qt window. First menu icon on the left → Export to image
This process could in theory be automated directly from Gnuplot, without user interaction. A patch has been submitted : https://sourceforge.net/p/gnuplot/patches/665/. As far as I can tell, it hasn't been yet integrated into Gnuplot 5.0.x
Here is a related discussion on Gnuplot-dev.
If you feel adventurous, you could try to recompile Gnuplot with the applied patch. The submitter might be able to help you.
Very offtopic in this question, but as a workaround I have made a Julia script that replicates the image feeling that I am looking for. I will post it here in case anybody finds it useful.
using Images
function paint(Ny, Nx, iters=1e6; stepsize = 50)
randstep() = rand([-1;1])
x = Nx÷2
y = Ny÷2
M = zeros(Nx,Ny)
for i in 1:iters
rx = randstep()
ry = randstep()
for i in 1:stepsize
x = mod1(x+rx, Nx)
y = mod1(y+ry, Ny)
M[x,y] += 1
end
end
clamped = M/maximum(M)
img = [Colors.RGB(0,mm,0) for mm in clamped]
end
img = convert(Image,paint(1366,768,1e4,stepsize=10))
save("coolbrownianwalk.png", img)
This produces images like this:

Looking for algorithm to do long pair wise nucleotide alignments

I am trying to scan for possible SNPs and indels by aligning scaffolds to subsequences from a reference genome. (the raw reads are not available). I am using R/bioconductor and the `pairwiseAlignment function from the Biostrings package.
This was working fine for smaller scaffolds, but failed when I tried to align as 56kbp scaffold with the error message:
Error in QualityScaledXStringSet.pairwiseAlignment(pattern = pattern,
: cannot allocate memory block of size 17179869183.7 Gb
I am not sure if this is a bug or not ? ; I was under the impression that the Needleman-Wunsch algorithm used by pairwiseAlignment is an O(n*m) which I thought would imply the computational demand to be on the order of 3.1E9 operations (56K * 56k ~= 3.1E9). It seems the Needleman-Wunsch similarity matrix should as well take up on the order of 3.1 gigs of memory as well. Not sure if I'm not remembering big-o notation correctly or that is actually the memory overhead that would be needed to build the alignment given the overhead of the R scripting environment.
Does anybody have suggestions for a better alignment algorithm to use for aligning longer sequences? An initial alignment was already done using BLAST to find the region of the reference genome to align. I am not entirely confident BLAST's reliability for correctly placing indels and I have not yet been able to find an api as good as that provided by biostrings for parsing the raw BLAST alignments.
By the way, here is a code snippet that replicates the problem:
library("Biostrings")
scaffold_set = read.DNAStringSet(scaffold_file_name) #scaffold_set is a DNAStringSet instance
scafseq = scaffold_set[[scaffold_name]] #scaf_seq is a "DNAString" instance
genome = read.DNAStringSet(genome_file_name)[[1]] #genome is a "DNAString" instance
#qstart, qend, substart, subend are all from intial BLAST alignment step
scaf_sub = subseq(scafseq, start=qstart, end=qend) #56170-letter "DNAString" instance
genomic_sub = subseq(genome, start=substart, end=subend) #56168-letter "DNAString" instance
curalign = pairwiseAlignment(pattern = scaf_sub, subject = genomic_sub)
#that last line gives the error:
#Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject, :
#cannot allocate memory block of size 17179869182.9 Gb
The error does not happen with shorter alignments (hundreds of bases).
I have not yet found the length cutoff where the error starts happening
So I use Clustal as an alignment tool. Not sure about the specific performance, but it has never given me issues when doing multiple sequence alignments of large quantity. Here is a script that runs a whole directory of .fasta files and aligns them. You can modify the flags on the system call to suit your input/output needs. Just look at the clustal documentation. This is in Perl, I don't use R too much for alignments. You need to edit the executable path in the script to match where clustal is on your computer.
#!/usr/bin/perl
use warnings;
print "Please type the list file name of protein fasta files to align (end the directory path with a / or this will fail!): ";
$directory = <STDIN>;
chomp $directory;
opendir (DIR,$directory) or die $!;
my #file = readdir DIR;
closedir DIR;
my $add="_align.fasta";
foreach $file (#file) {
my $infile = "$directory$file";
(my $fileprefix = $infile) =~ s/\.[^.]+$//;
my $outfile="$fileprefix$add";
system "/Users/Wes/Desktop/eggNOG_files/clustalw-2.1-macosx/clustalw2 -INFILE=$infile -OUTFILE=$outfile -OUTPUT=FASTA -tree";
}

R: While loop input

I am bit new to R and have a question about a program I am trying to write. I am hoping to take in files (as many as a user pleases) with a while loop (eventually using read.table on each) but it keeps breaking on me.
Here is what I have so far:
cat("Please enter the full path for your files, if you have no more files to add enter 'X': ")
fil<-readLines(con="stdin", 1)
cat(fil, "\n")
while (!input=='X' | !input=='x'){
inputfile=input
input<- readline("Please enter the full path for your files, if you have no more files to add enter 'X': ")
}
if(input=='X' | input=='x'){
exit -1
}
When I run it (from the commandline (UNIX)) I get these results:
> library("lattice")
>
> cat("Please enter the full path for your files, if you have no more files to add enter 'X': ")
Please enter the full path for your files, if you have no more files to add enter 'X': > fil<-readLines(con="stdin", 1)
x
> cat(fil, "\n")
x
> while (!input=='X' | !input=='x'){
+ inputfile=input
+ input<- readline("Please enter the full path for your files, if you have no more files to add enter 'X': ")
+ }
Error: object 'input' not found
Execution halted
I am not quite sure how to fix the problem, but I am pretty sure that it is probably a simple problem.
Any suggestions?
Thanks!
when you first run the script input doesnt exist. Assign
input<-c()
say before your while statement or put
inputfile=input
below input<- readline....
I'm not exactly sure what the underlying problem is for your issue. It may be that you're inputting the directory path incorrectly.
Here's a solution I've used a few times. It makes it much easier for the user. Basically, your code will not require user input, all it requires is that you have a certain naming convention for your files.
setwd("Your/Working/Directory") #This doesn't change
filecontents <- 1
i <- 1
while (filecontents != 0) {
mydata.csv <- try(read.csv(paste("CSV_file_",i,".csv", sep = ""), header = FALSE), silent = TRUE)
if (typeof(mydata.csv) != "list") { #checks to see if the imported data is a list
filecontents <- 0
}
else {
assign(paste('dataset',i, sep=''), mydata)
#Whatever operations you want to do on the files.
i <- i + 1
}
}
As you can see, the naming convention for the files is CSV_file_n where n is any number of input files (i took this code out of one of my programs, in which I load csv's). One of the problems I kept having was Error messages popping up when my code looked for a file that wasn't there. With this loop, those messages won't arise. If it assigns the contents of a non-existant file to mydata.csv, it merely checks to see if mydata.csv is a list. If it is, it continues operating. If not, it stops. If you're worried about differentiating between your data from different files within the code, just insert any relevant information about the file in a constant location within the file itself. For example, in my csv's, My 3rd column always contained the name of the image from which I gathered the information contained in the rest of the csv.
Hope this helps you a bit, even though I see you've already got a solution :-). It's really just an option if you want your program to be more autonomous.

Resources