Check whether file is binary - r

I would like to have a function is.binary that gives the following results
tmp_vec=1:3
save(tmp_vec,file="temp_vec.RData")
write.csv(tmp_vec,"temp_vec.csv")
is.binary("temp_vec.RData")
#TRUE
is.binary("temp_vec.csv")
#FALSE
Is there a function like this in R?
The best solution I was able to come up with is
is.binary=function(filename) {
is_binary=TRUE
#suppress warnings and try to read file with binary reader
#if it throws an error, set is_binary to FALSE
withCallingHandlers(expr=tryCatch(load(filename),
error=function(err) is_binary<<-FALSE),
warning=function(w) invokeRestart("muffleWarning"))
#since R loads the objects of the binary file into the memory, delete them
#maybe this is unnecessary ...
#... I do not know how R handles memory of objects that go out of scope
rm(list=ls()[ls()!="is_binary"])
is_binary
}
Obviously, this function is not very efficient when dealing with large files. Any pointers to better solutions? Thank you!
Edit: I am on a Windows machine.

All files on a computer are binary files. The exceptions are those that happen to look like text files. And even those can have different encodings that are required to actually read the files properly.
If you're just trying to distinguish between the results of save() and write.csv(), the easiest thing to check would be if the file is compressed. By default save() will compress a file. You can look for the magic number for compression in the file. Here's one way to do that
is_compressed <- function(filename, magic.number=as.raw(c("0x1f","0x8b"))) {
fh<-file(filename, "rb")
on.exit(close(fh))
magic <- readBin(fh, "raw", length(magic.number))
if(length(magic) != length(magic.number)) return(FALSE)
if(all(magic == magic.number)) return(TRUE)
return (FALSE)
}
This only requires reading two bytes of a file rather than the whole thing. Once you know it's compressed you can then try to look for the magic number for an rData file. Here's a more complete function
is_rdata <- function(filename) {
#check for magic number
#https://github.com/wch/r-source/blob/b99d403f4b7337553acb2d2108c7a00e6c19f908/src/main/saveload.c#L1786
fh <- if(!is_compressed(filename))
file(filename, "rb")
else {
gzfile(filename, "rb")
}
on.exit(close(fh))
magic <- rawToChar(readBin(fh, "raw", 5))
if(nchar(magic)<5) return(FALSE)
if(magic %in% c("RDA1\n","RDB1\n","RDX1\n","RDA2\n","RDB2\n","RDX2\n")) return(TRUE)
return (FALSE)
}
We can test with
dd <- data.frame(a=1:4, b=letters[1:4])
save(dd, file="test1.file")
write.csv(dd, file="test2.file")
is_rdata("test1.file")
is_rdata("test2.file")
of course, if you are careful about your file extensions, that would probably be the easiest way to identify files.

Related

What can be the problem when "eem_import_dir" is working properly?

eem_import_dir is supposed to "Reads Rdata and RDa files with one eemlist each. The eemlists are combined into one and returned." However, in my case, only one file is read at the time, and thus no combination is happening...
I don't expect any error in the function which is part of the staRdom package. I guess my limited R knowledge limits my understanding of the function and what could be wrong.
All files are the same class (eemlist) and in the same format. Tried changing the folder, filenames, etc. Can someone please help me understand the requirements of the function? Why is only one file read at the time and not all combined?
function (dir)
{
eem_files <- dir(dir, pattern = ".RData$|.RDa$", ignore.case = TRUE) %>%
paste0(dir, "/", .)
for (file in eem_files) {
file <- load(file)
if (get(file) %>% class() == "eemlist") {
if (exists("eem_list"))
eem_list <- eem_bind(eem_list, get(file))
else eem_list <- get(file)
}
else {
warning(paste0(file, " is no object of class eemlist!"))
}
NULL
}
eem_list
}

Detecting invalid or corrupt jpg files with jpeg package in R

I'd like to use the jpeg package (or similar) to detect corrupted .jpg files. I am sharing this code with users who have had trouble installing exiftool so I'd prefer to use packages that do not require that program.
I want my code to catch images that are completely corrupt or that are partially corrupt (i.e., you can see part of the image, but some of it is cut off).
When an image is corrupt, the readJPEG function returns:
Error in readJPEG(photos[35]) :
JPEG decompression error: Not a JPEG file: starts with 0x7b 0x28
When an image is partially corrupt, the function returns:
JPEG decompression: Corrupt JPEG data: premature end of data segment
I want to write a function that will return FALSE if the image is "good" and TRUE if it is corrupted or partially corrupted. So far, I can't get my function to work if the image is partially corrupted (it returns FALSE). What am I doing wrong?
Here's an example of a "partially corrupt" image - the bottom half got cut off when it was transferred to a new device.
library(jpeg)
# Function to "catch" bad photos
is_corrupted <- function(x){
tryCatch({
check <- readJPEG(x)
return(FALSE)
},
error = function(e)
return(TRUE),
warning = function(w)
return(TRUE),
message = function(m)
return(TRUE)
)
}
EDIT: Try number 2...
I created a modified function based on Ben's suggestions, but it still isn't returning TRUE if an image is completely corrupt. I also don't like how it tests the photo twice. Any recommendations appreciated!
To test the function, you can use three jpgs... (1) any valid jpg from your computer, (2) the "partially corrupt" file linked in this question, and (3) reference a file that doesn't exist to throw an error that will be caught by tryCatch (e.g., is_corrupted("").
is_corrupted <- function(x){
message <- capture.output(check2 <- readJPEG(x), type = "message")
if(length(message) > 0) {
corrupt <- TRUE
} else {
corrupt <- tryCatch({
check <- readJPEG(x)
return(FALSE)
},
error = function(e) # catch "corrupt" images
return(TRUE)
)
}
return(corrupt)
}
I agree, this one is tricky. I think you need to have the error checking before the capturing part. I will post a temporary (ugly) solution, and hopefully someone else posts a more elegant and straightforward one.
readJPEG2 <- purrr::safely(readJPEG)
Let purrr do the error checking and if there is none, proceed with examining the output:
fun <- function(x){
if(is.null(readJPEG2(x)$error)){
message2 <- capture.output(readJPEG(x), type = "message")
if(length(message2) > 0){
return("partially corrupted")
} else {
return("complete")
}
} else {
return("corrupted")
}
}
I do not know how robust this solution is but maybe it helps you even so.

Batch-reading mesh3D objects with the 'file2mesh' function from the 'Morpho' package

I am trying to batch-read a series of ply-meshes (as mesh3D objects), in order to slide semilandmarks with 'slider3d'. However, when I try to use a loop to read those files, I am told that the object 'Mesh' could not be found. This indicates that a mesh object must first be created in order to then be altered in a loop. How do I solve this?
Is there a simple function in the 'rgl' package that I overlooked?
Or is there an alternative to read all 3D-meshes in one folder, and create a list that I can use to match files downstream?
library(Morpho)
FilesPLY <- list.files("HumerusPLY",pattern="*.ply")
for(j in 1:length(FilesPLY)){
Mesh[j] <- file2mesh(paste("HumerusPLY/",FilesPLY[j],sep=""), clean = TRUE, readcol = FALSE)
}
Error: Object 'Mesh' could not be found.
One way to solve the problem is by creating a list of empty files, and then reading the meshes into the empty files. Oddly enough the first read-out results in an error, but it sets up the system for the read-in. I don't understand the problem behind it, but it works. Thus, here is the temporary solution:
library(Morpho)
# Read ply-list from subfolder "HumerusPLY/"; Create Mesh series of objects, and fill them
FilesPLY <- list.files("HumerusPLY/",pattern="*.ply")
for(i in 1:length(FilesPLY)) {
assign(paste("Mesh",i,sep=""), i)
}
meshlist <- c(1:length(FilesPLY))
for (i in 1:length(meshlist)){
meshlist[i] <- paste("Mesh",meshlist[i],sep="")
}
meshlist <- noquote(meshlist)
ls()
##read ply-files; the second read fixes an error, but does not work without the first read
for(j in 1:length(meshlist)){
meshlist[j] <- file2mesh(paste("HumerusPLY/",FilesPLY[j],sep=""), clean = TRUE, readcol = FALSE)
}
for(j in 1:length(meshlist)){
meshlist[[j]] <- file2mesh(paste("HumerusPLY/",FilesPLY[j],sep=""), clean = TRUE, readcol = FALSE)
}

TryCatch with parLapply (Parallel package) in R

I am trying to run something on a very large dataset. Basically, I want to loop through all files in a folder and run the function fromJSON on it. However, I want it to skip over files that produce an error. I have built a function using tryCatch however, that only works when i use the function lappy and not parLapply.
Here is my code for my exception handling function:
readJson <- function (file) {
require(jsonlite)
dat <- tryCatch(
{
fromJSON(file, flatten=TRUE)
},
error = function(cond) {
message(cond)
return(NA)
},
warning = function(cond) {
message(cond)
return(NULL)
}
)
return(dat)
}
and then I call parLapply on a character vector files which contains the full paths to the JSON files:
dat<- parLapply(cl,files,readJson)
that produces an error when it reaches a file that doesn't end properly and does not create the list 'dat' by skipping over the problematic file. Which is what the readJson function was supposed to mitigate.
When I use regular lapply, however it works perfectly fine. It generates the errors, however, it still creates the list by skipping over the erroneous file.
any ideas on how I could use exception handling with parLappy parallel such that it will skip over the problematic files and generate the list?
In your error handler function cond is an error condition. message(cond) signals this condition, which is caught on the workers and transmitted as an error to the master. Either remove the message calls or replace them with something like
message(conditionMessage(cond))
You won't see anything on the master though, so removing is probably best.
What you could do is something like this (with another example, reproducible):
test1 <- function(i) {
dat <- NA
try({
if (runif(1) < 0.8) {
dat <- rnorm(i)
} else {
stop("Error!")
}
})
return(dat)
}
cl <- parallel::makeCluster(3)
dat <- parallel::parLapply(cl, 1:100, test1)
See this related question for other solutions. I think using foreach with .errorhandling = "pass" would be another good solution.

Chinese characters function R studio

I am trying to do text mining in Chinese with R.
In my data set, I have a column with people's comment like "连锁店购买的". And I have 2 other columns that I created thanks to JiebaR. These hold the segmented message ("连锁店", "购买", "的") and the keywords from these messages ("连锁店", "购买"). The keyword selection removes "不"("no" in Chinese) so I am trying to fetch it back from the words and add it to the keywords. Simple, right ?
To have a clean code, I put all my functions in a separate file and source it in my main file. And NOW something VERY weird happens : the function works when it's in the main file but doesn't work when it's in the file that I source ! (I just copied and pasted the function from my main to the "function" file and run the source(...) line...).
fetchingNeg <- function(df){
for (i in 1:nrow(df)){
if ("不" %in% unlist(df[i,]$words)){
df[i,]$keywords <- list(append(unlist(df[i,]$keywords),"不"))
}
}
return(df)
}
So I found the error : Encoding !
There was a character c that I knew was "不" but when I was doing print("不" == c) it would give FALSE... "不" is not encoded in UTF-8 in this case, so to make my code work I had to change it to
fetchingNeg <- function(df){
for (i in 1:nrow(df)){
# "不" is "\u{4e0d}" in UTF-8
if ("\u{4e0d}" %in% unlist(df[i,]$words)){
df[i,]$keywords <- list(append(unlist(df[i,]$keywords),"\u{4e0d}"))
}
}
return(df)
}

Resources