trying to run this function within a function based loosely off of this, however, since xPDF can convert PDFs to PNGs, I skipped the ImageMagick conversion step, as well as the faulty logic with the function(i) process, since pdftopng requires a root name and that is "ocrbook-000001.png" in this case and throws an error when looking for a PNG of the original PDF's file name.
My issue is now with getting Tesseract to do anything with my PNG files. I get the error:
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.
Here is my code:
lapply(myfiles, function(i){
shell(shQuote(paste0("pdftopng -f 1 -l 10 -r 600 ", i, " ocrbook")))
mypngs <- list.files(path = dest, pattern = "png", full.names = TRUE)
lapply(mypngs, function(z){
shell(shQuote(paste0("tesseract ", z, " out")))
file.remove(paste0(z))
})
})
The issue was the DPI set too high for Tesseract to handle, apparently. Changing the PDFtoPNG DPI parameter from 600 to 150 appears to have corrected the issue. There seems to be a max DPI for Tesseract to understand and know what to do.
I have also corrected my code from a static naming convention to a more dynamic one that mimics the file's original names.
dest <- "C:\\users\\YOURNAME\\desktop"
files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
lapply(files, function(i){
shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 150 ", i,".pdf", " ",i)))
})
myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
lapply(myppms, function(y){
shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
file.remove(paste0(y,".ppm"))
})
mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
lapply(mytiffs, function(z){
shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
file.remove(paste0(z,".tif"))
})
Background
It sounds like you already solved your problem. Yay! I'm writing this answer because I encountered a very similar problem calling tesseract from R and wanted to share some of the workarounds I came up with in case anyone else stumbles across the post and needs further troubleshooting ideas.
In my case I was converting a bunch of faxes (about 3000 individual pdf files, most of them between 1-15 pages) to text. I used an apply function to make the text from each individual fax as a separate entry in a list (length = number of faxes = ~ 3000). Then the faxes were converted to a vector and then that vector was combined with a vector of file names to make a data frame. Finally I wrote the data frame to a csv file. (See below for the code I used).
The problem was I kept getting the same string of errors that you got:
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.
Followed by this error: error in FUN(X[[i]], ...) : basic_string::_M_construct null not valid
What I think the problem is
What was weird for me was that I re-ran the code multiple times and it was always a different fax where the error occurred. It seemed to also occur more often when I was trying to do something else that used a lot of RAM or CPU (opening microsoft teams etc.). I tried changing the DPI as suggested in the first answer and that didn't seem to work.
It was also noticeable that while this code was running I was regularly using close to 100% of RAM and 50% of CPU (based on windows task manager).
When I ran this process (on a similiar batch of about 3,000 faxes) on linux machine with significantly more RAM and CPU I never encountered this problem.
basic_string::_M_construct null not valid, appears to be a c++ error. I'm not familiar with c++, but it sort of sounds like it's a bit of a catch all error that might indicate something that should have been created wasn't created.
Based on all that, I think the problem is that R runs out of memory and in response somehow the memory available to some of the underlying tesseract processes gets throttled off. This means there's not enough memory to convert a pdf to a png and then extract the text which is what throws these errors. This leads to a text blob not getting created where one is expected and the final C++ error of : basic_string::_M_construct null not valid It's possible that lowering the dpi is what gave your process enough memory to complete, but maybe the fundamental underlying problem was the memory not the DPI.
Possible workarounds
So, I'm not sure about any of what I just said, but running with that assumption, here's some ideas I came up with for people running the tesseract package in R who encounter similar problems:
Switch from Rstudio to Rgui: This alone solved my problem. I was able to complete the whole 3000 fax process without any errors using Rgui. Rgui also used between 100-400 MB instead 1000+ that Rstudio used, and about 25% of CPU instead of 50%. Putting R in the path and running R from the console or running R in the background might reduce memory use even further.
Close any memory intensive processes while the code is running. Microsoft teams, videoconferencing, streaming, docker on windows and the windows linux subsystem are all huge memory hogs.
lower DPI As suggested by the first answer, this would also probably reduce memory use.
break the process up. I think running my processes in batches of about 500 might have also reduced the amount of working memory R has to take up before writing to file.
These are all quick and easy solutions that can be done from R without having to learn C++ or upgrade hardware. A more durable solution probably would require doing more customization of the tesseract parameters, implementing the process in C++, changing memory allocation settings for R and the operating system, or buying more RAM.
Example Code
# Load Libraries
library(tesseract)
dir.create("finished_data")
# Define Functions
ocr2 <- function(pdf_path){
# tell tesseract which language to guess
eng <- tesseract("eng")
#convert to png first
#pngfile <- pdftools::pdf_convert(pdf_path, dpi = 300)
# tell tesseract to convert the pdf at pdf_path
seperated_pages <- tesseract::ocr(pdf_path, engine = eng)
#combine all the pages into one page
combined_pages <- paste(seperated_pages, collapse = "**new page**")
# I delete png files as I go to avoid overfilling the hard drive
# because work computer has no hard drive space :'(
png_file_paths <- list.files(pattern = "png$")
file.remove(png_file_paths)
combined_pages
}
# find pdf_paths
fax_file_paths <- list.files(path="./raw_data",
pattern = "pdf$",
recursive = TRUE)
#this converts all the pdfs to text using the ocr
faxes <- lapply(paste0("./raw_data/",fax_file_paths),
ocr2)
fax_table <- data.frame(file_name= fax_file_paths, file_text= unlist(faxes))
write.csv(fax_table, file = paste0("./finished_data/faxes_",format(Sys.Date(),"%b-%d-%Y"), "_test.csv"),row.names = FALSE)
Related
I've got a program that repeatedly loads largish datasets that are stored in R's Rds format. Here's a silly example that has all of the salient features:
# make and save the data
big_data <- matrix(rnorm(1e6^2), 1e6)
saveRDS(big_data, file = "big_data.Rds")
# write a program that uses the data
big_data <- readRDS("big_data.Rds")
BIGGER_data <- big_data+rnorm(1)
print("hooray!")
# save this in a text file called `my_program.R`
# run this program a bunch
for (i = 1:1000){
system("Rscript my_program.R")
}
The bottleneck is loading the data. But what if I had a separate process somewhere that held the data in memory?
Maybe something like this:
# write a program to hold the data in memory
big_data <- readRDS("big_data.Rds")
# save this as `holder.R` open a terminal and do
Rscript holder.R
Now there is a process running somewhere with my data in memory. How can I get it from a different R session? (I'm assuming that this would be faster than loading it -- but is this correct?)
Maybe something like this:
# write another program:
big_data <- get_big_data_from_holder()
BIGGER_data <- big_data+1
print("yahoo!")
# save this as `my_improved_program.R`
# now do the following:
for (i = 1:1000){
system("Rscript my_improved_program.R")
}
So I guess my question is what would the function get_big_data_from_holder() look like? Is it possible to do this? Practical?
Backstory: I'm trying to work around what appears to be a memory leak in R's interface to keras/tensorflow, that I've described here. The workaround is to let the OS clean up all of the cruft left over from a TF session, so that I can run TF sessions one after another without my computer slowing to a crawl.
Edit: maybe I could do this with a clone() system call? Conceptually I can imagine that I'd clone the process running holder and then run all of the commands in the program that depend on the data that's loaded. But I don't know how this would be done.
You may also improve the performance of saving and loading the data by turning off compression:
saveRDS(..., compress = FALSE)
You may find my filematrix package useful for storing and quickly accessing the big matrix.
To create it, run:
big_data = matrix(rnorm(1e4^2), 1e4)
library(filematrix)
fm = fm.create.from.matrix('matrix_file', big_data)
close(fm)
To access it from another R session:
library(filematrix)
fm = fm.open('matrix_file')
show(fm[1:3,1:3])
close(fm)
I'm trying to load IDATs using the bioconductor methylumi package, with a script that has worked for previous files, but am coming across the below unfamiliar error when I use methylumIDAT
>mset450k <- methylumIDAT(sampleSheet$Basename, idatPath=idatPath)
0 HumanMethylation27 samples found
48 HumanMethylation450 samples found
Error in readBin(con, what = "integer", n = n, size = 4, endian = "little", :
invalid 'n' argument
I'm unsure which n this is referring to, and how to resolve it. Searching for information on readBin arguments hasn't helped me understand it in this context.
The closest question I could find on here was this one about an audio file where the problem appears to be file size:
Invalid 'n' argument error in readBin() when trying to load a large (4GB+ audio file)
I'm not using a larger data set then previous times - in fact, when I run the script with a different folder of idats (same number of files) and different sample sheet (same format) it works fine so I don't think file size is the issue here.
Below is the full script
> idatPath<-c("idats2")
> sampleSheet<-read.csv("CrestarSampleSheet2.csv", stringsAsFactors = FALSE)
> sampleSheet<-cbind(paste(sampleSheet$CHIP.ID, sampleSheet$CHIP.Location,
sep = "_"), sampleSheet)
> colnames(sampleSheet)[1]<-"Basename"
> mset450k <- methylumIDAT(sampleSheet$Basename, idatPath=idatPath)
I have since solved this. The problem was that one of the idat files in my folder had not been downloaded fully prior to loading (my computer crashed during downloading).
I figured out which one it was by loading files with individual file names (in large batches, then gradually smaller batches) e.g.
mset450k <- methylumIDAT(c("200397870043_R01C02", "200397870043_R02C02",
"200397870043_R03C02", "200397870043_R03C01")
,idatPath=idatPath)
I'm using the quantmod package and was loading a weeks worth of data for each stock symbol. There were approximately 6400 symbols retrieved by the stockSymbols() function but when it reached around 5003 I got
Error in file(fname, "w"): cannot open the connection
cannot open file 'path to temp file': Too many open files
Is there a work around or a setting that can change the limit on the number of open files that R permits?
That is a shell / OS limit which is handed down from the OS to R.
If you are on Linux, see man bash and look for ulimit: [...]
Edit: And credit to Josh for reminding of another limit in R's connection code. A simple test script like this
N <- 130
fvec <- vector(length=N, mode="list")
for (i in 1:N) {
fname <- paste0("/tmp/foo", i, ".tmp")
fvec[[i]] <- file(fname, "w")
}
Sys.sleep(3)
for (i in 1:N) {
close(fvec[[i]])
}
seems to die when N > 128 but makes it fine up to somewhere near that value. Right now, N=125 worked for me, higher values die.
In a nutshell, you need to organize your program so that it can operate with fewer concurrently open file handles. Else, you may need to rebuild R with a higher default for open connections and make sure your OS lets you have as many too.
I am trying to download 460,000 files from ftp server ( which I got from the TRMM archive data). I made a list of all files and separated them into different jobs, but can any one help me how to run those jobs at the same time in R. Just an example what I have tried to do
my.list <-readLines("1998-2010.txt") # lists the ftp address of each file
job1 <- for (i in 1: 1000) {
download.file(my.list[i], name[i], mode = "wb")
}
job2 <- for (i in 1001: 2000){
download.file(my.list[i], name[i], mode = "wb")
}
job3 <- for (i in 2001: 3000){
download.file(my.list[i], name[i], mode = "wb")
}
Now I m stuck on how to run all of the Jobs at the same time.
Appreciate your help
Dont do that. Really. Dont. It won't be any faster because the limiting factor is going to be the network speed. You'll just end up with a large number of even slower downloads, and then the server will just give up and throw you off, and you'll end up with a large number of half-downloaded files.
Downloading multiple files will also increase the disk load since now your PC is trying to save a large number of files.
Here's another solution.
Use R (or some other tool, its one line of awk script starting from your list) to write an HTML file which just looks like this:
file-1.dat
file-2.dat
and so on. Now open this file in your web browser and use a download manager (eg DownThemAll for Firefox) and tell it to download all the links. You can specify how many simultaneous downloads, how many times to retry fails and so on with DownThemAll.
A good option is to use the mclapply or parLapply from the builtin parallel package. You then make a function that accepts a list of files that need to be downloaded:
library(parallel)
dowload_list = function(file_list) {
return(lapply(download.file(file_list)))
}
list_of_file_lists = c(my_list[1:1000], my_list[1001:2000], etc)
mclapply(list_of_file_lists, download_list)
I think it is wise to first split up the big list of files into a set a sublists, as for each entry in the list fed to mclapply a process is spawned. If this list is big, and the processing time per item in the list small, the overhead of parallelisation is probably going to make the downloading slower in stead of faster.
Do note that mclapply only works on Linux, parLapply should also work fine under Windows.
First make a while loop which looks for all the destination file. If the current predefined destination file is in the existing destination file, the script creates a new destination file. This will create many destination files which correspond to each one downloading. Next, I parallel the script. If I have 5 cores on my machine, I will get 5 destination files on my disk. I can also use lapply function to do that.
For example:
id <- 0
newDestinationFile <- "File.xlsx"
while(newDestinationFile %in% list.files(path =getwd(),pattern ="[.]xlsx"))
{
newDestinationFile <- paste0("File",id,".xlsx")
id <- id+1
download.file(url = URLS,method ="libcurl",mode ="wb",quiet = TRUE,destfile =newDestinationFile)
}
I'm trying to normalize a big amount of Affymetrix CEL files using R. However, some of them appear to be truncated, so when reading them i get the error
Cel file xxx does not seem to have the correct dimensions
And the normalization stops. Manually removing the corrupted files and restart every time will take very long. Do you know if there is a fast way (in R or with a tool) to detect corrupted files?
PS I'm 99.99% sure I'm normalizing together CELs from the same platform, it's really just truncated files :-)
One simple suggestion:
Can you just use a tryCatch block around your read.table (or whichever read command you're using)? Then just skip a file if you get that error message. You can also compile a list of corrupted files within the catch block (I recommend doing that so that you are tracking corrupted files for future reference when running a big batch process like this). Here's the pseudo code:
corrupted.files <- data.frame()
for(i in 1:nrow(files)) {
x <- tryCatch(read.table(file=files[i]), error = function(e)
if(e=="something") { corrupted.files <- rbind(corrupted.files, files[i]) }
else { stop(e) },
finally=print(paste("finished with", files[i], "at", Sys.time())))
if(nrow(x)) # do something with the uncorrupted data
}