error using methylumIDAT, invalid n argument for readBin - r

I'm trying to load IDATs using the bioconductor methylumi package, with a script that has worked for previous files, but am coming across the below unfamiliar error when I use methylumIDAT
>mset450k <- methylumIDAT(sampleSheet$Basename, idatPath=idatPath)
0 HumanMethylation27 samples found
48 HumanMethylation450 samples found
Error in readBin(con, what = "integer", n = n, size = 4, endian = "little", :
invalid 'n' argument
I'm unsure which n this is referring to, and how to resolve it. Searching for information on readBin arguments hasn't helped me understand it in this context.
The closest question I could find on here was this one about an audio file where the problem appears to be file size:
Invalid 'n' argument error in readBin() when trying to load a large (4GB+ audio file)
I'm not using a larger data set then previous times - in fact, when I run the script with a different folder of idats (same number of files) and different sample sheet (same format) it works fine so I don't think file size is the issue here.
Below is the full script
> idatPath<-c("idats2")
> sampleSheet<-read.csv("CrestarSampleSheet2.csv", stringsAsFactors = FALSE)
> sampleSheet<-cbind(paste(sampleSheet$CHIP.ID, sampleSheet$CHIP.Location,
sep = "_"), sampleSheet)
> colnames(sampleSheet)[1]<-"Basename"
> mset450k <- methylumIDAT(sampleSheet$Basename, idatPath=idatPath)

I have since solved this. The problem was that one of the idat files in my folder had not been downloaded fully prior to loading (my computer crashed during downloading).
I figured out which one it was by loading files with individual file names (in large batches, then gradually smaller batches) e.g.
mset450k <- methylumIDAT(c("200397870043_R01C02", "200397870043_R02C02",
"200397870043_R03C02", "200397870043_R03C01")
,idatPath=idatPath)

Related

Tesseract "Error in pixCreateNoInit: pix_malloc fail for data"

trying to run this function within a function based loosely off of this, however, since xPDF can convert PDFs to PNGs, I skipped the ImageMagick conversion step, as well as the faulty logic with the function(i) process, since pdftopng requires a root name and that is "ocrbook-000001.png" in this case and throws an error when looking for a PNG of the original PDF's file name.
My issue is now with getting Tesseract to do anything with my PNG files. I get the error:
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.
Here is my code:
lapply(myfiles, function(i){
shell(shQuote(paste0("pdftopng -f 1 -l 10 -r 600 ", i, " ocrbook")))
mypngs <- list.files(path = dest, pattern = "png", full.names = TRUE)
lapply(mypngs, function(z){
shell(shQuote(paste0("tesseract ", z, " out")))
file.remove(paste0(z))
})
})
The issue was the DPI set too high for Tesseract to handle, apparently. Changing the PDFtoPNG DPI parameter from 600 to 150 appears to have corrected the issue. There seems to be a max DPI for Tesseract to understand and know what to do.
I have also corrected my code from a static naming convention to a more dynamic one that mimics the file's original names.
dest <- "C:\\users\\YOURNAME\\desktop"
files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
lapply(files, function(i){
shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 150 ", i,".pdf", " ",i)))
})
myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
lapply(myppms, function(y){
shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
file.remove(paste0(y,".ppm"))
})
mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
lapply(mytiffs, function(z){
shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
file.remove(paste0(z,".tif"))
})
Background
It sounds like you already solved your problem. Yay! I'm writing this answer because I encountered a very similar problem calling tesseract from R and wanted to share some of the workarounds I came up with in case anyone else stumbles across the post and needs further troubleshooting ideas.
In my case I was converting a bunch of faxes (about 3000 individual pdf files, most of them between 1-15 pages) to text. I used an apply function to make the text from each individual fax as a separate entry in a list (length = number of faxes = ~ 3000). Then the faxes were converted to a vector and then that vector was combined with a vector of file names to make a data frame. Finally I wrote the data frame to a csv file. (See below for the code I used).
The problem was I kept getting the same string of errors that you got:
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.
Followed by this error: error in FUN(X[[i]], ...) : basic_string::_M_construct null not valid
What I think the problem is
What was weird for me was that I re-ran the code multiple times and it was always a different fax where the error occurred. It seemed to also occur more often when I was trying to do something else that used a lot of RAM or CPU (opening microsoft teams etc.). I tried changing the DPI as suggested in the first answer and that didn't seem to work.
It was also noticeable that while this code was running I was regularly using close to 100% of RAM and 50% of CPU (based on windows task manager).
When I ran this process (on a similiar batch of about 3,000 faxes) on linux machine with significantly more RAM and CPU I never encountered this problem.
basic_string::_M_construct null not valid, appears to be a c++ error. I'm not familiar with c++, but it sort of sounds like it's a bit of a catch all error that might indicate something that should have been created wasn't created.
Based on all that, I think the problem is that R runs out of memory and in response somehow the memory available to some of the underlying tesseract processes gets throttled off. This means there's not enough memory to convert a pdf to a png and then extract the text which is what throws these errors. This leads to a text blob not getting created where one is expected and the final C++ error of : basic_string::_M_construct null not valid It's possible that lowering the dpi is what gave your process enough memory to complete, but maybe the fundamental underlying problem was the memory not the DPI.
Possible workarounds
So, I'm not sure about any of what I just said, but running with that assumption, here's some ideas I came up with for people running the tesseract package in R who encounter similar problems:
Switch from Rstudio to Rgui: This alone solved my problem. I was able to complete the whole 3000 fax process without any errors using Rgui. Rgui also used between 100-400 MB instead 1000+ that Rstudio used, and about 25% of CPU instead of 50%. Putting R in the path and running R from the console or running R in the background might reduce memory use even further.
Close any memory intensive processes while the code is running. Microsoft teams, videoconferencing, streaming, docker on windows and the windows linux subsystem are all huge memory hogs.
lower DPI As suggested by the first answer, this would also probably reduce memory use.
break the process up. I think running my processes in batches of about 500 might have also reduced the amount of working memory R has to take up before writing to file.
These are all quick and easy solutions that can be done from R without having to learn C++ or upgrade hardware. A more durable solution probably would require doing more customization of the tesseract parameters, implementing the process in C++, changing memory allocation settings for R and the operating system, or buying more RAM.
Example Code
# Load Libraries
library(tesseract)
dir.create("finished_data")
# Define Functions
ocr2 <- function(pdf_path){
# tell tesseract which language to guess
eng <- tesseract("eng")
#convert to png first
#pngfile <- pdftools::pdf_convert(pdf_path, dpi = 300)
# tell tesseract to convert the pdf at pdf_path
seperated_pages <- tesseract::ocr(pdf_path, engine = eng)
#combine all the pages into one page
combined_pages <- paste(seperated_pages, collapse = "**new page**")
# I delete png files as I go to avoid overfilling the hard drive
# because work computer has no hard drive space :'(
png_file_paths <- list.files(pattern = "png$")
file.remove(png_file_paths)
combined_pages
}
# find pdf_paths
fax_file_paths <- list.files(path="./raw_data",
pattern = "pdf$",
recursive = TRUE)
#this converts all the pdfs to text using the ocr
faxes <- lapply(paste0("./raw_data/",fax_file_paths),
ocr2)
fax_table <- data.frame(file_name= fax_file_paths, file_text= unlist(faxes))
write.csv(fax_table, file = paste0("./finished_data/faxes_",format(Sys.Date(),"%b-%d-%Y"), "_test.csv"),row.names = FALSE)

Error in R: File size and implied file size differ, consider trying repair=TRUE

I have a huge number of .shps which are characterised by serial numbers (eg. shp1,shp2,..). I want to read them in R but some of them are empty and thus I get the error
"Error in read.shape(filen = fn, verbose = verbose, repair = repair) :
File size and implied file size differ, consider trying repair=TRUE"
My question is how to skip those empty .shps and continue reading my files with the use of a loop?I think the trycatch is the answer but I have never use it. Can someone help me?
Thanks.
mylist=list()
for (i in 1:5000){
mylist[[i]]=readShapeSpatial(paste0("inhere/shp",i,".shp"))
}

XMLtoDataFrame: "Duplicate subscripts for columns" When trying to load multiple files (R)

I'm trying to load many files from a folder in R that are of the form '.xml'. The files are contained in one folder with any subfolders. I tried to get them in one shot using the following code:
allfiles <- list.files("MyDirectory", pattern = '*.xml', recursive = TRUE, full.names = TRUE)
hope <- do.call(rbind.fill,lapply(allfiles,xmlToDataFrame))
Unfortunately, this is the result:
"Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("12786998421436773", : duplicate subscripts for columns'"
And I'm not even quite sure what the problem is? It works when I do it by individual files but there are upwards of 30,000 files so it's not really feasible. I've tried it a different way (using a for loop) but will get an error like the "xml file is not of type XML' even though the file name it outputs ends in the '.xml' tag.
Any clarity would be greatly appreciated.

Reading large csv file with missing data using bigmemory package in R

I am using large datasets for my research (4.72GB) and I discovered "bigmemory" package in R that supposedly handles large datasets (up to the range of 10GB). However, when I use read.big.matrix to read a csv file, I get the following error:
> x <- read.big.matrix("x.csv", type = "integer", header=TRUE, backingfile="file.bin", descriptorfile="file.desc")
Error in read.big.matrix("x.csv", type = "integer", header = TRUE,
: Dimension mismatch between header row and first data row.
I think the issue is that the csv file is not full, i.e., it is missing values in several cells. I tried removing header = TRUE but then R aborts and restarts the session.
Does anyone have experience with reading large csv files with missing data using read.big.matrix?
It may not be solving your problem directly, but you might find a package of mine filematrix useful. The relevant function is fm.create.from.text.file.
Please let me know if it works for your data file.
Did you check bigmemory PDF at https://cran.r-project.org/web/packages/bigmemory/bigmemory.pdf?
It was clearly described right there.
write.big.matrix(x, 'IrisData.txt', col.names=TRUE, row.names=TRUE)
y <- read.big.matrix("IrisData.txt", header=TRUE, has.row.names=TRUE)
# The following would fail with a dimension mismatch:
if (FALSE) y <- read.big.matrix("IrisData.txt", header=TRUE)
Basically, error means there is a column in the CSV file with row names. If you don't pass has.row.names=TRUE, bigmemory will consider row names a separate column, and without header you'll get mismatch.
I personally found data.table package more useful for dealing with large data set cases, YMMV

Troubles with cbc.read.table function in R [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Some issues trying to read a file with cbc.read.table function in R + using filter while reading files
a)I'm trying to read a relatively big .txt file with the function cbc.read.table from the colbycol package in R. According to what I've been reading this package makes job easier when we have large files (more than a GB to be read in R) and we don't need all of the columns/variables for our analysis. Also, I read that the function cbc.read.table could support the same read.table's parameters. However, if I pass the parameter nrows (in order to get a preview of my file in R) I get the following error:
#My line code. I'm just reading columns 5,6,7,8 out of 27
i.can <- cbc.read.table( "xxx.txt", header = T, sep = "\t",just.read=5:8, nrows=20)
#error message
Error in read.table(file, nrows = 50, sep = sep, header = header, ...) :
formal argument "nrows" matched by multiple actual arguments
So, my question is: could you tell me how can I solve this problem?
b) After that, I tried to read all instances with the following code:
i.can.b <- cbc.read.table( "xxx.txt", header = T, sep = "\t",just.read=4:8) #done perfectly
my.df <- as.data.frame(i.can.b) #getting error in this line
Error in readSingleKey(con, map, key) : unable to obtain value for key 'Company' #Company is a string column in my data set
So, my question is again: How can I solve this?
c) Do you know a way in which I can filter (by conditions on instances) while reading files?
In reply to a):
cbc.read.table() reads in the data in 50 row chunks:
tmp.data <- read.table(file, nrows = 50, sep = sep, header = header,
...)
Since the function already assigns the nrows argument the value 50, when it passes the nrows argument that you specify, there are two nrows arguments passed to read.table(), resulting in the error. To me, this seems to be a bug. To get around this, you can either modify the cbc.read.table() function to handle the specified nrows argument or accept something like a max.rows argument (and perhaps pass it along to the maintainer as a potential patch). Alternatively, you can specify the sample.pct argument, which specifies the proportion of rows to read. So, if the file contains 100 rows, and you only want 50: sample.pct = 0.5.
In reply to b):
Not sure what that error means. It is hard to diagnose without a reproducible example. Do you get the same error if you read in a smaller file?
In reply to c):
I generally prefer storing very large character data in a relational database, such as MySQL. It might be easier in your case to use the RSQLite package, which embeds an SQLite engine within R. Then SQL SELECT queries can be used to retrieve conditional subsets of data. Other packages for larger-than-memory data can be found under Large memory and out-of-memory data here: http://cran.r-project.org/web/views/HighPerformanceComputing.html

Resources