read_tsv stalls: is this an encoding issue?

read_tsv stalls: is this an encoding issue? - r

I apologize in advance for the lack of specificity of this post but I can't provide a reproducible example in this case. I'm trying to read a tab-separated data file with R readr's read_tsv. The data is from a confidential source so I can't share it, even just the problematic part. read_tsv stalls around 20% of reading progress and unless I kill R quickly, my RAM usage starts blowing up to the point that my computer freezes (I'm on Ubuntu 18.04). Specifically, I'm running:
read_tsv(file = path_to_file,
skip = 10e6,
n_max = 1e5)
I'm skipping lines and setting n_max to vaguely isolate where the problem is and run faster tests. I also tried setting read_tsv's locale to locale(encoding = 'latin1') without success. I tried inspecting this problematic part by reading it with readr's read_lines:
read_lines(file = path_to_file,
skip = 10e6,
n_max = 1e5)
There's no reading problem there: I'm getting a list of character strings. I ran validUTF8 on all of them and they all seem valid. I just have no idea what type of problem could cause read_tsv to stall. Any ideas?

I solved the problem. It seems like it came from inappropriate handling of quoting characters with the default read_tsv quote option. Using quotes = "" instead made it work smoothly.

Related

Tesseract "Error in pixCreateNoInit: pix_malloc fail for data"

trying to run this function within a function based loosely off of this, however, since xPDF can convert PDFs to PNGs, I skipped the ImageMagick conversion step, as well as the faulty logic with the function(i) process, since pdftopng requires a root name and that is "ocrbook-000001.png" in this case and throws an error when looking for a PNG of the original PDF's file name.
My issue is now with getting Tesseract to do anything with my PNG files. I get the error:
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.
Here is my code:
lapply(myfiles, function(i){
shell(shQuote(paste0("pdftopng -f 1 -l 10 -r 600 ", i, " ocrbook")))
mypngs <- list.files(path = dest, pattern = "png", full.names = TRUE)
lapply(mypngs, function(z){
shell(shQuote(paste0("tesseract ", z, " out")))
file.remove(paste0(z))
})
})

The issue was the DPI set too high for Tesseract to handle, apparently. Changing the PDFtoPNG DPI parameter from 600 to 150 appears to have corrected the issue. There seems to be a max DPI for Tesseract to understand and know what to do.
I have also corrected my code from a static naming convention to a more dynamic one that mimics the file's original names.
dest <- "C:\\users\\YOURNAME\\desktop"
files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
lapply(files, function(i){
shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 150 ", i,".pdf", " ",i)))
})
myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
lapply(myppms, function(y){
shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
file.remove(paste0(y,".ppm"))
})
mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
lapply(mytiffs, function(z){
shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
file.remove(paste0(z,".tif"))
})

Background
It sounds like you already solved your problem. Yay! I'm writing this answer because I encountered a very similar problem calling tesseract from R and wanted to share some of the workarounds I came up with in case anyone else stumbles across the post and needs further troubleshooting ideas.
In my case I was converting a bunch of faxes (about 3000 individual pdf files, most of them between 1-15 pages) to text. I used an apply function to make the text from each individual fax as a separate entry in a list (length = number of faxes = ~ 3000). Then the faxes were converted to a vector and then that vector was combined with a vector of file names to make a data frame. Finally I wrote the data frame to a csv file. (See below for the code I used).
The problem was I kept getting the same string of errors that you got:
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.
Followed by this error: error in FUN(X[[i]], ...) : basic_string::_M_construct null not valid
What I think the problem is
What was weird for me was that I re-ran the code multiple times and it was always a different fax where the error occurred. It seemed to also occur more often when I was trying to do something else that used a lot of RAM or CPU (opening microsoft teams etc.). I tried changing the DPI as suggested in the first answer and that didn't seem to work.
It was also noticeable that while this code was running I was regularly using close to 100% of RAM and 50% of CPU (based on windows task manager).
When I ran this process (on a similiar batch of about 3,000 faxes) on linux machine with significantly more RAM and CPU I never encountered this problem.
basic_string::_M_construct null not valid, appears to be a c++ error. I'm not familiar with c++, but it sort of sounds like it's a bit of a catch all error that might indicate something that should have been created wasn't created.
Based on all that, I think the problem is that R runs out of memory and in response somehow the memory available to some of the underlying tesseract processes gets throttled off. This means there's not enough memory to convert a pdf to a png and then extract the text which is what throws these errors. This leads to a text blob not getting created where one is expected and the final C++ error of : basic_string::_M_construct null not valid It's possible that lowering the dpi is what gave your process enough memory to complete, but maybe the fundamental underlying problem was the memory not the DPI.
Possible workarounds
So, I'm not sure about any of what I just said, but running with that assumption, here's some ideas I came up with for people running the tesseract package in R who encounter similar problems:
Switch from Rstudio to Rgui: This alone solved my problem. I was able to complete the whole 3000 fax process without any errors using Rgui. Rgui also used between 100-400 MB instead 1000+ that Rstudio used, and about 25% of CPU instead of 50%. Putting R in the path and running R from the console or running R in the background might reduce memory use even further.
Close any memory intensive processes while the code is running. Microsoft teams, videoconferencing, streaming, docker on windows and the windows linux subsystem are all huge memory hogs.
lower DPI As suggested by the first answer, this would also probably reduce memory use.
break the process up. I think running my processes in batches of about 500 might have also reduced the amount of working memory R has to take up before writing to file.
These are all quick and easy solutions that can be done from R without having to learn C++ or upgrade hardware. A more durable solution probably would require doing more customization of the tesseract parameters, implementing the process in C++, changing memory allocation settings for R and the operating system, or buying more RAM.
Example Code
# Load Libraries
library(tesseract)
dir.create("finished_data")
# Define Functions
ocr2 <- function(pdf_path){
# tell tesseract which language to guess
eng <- tesseract("eng")
#convert to png first
#pngfile <- pdftools::pdf_convert(pdf_path, dpi = 300)
# tell tesseract to convert the pdf at pdf_path
seperated_pages <- tesseract::ocr(pdf_path, engine = eng)
#combine all the pages into one page
combined_pages <- paste(seperated_pages, collapse = "**new page**")
# I delete png files as I go to avoid overfilling the hard drive
# because work computer has no hard drive space :'(
png_file_paths <- list.files(pattern = "png$")
file.remove(png_file_paths)
combined_pages
}
# find pdf_paths
fax_file_paths <- list.files(path="./raw_data",
pattern = "pdf$",
recursive = TRUE)
#this converts all the pdfs to text using the ocr
faxes <- lapply(paste0("./raw_data/",fax_file_paths),
ocr2)
fax_table <- data.frame(file_name= fax_file_paths, file_text= unlist(faxes))
write.csv(fax_table, file = paste0("./finished_data/faxes_",format(Sys.Date(),"%b-%d-%Y"), "_test.csv"),row.names = FALSE)

Inconsistent results between fread() and read.table() for .tsv file in R

My question is in response to two issues I encountered in reading a .tsv file published that contains campaign finance data.
First, the file has a null character that terminates input and throws the error 'embedded nul in string: 'NAVARRO b\0\023 POWERS' when using data.table::fread(). I understand that there are a number of potential solutions to this problem but I was hoping to find something within R. Having seen the skipNul option in read.table(), I decided to give it a shot.
That brings me to the second issue: read.table() with reasonable settings (comment.char = "", quote = "", fill = T) is not throwing an error but it is also not detecting the same filesize that data.table::fread() identified (~100k rows with read.table() vs. ~8M rows with data.table::fread()). The fread() answer seems to be more correct as the file size is ~1.5GB and data.table::fread() identifies valid data when reading in rows leading up to where the error seems to be.
Here is a link to the code and output for the issue.
Any ideas on why read.table() is returning such different results? fread() operates by guessing characteristics of the input file but it doesn't seem to be guessing any exotic options that I didn't use in read.table().
Thanks for your help!
NOTE
I do not know anything about the file in question other than the source and what information it contains. The source is from the California Secretary of State by the way. At any rate, the file is too large to open in excel or notepad so I haven't been able to visually examine the file besides looking at a handful of rows in R.

I couldn't figure out an R way to deal with the issue but I was able to use a python script that relies on pandas:
import pandas as pd
import os
os.chdir(path = "C:/Users/taylor/Dropbox/WildPolicy/Data/Campaign finance/CalAccess/DATA")
receipts_chunked = pd.read_table("RCPT_CD.TSV", sep = "\t", error_bad_lines = False, low_memory = False, chunksize = 5e5)
chunk_num = 0
for chunk in receipts_chunked:
chunk_num = chunk_num + 1
file_name = "receipts_chunked_" + str(chunk_num) + ".csv"
print("Output file:", file_name)
chunk.to_csv(file_name, sep = ",", index = False)
The problem with this route is that, with 'error_bad_lines = False', problem rows are simply skipped instead of erroring out. There are only a handful of error cases (out of ~8 million rows) but this is still suboptimal obviously.

Getting rid of BOM between SAS and R

I used SAS to save a tab-delimited text file with utf8 encoding on a windows machine. Then I tried to open this in R:
read.table(myfile, header =TRUE, sep = "\t")
To my surprise, the data was totally messed up, but only in a sneaky way. Number values changed randomly, but the overall layout looked normal, so it took me a while to notice the problem, which I'm assuming now is the BOM.
This is not a new issue of course; they address it briefly here, and recommend using
read.table(myfile, fileEncoding = "UTF-8", header =TRUE, sep = "\t")
However, this made no improvement! My only solution was to suppress the header, with or without the fileEncoding argument:
read.table(myfile, fileEncoding = "UTF-8", header =FALSE, sep = "\t")
read.table(myfile, header =FALSE, sep = "\t")
In either case, I have to do some funny business to replace the column names with the first row, but only after I remove some version of the BOM that appears at the beginning of the first column name (<U+FEFF> if I use fileEncoding and
ï»¿ if I don't use fileEncoding).
Isn't there a simple way to just remove the BOM and use read.table without any special arguments?
Update for #Joe:
The SAS that I used:
FILENAME myfile 'C:\Documents ... file.txt' encoding="utf-8";
proc export data=lib.sastable
outfile=myfile
dbms=tab replace;
putnames=yes;
run;
Update on further weirdness: Using fileEncoding="UTF-8-BOM" as #Joe suggested in his solution below seems to remove the BOM. However, it did not fix my original motivating problem, which is corruption in the data; the header row is fine, but weirdly the last few digits of the first column of numbers gets messed up. I'll give Joe credit for his answer -- maybe my problem is not actually a BOM issue?
Hack solution: Use fileEncoding="UTF-8-BOM" AND also include the argument colClasses = "character". No idea why this works to fix the data corruption issue -- could be the topic of a future question.

As per your link, it looks like it works for me with:
read.table('c:\\temp\\testfile.txt',fileEncoding='UTF-8-BOM',header=TRUE,sep='\t')
note the -BOM in the file encoding.
This is in 2.1 Variations on read.table in the r documentation. Under 12 Encoding, see "Under UNIX you might need...", which apparently applies even on Windows now (for me, at least).

or you can use the sas system option options=NOBOMFILE the write a uft-8 file without the BOM.

Can scan (or any import function) return partial results after it bump into errors?

Is there anything I can do to get partial results from after bumping into errors in a big file? I am using the following command to import data from files. This is the fastest way I know, but it's not robust. It can easily screw up everything because of a small error. I hope at least there is way that scan(or any reader) can quickly return which row/line has the error, or partial results it read (than I will have an idea where the error is). Then, I can skip enough lines to recover over 99% good data.
rawData = scan(file = "rawData.csv", what = scanformat, sep = ",", skip = 1, quiet = TRUE, fill = TRUE, na.strings = c("-", "NA", "Na","N"))
All importing data tutorials I found seem to assume the files are in good shape. I didn't find a useful hint to deal with dirty files.
I will sincerely appreciate any hint or suggestion! It was really frustrating.

Idea1: Open a file connection (with file function) and then scan line by line (with nlines=1). Put each scan into try to recover after reading a bad line.
Idea2: Use readLines to read the file in raw format; then use strsplit to parse. You can analyse this output to find bad lines and remove it.

The count.fields function will preprocess a table like file and give you how many fields it found on each line (in the sense that read.table will look for fields). This is often a quick way to identify lines that have a problem because they will show a different number of fields from what is expected (or just different from the majority of other lines).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

read_tsv stalls: is this an encoding issue? - r

I solved the problem. It seems like it came from inappropriate handling of quoting characters with the default read_tsv quote option. Using quotes = "" instead made it work smoothly.

Related

Tesseract "Error in pixCreateNoInit: pix_malloc fail for data"

Inconsistent results between fread() and read.table() for .tsv file in R

Getting rid of BOM between SAS and R

More problems with "incomplete final line"

Can scan (or any import function) return partial results after it bump into errors?

Categories

Resources