I'm trying to write a script to read a series of pdfs, OCR them using the tesseract package, and then do things with the text I can extract.
So far, I'm at the following:
ReportDensity <- list()
AllReports <- list.files(path = "path",pattern = "*.PDF",full.names=TRUE)
and then I needed to call the page number for each pdf so that I can read the image data
for (i in seq(AllReports))
ReportDensity[[i]] <- pdf_info(AllReports[[i]])
ReportDensity <- lapply(ReportDensity, `[[`, 2)
Now, what I want to do is to list each page of a pdf of a separate image file so that I can OCR it.
for (i in seq(AllReports))
for (j in 1:ReportDensity[[i]])
(assign(paste0("Report_",i,"_Page_",j),image_read_pdf(AllReports[[i]],pages = ReportDensity[j])))
The error message I receive is:
"Error in poppler_render_page(loadfile(pdf), page, dpi, opw, upw, antialiasing, :
Invalid page."
which I believe to be because I wrote the loop incorrectly. I have tested the code by manually putting in image/page numbers, and it loads correctly.
I'm hoping that the end result would be a series of image files of the form "Report_ReportNumber_PageNumber" that I could then process.
pdfs are text mainly (most often);
i usually extract text from pdfs using python's pdf2txt, page by page run on the shell through a call to
i=pagenumber
system(paste("pdf2txt -p", i, "-o text.txt pdffile.pdf"))
then you can grep text from each page; flag -o can output an html or xml which you can scrap with library(rvest)
[pdfimages][2] extracts the images contained in pdfs, you can OCR those:
system(paste("pdfimages -f", i, "-l", i, "-p -png pdffile.pdf imagefile"))
that may output a lot of pngs from a single page, they come out numbered:
system(paste0("tesseract imagefile-",i,"-006.png out6"))
tesseract has several parameters you must tune before getting a decent result
Related
I have 500+ .json files that I am trying to get a specific element out of. I cannot figure out why I cannot read more than one at a time..
This works:
library (jsonlite)
files<-list.files(‘~/JSON’)
file1<-fromJSON(readLines(‘~/JSON/file1.json),flatten=TRUE)
result<-as.data.frame(source=file1$element$subdata$data)
However, regardless of using different json packages (eg RJSONIO), I cannot apply this to the entire contents of files. The error I continue to get is...
attempt to run same code as function over all contents in file list
for (i in files) {
fromJSON(readLines(i),flatten = TRUE)
as.data.frame(i)$element$subdata$data}
My goal is to loop through all 500+ and extract the data and its contents. Specifically if the file has the element ‘subdata$data’, i want to extract the list and put them all in a dataframe.
Note: files are being read as ASCII (Windows OS). This does bot have a negative effect on single extractions but for the loop i get ‘invalid character bytes’
Update 1/25/2019
Ran the following but returned errors...
files<-list.files('~/JSON')
out<-lapply(files,function (fn) {
o<-fromJSON(file(i),flatten=TRUE)
as.data.frame(i)$element$subdata$data
})
Error in file(i): object 'i' not found
Also updated function, this time with UTF* errors...
files<-list.files('~/JSON')
out<-lapply(files,function (i,fn) {
o<-fromJSON(file(i),flatten=TRUE)
as.data.frame(i)$element$subdata$data
})
Error in parse_con(txt,bigint_as_char):
lexical error: invalid bytes in UTF8 string. (right here)------^
Latest Update
Think I found out a solution to the crazy 'bytes' problem. When I run readLines on the .json file, I can then apply fromJSON),
e.x.
json<-readLines('~/JSON')
jsonread<-fromJSON(json)
jsondf<-as.data.frame(jsonread$element$subdata$data)
#returns a dataframe with the correct information
Problem is, I cannot apply readLines to all the files within the JSON folder (PATH). If I can get help with that, I think I can run...
files<-list.files('~/JSON')
for (i in files){
a<-readLines(i)
o<-fromJSON(file(a),flatten=TRUE)
as.data.frame(i)$element$subdata}
Needed Steps
apply readLines to all 500 .json files in JSON folder
apply fromJSON to files from step.1
create a data.frame that returns entries if list (fromJSON) contains $element$subdata$data.
Thoughts?
Solution (Workaround?)
Unfortunately, the fromJSON still runs in to trouble with the .json files. My guess is that my GET method (httr) is unable to wait/delay and load the 'pretty print' and thus is grabbing the raw .json which in-turn is giving odd characters and as a result giving the ubiquitous '------^' error. Nevertheless, I was able to put together a solution, please see below. I want to post it for future folks that may have the same problem with the .json files not working nicely with any R json package.
#keeping the same 'files' variable as earlier
raw_data<-lapply(files,readLines)
dat<-do.call(rbind,raw_data)
dat2<-as.data.frame(dat,stringsasFactors=FALSE)
#check to see json contents were read-in
dat2[1,1]
library(tidyr)
dat3<-separate_rows(dat2,sep='')
x<-unlist(raw_data)
x<-gsub('[[:punct:]]', ' ',x)
#Identify elements wanted in original .json and apply regex
y<-regmatches(x,regexc('.*SubElement2 *(.*?) *Text.*',x))
for loops never return anything, so you must save all valuable data yourself.
You call as.data.frame(i) which is creating a frame with exactly one element, the filename, probably not what you want to keep.
(Minor) Use fromJSON(file(i),...).
Since you want to capture these into one frame, I suggest something along the lines of:
out <- lapply(files, function(fn) {
o <- fromJSON(file(fn), flatten = TRUE)
as.data.frame(o)$element$subdata$data
})
allout <- do.call(rbind.data.frame, out)
### alternatives:
allout <- dplyr::bind_rows(out)
allout <- data.table::rbindlist(out)
I am trying to write R code where I input an URL and output (save on hard drive) a .txt file. I created a large list of url using the "edgarWebR" package. An example would be "https://www.sec.gov/Archives/edgar/data/1131013/000119312518074650/d442610dncsr.htm". Basically
open the link
Copy everything (CTRL+A, CTRL+C)
open empy text file and paste content (CTRL+V)
save .txt file under specified name
(in a looped fashion of course). I am inclined to "hard code it" (as in open website in browner using browseURL(...) and "send keys" commands). But I am afraid that it will not run very smoothly. However other commands (such as readLines()) seem to copy the HTML structure (therefore returning not only the text).
In the end I am interested in a short paragraph of each of those shareholder letters (containing only text; Therefore Tables/graphs are no concern in my particular setup.)
Anyone aware of an R function that would help`?
thanks in advance!
Let me know incase below code works for you. xpathSApply can be applied for different html components as well. Since in your case only paragraphs are required.
library(RCurl)
library(XML)
# Create character vector of urls
urls <- c("url1", "url2", "url3")
for ( url in urls) {
# download html
html <- getURL(url, followlocation = TRUE)
# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
# writing lines to html
# depends whether you need separate files for each url or same
fileConn<-file(paste(url, "txt", sep="."))
writeLines(paste(plain.text, collapse = "\n"), fileConn)
close(fileConn)
}
Thanks everyone for your input. Turns out that any html conversion took too much time given the ammount of websites I need to parse. The (working) solution probably violates some best-practice guidelines, but it does do the job.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=path + '/codes_ml/geckodriver/geckodriver.exe') # initialize driver
# it is fine to open the driver just once
# loop over urls will the text
driver.get(report_url)
element = driver.find_element_by_css_selector("body")
element.send_keys(Keys.CONTROL+'a')
element.send_keys(Keys.CONTROL+'c')
text = clipboard.paste()
I am learning python (using 3.5). I realize I will probably take a bit of heat for posting my question. Here goes: I have literally reviewed several hundred posts, help docs, etc. all in an attempt to construct the code I need. No luck thus far. I hope someone can help me. I have a set of URLs say, 18 or more. Only 2 illustrated here:
[1] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html"
[2] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"
I need to scrape all the data (text) behind each url and write out to individual text files (one for each URL) for future topic model analysis. Right now, I pull in the urls through R using rvest. I then take each url (one at a time, by code) into python and do the following:
soup = BeautifulSoup(urlopen('http://www.senate.mo.gov/media/14info/chappelle-nadal/Columns/012314-Condensed.html').read())
txt = soup.find('div', {'class' : 'body'})
print(soup.get_text())
#print(soup.prettify()) not much help
#store the info in an object, then write out the object
test=print(soup.get_text())
test=soup.get_text()
#below does write a file
#how to take my BS object and get it in
open_file = open('23Jan2014cplNadal1.txt', 'w')
open_file.write(test)
open_file.close()
The above gets me partially to my target. It leaves me just a little clean up regarding the text, but that's okay. The problem is that it is labor intensive.
Is there a way to
Write a clean text file (without invisibles, etc.) out from R with all listed urls?
For python 3.5: Is there a way to take all the urls, once they are in a clean single file (the clean text file, one url per line), and have some iterative process retrieve the text behind each url and write out a text file for each URL's data(text) to a location on my hard drive?
I have to do this process for approximately 1000 state-level senators. Any help or direction is greatly appreciated.
Edit to original: Thank you so much all. To N. Velasquez: I tried the following:
urls<-c("http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/120114.html",
"http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/110614.htm"
)
for (url in urls) {
download.file(url, destfile = basename(url), method="curl", mode ="w", extra="-k")
}
html files are then written out to my working directory. However, is there a way to write out text files instead of html files? I've read download.file info and can't seem to figure out a way to push out individual text files. Regarding the suggestion for a for loop: Is what I illustrate what you mean for me to attempt? Thank you!
The answer for 1 is: Sure!
The following code will loop you through the html list and export atomic TXTs, as per your request.
Note that through rvest and html_node() you could get a much more structure datset, with recurring parts of the html stored separately. (header, office info, main body, URL, etc...)
library(rvest)
urls <- (c("http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html", "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"))
for (i in 1:length(urls))
{
ht <- list()
ht[i] <- html_text(html_node(read_html(urls[i]), xpath = '//*[#id="mainContent"]'), trim = TRUE)
ht <- gsub("[\r\n]","",ht)
writeLines(ht[i], paste("DOC_", i, ".txt", sep =""))
}
Look for the DOC_1.txt and DOC_2.txt in your working directory.
I have been trying to do OCR within R (reading PDF data which data as scanned image). Have been reading about this # http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
This a very good post.
Effectively 3 steps:
convert pdf to ppm (an image format)
convert ppm to tif ready for tesseract (using ImageMagick for convert)
convert tif to text file
The effective code for the above 3 steps as per the link post:
lapply(myfiles, function(i){
# convert pdf to ppm (an image format), just pages 1-10 of the PDF
# but you can change that easily, just remove or edit the
# -f 1 -l 10 bit in the line below
shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook")))
# convert ppm to tif ready for tesseract
shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif")))
# convert tif to text file
shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
# delete tif file
file.remove(paste0(i, ".tif" ))
})
The first two steps are happening fine. (although taking good amount of time, for 4 pages of a pdf, but will look into the scalability part later, first trying if this works or not)
While running this, the fist two steps work fine.
While runinng the 3rd step, i.e
shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
I having this error:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Or Tesseract is crashing.
Any workaround or root cause analysis would be appreciated.
By using "tesseract", I created a sample script which works.Even it works for scanned PDF's too.
library(tesseract)
library(pdftools)
# Render pdf to png image
img_file <- pdftools::pdf_convert("F:/gowtham/A/B/invoice.pdf", format = 'tiff', dpi = 400)
# Extract text from png image
text <- ocr(img_file)
write.table(text, "F:/gowtham/A/B/mydata.txt")
I'm new to R and Programming. Guide me if it's wrong. Hope this help you.
The newly released tesseract package might be worth checking out. It allows you to perform the whole process inside of R without the shell calls.
Taking the procedure as used in the help documentation of the tesseract package your function would look something like this:
lapply(myfiles, function(i){
# convert pdf to jpef/tiff and perform tesseract OCR on the image
# Read in the PDF
pdf <- pdf_text(i)
# convert pdf to tiff
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, paste0(i, ".tiff"))
# perform OCR on the .tiff file
out <- ocr(paste0, (".tiff"))
# delete tiff file
file.remove(paste0(i, ".tiff" ))
})
I have a file that I open using wdGet(filename="exOut.doc",visible=FALSE). This file already has images in it that I've inserted using html and cat(img, file=outputDoc, sep="\n", append=TRUE).
I need to insert a table at the end of the document, but wdTable(format(head(testTable))) places the table at the very top of the word document. How can I fix this?
Also, second problem: I have a lot of tables I need to insert into my document and hence make use of a loop. Below is sample code that demonstrates my problem. Here's the really weird part for me: when I step through the code and run each line after another, it produces no error and I have an output document. If I run everything at once I get a 'cannot open the connection error'. I don't understand how this can be. How is it possible that running each line one at a time produces a different result than running all of that exact same code all at once?
rm(list=ls())
library(R2wd)
library(png)
outputForNow<-"C:\\Users\\dirkh_000\\Downloads\\"
outputDoc<-paste(outputForNow,"exOut.doc",sep="")
setwd(outputForNow)
# Some example plots
for(i in 1:3)
{
dir.create(file.path(paste("folder",i,sep="")))
setwd(paste("folder",i,sep="")) # Note that images are all in different folders
png(paste0("ex", i, ".png"))
plot(1:5)
title(paste("plot", i))
dev.off()
setwd(outputForNow)
}
setwd(outputForNow)
# Start empty word doc
cat("<body>", file="exOut.doc", sep="\n")
# Retrieve a list of all folders
folders<-dir()[file.info(dir())$isdir]
folders<-folders[!is.na(folders)]
# Cycle through all folders in working directory
for(folder in folders){
setwd(paste(outputForNow,folder,sep=""))
# select all png files in working directory
for(i in list.files(pattern="*.png"))
{
temp<-paste0('<img src=','\"',gsub("[\\]","/",folder),"/", i, '\">')
cat(temp, file=outputDoc, sep="\n", append=TRUE)
setwd(paste(outputForNow,folder,sep=""))
}
setwd(outputForNow)
cat("</body>", file="exOut.doc", sep="\n", append=TRUE)
testTable<-as.data.frame(cbind(1,2,3))
wdGet(filename="exOut.doc",visible=FALSE)
wdTable(format(head(testTable))) ## This produces a table at the top and not the bottom of the document
wdSave(outputDoc)
wdQuit() # NOTE that this means that the document is closed and opened over and over again in the loop otherwise cat() will throw an error
}
The above code produces:
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
Can anyone tell me why this occurs and how to fix it? Please and thank you. Please do recommend a completely different approach if you know I'm going about this the wrong way, but please also explain what it is that I'm doing wrong.
To start the DescTools package and a Word document, use something like this (obviously, modified for your path structure):
library(DescTools)
library(RDCOMClient)
report <- GetNewWrd(template = "C:/Users/Rees/Documents/R/win-library/3.0/R2DOCX/templates/TEMPLATE_03.docx")
ADDED BASED ON COMMENT
Create a template for your report in Word. Perhaps you call it TEMPLATE.docx. Save it in your Document director (or whatever directory you keep Word documents in. Then
report <- GetNewWrd(template = " "C:/Users/dirkh_000/Documents/TEMPLATE.docx")
Thereafter, each time you create a plot, add this line:
WrdPlot(wrd = report)
The plot is inserted in the TEMPLATE.docx Word document in the specified directory.
The same for WrdTable(wrd = report)