How to extract specific parts of messy PDFs in R? - r

I need to extract specific parts of a large corpus of PDF documents. The PDFs are large and messy reports containing all kinds of digital, alphabetic and other information. The files are of different length but have unified content and sections across them. The documents have a Table of Content with the section names in them. For example
Table of Content:
Item 1. Business 1
Item 1A. Risk Factors 2
Item 1B. Unresolved Staff Comments 5
Item 2. Properties 10
Item N........
..........text I do not care about...........
Item 1A. Risk Factors
.....text I am interested in getting.......
(section ends)
Item 1B. Unresolved Staff Comments
..........text I do not care about...........
I have no problem reading them in and analyzing them as a whole but I need to pull out only the text between "Item 1A. Risk Factors" and "Item 1B. Unresolved Staff Comments".
I used pdftools, tm, quanteda and readtext package
This is the part of code I use to read-in my docs. I created a directory where I placed my PDFs and called it "PDF" and another directory where R will place converted to ".txt" files.
pdf_directory <- paste0(getwd(), "/PDF")
txt_directory <- paste0(getwd(), "/Texts")
Then I create a list of files using "list.files" function.
files <- list.files(pdf_directory, pattern = ".pdf", recursive = FALSE,
full.names = TRUE)
files
After that, I go on to create a function that extracts file names.
extract <- function(filename) {
print(filename)
try({
text <- pdf_text(filename)
})
f <- gsub("(.*)/([^/]*).pdf", "\\2", filename)
write(text, file.path(txt_directory, paste0(f, ".txt")))
}
for (file in files) {
extract(file)
}
After this step, I get stuck and do not know how to proceed. I am not sure if I should try to extract the section of interest when I read data in, therefore, I suppose, I would have to wrestle with the chunk where I create the function -- f <- gsub("(.*)/([^/]*).pdf", "\\2", filename)? I apologize for such questions but I am self-teaching myself.
I also tried engaging the following code on just one file instead of a corpus:
start <- grep("^\\*\\*\\* ITEM 1A. RISK FACTORS", text_df$text) + 1
stop <- grep("^ITEM 1B. UNRESOLVED STAFF COMMENTS", text_df$text) - 1
lines <- raw[start:stop]
scd <- paste0(".*",start,"(.*)","\n",stop,".*")
gsub(scd,"\\1", name_of_file)
but it did not help me in any way.

I don't really see why you would write files into a txt first, so I did it all in one go.
What threw me off a little is that your patterns have lots of extra spaces. You can match them with the regular expression \\s+
library(stringr)
files <- c("https://corporate.exxonmobil.com/-/media/Global/Files/investor-relations/investor-relations-publications-archive/ExxonMobil-2016-Form-10-K.pdf",
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf")
relevant_l <- lapply(files, function(file) {
# print status message
message("processing: ", basename(file))
lines <- unlist(stringr::str_split(pdftools::pdf_text(file), "\n"))
start <- stringr::str_which(lines, "ITEM 1A.\\s+RISK FACTORS")
end <- stringr::str_which(lines, "ITEM 1B.\\s+UNRESOLVED STAFF COMMENTS")
# cover a few different outcomes depending on what was found
if (length(start) == 1 & length(end) == 1) {
relevant <- lines[start:end]
} else if (length(start) == 0 | length(end) == 0) {
relevant <- "Pattern not found"
} else {
relevant <- "Problems found"
}
return(relevant)
})
#> processing: ExxonMobil-2016-Form-10-K.pdf
#> processing: dummy.pdf
names(relevant_l) <- basename(files)
sapply(relevant_l, head)
#> $`ExxonMobil-2016-Form-10-K.pdf`
#> [1] "ITEM 1A. RISK FACTORS\r"
#> [2] "ExxonMobil’s financial and operating results are subject to a variety of risks inherent in the global oil, gas, and petrochemical\r"
#> [3] "businesses. Many of these risk factors are not within the Company’s control and could adversely affect our business, our financial\r"
#> [4] "and operating results, or our financial condition. These risk factors include:\r"
#> [5] "Supply and Demand\r"
#> [6] "The oil, gas, and petrochemical businesses are fundamentally commodity businesses. This means ExxonMobil’s operations and\r"
#>
#> $dummy.pdf
#> [1] "Pattern not found"
I would return the results as a list and then use original file names to name the list elements. Let me know if you have questions. I use the package stringr since it's fast and consistent in dealing with strings. But the command str_which and grep are pretty the same.

Related

What does this code below do ? In r language

#Read state of union file
speech<-readLines("stateoftheunion1790-2012.txt")
head(speech)
What does this code below do after it reads the file ??? I was told It will give a list where each entry is the text between consecutive ***'s. But what does that mean.
x <- grep("^\\*{3}", speech)
list.speeches <- list()
for(i in 1:length(x)){
if(i == 1){
list.speeches[[i]] <- paste(speech[1:x[1]], collapse = " ")
}else{
list.speeches[[i]] <- paste(speech[x[i-1]:x[i]], collapse = " ")}
}
It looks like you're new to SO; welcome to the community! As #Allan Cameron pointed out, whenever you ask questions, especially if you want great answers quickly, it's best to make your question reproducible. This includes sample data like the output from dput() or reprex::reprex(). Check it out: making R reproducible questions.
I've detailed each part of the code with coding comments. Feel free to ask questions if you've got any.
speech <- readLines("https://raw.githubusercontent.com/jdwilson4/Intro-Data-Science/master/Data/stateoftheunion1790-2012.txt")
head(speech) # print the first 6 rows captured in the object speech
# [1] "The Project Gutenberg EBook of Complete State of the Union Addresses,"
# [2] "from 1790 to the Present"
# [3] ""
# [4] "Character set encoding: UTF8"
# [5] ""
# [6] "The addresses are separated by three asterisks"
x <- grep("^\\*{3}", speech)
# searches speech char vector for indices coinciding with strings of 3 asterisks ***
list.speeches <- list() # create a list to store the results
for(i in 1:length(x)){ # for each index that coincided with three asterisks
if(i == 1){ # if it's the first set of asterisks ***
list.speeches[[i]] <- paste(speech[1:x[1]], collapse = " ")
# capture all vector elements up to the first set of 3 asterisks
# capture file information and who gave each of the speeches
}else{
list.speeches[[i]] <- paste(speech[x[i-1]:x[i]], collapse = " ")}
} # capture the info between each set of subsequent indices
# capture all rows of each speech (currently separated by ***)
# place each complete speech in a different list position

Searching strings within elements of a list and extracting information

I am looking to search contents of a file within a list and, where strings match, pull specific information from within that list.
#importing files as list
folder <- "~/cars/"
carfiles <- list.files(folder)
len = length(carfiles)
files <- vector(mode = "list", length = len)
Each file is a csv, each element of the list is a file and each subelement is a row of the file.
#first element == column headers
files[[1]][[1]]
[1] "CarName,RegNumber,CarInfo"
#second element == first row of car info in first file
files[[1]][[2]]
[1] "BMWi3,BM19 XYZ,Colour:White Mileage:9,000 ..."
#third element == second row of car info in first file, etc.
files[[1]][[2]]
[1] "BMWX5,BM19 ABC,Colour:Black Mileage:10,000 ..."
For example, here I would like to isolate which sub-element (row in the original Excel file) of the file contains the string "White" and pull the RegNumber information.
Note: I have several hundred files to search through.
This is what I have tried:
if (str_detect(files[[i]],"White")==TRUE){
sapply(files, "[[", 2)
}
But it is not isolating the list correctly... which I think is mainly due to my limited knowledge on lists. Any advice on how to identify and extract the information would be helpful.
SHORTENED OUTPUT OF DPUT(HEAD)
dput(head(files))
list('Carfile1' = c("CarName,RegNumber,CarInfo",
"BMWi3,BM19 XYZ,Colour:White Mileage:9,000 ...",
"BMWX5,BM19 ABC,Colour:Black Mileage:10,000 ...",
"BMWX5,BM19 DEF,Colour:Black Mileage:10,100 ...",
"BMWZ3,BM19 GHI,Colour:Red Mileage:10,102 ..."),
'Carfile2'=c("CarName,RegNumber,CarInfo",
"AudiA1,BM18 XYZ,Colour:White Mileage:29,000 ...",
"AudiA2,BM18 ABC,Colour:Black Mileage:30,000 ...",
"AudiA3,BM18 DEF,Colour:Black Mileage:30,100 ...",
"AudiA4,BM18 GHI,Colour:Red Mileage:30,102 ..."))

how to search through a column of links looking for string matches in r?

I have a data table with a list of .txt links in the same column. I am looking for a way for R to search within each link to see if the file contains either of the strings discount rate or discounted cash flow. I then want R to create 2 columns next to each link (one for discount rate and one for discounted cash flow) that is either going to have a 1 in it if present or a 0 if not.
Here's a small list of sample links that I would like to sift through:
http://www.sec.gov/Archives/edgar/data/1015328/0000913849-04-000510.txt
http://www.sec.gov/Archives/edgar/data/1460306/0001460306-09-000001.txt
http://www.sec.gov/Archives/edgar/data/1063761/0001047469-04-028294.txt
http://www.sec.gov/Archives/edgar/data/1230588/0001178913-09-000260.txt
http://www.sec.gov/Archives/edgar/data/1288246/0001193125-04-155851.txt
http://www.sec.gov/Archives/edgar/data/1436866/0001172661-09-000349.txt
http://www.sec.gov/Archives/edgar/data/1089044/0001047469-04-026535.txt
http://www.sec.gov/Archives/edgar/data/1274057/0001047469-04-023386.txt
http://www.sec.gov/Archives/edgar/data/1300379/0001047469-04-026642.txt
http://www.sec.gov/Archives/edgar/data/1402440/0001225208-09-007496.txt
http://www.sec.gov/Archives/edgar/data/35527/0001193125-04-161618.txt
Maybe something like this...
checktext <- function(file, text) {
filecontents <- readLines(file)
return(as.numeric(any(grepl(text, filecontents, ignore.case = TRUE))))
}
df$DR <- sapply(df$file_name, checktext, "discount rate")
df$DCF <- sapply(df$file_name, checktext, "discounted cash flow")
A much faster version, thanks to Gregor's comment below, would be
checktext <- function(file, text) {
filecontents <- readLines(file)
sapply(text, function(x) as.numeric(any(grepl(x, filecontents,
ignore.case = T))))
}
df[,c("DR","DCF")] <- t(sapply(df$file_name, checktext,
c("discount rate", "discounted cash flow")))
Or if you are doing it from URLs rather than local files, replace df$file_name with df$websiteURL in the above. It worked for me on the short list you provided.

Parsing XML for Ancient Greek Plays with speaker and dialogue

I am currently trying to read Greek plays which are available online as XML files into a data frame with a dialogue and speaker column.
I run the following commands to download the XML and parse the dialogue and speakers.
library(XML)
library(RCurl)
url <- "http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3Atext%3A1999.01.0186"
html <- getURL(url, followlocation = TRUE)
doc <- htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
speakersc <- xpathSApply(doc, "//speaker", xmlValue)
dialogue <- data.frame(text = plain.text, stringsAsFactors = FALSE)
speakers <- data.frame(text = speakersc, stringsAsFactors = FALSE)
However, I then encounter a problem. The dialogue will yield 300 rows (for 300 distinct lines in the play), but the speaker will yield 297.
The reason for the problem is due to the structure of the XML as reproduced below, where the <speaker> tag is not repeated for continued dialogue interrupted by stage direction. Because I have to separate the dialogue
with the <p> tag, it yields two dialogue rows, but only one speaker row, without duplicating the speaker accordingly.
<speaker>Creon</speaker>
<stage>To the Guard.</stage>
-<p>
You can take yourself wherever you please,
<milestone n="445" unit="line" ed="p"/>
free and clear of a heavy charge.
<stage>Exit Guard.</stage>
</p>
</sp>
-<sp>
<stage>To Antigone.</stage>
<p>You, however, tell me—not at length, but briefly—did you know that an edict had forbidden this?</p>
</sp>
How can I parse the XML so the data will correctly yield the same number of dialogue rows for the same number of corresponding speaker rows?
For the above example, I would like the resulting data frame to either contain two rows for Creon's dialogue corresponding to the two lines of dialogue prior and after the stage direction, or one row which treats Creon's dialogue as one line ignoring the separation due to the stage direction.
Thank you very much for your help.
Consider using xpath's forward looking following-sibling to search for the next <p> tag when speaker is empty, all while iterating through <sp> which is the parent to <speaker> and <p>:
# ALL SP NODES
sp <- xpathSApply(doc, "//body/descendant::sp", xmlValue)
# ITERATE THROUGH EACH SP BY NODE INDEX TO CREATE LIST OF DFs
dfList <- lapply(seq_along(sp), function(i){
data.frame(
speakers = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker,'')"), xmlValue),
dialogue = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker/following-sibling::p[1], ' ',
//body/descendant::sp[position()=",i+1," and not(speaker)]/p[1])"), xmlValue)
)
# ROW BIND LIST OF DFs AND SUBSET EMPTY SPEAKER/DIALOGUE
finaldf <- subset(do.call(rbind, dfList), speakers!="" & dialogue!="")
})
# SPECIFIC ROWS IN OP'S HIGHLIGHT
finaldf[85,]
# speakers
# 85 Creon
#
# dialogue
# 85 You can take yourself wherever you please,free and clear of a heavy
# charge.Exit Guard. You, however, tell me—not at length, but
# briefly—did you know that an edict had forbidden this?
finaldf[86,]
# speakers dialogue
# 87 Antigone I knew it. How could I not? It was public.
Another option is the read the url and make some updates before parsing XML, in this case replace milestone tags with a space to avoid mashing words together, remove stage tags and then fix the sp node without a speaker
x <- readLines(url)
x <- gsub("<milestone[^>]*>", " ", x) # add space
x <- gsub("<stage>[^>]*stage>", "", x) # no space
x <- paste(x, collapse = "")
x <- gsub("</p></sp><sp><p>", "", x) # fix sp without speaker
Now the XML has the same number of sp and speaker tags.
doc <- xmlParse(x)
summary(doc)
p sp speaker div2 placeName
299 297 297 51 25 ...
Finally, get the sp nodes and parse speaker and paragraph.
sp <- getNodeSet(doc, "//sp")
s1 <- sapply( sp, xpathSApply, ".//speaker", xmlValue)
# collapse the 1 node with 2 <p>
p1 <- lapply( sp, xpathSApply, ".//p", xmlValue)
p1 <- trimws(sapply(p1, paste, collapse= " "))
speakers <- data.frame(speaker=s1, dialogue = p1)
speaker dialogue
1 Antigone Ismene, my sister, true child of my own mother, do you know any evil o...
2 Ismene To me no word of our friends, Antigone, either bringing joy or bringin...
3 Antigone I knew it well, so I was trying to bring you outside the courtyard gat...
4 Ismene Hear what? It is clear that you are brooding on some dark news.
5 Antigone Why not? Has not Creon destined our brothers, the one to honored buri...
6 Ismene Poor sister, if things have come to this, what would I profit by loose...
7 Antigone Consider whether you will share the toil and the task.
8 Ismene What are you hazarding? What do you intend?
9 Antigone Will you join your hand to mine in order to lift his corpse?
10 Ismene You plan to bury him—when it is forbidden to the city?
...

R error in lemmatizzation a corpus of document with wordnet

i'm trying to lemmatizzate a corpus of document in R with wordnet library. This is the code:
corpus.documents <- Corpus(VectorSource(vector.documents))
corpus.documents <- tm_map(corpus.documents removePunctuation)
library(wordnet)
lapply(corpus.documents,function(x){
x.filter <- getTermFilter("ContainsFilter", x, TRUE)
terms <- getIndexTerms("NOUN", 1, x.filter)
sapply(terms, getLemma)
})
but when running this. I have this error:
Errore in .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word, :
java.lang.NoSuchMethodError: <init>
and those are stack calls:
5 stop(structure(list(message = "java.lang.NoSuchMethodError: <init>",
call = .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."),
word, ignoreCase), jobj = <S4 object of class structure("jobjRef", package
="rJava")>), .Names = c("message",
"call", "jobj"), class = c("NoSuchMethodError", "IncompatibleClassChangeError", ...
4 .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word,
ignoreCase)
3 getTermFilter("ContainsFilter", x, TRUE)
2 FUN(X[[1L]], ...)
1 lapply(corpus.documents, function(x) {
x.filter <- getTermFilter("ContainsFilter", x, TRUE)
terms <- getIndexTerms("NOUN", 1, x.filter)
sapply(terms, getLemma) ...
what's wrong?
So this does not address your use of wordnet, but does provide an option for lemmatizing that might work for you (and is better, IMO...). This uses the MorphAdorner API developed at Northwestern University. You can find detailed documentation here. In the code below I'm using their Adorner for Plain Text API.
# MorphAdorner (Northwestern University) web service
adorn <- function(text) {
require(httr)
require(XML)
url <- "http://devadorner.northwestern.edu/maserver/partofspeechtagger"
response <- GET(url,query=list(text=text, media="xml",
xmlOutputType="outputPlainXML",
corpusConfig="ncf", # Nineteenth Century Fiction
includeInputText="false", outputReg="true"))
doc <- content(response,type="text/xml")
words <- doc["//adornedWord"]
xmlToDataFrame(doc,nodes=words)
}
library(tm)
vector.documents <- c("Here is some text.",
"This might possibly be some additional text, but then again, maybe not...",
"This is an abstruse grammatical construction having as it's sole intention the demonstration of MorhAdorner's capability.")
corpus.documents <- Corpus(VectorSource(vector.documents))
lapply(corpus.documents,function(x) adorn(as.character(x)))
# [[1]]
# token spelling standardSpelling lemmata partsOfSpeech
# 1 Here Here Here here av
# 2 is is is be vbz
# 3 some some some some d
# 4 text text text text n1
# 5 . . . . .
# ...
I'm just showing the lemmatization of the first "document". partsOfSpeech follows the NUPOS convention.
This answers your question, but does not really solve your problem. There is another solution above (different answer) that attempts to provide a solution.
There are several issues with the way you are using the wordnet package, described below, but the bottom line is that even after addressing these, I could not get wordnet to produce anything but gibberish.
First: You can't just install the wordnet package in R, you have to install Wordnet on your computer, or at least download the dictionaries. Then, before you use the package, you need to run initDict("path to wordnet dictionaries").
Second: It looks like getTermFilter(...) expects a character argument for x. The way you have it set up, you are passing an object of type PlainTextDocument. So you need to use as.character(x) to convert that to it's contained text, or you get the java error in your question.
Third: It looks like getTermFilter(...) expects single words (or phrases). For instance, if you pass "This is a phrase" to getTermFilter(...) it will look up "This is a phrase" in the dictionary. It will not find it of course, so getIndexTerms(...) returns NULL and getLemma(...) fails... So you have to parse the text of your PlainTextDocument into words first.
Finally, I'm not sure it's a good idea to remove punctuation. For instance "it's" will be converted to "its" but these are different words with different meanings, and they lemmatize differently.
Rolling all this up:
library(tm)
vector.documents <- c("This is a line of text.", "This is another one.")
corpus.documents <- Corpus(VectorSource(vector.documents))
corpus.documents <- tm_map(corpus.documents, removePunctuation)
library(wordnet)
initDict("C:/Program Files (x86)/WordNet/2.1/dict")
lapply(corpus.documents,function(x){
sapply(unlist(strsplit(as.character(x),"[[:space:]]+")), function(word) {
x.filter <- getTermFilter("StartsWithFilter", word, TRUE)
terms <- getIndexTerms("NOUN",1,x.filter)
if(!is.null(terms)) sapply(terms,getLemma)
})
})
# [[1]]
# This is a line of text
# "thistle" "isaac" "a" "line" "off-axis reflector" "text"
As you can see, the output is still gibberish. "This" is lemmatized as "thistle" and so on. It may be that I have the dictionaries configured improperly, so you might have better luck. If you are committed to wordnet, for some reason, I suggest you contact the package authors.

Resources