Importing data from a PDF to HTML using R - r

Is there a way to import data from a .pdf file into HTML format using R?
I tried with the following code:
library(tm)
filename = "file.pdf"
doc <- readPDF(control = list(text = "-layout"))(elem = list(uri = filename),language = "en",id = "id1")
head(doc)
Output in HTML displays as:
## $content
## [1] " sample data"
## [2] ""
## [3] " records"
## [4] ""
## [5] " 31 July 2017"
## [6] ""
## [7] ""
## [8] "R Markdown setup
## [9] ""
## [10] ""
## [11] "R Markdown"
## [12] ""
## [13] "This is an R Markdown document. Markdown is a simple formatting syntax for"
## [14] "authoring HTML, PDF, and MS Word documents. For more details on using R"
## [15] "Markdown see http://rmarkdown.rstudio.com."
## [16] "When you click the Knit button a document will be generated that includes"
## [17] "both content as well as the output of any embedded R code chunks within the"
## [18] "document. You can embed an R code chunk like this:"
## [19] "{r cars} summary(cars)"
Please help!

I downloaded the pdf file available here : https://fie.org/competition/2022/152/results/pools/pdf?lang=en
With the following code, I have been able to convert the PDF file to a html file :
library(RDCOMClient)
path_PDF <- "C:\\pdf_with_table.pdf"
path_Html <- "C:\\temp.html"
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Html, FileFormat = 9) # saves to html
From my point of view, it would be more straightforward to extract the tables directly from the PDF or to convert the PDF to a word file and extract the tables from the word file.

Related

How to edit an R markdown YAML header programmatically?

Take for example this Rmd file - https://github.com/rstudio/learnr/blob/master/inst/tutorials/ex-setup-r/ex-setup-r.Rmd
The YAML header has this -
output:
learnr::tutorial:
progressive: true
allow_skip: true
I would like to change this to -
output:
ioslides_presentation:
widescreen: true
Is there a way to make this edit programmatically i.e. Can I write some function that takes an Rmd file as input, edits the YAML header, and produces a new Rmd file?
Thanks!
I think a quick function could do this.
change_yaml_matter <- function(input_file, ..., output_file) {
input_lines <- readLines(input_file)
delimiters <- grep("^---\\s*$", input_lines)
if (!length(delimiters)) {
stop("unable to find yaml delimiters")
} else if (length(delimiters) == 1L) {
if (delimiters[1] == 1L) {
stop("cannot find second delimiter, first is on line 1")
} else {
# found just one set, assume it is *closing* the yaml matter;
# fake a preceding line of delimiter
delimiters <- c(0L, delimiters[1])
}
}
delimiters <- delimiters[1:2]
yaml_list <- yaml::yaml.load(input_lines[ (delimiters[1]+1):(delimiters[2]-1) ])
dots <- list(...)
yaml_list <- c(yaml_list[ setdiff(names(yaml_list), names(dots)) ], dots)
output_lines <- c(
if (delimiters[1] > 0) input_lines[1:(delimiters[1])],
strsplit(yaml::as.yaml(yaml_list), "\n")[[1]],
input_lines[ -(1:(delimiters[2]-1)) ]
)
if (missing(output_file)) {
return(output_lines)
} else {
writeLines(output_lines, con = output_file)
return(invisible(output_lines))
}
}
Where ... is whatever you want it to be. Meaning: if you want to replace the output: component of the yaml matter, then you give a named list as output=list(...).
If I use the rmarkdown document I used in a previous answer, then unchanged, it looks like this:
readLines("~/StackOverflow/1883604/62095186.Rmd")
# [1] "---"
# [2] "title: Hello"
# [3] "output: html_document"
# [4] "params:"
# [5] " intab: TRUE"
# [6] "---"
# [7] ""
# [8] "# Headline 1"
# [9] ""
# [10] "## Headline 2 `r if (params$intab) \"{.tabset}\"`"
# [11] ""
# [12] "### Headline 3 in a tab"
# [13] ""
# [14] "### Headline 4 in a tab"
# [15] ""
# [16] "### Headline 5 in a tab"
# [17] ""
# [18] ""
And to change the output portion, I add a nested named list as:
change_yaml_matter("~/StackOverflow/1883604/62095186.Rmd",
output=list(ioslides_presentation=list(widescreen=TRUE)))
# [1] "---"
# [2] "title: Hello"
# [3] "params:"
# [4] " intab: yes"
# [5] "output:"
# [6] " ioslides_presentation:"
# [7] " widescreen: yes"
# [8] "---"
# [9] ""
# [10] "# Headline 1"
# [11] ""
# [12] "## Headline 2 `r if (params$intab) \"{.tabset}\"`"
# [13] ""
# [14] "### Headline 3 in a tab"
# [15] ""
# [16] "### Headline 4 in a tab"
# [17] ""
# [18] "### Headline 5 in a tab"
# [19] ""
# [20] ""
You can change just about any portion of the yaml matter. (The only things you cannot change, I suspect, are if you happen to have yaml parameters named input_file or output_file. If you actually have Rmd files with those yaml top-level parameters, then you can easily rename the named arguments here to be something else, such as Mxyzptlk and something else ... you're unlikely to see those in production.)
Notes:
This did not save anything to a file, you have to do that yourself. Add output_file="path/to/new.RMd" to your call, and it will write a new file.
When you do include output_file= in the arguments, if you choose to not catch the return value, it will appear to return nothing. This is due to invisible in my return; if you really want to see and save, either capture to a variable and look at that, or wrap the function call in parens, as in (change_yaml_matter(...)).
The trick for YAML is to know that yaml:: will treat every top-level as the named element of a list, and its contents are recursively lists in the same manner. For instance,
str(yaml::yaml.load("
---
top1:
level2a:
level3a: 123
level3b: 456
level2b: 789
top2: quux
---"))
# List of 2
# $ top1:List of 2
# ..$ level2a:List of 2
# .. ..$ level3a: int 123
# .. ..$ level3b: int 456
# ..$ level2b: int 789
# $ top2: chr "quux"
To assign new values, just provide nested named lists.
I modified it slightly.
With the version given, if an existing yaml element is changed, it is moved to the end of the yaml header.
With my modification, existing elements with changed values will keep their position in the header.
change_yaml_matter <- function(input_file, ..., output_file) {
input_lines <- readLines(input_file)
delimiters <- grep("^---\\s*$", input_lines)
if (!length(delimiters)) {
stop("unable to find yaml delimiters")
} else if (length(delimiters) == 1L) {
if (delimiters[1] == 1L) {
stop("cannot find second delimiter, first is on line 1")
} else {
# found just one set, assume it is *closing* the yaml matter;
# fake a preceding line of delimiter
delimiters <- c(0L, delimiters[1])
}
}
delimiters <- delimiters[1:2]
yaml_list <- yaml::yaml.load(
input_lines[ (delimiters[1]+1):(delimiters[2]-1) ])
dots <- list(...)
for (element_name in names(dots)){
if(element_name %in% names(yaml_list)) {
yaml_list[element_name] <- dots[element_name]
} else {
yaml_list <- c(yaml_list,dots[element_name])
}
}
output_lines <- c(
if (delimiters[1] > 0) input_lines[1:(delimiters[1])],
strsplit(yaml::as.yaml(yaml_list), "\n")[[1]],
input_lines[ -(1:(delimiters[2]-1)) ]
)
if (missing(output_file)) {
return(output_lines)
} else {
writeLines(output_lines, con = output_file)
return(invisible(output_lines))
}
}

removing special apostrophes from French article contractions when tokenizing

I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text.
I'm currently using the quanteda package and the tm package for doing things like removing words, removing numbers...etc...
There's only one thing, though, that doesn't seem to work.
As some of you might know, in French, the masculine determinative article -le- contracts in -l'- before vowels. I've tried to remove -l'- (and similar things like -d'-) as words with removeWords
lmt67 <- removeWords(lmt67, c( "l'","d'","qu'il", "n'", "a", "dans"))
but it only works with words that are separate from the rest of text, not with the articles that are attached to a word, such as in -l'arbre- (the tree).
Frustrated, I've tried to give it a simple gsub
lmt67 <- gsub("l'","",lmt67)
but that doesn't seem to be working either.
Now, what's a better way to do this, and possibly through a c(...) vector so that I can give it a series of expressions all together?
Just as context, lmt67 is a "large character" with 30,000 elements/articles, obtained by using the "texts" functions on data imported from txt files.
Thanks to anyone that will want to help me.
I'll outline two ways to do this using quanteda and quanteda-related tools. First, let's define a slightly longer text, with more prefix cases for French. Notice the inclusion of the ’ apostrophe as well as the ASCII 39 simple apostrophe.
txt <- c(doc1 = "M. Trump, lors d’une réunion convoquée d’urgence à la Maison Blanche,
n’en a pas dit mot devant la presse. En réalité, il s’agit d’une
mesure essentiellement commerciale de ce pays qui l'importe.",
doc2 = "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme
successeur Jordi Sanchez, partisan de l’indépendance catalane,
actuellement en prison pour sédition.")
The first method will use pattern matches for the simple ASCII 39 (apostrophe) plus a bunch of
Unicode variants, matched through the category "Pf" for "Punctuation: Final Quote" category.
However, quanteda does its best to normalize the quotes at the tokenization stage - see the
"l'indépendance" in the second document for instance.
The second way below uses a French part-of-speech tagger integrated with quanteda that allows similar
selection after recognizing and separating the prefixes, and then removing determinants (among other POS).
1. quanteda tokens
toks <- tokens(txt, remove_punct = TRUE)
# remove stopwords
toks <- tokens_remove(toks, stopwords("french"))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "d'une" "réunion"
# [6] "convoquée" "d'urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "s'agit" "d'une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "l'importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "l'indépendantiste"
# [5] "catalan" "a" "désigné" "comme"
# [9] "successeur" "Jordi" "Sanchez" "partisan"
# [13] "de" "l'indépendance" "catalane" "actuellement"
# [17] "en" "prison" "pour" "sédition"
Then, we apply the pattern to match l', d', or l', using a regular expression replacement on the types (the unique tokens):
toks <- tokens_replace(
toks,
types(toks),
stringi::stri_replace_all_regex(types(toks), "[lsd]['\\p{Pf}]", "")
)
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "une" "réunion"
# [6] "convoquée" "urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "agit" "une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "indépendantiste" "catalan"
# [6] "a" "désigné" "comme" "successeur" "Jordi"
# [11] "Sanchez" "partisan" "de" "indépendance" "catalane"
# [16] "actuellement" "En" "prison" "pour" "sédition"
From the resulting toks object you can form a dfm and then proceed to fit the STM.
2. using spacyr
This will involve more sophisticated part-of-speech tagging and then converting the tagged object into quanteda tokens. This requires first that you install Python, spacy, and the French language model. (See https://spacy.io/usage/models.)
library(spacyr)
spacy_initialize(model = "fr", python_executable = "/anaconda/bin/python")
# successfully initialized (spaCy Version: 2.0.1, language model: fr)
toks <- spacy_parse(txt, lemma = FALSE) %>%
as.tokens(include_pos = "pos")
toks
# tokens from 2 documents.
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" ",/PUNCT"
# [4] "lors/ADV" "d’/PUNCT" "une/DET"
# [7] "réunion/NOUN" "convoquée/VERB" "d’/ADP"
# [10] "urgence/NOUN" "à/ADP" "la/DET"
# [13] "Maison/PROPN" "Blanche/PROPN" ",/PUNCT"
# [16] "\n /SPACE" "n’/VERB" "en/PRON"
# [19] "a/AUX" "pas/ADV" "dit/VERB"
# [22] "mot/ADV" "devant/ADP" "la/DET"
# [25] "presse/NOUN" "./PUNCT" "En/ADP"
# [28] "réalité/NOUN" ",/PUNCT" "il/PRON"
# [31] "s’/AUX" "agit/VERB" "d’/ADP"
# [34] "une/DET" "\n /SPACE" "mesure/NOUN"
# [37] "essentiellement/ADV" "commerciale/ADJ" "de/ADP"
# [40] "ce/DET" "pays/NOUN" "qui/PRON"
# [43] "l'/DET" "importe/NOUN" "./PUNCT"
#
# doc2 :
# [1] "Réfugié/VERB" "à/ADP" "Bruxelles/PROPN"
# [4] ",/PUNCT" "l’/PRON" "indépendantiste/ADJ"
# [7] "catalan/VERB" "a/AUX" "désigné/VERB"
# [10] "comme/ADP" "\n /SPACE" "successeur/NOUN"
# [13] "Jordi/PROPN" "Sanchez/PROPN" ",/PUNCT"
# [16] "partisan/VERB" "de/ADP" "l’/DET"
# [19] "indépendance/ADJ" "catalane/ADJ" ",/PUNCT"
# [22] "\n /SPACE" "actuellement/ADV" "en/ADP"
# [25] "prison/NOUN" "pour/ADP" "sédition/NOUN"
# [28] "./PUNCT"
Then we can use the default glob-matching to remove the parts of speech in which we are probably not interested, including the newline:
toks <- tokens_remove(toks, c("*/DET", "*/PUNCT", "\n*", "*/ADP", "*/AUX", "*/PRON"))
toks
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" "lors/ADV" "réunion/NOUN" "convoquée/VERB"
# [6] "urgence/NOUN" "Maison/PROPN" "Blanche/PROPN" "n’/VERB" "pas/ADV"
# [11] "dit/VERB" "mot/ADV" "presse/NOUN" "réalité/NOUN" "agit/VERB"
# [16] "mesure/NOUN" "essentiellement/ADV" "commerciale/ADJ" "pays/NOUN" "importe/NOUN"
#
# doc2 :
# [1] "Réfugié/VERB" "Bruxelles/PROPN" "indépendantiste/ADJ" "catalan/VERB" "désigné/VERB"
# [6] "successeur/NOUN" "Jordi/PROPN" "Sanchez/PROPN" "partisan/VERB" "indépendance/ADJ"
# [11] "catalane/ADJ" "actuellement/ADV" "prison/NOUN" "sédition/NOUN"
Then we can remove the tags, which you probably don't want in your STM - but you could leave them if you prefer.
## remove the tags
toks <- tokens_replace(toks, types(toks),
stringi::stri_replace_all_regex(types(toks), "/[A-Z]+$", ""))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M." "Trump" "lors" "réunion" "convoquée"
# [6] "urgence" "Maison" "Blanche" "n’" "pas"
# [11] "dit" "mot" "presse" "réalité" "agit"
# [16] "mesure" "essentiellement" "commerciale" "pays" "importe"
#
# doc2 :
# [1] "Réfugié" "Bruxelles" "indépendantiste" "catalan" "désigné"
# [6] "successeur" "Jordi" "Sanchez" "partisan" "indépendance"
# [11] "catalane" "actuellement" "prison" "sédition"
From there, you can use the toks object to form your dfm and fit the model.
Here's a scrape from the current page at Le Monde's website. Notice that the apostrophe they use is not the same character as the single-quote here "'":
text <- "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."
It has a little angle and is not actually "straight down" when I view it. You need to copy that character into your gsub command:
sub("l’", "", text)
[#1] "Réfugié à Bruxelles, indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."

Saving Result in different vector in R

The following are the URLs I wish to extract:
> links
[1] "https://www.makemytrip.com/holidays-india/"
[2] "https://www.makemytrip.com/holidays-india/"
[3] "https://www.yatra.com/india-tour-packages"
[4] "http://www.thomascook.in/tcportal/international-holidays"
[5] "https://www.yatra.com/holidays"
[6] "https://www.travelguru.com/holiday-packages/domestic-packages.shtml"
[7] "https://www.chanbrothers.com/package"
[8] "https://www.tourmyindia.com/packagetours.html"
[9] "http://traveltriangle.com/tour-packages"
[10] "http://www.coxandkings.com/bharatdeko/"
[11] "https://www.sotc.in/india-tour-packages"
I have managed to do it using:
for (i in 1:10){
html <- getURL(links[i], followlocation = TRUE)
parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)}
But the thing is all extracted data are all saved in "plain.text." How do I have "plain.text" for each link?
Thank you.

automating downloading a .iqy file and reading the data from it

I've a link where I need to download data which is in ".iqy" file and I need to read that for further cleaning.
I'm able to do it manually by entering the link present(in 3rd line) in the file using
con <- file("ABC1.iqy", "r", blocking = FALSE)
readLines(con=con,n=-1L,ok=TRUE, warn=FALSE,encoding='unknown').
Output:
[1] "WEB"
[2] "1"
[3] "https:abc.../excel/execution/EPnx?view=vrs" [4] ""
[5] ""
[6] "Selection=AllTables"
[7] "Formatting=None"
[8] "PreFormattedTextToColumns=True"
[9] "ConsecutiveDelimitersAsOne=True"
[10] "SingleBlockTextImport=False"
[11] "DisableDateRecognition=False"
[12] "DisableRedirections=False"
[13] ""
I need to automate this instead of doing it manually. Is there any option in r that I can use?
simply use download.file :)
con <- file("ABC1.iqy", "r", blocking = FALSE)
dest_path <- "ABC.file"
download.file(readLines(con=con,n=-1L,ok=TRUE, warn=FALSE,encoding='unknown')[3],destfile= dest_path)
if you can't read the file you get, try :
download.file(readLines(con=con,n=-1L,ok=TRUE, warn=FALSE,encoding='unknown')[3],destfile= dest_path, mode = "wb")

R: Extract words from a website

I am attempting to extract all words that start with a particular phrase from a website. The website I am using is:
http://docs.ggplot2.org/current/
I want to extract all the words that start with "stat_". I should get 21 names like "stat_identity" in return. I have the following code:
stats <- readLines("http://docs.ggplot2.org/current/")
head(stats)
grep("stat_{1[a-z]", stats, value=TRUE)
I am returned every line containing the phrase "stat_". I just want to extract the "stat_" words. So I tried something else:
gsub("\b^stat_[a-z]+ ", "", stats)
I think the output I got was an empty string, " ", where a "stat_" phrase would be? So now I'm trying to think of ways to extract all the text and set everything that is not a "stat_" phrase to empty strings. Does anyone have any ideas on how to get my desired output?
rvest & stringr to the rescue:
library(xml2)
library(rvest)
library(stringr)
pg <- read_html("http://docs.ggplot2.org/current/")
unique(str_match_all(html_text(html_nodes(pg, "body")),
"(stat_[[:alnum:]_]+)")[[1]][,2])
## [1] "stat_bin" "stat_bin2dCount"
## [3] "stat_bindot" "stat_binhexBin"
## [5] "stat_boxplot" "stat_contour"
## [7] "stat_density" "stat_density2d"
## [9] "stat_ecdf" "stat_functionSuperimpose"
## [11] "stat_identity" "stat_qqCalculation"
## [13] "stat_quantile" "stat_smooth"
## [15] "stat_spokeConvert" "stat_sum"
## [17] "stat_summarySummarise" "stat_summary_hexApply"
## [19] "stat_summary2dApply" "stat_uniqueRemove"
## [21] "stat_ydensity" "stat_defaults"
Unless you need the links (then you can use other rvest functions), this removes all the markup for you and just gives you the text of the website.

Resources