Extract Text from a pdf only English text Canadian Legislation R

Extract Text from a pdf only English text Canadian Legislation R - r

I'm trying to extract data from a Canadian Act for a project (in this case, the Food and Drugs Act), and import it into R. I want to break it up into 2 parts. 1st the table of contents (pic 1). Second, the information in the act (pic 2). But I do not want the French part (je suis désolé). I have tried using tabulizer extract_area(), but I don't want to have to select the area by hand 90 times (I'm going to do this for multiple pieces of legislation).
Obviously I don't have a minimal reproducible example coded out... But the pdf is downloadable here: https://laws-lois.justice.gc.ca/eng/acts/F-27/
Option 2 is to write something to pull it out via XML, but I'm a bit less used to working with XML files. Unless it's incredibly annoying to do using either pdftools or tabulizer, I'd prefer the answer using one of those libraries (mostly for learning purposes).
I've seen some similarish questions on stackoverflow, but they're all confusingly written/designed for tables, of which this is not. I am not a quant/data science researcher by training, so an explanation would be super helpful (but not required).

Here's an option that reads in the pdf text and detects language. You're probably going to have to do a lot of text cleanup after reading in the pdf. Assume you don't care about retaining formatting.
library(pdftools)
a = pdf_text('F-27.pdf')
#split text to get sentence chunks, mostly.
b = sapply(a,strsplit,'\r\n')
#do a bunch of other text cleanup, here's an example using the third list element. You can expand this to cover all of b with a loop or list function like sapply.
#Two spaces should hopefully retain most sentence-like fragments, you can get more sophisticated:
d = strsplit(b[[3]], ' ')[[1]]
library(cld3) #language tool to detect french and english
x = sapply(d,detect_language)
#Keep only English
x[x=='en']

Related

Can I count and list how many times words were used in an excel document?

I am working on analyzing some text data from a Ticketing system. I am pulling out long text fields from the tickets and need to analyze which words are being used and which ones are being used the most. But I need it to list all of the words.
The file format is in Excel and I have taken the file and using tm, I have made some edits to the data and removed some stop words and other words that aren't really important to the data I am looking for. I have already made this into a corpus.
When I do the following code it kind of gives me what I need but it does not actually give me all of the words. I know that this is going to be a long list, but that is fine.
dtm <- DocumentTermMatrix(hardwareCN.Clean)
dtmDataFrame1 <- as.data.frame(inspect(dtm))
colSums(dtmDataFrame1)
This gives me only about 10 words, but I know that there are many many more than that. I also then need to be able to export this to share.
Thanks

Better cluster dendrogram for representation of Cluster in Text Mining in R

I have around 1140 terms in three documents (after removing sparse terms). I want to have the information about the clusters. I have produced clusters as shown in attached image but I am unable to read them. I have also tried k-mean clusters but the same problem persists. I am not so much interested in all the terms but clearly defined few three or four clusters would do the job. I have been using tm package in R for text mining.
Secondly I am also looking for finding association in terms with in a single document; for this how can I split a text file into several text files i.e. if my file has three sentences:
Doc: "My name is ABC. I live in XYZ. I am cousin of TUV."
I would like to split it as:
Doc_1: My name is ABC.
Doc_2:I live in XYZ.
Doc_3: I am cousin of TUV.
So that I have three rows and columns of terms in dtm instead of a single row and column of terms.
and

You ask more than one question. I will address your first one. It seems unrealistic to expect to put 1140 strings in your graph and expect to see anything. You need a way to be able to see a bit of it at a time. You can cut the tree and look a smaller pieces in the lower part of the tree to control how much you are seeing at one time.
Here is an example. Even with 150 points, it is hard to see what is going on.
D = as.dendrogram(hclust(dist(iris[,1:4])))
plot(D)
But if you cut the tree, you can look at individual lower branches and understand that part.
Cuts = cut(D, 4)
plot(Cuts$lower[[2]])
Of course, you will need to experiment around a bit to find good places to cut your tree.

Easy export and table formatting of R dataframe to Word? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Is there an easy way to automatize the conversion of a R dataframe to a pretty Word table in APA format for publishing manuscripts? I'm currently doing this by saving the table in a csv, opening that in excel, copying the excel table to Word, and formatting it there, but I'm hoping there would be a way to automatize the formatting in R, so that when I convert it to Word, it would already be in APA format, because Word sucks in automatization.
Basically, I want to continue writing the manuscript itself in Word, while doing my analyses in R. Then gather all the results in R to a table (with manually modifiable formatting) by a script and convert it to whatever format I could then simply copy-paste to Word (so that the formatting actually holds). When I need to modify the table, I would make the changes in R and then just run the script again without the need to do any changes in Word.
I don't want to learn LaTeX, because everyone in my field uses Word with features like track changes, and I use Zotero add-in for citations, so it's simpler to just keep the writing separate from the analyses. Also, I am a psychologist, not a coder, so learning a lot of new technologies just for this is probably not worth the effort for me. Typically with new technologies come new technical problems, and I am aiming to make my workflow quicker, but not at the cost of unpredictability (which may make it slower exactly at the moment when I cannot afford it).
I found a R+knitr+rmarkdown+pander+pandoc solution "with as little overhead as possible", but it seems to be quite heavy still because I don't know any of those technologies apart from R. And I'm not eager to start learning all that, as it seems to be aimed for doing the writing and all in R to the very end, while I want to separate my writing and my code - I never need code in my writing, only the result tables. In addition, based on the examples, it seems to fetch the values directly from R code (e.g., from summary() to create a descriptive table), while I need to be able to tinker with my table manually before converting it, for instance, writing the title and notes (like a specific note to one cell and explaining it in the bottom). I also found R2wd, but it seems to be an older attempt for the same "whole workflow in R" problem as the solution above. SWord does not seem to be working anymore.
Any suggestions?

(Just to let you know, I am the author of the packages I recommend you...)
You can use package ReporteRs to output your table to Word. See here a tutorial (not mine):
http://www.sthda.com/english/wiki/create-and-format-word-documents-using-r-software-and-reporters-package
Objects FlexTable let you format and arrange tables easily with some standard R code. For example, to set the 2nd column in bold, the code looks like:
myFlexTable[, 2] = textBold()
There are (old) examples here:
http://davidgohel.github.io/ReporteRs/flextable_examples.html
These objects can be added to a Word report using the function addFlexTable. The word report can be generated with function writeDoc.
If you are working in RStudio, you can print the object and it will be rendered in the html viewer so you can export it in Word when you are satisfied with its content.
You can even add real Word footnotes (see the link below)
http://davidgohel.github.io/ReporteRs/pot_objects.html#pot_footnotes
If you need more tabular output, I recommend you also the rtable package that handles xtable objects (and other things I have to develop to satisfy my colleagues or customers) - a quick demo can be seen here:
http://davidgohel.github.io/tabular/
Hope it helps...

I have had the same need, and I have ended up using the package htmlTable, which is quite 'cost-efficient'. This creates a HTML table (in RStudio it is created in the "Viewer" windows in the bottom right which I just mark using the mouse copy-paste to Word. (Start marking form the bottom of the table and drag the mouse upwards, that way you are sure to include the start of the HTML code.) Word handles these tables quite nicely. The syntax of is quite simple involving just the function htmlTable(), but is still able to make somewhat more complex tables, such as grouped rows and primary and secondary column headers (i.e. column headers spanning more than one column). Check out the examples in the vignette.
One note of caution: htmlTable will not work will factor variables, i.e., they will come out as integer numbers (according to factor levels). So read the data using stringsAsFactors = FALSE or convert them using as.character().
Including trailing zeroes can be done using the txtRound function. Example:
mini_table <- data.frame(Name="A", x=runif(20), stringsAsFactors = FALSE)
txt <- txtRound(mini_table, 2)
It is not completely straightforward to assign formatting soch as bold or italics, but it can be done by wrapping the table contents in HTML code. If you for instance want to make an entire column bold, it can be done like this (please note the use of single and double quotation marks inside paste0):
library(plyr)
mini_table <- data.frame(Name="A", x=runif(20), stringsAsFactors = FALSE)
txt <- txtRound(mini_table, 2)
txt$x <- aaply(txt$x, 1, function(x)
paste0("<span style='font-weight:bold'>", x, "</span")
)
htmlTable(txt)
Of course, that would be easier to to in Word. However, it is more interesting to add formatting to numbers according to some criteria. For instance, if we want to emphasize all values of x that are less than 0.2 by applying bold font, we can modify the code above as follows:
library(plyr)
mini_table <- data.frame(Name="A", x=runif(20), stringsAsFactors = FALSE)
txt <- txtRound(mini_table, 2)
txt$x <- aaply(txt$x, 1, function(x)
if (as.numeric(x)<0.2) {
paste0("<span style='font-weight:bold'>", x, "</span>")
} else {
paste0("<span>", x, "</span>")
})
htmlTable(txt)
If you want even fancier emphasis, you can for instance replace the bold font by red background color by using span style='background-color: red' in the code above. All these changes carry over to Word, at least on my computer (Windows 7).

The short answer is "not really." I've never had much luck getting well formatted tables into MS Word. The best approach I can offer you requires using Rmarkdown to render your tables into an HTML file. You can copy and paste you results from the HTML file to MS Word, but I make no guarantees about how well the formatting will follow.
To format your tables, you can try something like the xtable package, or the pixiedust package. But again, no guarantees that the formatting will transfer.

What format for external file so that R can read a list of lists?

I have a haskell program that produces a text file, which is then read by R. My current solution is working, but I am asking if there is a better solution and whether it is worth changing the current approach.
Currently my haskell program produces the following output (simplified example):
mylist <- list(
list(c("b"),c("b","E"),c("b","E","P"),c("b","T"),c("b","P","T"),c("b","E","T"),c("b","E","P","T"))
, list(c("b"),c("b","T"),c("b","N"),c("b","E"),c("b","E","T"),c("b","N","T"),c("b","N","E"),c("b","N","E","T"))
, list(c("b","N"),c("b","E","N"),c("b","N","T"),c("b","E","N","T"))
)
myListNames <- c("Name1","Name2","Name3")
This output is saved to a text file that is simply sourced from within R. I then access the two variables mylist and myListNames.
The data: I am generating 9 text files. List entries represent a feature, there are at maximum 120 different features and the name can be 20 characters long. Please note that features have nothing to do with statistics. In the dummy example b would be in the real world example 20 character long. Each sublist is about 5 to 45 elements long, but an outlier might have 500.000 list entries.
The current approach works reasonably well. But is there another way to store a list of lists as a text file that might be better suited?

I used the approach that was suggested by Ricardo Saporta. It worked like a charm and I used the R library RJSONIO for JSON parsing in R.
Many thanks to Ricardo Saporta!

FAQ markup to R data structure

I'm reading the R FAQ source in texinfo, and thinking that it would be easier to manage and extend if it was parsed as an R structure. There are several existing examples related to this:
the fortunes package
bibtex entries
Rd files
each with some desirable features.
In my opinion, FAQs are underused in the R community because they lack i) easy access from the R command-line (ie through an R package); ii) powerful search functions; iii) cross-references; iv) extensions for contributed packages. Drawing ideas from packages bibtex and fortunes, we could conceive a new system where:
FAQs can be searched from R. Typical calls would resemble the fortune() interface: faq("lattice print"), or faq() #surprise me!, faq(51), faq(package="ggplot2").
Packages can provide their own FAQ.rda, the format of which is not clear yet (see below)
Sweave/knitr drivers are provided to output nicely formatted Markdown/LaTeX, etc.
QUESTION
I'm not sure what is the best input format, however. Either for converting the existing FAQ, or for adding new entries.
It is rather cumbersome to use R syntax with a tree of nested lists (or an ad hoc S3/S4/ref class or structure,
\list(title = "Something to be \\escaped", entry = "long text with quotes, links and broken characters", category = c("windows", "mac", "test"))
Rd documentation, even though not an R structure per se (it is more a subset of LaTeX with its own parser), can perhaps provide a more appealing example of an input format. It also has a set of tools to parse the structure in R. However, its current purpose is rather specific and different, being oriented towards general documentation of R functions, not FAQ entries. Its syntax is not ideal either, I think a more modern markup, something like markdown, would be more readable.
Is there something else out there, maybe examples of parsing markdown files into R structures? An example of deviating Rd files away from their intended purpose?
To summarise
I would like to come up with:
1- a good design for an R structure (class, perhaps) that would extend the fortune package to more general entries such as FAQ items
2- a more convenient format to enter new FAQs (rather than the current texinfo format)
3- a parser, either written in R or some other language (bison?) to convert the existing FAQ into the new structure (1), and/or the new input format (2) into the R structure.
Update 2: in the last two days of the bounty period I got two answers, both interesting but completely different. Because the question is quite vast (arguably ill-posed), none of the answers provide a complete solution, thus I will not (for now anyway) accept an answer. As for the bounty, I'll attribute it to the answer most up-voted before the bounty expires, wishing there was a way to split it more equally.

(This addresses point 3.)
You can convert the texinfo file to XML
wget http://cran.r-project.org/doc/FAQ/R-FAQ.texi
makeinfo --xml R-FAQ.texi
and then read it with the XML package.
library(XML)
doc <- xmlParse("R-FAQ.xml")
r <- xpathSApply( doc, "//node", function(u) {
list(list(
title = xpathSApply(u, "nodename", xmlValue),
contents = as(u, "character")
))
} )
free(doc)
But it is much easier to convert it to text
makeinfo --plaintext R-FAQ.texi > R-FAQ.txt
and parse the result manually.
doc <- readLines("R-FAQ.txt")
# Split the document into questions
# i.e., around lines like ****** or ======.
i <- grep("[*=]{5}", doc) - 1
i <- c(1,i)
j <- rep(seq_along(i)[-length(i)], diff(i))
stopifnot(length(j) == length(doc))
faq <- split(doc, j)
# Clean the result: since the questions
# are in the subsections, we can discard the sections.
faq <- faq[ sapply(faq, function(u) length(grep("[*]", u[2])) == 0) ]
# Use the result
cat(faq[[ sample(seq_along(faq),1) ]], sep="\n")

I'm a little unclear on your goals. You seem to want all the R-related documentation converted into some format which R can manipulate, presumably so the one can write R routines to extract information from the documentation better.
There seem to be three assumptions here.
1) That it will be easy to convert these different document formats (texinfo, RD files, etc.) to some standard form with (I emphasize) some implicit uniform structure and semantics.
Because if you cannot map them all to a single structure, you'll have to write separate R tools for each type and perhaps for each individual document, and then the post-conversion tool work will overwhelm the benefit.
2) That R is the right language in which to write such document processing tools; suspect you're a little biased towards R because you work in R and don't want to contemplate "leaving" the development enviroment to get information about working with R better. I'm not an R expert, but I think R is mainly a numerical language, and does not offer any special help for string handling, pattern recognition, natural language parsing or inference, all of which I'd expect to play an important part in extracting information from the converted documents that largely contain natural language. I'm not suggesting a specific alternative language (Prolog??), but you might be better off, if you succeed with the conversion to normal form (task 1) to carefully choose the target language for processing.
3) That you can actually extract useful information from those structures. Library science was what the 20th century tried to push; now we're all into "Information Retrieval" and "Data Fusion" methods. But in fact reasoning about informal documents has defeated most of the attempts to do it. There are no obvious systems that organize raw text and extract deep value from it (IBM's Jeopardy-winning Watson system being the apparent exception but even there it isn't clear what Watson "knows"; would you want Watson to answer the question, "Should the surgeon open you with a knife?" no matter how much raw text you gave it) The point is that you might succeed in converting the data but it isn't clear what you can successfully do with it.
All that said, most markup systems on text have markup structure and raw text. One can "parse" those into tree-like structures (or graph-like structures if you assume certain things are reliable cross-references; texinfo certainly has these). XML is widely pushed as a carrier for such parsed-structures, and being able to represent arbitrary trees or graphs it is ... OK ... for capturing such trees or graphs. [People then push RDF or OWL or some other knoweldge encoding system that uses XML but this isn't changing the problem; you pick a canonical target independent of R]. So what you really want is something that will read the various marked-up structures (texinfo, RD files) and spit out XML or equivalent trees/graphs. Here I think you are doomed into building separate O(N) parsers to cover all the N markup styles; how otherwise would a tool know what the value markup (therefore parse) was? (You can imagine a system that could read marked-up documents when given a description of the markup, but even this is O(N): somebody still has to describe the markup). One this parsing is to this uniform notation, you can then use an easily built R parser to read the XML (assuming one doesn't already exist), or if R isn't the right answer, parse this with whatever the right answer is.
There are tools that help you build parsers and parse trees for arbitrary lanuages (and even translators from the parse trees to other forms). ANTLR is one; it is used by enough people so you might even accidentally find a texinfo parser somebody already built. Our DMS Software Reengineering Toolkit is another; DMS after parsing will export an XML document with the parse tree directly (but it won't necessarily be in that uniform representation you ideally want). These tools will likely make it relatively easy to read the markup and represent it in XML.
But I think your real problem will be deciding what you want to extract/do, and then finding a way to do that. Unless you have a clear idea of how to do the latter, doing all the up front parsers just seems like a lot of work with unclear payoff. Maybe you have a simpler goal ("manage and extend" but those words can hide a lot) that's more doable.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex