Using readtext to extract text from XML - r

I am not used to working with XML files but need to extract text from various fields in XML files. Specifically, I've downloaded and saved XML files like the following: https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml. I'm interested in the text within the tag "regtext" in this and other similar XML files.
I've downloaded the XML files and stored them on my computer, but when I set the directory and attempt to use the readtext package to read from the XML files, I get the following error:
regtext <- readtext("/regdata/RegDataValidation", text_field = "regtext")
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Start tag expected, '<' not found [4]
I've tried to search the error, but nothing I've come across has helped me figure out what might be going on. This basic command works like a charm on any number of other document types, including .csv or .docx, but for some reason it just doesn't seem to recognize the files I'm trying to work with here. Any pointers would be much appreciated--I'm too much of a novice and all of the documentation on readtext does not give examples of how to work with XML.
Pursuant to comments below, I've also tried to specify a single saved XML file, as follows:
> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "regtext")
Error in xml2_to_dataframe(xml) :
The xml format does not fit for the extraction without xPath
Use xPath method instead
In addition: There were 50 or more warnings (use warnings() to see the first 50)
I tried to specify an xPath expression on a single file, and this did not return any errors, but didn't actually extract any text (even though there should be plenty of text within the "regtext" node:
> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "/regtext/*")
I end up with a dataframe with the correct doc_id, but no text.

From the error messages, the readtext function appears to be converting the xml file into a plain text document and the XML package is not accepting it as a valid document.
It is also likely that the XML parser is differentiating between "regtext" and "REGTEXT".
Here is a solution using the xml2 package. (I find this package provides a simpler interface and is easier to use)
library(xml2)
url <- "https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml"
page <- read_xml(url)
#parse out the nodes within the "REGTEXT" sections
regtext <- xml_find_all(page, ".//REGTEXT")
#convert the regtext nodes into vector of strings
xml_text(regtext)

Related

R: filename list result not recognized for actually reading the file (filename character encoding problem)

I get .xlsx files from various sources, to read and analyse the data in R, working great. Files are big, 10+ MB. So far, readxl::read_xlsx was the only solution that worked. xlsx::read.xls produced only error messages: Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.OutOfMemoryError: GC overhead limit exceeded)
Problem: some files have non-standard letters in the filename, e.g. displayed in Windows 10/explorer as '...ü...xlsx' (the character 'ü' somewhere in the filename). When I read all filenames in the folder in R, I get '...u"...xlsx'). I check for doublettes of the filenames from different folders before I actualle read the files. However, when it comes to read the above file, I get an error message '... file does not exist', no matter if I use
the path/filename character variable directly obtained from list.files (showing '...u"...xlsx')
the string constant '...u"...xlsx'
the string constant '...ü...xlsx'
As far as I understand, the problem arises from aequivalent, yet not identical, unicode compositions. I have no influence on how these characters are originally encoded. Therefore I see no way to read the file, other than (so far manually) rename the file in Windows explorer, changing an 'ü' coded as 'u+"' to 'ü'.
Questions:
is there a workaround within R? (keep in mind the requirement to use read_xlsx, unless a yet unknown package works with huge files.
if not possible within R, what would be the best option to change filenames automatically ('u+"' to 'ü') - I need to keep the 'ü' (or ä, ö, and others) in order to connect the analysis results back to the input), preferrably without additional (non-standard) software (e.g. command shell).
EDIT:
To read the list of files, dir_ls works (as suggested), but it returns an even stranger filename: 'ö' instead of 'ö', which in turn cannot be read (found) by read_xlsx either.
try using the fs library. My workflow looks something like this:
library(tidyverse)
library(lubridate)
library(fs)
library(readxl)
directory_to_read <- getwd()
file_names_to_read <- dir_ls(path = directory_to_read,
recurse = FALSE, # set this to TRUE to read all subdirectories
glob = "*.xls*",
ignore.case = TRUE) %>% # This is to ignore upper/lower case extensions
# Use this to weed out temp files - I constantly have this probles
str_subset(string = .,
regex(pattern = "\\/~\\$", ignore_case = TRUE), #use \\ before $ else it will not work
negate = TRUE) # TRUE Returns non-matching patterns
map(file_names_to_red[4], read_excel)

xpathSApply returns empty but xmlValue gives the correct value using XML package in R

Recently, I want to parse XML file using R and the package XML.
After learning from website tutorial, I can obtain the values what I want in the XML specified tag.
But when I want to parse my own xml file ( the file is here: https://drive.google.com/open?id=1yr0klZ0eChTAPmZpnlYL-hQ3-42Sply2). I always obtain the empty list (Actually, I want the value of PC-ID_id tag).
library(XML)
xml = xmlParse('./100000.xml', useInternalNodes = TRUE)
xpathSApply(xml, '//PC-ID_id', xmlValue)
# list()
xmltop = xmlRoot(xml)
xmlValue(xmltop[['PC-AssaySubmit_assay']]
[['PC-AssaySubmit_assay_descr']]
[['PC-AssayDescription']]
[['PC-AssayDescription_aid']]
[['PC-ID']]
[['PC-ID_id']]
)
# "100000"
Can you help me figure out what is the reason of empty list returned when using xpathSApply()?
Thanks.
The problem is likely because your XML document uses a default namespace. You need to properly account for that when creating your xpath expression.
The newer xml2 library makes it easier to strip off the default namespace. So, for example, you can do
library(xml2)
read_xml('./100000.xml') %>%
xml_ns_strip() %>%
xml_find_all("//PC-ID_id") %>%
xml_text
# [1] "100000"
If you really, really want to use the XML package, then check out the possible answers from this question

Why can I only read one .json file at a time?

I have 500+ .json files that I am trying to get a specific element out of. I cannot figure out why I cannot read more than one at a time..
This works:
library (jsonlite)
files<-list.files(‘~/JSON’)
file1<-fromJSON(readLines(‘~/JSON/file1.json),flatten=TRUE)
result<-as.data.frame(source=file1$element$subdata$data)
However, regardless of using different json packages (eg RJSONIO), I cannot apply this to the entire contents of files. The error I continue to get is...
attempt to run same code as function over all contents in file list
for (i in files) {
fromJSON(readLines(i),flatten = TRUE)
as.data.frame(i)$element$subdata$data}
My goal is to loop through all 500+ and extract the data and its contents. Specifically if the file has the element ‘subdata$data’, i want to extract the list and put them all in a dataframe.
Note: files are being read as ASCII (Windows OS). This does bot have a negative effect on single extractions but for the loop i get ‘invalid character bytes’
Update 1/25/2019
Ran the following but returned errors...
files<-list.files('~/JSON')
out<-lapply(files,function (fn) {
o<-fromJSON(file(i),flatten=TRUE)
as.data.frame(i)$element$subdata$data
})
Error in file(i): object 'i' not found
Also updated function, this time with UTF* errors...
files<-list.files('~/JSON')
out<-lapply(files,function (i,fn) {
o<-fromJSON(file(i),flatten=TRUE)
as.data.frame(i)$element$subdata$data
})
Error in parse_con(txt,bigint_as_char):
lexical error: invalid bytes in UTF8 string. (right here)------^
Latest Update
Think I found out a solution to the crazy 'bytes' problem. When I run readLines on the .json file, I can then apply fromJSON),
e.x.
json<-readLines('~/JSON')
jsonread<-fromJSON(json)
jsondf<-as.data.frame(jsonread$element$subdata$data)
#returns a dataframe with the correct information
Problem is, I cannot apply readLines to all the files within the JSON folder (PATH). If I can get help with that, I think I can run...
files<-list.files('~/JSON')
for (i in files){
a<-readLines(i)
o<-fromJSON(file(a),flatten=TRUE)
as.data.frame(i)$element$subdata}
Needed Steps
apply readLines to all 500 .json files in JSON folder
apply fromJSON to files from step.1
create a data.frame that returns entries if list (fromJSON) contains $element$subdata$data.
Thoughts?
Solution (Workaround?)
Unfortunately, the fromJSON still runs in to trouble with the .json files. My guess is that my GET method (httr) is unable to wait/delay and load the 'pretty print' and thus is grabbing the raw .json which in-turn is giving odd characters and as a result giving the ubiquitous '------^' error. Nevertheless, I was able to put together a solution, please see below. I want to post it for future folks that may have the same problem with the .json files not working nicely with any R json package.
#keeping the same 'files' variable as earlier
raw_data<-lapply(files,readLines)
dat<-do.call(rbind,raw_data)
dat2<-as.data.frame(dat,stringsasFactors=FALSE)
#check to see json contents were read-in
dat2[1,1]
library(tidyr)
dat3<-separate_rows(dat2,sep='')
x<-unlist(raw_data)
x<-gsub('[[:punct:]]', ' ',x)
#Identify elements wanted in original .json and apply regex
y<-regmatches(x,regexc('.*SubElement2 *(.*?) *Text.*',x))
for loops never return anything, so you must save all valuable data yourself.
You call as.data.frame(i) which is creating a frame with exactly one element, the filename, probably not what you want to keep.
(Minor) Use fromJSON(file(i),...).
Since you want to capture these into one frame, I suggest something along the lines of:
out <- lapply(files, function(fn) {
o <- fromJSON(file(fn), flatten = TRUE)
as.data.frame(o)$element$subdata$data
})
allout <- do.call(rbind.data.frame, out)
### alternatives:
allout <- dplyr::bind_rows(out)
allout <- data.table::rbindlist(out)

Parseing XML by R always return XML declaration error

I am new to XML.
i downloaded a XML file, called ipg140722,from google drive, http://www.google.com/googlebooks/uspto-patents-grants-text.html
, I used Window 8.1, R 3.1.1,
library(XML)
url<- "E:\\clouddownload\\R-download\\ipg140722.xml"
indata<- xmlTreeParse(url)
XML declaration allowed only at the start of the document
Extra content at the end of the document
error: 1: XML declaration allowed only at the start of the document
2: Extra content at the end of the document
what is the problem
Note: This post is edited from the original version.
The object lesson here is that just because a file has an xml extension does not mean it is well formed XML.
If #MartinMorgan is correct about the file, Google seems to have taken all the patents approved during the week of 2014-07-22 (last week), converted them to XML, strung them together into a single text file, and given that an xml extension. Clearly this is not well-formed XML. So the challenge is to deconstruct that file. Here is away to do it in R.
lines <- readLines("ipg140722.xml")
start <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end <- c(start[-1]-1,length(lines))
library(XML)
get.xml <- function(i) {
txt <- paste(lines[start[i]:end[i]],collapse="\n")
# print(i)
xmlTreeParse(txt,asText=T)
# return(i)
}
docs <- lapply(1:10,get.xml)
class(docs[[1]])
# [1] "XMLInternalDocument" "XMLAbstractDocument"
So now docs is a list of parsed XML documents. These can be accessed individually as, e.g., docs[[1]], or collectively using something like the code below, which extracts the invention title from each document.
sapply(docs,function(doc) xmlValue(doc["//invention-title"][[1]]))
# [1] "Phallus retention harness" "Dress/coat"
# [3] "Shirt" "Shirt"
# [5] "Sandal" "Shoe"
# [7] "Footwear" "Flexible athletic shoe sole"
# [9] "Shoe outsole with a surface ornamentation contrast" "Shoe sole"
And no, I did not make up the name of the first patent.
Response to OPs comment
My original post, which detected the start of a new document using:
start <- grep("xml version",lines,fixed=T)
was too naive: it turns out the phrase "xml version" appears in the text of some of the patents. So this was breaking (some of) the documents prematurely, resulting in mal-formed XML. The code above fixes that problem. If you un-coment the two lines in the function get.xml(...) and run the code above with
docs <- lapply(1:length(start),get.xml)
you will see that all 6961 documents parse correctly.
But there is another problem: the parsed XML is very large, so if you leave these lines as comments and try to parse the full set, you run out of memory about half way through (or I did, on an 8GB system). There are two ways to work around this. The first is to do the parsing in blocks (say 2000 documents at a time). The second is to extract whatever information you need for your CSV file in get.xml(...) and discard the parsed document at each step.

how do I create a corpus of *.docx files with tm?

I have a mixed filetype collection of MS Word documents. Some files are *.doc and some are *.docx. I'm learning to use tm and I've (more or less*) successfully created a corpus composed of the *.doc files using this:
ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'),
readerControl=list(reader=readDOC,
language='en_CA',
load=TRUE));
This command does not handle *.docx files. I assume that I need a different reader. From this article, I understand that I could write my own (given a good understanding of the .docx format which I do not currently have).
The readDOC reader uses antiword to parse *.doc files. Is there a similar application that will parse *.docx files?
Or better still, is there already a standard way of creating a corpus of *.docx files using tm?
* more or less, because although the files go in and are readable, I get this warning for every document: In readLines(y, encoding = x$Encoding) : incomplete final line found on 'path/to/a/file.doc'
.docx files are zipped XML files. If you execute this:
> uzfil <- unzip(file.choose())
And then pick a .docx file in your directory, you get:
> str(uzfil)
chr [1:13] "./[Content_Types].xml" "./_rels/.rels" "./word/_rels/document.xml.rels" ...
> uzfil
[1] "./[Content_Types].xml" "./_rels/.rels" "./word/_rels/document.xml.rels"
[4] "./word/document.xml" "./word/theme/theme1.xml" "./docProps/thumbnail.jpeg"
[7] "./word/settings.xml" "./word/webSettings.xml" "./word/styles.xml"
[10] "./docProps/core.xml" "./word/numbering.xml" "./word/fontTable.xml"
[13] "./docProps/app.xml"
This will also silently unpack all of those files to your working directory. The "./word/document.xml" file has the words you are looking for, so you can probably read them with one of the XML tools in package XML. I'm guessing you would do something along the lines of :
library(XML)
xtext <- xmlTreeParse(unz(uzfil[4]), useInternalNodes = TRUE) )
Actually you will probably need to save this to a temp-directory and add that path to the file name, "./word/document.xml".
You may want to use the further steps provided by #GaborGrothendieck in this answer: How to extract xml data from a CrossRef using R?
I ended up using docx2txt to convert the .docx files to text. Then I created a corpus from them like this:
ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'),
readerControl=list(reader=readPlain,
language='en_CA',
load=TRUE));
I figure I could probably hack the readDOC reader so that it would use docx2txt or antiword as needed, but this works.

Resources