R xpath getnodeset "matches" command - r

I have an xml file.
<?xml version="1.0" encoding="UTF-8"?> <doc>
<!-- A comment -->
<a xmlns="http://www.tei-c.org/ns/1.0">
<w>word
</w>
<w>wording
</w>
</a>
</doc>
I would like to return nodes containing "word" but not "wording".
library(XML) # I have nothing against using library(xml2) or library(xml2r) instead
test2 <- xmlParse("file.xml", encoding="UTF-8")
x <- c(x="http://www.tei-c.org/ns/1.0")
# starts-with seems to find the words just fine
test1 <- getNodeSet(doc, "//x:w[starts-with(., 'word')]", x)
# but R doesn't seem to allow "matches" to be included
# in the xpath query, hence none of the following work:
test1 <- getNodeSet(doc, "//x:w[[matches(., 'word')]]", x)
test1 <- getNodeSet(doc, "//x:w[#*[matches(., 'word')]]", x)
test1 <- getNodeSet(doc, "//x:w[matches(., '^word$')]", x)
test1 <- getNodeSet(doc, "//x:w[#*[matches(., '^word$')]]", x)
Update: If I use the term matches with any combination I get the following error and an empty list as result.
xmlXPathCompOpEval: function matches not found
XPath error : Unregistered function
XPath error : Invalid expression
XPath error : Stack usage error
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces, :
error evaluating xpath expression //x:w[matches(., '^word$')]
If I look for "//x:w[#*[contains(., '^word$')]]" based on advice below, I get the following warning and empty list as result:
Warning message:
In xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces, :
the XPath query has no namespace, but the target document has a default namespace.
This is often an error and may explain why you obtained no results
I imagine I am just using the wrong commands. What should I change to make it work? Thanks!

Thanks for updating your question to include the error message. It's like going to a doctor and asking for treatment to solve your problem -- you definitely want to let him know what specific symptoms you've noticed!
And this error message confirms that the match() function is missing. That indicates that R (at least, the version you're using) uses XPath 1.0, which does not have match() or other regular expression features. BaseX, on the other hand, supports XPath 2.0 (in fact it supports XPath 3.0, IIRC), so it can handle match().
Regarding how to do what you want in XPath 1.0, it's not entirely clear what you'd like to do. You mentioned using word boundary markers, so you could try something like
getNodeSet(doc, "//x:w[contains(normalize-space(concat(' ', ., ' ')),
' word ')]", x)
This will select <w> elements whose content includes word at the beginning and/or end of the text, or preceded/followed by whitespace. If you want to treat certain non-whitespace characters as word boundaries, you could translate them to whitespace using translate().

Related

drake::vis_drake_graph and drake::make() fail when regex string variables are defined within the plan

First, thank you for this amazing package!
I am applying a long list of treatments to a huge corpus of texts.
Before I even begin working with the corpus, I have to load lexicons from csv files and define regex replacement strings using those lexicons. I, therefore, have to put the [import lexicon] > [define regex strings] process within the drake plan, as the lexicons are often updated upstream.
The regex strings are defined as variables, such as:
nt_left = "(?:[.”]\\s*)"
Unfortunately, drake sends this error when drawing the dependency graph (vis_drake_graph) or building the plan (make):
Error in base::parse(text = code, keep.source = FALSE) :
<text>:1:3: ':' unexpected
1: (?:
^
How can I avoid getting those errors, considering I am not able to further escape the regex, as it works as is.

How can I keep toJSON from quoting my JSON string in R?

I am using OpenCPU and R to create a web API that takes in some inputs and returns a topoJSON file from a database, as well as some other information. OpenCPU automatically pushes the output through toJSON, which results in JSON output that has quoted JSON in it (i.e., the topoJSON). This is obviously not ideal--especially since it then gets incredibly cluttered with backticked quotes (\"). I tried using fromJSON to convert it to an R object, which could then be converted back (which is incredibly inefficient), but it returns a slightly different syntax and the result is that it doesn't work.
I feel like there should be some way to convert the string to some other type of object that results in toJSON calling a different handler that tells it to just leave it alone, but I can't figure out how to do that.
> s <- '{"type":"Topology","objects":{"map": "0"}}'
> fromJSON(s)
$type
[1] "Topology"
$objects
$objects$map
[1] "0"
> toJSON(fromJSON(s))
{"type":["Topology"],"objects":{"map":["0"]}}
That's just the beginning of the file (I replaced the actual map with "0"), and as you can see, brackets appeared around "Topology" and "0". Alternately, if I just keep it as a string, I end up with this mess:
> toJSON(s)
["{\"type\":\"Topology\",\"objects\":{\"0000595ab81ec4f34__csv\": \"0\"}}"]
Is there any way to fix this so that I just get the verbatim string but without quotes and backticks?
EDIT: Note that because I'm using OpenCPU, the output needs to come from toJSON (so no other function can be used, unfortunately), and I can't do any post-processing.
To it seems you just want the values rather than vectors. Set auto_unbox=TRUE to turn length-one vectors into scalar values
toJSON(fromJSON(s), auto_unbox = TRUE)
# {"type":"Topology","objects":{"map":"0"}}
That does print without escaping for me (using jsonlite_1.5). Maybe you are using an older version of jsonlite. You can also get around that by using cat() to print the result. You won't see the slashes when you do that.
cat(toJSON(fromJSON(s), auto_unbox = TRUE))
You can manually unbox the relevant entries:
library(jsonlite)
s <- '{"type":"Topology","objects":{"map": "0"}}'
j <- fromJSON(s)
j$type <- unbox(j$type)
j$objects$map <- unbox(j$objects$map)
toJSON(j)
# {"type":"Topology","objects":{"map":"0"}}

Index in xpath expression

In a related post,
How to select specified node within Xpath node sets by index with Selenium?,
it is mentioned that there is "no index i in xpath".
I am trying to use an index in an R loop within an XPath expression such as
getNodeSet(xmlfile, '//first[i]/second/third')
Clearly, according to the above post it works perfectly when replacing 'i' with '1', but not e.g. for i <- 1.
However, the workaround in the above post (i.e. using ['+i+']) does not seem to work.
Any ideas on how to make indices work in XPath expressions?
'//first[i]/second/third' is just a string. Therefore you can use the R string building function paste0() to make your own (R doesn't use + for string concatenation).
getNodeSet(xmlfile, paste0('//first[', i, ']/second/third'))

HTML scraping - R scrapR

I am trying to parse data encoded in HTML format. Example of the string I am trying to parse is:
Simplify the polynomial by combining like terms. <img src=\"/flx/math/inline/3x%2B12-11x%2B14\" class=\"x-math\" alt=\"3x+12-11x+14\" />
I want to get the text before <img and the text in alt=
Desired output:
Simplify the polynomial by combining like terms. 3x+12-11x+14
I tried scrapeR.
y1 = scrape (str1) # the above string is in str1 (as a vector)
I get the following error message
Error in which(value == defs) :
argument "code" is missing, with no default
Has anyone played with scrapeR. I am not sure what "code" refers to as it is an option
and is not described in the manual. Just trying to see which default value is affecting this.
Here's one way to extract that information
str1<-"Simplify the polynomial by combining like terms. <img src=\"/flx/math/inline/3x%2B12-11x%2B14\" class=\"x-math\" alt=\"3x+12-11x+14\" />"
library(scrapeR)
y<-scrape(object="str1")[[1]] #just get the first result
pretext <- sapply(xpathSApply(y, "//img/preceding::text()"), xmlValue)
alttext <- xpathSApply(y, "//img/#alt")
paste(pretext, alttext)
#[1] "Simplify the polynomial by combining like terms. 3x+12-11x+14"
The scrape() will return HTML/XML like document that you can work with using functions like xpathSApply to find nodes and extract values.

xpath node determination

I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")
I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.
Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)

Resources