Extracting Attributes from XML File - r

I have a sample XML file that I have parsed in R
<ROUGHTDRAFT_FILE MV="00" MMV="00"
tId="0000">
<HEADER Location="Utah" dateCreated="1/1/99">
</HEADER>
<COVERSHEET>
<PRIMIARY_INFO eName="John Smith" pList="XXXXX"
type="Remodel" cNumber="00000"
policyNumber="00000000000" />
</COVERSHEET>
</ROUGHDRAFT_FILE>
After I load the XML and name it file I get an error. This is my code:
xml <- xmlParse(file)
This work fine
When I try to pull the attributes it give me an error
EstAttribs <- xpathApply(xml, path="//PRIMIARY_INFO", xml_attrs )
Error in UseMethod("xpathApply") :
no applicable method for 'xpathApply' applied to an object of class "c('XMLDocument', 'XMLAbstractDocument')"
Any recommendations on how I can fix this? Do I have to specify something for xml_attrs?

MrFlick has already given you one answer. Here is another one that might be useful. As he suggested don't try to mix functions from XML library with rvest and xml2.
# here is the rvest and xml2 solution
# rvest calls xml2 since it is a dependency
library(rvest)
xml_file <- read_xml("test.xml")
xml_file %>%
xml_find_all('//PRIMIARY_INFO') %>%
xml_attrs('eName')
[[1]]
eName pList type cNumber policyNumber
"John Smith" "XXXXX" "Remodel" "00000" "00000000000"
# this solution is purely using XML - as suggested by MrFlick
library(XML)
xml_file <- xmlParse("test.xml")
xpathApply(xml_file, path="//PRIMIARY_INFO", xmlAttrs )
[[1]]
eName pList type cNumber policyNumber
"John Smith" "XXXXX" "Remodel" "00000" "00000000000"
I think this SO question might contain useful info for you.

Related

Template literals in R

In JavaScript, template literals may be used to insert a dynamic value into text.
Here is an example:
const name = "John";
console.log(`Hello ${name}, how are you?`;)
This would print Hello John, how are you? to the console.
An approach to this in R would be to use paste0
name = "John"
print(paste0("Hello ", name, ", how are you?"))
This becomes quite a hassle if you are dealing with long texts that require multiple dynamic variables.
What are some alternatives to template literals in R?
You can also use str_glue from the stringr package to directly refer to an R object:
library(stringr)
name = "John"
str_glue("Hello {name}, how are you?")
# Hello John, how are you?
Take a look at the versatile sprintf()-function
name = "John"
sprintf( "Hello %s, how are you", name )
#[1] "Hello John, how are you"

Extract attributes with same name for all nodes in an xml file using R

I am trying to extract all attributes (with the same name) within an xml file. Currently using the xml2 package and was hoping to have success with the xml_attr or xml_attrs functions.
library(xml2)
# basic xml file
x <- read_xml("<a>
<b><c>123</c></b>
<b><c>456</c></b>
</a>")
# add a few attributes with the same name of "Fake ID"
xml_set_attr(xml_child(x, 'b[1]'), 'FakeID', '11111')
xml_set_attr(xml_child(x, 'b[2]'), 'FakeID', '22222')
xml_set_attr(xml_child(xml_child(x, 'b[2]'), 'c'), 'FakeID', '33333')
# this will give me attributes only when I call a specific child node
xml_attr(xml_child(x, 'b[1]'), 'FakeID')
# this does not give me any attributes with the name "FakeID" because the current node
# doesn't have that attribute
xml_attr(x, 'FakeID')
What I am ultimately hoping for is a vector that gives the value of every node within the xml that has the attribute "FakeID"; c('11111', '22222', '33333')
I used the package rvest because it re-exports xml2 functions, but also re-exports the %>% operator. Then I made your xml a string to be clear about what is in there and added a second attribute to your first node.
In xml_nodes() I select all nodes with the * css selector and specify I only want nodes having the FakeID attribute with [FakeID].
library(rvest)
"<a>
<b FakeID=\"11111\" RealID=\"abcde\">
<c>123</c>
</b>
<b FakeID=\"22222\">
<c FakeID=\"33333\">456</c>
</b>
</a>" %>%
read_xml() %>%
xml_nodes("*[FakeID]") %>%
xml_attrs() %>%
pluck("FakeID") %>%
unlist()

R xpathSApply --> extracting Attribute gives empty result

I try to parse out the xmlValue for the attribute "NAME" in an XML Document in R.
<NN ID_NAME="107232" ID_NTYP="6" NAME="dSpace_ECat1Error.STS" KOMMENTAR="dSpace_ECat1Error.STS" IS_SYSTEM="0" IS_LOCKED="0" DTYP="Ganzzahl" ADIM="" AFMT=""/><NN ID_NAME="107233" ID_NTYP="6" NAME="dSpace_ECat2Error.STS" KOMMENTAR="dSpace_ECat2Error.STS" IS_SYSTEM="0" IS_LOCKED="0" DTYP="Ganzzahl" ADIM="" AFMT=""/>
The result should be like this:
dSpace_ECat1Error.STS
dSpace_ECat2Error.STS
I use this function:
xpathSApply(root,"//NN[#NAME]",xmlValue)
But as a result, I get just empty "" (Quotes)
What have I done wrong?
Thank's in advance!
I just found out by using:
erg<-xpathSApply(root,"//NN",xmlGetAttr,'NAME')
There should be a better tutorial for this particular XML-function in R....

NodeSet as character

I want to get a NodeSet, with the getNodeSet function from the XML package, and write it as text in a file.
For example :
> getNodeSet(htmlParse("http://www.google.fr/"), "//div[#id='hplogo']")[[1]]
<div title="Google" align="left" id="hplogo" onload="window.lol&&lol()" style="height:110px;width:276px;background:url(/images/srpr/logo9w.png) no-repeat">
<div nowrap="" style="color:#777;font-size:16px;font-weight:bold;position:relative;top:70px;left:218px">France</div>
</div>
I want to save all this node unchanged in a file.
The problem is we can't write the object directly with :
write.lines(getNodeSet(...), file)
And as.character(getNodeSet(...)) returns a C pointer.
How can I do this ? Thank you.
To save an XML object to a file, use saveXML, e.g.,
url = "http://www.google.fr/"
nodes = getNodeSet(htmlParse(url), "//div[#id='hplogo']")[[1]]
fl <- saveXML(nodes, tempfile())
readLines(fl)
There has to be a better way, until then you can capture what the print method for a XMLNode outputs:
nodes <- getNodeSet(...)
sapply(nodes, function(x)paste(capture.output(print(x)), collapse = ""))
I know it might be a bit outdated but i got into the same problem and wanted to leave it for future reference, after searching and struggling the answer is as simple as:
htmlnodes <- toString(nodes)
write.lines(htmlnodes, file)

removing data with tags from a vector

I have a string vector which contains html tags e.g
abc<-""welcome <span class=\"r\">abc</span> Have fun!""
I want to remove these tags and get follwing vector
e.g
abc<-"welcome Have fun"
Try
> gsub("(<[^>]*>)","",abc)
what this says is 'substitute every instance of < followed by anything that isnt a > up to a > with nothing"
You cant just do gsub("<.*>","",abc) because regexps are greedy, and the .* would match up to the last > in your text (and you'd lose the 'abc' in your example).
This solution might fail if you've got > in your tags - but is <foo class=">" > legal? Doubtless someone will come up with another answer that involves parsing the HTML with a heavyweight XML package.
You can convert your piece of HTML to an XML document with
htmlParse or htmlTreeParse.
You can then convert it to text,
i.e., strip all the tags, with xmlValue.
abc <- "welcome <span class=\"r\">abc</span> Have fun!"
library(XML)
#doc <- htmlParse(abc, asText=TRUE)
doc <- htmlTreeParse(abc, asText=TRUE)
xmlValue( xmlRoot(doc) )
If you also want to remove the contents of the links,
you can use xmlDOMApply to transform the XML tree.
f <- function(x) if(xmlName(x) == "span") xmlTextNode(" ") else x
d <- xmlDOMApply( xmlRoot(doc), f )
xmlValue(d)

Resources