Accessing XML data in R that is several layers embedded - r

I'm working with a well-structured XML file. So far, I have successfully accessed elements of this dataset that are only one layer/subfield deep. However, now I need to access one type of data that is more deeply embedded within this data structure, and the expected method is not working...
Excerpt from the XML data; this is the "target" field that I need to access, where each node (i.e. drug) can have between 0 and N targets (I am arbitrarily setting N to 20 for now, since I'm not sure what this value is for the entire dataset):
<targets> --> 51st field in each node
<target> --> there are a variable number of targets per drug
<id>BE0000048</id> --> this is the value I want for each Target
<name>Prothrombin</name>
<organism>Human</organism>
<actions>
<action>inhibitor</action>
</actions>
<references>
<articles>
<article>
<pubmed-id>10505536</pubmed-id>
<citation>Turpie AG: Anticoagulants in acute coronary syndromes.
...
I have determined that the main Target field that I need is Field 51 within each node's structure, thus the hardcoded value below. I would think that accessing the i'th node's id value within the j'th target within the node's Target field should have an index of [[i]][[51]][[j]][[1]] or [[i]][[51]][[j]][['id']]:
This is my code that isn't working as expected:
Target <- array(1:NumNodes, dim=c(1,NumNodes,MaxTargets))
for (i in 1:NumNodes){
for (j in 1:MaxTargets){
Target[i][j] <- Data[[i]][[51]][[j]][[1]]
}
}
The behavior I'm seeing is that I can extend the subscripts out numerous levels on the command line, and never narrow the result any more than the following:
> Data[[1]][[51]][[1]][[1]][[1]][[1]][[1]][[1]][[1]][[1]]
[1] "BE0000048ProthrombinHumaninhibitor10505536Turpie AG: Anticoagulants...
It doesn't seem to matter how many subscripts I add; all of the fields in the Target subfield are always conjoined and don't seem to be able to be separated...
Confusingly, when I run my code, I get the following error message:
Error in Data[[i]][[51]][[1]] : subscript out of bounds
... which doesn't seem to make sense, given that I am limiting i to the number of nodes, and that there is no error thrown for even the ridiculously long list of subscripts show above, when I query that phrase on the command line...
Thanks in advance for any insights you can provide.

Thanks for your suggestion, cderv; I will plan to check out the xml2 package and XPATH. I really appreciate your willingness to provide an example.
I am pasting what should be a functional subset of my XML file; however, now instead of the "targets" field being the 51st field, it is the sixth. Again, it is the targets --> target --> id value that I want to report for each target, with each node having a variable number of target values. My code follows the XML content.
<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" version="5.0" exported-on="2017-07-06">
<drug type="biotech" created="2005-06-13" updated="2016-08-17">
<drugbank-id primary="true">DB00001</drugbank-id>
<drugbank-id>BTD00024</drugbank-id>
<drugbank-id>BIOD00024</drugbank-id>
<name>Lepirudin</name>
<description>Lepirudin is identical to natural hirudin except for substitution of leucine for isoleucine at the N-terminal end of the molecule and the absence of a sulfate group on the tyrosine at position 63. It is produced via yeast cells. Bayer ceased the production of lepirudin (Refludan) effective May 31, 2012.</description>
<targets>
<target>
<id>BE0000048</id>
<name>Prothrombin</name>
<organism>Human</organism>
<actions>
<action>inhibitor</action>
</actions>
<references>
<articles>
<article>
<pubmed-id>10505536</pubmed-id>
<citation>Turpie AG: Anticoagulants in acute coronary syndromes. Am J Cardiol. 1999 Sep 2;84(5A):2M-6M.</citation>
</article>
<article>
<pubmed-id>10912644</pubmed-id>
<citation>Warkentin TE: Venous thromboembolism in heparin-induced thrombocytopenia. Curr Opin Pulm Med. 2000 Jul;6(4):343-51.</citation>
</article>
</articles>
</references>
<known-action>yes</known-action>
</target>
</targets>
</drug>
</drugbank>
Now that I have significantly truncated the above file, my code is now giving an error message that any subscripts above Data[[1]][[1]] are out of bounds, but hopefully this code gives you an idea of what I'm aiming to do...
library(XML)
# Save the database file as a tree structure
xmldata = xmlRoot(xmlTreeParse("DrugBank_TruncatedDatabase_v4_Tiny.xml"))
# Number of nodes in the entire database file
NumNodes <- xmlSize(xmldata)
MaxTargets <- 20
Data <- xmlSApply(xmldata, function(x) xmlSApply(x, xmlValue))
Target <- array(1:NumNodes, dim=c(1,NumNodes,MaxTargets))
for (i in 1:NumNodes){
for (j in 1:MaxTargets){
Target[i][j] <- Data[[i]][[5]][[j]][[1]]
}
}
Thanks for your input!

Related

How to loop through xml nodes in R

I have a requirement to split an xml document into multiple nodes; and then split each node separately into more sub nodes. I am using xpathSApply/getNodeSet functions in XML package. But it seems like once the xml document is split as nodes, each node is now considered as class "internal node" and hence cannot perform spath operations on it unless we save it as an xml using saveXML(). Any ideas on how this can be worked out without having to do a SAVEXML?
For example, consider sample xml below:
<array>
<ResidentialProperty>
<Listing>
<StreetAddress>
<StreetNumber>11111</StreetNumber>
<StreetName>111th</StreetName>
<StreetSuffix>Avenue Ct</StreetSuffix>
<StateOrProvince>WA</StateOrProvince>
</StreetAddress>
<MLSInformation>
<ListingStatus Status="Active"/>
<StatusChangeDate>2015-07-05T23:48:53.410</StatusChangeDate>
</MLSInformation>
<GeographicData>
<Latitude>11.111111</Latitude>
<Longitude>-111.111111</Longitude>
<County>Pierce</County>
</GeographicData>
</ResidentialProperty>
<ResidentialProperty>
<Listing>
<StreetAddress>
<StreetNumber>11211</StreetNumber>
<StreetName>11111334th</StreetName>
<StreetSuffix>Av1enue Ct</StreetSuffix>
<StateOrProvince>WA</StateOrProvince>
</StreetAddress>
<MLSInformation>
<ListingStatus Status="Active"/>
<StatusChangeDate>2017-07-05T23:48:53.410</StatusChangeDate>
</MLSInformation>
<GeographicData>
<Latitude>11.111111</Latitude>
<Longitude>-111.111111</Longitude>
<County>Pie2rce</County>
</GeographicData>
</ResidentialProperty>
</array>
I am intending to split the above into:
1. Two separate nodes with root ResidentialProperty
2. Then be able to perform XPATH operations on each of these nodes.
P.S: This is sample data and not similar to the actual data set I am working with. Just tried to use this to explain the problem I am trying to solve.
EDIT : I think I've misunderstood the question. New approach.
We use xpathApply, toString.XMLNode and xmlParseString to extract specific nodes in 2 objects.
Parse the XML file and exctract the nodes :
library(XML) :
doc=xmlParse("pathtoyourXML.xml")
result1=xmlParseString(toString.XMLNode(xpathApply(doc,"(//ResidentialProperty)[1]")))
result2=xmlParseString(toString.XMLNode(xpathApply(doc,"(//ResidentialProperty)[2]")))
We have 2 objects, we evaluate them with :
from.result1=xpathApply(result1,"//StreetAddress")
from.result2=xpathApply(result2,"//StreetAddress")
Sidenote : your XML is not valid. Listings elements are not closed.
EDIT 2 : In fact, you can use XPathApply on a previously "extracted" nodeset :
foo=xpathApply(doc,"(//ResidentialProperty)[2]")
xpathApply(foo[[1]],"//StreetAddress")
foo does not contain the result of the previous xpath expression ((//ResidentialProperty)[2]) but the whole XML nodeset.

PHPUnit: Flat XML to BIT(1) is always 1

I am trying to run PHPUnit tests and I have a problem with putting a false (0) into a BIT(1) column.
My XML looks like (without pasting my entire XML File):
<!-- Panama -->
<ztest_user
id="3"
firstname="Eddie"
lastname="Van Halen"
password="dabac4cd92d71106e74406be137a578e1b3ce436420f66d2df5be458301d2154"
salt="zU`Rr$X=YjxHRp8o"
email="eddie#vanhalen.com"
roleid="2"
date_added=""
date_modified=""
enabled="1"
confirmed="0"
graduationyear="2020"
yearjoined="2019"
birthday="1955-01-26"
profileimageid=""
/>
When I run the tests, this entry gets put into the database as shown, except for the confirmed column, which gets a 1 instead of a 0.
My thoughts are that the BIT(1) column gets set if there is any data and not to what the value of confirmed actually is.
How should I go about fixing this?

Why does my query report "Steps within a path expression must yield nodes"?

I'm relatively new to XQuery and I'm using a XML with the following format (MODSXML):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/mods/v3
http://www.loc.gov/standards/mods/v3/mods-3-0.xsd">
<mods ID="ISI:000330282600027" version="3.0">
<titleInfo>
<title>{Minimum Relative Entropy for Quantum Estimation: Feasibility and General
Solution}</title>
</titleInfo>
I'm trying to retrieva all titles of the articles contained on the XML file. The expression I'm using is the following:
for $x in collection("ExemploBibtex")/"quantuminformation.xml"/modsCollection/mods/titleInfo/title
return <title>$x/text()</title>
When I try to run this expression on Base, I get the following error:
"[XPTY0019] Steps within a path expression must yield nodes; xs:string
found."
Can anybody tell me what's wrong? The result I was expecting was a list with all the titles in the document.
Okay, problem solved in the BaseX Mailing List :D
I needed to declare the namespace. So now I'm using:
declare namespace v3 ="http://www.loc.gov/mods/v3";
for $doc in collection('ExemploBibtex')
where matches(document-uri($doc), 'quantuminformation.xml')
return $doc/v3:modsCollection/v3:mods/v3:titleInfo/v3:title/text()
And it works.
The problem is here:
collection("ExemploBibtex")/"quantuminformation.xml"/modsCollection
This returns a string with content quantuminformation.xml for each file/root node in the ExemploBibtex collection, and then tries to perform an axis step on each of these strings -- which is not allowed.
It seems you want to access to document quantuminformation.xml within the collection ExemploBibtex. To open a specific file of a collection, use following syntax instead:
collection("ExemploBibtex/quantuminformation.xml")/modsCollection
I cut of the last axis steps for readability and keeping the code lines short; simply add them again, they're fine.

Cannot execute TREC customised file in Terrier

Im having a problem to executing evaluation part of TREC file using terrier tools. I implement the query expansion in the TREC file, thus it gives me a weighting terms in the tag. What i want to do is input this customized TREC file in WT10G using terrier. I have succeed to index WT10G with terrier, therefore my next part is to retrieve an evaluation from this file.
Here is an example of modified TREC file:
<top>
<num> Number: 501
<title> peirce^570.66156
<desc> Description:
What is the difference between deduction and induction in the
process of reasoning?
<narr> Narrative:
A relevant document will contrast inductive and deductive reasoning.
A document that discusses only one or the other is not relevant.
</top>
When i try to input that file in Terrier, terrier process it as:
See the yellow rectangle. It treats as 2 inputs instead of 1 single input with weighted numbers. I read in the documentation that Terrier can do the weighted term query as its input http://terrier.org/docs/v3.5/querylanguage.html (i try the interactive part and it works with weighted term). Does anyone know how to solve this problem ?
Thank you

Setting an attribute in a Nokogiri::XML::NodeSet with css

I was trying to set an attribute for a specific element in an xml file and I was having success using
doc.css('Object').attr("Id").value = timestamp
This was fine until the situation where 'Object' doesn't exist causing an exception in the program and quitting. To avoid that I wanted to use Nodeset as it'll just be empty instead.
doc.css('Object').each do |element|
element.attr("Id").value = timestamp
end
However this returns with the error that value= is an undefined method. It's probably something simple but I'm new to Ruby and CSS so any help would be great.
The problem has little to do with CSS, since Nokogiri uses CSS selectors as an alternative to using XPath selectors, and both are only used to provide a path to a node, or nodes.
It looks like you're overthinking this, and making it harder than it needs to be. Here's what I'd do:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<foo>
<Object>bar</Object>
</foo>
</xml>
EOT
doc.at('Object')["Id"] = Time.now.to_s
Looking at doc at this point shows:
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <xml>
# >> <foo>
# >> <Object Id="2014-01-28 19:13:32 -0700">bar</Object>
# >> </foo>
# >> </xml>
It's really important to understand the difference between at, at_css and at_xpath, which return the first matching Node, and search, css and xpath, which return NodeSets. A NodeSet is akin to an array containing Nodes. When you know that, your statement:
doc.css('Object').attr("Id").value = timestamp
won't make much sense, especially since that's not how the attr method is defined:
attr(key, value = nil, &blk)
You'd need to use:
doc.css('Object').attr("Id", value)
which would assign value to all Id attributes for every <Object> node in the document.
But, again, that's not the right choice, instead you should use at or at_css to return the single node.
This was fine until the situation where 'Object' doesn't exist
If no <Object> node exists, then it gets more interesting, and you have to determine what to do. You can insert it, or, you can simply move along and do nothing.
To see if a node exists is simple:
object_node = doc.at('Object')
if object_node
object_node['Id'] = Time.now.to_s
else
# ... insert it
end
To insert a node involves locating the place you want to insert it, then add it:
doc.at('foo').add_child("<Object Id='#{ Time.now }'>baz</Object>")
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <xml>
# >> <foo>
# >> <Object>bar</Object>
# >> <Object Id="2014-01-28 19:39:38 -0700">baz</Object></foo>
# >> </xml>
I didn't try to make the XML output pretty, which isn't important in XML, it merely needs to be syntactically correct.
Also note that it's possible to insert a node, or nodes, by defining them as a string of XML. Nokogiri will parse it into the appropriate XML and graft it in where you said. You could also go the long route by define a NodeSet or Node, then inserting it, but, in general, that makes uglier code and causes you to do a lot more work, which results in less readable code for those who follow in your footsteps maintaining the source.

Resources