Here is an example of the XQuery output that I get:
<clinic>
<Name xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Healthy Kids Pediatrics</Name>
<Address xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">510 W 27th St, Los Angeles, CA 90007</Address>
<PhoneNumberList>213-555-5845</PhoneNumberList>
<NumberOfPatientGroups>2</NumberOfPatientGroups>
</clinic>
As you can see, in the <Name> and <Address> tag, there are these strange xmlns:xsi tags being added to it.
The funny thing is if I go to the top of my xml file, and remove:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="vaccination.xsl"?>
<Vaccination xsi:noNamespaceSchemaLocation="vaccination.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
the phrase
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
Then now my XQuery XML output will look like this (which is what I want)
<clinic>
<Name>Healthy Kids Pediatrics</Name>
<Address>510 W 27th St, Los Angeles, CA 90007</Address>
<PhoneNumberList>213-555-5845</PhoneNumberList>
<NumberOfPatientGroups>2</NumberOfPatientGroups>
</clinic>
BUT, when I view my XML in my browser, it will give an error and display something like:
XML Parsing Error: prefix not bound to a namespace
Location: file:///C:/Users/Pac/Desktop/csci585-hw3/vaccination.xml
Line Number 3, Column 1:<Vaccination xsi:noNamespaceSchemaLocation="vaccination.xsd">
^
Does anyone have an idea of how to remove those xsi tags from my XQuery output without breaking my XML/XSL ?
Removing the namespace declaration from the top node makes the XML document invalid, as the xsi prefix is used but not declared. This should have caused an error when you try to load the document in a query.
I assume that the Name and Address nodes are copied directly from the source document and the other nodes are constructed.
When copying a node from the source document, the in scope namespaces from the source node are combined with the in scope namespaces in the node that contains the copy. The way these are combined is specified by the copy-namespaces-mode.
In your case you want namespaces to be inherited from the parent node (the node in the query), but you do not want to preserve namespaces in the source document where they are unnecessary.
This can be achieved by adding the following line to the top of the query:
declare copy-namespaces no-preserve, inherit;
Related
I have an XML document which reads like this:
<xml>
<web:Web>
<web:Total>4000</web:Total>
<web:Offset>0</web:Offset>
</web:Web>
</xml>
my question is how do I access them using a library like BeautifulSoup in python?
xmlDom.web["Web"].Total ? does not work?
BeautifulSoup isn't a DOM library per se (it doesn't implement the DOM APIs). To make matters more complicated, you're using namespaces in that xml fragment. To parse that specific piece of XML, you'd use BeautifulSoup as follows:
from BeautifulSoup import BeautifulSoup
xml = """<xml>
<web:Web>
<web:Total>4000</web:Total>
<web:Offset>0</web:Offset>
</web:Web>
</xml>"""
doc = BeautifulSoup( xml )
print doc.find( 'web:total' ).string
print doc.find( 'web:offset' ).string
If you weren't using namespaces, the code could look like this:
from BeautifulSoup import BeautifulSoup
xml = """<xml>
<Web>
<Total>4000</Total>
<Offset>0</Offset>
</Web>
</xml>"""
doc = BeautifulSoup( xml )
print doc.xml.web.total.string
print doc.xml.web.offset.string
The key here is that BeautifulSoup doesn't know (or care) anything about namespaces. Thus web:Web is treated like a web:web tag instead of as a Web tag belonging to th eweb namespace. While BeautifulSoup adds web:web to the xml element dictionary, python syntax doesn't recognize web:web as a single identifier.
You can learn more about it by reading the documentation.
This is an old question but somebody might not know that at least BeautifulSoup 4 does handle namespaces well if you pass 'xml' as second argument to the constructor:
soup = BeautifulSoup("""<xml>
<web:Web>
<web:Total>4000</web:Total>
<web:Offset>0</web:Offset>
</web:Web>
</xml>""", 'xml')
print soup.prettify()
<?xml version="1.0" encoding="utf-8"?>
<xml>
<Web>
<Total>
4000
</Total>
<Offset>
0
</Offset>
</Web>
</xml>
Environment
import bs4
bs4.__version__
---
4.10.0'
import sys
print(sys.version)
---
3.8.10 (default, Nov 26 2021, 20:14:08)
[GCC 9.3.0]
BS4/XML Parser on XML with namespace definition
from bs4 import BeautifulSoup
xbrl_with_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<xbrl
xmlns:dei="http://xbrl.sec.gov/dei/2020-01-31"
>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""
soup = BeautifulSoup(xbrl_with_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant.prettify())
---
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
BS4/XML Parser on XML without namespace definition
xbrl_without_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""
soup = BeautifulSoup(xbrl_without_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant)
---
None
BS4/HTML Parser on XML without namespace definition
BS4/HTML parser regards <namespace>:<tag> as a single tag, besides it lower the letters.
soup = BeautifulSoup(xbrl_without_namespace, 'html.parser')
registrant = soup.find("dei:EntityRegistrantName".lower())
print(registrant)
---
<dei:entityregistrantname>
Hoge, Inc.
</dei:entityregistrantname>
Does not match with capital letters as they have been converted into lower letters.
registrant = soup.find("dei:EntityRegistrantName")
print(registrant)
---
None
Conclusion
Provide the namespace definitions to use namespaces with XML parser, OR
Use HTML parser and handle with all small letters.
You should explicitly define your namespace on root element, using xmlns:prefix="URI"syntax (see examples here), and then you access you attribute via prefix:tag from BeautifulSoup. Keep in mind,what you also should explicitly define, how BeautifulSoup should process you document, in that case:
xml = BeautifulSoup(xml_content, 'xml')
For the examples below I’m assuming you:
have your namespaces declared at the top of your XML file: xmlns:ns_name="http://example.com"
have your XML parsed as xml: BeautifulSoup(data, 'xml')
Extracting known tags in a namespace
If <ns_name:tag_name> is known, the find() and find_all() methods will work just fine - as mentioned in this thread already.
# extract the first element with tag name
xml_soup.find('web:Web')
# extract all elements with tag name
xml_soup.find_all('web:Web')
Searching within a namespace with CSS selectors
BS4 also allows you to search within namespaces using CSS selectors by using a prefix: your namespace, a pipe symbol | and finally your CSS selector. Template: ns_name|css_selector.
# select all elements in the namespace 'web'
xml_soup.select('web|*')
# selecting specific elements within the namespace 'web'
xml_soup.select('web|Web > Total')
More complex searches within a namespace
For anything more complex, you’ll want to write a custom boolean function:
def ns_and_regex_match(tag) -> bool:
if tag.prefix != 'web':
return False
return bool(re.search('^Off.*$', tag.name))
xml_soup.find_all(ns_and_regex_match)
Hello everyone.
I am new to Atom and using atom to see xml files. (I didn't setup any additional packages yet. Version 1.19.4)
One of my xml files consist of many attributes. For example..
<book id="test_xml">
<class name="First_row" attrib_01="Grape" attrib_02="Apple" attrib_03="banana" attrib_04="Water melon" attrib_05="Orange" ... (and so on )
</book>
Every has 50 attributes at least.
First time I opened this xml file in atom editor, It shows every class in single line. (This is what I want.) But when I edit attribute value ("Melon" to "Apple"), atom editor breaks the line suddenly and showed one line to multi line like belows.
<book id="Fruit">
<class name="First_row" attrib_01="Grape" attrib_02="Apple"
attrib_03="banana" attrib_04="Water melon"
attrib_05="Orange" ... (and so on )
</book>
Without changing xml format, how to prevent split the single line to multi line?
Thank you.
I'm working with a well-structured XML file. So far, I have successfully accessed elements of this dataset that are only one layer/subfield deep. However, now I need to access one type of data that is more deeply embedded within this data structure, and the expected method is not working...
Excerpt from the XML data; this is the "target" field that I need to access, where each node (i.e. drug) can have between 0 and N targets (I am arbitrarily setting N to 20 for now, since I'm not sure what this value is for the entire dataset):
<targets> --> 51st field in each node
<target> --> there are a variable number of targets per drug
<id>BE0000048</id> --> this is the value I want for each Target
<name>Prothrombin</name>
<organism>Human</organism>
<actions>
<action>inhibitor</action>
</actions>
<references>
<articles>
<article>
<pubmed-id>10505536</pubmed-id>
<citation>Turpie AG: Anticoagulants in acute coronary syndromes.
...
I have determined that the main Target field that I need is Field 51 within each node's structure, thus the hardcoded value below. I would think that accessing the i'th node's id value within the j'th target within the node's Target field should have an index of [[i]][[51]][[j]][[1]] or [[i]][[51]][[j]][['id']]:
This is my code that isn't working as expected:
Target <- array(1:NumNodes, dim=c(1,NumNodes,MaxTargets))
for (i in 1:NumNodes){
for (j in 1:MaxTargets){
Target[i][j] <- Data[[i]][[51]][[j]][[1]]
}
}
The behavior I'm seeing is that I can extend the subscripts out numerous levels on the command line, and never narrow the result any more than the following:
> Data[[1]][[51]][[1]][[1]][[1]][[1]][[1]][[1]][[1]][[1]]
[1] "BE0000048ProthrombinHumaninhibitor10505536Turpie AG: Anticoagulants...
It doesn't seem to matter how many subscripts I add; all of the fields in the Target subfield are always conjoined and don't seem to be able to be separated...
Confusingly, when I run my code, I get the following error message:
Error in Data[[i]][[51]][[1]] : subscript out of bounds
... which doesn't seem to make sense, given that I am limiting i to the number of nodes, and that there is no error thrown for even the ridiculously long list of subscripts show above, when I query that phrase on the command line...
Thanks in advance for any insights you can provide.
Thanks for your suggestion, cderv; I will plan to check out the xml2 package and XPATH. I really appreciate your willingness to provide an example.
I am pasting what should be a functional subset of my XML file; however, now instead of the "targets" field being the 51st field, it is the sixth. Again, it is the targets --> target --> id value that I want to report for each target, with each node having a variable number of target values. My code follows the XML content.
<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" version="5.0" exported-on="2017-07-06">
<drug type="biotech" created="2005-06-13" updated="2016-08-17">
<drugbank-id primary="true">DB00001</drugbank-id>
<drugbank-id>BTD00024</drugbank-id>
<drugbank-id>BIOD00024</drugbank-id>
<name>Lepirudin</name>
<description>Lepirudin is identical to natural hirudin except for substitution of leucine for isoleucine at the N-terminal end of the molecule and the absence of a sulfate group on the tyrosine at position 63. It is produced via yeast cells. Bayer ceased the production of lepirudin (Refludan) effective May 31, 2012.</description>
<targets>
<target>
<id>BE0000048</id>
<name>Prothrombin</name>
<organism>Human</organism>
<actions>
<action>inhibitor</action>
</actions>
<references>
<articles>
<article>
<pubmed-id>10505536</pubmed-id>
<citation>Turpie AG: Anticoagulants in acute coronary syndromes. Am J Cardiol. 1999 Sep 2;84(5A):2M-6M.</citation>
</article>
<article>
<pubmed-id>10912644</pubmed-id>
<citation>Warkentin TE: Venous thromboembolism in heparin-induced thrombocytopenia. Curr Opin Pulm Med. 2000 Jul;6(4):343-51.</citation>
</article>
</articles>
</references>
<known-action>yes</known-action>
</target>
</targets>
</drug>
</drugbank>
Now that I have significantly truncated the above file, my code is now giving an error message that any subscripts above Data[[1]][[1]] are out of bounds, but hopefully this code gives you an idea of what I'm aiming to do...
library(XML)
# Save the database file as a tree structure
xmldata = xmlRoot(xmlTreeParse("DrugBank_TruncatedDatabase_v4_Tiny.xml"))
# Number of nodes in the entire database file
NumNodes <- xmlSize(xmldata)
MaxTargets <- 20
Data <- xmlSApply(xmldata, function(x) xmlSApply(x, xmlValue))
Target <- array(1:NumNodes, dim=c(1,NumNodes,MaxTargets))
for (i in 1:NumNodes){
for (j in 1:MaxTargets){
Target[i][j] <- Data[[i]][[5]][[j]][[1]]
}
}
Thanks for your input!
I'm relatively new to XQuery and I'm using a XML with the following format (MODSXML):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/mods/v3
http://www.loc.gov/standards/mods/v3/mods-3-0.xsd">
<mods ID="ISI:000330282600027" version="3.0">
<titleInfo>
<title>{Minimum Relative Entropy for Quantum Estimation: Feasibility and General
Solution}</title>
</titleInfo>
I'm trying to retrieva all titles of the articles contained on the XML file. The expression I'm using is the following:
for $x in collection("ExemploBibtex")/"quantuminformation.xml"/modsCollection/mods/titleInfo/title
return <title>$x/text()</title>
When I try to run this expression on Base, I get the following error:
"[XPTY0019] Steps within a path expression must yield nodes; xs:string
found."
Can anybody tell me what's wrong? The result I was expecting was a list with all the titles in the document.
Okay, problem solved in the BaseX Mailing List :D
I needed to declare the namespace. So now I'm using:
declare namespace v3 ="http://www.loc.gov/mods/v3";
for $doc in collection('ExemploBibtex')
where matches(document-uri($doc), 'quantuminformation.xml')
return $doc/v3:modsCollection/v3:mods/v3:titleInfo/v3:title/text()
And it works.
The problem is here:
collection("ExemploBibtex")/"quantuminformation.xml"/modsCollection
This returns a string with content quantuminformation.xml for each file/root node in the ExemploBibtex collection, and then tries to perform an axis step on each of these strings -- which is not allowed.
It seems you want to access to document quantuminformation.xml within the collection ExemploBibtex. To open a specific file of a collection, use following syntax instead:
collection("ExemploBibtex/quantuminformation.xml")/modsCollection
I cut of the last axis steps for readability and keeping the code lines short; simply add them again, they're fine.
I was trying to set an attribute for a specific element in an xml file and I was having success using
doc.css('Object').attr("Id").value = timestamp
This was fine until the situation where 'Object' doesn't exist causing an exception in the program and quitting. To avoid that I wanted to use Nodeset as it'll just be empty instead.
doc.css('Object').each do |element|
element.attr("Id").value = timestamp
end
However this returns with the error that value= is an undefined method. It's probably something simple but I'm new to Ruby and CSS so any help would be great.
The problem has little to do with CSS, since Nokogiri uses CSS selectors as an alternative to using XPath selectors, and both are only used to provide a path to a node, or nodes.
It looks like you're overthinking this, and making it harder than it needs to be. Here's what I'd do:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<foo>
<Object>bar</Object>
</foo>
</xml>
EOT
doc.at('Object')["Id"] = Time.now.to_s
Looking at doc at this point shows:
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <xml>
# >> <foo>
# >> <Object Id="2014-01-28 19:13:32 -0700">bar</Object>
# >> </foo>
# >> </xml>
It's really important to understand the difference between at, at_css and at_xpath, which return the first matching Node, and search, css and xpath, which return NodeSets. A NodeSet is akin to an array containing Nodes. When you know that, your statement:
doc.css('Object').attr("Id").value = timestamp
won't make much sense, especially since that's not how the attr method is defined:
attr(key, value = nil, &blk)
You'd need to use:
doc.css('Object').attr("Id", value)
which would assign value to all Id attributes for every <Object> node in the document.
But, again, that's not the right choice, instead you should use at or at_css to return the single node.
This was fine until the situation where 'Object' doesn't exist
If no <Object> node exists, then it gets more interesting, and you have to determine what to do. You can insert it, or, you can simply move along and do nothing.
To see if a node exists is simple:
object_node = doc.at('Object')
if object_node
object_node['Id'] = Time.now.to_s
else
# ... insert it
end
To insert a node involves locating the place you want to insert it, then add it:
doc.at('foo').add_child("<Object Id='#{ Time.now }'>baz</Object>")
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <xml>
# >> <foo>
# >> <Object>bar</Object>
# >> <Object Id="2014-01-28 19:39:38 -0700">baz</Object></foo>
# >> </xml>
I didn't try to make the XML output pretty, which isn't important in XML, it merely needs to be syntactically correct.
Also note that it's possible to insert a node, or nodes, by defining them as a string of XML. Nokogiri will parse it into the appropriate XML and graft it in where you said. You could also go the long route by define a NodeSet or Node, then inserting it, but, in general, that makes uglier code and causes you to do a lot more work, which results in less readable code for those who follow in your footsteps maintaining the source.