Setting an attribute in a Nokogiri::XML::NodeSet with css - css

I was trying to set an attribute for a specific element in an xml file and I was having success using
doc.css('Object').attr("Id").value = timestamp
This was fine until the situation where 'Object' doesn't exist causing an exception in the program and quitting. To avoid that I wanted to use Nodeset as it'll just be empty instead.
doc.css('Object').each do |element|
element.attr("Id").value = timestamp
end
However this returns with the error that value= is an undefined method. It's probably something simple but I'm new to Ruby and CSS so any help would be great.

The problem has little to do with CSS, since Nokogiri uses CSS selectors as an alternative to using XPath selectors, and both are only used to provide a path to a node, or nodes.
It looks like you're overthinking this, and making it harder than it needs to be. Here's what I'd do:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<foo>
<Object>bar</Object>
</foo>
</xml>
EOT
doc.at('Object')["Id"] = Time.now.to_s
Looking at doc at this point shows:
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <xml>
# >> <foo>
# >> <Object Id="2014-01-28 19:13:32 -0700">bar</Object>
# >> </foo>
# >> </xml>
It's really important to understand the difference between at, at_css and at_xpath, which return the first matching Node, and search, css and xpath, which return NodeSets. A NodeSet is akin to an array containing Nodes. When you know that, your statement:
doc.css('Object').attr("Id").value = timestamp
won't make much sense, especially since that's not how the attr method is defined:
attr(key, value = nil, &blk)
You'd need to use:
doc.css('Object').attr("Id", value)
which would assign value to all Id attributes for every <Object> node in the document.
But, again, that's not the right choice, instead you should use at or at_css to return the single node.
This was fine until the situation where 'Object' doesn't exist
If no <Object> node exists, then it gets more interesting, and you have to determine what to do. You can insert it, or, you can simply move along and do nothing.
To see if a node exists is simple:
object_node = doc.at('Object')
if object_node
object_node['Id'] = Time.now.to_s
else
# ... insert it
end
To insert a node involves locating the place you want to insert it, then add it:
doc.at('foo').add_child("<Object Id='#{ Time.now }'>baz</Object>")
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <xml>
# >> <foo>
# >> <Object>bar</Object>
# >> <Object Id="2014-01-28 19:39:38 -0700">baz</Object></foo>
# >> </xml>
I didn't try to make the XML output pretty, which isn't important in XML, it merely needs to be syntactically correct.
Also note that it's possible to insert a node, or nodes, by defining them as a string of XML. Nokogiri will parse it into the appropriate XML and graft it in where you said. You could also go the long route by define a NodeSet or Node, then inserting it, but, in general, that makes uglier code and causes you to do a lot more work, which results in less readable code for those who follow in your footsteps maintaining the source.

Related

Not able to access certain JSON properties in Autoloader

I have a JSON file that is loaded by two different Autoloaders.
One uses schema evolution and besides replacing spaces in the json property names, writes the json directly to a delta table, and I can see all the values are there properly.
In the second one I am mapping to a defined schema and only use a subset of properties. So use a lot of withColumn and then a select to narrows to my defined column list.
Autoloader definition:
df = (spark
.readStream
.format('cloudFiles')
.option('cloudFiles.format', 'json')
.option('multiLine', 'true')
.option('cloudFiles.schemaEvolutionMode','rescue')
.option('cloudFiles.includeExistingFiles','true')
.option('cloudFiles.schemaLocation', bronze_schema)
.option('cloudFiles.inferColumnTypes', 'true')
.option('pathGlobFilter','*.json')
.load(upload_path)
.transform(lambda df: remove_spaces_from_columns(df))
.withColumn(...
Writer:
df.writeStream.format('delta') \
.queryName(al_stream_name) \
.outputMode('append') \
.option('checkpointLocation', checkpoint_path) \
.option('mergeSchema', 'true') \
.trigger(once = True) \
.table(bronze_table)
Issue is that some of the source columns are ok load and I get their values, and others are constantly null in the output table.
For example:
.withColumn('vl_rating', col('risk_severity.value')) # works
.withColumn('status', col('status.name')) # always null
...
.select(
'rating',
'status',
...
json is quite simple, these are all string values, they are always populated. The same code works against another simular json file in another autoloader without issue.
I have run out of ideas to fault find on this. My imports are minimal, outside of Autoloader the JSON loads fine.
e.g
%python
import pyspark.sql.functions as psf
jsontest = spark.read.option('inferSchema','true').json('dbfs:....json')
df = jsontest.withColumn('status', psf.col('status.name')).select('status')
display(df)
Results in the values of the status.name property of the json file
Any ideas would be greatly appreciated.
I have found generally what is causing this. Interesting cause!
I am scanning a whole directory of json files, and the schema evolves over time (as expected). But when I clear out the autoloader schema and checkpoint directories and only scan the latest json file it all works correctly.
So what I surmise is that something in schema evolution with the older json files causes Autoloader to get into a state where it will not put certain properties into the stream to the writer.
If anyone has any recommendation on how to implement some data quality analysis in an Autoloader I would be most appreciative if you would share.

xmlstarlet how can I specify s specific class

This code successfully makes enabled="false for both lines below.
How can I change the following so enabled = false for only the second line?
xmlstarlet ed --inplace --update '//ResultCollector/#enabled' --value 'false' "${scriptLocation}"
<ResultCollector guiclass="SimpleDataWriter" testclass="ResultCollector" testname="Simple Data Writer" enabled="true">
<ResultCollector guiclass="ViewResultsFullVisualizer" testclass="ResultCollector" testname="View Results Tree" enabled="true">
XPath allows specifying a specific xml element by order of appearance inside its parent:
--update '//ResultCollector[2]/#enabled'
The above expression selects all ResultCollector elements that appear as second under their parent for processing.
More generally, chances are your application will be safer selecting elements by an embedded information (such as a tag value) instead of by order:
--update '//ResultCollector[#guiclass="ViewResultsFullVisualizer"]/#enabled'
If it suits you, the above expression selects for processing all ResultCollector elements whose tag guiclass is ViewResultsFullVisualizer. In your example this also causes only the second ResultCollector to be updated as well.

Why is "//*" the only xPath query that works when I'm parsing this XML in R using the XML-package?

I'm parsing a Swedish library catalogue using R and the XML-package. Using the library's API, I'm getting XML back from a url containing my query.
I'd like to use xPath queries to parse each record, but everything I do with xPath of the XML-package returns blank lists, everything except "//*". I'm no expert in either xml-parsing nor xPath, but I suspect that it has to do with the xml that my API returns to me.
This is a simple example of one single post in the catalogue:
library(XML)
example.url <- "http://libris.kb.se/sru/swepub?version=1.1&operation=searchRetrieve&query=mat:dok&maximumRecords=1&recordSchema=mods"
doc = xmlParse(example.url)
# Title
works <- xmlRoot(doc)[[4]][["record"]][["recordData"]][["mods"]][["titleInfo"]][["title"]][[1]]
doesntwork <- getNodeSet(doc, "//title")
# The only xPath that returns anything
onlythisworks <- getNodeSet(doc, "//*")
If this has something to do with namespaces (as these answers sugests), all I understan about it is that the API returns data that seems to have namespaces defined in the initial tag, and that I could use that, but this doesn't help me:
# Namespaces are confusing:
title <- getNodeSet(xmlRoot(doc), "//xsi:title", namespaces = c(xsi = "http://www.w3.org/2001/XMLSchema-instance"))
Here's (again) the example return data that I'm trying to parse.
You have to use the right namespace.
Try the following
doesntwork <- getNodeSet(doc, "//mods:title")
#[[1]]
#<title>Horizontal Slot Waveguides for Silicon Photonics Back-End Integration [Elektronisk resurs]</title>
#
#[[2]]
#<title>TRITA-ICT/MAP AVH, 2014:17 \
# </title>
#
#attr(,"class")
#[1] "XMLNodeSet"
BTW: I usually get the namespaces via
nsDefs=xmlNamespaceDefinitions(doc,simplify = TRUE,recursive=TRUE)
But this throws an error in your case. It complains that there are different URIs for the same name space prefix. According to
this site this does not seem to be good coding style.
Update as per OP's comment
I am myself not an xml expert, but here is my take: You can define default namespaces via <tag xmlns=URI>. Non default namespaces are of the form <tag xmlns:a=URI> with a being the respective namespace name.
The problem with your document is that there are two different default namespaces. The first being in <searchRetrieveResponse xmlns="http://www.loc.gov/zing/srw/" ... >. The second is in <mods xmlns="http://www.loc.gov/mods/v3" ... >. Also, you will find the second default namespace URI in the first tag as xmlns:mods="http://www.loc.gov/mods/v3" (where it is non-default). This seems rather messy. Now, the <title> tag is within the <mods> tag. I think that the default namespace defined in <mods> gets overridden by the non default namespace of searchRetrieveResponse (because they have the same URI). So although <mods> and all following tags (like <title>) seems to have default namespaces they actually have the xmlns:mods namespace. But this does not apply to the tag <numberOfRecords> (because it's outside of <mods>). You can access this node via
getNodeSet(doc, "//ns:numberOfRecords",
namespaces = c(ns="http://www.loc.gov/zing/srw/"))
Here you extract the default namespace defined in <searchRetrieveResponse> and give it a name (ns in our case). Then you can explicitly use the default namespace name in your xPath query.

Why does my query report "Steps within a path expression must yield nodes"?

I'm relatively new to XQuery and I'm using a XML with the following format (MODSXML):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/mods/v3
http://www.loc.gov/standards/mods/v3/mods-3-0.xsd">
<mods ID="ISI:000330282600027" version="3.0">
<titleInfo>
<title>{Minimum Relative Entropy for Quantum Estimation: Feasibility and General
Solution}</title>
</titleInfo>
I'm trying to retrieva all titles of the articles contained on the XML file. The expression I'm using is the following:
for $x in collection("ExemploBibtex")/"quantuminformation.xml"/modsCollection/mods/titleInfo/title
return <title>$x/text()</title>
When I try to run this expression on Base, I get the following error:
"[XPTY0019] Steps within a path expression must yield nodes; xs:string
found."
Can anybody tell me what's wrong? The result I was expecting was a list with all the titles in the document.
Okay, problem solved in the BaseX Mailing List :D
I needed to declare the namespace. So now I'm using:
declare namespace v3 ="http://www.loc.gov/mods/v3";
for $doc in collection('ExemploBibtex')
where matches(document-uri($doc), 'quantuminformation.xml')
return $doc/v3:modsCollection/v3:mods/v3:titleInfo/v3:title/text()
And it works.
The problem is here:
collection("ExemploBibtex")/"quantuminformation.xml"/modsCollection
This returns a string with content quantuminformation.xml for each file/root node in the ExemploBibtex collection, and then tries to perform an axis step on each of these strings -- which is not allowed.
It seems you want to access to document quantuminformation.xml within the collection ExemploBibtex. To open a specific file of a collection, use following syntax instead:
collection("ExemploBibtex/quantuminformation.xml")/modsCollection
I cut of the last axis steps for readability and keeping the code lines short; simply add them again, they're fine.

How do I find/use op:except with the multiple xml files?

How do I find/use op:except with the multiple xml files?
I've gotten the nodes from file 1 and file 2, and in the xquery expression I'm tring to find the op:except of those two. When I use op:except, I end up getting an empty set.
XML File 1:
<a>txt</a>
<a>txt2</a>
<a>txt3</a>
XML File 2:
<a>txt2</a>
<a>txt4</a>
<a>txt3</a>
I want output from op:($nodesfromfile1, $nodesfromfile2) to be:
<a>txt</a>
It effectively comes down to the single line behind the return in the following code. You could put that in a function if you like, but it is already very dense, so maybe not worth it..
let $file1 := (
<a>txt</a>,
<a>txt2</a>,
<a>txt3</a>
)
let $file2 := (
<a>txt2</a>,
<a>txt4</a>,
<a>txt3</a>
)
return
$file1[not(. = $file2)]
Note, you also have the 'except' keyword ($file1 except $file2), but that works on node identity which won't work if the nodes comes from different files.
By the way, above code uses string-equality for comparison. If you would prefer to do a comparison on full node-structure, you could also use the deep-equal() function.
HTH!

Resources