R parsing plist XML - r

Sorry, edited with one more little nuance! I had simplified my raw file a little too much in the example I provided, so while your solution works beautifully as-is, what if there are a few extra things thrown into the second line? Those seem to throw off the xml_find_all(page, "//event"), since now it can't find that node. How can I get the script to ignore the extras (or maybe what is the right search term to incorporate them?) Thanks!!!
I'm new to working with xml, and I have some speech xml files that I'm trying to flatten into dataframes in R, but I can't get them to be read using some of the standard functions in the XML package. I think the problem is the plist format, because some of the other answers that I've tried to apply don't work on these files.
My files look as follows (*****second line edited):
<?xml version="1.0" encoding="us-ascii"?>
<event id="111" extraInfo="CivilwarSpeeches" xmlns = "someurl>
<meta>
<title>Gettysburg</title>
<date>1863-11-19</date>
<organizations>
<org>Union</org>
</organizations>
<people>
<person id="0" type="President">Honest Abe</person>
</people>
</meta>
<body>
<section name="Address">
<speaker id="0">
<plist>
<p>Four score and seven years ago</p>
</plist>
</speaker>
</section>
</body>
</event>
And I would like to end up with a dataframe that links some of the info in the two sections, something like
Section|Speaker|Speaker Type| Speaker Name|Body
Address|0 |President | Honest Abe |Four score and seven years ago
I found this answer fairly helpful, but it still can't seem to unpack my data. Parsing XML file with known structure and repeating elements
Any help would be appreciated!

I prefer to use the xml2 library over the xml library.
This is a pretty straight forward problem. Read the data in, parse out the desired attributes and nodes and assemble into a data frame.
library(xml2)
page<-read_xml('<?xml version="1.0" encoding="us-ascii"?>
<event id="111">
<meta>
<title>Gettysburg</title>
<date>1863-11-19</date>
<organizations>
<org>Union</org>
</organizations>
<people>
<person id="0" type="President">Honest Abe</person>
</people>
</meta>
<body>
<section name="Address">
<speaker id="0">
<plist>
<p>Four score and seven years ago</p>
</plist> </speaker> </section> </body> </event>')
#get the nodes
nodes<-xml_find_all(page, "//event")
#parse the requested information out of each node
Section<- xml_attr(xml_find_first(nodes, ".//section"), "name")
Speaker<- xml_attr(xml_find_first(nodes, ".//person"), "id")
SpeakerType<- xml_attr(xml_find_first(nodes, ".//person"), "type")
SpeakerName<- xml_text(xml_find_first(nodes, ".//person"))
Body<- xml_text(xml_find_first(nodes, ".//plist/p"))
#put together into a data.frame
answer<-data.frame(Section, Speaker, SpeakerType, SpeakerName, Body)
The code is set up to parse a series of "event" nodes. For clarity I am using 5 steps to parse out each requested information field out separately and then combine into the final dataframe.
Part of the justification for this is to maintain alignment in case the "event" nodes are missing some of the requested information. This could be simplified, but if your dataset is small, there shouldn't be much of a performance impact.

Related

Marklogic: what are field range query and path range query cts functions using xquery

I have been following the documentation to understand cts:field-range-query and cts:path-range-query. These are the links i used to understand.
https://docs.marklogic.com/cts:field-range-query
https://docs.marklogic.com/cts:path-range-query
In cts:path-range-query, i didnt understand the output. How do you compare a string with < or >?
cts:search(doc(),cts:path-range-query("/name/fname",">","Jim"),"filtered")
=>
<?xml version="1.0" encoding="UTF-8"?>
<name><fname>John</fname><mname>Rob</mname><lname>Goldings</lname></name>
<?xml version="1.0" encoding="UTF-8"?>
<name><fname>Ooi</fname><mname>Ben</mname><lname>Fu</lname></name>
In cts:field-range-query, here aswell i didnt get the output.
ts:search(doc(),cts:field-range-query("aname",">","Jim Kurla"));
(:
returns the following:
<?xml version="1.0" encoding="UTF-8"?>
<name>
<fname>John</fname>
<mname>Rob</mname>
<lname>Goldings</lname>
</name>
<?xml version="1.0" encoding="UTF-8"?>
<name>
<fname>Ooi</fname>
<mname>Ben</mname>
<lname>Fu</lname>
</name>
:)
Sorry, if it is silly but i have been trying to understand this little thing since several days but somehow i don't get it. Really appreciate the help
String comparison is based on alphanumeric comparison. It actually depends on the collation, but the default is based on Unicode (UCA Root Collation with case and diacritic sensitivity). A comes before B, but a comes after B, and also alpha comes after Zeta. More confusingly, 10 comes before 2 as well.
In your examples the path query only looks at fname where Jim comes before both John and Ooi.
The second example is likely a field with multiple paths, including fname, mname, and lname. The > satisfies if there is any name value in the document that is larger than Jim. Goldings, Ben, and Fu come before Jim alphabetically, but there are other names like John and Ooi that come after. So that returns both those values as well.
It is more fun to repeat the queries with Lee. The path query will then return 1 result only (the second), but the field is likely still returning both.

BI Publisher conditional field masking

I have the following code on a field in a Peoplesoft BI Publisher RTF template where it is masking the last 4 digits of the Bank Account number.
<?xdofx:lpad('',length(Bank_Account__)-4,'*')?>
<?xdoxslt:rtrim(xdoxslt:right(Bank_Account__,4))?>
The problem is that sometimes the total Bank Account number length is less than 4 digits and when this happens it causes an negative array error on the lpad function to occur.
Can I wrap some kind of conditional IF statement around this where it will check the length of the bank account number and if it is longer than 5 digits than mask the last 4 digits, else (for Bank Account numbers less than 5 digits) just mask the last 2 digits. What would this look like?
Thanks in advance!
EDIT:
I should add that the existing code above is already wrapped in the following IF statement:
<?if#inlines:Bank_Account__!=''?>
So the entire statement is:
<?if#inlines:Bank_Account__!=''?>
<?xdofx:lpad('',length(Bank_Account__)-4,'*')?>
<?xdoxslt:rtrim(xdoxslt:right(Bank_Account__,4))?>
<?end if?>
I would just like to add in the conditional logic to check the bank account length and subsequently perform either of the above masking.
EDIT 2:
Here is my setup with your suggested changes, but I don't think I have the logic nested right, and the syntax may also be an issue.
Edit 3:
Here is the modified code, and the resulting error message:
The if statements can be nested, but since BIP does not have an else clause, the second if conditions has to check for the negative case.
Maybe this might work:
<?if#inlines:Bank_Account__!=''?>
<?if#inlines:string-length(Bank_Account__)>4?>
<?xdofx:lpad('',length(Bank_Account__)-4,'*')?><?xdoxslt:rtrim(xdoxslt:right(Bank_Account__,4))?>
<?end if?>
<?if#inlines:string-length(Bank_Account__)<=4?>
<?xdofx:lpad('','2','*')?><?xdoxslt:rtrim(xdoxslt:right(Bank_Account__,string-length(Bank_Account__)-2))?>
<?end if?>
<?end if?>
Update: Here is a screenshot of what I got:
Here is the xml snippet I used.
<?xml version="1.0"?>
<root>
<record>
<Bank_Account__>123456</Bank_Account__>
</record>
<record>
<Bank_Account__>12345</Bank_Account__>
</record>
<record>
<Bank_Account__>1234</Bank_Account__>
</record>
<record>
<Bank_Account__>123</Bank_Account__>
</record>
<record>
<Bank_Account__>12</Bank_Account__>
</record>
</root>
Download working files from here
There are some more functions available for other ways to implement this requirement.

How to dynamically fetch value using cts:seach in Marklogic?

My Database is having "n" number of documents and i need to search for document dynamically using the elements and value i am providing. I am explaining it below-
Sample documents in my database-
document1-
<root>
<id1>12345</id1>
<value>Country</value>
<node1>somevalue</node1>
<node2>somevalue</node2>
<node3>somevalue</node3>
<node4>somevalue</node4>
.......................
</root>
document2-
<root>
<id2>34567</id2>
<value>Fruits</value>
<node1>somevalue</node1>
<node2>somevalue</node2>
<node3>somevalue</node3>
<node4>somevalue</node4>
.......................
</root>
I need to give input parameters as Rest End Point to perform my operation and the input to rest xml document is as below-
INPUT XML-
<root>
<id>id1</id>
<idvalue>12345</idvalue>
.......................
</root>
Output i need is shown in example-
Example- Search for all the documents from the database which is having Id=Id1 and it's value=12345
Any Suggestions ?
You can explore Query By Example (QBE) of MarkLogic. For more details go to URL https://docs.marklogic.com/guide/search-dev/qbe
XPath can extract the input values for constructing a cts.elementValueQuery().
Something similar to the following should work in SJS:
cts.search(cts.elementValueQuery(
xs.QName(fn.string(input.xpath('/root/id'))),
fn.string(input.xpath('/root/idvalue'))
))
Or similar to the following in XQuery:
cts:search(fn:collection(), cts:element-value-query(
xs:QName(fn:string($input/root/id)),
fn:string($input/root/idvalue)
))
For more information, see http://docs.marklogic.com/cts.elementValueQuery
Hoping that helps,

Entity replacement not working in a KML/XML file, how do I use this data?

basically I want to put information into a balloon in Maps API, this is the KML file, the data is stored using SimpleData tags, and I am trying to access to it from the BalloonStyle text tag.
But it doesn't work, in the baloon is displayed simply $[something]. After some research, I discovered Entity replacement may not be supported anymore by SimpleData tags.
So how do I manage the data? I got the data from ogr2ogr conversion from a shapefile and I don't know how to manage its output to make it use ExtendedData and Data tags.
Thank for your help.
You can replace <SchemaData><SimpleData> with <Data><value> elements with a text editor preferably one that can perform regular expression replacements on searches such as NotePad++.
You start with this:
<ExtendedData>
<SchemaData schemaUrl="#biblioteche">
<SimpleData name="INDIRIZZO">VIA SAN VITTORE, 21</SimpleData>
<SimpleData name="TIPOLOGIA">BIBLIOTECHE</SimpleData>
...
<SimpleData name="ID">0</SimpleData>
</SchemaData>
</ExtendedData>
And need to convert to this form:
<ExtendedData>
<Data name="INDIRIZZO">
<value>VIA SAN VITTORE, 21</value>
</Data>
<Data name="TIPOLOGIA">
<value>BIBLIOTECHE</value>
</Data>
...
<Data name="ID">
<value>0</value>
</Data>
</ExtendedData>
Globally make the following replacements (in this order):
#
Find what
Replace with
1.
<SchemaData schemaUrl="#biblioteche">
2.
</SchemaData>
3.
<SimpleData
<Data
4.
(<Data name=".*?">)
\1<value>
5.
</SimpleData>
</value></Data>
Steps 1 and 2 have an empty target such that you delete the element.
Step 4 is the only step that needs to be done as a regular expression.
working example

What is cstyle in XSLT?

My XSLT is shown below.
aic is a namespace.
What is cstyle?
is it a built-in XSLT element/function?
Or an element within the expected input xml?
<xsl:stylesheet exclude-result-prefixes="aic"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:aic="http://ns.adobe.com/AdobeInCopy/2.0/" >
<xsl:template match="/">
</xsl:template>
<xsl:template match="aic:cstyle[contains(#name,'bold')]">
</xsl:template>
</xsl:stylesheet>
It is an element within the expected input XML. The XPaths in an XSLT's match attributes are generally applied to contents from the input XML.
Exactly as in my answer to your previous question, aic:cstyle is a selector that matches elements whose local name is cstyle and whose namespace URI is http://ns.adobe.com/AdobeInCopy/2.0/ (the URI bound to the aic prefix in the xsl:stylesheet element). Thus
<xsl:template match="aic:cstyle[contains(#name,'bold')]">
is a template that will apply to any {http://ns.adobe.com/AdobeInCopy/2.0/}cstyle element that has a name attribute that contains the substring bold. (So, to answer your question directly: the expression in question will match elements in the input streams for which the stylesheet was written.)
As with any new programming language, I would strongly recommend that you find a decent tutorial and work through that to get comfortable with the syntax and idioms of the language through simple examples before you start trying to decode a large and complex XSLT that you've inherited from elsewhere.

Resources