XML data extraction where not all parent nodes contain the child node - r

I have an xml data file where user has opened an account and in some cases the account has been terminated. The data does not list the value when account has not been terminated, which makes it very difficult to extract the information.
Here is the reproducible example (where only user 1 and 3 have had their account terminated):
library(XML)
my_xml <- xmlParse('<accounts>
<user>
<id>1</id>
<start>2015-01-01</start>
<termination>2015-01-21</termination>
</user>
<user>
<id>2</id>
<start>2015-01-01</start>
</user>
<user>
<id>3</id>
<start>2015-02-01</start>
<termination>2015-04-21</termination>
</user>
<user>
<id>4</id>
<start>2015-03-01</start>
</user>
<user>
<id>5</id>
<start>2015-04-01</start>
</user>
</accounts>')
To create a data.frame I've tried using sapply however due to it not returning NA when user does not have a termination value, the code produces an error: arguments imply differing number of rows: 5, 2
accounts <- data.frame(id=sapply(my_xml["//user//id"], xmlValue),
start=sapply(my_xml["//user//start"], xmlValue),
termination=sapply(my_xml["//user//termination"], xmlValue)
)
Any suggestions on how to solve this problem ?

I prefer to use the xml2 package over the XML package, I find the syntax easier to use.
This is a straight forward problem. Find all of the user nodes and then parse out the id and termination nodes. With xml2, the xml_find_first function will return NA even if the node is not found.
library(xml2)
my_xml <- read_xml('<accounts>
<user>
<id>1</id>
<start>2015-01-01</start>
<termination>2015-01-21</termination>
</user>
<user>
<id>2</id>
<start>2015-01-01</start>
</user>
<user>
<id>3</id>
<start>2015-02-01</start>
<termination>2015-04-21</termination>
</user>
<user>
<id>4</id>
<start>2015-03-01</start>
</user>
<user>
<id>5</id>
<start>2015-04-01</start>
</user>
</accounts>')
usernodes<-xml_find_all(my_xml, ".//user")
ids<-xml_text(xml_find_first(usernodes, ".//id") )
terms<-xml_text(xml_find_first(usernodes, ".//termination"))
answer<-data.frame(ids, terms)

I managed to find a solution from XPath in R: return NA if node is missing
accounts <- data.frame(id=sapply(my_xml["//user//id"], xmlValue),
start=sapply(my_xml["//user//start"], xmlValue),
termination=sapply(xpathApply(my_xml, "//user",
function(x){
if("termination" %in% names(x))
xmlValue(x[["termination"]])
else NA}), function(x) x))

Related

Convert in R a XML with ASCII Entity Names to a basic XML

I have the following XML file:
<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:pdfx_1_="ns.adobe.org/pdfx/1.3/">
<xmp:CreateDate>2021-05-30T11:17:35+02:00</xmp:CreateDate>
<xmp:CreatorTool>TeX</xmp:CreatorTool>
<xmp:ModifyDate>2021-05-30T12:12:25+02:00</xmp:ModifyDate>
<xmp:MetadataDate>2021-05-30T12:12:25+02:00</xmp:MetadataDate>
<pdfx:PTEX.Fullbanner>This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019) kpathsea version 6.3.1</pdfx:PTEX.Fullbanner>
<pdf:Producer>pdfTeX-1.40.20</pdf:Producer>
<pdf:Trapped>Unknown</pdf:Trapped>
<pdf:Keywords/>
<dc:format>application/pdf</dc:format>
<xmpMM:DocumentID>uuid:38d0617c-0385-5941-a87d-cc4a1e54bd76</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:d056c61c-55c6-5f44-8c0e-fe6e911c2ed9</xmpMM:InstanceID>
<pdfwe:dafra>
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
</pdfx_1_:Dataframe>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
As you can see, the tag Dataframe of the namespace pdfwe have inside it another XML. I need to extract this XML and convert it to a normal XML with no ASCII Entity Names like the following:
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
To extract what's inside pdfwe:dafra I'm using the function xml_find_all(x, ".//pdfwe:dafra") of the xml2 package but I'm not getting the result I want.
To convert the Entity Names I'm using the function xml2::xml_text(xml2::read_xml(paste0("<x>", md, "</x>"))) but I'm not getting the results I want either.
Thanks in advance!
The solution is a multi step process, extract the database node, convert to text, clean up and then convert back to xml with the read_xml() function.
library(xml2)
page <- read_xml('<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08">
.....') #read in the entire file
xml_ns(page) #show namespaces
#extract the database
db <- xml_find_first(page, ".//pdfx_1_:Dataframe")
#convert to text and strip leading whitespace
dbtext <- xml_text(db) %>% trimws()
#read the text in and convert to xml
xml_db <- read_xml(dbtext)
xml_ns(xml_db) #show namespaces
#extract the requested information from database
#shown here for demonstration purposes
xml_db %>% xml_find_all(".//d1:column") %>% xml_find_all(".//d1:value") %>% xml_text()

Map the next node of input XML into previous node of XML schema

I have an source and destination schema like following
Source Schema:
<Root>
<STDS>
<COD>
<NAM>
<AGE>
</STDS>
</Root>
Destination Schema::
<Root>
<Students>
<Code100>
<Name>
<Age>
<Code50>
<Name>
<Age>
</Code50>
</Code100>
</Students>
</Root>
In the source input, STDS is unbounded. Node COD can have three values 100, 200 and 50. So any STDS node having the COD value = 50, should be added to the STDS with COD value = 100 prior to the current STDS node.
I have an input like.
<Root>
<STDS>
<COD>200</COD>
<NAM>ABC</NAM>
<AGE>20</AGE>
</STDS>
<STDS>
<COD>100</COD>
<NAM>XYZ</NAM>
<AGE>21</AGE>
</STDS>
<STDS>
<COD>50</COD>
<NAM>JJJ</NAM>
<AGE>22</AGE>
</STDS>
<STDS>
<COD>200</COD>
<NAM>JKL</NAM>
<AGE>23</AGE>
</STDS>
<STDS>
<COD>100</COD>
<NAM>MMM</NAM>
<AGE>24</AGE>
</STDS>
<STDS>
<COD>50</COD>
<NAM>NNN</NAM>
<AGE>25</AGE>
</STDS>
</STDS>
<STDS>
<COD>50</COD>
<NAM>LLL</NAM>
<AGE>26</AGE>
</STDS>
</Root>
I need an output like following
<Root>
<Students>
<Code200>
<Name>ABC</Name>
<Age>20</Age>
</Code200>
<Code100>
<Name>XYZ</Name>
<Age>21</Age>
<Code50>
<Name>JJJ</Name>
<Age>22</Age>
</Code50>
</Code100>
<Name>XYZ</Name>
<Age>21</Age>
</Code100>
<Code200>
<Name>JKL</Name>
<Age>23</Age>
</Code200>
</Code100>
<Name>MMM</Name>
<Age>24</Age>
<Code50>
<Name>NNN</Name>
<Age>24</Age>
</Code50>
<Code50>
<Name>LLL</Name>
<Age>25</Age>
</Code50>
</Code100>
</Students>
</Root>
I want to achieve this in BizTalk mapper without using custom XSLT.
All you need to do is
Link <COD> to <Code50>, <Code100> etc.. through an Equal Functoid that tests for 50, 100, etc.
Depending on the composition of the Schemas, you may also have to like <STDS> to <CodeXXX> through Looping Functoids.
Just like the other questions, which you should also switch back to Functoids, this can be done with Functoids only, you just have to try some combinations.

xmlstarlet "does not work" for XMLs with namespaces

I'm using media info, to get some xml information about movie:
mediainfo --Output=XML Krtek\ a\ buldozer-jdvwqZUEbhc.mkv | xmlstarlet format
which output is:
<?xml version="1.0" encoding="UTF-8"?>
<MediaInfo xmlns="https://mediaarea.net/mediainfo" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://mediaarea.net/mediainfo https://mediaarea.net/mediainfo/mediainfo_2_0.xsd" version="2.0">
<creatingLibrary version="18.03" url="https://mediaarea.net/MediaInfo">MediaInfoLib</creatingLibrary>
<media ref="Krtek a buldozer-jdvwqZUEbhc.mkv">
<track type="General">
<UniqueID>101120522676894244607292274887483611459</UniqueID>
<VideoCount>1</VideoCount>
<AudioCount>1</AudioCount>
<FileExtension>mkv</FileExtension>
<Format>Matroska</Format>
<Format_Version>4</Format_Version>
<FileSize>60132643</FileSize>
<Duration>374.101</Duration>
<OverallBitRate>1285912</OverallBitRate>
<FrameRate>25.000</FrameRate>
<FrameCount>9352</FrameCount>
<IsStreamable>Yes</IsStreamable>
<File_Modified_Date>UTC 2018-10-15 07:09:29</File_Modified_Date>
<File_Modified_Date_Local>2018-10-15 09:09:29</File_Modified_Date_Local>
<Encoded_Application>Lavf57.71.100</Encoded_Application>
<Encoded_Library>Lavf57.71.100</Encoded_Library>
<extra>
<ErrorDetectionType>Per level 1</ErrorDetectionType>
</extra>
</track>
<track type="Video">
<StreamOrder>0</StreamOrder>
<ID>1</ID>
<UniqueID>1</UniqueID>
<Format>AVC</Format>
<Format_Profile>High</Format_Profile>
<Format_Level>4</Format_Level>
<Format_Settings_CABAC>Yes</Format_Settings_CABAC>
<Format_Settings_RefFrames>3</Format_Settings_RefFrames>
<CodecID>V_MPEG4/ISO/AVC</CodecID>
<Duration>374.080000000</Duration>
<Width>1920</Width>
<Height>1080</Height>
<Stored_Height>1088</Stored_Height>
<Sampled_Width>1920</Sampled_Width>
<Sampled_Height>1080</Sampled_Height>
<PixelAspectRatio>1.000</PixelAspectRatio>
<DisplayAspectRatio>1.778</DisplayAspectRatio>
<FrameRate_Mode>CFR</FrameRate_Mode>
<FrameRate_Mode_Original>VFR</FrameRate_Mode_Original>
<FrameRate>25.000</FrameRate>
<FrameCount>9352</FrameCount>
<ColorSpace>YUV</ColorSpace>
<ChromaSubsampling>4:2:0</ChromaSubsampling>
<BitDepth>8</BitDepth>
<ScanType>Progressive</ScanType>
<Delay>0.000</Delay>
<Default>Yes</Default>
<Forced>No</Forced>
<colour_range>Limited</colour_range>
<colour_description_present>Yes</colour_description_present>
<colour_primaries>BT.709</colour_primaries>
<transfer_characteristics>BT.709</transfer_characteristics>
<matrix_coefficients>BT.709</matrix_coefficients>
</track>
<track type="Audio">
<StreamOrder>1</StreamOrder>
<ID>2</ID>
<UniqueID>2</UniqueID>
<Format>Opus</Format>
<CodecID>A_OPUS</CodecID>
<Duration>374.101000000</Duration>
<Channels>2</Channels>
<ChannelPositions>Front: L R</ChannelPositions>
<SamplingRate>48000</SamplingRate>
<SamplingCount>17956848</SamplingCount>
<BitDepth>32</BitDepth>
<Compression_Mode>Lossy</Compression_Mode>
<Delay>0.000</Delay>
<Delay_Source>Container</Delay_Source>
<Language>en</Language>
<Default>Yes</Default>
<Forced>No</Forced>
</track>
</media>
</MediaInfo>
now say that I want to get all IDs:
... | xmlstarlet sel -t -v "//ID"
and nothing is printed. What? Why? Well it turned out, that if i remove all parameters from tag on second line, the same selection command will work. Now I undestand, that xmlstarlet (probably) works just fine, I'm just missing some magic flag or syntax, so that it can process xmls with defined namespaces. Can someone advice?
You need to use the namespace with -N option, and use it in the query like <namespace>:<xpath>:
... | xmlstarlet sel -N n="https://mediaarea.net/mediainfo" -t -v "//n:ID"
From the help page:
-N <name>=<value>
- predefine namespaces (name without 'xmlns:')
ex: xsql=urn:oracle-xsql
Multiple -N options are allowed.

Import XML to R data frame

I am trying to import an xml file into R. It is of the format below with an event on each row followed by a number of attributes - which ones depend on the event type. This file is 0.7GB and future versions may be much bigger. I would like to create a data frame with each event on a new row and all the possible attributes in separate columns (meaning some will be empty depending on the event type). I have looked elsewhere for answers but they all seem to be dealing with XML files in a tree structure and I can't work out how to apply them to this format.
I am new to R and have no experience with XML files so please give me the "for dummies" answer with plenty of explanation. Thanks!
<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
<event time="21510.0" type="actend" person="3" link="1" actType="h" />
<event time="21510.0" type="departure" person="3" link="1" legMode="car" />
<event time="21510.0" type="PersonEntersVehicle" person="3" vehicle="3" />
<event time="21510.0" type="vehicle enters traffic" person="3" link="1" vehicle="3" networkMode="car" relativePosition="1.0" />
...
</events>
You can try something like this:
original_xml <- '<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
<event time="21510.0" type="actend" person="3" link="1" actType="h" />
<event time="21510.0" type="departure" person="3" link="1" legMode="car" />
<event time="21510.0" type="PersonEntersVehicle" person="3" vehicle="3" />
<event time="21510.0" type="vehicle enters traffic" person="3" link="1" vehicle="3" networkMode="car" relativePosition="1.0" />
</events>'
library(xml2)
data2 <- xml_children(read_xml(original_xml))
attr_names <- unique(names(unlist(xml_attrs(data2))))
xmlDataFrame <- as.data.frame(sapply(attr_names, function (attr) {
xml_attr(data2, attr = attr)
}), stringsAsFactors = FALSE)
#-- since all columns are strings, you may want to turn the numeric columns to numeric
xmlDataFrame[, c("time", "person", "link", "vehicle")] <- sapply(xmlDataFrame[, c("time", "person", "link", "vehicle")], as.numeric)
If you have additional "numeric" columns, you can add them at the end to convert the data to its proper class.

Adding XML element for before element with different value from previous element group

I have an XML file I am reading in the code-behind C# ASP.Net page which lists staff details and is organised according to the team they belong to. See below:
<user>
<firstname>Tony</firstname>
<surname>Smith</surname>
<team>Board A</team>
</user>
<user>
<firstname>Paula</firstname>
<surname>Ram</surname>
<team>Board A</team>
</user>
<user>
<firstname>Linda</firstname>
<surname>Smith</surname>
<team>Board b</team>
</user>
<user>
<firstname>Sam </firstname>
<surname>Peak</surname>
<team>Board b</team>
</user>
What I would like to do is the following:
<group>Board A</group>
<user>
<firstname>Tony</firstname>
<surname>Smith</surname>
<team>Board A</team>
</user>
<user>
<firstname>Paula</firstname>
<surname>Ram</surname>
<team>Board A</team>
</user>
<user>
<group>Board B</group>
<firstname>Linda</firstname>
<surname>Smith</surname>
<team>Board b</team>
</user>
<user>
<firstname>Sam </firstname>
<surname>Peak</surname>
<team>Board b</team>
</user>
So basically insert a new element with the name of the team the the elements that follow it contain in the team element? I'm not sure if I'm malking any sense?
Ta.
Momo
With xml it would be kind of redundant to use the structure of your second example, because you can already group your elements by team with xpath (or LINQ). In XPATH:
"/user[team='Board b']"
if you have System.Xml.XmlNode
string currentGroup = "Board B"; //get this from some control
XmlNodeList groups = xdoc.SelectNodes(String.Format("/user[team='{0}']",currentGroup));
or if you have a System.Xml.XPath.XPathNavigator
string currentGroup = "Board B";
XPathNodeIterator groups = xNav.Select(String.Format("/user[team='{0}']",currentGroup));
Or you can place your "users" in the "group" element:
<group name="Board A">
<user>
<firstname>Tony</firstname>
<surname>Smith</surname>
<!-- <team>Board A</team> -->
</user>
<user>
<firstname>Paula</firstname>
<surname>Ram</surname>
</user>
</group>
And then as an example of how to select them:
string currentGroup = "Board B";
XmlNodeList groups = xdoc.SelectNodes(String.Format("//user[parent::group/#name='{0}']",currentGroup));

Resources