Import XML to R data frame - r

I am trying to import an xml file into R. It is of the format below with an event on each row followed by a number of attributes - which ones depend on the event type. This file is 0.7GB and future versions may be much bigger. I would like to create a data frame with each event on a new row and all the possible attributes in separate columns (meaning some will be empty depending on the event type). I have looked elsewhere for answers but they all seem to be dealing with XML files in a tree structure and I can't work out how to apply them to this format.
I am new to R and have no experience with XML files so please give me the "for dummies" answer with plenty of explanation. Thanks!
<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
<event time="21510.0" type="actend" person="3" link="1" actType="h" />
<event time="21510.0" type="departure" person="3" link="1" legMode="car" />
<event time="21510.0" type="PersonEntersVehicle" person="3" vehicle="3" />
<event time="21510.0" type="vehicle enters traffic" person="3" link="1" vehicle="3" networkMode="car" relativePosition="1.0" />
...
</events>

You can try something like this:
original_xml <- '<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
<event time="21510.0" type="actend" person="3" link="1" actType="h" />
<event time="21510.0" type="departure" person="3" link="1" legMode="car" />
<event time="21510.0" type="PersonEntersVehicle" person="3" vehicle="3" />
<event time="21510.0" type="vehicle enters traffic" person="3" link="1" vehicle="3" networkMode="car" relativePosition="1.0" />
</events>'
library(xml2)
data2 <- xml_children(read_xml(original_xml))
attr_names <- unique(names(unlist(xml_attrs(data2))))
xmlDataFrame <- as.data.frame(sapply(attr_names, function (attr) {
xml_attr(data2, attr = attr)
}), stringsAsFactors = FALSE)
#-- since all columns are strings, you may want to turn the numeric columns to numeric
xmlDataFrame[, c("time", "person", "link", "vehicle")] <- sapply(xmlDataFrame[, c("time", "person", "link", "vehicle")], as.numeric)
If you have additional "numeric" columns, you can add them at the end to convert the data to its proper class.

Related

WSO2 : Transforming response xml

I would like to turn this xml response into something more easily readable.
<?xml version="1.0" encoding="ISO-8859-1"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<SOAP-ENV:Header xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"/>
<soap:Body>
<executeResponse xmlns="urn:GCE">
<BusinessViewServiceexecuteOut xmlns="http://www.generix.fr/technicalframework/businesscomponent/applicationmodule/common" xmlns:ns2="http://www.generixgroup.com/processus/configuration/scheduler" xmlns:ns3="http://www.generix.fr/technicalframework/business/service/common">
<xmlpres><?xml version = '1.0' encoding = 'UTF-8'?> <VueTable type="View" name="Table" habctr="true" total_business_row="2" nbline="400" confNbline="400" numpage="1" nbpage="1">
<JTblView name="JTblView" type="ViewObject" maxfetchsize="999" maxfetchsizeexceeded="false">
<JTblViewRow current="true" type="ViewRow" index="1" business_row_index="1">
<Cletbl precision="6" type="VARCHAR" pk="true">
<business_data>N</business_data>
</Cletbl>
<Codtbl precision="6" type="VARCHAR" pk="true">
<business_data>001</business_data>
</Codtbl>
<Lib1 precision="30" type="VARCHAR">
<business_data>Non</business_data>
</Lib1>
<Lib2 precision="30" type="VARCHAR">
<business_data/>
</Lib2>
<Lir precision="10" type="VARCHAR">
<business_data>Non</business_data>
</Lir>
</JTblViewRow>
<JTblViewRow type="ViewRow" index="2" business_row_index="2">
<Cletbl precision="6" type="VARCHAR" pk="true">
<business_data>O</business_data>
</Cletbl>
<Codtbl precision="6" type="VARCHAR" pk="true">
<business_data>001</business_data>
</Codtbl>
<Lib1 precision="30" type="VARCHAR">
<business_data>Oui</business_data>
</Lib1>
<Lib2 precision="30" type="VARCHAR">
<business_data/>
</Lib2>
<Lir precision="10" type="VARCHAR">
<business_data>Oui</business_data>
</Lir>
</JTblViewRow>
</JTblView>
</VueTable></xmlpres>
</BusinessViewServiceexecuteOut>
</executeResponse>
</soap:Body></soap:Envelope>
At least if I could extract what's in the value of "xmlpres", the better I could do:
<table><row><code></code><libelle></libelle/></row></table>
To then turn it into a json response but I can't see ... I just get all the output or in json stream but with everything , which is not usable.
Create an out-mediation sequence with the following content and attach it to the respective API and try out the scenario. This is to extract the xmlpres content and send that as the response to the client
<?xml version="1.0" encoding="UTF-8"?>
<sequence xmlns="http://ws.apache.org/ns/synapse" name="out-sequence">
<!-- extract the xmlpres content and store as OM element -->
<property name="XMLBody"
expression="$body//soap:Body//generic:xmlpres"
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:gce="urn:GCE"
xmlns:generic="http://www.generix.fr/technicalframework/businesscomponent/applicationmodule/common" type="OM" />
<!-- pass the extracted property as response body -->
<enrich>
<source type="property" property="XMLBody" />
<target type="body" />
</enrich>
</sequence>
Hope this helps you to extract and send the response accordingly.

Convert in R a XML with ASCII Entity Names to a basic XML

I have the following XML file:
<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:pdfx_1_="ns.adobe.org/pdfx/1.3/">
<xmp:CreateDate>2021-05-30T11:17:35+02:00</xmp:CreateDate>
<xmp:CreatorTool>TeX</xmp:CreatorTool>
<xmp:ModifyDate>2021-05-30T12:12:25+02:00</xmp:ModifyDate>
<xmp:MetadataDate>2021-05-30T12:12:25+02:00</xmp:MetadataDate>
<pdfx:PTEX.Fullbanner>This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019) kpathsea version 6.3.1</pdfx:PTEX.Fullbanner>
<pdf:Producer>pdfTeX-1.40.20</pdf:Producer>
<pdf:Trapped>Unknown</pdf:Trapped>
<pdf:Keywords/>
<dc:format>application/pdf</dc:format>
<xmpMM:DocumentID>uuid:38d0617c-0385-5941-a87d-cc4a1e54bd76</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:d056c61c-55c6-5f44-8c0e-fe6e911c2ed9</xmpMM:InstanceID>
<pdfwe:dafra>
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
</pdfx_1_:Dataframe>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
As you can see, the tag Dataframe of the namespace pdfwe have inside it another XML. I need to extract this XML and convert it to a normal XML with no ASCII Entity Names like the following:
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
To extract what's inside pdfwe:dafra I'm using the function xml_find_all(x, ".//pdfwe:dafra") of the xml2 package but I'm not getting the result I want.
To convert the Entity Names I'm using the function xml2::xml_text(xml2::read_xml(paste0("<x>", md, "</x>"))) but I'm not getting the results I want either.
Thanks in advance!
The solution is a multi step process, extract the database node, convert to text, clean up and then convert back to xml with the read_xml() function.
library(xml2)
page <- read_xml('<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08">
.....') #read in the entire file
xml_ns(page) #show namespaces
#extract the database
db <- xml_find_first(page, ".//pdfx_1_:Dataframe")
#convert to text and strip leading whitespace
dbtext <- xml_text(db) %>% trimws()
#read the text in and convert to xml
xml_db <- read_xml(dbtext)
xml_ns(xml_db) #show namespaces
#extract the requested information from database
#shown here for demonstration purposes
xml_db %>% xml_find_all(".//d1:column") %>% xml_find_all(".//d1:value") %>% xml_text()

Parse HTML/XML characters in R

I am reading in the following XML as a text file in R:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE score-partwise PUBLIC
"-//Recordare//DTD MusicXML 3.0 Partwise//EN"
"http://www.musicxml.org/dtds/partwise.dtd">
<score-partwise version="3.0">
<part-list>
<score-part id="P1">
<part-name>Music</part-name>
</score-part>
</part-list>
<part id="P1">
<measure number="1">
<attributes>
<divisions>1</divisions>
<key>
<fifths>0</fifths>
</key>
<time>
<beats>4</beats>
<beat-type>4</beat-type>
</time>
<clef>
<sign>G</sign>
<line>2</line>
</clef>
</attributes>
<note>
<pitch>
<step>C</step>
<octave>4</octave>
</pitch>
<duration>4</duration>
<type>whole</type>
</note>
</measure>
</part>
</score-partwise>
R:
library(readtext)
xml <- readtext("musicxml.txt")$text
I am then trying to render this in Javascript via Shiny by feeding my XML text to a Javascript function. NB: Working outside of R.
shiny::tags$script(paste0('var osmd = new opensheetmusicdisplay.OpenSheetMusicDisplay(\"sheet-music\", {drawingParameters: "compact",
drawPartNames: false, drawMeasureNumbers: false, drawMetronomeMarks: false, drawTitle: false});
var loadPromise = osmd.load(\'',xml,'\');
loadPromise.then(function(){
osmd.render();
});
'))
However, when I concatenate the XML string above, it does not work because characters are escaped, e.g one line:
<note>
I tried using the unescape_xml function here (with and without the tags removed), but this does not solve the problem. It leaves me with:
"Music1044G2C44whole"
So how can I end up with a concatenated string with none of the escaped characters? It must just be a string and not another R object.
You need to wrap the contents of the tag call with shiny::HTML to ensure it is passed unescaped:
shiny::tags$script(shiny::HTML(paste0(
'var osmd = new opensheetmusicdisplay.OpenSheetMusicDisplay(\"sheet-music\",
{ drawingParameters: "compact",
drawPartNames: false,
drawMeasureNumbers: false,
drawMetronomeMarks: false,
drawTitle: false});
var loadPromise = osmd.load(\'',xml,'\');
loadPromise.then(function(){ osmd.render() });')))
Which gives you:
<script>var osmd = new opensheetmusicdisplay.OpenSheetMusicDisplay("sheet-music",
{ drawingParameters: "compact",
drawPartNames: false,
drawMeasureNumbers: false,
drawMetronomeMarks: false,
drawTitle: false});
var loadPromise = osmd.load('<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE score-partwise PUBLIC
"-//Recordare//DTD MusicXML 3.0 Partwise//EN"
"http://www.musicxml.org/dtds/partwise.dtd">
<score-partwise version="3.0">
<part-list>
<score-part id="P1">
<part-name>Music</part-name>
</score-part>
</part-list>
<part id="P1">
<measure number="1">
<attributes>
<divisions>1</divisions>
<key>
<fifths>0</fifths>
</key>
<time>
<beats>4</beats>
<beat-type>4</beat-type>
</time>
<clef>
<sign>G</sign>
<line>2</line>
</clef>
</attributes>
<note>
<pitch>
<step>C</step>
<octave>4</octave>
</pitch>
<duration>4</duration>
<type>whole</type>
</note>
</measure>
</part>
</score-partwise>');
loadPromise.then(function(){ osmd.render() });</script>

how do you convert xml file into data frame in R

I am trying to parse this xml and place it on data frame form:
file content looks like this:
<?xml version="1.0" encoding="utf-8" ?>
- <dashboardreport name="Incident_Rules" version="7.2.5.1022" reportdate="2019-02-20T14:45:57.352-05:00" description="">
- <source name="app1">
- <filters summary="last 30 minutes (auto)">
<filter>tf:DiagnoseTimeframe?1550690157352:1550691957352</filter>
</filters>
</source>
- <reportheader>
- <reportdetails>
<user>user1</user>
</reportdetails>
</reportheader>
- <data>
- <incidentchartdashlet name="Incident Chart" description="">
- <incidentchartrecords structuretype="tree">
<incidentchartrecord rule="Database Exception" systemprofile="app1" />
<incidentchartrecord rule="Response time greater than 30 minutes" systemprofile="app1" />
<incidentchartrecord rule="JVM Heap Utilization > 90%" systemprofile="app1" />
</incidentchartrecords>
</incidentchartdashlet>
</data>
</dashboardreport>
The data frame needs to be like this:
Source Name Rule
App1 Database Exception
App1 Response time greater than 30 minutes
App1 JVM Heap Utilization > 90%
Need to extract "Source name" and "incidentchartrecord rule". I have tried something like this:
library("XML")
doc <- read_xml(file)
dat<-xml_find_all(doc, ".//incidentchartrecord") %>%
map_df(function(x) {
xml_find_all(x, ".//incidentchartrecord") %>%
map_df(~as.list(xml_attrs(.))) %>%
select(rule) %>%
mutate(node=xml_attr(x, "incidentchartrecord"))
})
Any ideas?
Here's an approach that works. I used xml2, instead; that's where the xml_find_all & xml_attr functions are found.
library(xml2)
doc <- read_xml("test.xml")
source <- xml_attr(xml_find_all(doc,".//source"), "name")
rules <- xml_attr(xml_find_all(doc, ".//incidentchartrecord"), "rule")
df <- data.frame("Source.Name" = source, Rule=rules, stringsAsFactors=F)

XML to list and back to XML

Situation: "Software" to R and back to "Software". The only interface for "Software" is xml.
In R, I need to make a few changes in the file so i convert it to a list and make some changes.
library(XML)
myFile = xmlParse("myXML")
xml_data <- xmlToList(myFile)
xml_data$timetable$train$.attrs[6] = "HelloNewWorld"
Now i need to convert this list "xml_data" it back to xml.
I found some functions like this:
function(item, tag) {
# just a textnode, or empty node with attributes
if(typeof(item) != 'list') {
if (length(item) > 1) {
xml <- xmlNode(tag)
for (name in names(item)) {
xmlAttrs(xml)[[name]] <- item[[name]]
}
return(xml)
} else {
return(xmlNode(tag, item))
}
}
# create the node
if (identical(names(item), c("text", ".attrs"))) {
# special case a node with text and attributes
xml <- xmlNode(tag, item[['text']])
} else {
# node with child nodes
xml <- xmlNode(tag)
for(i in 1:length(item)) {
if (names(item)[i] != ".attrs") {
xml <- append.xmlNode(xml, listToXml(item[[i]], names(item)[i]))
}
}
}
# add attributes to node
attrs <- item[['.attrs']]
for (name in names(attrs)) {
xmlAttrs(xml)[[name]] <- attrs[[name]]
}
return(xml)
}
But this doesnt work...
Any help or hints appreciated!
Thanks!
In the linked picture you can see the current xml-file. Highlighted in yellow the values that I need to change.
Link:
https://i.stack.imgur.com/remzj.png
Consider XSLT, the special-purpose language designed to transform XML files. No need to rewrite the entire tree in R. Using its xslt package (available on CRAN-R), extension of xml2, you can transform an input source and write output to screen or file.
Using the Identity Transform to copy document as is, below XSLT then rewrites one of the attributes in <train> tag, #source, similar to your above code attempt but with sixth attribute.
XML (sample input from railIML Wiki page)
<?xml version="1.0" encoding="UTF-8"?>
<railml xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:noNamespaceSchemaLocation="timetable.xsd">
<timetable version="1.1">
<train trainID="RX 100.2" type="planned" source="opentrack">
<timetableentries>
<entry posID="ZU" departure="06:08:00" type="begin"/>
<entry posID="ZWI" departure="06:10:30" type="pass"/>
<entry posID="ZOER" arrival="06:16:00" departure="06:17:00" minStopTime="9" type="stop"/>
<entry posID="WS" departure="06:21:00" type="pass"/>
<entry posID="DUE" departure="06:23:00" type="pass"/>
<entry posID="SCW" departure="06:27:00" type="pass"/>
<entry posID="NAE" departure="06:29:00" type="pass"/>
<entry posID="UST" arrival="06:34:30" type="stop"/>
</timetableentries>
</train>
</timetable>
</railml>
XSLT (save as .xsl file, rewrites the #source attribute)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="#source">
<xsl:attribute name="source">HelloNewWorld</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
R
library(xslt)
doc <- read_xml("/path/to/Input.xml", package = "xslt")
style <- read_xml("/path/to/XLSTScript.xsl", package = "xslt")
new_xml <- xml_xslt(doc, style)
# OUTPUT TO SCREEN
cat(as.character(new_xml))
# OUTPUT TO FILE
write_xml(new_xml, "/path/to/Output.xml")
Output
<?xml version="1.0" encoding="UTF-8"?>
<railml xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:noNamespaceSchemaLocation="timetable.xsd">
<timetable version="1.1">
<train trainID="RX 100.2" type="planned" source="HelloNewWorld">
<timetableentries>
<entry posID="ZU" departure="06:08:00" type="begin"/>
<entry posID="ZWI" departure="06:10:30" type="pass"/>
<entry posID="ZOER" arrival="06:16:00" departure="06:17:00" minStopTime="9" type="stop"/>
<entry posID="WS" departure="06:21:00" type="pass"/>
<entry posID="DUE" departure="06:23:00" type="pass"/>
<entry posID="SCW" departure="06:27:00" type="pass"/>
<entry posID="NAE" departure="06:29:00" type="pass"/>
<entry posID="UST" arrival="06:34:30" type="stop"/>
</timetableentries>
</train>
</timetable>
</railml>
I found it hard to apply many of the answers listed here so I wonder if this set of simple Java XML XPathHelper Unities may help others. You can find the source code here. I didn't write it all myself but adapted code I found, so I can't take all the credit but it works and it is compact and hope it helps others.
String xmlPayLoad = readFileAsString(payLoadPath + "/payLoad.xml");
TreeMap<String, String> header = new TreeMap<String, String>();
XPathHelperCommon xph = new XPathHelperCommon();
header = xph.findMultipleXMLItems(xmlPayLoad, "//header/*");
header.put("type", "newProcess");
xmlPayLoad = xph.modifyMultipleXMLItems(xmlPayLoad, "//header/*", header);
The primative XML header could be something like this:
<header>
<type>process</type>
<ruleBaseVersion>0</ruleBaseVersion>
<ruleBaseCommitment>0</ruleBaseCommitment>
<sequenceId>0</sequenceId>
<priortiseSID>0</priortiseSID>
<monitorIncomingEvents>0</monitorIncomingEvents>
<activityCount>0</activityCount>
<taskElapsedTime>0</taskElapsedTime>
<processStartTime>0</processStartTime>
<processElapsedTime>0</processElapsedTime>
<eventElapsedTime>0</eventElapsedTime>
<status>0</status>
</header>

Resources