I am trying to parse this xml and place it on data frame form:
file content looks like this:
<?xml version="1.0" encoding="utf-8" ?>
- <dashboardreport name="Incident_Rules" version="7.2.5.1022" reportdate="2019-02-20T14:45:57.352-05:00" description="">
- <source name="app1">
- <filters summary="last 30 minutes (auto)">
<filter>tf:DiagnoseTimeframe?1550690157352:1550691957352</filter>
</filters>
</source>
- <reportheader>
- <reportdetails>
<user>user1</user>
</reportdetails>
</reportheader>
- <data>
- <incidentchartdashlet name="Incident Chart" description="">
- <incidentchartrecords structuretype="tree">
<incidentchartrecord rule="Database Exception" systemprofile="app1" />
<incidentchartrecord rule="Response time greater than 30 minutes" systemprofile="app1" />
<incidentchartrecord rule="JVM Heap Utilization > 90%" systemprofile="app1" />
</incidentchartrecords>
</incidentchartdashlet>
</data>
</dashboardreport>
The data frame needs to be like this:
Source Name Rule
App1 Database Exception
App1 Response time greater than 30 minutes
App1 JVM Heap Utilization > 90%
Need to extract "Source name" and "incidentchartrecord rule". I have tried something like this:
library("XML")
doc <- read_xml(file)
dat<-xml_find_all(doc, ".//incidentchartrecord") %>%
map_df(function(x) {
xml_find_all(x, ".//incidentchartrecord") %>%
map_df(~as.list(xml_attrs(.))) %>%
select(rule) %>%
mutate(node=xml_attr(x, "incidentchartrecord"))
})
Any ideas?
Here's an approach that works. I used xml2, instead; that's where the xml_find_all & xml_attr functions are found.
library(xml2)
doc <- read_xml("test.xml")
source <- xml_attr(xml_find_all(doc,".//source"), "name")
rules <- xml_attr(xml_find_all(doc, ".//incidentchartrecord"), "rule")
df <- data.frame("Source.Name" = source, Rule=rules, stringsAsFactors=F)
Related
I have an xml response containing Body and Header nodes, how can I access the value of the $Envelope$Body$checkVatResponse$valid node?
For some reason I already can't find the Body using xml_find_all
library(httr)
library(dplyr)
library(rvest)
library(xml2)
body = r'[<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" >
<soapenv:Header/>
<soapenv:Body>
<urn:checkVat xmlns:urn="urn:ec.europa.eu:taxud:vies:services:checkVat:types">
<urn:countryCode>NL</urn:countryCode>
<urn:vatNumber>800938495B01</urn:vatNumber>
</urn:checkVat>
</soapenv:Body>
</soapenv:Envelope>]'
r <- POST("http://ec.europa.eu/taxation_customs/vies/services/checkVatTestService", body = body)
stop_for_status(r)
content(r) %>% xml_find_all('//Body')
content(r) %>% xml2::as_list()
res <- content(r)
xml_children(res) %>% xml_name()
# [1] "Header" "Body"
xml_find_all(res,'.//Body')
# {xml_nodeset (0)}
When working with XML data, you need to be mindful of the namespaces used in the file. You need to previx namespaced nodes with the correct namespace. To extract the valid value you can use
content(r) %>% xml_find_all('//env:Body/ns2:checkVatResponse/ns2:valid')
To see all the namespaces used by the file you can run
content(r) %>% xml_ns()
# env <-> http://schemas.xmlsoap.org/soap/envelope/
# ns2 <-> urn:ec.europa.eu:taxud:vies:services:checkVat:types
I have the following XML file:
<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:pdfx_1_="ns.adobe.org/pdfx/1.3/">
<xmp:CreateDate>2021-05-30T11:17:35+02:00</xmp:CreateDate>
<xmp:CreatorTool>TeX</xmp:CreatorTool>
<xmp:ModifyDate>2021-05-30T12:12:25+02:00</xmp:ModifyDate>
<xmp:MetadataDate>2021-05-30T12:12:25+02:00</xmp:MetadataDate>
<pdfx:PTEX.Fullbanner>This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019) kpathsea version 6.3.1</pdfx:PTEX.Fullbanner>
<pdf:Producer>pdfTeX-1.40.20</pdf:Producer>
<pdf:Trapped>Unknown</pdf:Trapped>
<pdf:Keywords/>
<dc:format>application/pdf</dc:format>
<xmpMM:DocumentID>uuid:38d0617c-0385-5941-a87d-cc4a1e54bd76</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:d056c61c-55c6-5f44-8c0e-fe6e911c2ed9</xmpMM:InstanceID>
<pdfwe:dafra>
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
</pdfx_1_:Dataframe>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
As you can see, the tag Dataframe of the namespace pdfwe have inside it another XML. I need to extract this XML and convert it to a normal XML with no ASCII Entity Names like the following:
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
To extract what's inside pdfwe:dafra I'm using the function xml_find_all(x, ".//pdfwe:dafra") of the xml2 package but I'm not getting the result I want.
To convert the Entity Names I'm using the function xml2::xml_text(xml2::read_xml(paste0("<x>", md, "</x>"))) but I'm not getting the results I want either.
Thanks in advance!
The solution is a multi step process, extract the database node, convert to text, clean up and then convert back to xml with the read_xml() function.
library(xml2)
page <- read_xml('<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08">
.....') #read in the entire file
xml_ns(page) #show namespaces
#extract the database
db <- xml_find_first(page, ".//pdfx_1_:Dataframe")
#convert to text and strip leading whitespace
dbtext <- xml_text(db) %>% trimws()
#read the text in and convert to xml
xml_db <- read_xml(dbtext)
xml_ns(xml_db) #show namespaces
#extract the requested information from database
#shown here for demonstration purposes
xml_db %>% xml_find_all(".//d1:column") %>% xml_find_all(".//d1:value") %>% xml_text()
I am a newbie in XML and R and would like to ask you for a help. I need to extract data from XML into a dataframe in R. The XML file is following:
<?xml version="1.0" encoding="UTF-8"?>
-<Report xmlns="Tlg_Table_Begin_Ende_ValueIds" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" txtHeader="Table" Name="Tlg_Table_Begin_Ende_ValueIds" xsi:schemaLocation="Tlg_Table_Begin_Ende_ValueIds http://nwlph01/ReportServer_HISTORIAN?%2FTemplates%2FPublic%2FTags%2FTlg_Table_Begin_Ende_ValueIds&rs%3AFormat=XML&rc%3ASchema=True">
-<table1 textbox7="Flags" textbox6="Quality" textbox5="Value" textbox4="Timestamp" textbox2="Tag name">
-<Detail_Collection>
<Detail Flags="8392704" Quality="128" TimeStamp2="3758.203125 " TimeStamp="3/13/2019 3:15:00 PM 3/13/2019 3:15:00 PM" TagName="SystemArchive\0101___FIT101G/UM.PV_Out#Value"/>
<Detail Flags="8392704" Quality="128" TimeStamp2="3771.9267578125 " TimeStamp="3/13/2019 3:15:01 PM 3/13/2019 3:15:01 PM" TagName="SystemArchive\0101___FIT101G/UM.PV_Out#Value"/>
<Detail Flags="8392704" Quality="128" TimeStamp2="3783.43823242188 " TimeStamp="3/13/2019 3:15:02 PM 3/13/2019 3:15:02 PM" TagName="SystemArchive\0101___FIT101G/UM.PV_Out#Value"/>
</Detail_Collection>
</table1>
</Report>
I am using following codes:
library("xml2")
df <- read_xml("lh_01.xml")
But what I receive is:
Warning message:
In doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
xmlns: URI Tlg_Table_Begin_Ende_ValueIds is not absolute [100]
Do you have any idea what am I suppose to do?
Thank you in advance.
Searching Stackoverflow delivers e.g the folloeing URI is not absolute error - sorry I am not an XML expert what the error in your specific case may be; my know-how only goes so far as to find your xmlns URI unusual.
Situation: "Software" to R and back to "Software". The only interface for "Software" is xml.
In R, I need to make a few changes in the file so i convert it to a list and make some changes.
library(XML)
myFile = xmlParse("myXML")
xml_data <- xmlToList(myFile)
xml_data$timetable$train$.attrs[6] = "HelloNewWorld"
Now i need to convert this list "xml_data" it back to xml.
I found some functions like this:
function(item, tag) {
# just a textnode, or empty node with attributes
if(typeof(item) != 'list') {
if (length(item) > 1) {
xml <- xmlNode(tag)
for (name in names(item)) {
xmlAttrs(xml)[[name]] <- item[[name]]
}
return(xml)
} else {
return(xmlNode(tag, item))
}
}
# create the node
if (identical(names(item), c("text", ".attrs"))) {
# special case a node with text and attributes
xml <- xmlNode(tag, item[['text']])
} else {
# node with child nodes
xml <- xmlNode(tag)
for(i in 1:length(item)) {
if (names(item)[i] != ".attrs") {
xml <- append.xmlNode(xml, listToXml(item[[i]], names(item)[i]))
}
}
}
# add attributes to node
attrs <- item[['.attrs']]
for (name in names(attrs)) {
xmlAttrs(xml)[[name]] <- attrs[[name]]
}
return(xml)
}
But this doesnt work...
Any help or hints appreciated!
Thanks!
In the linked picture you can see the current xml-file. Highlighted in yellow the values that I need to change.
Link:
https://i.stack.imgur.com/remzj.png
Consider XSLT, the special-purpose language designed to transform XML files. No need to rewrite the entire tree in R. Using its xslt package (available on CRAN-R), extension of xml2, you can transform an input source and write output to screen or file.
Using the Identity Transform to copy document as is, below XSLT then rewrites one of the attributes in <train> tag, #source, similar to your above code attempt but with sixth attribute.
XML (sample input from railIML Wiki page)
<?xml version="1.0" encoding="UTF-8"?>
<railml xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:noNamespaceSchemaLocation="timetable.xsd">
<timetable version="1.1">
<train trainID="RX 100.2" type="planned" source="opentrack">
<timetableentries>
<entry posID="ZU" departure="06:08:00" type="begin"/>
<entry posID="ZWI" departure="06:10:30" type="pass"/>
<entry posID="ZOER" arrival="06:16:00" departure="06:17:00" minStopTime="9" type="stop"/>
<entry posID="WS" departure="06:21:00" type="pass"/>
<entry posID="DUE" departure="06:23:00" type="pass"/>
<entry posID="SCW" departure="06:27:00" type="pass"/>
<entry posID="NAE" departure="06:29:00" type="pass"/>
<entry posID="UST" arrival="06:34:30" type="stop"/>
</timetableentries>
</train>
</timetable>
</railml>
XSLT (save as .xsl file, rewrites the #source attribute)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="#source">
<xsl:attribute name="source">HelloNewWorld</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
R
library(xslt)
doc <- read_xml("/path/to/Input.xml", package = "xslt")
style <- read_xml("/path/to/XLSTScript.xsl", package = "xslt")
new_xml <- xml_xslt(doc, style)
# OUTPUT TO SCREEN
cat(as.character(new_xml))
# OUTPUT TO FILE
write_xml(new_xml, "/path/to/Output.xml")
Output
<?xml version="1.0" encoding="UTF-8"?>
<railml xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:noNamespaceSchemaLocation="timetable.xsd">
<timetable version="1.1">
<train trainID="RX 100.2" type="planned" source="HelloNewWorld">
<timetableentries>
<entry posID="ZU" departure="06:08:00" type="begin"/>
<entry posID="ZWI" departure="06:10:30" type="pass"/>
<entry posID="ZOER" arrival="06:16:00" departure="06:17:00" minStopTime="9" type="stop"/>
<entry posID="WS" departure="06:21:00" type="pass"/>
<entry posID="DUE" departure="06:23:00" type="pass"/>
<entry posID="SCW" departure="06:27:00" type="pass"/>
<entry posID="NAE" departure="06:29:00" type="pass"/>
<entry posID="UST" arrival="06:34:30" type="stop"/>
</timetableentries>
</train>
</timetable>
</railml>
I found it hard to apply many of the answers listed here so I wonder if this set of simple Java XML XPathHelper Unities may help others. You can find the source code here. I didn't write it all myself but adapted code I found, so I can't take all the credit but it works and it is compact and hope it helps others.
String xmlPayLoad = readFileAsString(payLoadPath + "/payLoad.xml");
TreeMap<String, String> header = new TreeMap<String, String>();
XPathHelperCommon xph = new XPathHelperCommon();
header = xph.findMultipleXMLItems(xmlPayLoad, "//header/*");
header.put("type", "newProcess");
xmlPayLoad = xph.modifyMultipleXMLItems(xmlPayLoad, "//header/*", header);
The primative XML header could be something like this:
<header>
<type>process</type>
<ruleBaseVersion>0</ruleBaseVersion>
<ruleBaseCommitment>0</ruleBaseCommitment>
<sequenceId>0</sequenceId>
<priortiseSID>0</priortiseSID>
<monitorIncomingEvents>0</monitorIncomingEvents>
<activityCount>0</activityCount>
<taskElapsedTime>0</taskElapsedTime>
<processStartTime>0</processStartTime>
<processElapsedTime>0</processElapsedTime>
<eventElapsedTime>0</eventElapsedTime>
<status>0</status>
</header>
I am trying to import an xml file into R. It is of the format below with an event on each row followed by a number of attributes - which ones depend on the event type. This file is 0.7GB and future versions may be much bigger. I would like to create a data frame with each event on a new row and all the possible attributes in separate columns (meaning some will be empty depending on the event type). I have looked elsewhere for answers but they all seem to be dealing with XML files in a tree structure and I can't work out how to apply them to this format.
I am new to R and have no experience with XML files so please give me the "for dummies" answer with plenty of explanation. Thanks!
<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
<event time="21510.0" type="actend" person="3" link="1" actType="h" />
<event time="21510.0" type="departure" person="3" link="1" legMode="car" />
<event time="21510.0" type="PersonEntersVehicle" person="3" vehicle="3" />
<event time="21510.0" type="vehicle enters traffic" person="3" link="1" vehicle="3" networkMode="car" relativePosition="1.0" />
...
</events>
You can try something like this:
original_xml <- '<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
<event time="21510.0" type="actend" person="3" link="1" actType="h" />
<event time="21510.0" type="departure" person="3" link="1" legMode="car" />
<event time="21510.0" type="PersonEntersVehicle" person="3" vehicle="3" />
<event time="21510.0" type="vehicle enters traffic" person="3" link="1" vehicle="3" networkMode="car" relativePosition="1.0" />
</events>'
library(xml2)
data2 <- xml_children(read_xml(original_xml))
attr_names <- unique(names(unlist(xml_attrs(data2))))
xmlDataFrame <- as.data.frame(sapply(attr_names, function (attr) {
xml_attr(data2, attr = attr)
}), stringsAsFactors = FALSE)
#-- since all columns are strings, you may want to turn the numeric columns to numeric
xmlDataFrame[, c("time", "person", "link", "vehicle")] <- sapply(xmlDataFrame[, c("time", "person", "link", "vehicle")], as.numeric)
If you have additional "numeric" columns, you can add them at the end to convert the data to its proper class.