R proper way to parse xml - r

I have an xml response containing Body and Header nodes, how can I access the value of the $Envelope$Body$checkVatResponse$valid node?
For some reason I already can't find the Body using xml_find_all
library(httr)
library(dplyr)
library(rvest)
library(xml2)
body = r'[<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" >
<soapenv:Header/>
<soapenv:Body>
<urn:checkVat xmlns:urn="urn:ec.europa.eu:taxud:vies:services:checkVat:types">
<urn:countryCode>NL</urn:countryCode>
<urn:vatNumber>800938495B01</urn:vatNumber>
</urn:checkVat>
</soapenv:Body>
</soapenv:Envelope>]'
r <- POST("http://ec.europa.eu/taxation_customs/vies/services/checkVatTestService", body = body)
stop_for_status(r)
content(r) %>% xml_find_all('//Body')
content(r) %>% xml2::as_list()
res <- content(r)
xml_children(res) %>% xml_name()
# [1] "Header" "Body"
xml_find_all(res,'.//Body')
# {xml_nodeset (0)}

When working with XML data, you need to be mindful of the namespaces used in the file. You need to previx namespaced nodes with the correct namespace. To extract the valid value you can use
content(r) %>% xml_find_all('//env:Body/ns2:checkVatResponse/ns2:valid')
To see all the namespaces used by the file you can run
content(r) %>% xml_ns()
# env <-> http://schemas.xmlsoap.org/soap/envelope/
# ns2 <-> urn:ec.europa.eu:taxud:vies:services:checkVat:types

Related

Convert in R a XML with ASCII Entity Names to a basic XML

I have the following XML file:
<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:pdfx_1_="ns.adobe.org/pdfx/1.3/">
<xmp:CreateDate>2021-05-30T11:17:35+02:00</xmp:CreateDate>
<xmp:CreatorTool>TeX</xmp:CreatorTool>
<xmp:ModifyDate>2021-05-30T12:12:25+02:00</xmp:ModifyDate>
<xmp:MetadataDate>2021-05-30T12:12:25+02:00</xmp:MetadataDate>
<pdfx:PTEX.Fullbanner>This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019) kpathsea version 6.3.1</pdfx:PTEX.Fullbanner>
<pdf:Producer>pdfTeX-1.40.20</pdf:Producer>
<pdf:Trapped>Unknown</pdf:Trapped>
<pdf:Keywords/>
<dc:format>application/pdf</dc:format>
<xmpMM:DocumentID>uuid:38d0617c-0385-5941-a87d-cc4a1e54bd76</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:d056c61c-55c6-5f44-8c0e-fe6e911c2ed9</xmpMM:InstanceID>
<pdfwe:dafra>
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
</pdfx_1_:Dataframe>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
As you can see, the tag Dataframe of the namespace pdfwe have inside it another XML. I need to extract this XML and convert it to a normal XML with no ASCII Entity Names like the following:
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
To extract what's inside pdfwe:dafra I'm using the function xml_find_all(x, ".//pdfwe:dafra") of the xml2 package but I'm not getting the result I want.
To convert the Entity Names I'm using the function xml2::xml_text(xml2::read_xml(paste0("<x>", md, "</x>"))) but I'm not getting the results I want either.
Thanks in advance!
The solution is a multi step process, extract the database node, convert to text, clean up and then convert back to xml with the read_xml() function.
library(xml2)
page <- read_xml('<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08">
.....') #read in the entire file
xml_ns(page) #show namespaces
#extract the database
db <- xml_find_first(page, ".//pdfx_1_:Dataframe")
#convert to text and strip leading whitespace
dbtext <- xml_text(db) %>% trimws()
#read the text in and convert to xml
xml_db <- read_xml(dbtext)
xml_ns(xml_db) #show namespaces
#extract the requested information from database
#shown here for demonstration purposes
xml_db %>% xml_find_all(".//d1:column") %>% xml_find_all(".//d1:value") %>% xml_text()

POST request in R: error in upload_file() with xml

I'm trying to create a POST request, but the body parameter isn't working as I expected.
The POST_bodyRequest.xml file
<?xml version="1.0" encoding="UTF-8"?>
<rs:alarm-request throttlesize="0"
xmlns:rs="http://www.ca.com/spectrum/restful/schema/request"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.ca.com/spectrum/restful/schema/request ../../../xsd/Request.xsd ">
<rs:requested-attribute id="0x10000"/>
<rs:requested-attribute id="0x10001"/>
<rs:requested-attribute id="0x10009"/>
<rs:requested-attribute id="0x1000a"/>
<rs:requested-attribute id="0x1006e"/>
<rs:requested-attribute id="0x11ee8"/>
</rs:alarm-request>
The code, basically the POST call
xml <- upload_file("POST_bodyRequest.xml")
r2 <- POST(url, login.password, body = list(xml))
status_code(r2)
First thing to note is that the content from the file isn't saved in "xml" file:
> xml <- upload_file("POST_bodyRequest.xml")
> xml
Form file: POST_bodyRequest.xml (type: application/xml)
> str(xml)
List of 2
$ path: chr "D:\\MPM\\POST_bodyRequest.xml"
$ type: chr "application/xml"
- attr(*, "class")= chr "form_file"
Therefore, the POST call returns an error
> r2 <- POST(url, login.password, body = list(xml))
Error: All components of body must be named
> status_code(r2)
[1] 415
I've also tried do read the xml file using xmlParse(). In this case, the code is recovered as expected, but I get the same error when calling POST.
> xml <- xmlParse(file = "POST_bodyRequest.xml")
> r2 <- POST(url, autenticacao, body = list(xml))
Erro: All components of body must be named
> xml
<?xml version="1.0" encoding="UTF-8"?>
<rs:alarm-request xmlns:rs="http://www.ca.com/spectrum/restful/schema/request" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" throttlesize="0" xsi:schemaLocation="http://www.ca.com/spectrum/restful/schema/request ../../../xsd/Request.xsd ">
<rs:requested-attribute id="0x10000"/>
<rs:requested-attribute id="0x10001"/>
<rs:requested-attribute id="0x10009"/>
<rs:requested-attribute id="0x1000a"/>
<rs:requested-attribute id="0x1006e"/>
<rs:requested-attribute id="0x11ee8"/>
</rs:alarm-request>
> str(list(xml))
List of 1
$ :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
> status_code(r2)
[1] 415
I had no trouble with GET requests in R. The POST request works fine with SoapUI. So, what am I doing wrong?
Well, it did work. The problem wasn't anything related to the upload_file() function or the xml file. It was the url variable, which wasn't updated to the POST version. I'd like to thank for the confirmations and check this issue as resolved.

How to import data from XML to R?

I am a newbie in XML and R and would like to ask you for a help. I need to extract data from XML into a dataframe in R. The XML file is following:
<?xml version="1.0" encoding="UTF-8"?>
-<Report xmlns="Tlg_Table_Begin_Ende_ValueIds" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" txtHeader="Table" Name="Tlg_Table_Begin_Ende_ValueIds" xsi:schemaLocation="Tlg_Table_Begin_Ende_ValueIds http://nwlph01/ReportServer_HISTORIAN?%2FTemplates%2FPublic%2FTags%2FTlg_Table_Begin_Ende_ValueIds&rs%3AFormat=XML&rc%3ASchema=True">
-<table1 textbox7="Flags" textbox6="Quality" textbox5="Value" textbox4="Timestamp" textbox2="Tag name">
-<Detail_Collection>
<Detail Flags="8392704" Quality="128" TimeStamp2="3758.203125 " TimeStamp="3/13/2019 3:15:00 PM 3/13/2019 3:15:00 PM" TagName="SystemArchive\0101___FIT101G/UM.PV_Out#Value"/>
<Detail Flags="8392704" Quality="128" TimeStamp2="3771.9267578125 " TimeStamp="3/13/2019 3:15:01 PM 3/13/2019 3:15:01 PM" TagName="SystemArchive\0101___FIT101G/UM.PV_Out#Value"/>
<Detail Flags="8392704" Quality="128" TimeStamp2="3783.43823242188 " TimeStamp="3/13/2019 3:15:02 PM 3/13/2019 3:15:02 PM" TagName="SystemArchive\0101___FIT101G/UM.PV_Out#Value"/>
</Detail_Collection>
</table1>
</Report>
I am using following codes:
library("xml2")
df <- read_xml("lh_01.xml")
But what I receive is:
Warning message:
In doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
xmlns: URI Tlg_Table_Begin_Ende_ValueIds is not absolute [100]
Do you have any idea what am I suppose to do?
Thank you in advance.
Searching Stackoverflow delivers e.g the folloeing URI is not absolute error - sorry I am not an XML expert what the error in your specific case may be; my know-how only goes so far as to find your xmlns URI unusual.

how do you convert xml file into data frame in R

I am trying to parse this xml and place it on data frame form:
file content looks like this:
<?xml version="1.0" encoding="utf-8" ?>
- <dashboardreport name="Incident_Rules" version="7.2.5.1022" reportdate="2019-02-20T14:45:57.352-05:00" description="">
- <source name="app1">
- <filters summary="last 30 minutes (auto)">
<filter>tf:DiagnoseTimeframe?1550690157352:1550691957352</filter>
</filters>
</source>
- <reportheader>
- <reportdetails>
<user>user1</user>
</reportdetails>
</reportheader>
- <data>
- <incidentchartdashlet name="Incident Chart" description="">
- <incidentchartrecords structuretype="tree">
<incidentchartrecord rule="Database Exception" systemprofile="app1" />
<incidentchartrecord rule="Response time greater than 30 minutes" systemprofile="app1" />
<incidentchartrecord rule="JVM Heap Utilization > 90%" systemprofile="app1" />
</incidentchartrecords>
</incidentchartdashlet>
</data>
</dashboardreport>
The data frame needs to be like this:
Source Name Rule
App1 Database Exception
App1 Response time greater than 30 minutes
App1 JVM Heap Utilization > 90%
Need to extract "Source name" and "incidentchartrecord rule". I have tried something like this:
library("XML")
doc <- read_xml(file)
dat<-xml_find_all(doc, ".//incidentchartrecord") %>%
map_df(function(x) {
xml_find_all(x, ".//incidentchartrecord") %>%
map_df(~as.list(xml_attrs(.))) %>%
select(rule) %>%
mutate(node=xml_attr(x, "incidentchartrecord"))
})
Any ideas?
Here's an approach that works. I used xml2, instead; that's where the xml_find_all & xml_attr functions are found.
library(xml2)
doc <- read_xml("test.xml")
source <- xml_attr(xml_find_all(doc,".//source"), "name")
rules <- xml_attr(xml_find_all(doc, ".//incidentchartrecord"), "rule")
df <- data.frame("Source.Name" = source, Rule=rules, stringsAsFactors=F)

SOAP Client with WSDL for R

I'm trying to write a code for a SOAP client with R using the SSOAP package. This was my inicial code:
wsdl <- getURL("http://sistemas.cvm.gov.br/webservices/Sistemas/SCW/CDocs/WsDownloadInfs.asmx?WSDL")
def <- processWSDL(doc, verbose = TRUE)
ff <- genSOAPClientInterface(def = def, verbose = TRUE)
But I think the WSDL documentation is too complex (multi-dimensional) for the functions. I tried (this and many other things) to simplify the WSDL choosing just one service, and it helped me with the processWSDL function, but I cannot generate the client functions yet. The error message is:
Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
Error during wrapup: evaluation nested too deeply: infinite recursion / options(expressions=)?
Please, could somebody help me?
The RCurl package helps us to do this (see an example on http://www.stat.wvu.edu/~jharner/courses/stat623/docs/RCurlJSS.pdf):
library(RCurl)
library(XML)
###############
#### Login ####
###############
headerfields = c(
Accept = "text/xml",
Accept = "multipart/*",
'Content-Type' = "text/xml; charset=utf-8",
SOAPAction = "http://www.cvm.gov.br/webservices/Login"
)
body = "<?xml version='1.0' encoding='utf-8'?>
<soap:Envelope xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns:soap='http://schemas.xmlsoap.org/soap/envelope/'>
<soap:Header>
<sessaoIdHeader xmlns='http://www.cvm.gov.br/webservices/'>
<Guid>8200ac01-bfb5-46d6-a625-38108141fb33</Guid>
<IdSessao>135128883</IdSessao>
</sessaoIdHeader>
</soap:Header>
<soap:Body>
<Login xmlns='http://www.cvm.gov.br/webservices/'>
<iNrSist>XXXX</iNrSist>
<strSenha>XXXXX</strSenha>
</Login>
</soap:Body>
</soap:Envelope>"
reader = basicTextGatherer()
curlPerform(
url = "http://sistemas.cvm.gov.br/webservices/Sistemas/SCW/CDocs/WsDownloadInfs.asmx",
httpheader = headerfields,
postfields = body,
writefunction = reader$update
)
xml <- reader$value()
xml
You have to do something like this to each one of the fuctions in http://sistemas.cvm.gov.br/webservices/Sistemas/SCW/CDocs/WsDownloadInfs.asmx
If you have something easier (or more elegant) and want to share will be welcome!
Tks!

Resources