I am a newbie in XML and R and would like to ask you for a help. I need to extract data from XML into a dataframe in R. The XML file is following:
<?xml version="1.0" encoding="UTF-8"?>
-<Report xmlns="Tlg_Table_Begin_Ende_ValueIds" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" txtHeader="Table" Name="Tlg_Table_Begin_Ende_ValueIds" xsi:schemaLocation="Tlg_Table_Begin_Ende_ValueIds http://nwlph01/ReportServer_HISTORIAN?%2FTemplates%2FPublic%2FTags%2FTlg_Table_Begin_Ende_ValueIds&rs%3AFormat=XML&rc%3ASchema=True">
-<table1 textbox7="Flags" textbox6="Quality" textbox5="Value" textbox4="Timestamp" textbox2="Tag name">
-<Detail_Collection>
<Detail Flags="8392704" Quality="128" TimeStamp2="3758.203125 " TimeStamp="3/13/2019 3:15:00 PM 3/13/2019 3:15:00 PM" TagName="SystemArchive\0101___FIT101G/UM.PV_Out#Value"/>
<Detail Flags="8392704" Quality="128" TimeStamp2="3771.9267578125 " TimeStamp="3/13/2019 3:15:01 PM 3/13/2019 3:15:01 PM" TagName="SystemArchive\0101___FIT101G/UM.PV_Out#Value"/>
<Detail Flags="8392704" Quality="128" TimeStamp2="3783.43823242188 " TimeStamp="3/13/2019 3:15:02 PM 3/13/2019 3:15:02 PM" TagName="SystemArchive\0101___FIT101G/UM.PV_Out#Value"/>
</Detail_Collection>
</table1>
</Report>
I am using following codes:
library("xml2")
df <- read_xml("lh_01.xml")
But what I receive is:
Warning message:
In doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
xmlns: URI Tlg_Table_Begin_Ende_ValueIds is not absolute [100]
Do you have any idea what am I suppose to do?
Thank you in advance.
Searching Stackoverflow delivers e.g the folloeing URI is not absolute error - sorry I am not an XML expert what the error in your specific case may be; my know-how only goes so far as to find your xmlns URI unusual.
Related
I have the following XML file:
<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:pdfx_1_="ns.adobe.org/pdfx/1.3/">
<xmp:CreateDate>2021-05-30T11:17:35+02:00</xmp:CreateDate>
<xmp:CreatorTool>TeX</xmp:CreatorTool>
<xmp:ModifyDate>2021-05-30T12:12:25+02:00</xmp:ModifyDate>
<xmp:MetadataDate>2021-05-30T12:12:25+02:00</xmp:MetadataDate>
<pdfx:PTEX.Fullbanner>This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019) kpathsea version 6.3.1</pdfx:PTEX.Fullbanner>
<pdf:Producer>pdfTeX-1.40.20</pdf:Producer>
<pdf:Trapped>Unknown</pdf:Trapped>
<pdf:Keywords/>
<dc:format>application/pdf</dc:format>
<xmpMM:DocumentID>uuid:38d0617c-0385-5941-a87d-cc4a1e54bd76</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:d056c61c-55c6-5f44-8c0e-fe6e911c2ed9</xmpMM:InstanceID>
<pdfwe:dafra>
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
</pdfx_1_:Dataframe>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
As you can see, the tag Dataframe of the namespace pdfwe have inside it another XML. I need to extract this XML and convert it to a normal XML with no ASCII Entity Names like the following:
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
To extract what's inside pdfwe:dafra I'm using the function xml_find_all(x, ".//pdfwe:dafra") of the xml2 package but I'm not getting the result I want.
To convert the Entity Names I'm using the function xml2::xml_text(xml2::read_xml(paste0("<x>", md, "</x>"))) but I'm not getting the results I want either.
Thanks in advance!
The solution is a multi step process, extract the database node, convert to text, clean up and then convert back to xml with the read_xml() function.
library(xml2)
page <- read_xml('<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08">
.....') #read in the entire file
xml_ns(page) #show namespaces
#extract the database
db <- xml_find_first(page, ".//pdfx_1_:Dataframe")
#convert to text and strip leading whitespace
dbtext <- xml_text(db) %>% trimws()
#read the text in and convert to xml
xml_db <- read_xml(dbtext)
xml_ns(xml_db) #show namespaces
#extract the requested information from database
#shown here for demonstration purposes
xml_db %>% xml_find_all(".//d1:column") %>% xml_find_all(".//d1:value") %>% xml_text()
Problem
I have an XML file that I would like to parse in R. I know that this file is not corrupted because the following Python code seems to work:
>>> import xml.etree.ElementTree as ET
>>> xml_tree = ET.parse(PATH_TO_MY_XML_FILE)
>>> do_my_regular_xml_stuff_that_seems_to_work_no_problem(xml_tree)
Now, when I try to run the following code in R, I get an error message:
> library("XML")
> xml_tree <- XML::xmlParse(PATH_TO_MY_XML_FILE)
Error in nchar(text_repr): invalid multibyte string, element 1
Traceback:
Alright, maybe the parser doesn't recognize the encoding. Luckily this should be specified in a decent XML file. So, I go to my shell and check:
$ head -n1 PATH_TO_MY_XML_FILE
??<?xml version="1.0" encoding="utf-16"?>
Now, I can go back to R and explicitly pass on the encoding, only to face the next error message where I got stuck now:
> library("XML")
> xml_tree <- XML::xmlParse(PATH_TO_MY_XML_FILE, encoding='UTF-16')
Start tag expected, '<' not found
Error: 1: Start tag expected, '<' not found
Traceback:
1. XML::xmlParse(filePath, encoding = "UTF-16")
2. (function (msg, ...)
. {
. if (length(grep("\\\n$", msg)) == 0)
. paste(msg, "\n", sep = "")
. if (immediate)
. cat(msg)
. if (length(msg) == 0) {
. e = simpleError(paste(1:length(messages), messages, sep = ": ",
. collapse = ""))
. class(e) = c(class, class(e))
. stop(e)
. }
. messages <<- c(messages, msg)
. })(character(0))
A last attempt to check (in R) if the file is in fact "UTF-16" encoded yields:
> f <- file(filePath, 'r', encoding = "UTF-16")
> firstLine <- readLines(f, n=1)
> close(f)
> print(line)
[1] "<?xml version=\"1.0\" encoding=\"utf-16\"?>"
Which looks just about right to me.
Question(s)
Does anyone know what is happening here? Is this a bug from the XML library? Is the file maybe not 'UTF-16' encoded, even though it claims it is? What are the two question marks ?? that I see when I print the file into the shell? These question marks don't appear when reading in the file properly...
Is this a bug from the XML library?
I think there could be a bug here. If I generate a valid UTF-16 XML document, which will have an initial byte-order mark:
$ echo '<a>๐</a>' | iconv -t utf-16 >a-utf16.xml
$ xxd a-utf16.xml
00000000: fffe 3c00 6100 3e00 3dd8 0ade 3c00 2f00 ..<.a.>.=...<./.
00000010: 6100 3e00 0a00 a.>...
then I can parse it with:
> XML::xmlParse('a-utf16.xml')
<?xml version="1.0"?>
<a>๐</a>
but not if I specify the encoding:
> XML::xmlParse('a-utf16.xml', encoding='utf-16')
Start tag expected, '<' not found
Error: 1: Start tag expected, '<' not found
Your original problem was when you weren't specifying the encoding. However:
I know that this file is not corrupted because the following Python code seems to work
That's a good hint, but I think you'll find edge cases where that doesn't hold. Try iconv for a second opinion on whether the file is encoded correctly.
For a more specific response, you'll need to post a reproducible XML file,
I am trying to parse this xml and place it on data frame form:
file content looks like this:
<?xml version="1.0" encoding="utf-8" ?>
- <dashboardreport name="Incident_Rules" version="7.2.5.1022" reportdate="2019-02-20T14:45:57.352-05:00" description="">
- <source name="app1">
- <filters summary="last 30 minutes (auto)">
<filter>tf:DiagnoseTimeframe?1550690157352:1550691957352</filter>
</filters>
</source>
- <reportheader>
- <reportdetails>
<user>user1</user>
</reportdetails>
</reportheader>
- <data>
- <incidentchartdashlet name="Incident Chart" description="">
- <incidentchartrecords structuretype="tree">
<incidentchartrecord rule="Database Exception" systemprofile="app1" />
<incidentchartrecord rule="Response time greater than 30 minutes" systemprofile="app1" />
<incidentchartrecord rule="JVM Heap Utilization > 90%" systemprofile="app1" />
</incidentchartrecords>
</incidentchartdashlet>
</data>
</dashboardreport>
The data frame needs to be like this:
Source Name Rule
App1 Database Exception
App1 Response time greater than 30 minutes
App1 JVM Heap Utilization > 90%
Need to extract "Source name" and "incidentchartrecord rule". I have tried something like this:
library("XML")
doc <- read_xml(file)
dat<-xml_find_all(doc, ".//incidentchartrecord") %>%
map_df(function(x) {
xml_find_all(x, ".//incidentchartrecord") %>%
map_df(~as.list(xml_attrs(.))) %>%
select(rule) %>%
mutate(node=xml_attr(x, "incidentchartrecord"))
})
Any ideas?
Here's an approach that works. I used xml2, instead; that's where the xml_find_all & xml_attr functions are found.
library(xml2)
doc <- read_xml("test.xml")
source <- xml_attr(xml_find_all(doc,".//source"), "name")
rules <- xml_attr(xml_find_all(doc, ".//incidentchartrecord"), "rule")
df <- data.frame("Source.Name" = source, Rule=rules, stringsAsFactors=F)
I can download a file from the internet easily enough using code such as this:
myurl <- "http://www.jatma.or.jp/toukei/xls/13_01.xls"
download.file(myurl, destfile = myfilepath, mode = 'wb')
However, usually I want to check the date the file was last modified before I download it. I can do this very easily in Perl using the LWP::Simple package. I've poked through the documentation for RCurl (which I admit I understand only poorly) and the closest thing I can find is the basicHeaderGatherer function.
library(RCurl)
if(url.exists("http://www.jatma.or.jp/toukei/xls/13_01.xls")) {
h = basicHeaderGatherer()
foo <- getURL("http://www.jatma.or.jp/toukei/xls/13_01.xls",
headerfunction = h$update)
names(h$value())
h$value()
}
h$value()[3]
By using the code above I can eventually access the 'Last-Modified' attribute, but not without generating errors as per the output below. How can I clean up my code to avoid this error and access the 'Last-Modified' attribute in a straightforward manner?
(Please note: this answer looks promising but it generates similar error messages to those shown below, so it doesn't resolve this particular issue.)
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) (from #3) :
embedded nul in string: ' \021เกฑ\032 \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0>\0\003\0 \t\0\006\0\0\0\0\0\0\0\0\0\0\0\001\0\0\09\0\0\0\0\0\0\0\0\020\0\0 \0\0\0\0 \0\0\0\08\0\0\0 \t\b\020\0\0\006\005\0g2 \a \0\002\0\006\006\0\0 \0\002\0 \004 \0\002\0\0\0 \0\0\0\\\0p\0\003\0\0CVC B\0\002\0 \004a\001\002\0\0\0 \001\0\0=\001\002\0$\0 \0\002\0\021\0\031\0\002\0\0\0\022\0\002\0\0\0\023\0\002\0\0\0 \001\002\0\0\0 \001\002\0\0\0=\0\022\0 \017\0xKX/8\0\0\0\
> h$value()[3]
Last-Modified
"Fri, 06 Dec 2013 05:33:53 GMT"
>
library(RCurl)
url.exists("http://www.jatma.or.jp/toukei/xls/13_01.xls", .header=T)["Last-Modified"]
# Last-Modified
# "Fri, 06 Dec 2013 05:33:53 GMT"
I built a web service in C# web application. I'm returning list of objects as a web service result. I need to know how to read that list of items one by one in a loop.
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<checkAvailabilityResponse xmlns="http://tempuri.org/">
<checkAvailabilityResult>
<Shedule>
<Sid>int</Sid>
<Fid>int</Fid>
<FromLocation>string</FromLocation>
<FromTime>dateTime</FromTime>
<ToLocation>string</ToLocation>
<ToTime>dateTime</ToTime>
<PriceSeatA>double</PriceSeatA>
<PriceSeatB>double</PriceSeatB>
<PriceSeatC>double</PriceSeatC>
</Shedule>
<Shedule>
<Sid>int</Sid>
<Fid>int</Fid>
<FromLocation>string</FromLocation>
<FromTime>dateTime</FromTime>
<ToLocation>string</ToLocation>
<ToTime>dateTime</ToTime>
<PriceSeatA>double</PriceSeatA>
<PriceSeatB>double</PriceSeatB>
<PriceSeatC>double</PriceSeatC>
</Shedule>
</checkAvailabilityResult>
</checkAvailabilityResponse>
</soap:Body>
</soap:Envelope>
This is the way I tried:
SriLankanWebService.Service1SoapClient air1 = new AgentPortal.SriLankanWebService.Service1SoapClient();
List<Shedule> air1Response = (List<Shedule>)air1.checkAvailability(drpFrom.SelectedValue.ToString(), drpTo.SelectedValue.ToString(), DateTime.Parse(txtDepartOn.Text));
When I tried it says:
Error 1 Cannot implicitly convert type 'AgentPortal.SriLankanWebService.Shedule[]' to 'System.Collections.Generic.List<AgentPortal.Shedule>' D:\DCBSD\AgentPortal\AgentPortal\Home.aspx.cs 32 46 AgentPortal
I need to use it in a loop.
Please update your code in last line from above code from:
List<Shedule> air1Response = (List<Shedule>)air1.checkAvailability(drpFrom.SelectedValue.ToString(), drpTo.SelectedValue.ToString(), DateTime.Parse(txtDepartOn.Text));
to
AgentPortal.SriLankanWebService.Shedule[] = air1.checkAvailability(drpFrom.SelectedValue.ToString(), drpTo.SelectedValue.ToString(), DateTime.Parse(txtDepartOn.Text));
That will fix the issue.