R XML parsing to data-frame - r

I have various XML files with information as shown below. Im having difficulty parsing this variable XML format into a dataframe that can handle both differing numbers of metrics and duplicated properties tags.
<ProducedFruits>
<FruitType>
<FruitName>Apple</FruitName>
<FruitMetrics>
<Properties Sugars="27.51" Rate="5.03" />
<Properties Sugars="219.39" Rate="12.19" />
<Properties Sugars="266.34" Rate="75.9" />
</FruitMetrics>
</FruitType>
<FruitType>
<FruitName>Lime</FruitName>
<FruitMetrics>
<Properties Sugars="1884.2" Rate="5" />
<Properties Sugars="1884.2" Rate="98.3" />
</FruitMetrics>
</FruitType>
<FruitType>
<FruitName>Lemon</FruitName>
<FruitMetrics>
<Properties Sugars="1064.77" Rate="5" />
<Properties Sugars="1064.77" Rate="56" />
</FruitMetrics>
</FruitType>
<FruitType>
<FruitName>Banana</FruitName>
<FruitMetrics>
<Properties Sugars="113" Rate="12" />
<Properties Sugars="113" Rate="79" />
</FruitMetrics>
</FruitType>
</ProducedFruits>
Each file may be somewhat different, so ideally i would to create something that can handle the inconsistent number of values that also preserves the fruitname and creates a dataframe like the one at the bottom.
enter image description here

To pass your xml into R as a dataframe you can use the XML package (https://cran.r-project.org/web/packages/XML/), e.g. data <- XML::xmlParse("doc.xml") then bind lists together with xml_data <- XML::xmlToList(data) then xml_df <- as.data.frame(xml_data) (per: How to parse XML to R data frame)

Related

R read in largish XML file contraining multiple tables

I want to efficiently read in an XML File (200mb in Size) consisting of multiple tables.
Sketch of the Structure:
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:od="urn:schemas-microsoft-com:officedata">
<xsd:schema>
<xsd:element name="dataroot">
<xsd:element name="dataroot">
</xsd:schema>
<dataroot>
<TABLE1>
[here comes the data]
</TABLE1>
<TABLE2>
[here comes the data]
</TABLE2>
...
</dataroot>
</root>
How do I read in all or specific tables into a data.frame? And what is probably the most efficient way to do so?
Perhaps as a starter:
library(XML)
library(data.table)
xmldoc <- xmlParse("data.xml")
d <- getNodeSet(xmldoc, "//dataroot//TABLE1")
size <- xmlSize(d)
dt <- rbindlist(lapply(1:size, function(i) {
as.list(getChildrenStrings(d[[i]]))
}), fill = TRUE)
works OK but is not particularly fast. How can I do this with xml2? The package docs are not particualrly enlightening.
Also, I want to loop over all Tables but couldn't figure out the xpath stuff.

How to set weight for geo-elem-pair-query in MarkLogic structure query?

I'm trying to rewrite the following cts query using XML structure query:
cts:element-pair-geospatial-query(
fn:QName("http://www.example.com/2009/foo","wgs84"),
fn:QName("http://www.example.com/2009/foo","latitude"),
fn:QName("http://www.example.com/2009/foo","longitude"),
cts:circle("#12 53.411541,-2.9900994"),
("coordinate-system=wgs84","score-function=reciprocal","slope-factor=4"),
32
)
I converted it into:
<geo-elem-pair-query>
<parent ns="http://www.example.com/2009/foo" name="wgs84" />
<lat ns="http://www.example.com/2009/foo" name="latitude" />
<lon ns="http://www.example.com/2009/foo" name="longitude" />
<fragment-scope>documents</fragment-scope>
<geo-option>coordinate-system=wgs84</geo-option>
<geo-option>score-function=reciprocal</geo-option>
<geo-option>slope-factor=4</geo-option>
<circle>
<radius>12.0</radius>
<point>
<latitude>53.411541</latitude>
<longitude>-2.9900994</longitude>
</point>
</circle>
</geo-elem-pair-query>
Unfortunately, I don't know how to add weight into <geo-elem-pair-query>. Accordingly to MarkLogic documentation it seems to be unsupported (but cts equivalent supports it). I've tried to add <weight>32.0</weight> but it doesn't work.
Do you know if there is a way to add weight to geo-elem-pair-query structure query?

Drillthrough to underlying text data in icCube?

How to set-up a model in icCube to allow to drill down to the details, when details contain text fields?
The idea is to get a list, with column names containing the text fields (in combination with amount fields). Just like a simple SQL statement would give.
I have tried the following:
a) added a technical dimension that is linked to the rows (via rownumber) and added MIN Aggregation for the text fields. With the idea to use these when a DRILLTHROUGH MDX statement is invoked. The DRILLTHROUGH function works, but does not give the values next to each other for the measures. Result is like:
b) added each unique line a line number and loaded the line number as lowest detail in one of the dimensions. Added attributes for these text and date items for the "drillthrough" columns. Next, added calculated measures to get the property for these attributes. The drillthrough is now effectively a drillby to the lowest details. It works, but this is not nice as it blows up my dimension.
c) tried to use the widget data source SQL, but it is not available for text files, and it does not work for MSAccess files (too slow).
The preferable solution should works in the dashboards and in any XMLA/REST API interface.
Enclosed this example
the schema file
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<schemaFactory revisionNumber="7">
<schemaDefinition name="drilltrhough-text" description="" group="Issues" loadOnStartup="false">
<activateIncrementalLoad>false</activateIncrementalLoad>
<useUnknownMembersInFacts>true</useUnknownMembersInFacts>
<autoCleanUpTableColumns>false</autoCleanUpTableColumns>
<useFactPartitioning>false</useFactPartitioning>
<callGarbageCollector>NONE</callGarbageCollector>
<backup>NONE</backup>
<nonEmptyCachePolicy>NONE</nonEmptyCachePolicy>
<nonEmptyCacheType>REGULAR</nonEmptyCacheType>
<nonEmptyCachePersistency>MEMORY</nonEmptyCachePersistency>
<storagePolicy>DEFAULT</storagePolicy>
<hierarchyUniqueNameStyle>IncludeDimensionName</hierarchyUniqueNameStyle>
<inMemoryDS name="data">
<memoryDataTable tableName="data" rowLimit="-1" id="d9429713-9be8-4c63-9b40-4a20388e7563">
<column name="dimension" tableType="STRING" type="STRING" selected="true" primaryKey="false"/>
<column name="amount" tableType="STRING" type="STRING" selected="true" primaryKey="false"/>
<column name="text" tableType="STRING" type="STRING" selected="true" primaryKey="false"/>
<addRowNumber>false</addRowNumber>
<stringDateConverter></stringDateConverter>
<trimStrings>true</trimStrings>
<columnSeparator>,</columnSeparator>
<commentMarker>#</commentMarker>
<dataAsString>dimension, amount, text
a, 10,some text
b, 20, some more text
c, ,text without an amount</dataAsString>
</memoryDataTable>
</inMemoryDS>
<multiLevelDimension dataTableId="d9429713-9be8-4c63-9b40-4a20388e7563" isTimeDimension="false" isDefaultTimeDimension="false" isIndexingByRange="false" id="86d118f0-71ba-4826-a6ac-343eac96fb05" name="Dimension">
<multiLevelHierarchy hasAllLevel="true" allLevelName="All-Level" allMemberName="All" name="Dimension" isDefault="true">
<level name="Dimension - L" nameUnique="false" nameUniqueInParent="false" keyUnique="false" ignoreNameCollision="false">
<nameCol name="dimension"/>
<orderType>BY_NAME</orderType>
<orderKind>ASC</orderKind>
</level>
</multiLevelHierarchy>
</multiLevelDimension>
<cube id="caa9c520-f953-4c77-9e72-76c8668170f7" name="Cube">
<defaultFacts measureGroupName="Facts" partitioningLevelName="" partitioningType="NONE" newGeneration="true" dataTableId="d9429713-9be8-4c63-9b40-4a20388e7563" aggregateDataSourceFacts="false" unresolvedRowsBehavior="ERROR">
<rowFactAggregationType>ADD_ROW</rowFactAggregationType>
<measure name="Amount" aggregationType="SUM">
<dataColumn name="amount"/>
</measure>
<measure name="Text" aggregationType="MIN">
<dataColumn name="text"/>
</measure>
<links dimensionId="86d118f0-71ba-4826-a6ac-343eac96fb05">
<viewLinks type="LAST_LEVEL">
<toColumns name="dimension"/>
</viewLinks>
</links>
</defaultFacts>
</cube>
</schemaDefinition>
</schemaFactory>
- the mdx
drillthrough
select [Measures].members on 0
, [Dimension].[Dimension].[Dimension - L] on 1
from [cube]
return Name([Dimension])
the result
This is not related to having a measure of type STRING.
You're performing a multi-cell result drillthrough (which is an extension of standard MDX in icCube). In that case, the result is "organized" per result cell meaning each [Measures] being in its own category (you can add another Amount measure and you'll see the same behavior).
Instead you should perform a single cell drillthrough:
drillthrough
select [Dimension].[Dimension].[Dimension -L].[a] on 0
from [cube]
And the result should look like:
You can see the [Measures].[Info] being on the same row (as all the other measures).
Hope that helps.

Xquery html formatting

I'm new to Xquery. I have a requirement of rewriting the API response into custom xml format.
Input file format:
<root> <_1>
<dataType>
<name>XVar(Osmo [mOsmol/kg])</name>
<term>M185</term>
<type>XVar</type>
</dataType>
<values>305</values>
<values>335</values> </_1> <_2>
<dataType>
<name>XVar(DO (2) [%])</name>
<term>M199</term>
<type>XVar</type>
</dataType>
<values>12</values>
<values>33</values>
</_2> <_3>
<dataType>
<name>Maturity</name>
<type>Maturity</type>
</dataType>
<values>0</values>
<values>0.73600054</values>
</_3> </root>
Expected output:
<element> <XVar(Osmo [mOsmol/kg]> 305</XVar(Osmo [mOsmol/kg]>
<XVar(Osmo [mOsmol/kg]> 335</XVar(Osmo [mOsmol/kg]> <XVar(DO (2)[%])>
12</XVar(DO (2) [%])> <XVar(DO (2) [%])>33 </XVar(DO (2) [%])>
<Maturity>0</Maturity> <Maturity>0.73600054</Maturity> </element>
no of nodes (dataType -> name) will vary in each input file and also
Values will be dynamics .
currently using the below code.
let $input:= /root for $i in $input//values
return <element>
<name>{$i/../dataType/name/text()}</name> <values>{$i/text()} </values>
</element>
but all data are coming in and . my requirement is to
keep the node name as {$i/../dataType/name/text()} as values should
be {$i/text()} -
for the input file sample ideally there should be three different
nodes and its values.
Can any one help me on this?

Reading xml data from oracle table column and parsing it in R

I have one scenario in R.
I have connected the oracle database with R through RODBC package and in one column of table there is xml data. Now when I am using xmlParse function its showing error as XML content does not seem to be XML. and class(xmldata) is data frame.
When i am copying the xml data and put it into new xml file and parsing though xmlParse function its getting parsed correctly and class(sourcefile) as XMLInternalDocument.
Error is raised because you are running XML::xmlParse on a dataframe object which is the returned value of RODBC::sqlQuery(), and not underlying XML content. Simply index the column and row value for specific XML content.
As example, below reads an XML (top 5 StackOverflow users in R tag) into a dataframe and runs xmlParse to reproduce error and another xmlParse call to resolve error.
Dataframe Build (replicating sqlQuery)
txt <- '<?xml version="1.0"?>
<stackoverflow>
<group lang="r">
<topusers>
<user>akrun</user>
<link>https://stackoverflow.com/users/3732271/akrun</link>
<location>Bengaluru, Karnataka, India</location>
<year_rep>15,900</year_rep>
<total_rep>328,573</total_rep>
<tag1>r</tag1>
<tag2>dataframe</tag2>
<tag3>dplyr</tag3>
</topusers>
<topusers>
<user>Dirk Eddelbuettel</user>
<link>https://stackoverflow.com/users/143305/dirk-eddelbuettel</link>
<location>Chicago, IL, United States </location>
<year_rep>5,588</year_rep>
<total_rep>253,481</total_rep>
<tag1>r</tag1>
<tag2>rcpp</tag2>
<tag3>c++</tag3>
</topusers>
<topusers>
<user>42-</user>
<link>https://stackoverflow.com/users/1855677/42</link>
<location>Alameda, CA</location>
<year_rep>4,143</year_rep>
<total_rep>193,407</total_rep>
<tag1>r</tag1>
<tag2>dataframe</tag2>
<tag3>plot</tag3>
</topusers>
<topusers>
<user>A5C1D2H2I1M1N2O1R2T1</user>
<link>https://stackoverflow.com/users/1270695/a5c1d2h2i1m1n2o1r2t1</link>
<location>Chennai, India</location>
<year_rep>3,982</year_rep>
<total_rep>141,425</total_rep>
<tag1>r</tag1>
<tag2>dataframe</tag2>
<tag3>reshape</tag3>
</topusers>
<topusers>
<user>Gavin Simpson</user>
<link>https://stackoverflow.com/users/429846/gavin-simpson</link>
<location>Regina, Canada </location>
<year_rep>2,780</year_rep>
<total_rep>124,779</total_rep>
<tag1>r</tag1>
<tag2>plot</tag2>
<tag3>dataframe</tag3>
</topusers>
</group>
</stackoverflow>'
res <- data.frame(Col1 = txt)
Error line
result1 <- xmlParse(res, asText=TRUE)
# Error: XML content does not seem to be XML: '1'
Resolved line (which yields no error)
# SINGLE XML
result1 <- xmlParse(res$Col1[[1]], asText=TRUE)
# MULTIPLE XML (ACROSS ALL ROWS)
result_list <- lapply(res$Col1, xmlParse, asText=TRUE)

Resources