XQuery tumbling window: group by start item of first window - xquery

Using BaseX 9.7.3, I have a sorted list of names that has been produced using a tumbling window clause.
A snippet of the data looks like this:
<data>
<group>
<key id="0c7b0bca-0349-489c-b45f-2612f3134a76">ovid</key>
<key id="f77ab9c2-0be3-4348-809d-ab245e630f81">ovid 43 b c-17 or 18 a d</key>
</group>
<group>
<key id="39b9d6c2-85a5-4c72-a83e-2a52e548fc3b">ovid 43 bc</key>
<key id="acf5b3c0-8fd4-4e0c-950b-a40683bab431">ovid 43 bc-17 ad</key>
<key id="cc57be53-9ca8-4b5e-97cf-1aeca798cded">ovid 43 bc-17 ad or 18 a</key>
<key id="8395e750-1e52-4152-9d37-8c8f4e389fd3">ovid 43 bc-17 ad or 18 ad</key>
</group>
<group>
<key id="0be07fc6-d9bf-4d56-8352-1885b4dd6574">ovid 43 bc-17 or 18</key>
<key id="e3aafc69-56b0-4632-a96c-26ca448c6c2d">ovid 43 bc-17 or 18 ad</key>
</group>
<group>
<key id="f9615365-4a32-442b-9e20-9c5abb0e6fa0">ovide</key>
<key id="c7b45a8d-79a3-4e79-b32b-8d918f67a7b0">ovide 0043 av j-c-0017</key>
</group>
</data>
I would like to further group the data so that, in this example, a group would begin with "ovid" and end with "ovid 43 bc-17 or 18 ad."
Desired output:
<data>
<group>
<key id="0c7b0bca-0349-489c-b45f-2612f3134a76">ovid</key>
<key id="f77ab9c2-0be3-4348-809d-ab245e630f81">ovid 43 b c-17 or 18 a d</key>
<key id="39b9d6c2-85a5-4c72-a83e-2a52e548fc3b">ovid 43 bc</key>
<key id="acf5b3c0-8fd4-4e0c-950b-a40683bab431">ovid 43 bc-17 ad</key>
<key id="cc57be53-9ca8-4b5e-97cf-1aeca798cded">ovid 43 bc-17 ad or 18 a</key>
<key id="8395e750-1e52-4152-9d37-8c8f4e389fd3">ovid 43 bc-17 ad or 18 ad</key>
<key id="0be07fc6-d9bf-4d56-8352-1885b4dd6574">ovid 43 bc-17 or 18</key>
<key id="e3aafc69-56b0-4632-a96c-26ca448c6c2d">ovid 43 bc-17 or 18 ad</key>
</group>
<group>
<key id="f9615365-4a32-442b-9e20-9c5abb0e6fa0">ovide</key>
<key id="c7b45a8d-79a3-4e79-b32b-8d918f67a7b0">ovide 0043 av j-c-0017</key>
</group>
</data>
I have the following query, but it simply reproduces the input document:
<data>{
for tumbling window $entry in /*/group/key
start $s at $sp previous $sprev next $snext when starts-with($snext, $s)
end $e at $ep next $enext when not(starts-with($enext, $e))
return
<group>{
for $k in $entry
return (
<key id="{$k/#id}">{data($k)}</key>
)
}</group>
}</data>
Is it possible to compare the start item of the first group ("ovid") to subsequent entries that start with that token? I want to exclude "ovide," even though it starts with "ovid."

With extended (Java like) regular expressions as supported in Saxon I think
for tumbling window $w in /data/group/key
start $s when true()
end next $n when not(matches($n, '^' || $s || '\b', ';j'))
return
<group>{$w}</group>
gives the two groups you want.
I have now also checked that the ';j' flag works with BaseX 9.7.2 as well.

Related

Nesting in XPATH

I have the following XML data for which I would like create an xpath statement which i think might contain nested count()
Here is the XML data for 5 CD Rentals
<?xml version="1.0"?>
<DataBase>
<!-- CD Rental 1 -->
<Rental>
<cd>
<title>title1</title>
</cd>
<person uniqueID = "1">
<name>name1</name>
</person>
</Rental>
<!-- CD Rental 2 -->
<Rental>
<cd>
<title>title2</title>
</cd>
<person uniqueID = "2">
<name>name2</name>
</person>
</Rental>
<!-- CD Rental 3 -->
<Rental>
<cd>
<title>title3</title>
</cd>
<person uniqueID = "1">
<name>name1</name>
</person>
</Rental>
<!-- CD Rental 4 -->
<Rental>
<cd>
<title>title4</title>
</cd>
<person uniqueID = "3">
<name>name3</name>
</person>
</Rental>
<!-- CD Rental 5 -->
<Rental>
<cd>
<title>title5</title>
</cd>
<person uniqueID = "2">
<name>name2</name>
</person>
</Rental>
</DataBase>
The xpath I had in mind was
Count the number of persons who rented multiple CD's
In the above XML data, the person with name as name1 and the person with name as name2 rented 2 CD's while name3 only rented 1 CD. So the answer I am expecting is 2. What could be a possible xpath for this?
One possible XPath expression would be:
count(//name[.=preceding::name][not(. = following::name)])
xpathtester demo
Brief explanation about the expression inside count():
//name[.=preceding::name]: find all elements name which have preceding element name with the same value, in other words name with duplicate
[not(. = following::name)]: further filter name elements found by the previous piece of XPath to return only the last of each duplicated name (distinct in Xpath?)

How to parse a complex xml in Rinto a dataframe?

I want to parse a nested xml file with the layout below in R and load it into a dataframe. I tried using several eays including the xml and xml2 packages but could not get it to work.
<?xml version="1.0" encoding="UTF-8"?>
<Targets>
<Target TYPE="myserver.mgmt.Metric" NAME="metric1">
<Attribute NAME="name" VALUE="metric1"></Attribute>
<Attribute NAME="Value" VALUE="2.4"></Attribute>
<Attribute NAME="collectionTime" VALUE="1525118288000"></Attribute>
<Attribute NAME="State" VALUE="normal"></Attribute>
<Attribute NAME="ObjectName" VALUE="obj1"></Attribute>
<Attribute NAME="ValueHistory" VALUE="5072"></Attribute>
</Target>
...
<Target TYPE="myserver.mgmt.Metric" NAME="metric999">
<Attribute NAME="name" VALUE="metric999"></Attribute>
<Attribute NAME="Value" VALUE="60.35"></Attribute>
<Attribute NAME="collectionTime" VALUE="1525118288000"></Attribute>
<Attribute NAME="State" VALUE="normal"></Attribute>
<Attribute NAME="ObjectName" VALUE="obj1"></Attribute>
<Attribute NAME="ValueHistory" VALUE="9550"></Attribute>
</Target>
</Targets>
The final outcome I am looking to get is:
name Value collectionTime State ObjectName ValueHistory
metric1 2.4 1525118288000 normal obj1 5072
metric2 60.35 1525118288000 normal obj2 9550
Any help is appreciated.
We can make use of XML with tidyverse
library(XML)
library(tidyverse)
lst1 <- getNodeSet(xml1, path = "//Target")
map_df(seq_along(lst1), ~
XML:::xmlAttrsToDataFrame(lst1[[.x]]) %>%
mutate_all(as.character) %>%
deframe %>%
as.list %>%
as_tibble) %>%
mutate_all(type.convert, as.is = TRUE)
# A tibble: 2 x 6
# name Value collectionTime State ObjectName ValueHistory
# <chr> <dbl> <dbl> <chr> <chr> <int>
#1 metric1 2.4 1525118288000 normal obj1 5072
#2 metric999 60.4 1525118288000 normal obj1 9550
data
xml1 <- xmlParse('<?xml version="1.0" encoding="UTF-8"?>
<Targets>
<Target TYPE="myserver.mgmt.Metric" NAME="metric1">
<Attribute NAME="name" VALUE="metric1"></Attribute>
<Attribute NAME="Value" VALUE="2.4"></Attribute>
<Attribute NAME="collectionTime" VALUE="1525118288000"></Attribute>
<Attribute NAME="State" VALUE="normal"></Attribute>
<Attribute NAME="ObjectName" VALUE="obj1"></Attribute>
<Attribute NAME="ValueHistory" VALUE="5072"></Attribute>
</Target>
<Target TYPE="myserver.mgmt.Metric" NAME="metric999">
<Attribute NAME="name" VALUE="metric999"></Attribute>
<Attribute NAME="Value" VALUE="60.35"></Attribute>
<Attribute NAME="collectionTime" VALUE="1525118288000"></Attribute>
<Attribute NAME="State" VALUE="normal"></Attribute>
<Attribute NAME="ObjectName" VALUE="obj1"></Attribute>
<Attribute NAME="ValueHistory" VALUE="9550"></Attribute>
</Target>
</Targets>
')

Grouping in XQuery FLWOR to generate a space-separated list of values for each key

I have a XML with 5 articles, each one of them:
<root>
<article>
<c0>
<number>
</number>
<price>
</price>
</c0>
<c1>
<name> NewArtc1
</name>
<nameUs>TheBest
</nameUs>
</c1>
<c2>
<name> c2_1
</name>
<nameUs>NotTheBest
</nameUs>
</c2>
<c2>
<name> c2_2
</name>
<nameUs>TheBest
</nameUs>
</c2>
<c2>
<name> c2_3
</name>
<nameUs>NotTheBest
</nameUs>
</c2>
<c2>
<name> c2_4
</name>
<nameUs>TheBest
</nameUs>
</c2>
<c2>
<name> c2_5
</name>
<nameUs>NotTheBest
</nameUs>
</c2>
</article>
<article> ...
</root>
Each item has several characteristics (c0, c1, c2, etc.). I need, using XQuery (FLWOR), to return the names of c1. followed by the names of c2 whose text of the nameUs node matches that of the nameUs node c1 (there are 5 c2 for each article) The output format should be, for each article:
<c1>
<name>NewArtc1</name>
<c2s>c2_2 c2_4</c2s>
</c1>
<c1>
<name>NewArtc2</name>
<c2s>c2_3 c2_5</c2s>
</c1>
Please help me :/ I've done this, but it returns pairs, not the output format I need:
for $a in doc("articles.xml")//article, $i in 1 to 5
let $b:=$a/c1/nameUs
let $c:=$a/c2[$i]/nameUs
where $b/text() and $b/text()=$c/text()
return <c1>{$b/name}<c2>{$c/name/text()}</c2></c1>
Maybe try something like...
for $arcticle in /root/article
let $name := $arcticle/c1/name/normalize-space()
let $nameUs := $arcticle/c1/nameUs/normalize-space()
return
<c1>
<name>{$name}</name>
<c2s>{$arcticle/c2[normalize-space(nameUs)=$nameUs]/name/normalize-space()}</c2s>
</c1>

Extract deep XML structure

I have the following XML file that I want to parse using R. The XML has a deep structure and also there are varied number of subnodes.
<?xml version="1.0" encoding="UTF-8"?>
<Alert date="20161223_2" type="full">
<Records>
<Person Id="100">
<PersonNameDetails>
<PersonNames id="Name1">
<ReferenceGroup ReferenceGroupCode="ABC"/>
<ReferenceGroup ReferenceGroupCode="DEF"/>
<PersonNameValue>
<FirstName>Carl Bangouvounda</FirstName>
<Surname>Toziz</Surname>
</PersonNameValue>
</PersonNames>
<PersonNames id="Name2">
<ReferenceGroup ReferenceGroupCode="ABC"/>
<ReferenceGroup ReferenceGroupCode="GHI" ReferenceGroupLanguageCode="en"/>
<ReferenceGroup ReferenceGroupCode="JKL"/>
<ReferenceGroup ReferenceGroupCode="MNO"/>
<ReferenceGroup ReferenceGroupCode="DEF"/>
<PersonNameValue>
<FirstName>Tozize</FirstName>
<Surname>Bangouvonda</Surname>
</PersonNameValue>
</PersonNames>
<PersonNames id="Name3">
<ReferenceGroup ReferenceGroupCode="MNO"/>
<PersonNameValue>
<FirstName>Carol</FirstName>
<Surname>Tozize</Surname>
</PersonNameValue>
</PersonNames>
<PersonNames id="Name4">
<ReferenceGroup ReferenceGroupCode="PQR"/>
<ReferenceGroup ReferenceGroupCode="MNO"/>
<PersonNameValue>
<FirstName>Carol</FirstName>
<MiddleName>Bangouvonda</MiddleName>
<Surname>Tozize</Surname>
</PersonNameValue>
</PersonNames>
<PersonNames id="Name5">
<ReferenceGroup ReferenceGroupCode="GHI" ReferenceGroupLanguageCode="en"/>
<ReferenceGroup ReferenceGroupCode="JKL"/>
<ReferenceGroup ReferenceGroupCode="DEF"/>
<PersonNameValue>
<FirstName>Carl Bangouvonda</FirstName>
<Surname>Toziz</Surname>
</PersonNameValue>
</PersonNames>
</PersonNameDetails>
</Person>
</Records>
</Alert>
The expected output is as below:
-----------------------------------------------------------
Id | id | ReferenceGroup | FirstName | MiddleName | Surname
-----------------------------------------------------------
100 | Name1 | ABC, DEF | Carl Bangouvounda | NA | Toziz
-----------------------------------------------------------
100 | Name2 | ABC, GHI, JKL, MNO, DEF | Tozize | NA | Bangouvonda
-----------------------------------------------------------
100 | Name3 | MNO | Carol | NA | Tozize
-----------------------------------------------------------
100 | Name4 | PQR, MNO | Carol | Bangouvonda | Tozize
-----------------------------------------------------------
100 | Name5 | GHI, JKL, DEF | Carl Bangouvonda | NA | Toziz
-----------------------------------------------------------
Id is from element Person's attribute, and all others are from PersonNameDetails. I also would like to concatenate the ReferenceGroupCode into one string within the same Personnames element.
I followed the advice to convert to XSLT with the following code:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="xml"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/Alert ">
<xsl:copy>
<xsl:apply-templates select="Records"/>
</xsl:copy>
</xsl:template>
<xsl:template match="Records">
<xsl:apply-templates select="Person"/>
</xsl:template>
<xsl:template match="Person">
<xsl:apply-templates select="PersonNameDetails"/>
</xsl:template>
<xsl:template match="PersonNameDetails">
<xsl:apply-templates select="PersonNames"/>
</xsl:template>
<xsl:template match="PersonNames">
<xsl:apply-templates select="PersonNameValue"/>
</xsl:template>
<xsl:template match="PersonNameValue">
<PersonNameValue>
<Id><xsl:value-of select="ancestor::Person/#Id"/></Id>
<id><xsl:value-of select="ancestor::PersonNames/#id"/></id>
<xsl:copy-of select="FirstName"/>
<MiddleName><xsl:value-of select="MiddleName"/></MiddleName>
<Surname><xsl:value-of select="Surname"/></Surname>
<ReferenceGroupCode><xsl:value-of select="ancestor::PersonNames/ReferenceGroup/#ReferenceGroupCode"/></ReferenceGroupCode>
</PersonNameValue>
</xsl:template>
</xsl:transform>
How to change the XSLT code so the ReferenceGroup output will be
<ReferenceGroupCode>ABC,DEF</ReferenceGroupCode>
Any help is highly appreciated.
Not sure about XSLT, but you could use xpath on the PersonNames nodes and write a function to handle missing or multiple values.
doc <- xmlParse( "<your XML file>")
x <- getNodeSet(doc, "//PersonNames")
xpath2 <-function(x, ...){
y <- xpathSApply(x, ...)
ifelse(length(y) == 0, NA, paste(y, collapse=", "))
}
y <- data.frame(
id = sapply(x, xpath2, ".", xmlGetAttr, "id"),
ReferenceGroup= sapply(x, xpath2, ".//ReferenceGroup", xmlGetAttr, "ReferenceGroupCode"),
FirstName = sapply(x, xpath2, ".//FirstName", xmlValue),
MiddleName = sapply(x, xpath2, ".//MiddleName", xmlValue),
Surname = sapply(x, xpath2, ".//Surname", xmlValue)
)
id ReferenceGroup FirstName MiddleName Surname
1 Name1 ABC, DEF Carl Bangouvounda <NA> Toziz
2 Name2 ABC, GHI, JKL, MNO, DEF Tozize <NA> Bangouvonda
3 Name3 MNO Carol <NA> Tozize
4 Name4 PQR, MNO Carol Bangouvonda Tozize
5 Name5 GHI, JKL, DEF Carl Bangouvonda <NA> Toziz
And maybe add Person Id by counting the number of PersonName nodes?
n <- xpathSApply(doc, "//Person/PersonNameDetails", xmlSize)
y$ID <- rep( xpathSApply(doc, "//Person", xmlGetAttr, "Id"), n)

Weighted edges in R/igraph

I'm using R & the igraph package to plot a graph written in graphml and I want to use the weight parameter included in this syntax
<edge id="e389" source="w4" target="w0">
<data key="d1">0.166666666667</data>
</edge>
I can get the values with
weight = E(f)$weight // f is the graph
but I don't know how to use weight before calculating the df = degree(f)
For further information: all nodes are connected to each other and the weight is 1 / (number_of_nodes - 1) so the degree for each node should be 1.
graphml file
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<key id="d0" for="node" attr.name="label" attr.type="string"/>
<key id="d1" for="edge" attr.name="weight" attr.type="float"/>
<key id="d2" for="node" attr.name="type" attr.type="string"/>
<key id="d3" for="node" attr.name="tweet" attr.type="int"/>
<key id="d4" for="node" attr.name="color" attr.type="string"/>
<graph id="G" edgedefault="undirected">
<node id="w4">
<data key="d0">value1</data>
<data key="d2">word</data>
<data key="d1">0.166666666667</data>
<data key="d4">green</data>
</node>
.
.
.
<node id="w2">
<data key="d0">value2</data>
<data key="d2">word</data>
<data key="d1">0.166666666667</data>
<data key="d4">green</data>
</node>
<edge id="e389" source="w4" target="w0">
<data key="d1">0.166666666667</data>
</edge>
Most likely you are not looking for the degree() because this does not care about the edge weights. Are you probably looking for the graph.strength() function?
# create fully connected graph
g <- graph.full(10)
# assign weights such that every weight is 1/number_of_nodes -1
E(g)$weight <- 1/( length( V(g) ) -1 )
# calculate the "weighted degree"
graph.strength(g)
[1] 1 1 1 1 1 1 1 1 1 1
Alternatively, are you maybe looking for the normalized degree?
degree( g, normalized = TRUE )
[1] 1 1 1 1 1 1 1 1 1 1

Resources