Extract deep XML structure - r

I have the following XML file that I want to parse using R. The XML has a deep structure and also there are varied number of subnodes.
<?xml version="1.0" encoding="UTF-8"?>
<Alert date="20161223_2" type="full">
<Records>
<Person Id="100">
<PersonNameDetails>
<PersonNames id="Name1">
<ReferenceGroup ReferenceGroupCode="ABC"/>
<ReferenceGroup ReferenceGroupCode="DEF"/>
<PersonNameValue>
<FirstName>Carl Bangouvounda</FirstName>
<Surname>Toziz</Surname>
</PersonNameValue>
</PersonNames>
<PersonNames id="Name2">
<ReferenceGroup ReferenceGroupCode="ABC"/>
<ReferenceGroup ReferenceGroupCode="GHI" ReferenceGroupLanguageCode="en"/>
<ReferenceGroup ReferenceGroupCode="JKL"/>
<ReferenceGroup ReferenceGroupCode="MNO"/>
<ReferenceGroup ReferenceGroupCode="DEF"/>
<PersonNameValue>
<FirstName>Tozize</FirstName>
<Surname>Bangouvonda</Surname>
</PersonNameValue>
</PersonNames>
<PersonNames id="Name3">
<ReferenceGroup ReferenceGroupCode="MNO"/>
<PersonNameValue>
<FirstName>Carol</FirstName>
<Surname>Tozize</Surname>
</PersonNameValue>
</PersonNames>
<PersonNames id="Name4">
<ReferenceGroup ReferenceGroupCode="PQR"/>
<ReferenceGroup ReferenceGroupCode="MNO"/>
<PersonNameValue>
<FirstName>Carol</FirstName>
<MiddleName>Bangouvonda</MiddleName>
<Surname>Tozize</Surname>
</PersonNameValue>
</PersonNames>
<PersonNames id="Name5">
<ReferenceGroup ReferenceGroupCode="GHI" ReferenceGroupLanguageCode="en"/>
<ReferenceGroup ReferenceGroupCode="JKL"/>
<ReferenceGroup ReferenceGroupCode="DEF"/>
<PersonNameValue>
<FirstName>Carl Bangouvonda</FirstName>
<Surname>Toziz</Surname>
</PersonNameValue>
</PersonNames>
</PersonNameDetails>
</Person>
</Records>
</Alert>
The expected output is as below:
-----------------------------------------------------------
Id | id | ReferenceGroup | FirstName | MiddleName | Surname
-----------------------------------------------------------
100 | Name1 | ABC, DEF | Carl Bangouvounda | NA | Toziz
-----------------------------------------------------------
100 | Name2 | ABC, GHI, JKL, MNO, DEF | Tozize | NA | Bangouvonda
-----------------------------------------------------------
100 | Name3 | MNO | Carol | NA | Tozize
-----------------------------------------------------------
100 | Name4 | PQR, MNO | Carol | Bangouvonda | Tozize
-----------------------------------------------------------
100 | Name5 | GHI, JKL, DEF | Carl Bangouvonda | NA | Toziz
-----------------------------------------------------------
Id is from element Person's attribute, and all others are from PersonNameDetails. I also would like to concatenate the ReferenceGroupCode into one string within the same Personnames element.
I followed the advice to convert to XSLT with the following code:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="xml"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/Alert ">
<xsl:copy>
<xsl:apply-templates select="Records"/>
</xsl:copy>
</xsl:template>
<xsl:template match="Records">
<xsl:apply-templates select="Person"/>
</xsl:template>
<xsl:template match="Person">
<xsl:apply-templates select="PersonNameDetails"/>
</xsl:template>
<xsl:template match="PersonNameDetails">
<xsl:apply-templates select="PersonNames"/>
</xsl:template>
<xsl:template match="PersonNames">
<xsl:apply-templates select="PersonNameValue"/>
</xsl:template>
<xsl:template match="PersonNameValue">
<PersonNameValue>
<Id><xsl:value-of select="ancestor::Person/#Id"/></Id>
<id><xsl:value-of select="ancestor::PersonNames/#id"/></id>
<xsl:copy-of select="FirstName"/>
<MiddleName><xsl:value-of select="MiddleName"/></MiddleName>
<Surname><xsl:value-of select="Surname"/></Surname>
<ReferenceGroupCode><xsl:value-of select="ancestor::PersonNames/ReferenceGroup/#ReferenceGroupCode"/></ReferenceGroupCode>
</PersonNameValue>
</xsl:template>
</xsl:transform>
How to change the XSLT code so the ReferenceGroup output will be
<ReferenceGroupCode>ABC,DEF</ReferenceGroupCode>
Any help is highly appreciated.

Not sure about XSLT, but you could use xpath on the PersonNames nodes and write a function to handle missing or multiple values.
doc <- xmlParse( "<your XML file>")
x <- getNodeSet(doc, "//PersonNames")
xpath2 <-function(x, ...){
y <- xpathSApply(x, ...)
ifelse(length(y) == 0, NA, paste(y, collapse=", "))
}
y <- data.frame(
id = sapply(x, xpath2, ".", xmlGetAttr, "id"),
ReferenceGroup= sapply(x, xpath2, ".//ReferenceGroup", xmlGetAttr, "ReferenceGroupCode"),
FirstName = sapply(x, xpath2, ".//FirstName", xmlValue),
MiddleName = sapply(x, xpath2, ".//MiddleName", xmlValue),
Surname = sapply(x, xpath2, ".//Surname", xmlValue)
)
id ReferenceGroup FirstName MiddleName Surname
1 Name1 ABC, DEF Carl Bangouvounda <NA> Toziz
2 Name2 ABC, GHI, JKL, MNO, DEF Tozize <NA> Bangouvonda
3 Name3 MNO Carol <NA> Tozize
4 Name4 PQR, MNO Carol Bangouvonda Tozize
5 Name5 GHI, JKL, DEF Carl Bangouvonda <NA> Toziz
And maybe add Person Id by counting the number of PersonName nodes?
n <- xpathSApply(doc, "//Person/PersonNameDetails", xmlSize)
y$ID <- rep( xpathSApply(doc, "//Person", xmlGetAttr, "Id"), n)

Related

Problem with replacing a comma with a period

I replace the comma with a period in the data.frame column
data[,22] <- as.numeric(sub(",", ".", sub(".", "", data[,22], fixed=TRUE), fixed=TRUE))
But I have values that look like this: 110.00, 120.00, 130.00...
When replacing, I get the value:11000.0, 12000.0, 13000.0
And I would like to get: 110.0,120.0, 130.0....
My column 22 data.frame:
| n |
|--------|
| 92,5 |
| 94,5 |
| 96,5 |
| 110.00|
| 120.00|
| 130.00|
What I want to get:
| n |
|--------|
| 92.5 |
| 94.5 |
| 96.5 |
| 110.0|
| 120.0|
| 130.0|
or
| n |
|--------|
| 92.5 |
| 94.5 |
| 96.5 |
| 110.00|
| 120.00|
| 130.00|
Don't replace the periods since they are already in the format that you want. Replace only commas to period and turn the data to numeric.
data[[22]] <- as.numeric(sub(',', '.', fixed = TRUE, data[[22]]))
Using str_replace
library(stringr)
data[[22]] <- as.numeric(str_replace(data[[2]], ",", fixed(".")))
You can use gsub like below
transform(
df,
n = as.numeric(gsub("\\D", ".", n))
)
where non-digital character, i.e., "," or ".", are replaced by "."

Nesting in XPATH

I have the following XML data for which I would like create an xpath statement which i think might contain nested count()
Here is the XML data for 5 CD Rentals
<?xml version="1.0"?>
<DataBase>
<!-- CD Rental 1 -->
<Rental>
<cd>
<title>title1</title>
</cd>
<person uniqueID = "1">
<name>name1</name>
</person>
</Rental>
<!-- CD Rental 2 -->
<Rental>
<cd>
<title>title2</title>
</cd>
<person uniqueID = "2">
<name>name2</name>
</person>
</Rental>
<!-- CD Rental 3 -->
<Rental>
<cd>
<title>title3</title>
</cd>
<person uniqueID = "1">
<name>name1</name>
</person>
</Rental>
<!-- CD Rental 4 -->
<Rental>
<cd>
<title>title4</title>
</cd>
<person uniqueID = "3">
<name>name3</name>
</person>
</Rental>
<!-- CD Rental 5 -->
<Rental>
<cd>
<title>title5</title>
</cd>
<person uniqueID = "2">
<name>name2</name>
</person>
</Rental>
</DataBase>
The xpath I had in mind was
Count the number of persons who rented multiple CD's
In the above XML data, the person with name as name1 and the person with name as name2 rented 2 CD's while name3 only rented 1 CD. So the answer I am expecting is 2. What could be a possible xpath for this?
One possible XPath expression would be:
count(//name[.=preceding::name][not(. = following::name)])
xpathtester demo
Brief explanation about the expression inside count():
//name[.=preceding::name]: find all elements name which have preceding element name with the same value, in other words name with duplicate
[not(. = following::name)]: further filter name elements found by the previous piece of XPath to return only the last of each duplicated name (distinct in Xpath?)

Dynamically convert date to Timestamp [without mentioning date format] in spark scala/python [duplicate]

This question already has answers here:
Cast column containing multiple string date formats to DateTime in Spark
(3 answers)
Closed 5 years ago.
Hi have a requirement to convert raw dates to timestamp
data
id,date,date1,date2,date3
1,161129,19960316,992503,20140205
2,961209,19950325,992206,20140503
3,110620,19960522,991610,20131302
4,160928,19930506,992205,20160112
5,021002,20000326,991503,20131112
6,160721,19960909,991212,20151511
7,160721,20150101,990809,20140809
8,100903,20151212,990605,20011803
9,070713,20170526,990702,19911010
here i have columns "date","date1","date2" and "date3" where dates are in string format. generally i convert the raw date using unix_timestamp("<col>","<formate>").cast("timestamp") but now i dont want mention format, i want dynamic method because later few more columns may get added to my table. in this case static method wont play a best role.
In some columns we will be having 6 characters of date where first 2 characters represents "year" and next 4 represents "date" and "month" i.e yyddmm or
yymmdd.
Some other columns we will be having 8 characters of date where first 4 characters represents "year" and next 4 represents "date" and "month" i.e yyyyddmm or yyyymmdd.
we have same format for each column which needs to find out dynamically and convert that to time stamp without hard coding.
Output should be in time stamp.
+---+-------------------+-------------------+-------------------+-------------------+
| id| date| date1| date2| date3|
+---+-------------------+-------------------+-------------------+-------------------+
| 1|2016-11-29 00:00:00|1996-03-16 00:00:00|1999-03-25 00:00:00|2014-05-02 00:00:00|
| 2|1996-12-09 00:00:00|1995-03-25 00:00:00|1999-06-22 00:00:00|2014-03-05 00:00:00|
| 3|2011-06-20 00:00:00|1996-05-22 00:00:00|1999-10-16 00:00:00|2013-02-13 00:00:00|
| 4|2016-09-28 00:00:00|1993-05-06 00:00:00|1999-05-22 00:00:00|2016-12-01 00:00:00|
| 5|2002-10-02 00:00:00|2000-03-26 00:00:00|1999-03-15 00:00:00|2013-12-11 00:00:00|
| 6|2016-07-21 00:00:00|1996-09-09 00:00:00|1999-12-12 00:00:00|2015-11-15 00:00:00|
| 7|2016-07-21 00:00:00|2015-01-01 00:00:00|1999-09-08 00:00:00|2014-09-08 00:00:00|
| 8|2010-09-03 00:00:00|2015-12-12 00:00:00|1999-05-06 00:00:00|2001-03-18 00:00:00|
| 9|2007-07-13 00:00:00|2017-05-26 00:00:00|1999-02-07 00:00:00|1991-10-10 00:00:00|
+---+-------------------+-------------------+-------------------+-------------------+
Here with the above requirement i have. Given some conditions in UDF to find the format of each date column.
def udf_1(x:String):
if len(x)==6 and int(x[-2:]) > 12: return "yyMMdd"
elif len(x)==8 and int(x[-2:]) > 12: return "yyyyMMdd"
elif len((x))==6 and int(x[2:4]) <12 and int(x[-2:]) >12: return "yyMMdd"
elif len((x))==8 and int(x[4:6]) <12 and int(x[-2:]) >12: return "yyyyMMdd"
elif len((x))==6 and int(x[2:4]) >12 and int(x[-2:]) <12: return "yyddMM"
elif len((x))==8 and int(x[4:6]) >12 and int(x[-2:]) <12: return "yyyyddMM"
elif len((x))==6 and int(x[2:4]) <=12 and int(x[-2:]) <=12: return "N"
elif len((x))==8 and int(x[4:6]) <=12 and int(x[-2:]) <=12: return "NA"
else: return "null"
udf_2 = udf(udf_1, StringType())
c1 = c.withColumn("date_formate",udf_2("date"))
c2 = c1.withColumn("date1_formate",udf_2("date1"))
c3 = c2.withColumn("date2_formate",udf_2("date2"))
c4 = c3.withColumn("date3_formate",udf_2("date3"))
c4.show()
with the specified conditions, i have extracted formats for some rows and in the case of date and month having <= 12 i have given "N" for 6 characters and "NA" for 8 characters.
+------+--------+------+---------+---+------------+-------------+-------------+-------------+
| date| date1| date2| date3| id|date_formate|date1_formate|date2_formate|date3_formate|
+------+--------+------+---------+---+------------+-------------+-------------+-------------+
|161129|19960316|992503| 20140205| 1| yyMMdd| yyyyMMdd| yyddMM| NA|
|961209|19950325|992206| 20140503| 2| N| yyyyMMdd| yyddMM| NA|
|110620|19960522|991610| 20131302| 3| yyMMdd| yyyyMMdd| yyddMM| yyyyddMM|
|160928|19930506|992205| 20160112| 4| yyMMdd| NA| yyddMM| NA|
|021002|20000326|991503| 20131112| 5| N| yyyyMMdd| yyddMM| NA|
|160421|19960909|991212| 20151511| 6| yyMMdd| NA| N| yyyyddMM|
|160721|20150101|990809| 20140809| 7| yyMMdd| NA| N| NA|
|100903|20151212|990605| 20011803| 8| N| NA| N| yyyyddMM|
|070713|20170526|990702|19911010 | 9| yyMMdd| yyyyMMdd| N| yyyyddMM|
+------+--------+------+---------+---+------------+-------------+-------------+-------------+
Now i have taken extracted format and stored it in a variable and called that variable in unix_timestamp to convert raw date to time stamp.
r1 = c4.where(c4.date_formate != ('NA' or 'N'))[['date_formate']].first().date_formate
t_s = unix_timestamp("date",r1).cast("timestamp")
c5=c4.withColumn("date",t_s)
r2 = c5.where(c5.date1_formate != ('NA' or 'N'))[['date1_formate']].first().date1_formate
t_s1 = unix_timestamp("date1",r2).cast("timestamp")
c6 = c5.withColumn("date1",t_s1)
r3 = c6.where(c6.date2_formate != ('NA' or 'N'))[['date2_formate']].first().date2_formate
t_s2 = unix_timestamp("date2",r3).cast("timestamp")
c7 = c6.withColumn("date2",t_s2)
r4 = c7.where(c7.date3_formate != ('NA' or 'N'))[['date3_formate']].first().date3_formate
t_s3 = unix_timestamp("date3",r4).cast("timestamp")
c8 = c7.withColumn("date3",t_s3)
c8.select("id","date","date1","date2","date3").show()
Output
+---+-------------------+-------------------+-------------------+-------------------+
| id| date| date1| date2| date3|
+---+-------------------+-------------------+-------------------+-------------------+
| 1|2016-11-29 00:00:00|1996-03-16 00:00:00|1999-03-25 00:00:00|2014-05-02 00:00:00|
| 2|1996-12-09 00:00:00|1995-03-25 00:00:00|1999-06-22 00:00:00|2014-03-05 00:00:00|
| 3|2011-06-20 00:00:00|1996-05-22 00:00:00|1999-10-16 00:00:00|2013-02-13 00:00:00|
| 4|2016-09-28 00:00:00|1993-05-06 00:00:00|1999-05-22 00:00:00|2016-12-01 00:00:00|
| 5|2002-10-02 00:00:00|2000-03-26 00:00:00|1999-03-15 00:00:00|2013-12-11 00:00:00|
| 6|2016-07-21 00:00:00|1996-09-09 00:00:00|1999-12-12 00:00:00|2015-11-15 00:00:00|
| 7|2016-07-21 00:00:00|2015-01-01 00:00:00|1999-09-08 00:00:00|2014-09-08 00:00:00|
| 8|2010-09-03 00:00:00|2015-12-12 00:00:00|1999-05-06 00:00:00|2001-03-18 00:00:00|
| 9|2007-07-13 00:00:00|2017-05-26 00:00:00|1999-02-07 00:00:00|1991-10-10 00:00:00|
+---+-------------------+-------------------+-------------------+-------------------+

Backslash escaped characters in JavaCC token

I'm writing JavaCC parser for a character stream like this
Abc \(Def\) Gh (Ij; Kl); Mno (Pqr)
and should get it tokenized like this
Abc \(Def\) Gh
LPAREN
Ij
SEMICOLON
Kl
RPAREN
SEMICOLON
Mno
LPAREN
Pqr
RPAREN
The current token definition is
TOKEN:
{
< WORDCHAR : (~[";", "(", ")"])+ >
| <LPAREN: "(">
| <RPAREN: ")">
| <SEMICOLON: ";">
}
How should I change the WORDCHAR token to include backslash escaped parentheses but not parentheses without leading backslash?
TOKEN:
{
< WORDCHAR : (~[";", "(", ")"] | "\\(" | "\\)")+ >
| <LPAREN: "(">
| <RPAREN: ")">
| <SEMICOLON: ";">
}

kable function: "id" in the columns

When I trying print table with knitr::kable function "id" word apperas in the column names. How can I change it?
Example:
> x <- structure(c(42.3076923076923, 53.8461538461538, 96.1538461538462,
2.56410256410256, 1.28205128205128, 3.84615384615385,
44.8717948717949, 55.1282051282051, 100),
.Dim = c(3L, 3L),
.Dimnames = structure(list(Condition1 = c("Yes", "No", "Sum"),
Condition2 = c("Yes", "No", "Sum")),
.Names = c("Condition1", "Condition2")), class = c("table", "matrix"))
> print(x)
Condition2
Condition1 Yes No Sum
Yes 42,31 2,56 44,87
No 53,85 1,28 55,13
Sum 96,15 3,85 100,00
> library(knitr)
> kable(x)
|id | Yes| No| Sum|
|:----|-----:|-----:|------:|
|Yes | 42,3| 2,56| 44,9|
|No | 53,8| 1,28| 55,1|
|Sum | 96,2| 3,85| 100,0|
Edit: I find reason of this behavior in the knitr:::kable_mark function. But now I not understand how to make it more flexible.
An alternative to kable might be the general S3 method of pander:
> library(pander)
> pander(x, style = 'rmarkdown')
| | Yes | No | Sum |
|:---------:|:-----:|:-----:|:-----:|
| **Yes** | 42.31 | 2.564 | 44.87 |
| **No** | 53.85 | 1.282 | 55.13 |
| **Sum** | 96.15 | 3.846 | 100 |
If you need to set the decimal mark to comma, then set the relevant option before and use that in your R session:
> panderOptions('decimal.mark', ',')
> pander(x, style = 'rmarkdown')
| | Yes | No | Sum |
|:---------:|:-----:|:-----:|:-----:|
| **Yes** | 42,31 | 2,564 | 44,87 |
| **No** | 53,85 | 1,282 | 55,13 |
| **Sum** | 96,15 | 3,846 | 100 |
There are also some other possible tweaks: http://rapporter.github.io/pander/#pander-options
I think the easiest way is to rip out and replace kable_mark completely. Note: this is quite dirty – but it seems to work, and there is no current way to customise how kable_mark works (you could submit a patch to knitr though).
km <- edit(knitr:::kable_mark)
# Now edit the code and remove lines 7 and 8.
unlockBinding('kable_mark', environment(knitr:::kable_mark))
assign('kable_mark', km, envir=environment(knitr:::kable_mark))
Explanation: First we edit the function and store the amended definition in a temporary variable. We remove the two lines
if (grepl("^\\s*$", cn[1L]))
cn[1L] = "id"
… of course you can also hard-code the amended function rather than editing it, or change the function around completely.
Next we use unlockBinding to make knitr:::kable_mark overridable. If we don’t do this, the next assign command wouldn’t work.
Finally, we assign the patched function back to knitr:::kable_mark. Done.

Resources