Isolating data from single XML nodeset in R xml2 - r

I am trying to iteratively isolate and manipulate nodesets from an XML document, but I am getting a strange behavior in the xml_find_all() function in the xml2 package in R. Can someone please help me understand the scope of functions applied to a nodeset?
Here is an example:
library( xml2 )
library( dplyr )
doc <- read_xml( "<MEMBERS>
<CUSTOMER>
<ID>178</ID>
<FIRST.NAME>Alvaro</FIRST.NAME>
<LAST.NAME>Juarez</LAST.NAME>
<ADDRESS>123 Park Ave</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
<CUSTOMER>
<ID>934</ID>
<FIRST.NAME>Janette</FIRST.NAME>
<LAST.NAME>Johnson</LAST.NAME>
<ADDRESS>456 Candy Ln</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
</MEMBERS>" )
doc %>% xml_find_all( '//*') %>% xml_path()
# [1] "/MEMBERS" "/MEMBERS/CUSTOMER[1]"
# [3] "/MEMBERS/CUSTOMER[1]/ID" "/MEMBERS/CUSTOMER[1]/FIRST.NAME"
# [5] "/MEMBERS/CUSTOMER[1]/LAST.NAME" "/MEMBERS/CUSTOMER[1]/ADDRESS"
# [7] "/MEMBERS/CUSTOMER[1]/ZIP" "/MEMBERS/CUSTOMER[2]"
# [9] "/MEMBERS/CUSTOMER[2]/ID" "/MEMBERS/CUSTOMER[2]/FIRST.NAME"
#[11] "/MEMBERS/CUSTOMER[2]/LAST.NAME" "/MEMBERS/CUSTOMER[2]/ADDRESS"
#[13] "/MEMBERS/CUSTOMER[2]/ZIP"
The object customer.01 is a nodeset that contains data from that customer only.
kids <- xml_children( doc )
customer.01 <- kids[[1]]
customer.01
# {xml_node}
# <CUSTOMER>
# [1] <ID>178</ID>
# [2] <FIRST.NAME>Alvaro</FIRST.NAME>
# [3] <LAST.NAME>Juarez</LAST.NAME>
# [4] <ADDRESS>123 Park Ave</ADDRESS>
# [5] <ZIP>57701</ZIP>
Why does the function, applied to the customer.01 nodeset, return the ID for customer.02 as well?
xml_find_all( customer.01, "//MEMBERS/CUSTOMER/ID" )
# {xml_nodeset (2)}
# [1] <ID>178</ID>
# [2] <ID>934</ID>
How do I return only values from that nodeset?
~~~
Ok, so here's a small wrinkle in the solution below, again related to scope of the xml_find_all() function. It says that it can be applied to a document, node, or nodeset. However...
This case works when applied to a nodeset:
library( xml2 )
url <- "https://s3.amazonaws.com/irs-form-990/201501279349300635_public.xml"
doc <- read_xml( url )
xml_ns_strip( doc )
nd <- xml_find_all( doc, "//LiquidationOfAssetsDetail|//LiquidationDetail" )
nodei <- nd[[1]]
nodei
# {xml_node}
# <LiquidationOfAssetsDetail>
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc>
# [2] <DistributionDt>2014-11-04</DistributionDt>
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt>
# [4] <EIN>abcdefghi</EIN>
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName>
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ...
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt>
xml_text( xml_find_all( nodei, "AssetsDistriOrExpnssPaidDesc" ) )
# [1] "LAND"
But not this one:
nodei <- xml_children( nd[[i]] )
nodei
# {xml_nodeset (7)}
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc>
# [2] <DistributionDt>2014-11-04</DistributionDt>
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt>
# [4] <EIN>abcdefghi</EIN>
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName>
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ...
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt>
xml_text( xml_find_all( nodei, "AssetsDistriOrExpnssPaidDesc" ) )
# character(0)
I'm guessing this is a problem applying xml_find_all() to all elements of a nodeset rather than a scoping issue?

Currently, you are using the absolute path search from root with XPath's double forward slash, //, which means find all items in document that match this path which includes both customers' ID.
For particular child nodes under a specific node, simply use a relative path under selected node:
xml_find_all(customer.01, "ID")
# {xml_nodeset (1)}
# [1] <ID>178</ID>
xml_find_all(customer.01, "FIRST.NAME|LAST.NAME")
# {xml_nodeset (2)}
# [1] <FIRST.NAME>Alvaro</FIRST.NAME>
# [2] <LAST.NAME>Juarez</LAST.NAME>
xml_find_all(customer.01, "*")
# {xml_nodeset (5)}
# [1] <ID>178</ID>
# [2] <FIRST.NAME>Alvaro</FIRST.NAME>
# [3] <LAST.NAME>Juarez</LAST.NAME>
# [4] <ADDRESS>123 Park Ave</ADDRESS>
# [5] <ZIP>57701</ZIP>

Related

Using xml2 to add child node with an attribute

I'm using R's xml2 package to edit an XML document. I'd like to add a node with a specific XML attribute, but I don't seem to understand the syntax of add_child_node.
Adding a node works great:
library(xml2)
my_xml <- read_xml("<fruits><apple/><banana/></fruits>")
xml_add_child(.x = my_xml, .value = "coconut")
my_xml
# {xml_document}
# <fruits>
# [1] <apple/>
# [2] <banana/>
# [3] <coconut/>
and according my understanding of the documentation, I should be able to add an attribute to the node by using the ellipsis argument to provide a named vector of text:
my_xml <- read_xml("<fruits><apple/><banana/></fruits>")
xml_add_child(.x = my_xml, .value = "coconut", c(id="new"))
my_xml
# {xml_document}
# <fruits>
# [1] <apple/>
# [2] <banana/>
# [3] <coconut>new</coconut>
However, this appears to simply insert the text into the node, as it does when the text is unnamed. The attribute doesn't show up at all.
What I'd like to get is this:
# {xml_document}
# <fruits>
# [1] <apple/>
# [2] <banana/>
# [3] <coconut id="new"/>
Any thoughts? I'm aware that I can set attributes manually after the fact using xml_attr<- but my use case doesn't support that method very well.
Snapshot of the documentation for anyone who doesn't want to pull it up:
Just remove the c()
xml_add_child(.x = my_xml, .value = "coconut", id = "new")
-output
> my_xml
{xml_document}
<fruits>
[1] <apple/>
[2] <banana/>
[3] <coconut id="new"/>
data
my_xml <- read_xml("<fruits><apple/><banana/></fruits>")

lapply() with XPath to obtain all text after a specific tag not working

Background:
I am scraping this website to obtain a list of all people named under a respective section of the editorial board.
In total, there are 6 sections, each one beginning with a <b>...</b> part. (It actually should be 5, but the code is a bit messy.)
My goal:
I want to get a list of all people per section (a list of 6 elements called people).
My approach:
I try to fetch all the text, or text(), after each respective <b>...</b>-tag.
However, with the following R-code and XPath, I fail to get the correct list:
journal_url <- "https://aepi.biomedcentral.com/about/editorial-board"
webpage <- xml2::read_html(url(journal_url))
# get a list of 6 sections
all_sections <- rvest::html_nodes(wholepage, css = '#editorialboard p')
# the following does not work properly
people <- lapply(all_sections, function(x) rvest::html_nodes(x, xpath = '//b/following-sibling::text()'))
The mistaken outcome:
Instead of giving me a list of 6 elements comprising the people per section, it gives me a list of 6 elements comprising all people in every element.
The expected outcome:
The expected output would start with:
people
[[1]]
[1] Shichuo Li
[[2]]
[1] Zhen Hong
[2] Hermann Stefan
[3] Dong Zhou
[[3]]
[1] Jie Mu
# etc etc
The double forward slash xpath selects all nodes in the whole document, even when the object is a single node. Use the current node selector .
people <- lapply(all_sections, function(x) {
rvest::html_nodes(x, xpath = './b/following-sibling::text()')
})
Output:
[[1]]
{xml_nodeset (1)}
[1] Shichuo Li,
[[2]]
{xml_nodeset (3)}
[1] Zhen Hong,
[2] Hermann Stefan,
[3] Dong Zhou,
[[3]]
{xml_nodeset (0)}
[[4]]
{xml_nodeset (1)}
[1] Jie Mu,
[[5]]
{xml_nodeset (2)}
[1] Bing Liang,
[2] Weijia Jiang,
[[6]]
{xml_nodeset (35)}
[1] Aye Mye Min Aye,
[2] Sándor Beniczky,
[3] Ingmar Blümcke,
[4] Martin J. Brodie,
[5] Eric Chan,
[6] Yanchun Deng,
[7] Ding Ding,
[8] Yuwu Jiang,
[9] Hennric Jokeit,
[10] Heung Dong Kim,
[11] Patrick Kwan,
[12] Byung In Lee,
[13] Weiping Liao,
[14] Xiaoyan Liu,
[15] Guoming Luan,
[16] Imad M. Najm,
[17] Terence O'Brien,
[18] Jiong Qin,
[19] Markus Reuber,
[20] Ley J.W. Sander,
...

How to scrape id from each div class in rvest?

Each div.grpl-grp clearfix (each club element) on this page Has it's own id:
https://uws-community.symplicity.com/index.php?s=student_group
I am trying to scrape each of these ids, however my current method, as shown below does not work. What am I doing wrong?
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
id_nodes <- html_nodes(page, "div.grpl-grp clearfix") %>% html_attrs("id")
Try XPath instead:
library(magrittr)
library(rvest)
doc <- read_html("https://uws-community.symplicity.com/index.php?s=student_group")
html_nodes(doc, xpath=".//div[contains(#class, 'grpl-grp') and contains(#class, 'clearfix')]") %>%
html_attr("id")
## [1] "grpl_5bf9ea61bc46eaeff075cf8043c27c92" "grpl_17e4ea613be85fe019efcf728fb6361d"
## [3] "grpl_d593eb48fe26d58f616515366a1e677b" "grpl_5b445690da34b7cff962ee2bf254db9e"
## [5] "grpl_cd1ebcef22852bdb5301a243803a2909" "grpl_0a7da33f968a919ecfa06486f0787bc7"
## [7] "grpl_a6a6cbf50b45d1ef05f8965c69f462de" "grpl_3fed7efb36173632ae2eef14393f37fc"
## [9] "grpl_f4e1e263109725bd4f99db9f70552b65" "grpl_2be038a5d159bf753fceb26cfdf596c2"
## [11] "grpl_918f9dec53fe5d36c1f98f5136f2ae7d" "grpl_f365b501f1e9833ca0cf8c504e37d11c"
## [13] "grpl_2f302fcce440ec1463beb73c6d7af070" "grpl_26b6771768df4a002e44ad6ec01fa36d"
## [15] "grpl_5e260344fd093628f3326a162996513a" "grpl_3604e5b44c0428dfc982c1bfc852fef2"
## [17] "grpl_9ab9bced3514bd8b2e0e18da8a3c7977" "grpl_6364bed0a4d3f45cd5b1fc929e320cb3"
## [19] "grpl_ba21e3c819afe6a32110585ac379f5d9" "grpl_9964a3732044fceffb4dc9b5645856ba"

How to parse sub path of an XML file using XML2 package

I have the following xml page that looks like this which I need to parse using xml2
However, with this code, I cannot get the list under the subcellularLocation xpath :
library(xml2)
xmlfile <- "https://www.uniprot.org/uniprot/P09429.xml"
doc <- xmlfile %>%
xml2::read_xml()
xml_name(doc)
xml_children(doc)
x <- xml_find_all(doc, "//subcellularLocation")
xml_path(x)
# character(0)
What is the right way to do it?
Update
The desired output is a vector:
[1] "Nucleus"
[2] "Chromosome"
[3] "Cytoplasm"
[4] "Secreted"
[5] "Cell membrane"
[6] "Peripheral membrane protein"
[7] "Extracellular side"
[8] "Endosome"
[9] "Endoplasmic reticulum-Golgi intermediate compartment"
Use x <- xml_find_all(doc, "//d1:subcellularLocation")
Whenever you meet a troublesome problem, check the document is the first thing to do, use ?xml_find_all and you will see this (at the end of the page)
# Namespaces ---------------------------------------------------------------
# If the document uses namespaces, you'll need use xml_ns to form
# a unique mapping between full namespace url and a short prefix
x <- read_xml('
<root xmlns:f = "http://foo.com" xmlns:g = "http://bar.com">
<f:doc><g:baz /></f:doc>
<f:doc><g:baz /></f:doc>
</root>
')
xml_find_all(x, ".//f:doc")
xml_find_all(x, ".//f:doc", xml_ns(x))
So you then go to check xml_ns(doc) and find
d1 <-> http://uniprot.org/uniprot
xsi <-> http://www.w3.org/2001/XMLSchema-instance
Update
xml_find_all(doc, "//d1:subcellularLocation")
%>% xml_children()
%>% xml_text()
## [1] "Nucleus"
## [2] "Chromosome"
## [3] "Cytoplasm"
## [4] "Secreted"
## [5] "Cell membrane"
## [6] "Peripheral membrane protein"
## [7] "Extracellular side"
## [8] "Endosome"
## [9] "Endoplasmic reticulum-Golgi intermediate compartment"ent"
If you don't mind, you can use the rvest package:
library(rvest)
a=read_html(xmlfile)%>%
html_nodes("subcellularlocation")
a%>%html_children()%>%html_text()
[1] "Nucleus" "Chromosome"
[3] "Cytoplasm" "Secreted"
[5] "Cell membrane" "Peripheral membrane protein"
[7] "Extracellular side" "Endosome"
[9] "Endoplasmic reticulum-Golgi intermediate compartment"

removing special apostrophes from French article contractions when tokenizing

I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text.
I'm currently using the quanteda package and the tm package for doing things like removing words, removing numbers...etc...
There's only one thing, though, that doesn't seem to work.
As some of you might know, in French, the masculine determinative article -le- contracts in -l'- before vowels. I've tried to remove -l'- (and similar things like -d'-) as words with removeWords
lmt67 <- removeWords(lmt67, c( "l'","d'","qu'il", "n'", "a", "dans"))
but it only works with words that are separate from the rest of text, not with the articles that are attached to a word, such as in -l'arbre- (the tree).
Frustrated, I've tried to give it a simple gsub
lmt67 <- gsub("l'","",lmt67)
but that doesn't seem to be working either.
Now, what's a better way to do this, and possibly through a c(...) vector so that I can give it a series of expressions all together?
Just as context, lmt67 is a "large character" with 30,000 elements/articles, obtained by using the "texts" functions on data imported from txt files.
Thanks to anyone that will want to help me.
I'll outline two ways to do this using quanteda and quanteda-related tools. First, let's define a slightly longer text, with more prefix cases for French. Notice the inclusion of the ’ apostrophe as well as the ASCII 39 simple apostrophe.
txt <- c(doc1 = "M. Trump, lors d’une réunion convoquée d’urgence à la Maison Blanche,
n’en a pas dit mot devant la presse. En réalité, il s’agit d’une
mesure essentiellement commerciale de ce pays qui l'importe.",
doc2 = "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme
successeur Jordi Sanchez, partisan de l’indépendance catalane,
actuellement en prison pour sédition.")
The first method will use pattern matches for the simple ASCII 39 (apostrophe) plus a bunch of
Unicode variants, matched through the category "Pf" for "Punctuation: Final Quote" category.
However, quanteda does its best to normalize the quotes at the tokenization stage - see the
"l'indépendance" in the second document for instance.
The second way below uses a French part-of-speech tagger integrated with quanteda that allows similar
selection after recognizing and separating the prefixes, and then removing determinants (among other POS).
1. quanteda tokens
toks <- tokens(txt, remove_punct = TRUE)
# remove stopwords
toks <- tokens_remove(toks, stopwords("french"))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "d'une" "réunion"
# [6] "convoquée" "d'urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "s'agit" "d'une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "l'importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "l'indépendantiste"
# [5] "catalan" "a" "désigné" "comme"
# [9] "successeur" "Jordi" "Sanchez" "partisan"
# [13] "de" "l'indépendance" "catalane" "actuellement"
# [17] "en" "prison" "pour" "sédition"
Then, we apply the pattern to match l', d', or l', using a regular expression replacement on the types (the unique tokens):
toks <- tokens_replace(
toks,
types(toks),
stringi::stri_replace_all_regex(types(toks), "[lsd]['\\p{Pf}]", "")
)
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "une" "réunion"
# [6] "convoquée" "urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "agit" "une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "indépendantiste" "catalan"
# [6] "a" "désigné" "comme" "successeur" "Jordi"
# [11] "Sanchez" "partisan" "de" "indépendance" "catalane"
# [16] "actuellement" "En" "prison" "pour" "sédition"
From the resulting toks object you can form a dfm and then proceed to fit the STM.
2. using spacyr
This will involve more sophisticated part-of-speech tagging and then converting the tagged object into quanteda tokens. This requires first that you install Python, spacy, and the French language model. (See https://spacy.io/usage/models.)
library(spacyr)
spacy_initialize(model = "fr", python_executable = "/anaconda/bin/python")
# successfully initialized (spaCy Version: 2.0.1, language model: fr)
toks <- spacy_parse(txt, lemma = FALSE) %>%
as.tokens(include_pos = "pos")
toks
# tokens from 2 documents.
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" ",/PUNCT"
# [4] "lors/ADV" "d’/PUNCT" "une/DET"
# [7] "réunion/NOUN" "convoquée/VERB" "d’/ADP"
# [10] "urgence/NOUN" "à/ADP" "la/DET"
# [13] "Maison/PROPN" "Blanche/PROPN" ",/PUNCT"
# [16] "\n /SPACE" "n’/VERB" "en/PRON"
# [19] "a/AUX" "pas/ADV" "dit/VERB"
# [22] "mot/ADV" "devant/ADP" "la/DET"
# [25] "presse/NOUN" "./PUNCT" "En/ADP"
# [28] "réalité/NOUN" ",/PUNCT" "il/PRON"
# [31] "s’/AUX" "agit/VERB" "d’/ADP"
# [34] "une/DET" "\n /SPACE" "mesure/NOUN"
# [37] "essentiellement/ADV" "commerciale/ADJ" "de/ADP"
# [40] "ce/DET" "pays/NOUN" "qui/PRON"
# [43] "l'/DET" "importe/NOUN" "./PUNCT"
#
# doc2 :
# [1] "Réfugié/VERB" "à/ADP" "Bruxelles/PROPN"
# [4] ",/PUNCT" "l’/PRON" "indépendantiste/ADJ"
# [7] "catalan/VERB" "a/AUX" "désigné/VERB"
# [10] "comme/ADP" "\n /SPACE" "successeur/NOUN"
# [13] "Jordi/PROPN" "Sanchez/PROPN" ",/PUNCT"
# [16] "partisan/VERB" "de/ADP" "l’/DET"
# [19] "indépendance/ADJ" "catalane/ADJ" ",/PUNCT"
# [22] "\n /SPACE" "actuellement/ADV" "en/ADP"
# [25] "prison/NOUN" "pour/ADP" "sédition/NOUN"
# [28] "./PUNCT"
Then we can use the default glob-matching to remove the parts of speech in which we are probably not interested, including the newline:
toks <- tokens_remove(toks, c("*/DET", "*/PUNCT", "\n*", "*/ADP", "*/AUX", "*/PRON"))
toks
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" "lors/ADV" "réunion/NOUN" "convoquée/VERB"
# [6] "urgence/NOUN" "Maison/PROPN" "Blanche/PROPN" "n’/VERB" "pas/ADV"
# [11] "dit/VERB" "mot/ADV" "presse/NOUN" "réalité/NOUN" "agit/VERB"
# [16] "mesure/NOUN" "essentiellement/ADV" "commerciale/ADJ" "pays/NOUN" "importe/NOUN"
#
# doc2 :
# [1] "Réfugié/VERB" "Bruxelles/PROPN" "indépendantiste/ADJ" "catalan/VERB" "désigné/VERB"
# [6] "successeur/NOUN" "Jordi/PROPN" "Sanchez/PROPN" "partisan/VERB" "indépendance/ADJ"
# [11] "catalane/ADJ" "actuellement/ADV" "prison/NOUN" "sédition/NOUN"
Then we can remove the tags, which you probably don't want in your STM - but you could leave them if you prefer.
## remove the tags
toks <- tokens_replace(toks, types(toks),
stringi::stri_replace_all_regex(types(toks), "/[A-Z]+$", ""))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M." "Trump" "lors" "réunion" "convoquée"
# [6] "urgence" "Maison" "Blanche" "n’" "pas"
# [11] "dit" "mot" "presse" "réalité" "agit"
# [16] "mesure" "essentiellement" "commerciale" "pays" "importe"
#
# doc2 :
# [1] "Réfugié" "Bruxelles" "indépendantiste" "catalan" "désigné"
# [6] "successeur" "Jordi" "Sanchez" "partisan" "indépendance"
# [11] "catalane" "actuellement" "prison" "sédition"
From there, you can use the toks object to form your dfm and fit the model.
Here's a scrape from the current page at Le Monde's website. Notice that the apostrophe they use is not the same character as the single-quote here "'":
text <- "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."
It has a little angle and is not actually "straight down" when I view it. You need to copy that character into your gsub command:
sub("l’", "", text)
[#1] "Réfugié à Bruxelles, indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."

Resources