find_xml_all return {xml_nodeset (0)} - r

I have recently downloaded the KML file from this map, and tried to use the package xml2 to extract the information of the campsites e.g. the geolocation, the facilities around the sites etc; but I got {xml_nodeset (0)} at the end.
Belows are the codes I have used,
library(xml2)
campsites <- read_xml("file_path")
xml_find_all(campsites, ".//Placemark")
Here is the structure of the KML file (you may also try xml_structure(campsites)),
> library(magrittr)
> campsites
{xml_document}
<kml>
[1] <Document>\n<description><![CDATA[powered by WordPress & MapsMarker.com]] ...
>
> campsites %>% xml_children %>% xml_children %>% xml_children
{xml_nodeset (55)}
[1] <IconStyle>\n <Icon>\n <href>http://www.mountaineering-lohas.org/wp-content/uploads/leaflet-maps-marker-icons/tents.png</href>\n </Icon>\n</IconStyle>
[2] <IconStyle>\n <Icon>\n <href>http://www.mountaineering-lohas.org/wp-content/uploads/leaflet-maps-marker-icons/tents-1.png</href>\n </Icon>\n</IconStyle>
[3] <IconStyle>\n <Icon>\n <href>http://www.mountaineering-lohas.org/wp-content/uploads/leaflet-maps-marker-icons/tents1.png</href>\n </Icon>\n</IconStyle>
[4] <name>香港營地 Hong Kong Camp Site</name>
[5] <Placemark id="marker-1">\n<styleUrl>#tents</styleUrl>\n<name>æµæ°´éŸ¿ç‡Ÿåœ° ( Lau Shui Heung Camp Site )</name>\n<TimeStamp><when>2013-02-21T04:02:29+08: ...
[6] <Placemark id="marker-2">\n<styleUrl>#tents</styleUrl>\n<name>鶴藪營地(Hok Tau Camp Site)</name>\n<TimeStamp><when>2013-02-21T04:02:18+08:00</when></Tim ...
[7] <Placemark id="marker-3">\n<styleUrl>#tents</styleUrl>\n<name>涌背營地(Chung Pui Camp Site)</name>\n<TimeStamp><when>2013-02-22T11:02:02+08:00</when></T ...
[8] <Placemark id="marker-4">\n<styleUrl>#tents</styleUrl>\n<name>æ±å¹³æ´²ç‡Ÿåœ° (Tung Ping Chau Campsite)</name>\n<TimeStamp><when>2013-02-22T11:02:39+08:00</ ...
[9] <Placemark id="marker-5">\n<styleUrl>#tents</styleUrl>\n<name>ç£ä»”å—營地(Wan Tsai Peninsula South Campsite)</name>\n<TimeStamp><when>2013-02-22T11:02:2 ...
[10] <Placemark id="marker-6">\n<styleUrl>#tents</styleUrl>\n<name>ç£ä»”西營地 (Wan Tsai Peninsula West Campsite)</name>\n<TimeStamp><when>2013-02-22T11:02:3 ...
...
As you can see there are nodes named as "Placemark", why I can't find the nodes using xml_find_all? Did I make any mistakes in my codes?
Thanks!

It looks like you have a few namespaces. If you add the prefix to your xpath you can get the nodeset.
xml_ns(campsites)
# d1 <-> http://www.opengis.net/kml/2.2
# atom <-> http://www.w3.org/2005/Atom
# gx <-> http://www.google.com/kml/ext/2.2
xml_find_all(campsites, ".//d1:Placemark", xml_ns(campsites))
# {xml_nodeset (45)}
# [1] <Placemark id="marker-1">\n<styleUrl>#tents</styleUrl>\n<name>流水響營地 ( La ...
# [2] <Placemark id="marker-2">\n<styleUrl>#tents</styleUrl>\n<name>鶴藪營地(Hok T ...
# ...
To get the text in the cdata, you could use something like
xml_text(xml_find_all(campsites, "//d1:description", xml_ns(campsites)))
# or "//d1:description/text()"

Related

lapply() with XPath to obtain all text after a specific tag not working

Background:
I am scraping this website to obtain a list of all people named under a respective section of the editorial board.
In total, there are 6 sections, each one beginning with a <b>...</b> part. (It actually should be 5, but the code is a bit messy.)
My goal:
I want to get a list of all people per section (a list of 6 elements called people).
My approach:
I try to fetch all the text, or text(), after each respective <b>...</b>-tag.
However, with the following R-code and XPath, I fail to get the correct list:
journal_url <- "https://aepi.biomedcentral.com/about/editorial-board"
webpage <- xml2::read_html(url(journal_url))
# get a list of 6 sections
all_sections <- rvest::html_nodes(wholepage, css = '#editorialboard p')
# the following does not work properly
people <- lapply(all_sections, function(x) rvest::html_nodes(x, xpath = '//b/following-sibling::text()'))
The mistaken outcome:
Instead of giving me a list of 6 elements comprising the people per section, it gives me a list of 6 elements comprising all people in every element.
The expected outcome:
The expected output would start with:
people
[[1]]
[1] Shichuo Li
[[2]]
[1] Zhen Hong
[2] Hermann Stefan
[3] Dong Zhou
[[3]]
[1] Jie Mu
# etc etc
The double forward slash xpath selects all nodes in the whole document, even when the object is a single node. Use the current node selector .
people <- lapply(all_sections, function(x) {
rvest::html_nodes(x, xpath = './b/following-sibling::text()')
})
Output:
[[1]]
{xml_nodeset (1)}
[1] Shichuo Li,
[[2]]
{xml_nodeset (3)}
[1] Zhen Hong,
[2] Hermann Stefan,
[3] Dong Zhou,
[[3]]
{xml_nodeset (0)}
[[4]]
{xml_nodeset (1)}
[1] Jie Mu,
[[5]]
{xml_nodeset (2)}
[1] Bing Liang,
[2] Weijia Jiang,
[[6]]
{xml_nodeset (35)}
[1] Aye Mye Min Aye,
[2] Sándor Beniczky,
[3] Ingmar Blümcke,
[4] Martin J. Brodie,
[5] Eric Chan,
[6] Yanchun Deng,
[7] Ding Ding,
[8] Yuwu Jiang,
[9] Hennric Jokeit,
[10] Heung Dong Kim,
[11] Patrick Kwan,
[12] Byung In Lee,
[13] Weiping Liao,
[14] Xiaoyan Liu,
[15] Guoming Luan,
[16] Imad M. Najm,
[17] Terence O'Brien,
[18] Jiong Qin,
[19] Markus Reuber,
[20] Ley J.W. Sander,
...

reading address and lat,long from xml_node in R (mapsapi package)

I'm trying to get informations from an address over the package mapsapi in R.
So my code looks like follows:
library(mapsapi)
library(XML)
library(RCurl)
string <- "Pariser Platz 1, 10117 Berlin"
test <- mp_geocode(string)
xml <- xml_child(test[[string]],2)
xml
Now I'm getting this kind of xml file:
{xml_node}
<result>
[1] <type>street_address</type>
[2] <formatted_address>Pariser Platz 1, 10117 Berlin, Germany</formatted_address>
[3] <address_component>\n <long_name>1</long_name>\n <short_name>1</short_name>\n <type>street_number</type>\n</address_component>
[4] <address_component>\n <long_name>Pariser Platz</long_name>\n <short_name>Pariser Platz</short_name>\n <type>route</type>\n</address_component>
[5] <address_component>\n <long_name>Mitte</long_name>\n <short_name>Mitte</short_name>\n <type>political</type>\n <type>sublocality</type>\n <type>sublocality_level_1</type>\n</address_component>
[6] <address_component>\n <long_name>Berlin</long_name>\n <short_name>Berlin</short_name>\n <type>locality</type>\n <type>political</type>\n</address_component>
[7] <address_component>\n <long_name>Berlin</long_name>\n <short_name>Berlin</short_name>\n <type>administrative_area_level_1</type>\n <type>political</type>\n</address_component>
[8] <address_component>\n <long_name>Germany</long_name>\n <short_name>DE</short_name>\n <type>country</type>\n <type>political</type>\n</address_component>
[9] <address_component>\n <long_name>10117</long_name>\n <short_name>10117</short_name>\n <type>postal_code</type>\n</address_component>
[10] <geometry>\n <location>\n <lat>52.5160964</lat>\n <lng>13.3779369</lng>\n </location>\n <location_type>ROOFTOP</location_type>\n <viewport>\n <southwest>\n <lat>52.5147474</lat>\n <lng>13.37658 ...
[11] <place_id>ChIJnYvtVcZRqEcRl6Kftq66b6Y</place_id>
So how can I export the street number, address, city, zip, lat and long out of this xml into decent variables?
Thanks for your help!
regards
I've made accessing this type of information easy in my googleway package
library(googleway)
## you're using Google's API, and they require you to have an API key
## so you'll need to get one
set_key("GOOGLE_API_KEY")
## perform query
res <- google_geocode("Pariser Platz 1, 10117 Berlin")
With the res result you can use geocode_coordinates() to extract the coordinates, and geocode_address_components() to get the street number
## coordinates
geocode_coordinates(res)
# lat lng
# 1 52.5161 13.37794
geocode_address_components(res)
# long_name short_name types
# 1 1 1 street_number
# 2 Pariser Platz Pariser Platz route
# 3 Mitte Mitte political, sublocality, sublocality_level_1
# 4 Berlin Berlin locality, political
# 5 Berlin Berlin administrative_area_level_1, political
# 6 Germany DE country, political
# 7 10117 10117 postal_code
You can look at str(res) to see the full list of items returned from Google's API
Alternatively, you can also use ggmap::geocode():
> library(ggmap)
> geocode(location = "Pariser Platz 1, 10117 Berlin", output = 'latlon' )
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Pariser%20Platz%201,%2010117%20Berlin&sensor=false
lon lat
1 13.37794 52.5161
Changing the output parameter can give you a very detailed list output (if required):
> geocode(location = "Pariser Platz 1, 10117 Berlin", output = 'all' )
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Pariser%20Platz%201,%2010117%20Berlin&sensor=false
$results
$results[[1]]
$results[[1]]$address_components
$results[[1]]$address_components[[1]]
$results[[1]]$address_components[[1]]$long_name
[1] "1"
$results[[1]]$address_components[[1]]$short_name
[1] "1"
$results[[1]]$address_components[[1]]$types
[1] "street_number"
$results[[1]]$address_components[[2]]
$results[[1]]$address_components[[2]]$long_name
[1] "Pariser Platz"
$results[[1]]$address_components[[2]]$short_name
[1] "Pariser Platz"
$results[[1]]$address_components[[2]]$types
[1] "route"
$results[[1]]$address_components[[3]]
$results[[1]]$address_components[[3]]$long_name
[1] "Mitte"
$results[[1]]$address_components[[3]]$short_name
[1] "Mitte"
$results[[1]]$address_components[[3]]$types
[1] "political" "sublocality" "sublocality_level_1"
$results[[1]]$address_components[[4]]
$results[[1]]$address_components[[4]]$long_name
[1] "Berlin"
$results[[1]]$address_components[[4]]$short_name
[1] "Berlin"
$results[[1]]$address_components[[4]]$types
[1] "locality" "political"
$results[[1]]$address_components[[5]]
$results[[1]]$address_components[[5]]$long_name
[1] "Berlin"
$results[[1]]$address_components[[5]]$short_name
[1] "Berlin"
$results[[1]]$address_components[[5]]$types
[1] "administrative_area_level_1" "political"
$results[[1]]$address_components[[6]]
$results[[1]]$address_components[[6]]$long_name
[1] "Germany"
$results[[1]]$address_components[[6]]$short_name
[1] "DE"
$results[[1]]$address_components[[6]]$types
[1] "country" "political"
$results[[1]]$address_components[[7]]
$results[[1]]$address_components[[7]]$long_name
[1] "10117"
$results[[1]]$address_components[[7]]$short_name
[1] "10117"
$results[[1]]$address_components[[7]]$types
[1] "postal_code"
$results[[1]]$formatted_address
[1] "Pariser Platz 1, 10117 Berlin, Germany"
$results[[1]]$geometry
$results[[1]]$geometry$location
$results[[1]]$geometry$location$lat
[1] 52.5161
$results[[1]]$geometry$location$lng
[1] 13.37794
$results[[1]]$geometry$location_type
[1] "ROOFTOP"
$results[[1]]$geometry$viewport
$results[[1]]$geometry$viewport$northeast
$results[[1]]$geometry$viewport$northeast$lat
[1] 52.51745
$results[[1]]$geometry$viewport$northeast$lng
[1] 13.37929
$results[[1]]$geometry$viewport$southwest
$results[[1]]$geometry$viewport$southwest$lat
[1] 52.51475
$results[[1]]$geometry$viewport$southwest$lng
[1] 13.37659
$results[[1]]$place_id
[1] "ChIJnYvtVcZRqEcRl6Kftq66b6Y"
$results[[1]]$types
[1] "street_address"
$status
[1] "OK"
You can find more info in the function help section.
Sometimes the call may fail with the following message:
Warning message:
geocode failed with status OVER_QUERY_LIMIT, location = "Pariser Platz 1, 10117 Berlin"
Generally if you try after a few seconds it works fine. You can always check the remaining queries left in your quota with geocodeQueryCheck:
> geocodeQueryCheck()
2490 geocoding queries remaining.

How to parse sub path of an XML file using XML2 package

I have the following xml page that looks like this which I need to parse using xml2
However, with this code, I cannot get the list under the subcellularLocation xpath :
library(xml2)
xmlfile <- "https://www.uniprot.org/uniprot/P09429.xml"
doc <- xmlfile %>%
xml2::read_xml()
xml_name(doc)
xml_children(doc)
x <- xml_find_all(doc, "//subcellularLocation")
xml_path(x)
# character(0)
What is the right way to do it?
Update
The desired output is a vector:
[1] "Nucleus"
[2] "Chromosome"
[3] "Cytoplasm"
[4] "Secreted"
[5] "Cell membrane"
[6] "Peripheral membrane protein"
[7] "Extracellular side"
[8] "Endosome"
[9] "Endoplasmic reticulum-Golgi intermediate compartment"
Use x <- xml_find_all(doc, "//d1:subcellularLocation")
Whenever you meet a troublesome problem, check the document is the first thing to do, use ?xml_find_all and you will see this (at the end of the page)
# Namespaces ---------------------------------------------------------------
# If the document uses namespaces, you'll need use xml_ns to form
# a unique mapping between full namespace url and a short prefix
x <- read_xml('
<root xmlns:f = "http://foo.com" xmlns:g = "http://bar.com">
<f:doc><g:baz /></f:doc>
<f:doc><g:baz /></f:doc>
</root>
')
xml_find_all(x, ".//f:doc")
xml_find_all(x, ".//f:doc", xml_ns(x))
So you then go to check xml_ns(doc) and find
d1 <-> http://uniprot.org/uniprot
xsi <-> http://www.w3.org/2001/XMLSchema-instance
Update
xml_find_all(doc, "//d1:subcellularLocation")
%>% xml_children()
%>% xml_text()
## [1] "Nucleus"
## [2] "Chromosome"
## [3] "Cytoplasm"
## [4] "Secreted"
## [5] "Cell membrane"
## [6] "Peripheral membrane protein"
## [7] "Extracellular side"
## [8] "Endosome"
## [9] "Endoplasmic reticulum-Golgi intermediate compartment"ent"
If you don't mind, you can use the rvest package:
library(rvest)
a=read_html(xmlfile)%>%
html_nodes("subcellularlocation")
a%>%html_children()%>%html_text()
[1] "Nucleus" "Chromosome"
[3] "Cytoplasm" "Secreted"
[5] "Cell membrane" "Peripheral membrane protein"
[7] "Extracellular side" "Endosome"
[9] "Endoplasmic reticulum-Golgi intermediate compartment"

Looping through in web scraping in r

I want to we scrape a list of drugs from the bnf website https://bnf.nice.org.uk/drug/
Let's take carbamazepine as an example- https://bnf.nice.org.uk/drug/carbamazepine.html#indicationsAndDoses
I want the following code to loop through each of the indications within that drug and return the patient type and dosage for each of those indications. This is a problem when I finally want to make it a dataframe because there are 7 indications and around 9 patient types and dosages.
Currently, I get an indications variable which looks like-
[1] "Focal and secondary generalised tonic-clonic seizures"
[2] "Primary generalised tonic-clonic seizures"
[3] "Trigeminal neuralgia"
[4] "Prophylaxis of bipolar disorder unresponsive to lithium"
[5] "Adjunct in acute alcohol withdrawal "
[6] "Diabetic neuropathy"
[7] "Focal and generalised tonic-clonic seizures"
And a patient group variable which looks like-
[1] "For \n Adult\n "
[2] "For \n Elderly\n "
[3] "For \n Adult\n "
[4] "For \n Adult\n "
[5] "For \n Adult\n "
[6] "For \n Adult\n "
[7] "For \n Adult\n "
[8] "For \n Child 1 month–11 years\n "
[9] "For \n Child 12–17 years\n "
I want it is as follows-
Indication Pt group
[1] "Focal and secondary generalised tonic-clonic seizures" For Adult
[1] "Focal and secondary generalised tonic-clonic seizures" For elderly
[2] "Primary generalised tonic-clonic seizures" For Adult
and so on..
Here is my code-
url_list <- paste0("https://bnf.nice.org.uk/drug/", druglist, ".html#indicationsAndDoses")
url_list
## The scraping bit - we are going to extract key bits of information for each drug in the list and craete a data frame
drug_table <- data.frame() # an empty data frame
for(i in seq_along(url_list)){
i=15
## Extract drug name
drug <- read_html(url_list[i]) %>%
html_nodes("span") %>%
html_text() %>%
.[7]
## Extract indication
indication <- read_html(url_list[i]) %>%
html_nodes(".indication") %>%
html_text()%>%
unique
## Extact patient group
for (j in seq_along(length(indication))){
pt_group <- read_html(url_list[i]) %>%
html_nodes(".patientGroupList") %>%
html_text()
ln <- length(pt_group)
## Extract dose info per pateint group
dose <- read_html(url_list[i]) %>%
html_nodes("p") %>%
html_text() %>%
.[2:(1+ln)]
## Combine pt group and dose
dose1 <- cbind(pt_group, dose)
}
## Create the data frame
drug_df <- data.frame(Drug = drug, Indication = indication, Dose = dose1)
## Combine data
drug_table <- bind_rows(drug_table, drug_df)
}
That site is actually blocked for me! I can't see anything there, but I can tell you, basically, it should be done like this.
The html_nodes() function turns each HTML tag into a row in an R dataframe.
library(rvest)
## Loading required package: xml2
# Define the url once.
URL <- "https://scistarter.com/finder?phrase=&lat=&lng=&activity=At%20the%20beach&topic=&search_filters=&search_audience=&page=1#view-projects"
scistarter_html <- read_html(URL)
scistarter_html
## {xml_document}
## <html class="no-js" lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n \n \n <svg style="position: absolute; width: 0; he ...
We’re able to retrieve the same HTML code we saw in our browser. This is not useful yet, but it does show that we’re able to retrieve the same HTML code we saw in our browser. Now we will begin filtering through the HTML to find the data we’re after.
The data we want are stored in a table, which we can tell by looking at the “Inspect Element” window.
This grabs all the nodes that have links in them.
scistarter_html %>%
html_nodes("a") %>%
head()
## {xml_nodeset (6)}
## [1] <a href="/index.html" class="site-header__branding" title="go to the ...
## [2] My Account
## [3] Project Finder
## [4] Event Finder
## [5] People Finder
## [6] log in
In a more complex example, we could use this to “crawl” the page, but that’s for another day.
Every div on the page:
scistarter_html %>%
html_nodes("div") %>%
head()
## {xml_nodeset (6)}
## [1] <div class="site-header__nav js-hamburger b-utility">\n <butt ...
## [2] <div class="site-header__nav__body js-hamburger__body">\n < ...
## [3] <div class="nav-tools">\n <div class="nav-tools__search"> ...
## [4] <div class="nav-tools__search">\n <div class="field">\n ...
## [5] <div class="field">\n <form method="get" action="/fin ...
## [6] <div class="input-group input-group--flush">\n <d ...
… the nav-tools div. This calls by css where class=nav-tools.
scistarter_html %>%
html_nodes("div.nav-tools") %>%
head()
## {xml_nodeset (1)}
## [1] <div class="nav-tools">\n <div class="nav-tools__search"> ...
We can call the nodes by id as follows.
scistarter_html %>%
html_nodes("div#project-listing") %>%
head()
## {xml_nodeset (1)}
## [1] <div id="project-listing" class="subtabContent">\n \n ...
All the tables as follows:
scistarter_html %>%
html_nodes("table") %>%
head()
## {xml_nodeset (6)}
## [1] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [2] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [3] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [4] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [5] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
## [6] <table class="table-project-2-col u-mb-0">\n<legend class="u-visuall ...
See the (related) link below, for more info.
https://rpubs.com/Radcliffe/superbowl

Isolating data from single XML nodeset in R xml2

I am trying to iteratively isolate and manipulate nodesets from an XML document, but I am getting a strange behavior in the xml_find_all() function in the xml2 package in R. Can someone please help me understand the scope of functions applied to a nodeset?
Here is an example:
library( xml2 )
library( dplyr )
doc <- read_xml( "<MEMBERS>
<CUSTOMER>
<ID>178</ID>
<FIRST.NAME>Alvaro</FIRST.NAME>
<LAST.NAME>Juarez</LAST.NAME>
<ADDRESS>123 Park Ave</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
<CUSTOMER>
<ID>934</ID>
<FIRST.NAME>Janette</FIRST.NAME>
<LAST.NAME>Johnson</LAST.NAME>
<ADDRESS>456 Candy Ln</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
</MEMBERS>" )
doc %>% xml_find_all( '//*') %>% xml_path()
# [1] "/MEMBERS" "/MEMBERS/CUSTOMER[1]"
# [3] "/MEMBERS/CUSTOMER[1]/ID" "/MEMBERS/CUSTOMER[1]/FIRST.NAME"
# [5] "/MEMBERS/CUSTOMER[1]/LAST.NAME" "/MEMBERS/CUSTOMER[1]/ADDRESS"
# [7] "/MEMBERS/CUSTOMER[1]/ZIP" "/MEMBERS/CUSTOMER[2]"
# [9] "/MEMBERS/CUSTOMER[2]/ID" "/MEMBERS/CUSTOMER[2]/FIRST.NAME"
#[11] "/MEMBERS/CUSTOMER[2]/LAST.NAME" "/MEMBERS/CUSTOMER[2]/ADDRESS"
#[13] "/MEMBERS/CUSTOMER[2]/ZIP"
The object customer.01 is a nodeset that contains data from that customer only.
kids <- xml_children( doc )
customer.01 <- kids[[1]]
customer.01
# {xml_node}
# <CUSTOMER>
# [1] <ID>178</ID>
# [2] <FIRST.NAME>Alvaro</FIRST.NAME>
# [3] <LAST.NAME>Juarez</LAST.NAME>
# [4] <ADDRESS>123 Park Ave</ADDRESS>
# [5] <ZIP>57701</ZIP>
Why does the function, applied to the customer.01 nodeset, return the ID for customer.02 as well?
xml_find_all( customer.01, "//MEMBERS/CUSTOMER/ID" )
# {xml_nodeset (2)}
# [1] <ID>178</ID>
# [2] <ID>934</ID>
How do I return only values from that nodeset?
~~~
Ok, so here's a small wrinkle in the solution below, again related to scope of the xml_find_all() function. It says that it can be applied to a document, node, or nodeset. However...
This case works when applied to a nodeset:
library( xml2 )
url <- "https://s3.amazonaws.com/irs-form-990/201501279349300635_public.xml"
doc <- read_xml( url )
xml_ns_strip( doc )
nd <- xml_find_all( doc, "//LiquidationOfAssetsDetail|//LiquidationDetail" )
nodei <- nd[[1]]
nodei
# {xml_node}
# <LiquidationOfAssetsDetail>
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc>
# [2] <DistributionDt>2014-11-04</DistributionDt>
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt>
# [4] <EIN>abcdefghi</EIN>
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName>
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ...
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt>
xml_text( xml_find_all( nodei, "AssetsDistriOrExpnssPaidDesc" ) )
# [1] "LAND"
But not this one:
nodei <- xml_children( nd[[i]] )
nodei
# {xml_nodeset (7)}
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc>
# [2] <DistributionDt>2014-11-04</DistributionDt>
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt>
# [4] <EIN>abcdefghi</EIN>
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName>
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ...
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt>
xml_text( xml_find_all( nodei, "AssetsDistriOrExpnssPaidDesc" ) )
# character(0)
I'm guessing this is a problem applying xml_find_all() to all elements of a nodeset rather than a scoping issue?
Currently, you are using the absolute path search from root with XPath's double forward slash, //, which means find all items in document that match this path which includes both customers' ID.
For particular child nodes under a specific node, simply use a relative path under selected node:
xml_find_all(customer.01, "ID")
# {xml_nodeset (1)}
# [1] <ID>178</ID>
xml_find_all(customer.01, "FIRST.NAME|LAST.NAME")
# {xml_nodeset (2)}
# [1] <FIRST.NAME>Alvaro</FIRST.NAME>
# [2] <LAST.NAME>Juarez</LAST.NAME>
xml_find_all(customer.01, "*")
# {xml_nodeset (5)}
# [1] <ID>178</ID>
# [2] <FIRST.NAME>Alvaro</FIRST.NAME>
# [3] <LAST.NAME>Juarez</LAST.NAME>
# [4] <ADDRESS>123 Park Ave</ADDRESS>
# [5] <ZIP>57701</ZIP>

Resources