Using xml_replace leaves behind some formatting - r

I am trying to replace some nodes of an XML document with text using the xml2 library in R. In the example below I'm trying to turn all the "name" nodes into text, but the final result still has the "<" and "/>" around the text.
library(xml2)
x <- read_xml(
"<scenario>
<event>
<dataProbeEvent>
<name>LogSurvResHigh</name>
</dataProbeEvent>
</event>
<event>
<accumulateEvent>
<name>SetSurvOut</name>
</accumulateEvent>
</event>
</scenario>")
x
> {xml_document}
<scenario>
[1] <event>\n <dataProbeEvent>\n <name>LogSurvResHigh</name>\n </dataProbeEvent>\n ...
[2] <event>\n <accumulateEvent>\n <name>SetSurvOut</name>\n </accumulateEvent>\n</ ...
namerefs <- xml_find_all(x, './/name')
replacements = namerefs %>%xml_text()
xml_replace(namerefs, replacements)
> {xml_document}
<scenario>
[1] <event>\n <dataProbeEvent>\n <LogSurvResHigh/>\n </dataProbeEvent>\n</event>
[2] <event>\n <accumulateEvent>\n <SetSurvOut/>\n </accumulateEvent>\n</event>
What I want it to look like is:
> {xml_document}
<scenario>
[1] <event>\n <dataProbeEvent>\n LogSurvResHigh\n </dataProbeEvent>\n</event>
[2] <event>\n <accumulateEvent>\n SetSurvOut\n </accumulateEvent>\n</event>

You should use the following:-
x <- as.character(x)
x_sub <- gsub("<name[^>]*>|<\\/name>","",x)
x <- read_xml(x_sub)
x
{xml_document}
<scenario>
[1] <event>\n <dataProbeEvent>\n LogSurvResHigh\n </dataProbeEvent>\n</event>
[2] <event>\n <accumulateEvent>\n SetSurvOut\n </accumulateEvent>\n</event>
This will remove ref-type="bibr" rid="CR8" kind of attributes from the name node.

XML documents are a type of nested data type a lot like a list in R. If you prune a node then all of the information in that node is lost. I find it generally easier to convert it to a flat data type (like a character vector) and then remove the information that isn't desired. It can then be converted back to XML if desired.
The alternative would be to use XML to locate the parent node that you want and then use xml_text to extract the text. But I believe this drops all newline characters.
x_char <- as.character(x)
x_noname <- gsub("<name>|<\\/name>","",x_char)
x_noname
x_noname <- read_xml(x_noname)
x_noname
# {xml_document}
# <scenario>
# [1] <event>\n <dataProbeEvent>\n LogSurvResHigh\n </dataProbeEvent>\n</event>
# [2] <event>\n <accumulateEvent>\n SetSurvOut\n </accumulateEvent>\n</event>

Related

Using xml2 to add child node with an attribute

I'm using R's xml2 package to edit an XML document. I'd like to add a node with a specific XML attribute, but I don't seem to understand the syntax of add_child_node.
Adding a node works great:
library(xml2)
my_xml <- read_xml("<fruits><apple/><banana/></fruits>")
xml_add_child(.x = my_xml, .value = "coconut")
my_xml
# {xml_document}
# <fruits>
# [1] <apple/>
# [2] <banana/>
# [3] <coconut/>
and according my understanding of the documentation, I should be able to add an attribute to the node by using the ellipsis argument to provide a named vector of text:
my_xml <- read_xml("<fruits><apple/><banana/></fruits>")
xml_add_child(.x = my_xml, .value = "coconut", c(id="new"))
my_xml
# {xml_document}
# <fruits>
# [1] <apple/>
# [2] <banana/>
# [3] <coconut>new</coconut>
However, this appears to simply insert the text into the node, as it does when the text is unnamed. The attribute doesn't show up at all.
What I'd like to get is this:
# {xml_document}
# <fruits>
# [1] <apple/>
# [2] <banana/>
# [3] <coconut id="new"/>
Any thoughts? I'm aware that I can set attributes manually after the fact using xml_attr<- but my use case doesn't support that method very well.
Snapshot of the documentation for anyone who doesn't want to pull it up:
Just remove the c()
xml_add_child(.x = my_xml, .value = "coconut", id = "new")
-output
> my_xml
{xml_document}
<fruits>
[1] <apple/>
[2] <banana/>
[3] <coconut id="new"/>
data
my_xml <- read_xml("<fruits><apple/><banana/></fruits>")

Can't append values to a list in R

I got a list with a weird format:
[[1]]
[1] "Freq.2432.40862794099" "Freq.2792.87280096993" "Freq.2955.16577598796"
[4] "Freq.3161.12982491516" "Freq.3194.19720315405" "Freq.3218.83311568825"
[7] "Freq.3265.37951283662" "Freq.3317.86908506493" "Freq.3900.50408838719"
[10] "Freq.4073.33935633108" "Freq.4302.8830598659" "Freq.4404.80065271461"
[13] "Freq.4469.12305573234" "Freq.4567.90688886175" "Freq.4965.4984006347"
[16] "Freq.5854.45161215455" "Freq.5905.64933878776" "Freq.6175.68130655941"
[19] "Freq.6433.22411185796" "Freq.6631.46775487994" "Freq.6958.20015968149"
[22] "Freq.7469.83422424355" "Freq.8602.43342069553" "Freq.8766.14436081853"
[25] "Freq.8811.22677706485" "Freq.8915.90029255773" "Freq.9131.39810096"
[28] "Freq.9378.82122607608"
Never saw that [[1]] in a list before, and the problem is that I can't append things to this list.
How can I solve this?
This is a list in a list. Normally this can be referred to as a nested list.
a <- c(1,2,3)
b <- c(4,5,6)
list <- list(a,b)
In this code snippet we are creating two vectors and put them into a list. Now you can access the nested vectors/lists using the double brackets. Like so:
list[[1]]
> [1] 1 2 3
Now, if you want to change the value (or append it, see comment) you can use the normal syntax but solely assign it to the nested object.
list[[1]] <- c(7,8,9)
list[[1]]
> [1] 7 8 9

How do I use file.path() on a list of subdirectories

I want to add "_quants" to a list of folder names contained in samples$sample. When I use the following:
files <- file.path(dir, "quants", samples$sample, "_quants")
> dir
[1] "E:/ubuntu-shared/salmonTutorial/"
> samples$sample
[1] DRR016125 DRR016126 DRR016127 DRR016128 DRR016129 DRR016130 DRR016131 DRR016132 DRR016133 DRR016134 DRR016135 DRR016136 DRR016137 DRR016138 DRR016139
[16] DRR016140
16 Levels: DRR016125 DRR016126 DRR016127 DRR016128 DRR016129 DRR016130 DRR016131 DRR016132 DRR016133 DRR016134 DRR016135 DRR016136 DRR016137 ... DRR016140
I get:
[1] "E:/ubuntu-shared/salmonTutorial//quants/DRR016125/_quants"
How do I remove the double // and append "_quants" to "DRR016125" using file.path() to get the desired:
[1] "E:/ubuntu-shared/salmonTutorial/quants/DRR016125_quants"
[2] "E:/ubuntu-shared/salmonTutorial/quants/DRR016126_quants"
Solution using base::paste0:
dir <- "E:/ubuntu-shared/salmonTutorial/"
samples <- list(sample = c("DRR016125", "DRR016126", "DRR016127"))
paste0(dir, "quants", samples$sample, "_quants")
[1] "E:/ubuntu-shared/salmonTutorial/quantsDRR016125_quants"
[2] "E:/ubuntu-shared/salmonTutorial/quantsDRR016126_quants"
[3] "E:/ubuntu-shared/salmonTutorial/quantsDRR016127_quants"
paste0 - concatenates vectors (after converting to character), i.e. outputs single string. And as you passed multiple samples it does this for every sample.

Isolating data from single XML nodeset in R xml2

I am trying to iteratively isolate and manipulate nodesets from an XML document, but I am getting a strange behavior in the xml_find_all() function in the xml2 package in R. Can someone please help me understand the scope of functions applied to a nodeset?
Here is an example:
library( xml2 )
library( dplyr )
doc <- read_xml( "<MEMBERS>
<CUSTOMER>
<ID>178</ID>
<FIRST.NAME>Alvaro</FIRST.NAME>
<LAST.NAME>Juarez</LAST.NAME>
<ADDRESS>123 Park Ave</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
<CUSTOMER>
<ID>934</ID>
<FIRST.NAME>Janette</FIRST.NAME>
<LAST.NAME>Johnson</LAST.NAME>
<ADDRESS>456 Candy Ln</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
</MEMBERS>" )
doc %>% xml_find_all( '//*') %>% xml_path()
# [1] "/MEMBERS" "/MEMBERS/CUSTOMER[1]"
# [3] "/MEMBERS/CUSTOMER[1]/ID" "/MEMBERS/CUSTOMER[1]/FIRST.NAME"
# [5] "/MEMBERS/CUSTOMER[1]/LAST.NAME" "/MEMBERS/CUSTOMER[1]/ADDRESS"
# [7] "/MEMBERS/CUSTOMER[1]/ZIP" "/MEMBERS/CUSTOMER[2]"
# [9] "/MEMBERS/CUSTOMER[2]/ID" "/MEMBERS/CUSTOMER[2]/FIRST.NAME"
#[11] "/MEMBERS/CUSTOMER[2]/LAST.NAME" "/MEMBERS/CUSTOMER[2]/ADDRESS"
#[13] "/MEMBERS/CUSTOMER[2]/ZIP"
The object customer.01 is a nodeset that contains data from that customer only.
kids <- xml_children( doc )
customer.01 <- kids[[1]]
customer.01
# {xml_node}
# <CUSTOMER>
# [1] <ID>178</ID>
# [2] <FIRST.NAME>Alvaro</FIRST.NAME>
# [3] <LAST.NAME>Juarez</LAST.NAME>
# [4] <ADDRESS>123 Park Ave</ADDRESS>
# [5] <ZIP>57701</ZIP>
Why does the function, applied to the customer.01 nodeset, return the ID for customer.02 as well?
xml_find_all( customer.01, "//MEMBERS/CUSTOMER/ID" )
# {xml_nodeset (2)}
# [1] <ID>178</ID>
# [2] <ID>934</ID>
How do I return only values from that nodeset?
~~~
Ok, so here's a small wrinkle in the solution below, again related to scope of the xml_find_all() function. It says that it can be applied to a document, node, or nodeset. However...
This case works when applied to a nodeset:
library( xml2 )
url <- "https://s3.amazonaws.com/irs-form-990/201501279349300635_public.xml"
doc <- read_xml( url )
xml_ns_strip( doc )
nd <- xml_find_all( doc, "//LiquidationOfAssetsDetail|//LiquidationDetail" )
nodei <- nd[[1]]
nodei
# {xml_node}
# <LiquidationOfAssetsDetail>
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc>
# [2] <DistributionDt>2014-11-04</DistributionDt>
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt>
# [4] <EIN>abcdefghi</EIN>
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName>
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ...
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt>
xml_text( xml_find_all( nodei, "AssetsDistriOrExpnssPaidDesc" ) )
# [1] "LAND"
But not this one:
nodei <- xml_children( nd[[i]] )
nodei
# {xml_nodeset (7)}
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc>
# [2] <DistributionDt>2014-11-04</DistributionDt>
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt>
# [4] <EIN>abcdefghi</EIN>
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName>
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ...
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt>
xml_text( xml_find_all( nodei, "AssetsDistriOrExpnssPaidDesc" ) )
# character(0)
I'm guessing this is a problem applying xml_find_all() to all elements of a nodeset rather than a scoping issue?
Currently, you are using the absolute path search from root with XPath's double forward slash, //, which means find all items in document that match this path which includes both customers' ID.
For particular child nodes under a specific node, simply use a relative path under selected node:
xml_find_all(customer.01, "ID")
# {xml_nodeset (1)}
# [1] <ID>178</ID>
xml_find_all(customer.01, "FIRST.NAME|LAST.NAME")
# {xml_nodeset (2)}
# [1] <FIRST.NAME>Alvaro</FIRST.NAME>
# [2] <LAST.NAME>Juarez</LAST.NAME>
xml_find_all(customer.01, "*")
# {xml_nodeset (5)}
# [1] <ID>178</ID>
# [2] <FIRST.NAME>Alvaro</FIRST.NAME>
# [3] <LAST.NAME>Juarez</LAST.NAME>
# [4] <ADDRESS>123 Park Ave</ADDRESS>
# [5] <ZIP>57701</ZIP>

Nested List Parsing with jsonlite

This is the second time that I have faced this recently, so I wanted to reach out to see if there is a better way to parse dataframes returned from jsonlite when one of elements is an array stored as a column in the dataframe as a list.
I know that this part of the power with jsonlite, but I am not sure how to work with this nested structure. In the end, I suppose that I can write my own custom parsing, but given that I am almost there, I wanted to see how to work with this data.
For example:
## options
options(stringsAsFactors=F)
## packages
library(httr)
library(jsonlite)
## setup
gameid="2015020759"
SEASON = '20152016'
BASE = "http://live.nhl.com/GameData/"
URL = paste0(BASE, SEASON, "/", gameid, "/PlayByPlay.json")
## get the data
x <- GET(URL)
## parse
api_response <- content(x, as="text")
api_response <- jsonlite::fromJSON(api_response, flatten=TRUE)
## get the data of interest
pbp <- api_response$data$game$plays$play
colnames(pbp)
And exploring what comes back:
> class(pbp$aoi)
[1] "list"
> class(pbp$desc)
[1] "character"
> class(pbp$xcoord)
[1] "integer"
From above, the column pbp$aoi is a list. Here are a few entries:
> head(pbp$aoi)
[[1]]
[1] 8465009 8470638 8471695 8473419 8475792 8475902
[[2]]
[1] 8470626 8471276 8471695 8476525 8476792 8477956
[[3]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[4]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[5]]
[1] 8469619 8471695 8473492 8474625 8475727 8476525
[[6]]
[1] 8469619 8471695 8473492 8474625 8475727 8475902
I don't really care if I parse these lists in the same dataframe, but what do I have for options to parse out the data?
I would prefer to take the data out of out lists and parse them into a dataframe that can be "related" to the original record it came from.
Thanks in advance for your help.
From #hrbmstr above, I was able to get what I wanted using unnest.
select(pbp, eventid, aoi) %>% unnest() %>% head

Resources