The xml2 package allows users to create XML documents. I'm trying to create a document using the pipe operator %>% to add various combinations of child and sibling nodes. I cannot figure out how to create a child node within a child node that is following by the original child's sibling (see example below).
Is it possible to "rise" up a level to then create more nodes or must they be created outside of the chained commands?
What I want
library(xml2)
x1 <- read_xml("<parent><child>1</child><child><grandchild>2</grandchild></child><child>3</child><child>4</child></parent>")
message(x1)
#> <?xml version="1.0" encoding="UTF-8"?>
#> <parent>
#> <child>1</child>
#> <child>
#> <grandchild>2</grandchild>
#> </child>
#> <child>3</child>
#> <child>4</child>
#> </parent>
What I'm creating that's wrong
library(magrittr)
library(xml2)
x2 <- xml_new_document()
x2 %>%
xml_add_child("parent") %>%
xml_add_child("child", 1) %>%
xml_add_sibling("child", 4, .where="after") %>%
xml_add_sibling("child", 3) %>%
xml_add_sibling("child", .where="before") %>%
xml_add_child("grandchild", 2)
message(x2)
#> <?xml version="1.0" encoding="UTF-8"?>
#> <parent>
#> <child>1</child>
#> <child>4</child>
#> <child>
#> <grandchild>2</grandchild>
#> </child>
#> <child>3</child>
#> </parent>
Solution using XML package
This is actually fairly straightforward if done using the XML package.
library(XML)
x2 <- newXMLNode("parent")
invisible(newXMLNode("child", 1, parent=x2))
invisible(newXMLNode("child", newXMLNode("grandchild", 2), parent=x2))
invisible(newXMLNode("child", 3, parent=x2))
invisible(newXMLNode("child", 4, parent=x2))
x2
#> <?xml version="1.0" encoding="UTF-8"?>
#> <parent>
#> <child>1</child>
#> <child>
#> <grandchild>2</grandchild>
#> </child>
#> <child>3</child>
#> <child>4</child>
#> </parent>
I'm going to start by saying that I think this is generally a bad idea. xml2 works using pointers, which means that it has reference semantics ("pass by reference"), which is not the typical behavior in R. Functions in xml2 work by producing side effects on the XML tree, not by returning values like in functional programming ("pass by value").
This means that piping is basically the wrong principle. You just need a series of steps that modify the object in the correct order.
That said, you can do:
library("magrittr")
library("xml2")
x2 <- xml_new_document()
x2 %>%
xml_add_child(., "parent") %>%
{
xml_add_child(., "child", 1, .where = "after")
(xml_add_child(., "child") %>% xml_add_child("grandchild", 2))
xml_add_child(., "child", 3, .where = "after")
xml_add_child(., "child", 4, .where = "after")
}
message(x2)
## <?xml version="1.0" encoding="UTF-8"?>
## <parent>
## <child>1</child>
## <child>
## <grandchild>2</grandchild>
## </child>
## <child>3</child>
## <child>4</child>
## </parent>
The . tells the %>% where to place the "parent" node in subsequent calls to xml_add_child(). The ()-bracketed expression in the middle takes advantage of the fact that you want to pipe into the "child" node then pipe that child node into the grandchild node.
Another option, if you really want to use pipes throughout is to use the %T>% pipe, instead of the %>% pipe (or rather, a mix of the two). The difference between the two is the following:
> 1:3 %>% mean() %>% str()
num 2
> 1:3 %T>% mean() %>% str()
int [1:3] 1 2 3
The %T>% pipe pushes the value of the lefthand side expression into the righthand side expression, but further pushes it into the subsequent expression. This means you can call functions in the middle of a pipeline for their side effects and continue to pass the earlier object reference forward in the pipeline.
This is what you're trying to do when you say "rise up a level" - that is, back up to a previous value in the pipeline and work from there. So you need to just %T>% pipe until you get to a point where you want to %>% pipe (e.g., to create the grandchild) and then return to %T>% piping to continue carrying the parent object reference forward. An example:
x3 <- xml_new_document()
x3 %>%
xml_add_child("parent") %T>%
xml_add_child("child", 1, .where = "after") %T>%
{xml_add_child(., "child") %>% xml_add_child("grandchild", 2)} %T>%
xml_add_child("child", 3, .where = "after") %>%
xml_add_child("child", 4, .where = "after")
message(x3)
## <?xml version="1.0" encoding="UTF-8"?>
## <parent>
## <child>1</child>
## <child>
## <grandchild>2</grandchild>
## </child>
## <child>3</child>
## <child>4</child>
## </parent>
Note the final %>% instead of %T>%. If you swapped %>% for %T>% the value of the whole pipeline would be the "parent" node tree only:
{xml_document}
<parent>
[1] <child>1</child>
[2] <child>\n <grandchild>2</grandchild>\n</child>
[3] <child>3</child>
[4] <child>4</child>
(Which - again - ultimately doesn't really matter because we're actually building x3 using side effects, but it will print the parent node tree to the console, which is probably confusing.)
Again, I'd suggest not using the pipe at all given the awkwardness, but it's up to you. A better way is just to preserve each object you want to attach a child to and then refer to it again each time. Like in the first example, save the parent node as p, skip all the pipes, and just refer to p everywhere that . is used in the example code.
Related
In the XML package in R, it is possible to create a new xmlTree object with a namespace, e.g. using:
library(XML)
d = xmlTree("foo", namespaces = list(prefix = "url"))
d$doc()
# <?xml version="1.0"?>
# <foo xmlns:prefix="url"/>
How do I create a default namespace, without the prefix bar, such that it looks like the following?
# <?xml version="1.0"?>
# <foo xmlns="url"/>
The following does not produce what I expected.
library(XML)
d = xmlTree("foo", namespaces = list("url"))
d$doc()
# <?xml version="1.0"?>
# <url:foo xmlns:url="<dummy>"/>
There seems to be a difference between nameless lists and lists with an empty name in R.
1 - A nameless list:
list("url")
# [[1]]
# [1] "url"
names(list("url"))
# NULL
2 - A named list:
list(prefix = "url")
# $prefix
# [1] "url"
names(list(prefix = "url"))
# [1] "prefix"
3 - An incorrectly initialised empty-name list:
list("" = "url")
# Error: attempt to use zero-length variable name
4 - An hacky way to initialise an empty-name list:
setNames(list(prefix = "url"), "")
# [[1]]
# [1] "url"
names(setNames(list(prefix = "url"), ""))
# [1] ""
It would seem 1. and 4. are identical, however, in the package XML they produce different results. The first gives the incorrect XML as mentioned in the OP, whereas option 4. produces:
library(XML)
d = d = xmlTree("foo", namespaces = setNames(list(prefix = "url"), ""))
d$doc()
# <?xml version="1.0"?>
# <foo xmlns="url"/>
Can someone explain to me why I get a different result when I run the convertToDisplayTime function inside mutate than when I run it on its own? The correct result is the one I obtain when I run it on its own. Also, why do I get these warnings? It feels like I might be passing the whole timeInSeconds column as an argument when I call convertToDisplayTime in the mutate function, but I'm not sure that I really understand the mechanics in play here.
library('tidyverse')
#> Warning: package 'tibble' was built under R version 4.1.2
convertToDisplayTime <- function(timeInSeconds){
## Takes a time in seconds and converts it
## to a xx:xx:xx string format
if(timeInSeconds>86400){ #Not handling time over a day
stop(simpleError("Enter a time below 86400 seconds (1 day)"))
} else if(timeInSeconds>3600){
numberOfMinutes = 0
numberOfHours = timeInSeconds%/%3600
remainingSeconds = timeInSeconds%%3600
if(remainingSeconds>60){
numberOfMinutes = remainingSeconds%/%60
remainingSeconds = remainingSeconds%%60
}
if(numberOfMinutes<10){displayMinutes = paste0("0",numberOfMinutes)}
else{displayMinutes = numberOfMinutes}
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfHours,":",displayMinutes,":", displaySeconds))
} else if(timeInSeconds>60){
numberOfMinutes = timeInSeconds%/%60
remainingSeconds = timeInSeconds%%60
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfMinutes,":", displaySeconds))
} else{
return(paste0("0:",timeInSeconds))
}
}
(df <- tibble(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10)) %>% mutate(displayTime = convertToDisplayTime(timeInSeconds)))
#> Warning in if (timeInSeconds > 86400) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 3600) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 60) {: the condition has length > 1 and only the
#> first element will be used
#> Warning in if (remainingSeconds < 10) {: the condition has length > 1 and only
#> the first element will be used
#> # A tibble: 4 x 2
#> timeInSeconds displayTime
#> <dbl> <chr>
#> 1 2710. 45:10
#> 2 2705. 45:5
#> 3 2692. 44:52
#> 4 2708. 45:8
convertToDisplayTime(2710.46)
#> [1] "45:10"
convertToDisplayTime(2705.04)
#> [1] "45:05"
convertToDisplayTime(2691.66)
#> [1] "44:52"
convertToDisplayTime(2708.10)
#> [1] "45:08"
Created on 2022-01-06 by the reprex package (v2.0.1)
Like mentioned in the comments, the problem here is that your function is not vectorized: it works with a single value for an input and outputs a single value. However, this does not work when the input is a vector of values, hence the condition has length 1 warning you get:
1: Problem with `mutate()` column `displayTime`.\
ℹ `displayTime = convertToDisplayTime(timeInSeconds)`.
ℹ the condition has length > 1 and only the first element will be used
Here, when you use dplyr::mutate, you're technically trying to feed a vector to your function, which is not formatted to process it.
Several options you may consider:
1. The "fast and ugly" way:
df <- data.frame(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10))
## This one does not work
df %>% mutate(displayTime = convertToDisplayTime(timeInSeconds))
## This one works
df %>%
rowwise() %>%
mutate(displayTime = convertToDisplayTime(timeInSeconds)) %>%
ungroup()
dplyr::rowwise() allows dplyr::mutate() to work on each row independently, rather than by columns. I assume this is the behavior you initially expected. dplyr::ungroup() sorta reverts rowwise, eg. go back to the default column-wise behavior.
I may be a little harsh on this one, but this is the kind of trick that I used back when I did not quite understand my way around dataframes and their manipulation...
2. Vectorize directly from your dplyr verbs:
df %>%
mutate(displayTime = base::mapply(convertToDisplayTime, timeInSeconds))
## or
df %>%
mutate(displayTime = purrr::map_chr(timeInSeconds, convertToDisplayTime))
Both options are similar.
3. Vectorize your function:
convertToDisplayTime_vec <- base::Vectorize(convertToDisplayTime)
# class(convertToDisplayTime_vec)
df %>% mutate(displayTime = convertToDisplayTime_vec(timeInSeconds))
## or
convertToDisplayTime_vec2 <- function(timeInSeconds_vec) {
mapply(FUN = convertToDisplayTime, timeInSeconds_vec)
}
# class(convertToDisplayTime_vec2)
df %>%
mutate(displayTime = convertToDisplayTime_vec2(timeInSeconds))
# Still works on single variables!
# convertToDisplayTime_vec2(6475)
This is my favourite option, as once it is implemented you can use it either on single variables, vectors or dataframes, without worring about it.
A little documentation to dig a little into the subject.
PS: As an aside, a little tip worth remembering: you may want to be careful when manipulating data.frame and tibble objects. Despite their similarity, they have slight differences, and some functions deal differently with one or the other, or actually convert one to the other without your noticing...
rvest doesn't seem to offer any way to extract text from parent object only (ignoring children). One workaround uses xml_remove(), which mutates the original object - all the way up the memory chain given R's default lazy evaluation.
I look to rlang::duplicate(), which is supposed for "modifying the copy leaves the original object intact", but the clone does not appear to be truly independent. For example:
require(rvest)
h = '<ul>
<li id="target">
text to extract
<ul><li>text to ignore</li><li>this too</li></ul>
</li>
</ul>'
doc = xml2::read_html(h)
x = html_node(doc, '#target')
html_text(x)
#> [1] "\ntext to extract\ntext to ignorethis too"
Now clone x, remove its children, and extract the text:
x2 = rlang::duplicate(x, shallow = FALSE)
children = html_children(x2)
xml2::xml_remove(children)
html_text(x2)
#> [1] "\ntext to extract\n"
That works as intended, however x has also been mutated:
html_text(x)
#> [1] "\ntext to extract\n"
Any suggestions why and how to workaround this? I do not want to start re-attaching children..
First of all let me say that I think yoo can solve the issue without copying the data. I'm not an expert in xpath, but I think you can use it to just select only direct text descendents, ignoring text nested in other xml nodes. I.e. the following seems to do the trick without any copy (x defined as in your question):
html_text(html_elements(x, xpath = "text()"))
# [1] "\ntext to extract\n"
That being said, I also have an answer to the question on how to make a deep copy:
The problem is that rlang::duplicate() can only copy R data structures. However, rvest builds on xml2, and xml2 builds on the C library libxml2.
When you create the xml_node object in R, the corresponding data structure is created in libxml2. On the R side, there is basically just a pointer to the libxml2 object. So rlang::duplicate() will only create a copy of that pointer, but not of the underlying data. It cannot do so, because it has no access to it as it is in a different library (that rlang doesn't know of).
The easiest way to create a copy of the underlying data seems to be to serialize and deserialze the xml. I suspect this is not very efficent though.
Example:
Read in the original data:
require(rvest)
h <- '<ul>
<li id="target">
text to extract
<ul><li>text to ignore</li><li>this too</li></ul>
</li>
</ul>'
doc <- xml2::read_html(h)
x <- html_node(doc, '#target')
Create two copies - one with rlang:duplicate() and one with xml2::xml_unserialize():
x1 <- rlang::duplicate(x, shallow = FALSE)
x2 <- xml2::xml_unserialize(xml2::xml_serialize(x, NULL))
Check that x and x1 are in fact identical, while x2 is a true copy (the memory locations you get will be of course be different to the ones shown here):
x$doc
# <pointer: 0x0000023911334ea0>
x1$doc
# <pointer: 0x0000023911334ea0>
# --> same as x
x2$doc
# <pointer: 0x00000239113377d0>
# --> different to x
Test that everything works as intented:
children <- html_children(x2)
xml2::xml_remove(children)
html_text(x2)
# [1] "\n text to extract\n "
html_text(x)
# [1] "\n text to extract\n text to ignorethis too"
Another potential solution (maybe a more general approach) is to use the html_children() function to obtain the text of all the child nodes and then remove that from the full text.
require(rvest)
h = '<ul>
<li id="target">
text to extract
<ul><li>text to ignore</li><li>this too</li></ul>
</li>
</ul>'
doc = xml2::read_html(h)
x = html_node(doc, '#target')
fulltext <- html_text(x)
# [1] "\ntext to extract\ntext to ignorethis too"
#find the text in the children nodes
childtext <- html_children(x) %>% html_text()
# "text to ignorethis too"
#replace the child node text with a numm
gsub(childtext, "", fulltext) %>% trimws()
#"text to extract"
#alternative using the text from the first child node
firstchild <- xml_child(x, search=1) %>% xml_text()
gsub(paste0(firstchild, ".*"), "", fulltext)
Of course, if there are additional newline "\n" or formatting character, the gsub() may break.
I have several XML files which are missing the initial tag. For example, this is the proper formatted file:-
<?xml version="1.0"?>
<UDI>
<Test_Equipment_Number>3300061-01</Test_Equipment_Number>
<Test_SW_Number>3300062</Test_SW_Number>
<Test_SW_Version>2.1</Test_SW_Version>
<GTIN>(01)00884838088597</GTIN>
<LOT></LOT>
<Date_of_Mfg>(11)20190322</Date_of_Mfg>
<Device_SN>(21)1160001242</Device_SN>
<Material_Number>(96)300001287651</Material_Number>
<PCBA_WO_and_SN>00190311-0001242</PCBA_WO_and_SN>
<FW_Version>06</FW_Version>
<Model>324PHB</Model>
</UDI>
And this is the file with missing initial tag:-
<Test_Equipment_Number>3300011-01</Test_Equipment_Number>
<Test_SW_Number>3300012</Test_SW_Number>
<Test_SW_Version>5.1</Test_SW_Version>
<GTIN>(01)00884838085497</GTIN>
<LOT></LOT>
<Date_of_Mfg>(11)20190411</Date_of_Mfg>
<Device_SN>(21)1120104548</Device_SN>
<Material_Number>(96)300000267981</Material_Number>
<PCBA_WO_and_SN>000143-00000793</PCBA_WO_and_SN>
<FW_Version>V01.0001</FW_Version>
<Model>7000PHW</Model>
How could I read the file with missing initial tag in R Programming Language ?
One option would be to parse the xml fragment by specifying a top node to be added:
# install.packages('XML')
library(XML)
fragment <-
'<Test_Equipment_Number>3300011-01</Test_Equipment_Number>
<Test_SW_Number>3300012</Test_SW_Number>
<Test_SW_Version>5.1</Test_SW_Version>
<GTIN>(01)00884838085497</GTIN>
<LOT></LOT>
<Date_of_Mfg>(11)20190411</Date_of_Mfg>
<Device_SN>(21)1120104548</Device_SN>
<Material_Number>(96)300000267981</Material_Number>
<PCBA_WO_and_SN>000143-00000793</PCBA_WO_and_SN>
<FW_Version>V01.0001</FW_Version>
<Model>7000PHW</Model>'
XML::parseXMLAndAdd(fragment, top = 'content')
#> <content>
#> <Test_Equipment_Number>3300011-01</Test_Equipment_Number>
#> <Test_SW_Number>3300012</Test_SW_Number>
#> <Test_SW_Version>5.1</Test_SW_Version>
#> <GTIN>(01)00884838085497</GTIN>
#> <LOT/>
#> <Date_of_Mfg>(11)20190411</Date_of_Mfg>
#> <Device_SN>(21)1120104548</Device_SN>
#> <Material_Number>(96)300000267981</Material_Number>
#> <PCBA_WO_and_SN>000143-00000793</PCBA_WO_and_SN>
#> <FW_Version>V01.0001</FW_Version>
#> <Model>7000PHW</Model>
#> </content>
I am trying to extract informations from source code of a specific website
In the source code there are lines:
# [[4]]
# <script type="text/javascript">
# <![CDATA[
# <!-- // <![CDATA[
# var wp_dot_addparams = {
# "cid": "148938",
# "ctype": "article",
# "ctags": "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions",
# "cauthor": "",
# "csource": "film.wp.pl",
# "cpageno": 1,
# "cpagemax": 1,
# "cdate": "2015-02-18"
# };
# // ]]]]><![CDATA[> -->
# ]]>
# </script>
From which I'd like to extract:
"ctags": "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions",
Does anyone know how I should specify the selector in html_nodes function in rvest package in R?
html("http://film.wp.pl/id,148938,title,dziejesiewkulturze-Codzienna-dawka-informacji-kulturalnych-180215-WIDEO,wiadomosc.html") %>%
html_nodes("script")
Extract the JSON object from the element's text (tidy the selector up while you're at it)
Parse it as a list using jsonlite's fromJSON() function.
You can access it directly using "$ctags"
library(jsonlite)
json <- html("http://film.wp.pl/id,148938,title,dziejesiewkulturze-Codzienna-dawka-informacji-kulturalnych-180215-WIDEO,wiadomosc.html") %>%
html_nodes("script:contains('var wp_dot_addparams')") %>%
gsub(x=., pattern=".*var wp_dot_addparams = (\\{.*\\});.*",replacement="\\1") %>%
fromJSON()
json$ctags
[1] "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions"