R rvest keeping italics in text when scraping

R rvest keeping italics in text when scraping - r

I'm looking to scrape some message from an online message board.
Currently I am using:
html_nodes(conv,'.talk-post.message') %>%
html_text(trim = TRUE)
For the message:
I'm back now and slowly getting back to speed.
This gives:
"\nI'm back now and slowly getting back to speed.\n"
Which works fine, but removes all html formatting. I would like to retain an indication of where the text has italics tags (similarly for underlining and bold).
I appreciate I could use toString.XMLNode instead, but then that keeps all html tags, not just the three required.
"{xml_nodeset (1)}\n[1] <div class=\"talk-post message\">\\n<p><i>I'm back now and slowly getting back to speed.</i><br>
Are there any more elegant solutions to this?

You can use the XML library for get all the string in the div.
> library(XML)
> txtNode <- "<div><i>Hello</i></div><div><b>World</b></div><div><b><i>!</i></b></div>"
> html <- htmlParse(txtNode)
> html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div><i>Hello</i></div>
<div><b>World</b></div>
<div><b><i>!</i></b></div>
</body></html>
>
> lNode <- getNodeSet(html, "//div")
> lNode
[[1]]
<div>
<i>Hello</i>
</div>
[[2]]
<div>
<b>World</b>
</div>
[[3]]
<div>
<b>
<i>!</i>
</b>
</div>
attr(,"class")
[1] "XMLNodeSet"
>
> lapply(lNode, function(x) toString.XMLNode(x[[1]]))
[[1]]
[1] "<i>Hello</i> "
[[2]]
[1] "<b>World</b> "
[[3]]
[1] "<b>\n <i>!</i>\n</b> "

Related

R markdown: datable within collapsible section

Below is a R-markdown document with <details> tags to create collapsible sections.
Can you help me to render the datatable from section 2 in the html output?
Minimal reproducible example
### Section 1
<details> <summary>Click to expand</summary>
```{r, echo=FALSE}
head(iris)
```
</details>
### Section 2
<details> <summary>Click to expand</summary>
```{r, echo=FALSE}
DT::datatable(iris)
```
</details>

This is not really an answer, but slightly too long to be a comment, so I've included it here. Hopefuly someone can use it to work out an actual answer:
The good new is that it is definitely "possible". The bad news is "it is not easy". With my limited knowledge of web-dev the problem seems to be, that DT::datatable (or more correctly htmlwidgets:::print.html_widget) creates an entire html-webpage in a temporary file, and this is the default method for visualizing DT::datatable. The file itself looks something like
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<style>body{background-color:white;}</style>
<script src="lib/htmlwidgets-1.5.3/htmlwidgets.js"></script>
<script src="lib/jquery-1.12.4/jquery.min.js"></script>
<link href="lib/datatables-css-0.0.0/datatables-crosstalk.css" rel="stylesheet" />
<script src="lib/datatables-binding-0.16/datatables.js"></script>
<link href="lib/dt-core-1.10.20/css/jquery.dataTables.min.css" rel="stylesheet" />
<link href="lib/dt-core-1.10.20/css/jquery.dataTables.extra.css" rel="stylesheet" />
<script src="lib/dt-core-1.10.20/js/jquery.dataTables.min.js"></script>
<link href="lib/crosstalk-1.1.0.1/css/crosstalk.css" rel="stylesheet" />
<script src="lib/crosstalk-1.1.0.1/js/crosstalk.min.js"></script>
</head>
<body>
<div id="htmlwidget_container">
<div id="htmlwidget-cd5f37d21433eb2088ae" style="width:960px;height:500px;" class="datatables html-widget"></div>
</div>
<script type="application/json" data-for="htmlwidget-cd5f37d21433eb2088ae">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122","123","124","125","126","127","128","129","130","131","132","133","134","135","136","137","138","139","140","141","142","143","144","145","146","147","148","149","150"],[5.1,4.9,4.7,4.6,5,5.4,4.6,5,4.4,4.9,5.4,4.8,4.8,4.3,5.8,5.7,5.4,5.1,5.7,5.1,5.4,5.1,4.6,5.1,4.8,5,5,5.2,5.2,4.7,4.8,5.4,5.2,5.5,4.9,5,5.5,4.9,4.4,5.1,5,4.5,4.4,5,5.1,4.8,5.1,4.6,5.3,5,7,6.4,6.9,5.5,6.5,5.7,6.3,4.9,6.6,5.2,5,5.9,6,6.1,5.6,6.7,5.6,5.8,6.2,5.6,5.9,6.1,6.3,6.1,6.4,6.6,6.8,6.7,6,5.7,5.5,5.5,5.8,6,5.4,6,6.7,6.3,5.6,5.5,5.5,6.1,5.8,5,5.6,5.7,5.7,6.2,5.1,5.7,6.3,5.8,7.1,6.3,6.5,7.6,4.9,7.3,6.7,7.2,6.5,6.4,6.8,5.7,5.8,6.4,6.5,7.7,7.7,6,6.9,5.6,7.7,6.3,6.7,7.2,6.2,6.1,6.4,7.2,7.4,7.9,6.4,6.3,6.1,7.7,6.3,6.4,6,6.9,6.7,6.9,5.8,6.8,6.7,6.7,6.3,6.5,6.2,5.9],[3.5,3,3.2,3.1,3.6,3.9,3.4,3.4,2.9,3.1,3.7,3.4,3,3,4,4.4,3.9,3.5,3.8,3.8,3.4,3.7,3.6,3.3,3.4,3,3.4,3.5,3.4,3.2,3.1,3.4,4.1,4.2,3.1,3.2,3.5,3.6,3,3.4,3.5,2.3,3.2,3.5,3.8,3,3.8,3.2,3.7,3.3,3.2,3.2,3.1,2.3,2.8,2.8,3.3,2.4,2.9,2.7,2,3,2.2,2.9,2.9,3.1,3,2.7,2.2,2.5,3.2,2.8,2.5,2.8,2.9,3,2.8,3,2.9,2.6,2.4,2.4,2.7,2.7,3,3.4,3.1,2.3,3,2.5,2.6,3,2.6,2.3,2.7,3,2.9,2.9,2.5,2.8,3.3,2.7,3,2.9,3,3,2.5,2.9,2.5,3.6,3.2,2.7,3,2.5,2.8,3.2,3,3.8,2.6,2.2,3.2,2.8,2.8,2.7,3.3,3.2,2.8,3,2.8,3,2.8,3.8,2.8,2.8,2.6,3,3.4,3.1,3,3.1,3.1,3.1,2.7,3.2,3.3,3,2.5,3,3.4,3],[1.4,1.4,1.3,1.5,1.4,1.7,1.4,1.5,1.4,1.5,1.5,1.6,1.4,1.1,1.2,1.5,1.3,1.4,1.7,1.5,1.7,1.5,1,1.7,1.9,1.6,1.6,1.5,1.4,1.6,1.6,1.5,1.5,1.4,1.5,1.2,1.3,1.4,1.3,1.5,1.3,1.3,1.3,1.6,1.9,1.4,1.6,1.4,1.5,1.4,4.7,4.5,4.9,4,4.6,4.5,4.7,3.3,4.6,3.9,3.5,4.2,4,4.7,3.6,4.4,4.5,4.1,4.5,3.9,4.8,4,4.9,4.7,4.3,4.4,4.8,5,4.5,3.5,3.8,3.7,3.9,5.1,4.5,4.5,4.7,4.4,4.1,4,4.4,4.6,4,3.3,4.2,4.2,4.2,4.3,3,4.1,6,5.1,5.9,5.6,5.8,6.6,4.5,6.3,5.8,6.1,5.1,5.3,5.5,5,5.1,5.3,5.5,6.7,6.9,5,5.7,4.9,6.7,4.9,5.7,6,4.8,4.9,5.6,5.8,6.1,6.4,5.6,5.1,5.6,6.1,5.6,5.5,4.8,5.4,5.6,5.1,5.1,5.9,5.7,5.2,5,5.2,5.4,5.1],[0.2,0.2,0.2,0.2,0.2,0.4,0.3,0.2,0.2,0.1,0.2,0.2,0.1,0.1,0.2,0.4,0.4,0.3,0.3,0.3,0.2,0.4,0.2,0.5,0.2,0.2,0.4,0.2,0.2,0.2,0.2,0.4,0.1,0.2,0.2,0.2,0.2,0.1,0.2,0.2,0.3,0.3,0.2,0.6,0.4,0.3,0.2,0.2,0.2,0.2,1.4,1.5,1.5,1.3,1.5,1.3,1.6,1,1.3,1.4,1,1.5,1,1.4,1.3,1.4,1.5,1,1.5,1.1,1.8,1.3,1.5,1.2,1.3,1.4,1.4,1.7,1.5,1,1.1,1,1.2,1.6,1.5,1.6,1.5,1.3,1.3,1.3,1.2,1.4,1.2,1,1.3,1.2,1.3,1.3,1.1,1.3,2.5,1.9,2.1,1.8,2.2,2.1,1.7,1.8,1.8,2.5,2,1.9,2.1,2,2.4,2.3,1.8,2.2,2.3,1.5,2.3,2,2,1.8,2.1,1.8,1.8,1.8,2.1,1.6,1.9,2,2.2,1.5,1.4,2.3,2.4,1.8,1.8,2.1,2.4,2.3,1.9,2.3,2.5,2.3,1.9,2,2.3,1.8],["setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","setosa","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","versicolor","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica","virginica"]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>Sepal.Length<\/th>\n <th>Sepal.Width<\/th>\n <th>Petal.Length<\/th>\n <th>Petal.Width<\/th>\n <th>Species<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":[1,2,3,4]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
<script type="application/htmlwidget-sizing" data-for="htmlwidget-cd5f37d21433eb2088ae">{"viewer":{"width":450,"height":350,"padding":15,"fill":true},"browser":{"width":960,"height":500,"padding":40,"fill":false}}</script>
</body>
</html>
with data and header being obviously variable depending on the data.
Now if we inspect and edit the page while it is active by
In Rstudio call DT::datatable(iris) to show the table
Click the "show in new window" to open it in your preferred browser
Right-click anywhere on the page and click "Inspect element (Q)"
Right-click on "body" and click "Edit as HTML"
Finally add the <details> / </details> at the start and end of <body>
Then we can actually see that it works as we would expect (closed first):
(open now):
So this is a clear conclusion that "it is possible". The problem is extracting the code. Walking down into DT::datatable you will eventually find that it calls htmlwidgets:::print.html_widget to open the actual html-page. This lets us recreate a script and extract the actual html code used in the widget:
#' Generate html and make dependencies available in directory for a DT::datatable
#'
#' #param x a data.frame or DT::datatable
#' #param dir the (root) directory for the project/dependencies. See details
#' #param background background for the html widget
#' #param libdir directory to export dependencies to
#'
#' #details This function generates the html that is usually generated when
#' printing DT::datatable, and exports dependencies to a given directory, making
#' it useful for embedding the html into a markdown file or shiny script, either
#' running and saving this in the pre-amble/header or interactively. "Dir"
#' can be used to specify the project root, with "libdir" specifying the path
#' relative from "dir" to place dependencies. Note that this likely enforces
#' the html file to be placed in the project root, and not a sub-folder of the
#' project.
datatable_html <- function(x, dir = getwd(), background = "white", libdir = 'lib'){
if(is.data.frame(x))
x <- DT::datatable(x)
#from htmlwidgets:::print.html_widgets
x <- htmltools::as.tags(x, standalone = TRUE)
#from htmltools::save_html (called by print.html_widgets
x <- htmltools::renderTags(x)
deps <- lapply(x$dependencies, function(dep) {
dep <- htmltools::copyDependencyToDir(dep,
libdir,
FALSE)
dep <- htmltools::makeDependencyRelative(dep, dir, FALSE)
dep
})
bodyBegin <- if (!isTRUE(grepl("<body\\b", x$html[1],
ignore.case = TRUE))) {
"<body>"
}
bodyEnd <- if (!is.null(bodyBegin)) {
"</body>"
}
c("<!DOCTYPE html>", "<html>", "<head>",
"<meta charset=\"utf-8\"/>", sprintf("<style>body{background-color:%s;}</style>",
htmltools::htmlEscape(background)),
htmltools::renderDependencies(deps, c("href", "file")), x$head, "</head>",
bodyBegin, # <=== body starts here, maybe remove?
x$html,
bodyEnd, # <=== Body ends here, maybe remove?
"</html>")
}
dt_html <- datatable_html(DT::datatable(iris))
# print (very large output):
cat(dt_html)
now dt_html contains the html segments in a vector, and the dependencies are copied to {dir}/{libdir} which should be a folder under the root of the markdown project. A few things to note: The html vector has the dependencies in dt_html[6] (may have to be included in the markdown pre-amble?) and the htmlscript itself is in dt_html[10] with body tags in dt_html[9] and dt_html[11] respectively.
I am not skilled enough with html embedding in Rmarkdown to be sure where to go from here, but I am certain that there will be some efficionados out there that has the proficiency to abuse this and providing the final part of the answer. I am assuming a combination of document dependencies for the header and then using the html segment in dt_html somehow should do the job.

Out of the box, reactable will work with similar output and configurations.
Within an Rmd document:
<details><summary>Click to expand</summary>
```{r}
library(reactable)
reactable(mtcars)
```
</details>
Which renders to:

This should work:
# Section 3
```{r}
library(shiny)
library(DT) # make sure you load DT *after* shiny
# Render
renderDataTable({
datatable(iris) %>% formatStyle(
'Sepal.Width',
backgroundColor = styleInterval(3.4, c('gray', 'yellow'))
)
})
```
It really bugged me that I couldn't figure it out, so I googled a bit and this should help you: https://blog.rstudio.com/2015/06/24/dt-an-r-interface-to-the-datatables-library/
The results is:

rvest - select href tag string

I am using rvest.
> pgsession %>% jump_to(urls[2]) %>% read_html() %>% html_nodes("a")
{xml_nodeset (114)}
[1] Date
[2] Kennwort Ã¤ndern
[3] Benutzernamen Ã¤ndern
[4] Abmelden
...
However, I would only like to get all tags that have the href tag Mitglieder/Detail in it back.
For example a result should look like that :
[1] /Mitglieder/Detail/1213412
...
I tried f.ex.: a[href~=\"Mitglieder\ as css selector, but I get nothing back as a result.
Any suggestions how to change this css selector?
I appreciate your replies!

R- html_nodes doesnt find selector

I wanted to scrap some data with "rvest" package from url http://www.finanzen.ch/kurse/historisch/ABB/SWL/1.1.2001_27.10.2015
I wanted to get the table with the following selector (copied via inspect option from chrome):
#historic-price-list > div > div.content > table
But html_nodes doesn't work:
> url="http://www.finanzen.ch/kurse/historisch/ABB/SWL/1.1.2001_27.10.2015"
> css_selector="#historic-price-list > div > div.content > table"
> html(url) %>% html_nodes(css_selector)
{xml_nodeset (0)}
What I can find is:
> css_selector="#historic-price-list"
> html(url) %>% html_nodes(css_selector)
{xml_nodeset (1)}
[1] <div id="historic-price-list"/>
But it doesn't goes any further.
Maybe someone got an idea why?

Trouble rendering Css Data in Django?

So I'm trying to export a section of a website to PDF, and I'm able to output the HTML data properly, but the CSS codes just appears as text in the PDF.
>
def exportPDf(results, css, html):
>
> result = StringIO.StringIO()
>
> results_2 = StringIO.StringIO(results.encode("UTF-8"))
> css_encode = StringIO.StringIO(css.encode("UTF-8"))
>
> pdf = pisa.pisaDocument(results_2 , result)#ISO-8859-1
>
> if not pdf.err:
> return HttpResponse(result.getvalue(), mimetype='application/pdf')
> return HttpResponse('We had some errors<pre>%s</pre>' % escape(html))
>
> def get_data(request):
> results = request.GET['css'] + request.GET['html']
> html = request.GET['html']
> css = request.GET['css']
> return ExportPDf(results, css, html)
Again, the HTML is fine. IT's just the css part that doesn't render. It outputs the actual CSS codes to PDF.

If you've setup your CSS as such:
<style type="text/css">
body {
color:#fff;
}
</style>
Try wrapping your css with comments:
<style type="text/css">
<!--
body {
color:#fff;
}
-->
</style>
This will force the CSS as a comment and thus won't render. Since I can't see how your code is rendered this is just a guess but let me know if it does indeed work :)

Retrieve Image Source with XPath in R using XML-Package

I'd like to retrieve the image source within the below HTML code block, but can't find the right syntax.
library(XML)
library(RCurl)
script <- getURL("http://www.floraweb.de/pflanzenarten/druck.xsql?suchnr=4346")
(doc <- htmlParse(script))
<div class="divider"><hr></div>
<div id="contentblock"><div id="content">
<h1>Alle Angaben</h1>
<p>Zu der von Ihnen gewÃ¤hlten Pflanzenart liegen folgende Informationen vor:</p>
<p>Wissenschaftlicher Name:Â Poa badensis agg. </p>
<p>Deutscher Name:Â Artengruppe Badener Rispengras</p>
<p>FamilienzugehÃ¶rigkeit: Poaceae, SÃ¼ÃŸgrÃ¤ser</p>
<p>Status:Â keine AngabenÂ </p>
<p class="centeredcontent"><img border="0" src="../bilder/Arten/dummy.tmb.jpg"></p>
Desired result:
"../bilder/Arten/dummy.tmb.jpg"
Any pointers are greatly appreciated!

Try the following:
script <- getURL("http://www.floraweb.de/pflanzenarten/druck.xsql?suchnr=4346")
doc <- htmlTreeParse(script,useInternalNodes=T)
img<-xpathSApply(doc,'//*/p[#class="centeredcontent"]/img',xmlAttrs)
> img[2]
[1] "../bilder/Arten/dummy.tmb.jpg"
The use of Internal representation maybe necessary
EDIT:
I just looked up htmlParse and its equivalent to htmlTreeParse(useInternalNodes=T)
#Martin Morgan thanks have added below
doc <- htmlParse("http://www.floraweb.de/pflanzenarten/druck.xsql?suchnr=4346")
xpathSApply(doc, '//*/p[#class="centeredcontent"]/img/#src')

Use:
//div[#id='contentblock']
/div/p[#class='centeredcontent']
/img/#src
This selects the src attribute of any p element whose class attribute has the value "centeredcontent"and that (the p element) is a child of a div that is a child of a div whose id attribute has the value '"contentblock"'.
If you want to get directly the value of this attribute, use:
string(//div[#id='contentblock']
/div/p[#class='centeredcontent']
/img/#src)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R rvest keeping italics in text when scraping - r

Related

R markdown: datable within collapsible section

rvest - select href tag string

R- html_nodes doesnt find selector

Trouble rendering Css Data in Django?

Retrieve Image Source with XPath in R using XML-Package

Categories

Resources