unescaping parsed strings produced by R XML package? - r

I have been cribbing off of the very helpful responses on Scraping html tables into R data frames using the XML package to scrape some html off the web and work with it in R.
The XML package seems to be pretty thorough about escaping non-alphabetic characters in text strings. Is there a simple way in XML or some other package that would reverse some/all of the character escaping that passing my data through XML did? I started to do it myself, but after encountering cases like 'Representative Joaquín Castro' thought 'there must be a better solution...'
Just for clarity, using the XML package to parse this HTML
library(XML)
apos_str <- c("<b>Tim O'Reilly</b>")
apos_str.parsed <- htmlTreeParse(apos_str, error=function(...){})
apos_str.parsed$children$html[[1]][[1]]
would produce
<b>Tim O&apos;Reilly</b>
And I'd ideally like a function or package that would search for that
&apos;
and turn it back into
'<b>Tim O'Reilly</b>'
Edit To clarify, from the comments below, I get how to do this for the particular case of apostrophes, or any other character I see in my data. What I'm looking for is a package where someone has worked this out more generally.
Research I've done so far:
-Read everything I could find in the XML documentation on escaping.
-Looked for a promising package on the CRAN NLP page.
-did a search for 'unescape [R]' and 'reverse escape [R]' here on SO.
Wasn't able to make any headway so thought I would bring the question here.

I'm not sure I understand the difficulty. String processing for replacements are done with the base regex functions: sub, gsub, regexpr, gregexpr
?sub # the same help page will also discuss 'gsub'
txt <- '<b>Tim O&apos;Reilly</b>'
sub("\\&apos;", "'", txt)
[1] "<b>Tim O'Reilly</b>"
If you had a list of values that occur between "&" and ";" you could split on those and then recombine. I suppose it is possible that you were hoping someone had already done that. You should clarify what level of abstraction you were hoping to achieve.
EDIT:
A blogger discusses the specific case of "&apos" http://fishbowl.pastiche.org/2003/07/01/the_curse_of_apos/
I've done some further research of my own. Those are not properly called "escapes" but rather "named entities". I cannot find any references to them in the rhelp archives. I have downloaded the XML listing from the w3.org website that defines these "enities" and am trying to convert to a tabular form that would support search and replace. But your comment about 'Representative Joaquín Castro' has me puzzled. the odd characters are not in the form "$#xxx", so ........... what exactly are you asking for? Please post a suitable test case with the expected output.
EDIT 2: The was a basically identical question from Michael Friendly that just got answered by David Carlson on Rhelp. Here's the link to the posting on the Rhelp archives:
https://stat.ethz.ch/pipermail/r-help/2012-August/321478.html
He's already done a better job than I had on creating a translation table and has included code to march through html text. (and a bonus... he included &apos). And a next-day followup from Michael Friendly has wrapped the process up in a function. You can follow the link on the Archives page.

Related

Link to R6 method from separate package in help pages and pkgdown

Cross posted from: https://community.rstudio.com/t/link-to-r6-method-from-separate-package-in-help-pages-and-pkgdown/134702
I'm currently writing an R package and would like to link to the help page for an R6 method in a separate package. The page I want to link to is here: https://mc-stan.org/cmdstanr/reference/model-method-sample.html, and there is an .Rd file for the method as well (https://github.com/stan-dev/cmdstanr/blob/master/man/model-method-sample.Rd). Finally, I can also access the help page from R directly with ?cmdstanr::`model-method-sample`.
However, when I try to add a link to my own help page using the normal link to another package syntax described here, [cmdstanr::`model-method-sample`], I get this error:
Warning: Link to unknown topic: cmdstanr::`model-method-sample`
I feel like there must be some way to link to this help page, given that it definitely exists and has an .Rd page, but I haven't found a solution yet. Has anyone else run into this problem or know the solution?
I think you (or Roxygen) are using the wrong syntax for the link. According to Writing R Extensions, the Rd syntax should be:
\link[cmdstanr]{model-method-sample}
I'm not sure how to generate this from Roxygen, but it appears to work as-is if I put it in Roxygen comments.
If you want the link with different text, the syntax is
\link[cmdstanr:model-method-sample]{link text}

Retaining newlines and other content when reading in a screenplay in R

I figure read.lines() or maybe the tm package will be the way to go for this, but I was wondering what recommendations folks have for reading a .txt screenplay. Demarcating parts of the screenplay will require case-sensitive and newline-sensitive regexes and such, so I need to preserve all those elements in the content that is read. Thoughts?
read_file() in the tidyverse package does a nice job of this.

Indenting code within .R script without using a function

I am new to R, and despite searching the forums I have been unable to find a solution to indenting code within both the Source window and Document Outline (Ctrl+Shift+O).
An example is shown below.
Ideally, I would want the code to function as below when pressing Alt+O
This function does seem to be implemented in some fashion as you get the indented code with functions but this is less than ideal.
# Section 1 -----------------------------------------------------------
function(x) {
# Section 1A ===========================================================
}
Has anyone found a work-around to implement this?
Not a fix but a workaround:
Any whitespace after a "." is included in the heading, so a "." followed by a tab or space can be used to create indented headers preceded by a ".".
# Section title ---------------------------------------------------------------
# . Subsection A --------------------------------------------------------------
# . . A.1 ---------------------------------------------------------------------
Would still be nice to see this implemented the way it is in R markdown, but in the mean time might make it easier to navigate scripts using sub-headers.
Screenshot of example script using dot-tab to indent headers
For what it's worth, this sort of nested indent is implemented for Markdown sections (for e.g. R Markdown documents), e.g.
However, this sort of nested is not implemented for sections in plain R scripts. You might consider filing this as a feature request for the RStudio team.
Thanks #Foztarz I posted this as an issue about a year ago on the GitHub. They claimed it was a worthy enhancement but they keep pushing it to the next version of RStudio lol. My work-around was similar. I used Alt codes to insert symbols I found a bit more visually appealing over .
# ▬ Section A ------
# ▐ ▬ Section A.1-----------
Here's what it looks like inside RStudio
Two additions:
Box-drawing characters look much better than the characters mentioned in https://stackoverflow.com/a/63812437/13776976 #Patricks answer
From RStudio 1.4 on, you can indicate sub-sections by additional '#' at the start of the line, see RStudio How To Article

sentence index for getOpenIE method in coreNLP for R

I' using the R wrapper for Stanford's CoreNLP tools, specifically the getOpenIE method in order to extract relation triples. This appears to work fine but I'm a bit confused about the output. Whereas getCoreference returns a dataframe with a sentence column getOpenIE does not and subject_start and subject_end etc. seem to be in-sentence references.
How can I determine the exact position of those elements in a document?
You want to look at the XML output. All the getOpenIE/getCoreference functions do is parse the XML anyway. I had to edit those functions as well to get sentence information.

How to access the help/documentation .rd source files in R?

In R, one very neat feature is that the source code of functions is accessible as objects in the workspace.
Thus, if I wanted to know the source code of, for example, grep() I can simply type grep into the console and read the code.
Similarly, I can read the documentation for grep by typing ?grep into the console.
Question: How can I get the source code for the documentation of a function? In other words, where do I find the .rd files?
I find studying the source of well-written code an excellent way of learning the idioms. Now I want to study how to write documentation for some very specific cases. I have not been able to find the documentation files for any of the base R functions in my R installation. Perhaps I have been looking in the wrong place.
It seems you can extract the Rd sources from an installed R. I'm using R-devel (2011-09-05 r56942).
Get the database of Rd for the base package.
library(tools)
db <- Rd_db("base")
Search for "grep.Rd" in the names of the Rd DB, for example:
grep("grep.Rd", names(db), value = TRUE)
[1] "d:/murdoch/recent/R64/src/library/base/man/agrep.Rd"
[2] "d:/murdoch/recent/R64/src/library/base/man/grep.Rd"
Get just the Rd object for grep.
db[grep("/grep.Rd", names(db))]
$`d:/murdoch/recent/R64/src/library/base/man/grep.Rd`
\title{Pattern Matching and Replacement}
\name{grep}
\alias{grep}
\alias{grepl}
\alias{sub}
\alias{gsub}
\alias{regexpr}
\alias{gregexpr}
\alias{regexec}
\keyword{character}
\keyword{utilities}
\description{
\code{grep}, \code{grepl}, \code{regexpr} and \code{gregexpr} search
for matches to argument \code{pattern} within each element of a
character vector: they differ in the format of and amount of detail in
the results.
\code{sub} and \code{gsub} perform replacement of the first and all
matches respectively.
}\usage{
...
...
There are tools for getting the components from the Rd objects, so you can refine searching to keywords or name, see examples in ?Rd_db and try this.
lapply(db, tools:::.Rd_get_metadata, "name")

Resources