I have been cribbing off of the very helpful responses on Scraping html tables into R data frames using the XML package to scrape some html off the web and work with it in R.
The XML package seems to be pretty thorough about escaping non-alphabetic characters in text strings. Is there a simple way in XML or some other package that would reverse some/all of the character escaping that passing my data through XML did? I started to do it myself, but after encountering cases like 'Representative JoaquÃÂn Castro' thought 'there must be a better solution...'
Just for clarity, using the XML package to parse this HTML
library(XML)
apos_str <- c("<b>Tim O'Reilly</b>")
apos_str.parsed <- htmlTreeParse(apos_str, error=function(...){})
apos_str.parsed$children$html[[1]][[1]]
would produce
<b>Tim O'Reilly</b>
And I'd ideally like a function or package that would search for that
'
and turn it back into
'<b>Tim O'Reilly</b>'
Edit To clarify, from the comments below, I get how to do this for the particular case of apostrophes, or any other character I see in my data. What I'm looking for is a package where someone has worked this out more generally.
Research I've done so far:
-Read everything I could find in the XML documentation on escaping.
-Looked for a promising package on the CRAN NLP page.
-did a search for 'unescape [R]' and 'reverse escape [R]' here on SO.
Wasn't able to make any headway so thought I would bring the question here.
I'm not sure I understand the difficulty. String processing for replacements are done with the base regex functions: sub, gsub, regexpr, gregexpr
?sub # the same help page will also discuss 'gsub'
txt <- '<b>Tim O'Reilly</b>'
sub("\\'", "'", txt)
[1] "<b>Tim O'Reilly</b>"
If you had a list of values that occur between "&" and ";" you could split on those and then recombine. I suppose it is possible that you were hoping someone had already done that. You should clarify what level of abstraction you were hoping to achieve.
EDIT:
A blogger discusses the specific case of "&apos" http://fishbowl.pastiche.org/2003/07/01/the_curse_of_apos/
I've done some further research of my own. Those are not properly called "escapes" but rather "named entities". I cannot find any references to them in the rhelp archives. I have downloaded the XML listing from the w3.org website that defines these "enities" and am trying to convert to a tabular form that would support search and replace. But your comment about 'Representative JoaquÃÂn Castro' has me puzzled. the odd characters are not in the form "$#xxx", so ........... what exactly are you asking for? Please post a suitable test case with the expected output.
EDIT 2: The was a basically identical question from Michael Friendly that just got answered by David Carlson on Rhelp. Here's the link to the posting on the Rhelp archives:
https://stat.ethz.ch/pipermail/r-help/2012-August/321478.html
He's already done a better job than I had on creating a translation table and has included code to march through html text. (and a bonus... he included &apos). And a next-day followup from Michael Friendly has wrapped the process up in a function. You can follow the link on the Archives page.
Related
Cross posted from: https://community.rstudio.com/t/link-to-r6-method-from-separate-package-in-help-pages-and-pkgdown/134702
I'm currently writing an R package and would like to link to the help page for an R6 method in a separate package. The page I want to link to is here: https://mc-stan.org/cmdstanr/reference/model-method-sample.html, and there is an .Rd file for the method as well (https://github.com/stan-dev/cmdstanr/blob/master/man/model-method-sample.Rd). Finally, I can also access the help page from R directly with ?cmdstanr::`model-method-sample`.
However, when I try to add a link to my own help page using the normal link to another package syntax described here, [cmdstanr::`model-method-sample`], I get this error:
Warning: Link to unknown topic: cmdstanr::`model-method-sample`
I feel like there must be some way to link to this help page, given that it definitely exists and has an .Rd page, but I haven't found a solution yet. Has anyone else run into this problem or know the solution?
I think you (or Roxygen) are using the wrong syntax for the link. According to Writing R Extensions, the Rd syntax should be:
\link[cmdstanr]{model-method-sample}
I'm not sure how to generate this from Roxygen, but it appears to work as-is if I put it in Roxygen comments.
If you want the link with different text, the syntax is
\link[cmdstanr:model-method-sample]{link text}
I figure read.lines() or maybe the tm package will be the way to go for this, but I was wondering what recommendations folks have for reading a .txt screenplay. Demarcating parts of the screenplay will require case-sensitive and newline-sensitive regexes and such, so I need to preserve all those elements in the content that is read. Thoughts?
read_file() in the tidyverse package does a nice job of this.
I am new to R, and despite searching the forums I have been unable to find a solution to indenting code within both the Source window and Document Outline (Ctrl+Shift+O).
An example is shown below.
Ideally, I would want the code to function as below when pressing Alt+O
This function does seem to be implemented in some fashion as you get the indented code with functions but this is less than ideal.
# Section 1 -----------------------------------------------------------
function(x) {
# Section 1A ===========================================================
}
Has anyone found a work-around to implement this?
Not a fix but a workaround:
Any whitespace after a "." is included in the heading, so a "." followed by a tab or space can be used to create indented headers preceded by a ".".
# Section title ---------------------------------------------------------------
# . Subsection A --------------------------------------------------------------
# . . A.1 ---------------------------------------------------------------------
Would still be nice to see this implemented the way it is in R markdown, but in the mean time might make it easier to navigate scripts using sub-headers.
Screenshot of example script using dot-tab to indent headers
For what it's worth, this sort of nested indent is implemented for Markdown sections (for e.g. R Markdown documents), e.g.
However, this sort of nested is not implemented for sections in plain R scripts. You might consider filing this as a feature request for the RStudio team.
Thanks #Foztarz I posted this as an issue about a year ago on the GitHub. They claimed it was a worthy enhancement but they keep pushing it to the next version of RStudio lol. My work-around was similar. I used Alt codes to insert symbols I found a bit more visually appealing over .
# ▬ Section A ------
# ▐ ▬ Section A.1-----------
Here's what it looks like inside RStudio
Two additions:
Box-drawing characters look much better than the characters mentioned in https://stackoverflow.com/a/63812437/13776976 #Patricks answer
From RStudio 1.4 on, you can indicate sub-sections by additional '#' at the start of the line, see RStudio How To Article
I' using the R wrapper for Stanford's CoreNLP tools, specifically the getOpenIE method in order to extract relation triples. This appears to work fine but I'm a bit confused about the output. Whereas getCoreference returns a dataframe with a sentence column getOpenIE does not and subject_start and subject_end etc. seem to be in-sentence references.
How can I determine the exact position of those elements in a document?
You want to look at the XML output. All the getOpenIE/getCoreference functions do is parse the XML anyway. I had to edit those functions as well to get sentence information.
In R, one very neat feature is that the source code of functions is accessible as objects in the workspace.
Thus, if I wanted to know the source code of, for example, grep() I can simply type grep into the console and read the code.
Similarly, I can read the documentation for grep by typing ?grep into the console.
Question: How can I get the source code for the documentation of a function? In other words, where do I find the .rd files?
I find studying the source of well-written code an excellent way of learning the idioms. Now I want to study how to write documentation for some very specific cases. I have not been able to find the documentation files for any of the base R functions in my R installation. Perhaps I have been looking in the wrong place.
It seems you can extract the Rd sources from an installed R. I'm using R-devel (2011-09-05 r56942).
Get the database of Rd for the base package.
library(tools)
db <- Rd_db("base")
Search for "grep.Rd" in the names of the Rd DB, for example:
grep("grep.Rd", names(db), value = TRUE)
[1] "d:/murdoch/recent/R64/src/library/base/man/agrep.Rd"
[2] "d:/murdoch/recent/R64/src/library/base/man/grep.Rd"
Get just the Rd object for grep.
db[grep("/grep.Rd", names(db))]
$`d:/murdoch/recent/R64/src/library/base/man/grep.Rd`
\title{Pattern Matching and Replacement}
\name{grep}
\alias{grep}
\alias{grepl}
\alias{sub}
\alias{gsub}
\alias{regexpr}
\alias{gregexpr}
\alias{regexec}
\keyword{character}
\keyword{utilities}
\description{
\code{grep}, \code{grepl}, \code{regexpr} and \code{gregexpr} search
for matches to argument \code{pattern} within each element of a
character vector: they differ in the format of and amount of detail in
the results.
\code{sub} and \code{gsub} perform replacement of the first and all
matches respectively.
}\usage{
...
...
There are tools for getting the components from the Rd objects, so you can refine searching to keywords or name, see examples in ?Rd_db and try this.
lapply(db, tools:::.Rd_get_metadata, "name")