Is there a nice way to extract get the R-help page from an installed package in the form of an R object (e.g a list). I would like to expose help pages in the form of standardized JSON or XML schemas. However getting the R-help info from the DB is harder than I thought.
I hacked together a while ago to get the HTML of an R help manual page. However I would rather have a general R object that contains this information, that I can render to JSON/XML/HTML, etc. I looked into the helpr package from Hadley, but this seems to be a bit of overkill for my purpose.
Edited with suggestion of Hadley
You can do this a bit easier by:
getHTMLhelp <- function(...){
thefile <- help(...)
capture.output(
tools:::Rd2HTML(utils:::.getHelpFile(thefile))
)
}
Using tools:::Rd2txt instead of tools:::Rd2HTML will give you plain text. Just getting the file (without any parsing) gives you the original Rd format, so you can write your custom parsing function to parse it into an object (see the solution of #Jeroen, which does a good job in extracting all info into a list).
This function takes exactly the same arguments as help() and returns a vector with every element being a line in the file, eg:
> head(HelpAnova)
[1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">"
[2] "<html><head><title>R: Anova Tables</title>"
[3] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">"
[4] "<link rel=\"stylesheet\" type=\"text/css\" href=\"R.css\">"
[5] "</head><body>"
[6] ""
Or :
> HelpGam <- getHTMLhelp(gamm,package=mgcv)
> head(HelpGam)
[1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">"
[2] "<html><head><title>R: Generalized Additive Mixed Models</title>"
[3] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">"
[4] "<link rel=\"stylesheet\" type=\"text/css\" href=\"R.css\">"
[5] "</head><body>"
[6] ""
So below what I hacked together. However I yet have to test it on many help files to see if it generally works.
Rd2list <- function(Rd){
names(Rd) <- substring(sapply(Rd, attr, "Rd_tag"),2);
temp_args <- Rd$arguments;
Rd$arguments <- NULL;
myrd <- lapply(Rd, unlist);
myrd <- lapply(myrd, paste, collapse="");
temp_args <- temp_args[sapply(temp_args , attr, "Rd_tag") == "\\item"];
temp_args <- lapply(temp_args, lapply, paste, collapse="");
temp_args <- lapply(temp_args, "names<-", c("arg", "description"));
myrd$arguments <- temp_args;
return(myrd);
}
getHelpList <- function(...){
thefile <- help(...)
myrd <- utils:::.getHelpFile(thefile);
Rd2list(myrd);
}
And then you would do something like:
myhelp <- getHelpList("qplot", package="ggplot2");
cat(jsonlite::toJSON(myhelp));
Related
I have a sample XML file that I have parsed in R
<ROUGHTDRAFT_FILE MV="00" MMV="00"
tId="0000">
<HEADER Location="Utah" dateCreated="1/1/99">
</HEADER>
<COVERSHEET>
<PRIMIARY_INFO eName="John Smith" pList="XXXXX"
type="Remodel" cNumber="00000"
policyNumber="00000000000" />
</COVERSHEET>
</ROUGHDRAFT_FILE>
After I load the XML and name it file I get an error. This is my code:
xml <- xmlParse(file)
This work fine
When I try to pull the attributes it give me an error
EstAttribs <- xpathApply(xml, path="//PRIMIARY_INFO", xml_attrs )
Error in UseMethod("xpathApply") :
no applicable method for 'xpathApply' applied to an object of class "c('XMLDocument', 'XMLAbstractDocument')"
Any recommendations on how I can fix this? Do I have to specify something for xml_attrs?
MrFlick has already given you one answer. Here is another one that might be useful. As he suggested don't try to mix functions from XML library with rvest and xml2.
# here is the rvest and xml2 solution
# rvest calls xml2 since it is a dependency
library(rvest)
xml_file <- read_xml("test.xml")
xml_file %>%
xml_find_all('//PRIMIARY_INFO') %>%
xml_attrs('eName')
[[1]]
eName pList type cNumber policyNumber
"John Smith" "XXXXX" "Remodel" "00000" "00000000000"
# this solution is purely using XML - as suggested by MrFlick
library(XML)
xml_file <- xmlParse("test.xml")
xpathApply(xml_file, path="//PRIMIARY_INFO", xmlAttrs )
[[1]]
eName pList type cNumber policyNumber
"John Smith" "XXXXX" "Remodel" "00000" "00000000000"
I think this SO question might contain useful info for you.
I am trying to scrape image sources from different website. I used rvest to do that. The problem I encounter is that I have a vector string containing the source but I need to extract the source from it.
Here are the first few entries:
> string
{xml_nodeset (100)}
[1] <td class="no-wrap currency-name" data-sort="Bitcoin">\n<img src="https://s2.coinmarketc ...
[2] <td class="no-wrap currency-name" data-sort="Ethereum">\n<img src="https://s2.coinmarket ...
[3] <td class="no-wrap currency-name" data-sort="Ripple">\n <img src="https://s2.coinmarketc ...
What I need is basically the part coming after src=", so for the first one "https://s2.coinmarketcap.com/static/img/coins/16x16/1.png" (the console doesn't show the full strings but this what appears after the dots ... and there comes more stuff after it as well).
Any help is appreciated as I am a bit stuck here.
You can do:
library(rvest)
read_html("https://coinmarketcap.com/coins/")%>%
html_nodes("td img")%>%html_attr("src")
[1] "https://s2.coinmarketcap.com/static/img/coins/16x16/1.png"
[2] "https://s2.coinmarketcap.com/generated/sparklines/web/7d/usd/1.png"
[3] "https://s2.coinmarketcap.com/static/img/coins/16x16/1027.png"
[4] "https://s2.coinmarketcap.com/generated/sparklines/web/7d/usd/1027.png"
[5] "https://s2.coinmarketcap.com/static/img/coins/16x16/52.png"
[6] "https://s2.coinmarketcap.com/generated/sparklines/web/7d/usd/52.png"
[7] "https://s2.coinmarketcap.com/static/img/coins/16x16/1831.png"
[8] "https://s2.coinmarketcap.com/generated/sparklines/web/7d/usd/1831.png"
:
:
:
:
As pointed out in the comments, a regular expression should do it:
myhtml <- gsub('^.*https://\\s*|\\s*.png.*$', "", string)
myhtml <- paste0("https://", myhtml, ".png")
The first line will extract the part of the string contained between https:// and .png, and the second one will paste them back into your string in order to have a valid source, i.e. with https:// and .png at the end.
I want to parse a court document I downloaded in xml format. But the response type is application/xhtml+xml. And I'm getting an error in turning this xhtml document to xml in r so that I can extract information I need. See below. Can anyone help? Thank you.
resp_xml <- readRDS("had_NH_xml.rds")
# Load xml2
library(xml2)
# Check response is XML
http_type(resp_xml)
[1] "application/xhtml+xml"
# Examine returned text with content()
NH_text <- content(resp_xml, as = "text")
NH_text
[1] "<!DOCTYPE html>\n<html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n \t<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\" /><link type=\"text/css\" rel=\"stylesheet\" href=\"/csologin/javax.faces.resource/theme.css.jsf?ln=primefaces-redmond\" /><link type=\"text/css\" rel=\"stylesheet\" href=\"/csologin/javax.faces.resource/primefaces.css.jsf?ln=primefaces&v=5.3\" /><script type=\"text/javascript\" src=\"/csologin/javax.faces.resource/jquery/jquery.js.jsf?ln=primefaces&v=5.3\"></script><script type=\"text/javascript\" src=\"/csologin/javax.faces.resource/jquery/jquery-plugins.js.jsf?ln=primefaces&v=5.3\"></script><script type=\"text/javascript\" src=\"/csologin/javax.faces.resource/primefaces.js.jsf?ln=primefaces&v=5.3\"></script><script type=\"text/javascript\" src=\"/csologin/javax.faces.resource/primefaces-extensions.js.jsf?ln=primefaces-extensions&v=4.0.0\"></script><link type=\"text/css\" rel=\"stylesheet\" href=\"/csologin/javax.faces.resou... <truncated>
>
> # Check htmltidy package: https://cran.r- project.org/web/packages/htmltidy/htmltidy.pdf
>
# Turn NH_text into an XML document
NH_xml <- read_xml(NH_text)
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url,
as_html = as_html, :
Entity 'nbsp' not defined [26]
Named HTML entities are invalid in XML (regardless of what any potential troll comments might otherwise "suggest"). I do not know R programming though what I can tell you is that you need to do string replacement for the following array:
' ','>','<'
...and replace them with the following strings:
' ','<','>'
In PHP this would simply be:
$f = array(' ','>','<');
$r = array(' ','<','>');
$a = str_ireplace($f,$r,$a);
...and each relative key/value would be replaced, I'm not sure enough to try to post R code looking at basic tutorials though.
What I can tell you is that if you clean out those strings (and any doctype) then if the rest of the code is not malformed then it should render just fine as application/xml.
I am attempting to add a new stemmer that works using a table look up method. if h is the hash the contains the stemming operation, it is encoded as follows: keys as words before stemming and values as words post-stemming.
I would like to ideally add a custom hash that allows me to do the following
myCorpus = tm_map(myCorpus, replaceWords, h)
the replaceWords function is applied to each document in myCorpus and uses the hash to stem the contents of the document
Here is the sample code from my replaceWords function
$hash_replace <- function(x,h) {
if (length(h[[x]])>0) {
return(h[[x]])
} else {
return(x)
}
}
replaceWords <- function(x,h) {
y = tolower(unlist(strsplit(x," ")))
y=y[which(as.logical(nchar(y)))]
z = unlist(lapply(y,hash_replace,h))
return(paste(unlist(z),collapse=' '))
}
Although this works, the transformed corpus is no longer contains content of type "TextDocument" or "PlainTextDocument" but of type "character"
I tried using
return(as.PlainTextDocument(paste(unlist(z),collapse=' ')))
but that that gives me an error while trying to run.
In the previous versions of the R's tm package, I did see a replaceWords function that allowed for synonym and WORDNET based subtitution. But I no longer see it in the current version of tm package (especially when I call the function getTransformations())
Does anybody out there have ideas on how I can make this happen?
Any help is greatly appreciated.
Cheers,
Shivani
Thanks,
Shivani Rao
You just need to use the PlainTextDocument function instead of as.PlainTextDocument. R will automatically return the last statement in your function, so it works if you just make the last line
PlainTextDocument(paste(unlist(z),collapse=' '))
I have a string vector which contains html tags e.g
abc<-""welcome <span class=\"r\">abc</span> Have fun!""
I want to remove these tags and get follwing vector
e.g
abc<-"welcome Have fun"
Try
> gsub("(<[^>]*>)","",abc)
what this says is 'substitute every instance of < followed by anything that isnt a > up to a > with nothing"
You cant just do gsub("<.*>","",abc) because regexps are greedy, and the .* would match up to the last > in your text (and you'd lose the 'abc' in your example).
This solution might fail if you've got > in your tags - but is <foo class=">" > legal? Doubtless someone will come up with another answer that involves parsing the HTML with a heavyweight XML package.
You can convert your piece of HTML to an XML document with
htmlParse or htmlTreeParse.
You can then convert it to text,
i.e., strip all the tags, with xmlValue.
abc <- "welcome <span class=\"r\">abc</span> Have fun!"
library(XML)
#doc <- htmlParse(abc, asText=TRUE)
doc <- htmlTreeParse(abc, asText=TRUE)
xmlValue( xmlRoot(doc) )
If you also want to remove the contents of the links,
you can use xmlDOMApply to transform the XML tree.
f <- function(x) if(xmlName(x) == "span") xmlTextNode(" ") else x
d <- xmlDOMApply( xmlRoot(doc), f )
xmlValue(d)