Edit: In fact, it appears that htmltreeparse don't parse well kml files. In that case, xmlTreeParse is what is needed.
I try to parse a huge kml file in R. My issue is when I want to use xpath to "navigate" through the nodes of the tree. Either way I grab the problem, I can't manage to do it, as the functions are made for xml and html files.
My final goal is to get a list of string of all the node under the node placemark.
# parse kml file:
pc2 <- htmlTreeParse(file = "http://www.doogal.co.uk/kml/EC.kml")
pc3 <- htmlTreeParse(file = "http://www.doogal.co.uk/kml/EC.kml", useInternalNodes = T)
# doesn't work
pc2["//#Placemark"]
# doesn't work either
xpathApply(pc3, "//#Placemark")
Is there a way to do it or the kml file block all?
So far, the only way I found was to doing it manually with call to the node, but it is not best practice.
pc4 <- htmlTreeParse(file = "http://www.doogal.co.uk/kml/EC.kml")$doc$chidren$kml ....
+ for loop
Edit: There is a strange effect, here: when I download the file, it is a kml file, beginning by a kml balise. when I use htmlTreeParse, it adds an html level:
<!DOCTYPE html PUBLIC "-//EN" "http://www.w3">
<?xml version="1.0" encoding="UTF-8"?>
<!-- comment here-->
<html>
<body>
<kml xmlns="http://www.opengis.net/kml/2.2">
<document>
my document here
</document></kml></body></html>
And the html parser react strangely to this. To correct this, I use xmltreeparse and it works fine in the end.
Related
I have never worked with APIs so this is my first try, I don't even know if what I'm trying to do is possible.
I'm trying to obtain pictures of cells from the SwissBioPics API (https://www.npmjs.com/package/%40swissprot/swissbiopics%2Dvisualizer) and have them in my R session.
res <- httr::GET('https://www.swissbiopics.org/static/swissbiopics.js',
query = list(taxisid = '9606', sls= 'SL0073',gos = '0005641'))
result <- httr::content(res$content)
but I'm getting this error:
Error in httr::content(res$content) : is.response(x) is not TRUE
Any clues?
After many misadventures, I have your answer as promised!
Since it involves interactive imagery, courtesy of JavaScript and HTML, this solution must be run as a .Rmd file within RStudio. The interactive imagery can also be accessed in the eponymous .html file, output by knitr when you click the Knit button in RStudio.
Step 1: Project Setup
Create a new R project my_pics in RStudio, under a new directory. From within this project, create a new R Notebook (here my_book.Rmd), which should end up right next to my_pics.Rproj under the aforementioned directory.
Step 2: Supporting Files
Under that same directory, create a ./snippets subdirectory. The latter should contain the following two .txt files, copied (and corrected) from the swissbiopics-visualizer documentation:
templates.txt: the first code block given in the documentation. Reproduced here with necessary EOF and syntactically corrected comments:
<template id="sibSwissBioPicsStyle">
<style>
ul > li > a {
font-style:oblique;
}
ul.notpresent li > .subcell_description {
display:none;
}
</style>
</template>
<template id="sibSwissBioPicsSlLiItem">
<li class="subcellular_location">
<a class="subcell_name"></a> <!-- the class name is required and textContent will be set on it -->
<span class="subcell_description"></span> <!-- the class name is required and textContent will be set on it -->
</li>
</template>
custom_element.txt: the third code block given in the documentation. Reproduced here with necessary EOF:
<script type="module" src="https://www.swissbiopics.org/static/swissbiopics.js"></script>
<script defer>
if (! window.customElements.get("sib-swissbiopics-sl"))
window.customElements.define("sib-swissbiopics-sl", SwissBioPicsSL);
</script>
Mind you, these .txt files can just as easily be saved as .html files. Only the file extensions would need to be refactored, in the default values for the templates and custom_element parameters, for the make_html() function in the my_book.Rmd code below.
Step 3: Interact in RStudio
Now we are ready! In my_book.Rmd, write the following:
---
title: "R Notebook"
output: html_document
---
```{r}
library(htmltools)
library(readr)
library(rlang)
```
# Functions #
Here are the functions that do the trick. The snippets used by `make_html()` are copied from the [documentation](https://www.npmjs.com/package/%40swissprot/swissbiopics-visualizer#usage) for `swissbiopics-visualizer`, and (after fixing the HTML comments) pasted into `.txt` files (`templates.txt` and `custom_element.txt`) under the `./snippets` subdirectory, which lies within the directory containing this `.Rproj`.
```{r}
# Create comma-separated list from vectorized (or listed) items, safely escaped.
csl <- function(items) {
return(paste("\"", paste(htmltools::htmlEscape(unlist(items)), collapse = ",", sep = ""), "\"", sep = ""))
}
# Create the HTML for the interactive imagery given by the parameters. Assembly
# process is as described the documentation for 'swissbiopics-visualizer':
# https://www.npmjs.com/package/%40swissprot/swissbiopics-visualizer#usage
make_html <- function(# The NCBI taxonomy ID.
tax_id,
# The IDs of the cellular elements to highlight.
sl_ids,
# The filepath to (or raw HTML text of) the templates
# snippet.
templates = "./snippets/templates.txt",
# The filepath to (or raw HTML text of) the custom element
# snippet.
custom_element = "./snippets/custom_element.txt",
# Further arguments to 'readr::read_file()', which might
# be useful to process snippet encodings across platforms.
...) {
# Escape any strings supplied.
tax_id <- csl(tax_id[1])
sl_ids <- csl(sl_ids)
# Compile all the HTML snippets into a list:
elements <- list()
# Include the templates (as read)...
elements$templates <- readr::read_file(file = templates, ...)
# ...then include the line (created here) to target the right picture...
elements$identifier <- "<sib-swissbiopics-sl taxid=%s sls=%s></sib-swissbiopics-sl>"
elements$identifier <- sprintf(fmt = elements$identifier, tax_id, sl_ids)
# ...and finally include the definition (as read) for the custom element.
elements$custom_element <- readr::read_file(file = custom_element, ...)
# Append these snippets together, into the full HTML code.
return(paste(unlist(elements), collapse = "\n"))
}
# Display the interactive imagery given by the parameters, visible in both
# RStudio (crowded) and the R Markdown file (well laid out).
visualize <- function(# The NCBI taxonomy ID.
taxid = "9606",
# A list (or vector) of the UniProtKB subcellular location
# (SL) IDs for the cellular elements to highlight.
sls = list("SL0073"),
# Further arguments to 'make_html()'.
...
) {
# Embed the HTML text where this function is called.
return(htmltools::HTML(make_html(tax_id = taxid, sl_ids = sls, ...)))
}
```
# Results #
Here we `visualize()` the **interactive** image, also accessible on [SwissBioPics](https://www.swissbiopics.org):
```{r}
visualize(sls = list("SL0073", "SL0138"))
```
Note
Observe how (in this case) we "lazily" use the default value ("9606") for taxid, without having to specify it. Observe also how we can simultaneously highlight not one but multiple separate components, namely the Contractile vacuole ("SL0073") and the Cell cortex ("SL0138").
Now below that last chunk where visualize() is called
```{r}
visualize(sls = list("SL0073", "SL0138"))
```
you will see interactive output that looks like this:
Sadly, it appears extremely crowded in RStudio, and an HTML wizard might be needed to alter the supporting .txt (or .html) files, to achieve properly formatted HTML within this IDE.
Step 4: Embed in Reports
As with any .Rmd file, RStudio gives you the option to Knit the Markdown results into a .html file, which can be easily accessed and beautifully formatted as a report!
With my_book.Rmd open in RStudio, click the Knit button, and my_book.html should appear within that same directory. You can open this .html file in a web browser (I used Chrome) to see it in all its glory!
In Conclusion
With either of these two interactive images, you can hover to highlight the various components and layers of the diagram. Furthermore, clicking on any definition will take you by hyperlink to its profile on UnitProt.
Many of the remaining limitations are due to the swissbiopics-visualizer API itself. For example, there seems to be a malfunction in its mapping from GO IDs to SL IDs, via this dataset. As such, you should provide only SL codes to visualize().
That said, if you can wrangle that HTML and bend its layout to your will, the sky is the limit!
Enjoy!
Bonus
Here's a demo of the same interactive output, embedded right here in Stack Overflow! Unfortunately, it's unstable and horribly unwieldy in this environment, so I've left it as a mere "footnote":
<template id="sibSwissBioPicsStyle">
<style>
ul > li > a {
font-style:oblique;
}
ul.notpresent li > .subcell_description {
display:none;
}
</style>
</template>
<template id="sibSwissBioPicsSlLiItem">
<li class="subcellular_location">
<a class="subcell_name"></a> <!-- the class name is required and textContent will be set on it -->
<span class="subcell_description"></span> <!-- the class name is required and textContent will be set on it -->
</li>
</template>
<sib-swissbiopics-sl taxid="9606" sls="SL0073,SL0138" ></sib-swissbiopics-sl>
<script type="module" src="https://www.swissbiopics.org/static/swissbiopics.js"></script>
<script defer>
if (! window.customElements.get("sib-swissbiopics-sl"))
window.customElements.define("sib-swissbiopics-sl", SwissBioPicsSL);
</script>
You have to call the content-function on res, not res$content. Then you get raw content which needs to be converted e.g. via
base::rawToChar(content(res))
which results in a string containing some JS-code
base::rawToChar(content(res))
[1] "var SwissBioPics;SwissBioPics=(()=>....
I have only quickly looked at the website, but what about just downloading the files? It also goes through the API.
qurl = "https://www.swissbiopics.org/api/image/Chlamydomona_cells.svg"
fl = file.path(tempdir(), basename(qurl))
download.file(qurl, fl)
Once on disk, you can load the image in R, e.g. through the magick-package:
require(magick)
img = image_read_svg(fl)
print(img)
I am trying to write R code where I input an URL and output (save on hard drive) a .txt file. I created a large list of url using the "edgarWebR" package. An example would be "https://www.sec.gov/Archives/edgar/data/1131013/000119312518074650/d442610dncsr.htm". Basically
open the link
Copy everything (CTRL+A, CTRL+C)
open empy text file and paste content (CTRL+V)
save .txt file under specified name
(in a looped fashion of course). I am inclined to "hard code it" (as in open website in browner using browseURL(...) and "send keys" commands). But I am afraid that it will not run very smoothly. However other commands (such as readLines()) seem to copy the HTML structure (therefore returning not only the text).
In the end I am interested in a short paragraph of each of those shareholder letters (containing only text; Therefore Tables/graphs are no concern in my particular setup.)
Anyone aware of an R function that would help`?
thanks in advance!
Let me know incase below code works for you. xpathSApply can be applied for different html components as well. Since in your case only paragraphs are required.
library(RCurl)
library(XML)
# Create character vector of urls
urls <- c("url1", "url2", "url3")
for ( url in urls) {
# download html
html <- getURL(url, followlocation = TRUE)
# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
# writing lines to html
# depends whether you need separate files for each url or same
fileConn<-file(paste(url, "txt", sep="."))
writeLines(paste(plain.text, collapse = "\n"), fileConn)
close(fileConn)
}
Thanks everyone for your input. Turns out that any html conversion took too much time given the ammount of websites I need to parse. The (working) solution probably violates some best-practice guidelines, but it does do the job.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=path + '/codes_ml/geckodriver/geckodriver.exe') # initialize driver
# it is fine to open the driver just once
# loop over urls will the text
driver.get(report_url)
element = driver.find_element_by_css_selector("body")
element.send_keys(Keys.CONTROL+'a')
element.send_keys(Keys.CONTROL+'c')
text = clipboard.paste()
I have a few thousand xml files that I would like to read into R. The problem is that some of these files have three special characters "" in the beginning of the file that stops xmlTreeParse from reading the xml file. The error that I get is the following...
Error: 1: Start tag expected, '<' not found
This is due to the first line in the xml file that is the following...
<?xml version="1.0" encoding="utf-8"?>
If I manually remove the characters using notepad, I have this in the beginning of the xml file and I am able to read the xml file...
<?xml version="1.0" encoding="utf-8"?>
I'd like to be able to remove the characters automatically. The following is the code that I have written currently.
filenames <- list.files("...filepath...", pattern="*.xml", full.names=TRUE)
files <- lapply(filenames, function(f) {
xmlfile <-tryCatch(xmlTreeParse(file = f), error=function(e) print(f))
xmltop <- xmlRoot(xmlfile)
plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
name <- unname(plantcat$EntityNames)
return(name)
})
I'm wondering how I can read the xml file in by removing the special characters in R. I have tried tryCatch as you can see above but I'm not sure how can edit the xml file without actually reading it in first. Any help would be appreciated!
Edit: Using the following parsing code fixed the problem. I think when I opened the xml file in notepad, it was showing "" but in reality it was this following string "". It's possible that this was due to the encoding of the file but I'm not sure of the specifics. Thank you #Prem.
xmlfile <- xmlTreeParse(gsub("","",readLines(f)), asText=TRUE)
The special chars from the beginning might come from a different encoding for the file, especially if your xml contains some special characters.
Try to specify the encoding. To identify what encoding is used, open the file as hexa and read the first bytes.
My hunch is that your special chars comes from BOM:
http://unicode.org/faq/utf_bom.html
In your code use readLines to read file and then gsub can be used to remove junk value from the string.
xmlfile <- xmlTreeParse(gsub("","",readLines(f)), asText=TRUE)
Have you tryed with the gsub function?. It is a very convenient function for characters replacement (and deletion). This works for me:
gsub ('','',string, fixed=TRUE)
On a string = '<?xml version="1.0" encoding="utf-8"?>' variable.
EDIT: I would also suggest you to use the sed function if you're using a computer with GNU/Linux. It's a very powerful tool that would deal perfectly with this task.
I am new to XML.
i downloaded a XML file, called ipg140722,from google drive, http://www.google.com/googlebooks/uspto-patents-grants-text.html
, I used Window 8.1, R 3.1.1,
library(XML)
url<- "E:\\clouddownload\\R-download\\ipg140722.xml"
indata<- xmlTreeParse(url)
XML declaration allowed only at the start of the document
Extra content at the end of the document
error: 1: XML declaration allowed only at the start of the document
2: Extra content at the end of the document
what is the problem
Note: This post is edited from the original version.
The object lesson here is that just because a file has an xml extension does not mean it is well formed XML.
If #MartinMorgan is correct about the file, Google seems to have taken all the patents approved during the week of 2014-07-22 (last week), converted them to XML, strung them together into a single text file, and given that an xml extension. Clearly this is not well-formed XML. So the challenge is to deconstruct that file. Here is away to do it in R.
lines <- readLines("ipg140722.xml")
start <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end <- c(start[-1]-1,length(lines))
library(XML)
get.xml <- function(i) {
txt <- paste(lines[start[i]:end[i]],collapse="\n")
# print(i)
xmlTreeParse(txt,asText=T)
# return(i)
}
docs <- lapply(1:10,get.xml)
class(docs[[1]])
# [1] "XMLInternalDocument" "XMLAbstractDocument"
So now docs is a list of parsed XML documents. These can be accessed individually as, e.g., docs[[1]], or collectively using something like the code below, which extracts the invention title from each document.
sapply(docs,function(doc) xmlValue(doc["//invention-title"][[1]]))
# [1] "Phallus retention harness" "Dress/coat"
# [3] "Shirt" "Shirt"
# [5] "Sandal" "Shoe"
# [7] "Footwear" "Flexible athletic shoe sole"
# [9] "Shoe outsole with a surface ornamentation contrast" "Shoe sole"
And no, I did not make up the name of the first patent.
Response to OPs comment
My original post, which detected the start of a new document using:
start <- grep("xml version",lines,fixed=T)
was too naive: it turns out the phrase "xml version" appears in the text of some of the patents. So this was breaking (some of) the documents prematurely, resulting in mal-formed XML. The code above fixes that problem. If you un-coment the two lines in the function get.xml(...) and run the code above with
docs <- lapply(1:length(start),get.xml)
you will see that all 6961 documents parse correctly.
But there is another problem: the parsed XML is very large, so if you leave these lines as comments and try to parse the full set, you run out of memory about half way through (or I did, on an 8GB system). There are two ways to work around this. The first is to do the parsing in blocks (say 2000 documents at a time). The second is to extract whatever information you need for your CSV file in get.xml(...) and discard the parsed document at each step.
Is there a standalone library of OpenOffice's formula renderer? I'm looking for something that can take plain text (e.g. E = mc^2) in the same syntax as used by OpenOffice, and convert to png or pdf fragments.
(note: I don't need the WYSIWYG editor, just the renderer. Basically I would like to work in OpenOffice to interactively edit my formulas, and then copy the source text for use in other contexts w/o needing OpenOffice to render them.)
I'm using unoconv to convert OpenOffice/LibreOffice document to PDF.
However, first I had to create some input document with a formula.
Unfortunately, it is not possible to use just the formula editor to create ODF file because the output PDF file would contain weird headers and footers.
Therefore, I created a simple text document (in Writer) and embedded the formula as a single object (aligned as a character). I saved the ODT file, unzipped it (since ODT is just a ZIP) and edited the content. Then, I identified what files can be deleted and formatted the remaining files to get a minimal example.
In my example, the formula itself is located in Formula/content.xml. It should be easy to change just the code within the <annotation>...</annotation> tags in an automated way.
Finally, I zipped the directory and produced a new ODT file.
Then, using unoconv and pdfcrop, I produced a nice formula as PDF.
# this trick prevents zip from creating an additional directory
cd formula.odt.unzipped
zip -r ../formula.odt .
cd ..
unoconv -f pdf formula.odt # ODT to PDF
pdfcrop formula.pdf # keep only the formula
# you can convert the PDF to bitmap as follows
convert -density 300x300 formula-crop.pdf formula.png
Content of the unzipped ODT directory:
Here is the minimal content of an ODT file formula.odt.
formula.odt.unzipped/Formula/content.xml
formula.odt.unzipped/META-INF/manifest.xml
formula.odt.unzipped/content.xml
File formula.odt.unzipped/Formula/content.xml contains:
<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
<semantics>
<annotation encoding="StarMath 5.0">
f ( x ) = sum from { { i = 0 } } to { infinity } { {f^{(i)}(0)} over {i!} x^i}
</annotation>
</semantics>
</math>
File formula.odt.unzipped/content.xml contains:
<?xml version="1.0" encoding="UTF-8"?>
<office:document-content
xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"
xmlns:xlink="http://www.w3.org/1999/xlink">
<office:body>
<office:text>
<text:p>
<draw:frame>
<draw:object xlink:href="./Formula"/>
</draw:frame>
</text:p>
</office:text>
</office:body>
</office:document-content>
File formula.odt.unzipped/META-INF/manifest.xml contains:
<?xml version="1.0" encoding="UTF-8"?>
<manifest:manifest xmlns:manifest="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0" manifest:version="1.2">
<manifest:file-entry manifest:full-path="/" manifest:version="1.2" manifest:media-type="application/vnd.oasis.opendocument.text"/>
<manifest:file-entry manifest:full-path="content.xml" manifest:media-type="text/xml"/>
<manifest:file-entry manifest:full-path="Formula/content.xml" manifest:media-type="text/xml"/>
<manifest:file-entry manifest:full-path="Formula/" manifest:version="1.2" manifest:media-type="application/vnd.oasis.opendocument.formula"/>
</manifest:manifest>
There are several web services that run LaTeX for you and return an image. For instance, http://rogercortesi.com/eqn/