HTML pages not retained in list while using mclapply

HTML pages not retained in list while using mclapply - r

While using simply lapply read_html page results are retained.
library(xml2)
lapply(c("https://www.analyticsvidhya.com/blog/2018/06/datahack-radio-1-machine-learning-competitions-with-kaggle-ceo-anthony-goldbloom/","https://www.analyticsvidhya.com/blog/2018/09/datahack-radio-lyft-dr-alok-gupta/"), function(x){read_html(x)})
#> [[1]]
#> {xml_document}
#> <html>
#> [1] <head lang="en-US" prefix="og: http://ogp.me/ns#">\n<meta http-equiv ...
#> [2] <body class="post-template-default single single-post postid-45087 s ...
#>
#> [[2]]
#> {xml_document}
#> <html>
#> [1] <head lang="en-US" prefix="og: http://ogp.me/ns#">\n<meta http-equiv ...
#> [2] <body class="post-template-default single single-post postid-46725 s ...
While using Parallel mclapply:
library(xml2)
library(parallel)
mclapply(c("https://www.analyticsvidhya.com/blog/2018/06/datahack-radio-1-machine-learning-competitions-with-kaggle-ceo-anthony-goldbloom/","https://www.analyticsvidhya.com/blog/2018/09/datahack-radio-lyft-dr-alok-gupta/"), function(x){read_html(x)}, mc.cores = 2)
#> [[1]]
#> {xml_document}
#>
#> [[2]]
#> {xml_document}
I can't figure out why it's happening, even with foreach I'm not able to get the desired results as normal lapply. Help!

Time to sew
(I mean, you used the word thread so I'm not passing up the opportunity for a pun or three).
Deep in the manual page for ?parallel::mclapply you'll eventually see that it works by:
forking processes
serializing results
eventually gathering up these serialized results and combining them into one object
You can read ?serialize to see the method used.
Why cant we serialize xml_document/html_document objects?
First, let's make one:
library(xml2)
(doc <- read_html("<p>hi there!</p>"))
## {xml_document}
## <html>
## [1] <body><p>hi there!</p></body>
and look at the structure:
str(doc)
## List of 2
## $ node:<externalptr>
## $ doc :<externalptr>
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
doc$node
## <pointer: 0x7ff45ab17ce0>
Hrm. Those are <externalptr> objects. What does ?"externalptr-class" (eventually) say abt them?
…
"externalptr" # raw external pointers for use in C code
Since it's not a built-in object and the data is hidden away and only accessible via the package interface, R can't serialize it on its own and needs help. (That hex string — 0x7ff45ab17ce0 — is the memory pointer to where this opaque data is hidden).
"You can't be serious…"
Totally am.
In the event you're from Missouri (the "Show Me" state), we can see what happens without the complexity of parallel ops and raw connection object serialization machinations by just trying to save the document above to an RDS file and read it back:
tf <- tempfile(fileext = ".rds")
saveRDS(doc, tf)
(doc2 <- readRDS(tf))
## List of 2
## $ node:<externalptr>
## $ doc :<externalptr>
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
Now, you may be all like "AHA! See, it works!" Aaaaand…you'd be wrong:
doc2$node
## <pointer: 0x0>
The 0x0 means it's not pointing to anything. You've lost all that data. It's gone. Forever. (But, it had a good run so we should not be too sad abt it).
This has been discussed by the xml2 devs and — rather than make life easier for us — they punted and made ?xml_serialize.
Wait…there's an xml_serialize but it's kinda not all that useful?
Yep. And, it gets even better worse.
Hopefully your curiosity was sufficiently piqued that you went ahead and found out what this quite seriously named xml_serialize() function does. If not, this is R, so to find out just type it's name without the () to get:
function (object, connection, ...)
{
if (is.character(connection)) {
connection <- file(connection, "w", raw = TRUE)
on.exit(close(connection))
}
serialize(structure(as.character(object, ...), class = "xml_serialized_document"),
connection)
}
Apart from wiring up some connection bits, the complex sorcery behind this xml_serialize function is, well, just as.character(). (Kind of a let-down, actually.)
Since parallel ops perform (idiomatically) the equivalent of saveRDS() => readRDS() when you return an xml_document, html_document (or their _node[s] siblings) in a parallel apply you eventually get back a whole pile of nothing.
What can a content thief innocent scraper do to overcome this devastating limitation?
You are left with (at minimum) four choices:
🤓 expand the complexity of your function in the parallel apply to process the XML/HTML document into a data frame, vector or list of objects that can all be serialized automagically by R so they can be combined for you
be cool 😎 and have one parallel apply that saves off the HTML into files (the HTTP ops are likely the slow bit anyway) and then a non-parallel operation that read them sequentially and processes them — which it looks like you were going to do anyway. Note that you're kind of being a leech and rly bad netizen if you don't do the HTML caching to file anyway since you're showing you really don't care about the bandwidth and CPU costs of the content you're purloining scraping.
don't be cool by doing ^^ 😔 and, instead, use as.character((read_html(…)) to return raw, serializable, character HTML directly from your parallel apply and then re-xml2 them back in the rest of your program
😱 fork the xml2 📦, layer in a proper serialization hack and don't bother PR'ing it since you'll likely spend alot of time trying to convince them it's worth it and still end up failing since this "externalptr serializing` is tricksy business, fraught with peril and you likely missed some edge cases (i.e. Hadley/Jim/etc know what they're doing and if they punted, it's prbly something not worth doing).
In reality, rather than use xml2::read_html() to grab the content, I'd use httr::GET() + httr::content(…, as="text") instead (if you're being cool and caching the pages vs callously wasting other folks' resources) since read_html() uses libxml2 under the covers and transforms the document (even if sometimes just a little) and it's better to have untransformed raw, cached source data vs something mangled by software that thinks its smarter than we are.
FIN
There really isn't any more I can do to clarify this than the above, verbose-mode blathering. Hopefully this expansion also helps others grok what's going on as well.

Related

Writing help information for user defined functions in R

I frequently use user defined functions in my code.
RStudio supports the automatic completion of code using the Tab key. I find this amazing because I always can read quickly what is supposed to go in the (...) of functions/calls.
However, my user defined functions just show the parameters, no additional info and obviously, no help page.
This isn't so much pain for me but I would like to share code I think it would be useful to have some information at hand besides the #coments in every line.
Nowadays, when I share, my lines usually look like this
myfun <- function(x1,x2,x3,...){
# This is a function for this and that
# x1 is a factor, x2 is an integer ...
# This line of code is useful for transformation of x2 by x1
some code here
# Now we do this other thing
more code
# This is where the magic happens
return (magic)
}
I think this line by line comment is great but I'd like to improve it and make some things handy just like every other function.

Not really an answer, but if you are interested in exploring this further, you should start at the rcompgen-help page (although that's not a function name) and also examine the code of:
rc.settings
Also, executing this allows you to see what the .CompletionEnv has in it for currently loaded packages:
names(rc.status())
#-----
[1] "attached_packages" "comps" "linebuffer" "start"
[5] "options" "help_topics" "isFirstArg" "fileName"
[9] "end" "token" "fguess" "settings"
And if you just look at:
rc.status()$help_topics
... you see the character items that the tab-completion mechanism uses for matching. On my machine at the moment there are 8881 items in that vector.

SnowballC in R stems "many" and "only"

I am using SnowballC to process a text document, but realize it stems words such as "many" and "only" even though they are not supposed to be stemmed.
> library(SnowballC)
>
> str <- c("many", "only", "things")
> str.stemmed <- stemDocument(str)
> str.stemmed
[1] "mani" "onli" "thing"
>
> dic <- c("many", "only", "online", "things")
> str.complete <- stemCompletion(str.stemmed, dic)
> str.complete
mani onli thing
"" "online" "things"
You can see that after stemming, "many" and "only" became "mani" and "onli", which cannot be completed back with stemCompletion later on, since letters in "many" is not inclusive of "mani". Notice how "onli" gets completed to "online" instead of the original "only".
Why is that? Is that a way to fix this?

Stemming is often executed as a set of rules from stripping all affixes--both derivational and inflectional--from a word, leaving its root. Lemmatization typically only removes inflectional affixes. Stemming is a much more aggressive version of lemmatization. Given what you want, it seems like you'd prefer lemmatization.
To compare the two, most lemmatizers are limited to a few rules for dealing with affixes to nouns and verbs in English---ed, -s, -ing, for example. There are a few irregular cases they have to handle, but with some training data, many are probably covered.
Stemmers are expected to dig deeper. As a result, the space of possible transformations they can make is bigger, so you're a lot more likely to end up with errors.
To see what's happening in your data, let's look at the specifics.
online -> onli: why on earth would this happen? Not totally sure on this one; there's probably some rule that tries to cater to words like medic-ine and medic-al, sub-mari-ne and mari-ne, imagi-ne and imagi-na-tion.
only -> onli, many -> mani: These seem particularly strange, but are probably more reasonable than the previous rule--especially in the context of dealing with verbs that end in -ed. If you're stemming the words denied, studied, modified, specified, you'll want them to be equivalent to their uninflected forms deny, study, modify, specify.
You could have a rule to transform each verb into the uninflected form, but the authors here chose to make the roots the forms ending in -i. To ensure that these match, -y endings had to be transformed to -i as well.
With a lemmatizer, you might get more predictable results. Since they only remove inflectional affixes, you'd get only, many, online, and thing, as you wanted. Both a good stemmer and lemmatizer can work well, but the stemmer does more stuff and therefore has more room for error.

That is how stemmers work. You've got a (smallish) set of rules that reduce most words to something resembling a canonical form (a stem), but not quite. There are many other corner cases you will find, so many in fact that I hesitate to call them corner cases, e.g.
many -> mani
other -> other
corner -> corner
cases -> case
in -> in
sentences -> sentenc
What you want is a lemmatiser. Have a look at this question for a more detailed explanation:
Stemmers vs Lemmatizers

In memory data processing in R?: save -> readBin ->?

How can I access the R data originally saved with the SAVE command and later read with readBin?
Let me try to explain:
I have saved some data (mostly matrices and lists) to a file using SAVE command.
Later I have transformed this file (encrypted) and saved it using writeBin.
Since the file is transformed I cannot get the data directly using LOAD but need to do it with readBin and perform opposite transformation in memory.
The problem is, after reading with readBin and transforming, the data are in memory, but I cannot access them as R objects (such as matrices or lists), since they are not recognized as such (there is just singular binary object).
The easiest way would be to use this binary object as connection for LOAD.
Unfortunately, LOAD does not accept in-memory binary connections.
I guess .Internal(loadFromConn2(...)) may be a key to this, but I do not have details of it internal workings.
Is there any way to make R recognize the binary data stored in-memory as binary object as R original objects (matrices, lists, etc.)?
The encryption code I am using is available at: http://pastebin.com/eVfVQYwn
Thanks in advance.

(If you aren't interested in learning how to research this type of
problem in the future, skip to "Results", far below.)
Long Story ...
Knowing some things about how the R objects are stored with save
will inform you on how to retrieve it with load. From help(save):
save(..., list = character(),
file = stop("'file' must be specified"),
ascii = FALSE, version = NULL, envir = parent.frame(),
compress = !ascii, compression_level,
eval.promises = TRUE, precheck = TRUE)
The default for compress will be !ascii which means compress will
be TRUE, so:
compress: logical or character string specifying whether saving to a
named file is to use compression. 'TRUE' corresponds to
'gzip' compression, ...
The key here is that it defaults to 'gzip' compression. From here,
let's look at help(load):
'load' ... can read a compressed file (see 'save') directly from a
file or from a suitable connection (including a call to
'url').
(Emphasis added by me.) This implies both that it will take a
connection (that is not an actual file), and that it "forces"
compressed-ness. My typical go-to function for faking file connections
is textConnection, though this does not work with binary files, and
its help page doesn't provide a reference for binary equivalence.
Continued from help(load):
A not-open connection will be opened in mode '"rb"' and closed after
use. Any connection other than a 'gzfile' or 'gzcon'
connection will be wrapped in 'gzcon' to allow compressed saves to
be handled ...
Diving a little tangentially (remember the previous mention of gzip
compression?), help(gzcon):
Compressed output will contain embedded NUL bytes, and so 'con'
is not permitted to be a 'textConnection' opened with 'open =
"w"'. Use a writable 'rawConnection' to compress data into
a variable.
Aha! Now we see that there is a function rawConnection which one
would (correctly) infer is the binary-mode equivalent of
textConnection.
Results (aka "long story short, too late")
Your pastebin code is interesting but unfortunately moot.
Reproducible examples
make things easier for people considering answering your question.
Your problem statement, restated:
set.seed(1234)
fn <- 'test-mjaniec.Rdata'
(myvar1 <- rnorm(5))
## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247
(myvar2 <- sample(letters, 5))
## [1] "s" "n" "g" "v" "x"
save(myvar1, myvar2, file=fn)
rm(myvar1, myvar2) ## ls() shows they are no longer available
x.raw <- readBin(fn, what=raw(), n=file.info(fn)$size)
head(x.raw)
## [1] 1f 8b 08 00 00 00
## how to access the data stored in `x.raw`?
The answer:
load(rawConnection(x.raw, open='rb'))
(Confirmation:)
myvar1
## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247
myvar2
## [1] "s" "n" "g" "v" "x"
(It works with your encryption code, too, by the way.)

R: How can I disable truncation of listing of package functions?

How can I list all of the results that used to occur when typing packageName<tab>, i.e. the full list offered via auto-completion? In R 2.15.0, I get the following for Matrix::<tab>:
> library(Matrix)
> Matrix::
Matrix::.__C__abIndex Matrix::.__C__atomicVector Matrix::.__C__BunchKaufman Matrix::.__C__CHMfactor Matrix::.__C__CHMsimpl
Matrix::.__C__CHMsuper Matrix::.__C__Cholesky Matrix::.__C__CholeskyFactorization Matrix::.__C__compMatrix Matrix::.__C__corMatrix
Matrix::.__C__CsparseMatrix Matrix::.__C__dCHMsimpl Matrix::.__C__dCHMsuper Matrix::.__C__ddenseMatrix Matrix::.__C__ddiMatrix
Matrix::.__C__denseLU Matrix::.__C__denseMatrix Matrix::.__C__dgCMatrix Matrix::.__C__dgeMatrix Matrix::.__C__dgRMatrix
Matrix::.__C__dgTMatrix Matrix::.__C__diagonalMatrix Matrix::.__C__dMatrix Matrix::.__C__dpoMatrix Matrix::.__C__dppMatrix
Matrix::.__C__dsCMatrix Matrix::.__C__dsparseMatrix Matrix::.__C__dsparseVector Matrix::.__C__dspMatrix Matrix::.__C__dsRMatrix
Matrix::.__C__dsTMatrix Matrix::.__C__dsyMatrix Matrix::.__C__dtCMatrix Matrix::.__C__dtpMatrix Matrix::.__C__dtrMatrix
Matrix::.__C__dtRMatrix Matrix::.__C__dtTMatrix Matrix::.__C__generalMatrix Matrix::.__C__iMatrix Matrix::.__C__index
Matrix::.__C__isparseVector Matrix::.__C__ldenseMatrix Matrix::.__C__ldiMatrix Matrix::.__C__lgCMatrix Matrix::.__C__lgeMatrix
Matrix::.__C__lgRMatrix Matrix::.__C__lgTMatrix Matrix::.__C__lMatrix Matrix::.__C__lsCMatrix Matrix::.__C__lsparseMatrix
[...truncated]
That [...truncated] message is irritating and I want to produce the full listing. Which option/flag/knob/configuration/incantation do I need to invoke in order to avoid the truncation? I have this impression that I used to see the full list, but not anymore - perhaps that was on a different OS (e.g. Linux).
I know that ls("package:Matrix") is one useful approach, but it is not the same as setting an option, and the list is different.

Unfortunately, on Windows, it looks like this behavior is hard-wired into the C code used to construct the console. So the answer seems to be that "no, you can't disable it" (at least not without modifying the sources and then recompiling R from scratch).
Here are the relevant lines from $RHOME/src/gnuwin32/console.c:
909 static void performCompletion(control c)
910 {
911 ConsoleData p = getdata(c);
912 int i, alen, alen2, max_show = 10, cursor_position = p->c - prompt_wid;
...
...
1001 if (alen > max_show)
1002 consolewrites(c, "\n[...truncated]\n");
You are correct that on some other platforms, all of the results are printed out. (I often use Emacs, for instance, and it pops all results of tab completion up in a separate buffer).
As an interesting side note, rcompgen, the backend that actually performs the tab-completion (as opposed to printing results to the console) does always find all completions. It's just that Windows doesn't then print them out for us to see.
You can verify that this happens even on Windows by typing:
library(Matrix)
Matrix::
## Then type <TAB> <TAB>
## Then type <RET>
rc.status() ## Careful not to use tab-completion to complete rc.status !
matches <- rc.status()$comps
length(matches) # -> 288
matches # -> lots of symbols starting with 'Matrix::'
For more details about about the backend, and the functions and options that control its behavior, see ?rcompgen.

How can I use R (Rcurl/XML packages ?!) to scrape this webpage?

I have a (somewhat complex) web scraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:
I would like to go through all the "species pages" present in this link:
http://gtrnadb.ucsc.edu/
So for each of them I will go to:
The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/)
And then to the "Secondary Structures" page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)
Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example):
chr.trna3 (1-77) Length: 77 bp
Type: Ala Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....
Where each line will have it's own list (inside the list for each "trna" inside the list for each animal)
I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is:
1. Some suggestion on how to build such a code.
2. And recommendation for how to learn the knowledge needed for performing such a task.
Thanks for any help,
Tal

Tal,
You could use R and the XML package to do this, but (damn) that is some poorly formed HTML you are trying to parse. In fact, in most cases your would want to be using the readHTMLTable() function, which is covered in this previous thread.
Given this ugly HTML, however, we will have to use the RCurl package to pull the raw HTML and create some custom functions to parse it. This problem has two components:
Get all of the genome URLS from the base webpage (http://gtrnadb.ucsc.edu/) using the getURLContent() function in the RCurlpackage and some regex magic :-)
Then take that list of URLS and scrape the data you are looking for, and then stick it into a data.frame.
So, here goes...
library(RCurl)
### 1) First task is to get all of the web links we will need ##
base_url<-"http://gtrnadb.ucsc.edu/"
base_html<-getURLContent(base_url)[[1]]
links<-strsplit(base_html,"a href=")[[1]]
get_data_url<-function(s) {
u_split1<-strsplit(s,"/")[[1]][1]
u_split2<-strsplit(u_split1,'\\"')[[1]][2]
ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
}
# Extract only those element that are relevant
genomes<-unlist(lapply(links,get_data_url))
genomes<-genomes[which(is.na(genomes)==FALSE)]
### 2) Now, scrape the genome data from all of those URLS ###
# This requires two complementary functions that are designed specifically
# for the UCSC website. The first parses the data from a -structs.html page
# and the second collects that data in to a multi-dimensional list
parse_genomes<-function(g) {
g_split1<-strsplit(g,"\n")[[1]]
g_split1<-g_split1[2:5]
# Pull all of the data and stick it in a list
g_split2<-strsplit(g_split1[1],"\t")[[1]]
ID<-g_split2[1] # Sequence ID
LEN<-strsplit(g_split2[2],": ")[[1]][2] # Length
g_split3<-strsplit(g_split1[2],"\t")[[1]]
TYPE<-strsplit(g_split3[1],": ")[[1]][2] # Type
AC<-strsplit(g_split3[2],": ")[[1]][2] # Anticodon
SEQ<-strsplit(g_split1[3],": ")[[1]][2] # ID
STR<-strsplit(g_split1[4],": ")[[1]][2] # String
return(c(ID,LEN,TYPE,AC,SEQ,STR))
}
# This will be a high dimensional list with all of the data, you can then manipulate as you like
get_structs<-function(u) {
struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
raw_data<-getURLContent(struct_url)
s_split1<-strsplit(raw_data,"<PRE>")[[1]]
all_data<-s_split1[seq(3,length(s_split1))]
data_list<-lapply(all_data,parse_genomes)
for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
return(data_list)
}
# Collect data, manipulate, and create data frame (with slight cleaning)
genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
genome_data<-t(sapply(genomes_rows,rbind))
colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
genome_data<-as.data.frame(genome_data)
genome_data<-subset(genome_data,ID!="</PRE>") # Some malformed web pages produce bad rows, but we can remove them
head(genome_data)
The resulting data frame contains seven columns related to each genome entry: ID, length, type, sequence, string, and name. The name column contains the base genome, which was my best guess for data organization. Here it what it looks like:
head(genome_data)
ID LEN TYPE AC SEQ
1 Scaffold17302.trna1 (1426-1498) 73 bp Ala AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA
2 Scaffold20851.trna5 (43038-43110) 73 bp Ala AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
3 Scaffold20851.trna8 (45975-46047) 73 bp Ala AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
4 Scaffold17302.trna2 (2514-2586) 73 bp Ala AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA
5 Scaffold51754.trna5 (253637-253565) 73 bp Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA
6 Scaffold17302.trna4 (6027-6099) 73 bp Ala AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA
STR NAME
1 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
2 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
3 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
4 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>.>>>.......<<<.<<<<<<<<. Spurp
5 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
6 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<......>>>>.......<<<<.<<<<<<<. Spurp
I hope this helps, and thanks for the fun little Sunday afternoon R challenge!

Just tried it using Mozenda (http://www.mozenda.com). After roughly 10 minutes and I had an agent that could scrape the data as you describe. You may be able to get all of this data just using their free trial. Coding is fun, if you have time, but it looks like you may already have a solution coded for you. Nice job Drew.

Interesting problem and agree that R is cool, but somehow i find R to be a bit cumbersome in this respect. I seem to prefer to get the data in intermediate plain text form first in order to be able to verify that the data is correct in every step... If the data is ready in its final form or for uploading your data somewhere RCurl is very useful.
Simplest in my opinion would be to (on linux/unix/mac/or in cygwin) just mirror the entire http://gtrnadb.ucsc.edu/ site (using wget) and take the files named /-structs.html, sed or awk the data you would like and format it for reading into R.
I'm sure there would be lots of other ways also.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

HTML pages not retained in list while using mclapply - r

Related

Writing help information for user defined functions in R

SnowballC in R stems "many" and "only"

In memory data processing in R?: save -> readBin ->?

R: How can I disable truncation of listing of package functions?

How can I use R (Rcurl/XML packages ?!) to scrape this webpage?

Categories

Resources