I have a (somewhat complex) web scraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:
I would like to go through all the "species pages" present in this link:
http://gtrnadb.ucsc.edu/
So for each of them I will go to:
The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/)
And then to the "Secondary Structures" page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)
Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example):
chr.trna3 (1-77) Length: 77 bp
Type: Ala Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....
Where each line will have it's own list (inside the list for each "trna" inside the list for each animal)
I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is:
1. Some suggestion on how to build such a code.
2. And recommendation for how to learn the knowledge needed for performing such a task.
Thanks for any help,
Tal
Tal,
You could use R and the XML package to do this, but (damn) that is some poorly formed HTML you are trying to parse. In fact, in most cases your would want to be using the readHTMLTable() function, which is covered in this previous thread.
Given this ugly HTML, however, we will have to use the RCurl package to pull the raw HTML and create some custom functions to parse it. This problem has two components:
Get all of the genome URLS from the base webpage (http://gtrnadb.ucsc.edu/) using the getURLContent() function in the RCurlpackage and some regex magic :-)
Then take that list of URLS and scrape the data you are looking for, and then stick it into a data.frame.
So, here goes...
library(RCurl)
### 1) First task is to get all of the web links we will need ##
base_url<-"http://gtrnadb.ucsc.edu/"
base_html<-getURLContent(base_url)[[1]]
links<-strsplit(base_html,"a href=")[[1]]
get_data_url<-function(s) {
u_split1<-strsplit(s,"/")[[1]][1]
u_split2<-strsplit(u_split1,'\\"')[[1]][2]
ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
}
# Extract only those element that are relevant
genomes<-unlist(lapply(links,get_data_url))
genomes<-genomes[which(is.na(genomes)==FALSE)]
### 2) Now, scrape the genome data from all of those URLS ###
# This requires two complementary functions that are designed specifically
# for the UCSC website. The first parses the data from a -structs.html page
# and the second collects that data in to a multi-dimensional list
parse_genomes<-function(g) {
g_split1<-strsplit(g,"\n")[[1]]
g_split1<-g_split1[2:5]
# Pull all of the data and stick it in a list
g_split2<-strsplit(g_split1[1],"\t")[[1]]
ID<-g_split2[1] # Sequence ID
LEN<-strsplit(g_split2[2],": ")[[1]][2] # Length
g_split3<-strsplit(g_split1[2],"\t")[[1]]
TYPE<-strsplit(g_split3[1],": ")[[1]][2] # Type
AC<-strsplit(g_split3[2],": ")[[1]][2] # Anticodon
SEQ<-strsplit(g_split1[3],": ")[[1]][2] # ID
STR<-strsplit(g_split1[4],": ")[[1]][2] # String
return(c(ID,LEN,TYPE,AC,SEQ,STR))
}
# This will be a high dimensional list with all of the data, you can then manipulate as you like
get_structs<-function(u) {
struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
raw_data<-getURLContent(struct_url)
s_split1<-strsplit(raw_data,"<PRE>")[[1]]
all_data<-s_split1[seq(3,length(s_split1))]
data_list<-lapply(all_data,parse_genomes)
for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
return(data_list)
}
# Collect data, manipulate, and create data frame (with slight cleaning)
genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
genome_data<-t(sapply(genomes_rows,rbind))
colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
genome_data<-as.data.frame(genome_data)
genome_data<-subset(genome_data,ID!="</PRE>") # Some malformed web pages produce bad rows, but we can remove them
head(genome_data)
The resulting data frame contains seven columns related to each genome entry: ID, length, type, sequence, string, and name. The name column contains the base genome, which was my best guess for data organization. Here it what it looks like:
head(genome_data)
ID LEN TYPE AC SEQ
1 Scaffold17302.trna1 (1426-1498) 73 bp Ala AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA
2 Scaffold20851.trna5 (43038-43110) 73 bp Ala AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
3 Scaffold20851.trna8 (45975-46047) 73 bp Ala AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
4 Scaffold17302.trna2 (2514-2586) 73 bp Ala AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA
5 Scaffold51754.trna5 (253637-253565) 73 bp Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA
6 Scaffold17302.trna4 (6027-6099) 73 bp Ala AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA
STR NAME
1 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
2 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
3 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
4 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>.>>>.......<<<.<<<<<<<<. Spurp
5 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
6 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<......>>>>.......<<<<.<<<<<<<. Spurp
I hope this helps, and thanks for the fun little Sunday afternoon R challenge!
Just tried it using Mozenda (http://www.mozenda.com). After roughly 10 minutes and I had an agent that could scrape the data as you describe. You may be able to get all of this data just using their free trial. Coding is fun, if you have time, but it looks like you may already have a solution coded for you. Nice job Drew.
Interesting problem and agree that R is cool, but somehow i find R to be a bit cumbersome in this respect. I seem to prefer to get the data in intermediate plain text form first in order to be able to verify that the data is correct in every step... If the data is ready in its final form or for uploading your data somewhere RCurl is very useful.
Simplest in my opinion would be to (on linux/unix/mac/or in cygwin) just mirror the entire http://gtrnadb.ucsc.edu/ site (using wget) and take the files named /-structs.html, sed or awk the data you would like and format it for reading into R.
I'm sure there would be lots of other ways also.
Related
I am trying to write code to store a matrix to a variable directly from Matrix Market's website. Below is a sample URL that I'd use:
https://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/bcsstruc1/bcsstk01.mtx.gz
The example URL will download a bcsstk01.mtx.gz file. I need to extract the bcsstk01.mtx file. Then I need to use MatrixMarket.mmread() so I can save to a variable.
I first tried saving the downloaded file (or URL location) to a variable A = HTTP.get(), but lack of online resources and lack of knowledge led to no results. Then I used HTTP.download() and got the .mtx.gz file, but I can't unzip it. And finally, MatrixMarket.mmread() cannot read .gz files. So I'm stuck with a downloaded file I can't do anything with unless I manually unzip it.
Using the info from link in the comments and some fiddling, I managed to get the following:
using TranscodingStreams, CodecZlib
using Downloads
stream = PipeBuffer()
openstream = TranscodingStream(GzipDecompressor(), stream)
Downloads.download("https://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/bcsstruc1/bcsstk01.mtx.gz", stream)
for line in eachline(openstream)
println(line)
end
This prints:
%%MatrixMarket matrix coordinate real symmetric
48 48 224
1 1 2.8322685185200e+06
5 1 1.0000000000000e+06
6 1 2.0833333333300e+06
7 1 -3.3333333333300e+03
...
which I suppose is the desired data.
I am working with a large number of image files within several subdirectories of one parent folder.
I am attempting to run an ImageJ macro to batch-process the images (specifically, I am trying to stitch together a series of images taken on the microscope into single images). Unfortunately, I don't think I can't run this as an ImageJ Macro because the images were taken with varying grid sizes, ie some are 2x3, some are 3x3, some are 3x2, etc.
I've written an R script that is able to evaluate the image folders and determine the grid size, now I am trying to feed that information to my ImageJ macro to batch process the folder.
The issue I am running into seems like it should be easy to solve, but I haven't had any luck figuring it out: in R, I have a data.frame that I need to pass to the system command line-by-line with the columns concatenated into a single character string delimited by *'s.
Here's an example from the data.frame I have in R:
X xcoord ycoord input
1 4_10249_XY01_Fused_CH2 2 3 /XY01
2 4_10249_XY02_Fused_CH2 2 2 /XY02
3 4_10249_XY03_Fused_CH2 3 3 /XY03
4 4_10249_XY04_Fused_CH2 2 2 /XY04
5 4_10249_XY05_Fused_CH2 2 2 /XY05
6 4_10249_XY06_Fused_CH2 2 3 /XY06
Here's what each row needs to be transformed into so that ImageJ can understand it:
4_10249_XY01_Fused_CH2*2*3*/XY01
4_10249_XY02_Fused_CH2*2*2*/XY02
4_10249_XY03_Fused_CH2*3*3*/XY03
4_10249_XY04_Fused_CH2*2*2*/XY04
4_10249_XY05_Fused_CH2*2*2*/XY05
4_10249_XY06_Fused_CH2*2*3*/XY06
I tried achieving this with a for loop inside of a function that I thought would pass each row into the system command, but the macro only runs for the first line, none of the others.
macro <- function(i) {
for (row in 1:nrow(i)) {
df<-paste(i$X, i$xcoord, i$ycoord, i$input, sep='*')
}
system2('/Applications/Fiji.app/Contents/MacOS/ImageJ-macosx', args=c('-batch "/Users/All Stitched CH2.ijm"', df))
}
macro(table)
I think this is because the for loop is not maintaining the list-form of the data.frame. How do I concatenate the table by row and maintain the list-structure? I don't know if I'm asking the right question, but hopefully I'm close enough that someone here understands what I'm trying to do.
I appreciate any help or tips you can provide!
Turns out taking a break helps a lot!
I came back to this after lunch and came up with an easy solution (duh!)- I thought I would post it in case anyone comes along later with a similar issue.
I used stringr to combine my datatable by columns, then put them back into list form using as.list. Finally, for feeding the list into my macro, I edited the macro to only contain the system command and then used lapply to apply the macro to my list of inputs. Here is what my code looks like in the end:
library(stringr)
tablecombined<- str_c(table$X, table$xcoord, table$ycoord, table$input, sep = "*")
listylist<-as.list(tablecombined)
macro <- function(i) {
system2('/Applications/Fiji.app/Contents/MacOS/ImageJ-macosx', args=c('-batch "/Users/All Stitched CH2.ijm"', i))
}
runme<- lapply(listylist, macro)
Note: I am using the system2 command because it can take arguments, which is necessary for me to be able to feed it a series of images to iterate over. I started with the solution posted here: How can I call/execute an imageJ macro with R?
but needed additional flexibility for my specific situation. Hopefully someone may find this useful in the future when running ImageJ Macros from R!
I apologize if this question has been asked with terminology I don't recognize but it doesn't appear to be.
I am using the function comm2sci in the library taxize to search for the scientific names for a database of over 120,000 rows of common names. Here is a subset of 10:
commnames <- c("WESTERN CAPERCAILLIE", "AARDVARK", "AARDWOLF", "ABACO ISLAND BOA",
"ABBOTT'S DAY GECKO", "ABDIM'S STORK", "ABRONIA GRAMINEA", "ABYSSINIAN BLUE
WINGED GOOSE",
"ABYSSINIAN CAT", "ABYSSINIAN GROUND HORNBILL")
When searching with the NCBI database in this function, it asks for user input if the common name is generic/general and not species specific, for example the following call will ask for clarification for "AARDVARK" by entering '1', '2' or 'return' for 'NA'.
install.packages("taxize")
library(taxize)
ncbioutput <- comm2sci(commnames, db = "ncbi")###querying ncbi database
Because of this, I cannot rely on this function to find the names of the 120000 species without me sitting and entering 'return' every few minutes. I know this question sounds taxize specific - but I've had this situation in the past with other functions as well. My question is: is there a general way to place the comm2sci call in a conditional statement that will return a specific value when user input is prompted? Or otherwise write a function that will return some input when prompted?
All searches related to this tell me how to ask for user input but not how to override user queries. These are two of the question threads I've found, but I can't seem to apply them to my situation: Make R wait for console input?, Switch R script from non-interactive to interactive
I hope this was clear. Thank you very much for your time!
So the get_* functions, used internally, all by default ask for user input when there is > 1 option. But, all of those functions have a sister function with an underscore, e.g., get_uid_ that do not prompt for input, and return all data. You can use that to get all the data, then process however you like.
Made some changes to comm2sci, so update first: devtools::install_github("ropensci/taxize")
Here's an example.
library(taxize)
commnames <- c("WESTERN CAPERCAILLIE", "AARDVARK", "AARDWOLF", "ABACO ISLAND BOA",
"ABBOTT'S DAY GECKO", "ABDIM'S STORK", "ABRONIA GRAMINEA",
"ABYSSINIAN BLUE WINGED GOOSE",
"ABYSSINIAN CAT", "ABYSSINIAN GROUND HORNBILL")
Then use get_uid_ to get all data
ids <- get_uid_(commnames)
Process the results in ids as you like. Here, for brevity, we'll just grab first row of each
ids <- lapply(ids, function(z) z[1,])
Then grab the uid's out
ids <- as.uid(unname(vapply(ids, "[[", "", "uid")), check = FALSE)
And pass to comm2sci
comm2sci(ids)
$`100830`
[1] "Tetrao urogallus"
$`9818`
[1] "Orycteropus afer"
$`9680`
[1] "Proteles cristatus"
$`51745`
[1] "Chilabothrus exsul"
$`8565`
[1] "Gekko"
$`39789`
[1] "Ciconia abdimii"
$`278977`
[1] "Abronia graminea"
$`8865`
[1] "Cyanochen cyanopterus"
$`9685`
[1] "Felis catus"
$`153643`
[1] "Bucorvus abyssinicus"
Note that NCBI returns common names from get_uid/get_uid_, so you can just go ahead and pluck those out if you want
I'm planning to use R to do some simple text mining tasks. Specifically, I would like to do the following:
Automatically read each html file within a folder, then
For each file, do frequency count of some particular words (e.g., "financial constraint" "oil export" etc.), then
Automatically write output to a csv. file using the following data structure (e.g., file 1 has "financial constraint" showing 3 times and "oil export" 4 times, etc.):
file_name count_financial_constraint count_oil_export
1 3 4
2 0 3
3 4 0
4 1 2
Can anyone please let me know where I should start, so far I think I've figured out how to clean html files and then do the count but I'm still not sure how to automate the process (I really need this as I have around 5 folders containing about 1000 html files within each)? Thanks!
Try this:
gethtml<-function(path=".") {
files<-list.files(path)
setwd(path)
html<-grepl("*.html",files)
files<-files[html]
htmlcount<-vector()
for (i in files) {
htmlcount[i]<- ##### add function that reads html file and counts it
}
return(sum(htmlcount))
}
R is not intended for doing rigorous text parsing. Subsequently, the tools for such tasks are limited. If you insist on doing it with R then you better get familiar with regular expressions and have a look at this.
However, I highly recommend using Python with the beautifulsoup library, which is specifically designed for this task.
Suppose we have files in one folder file1.bin, file2.bin, ... , and file1460.bin in directory C:\R\Data and we want to read them and make a loop to go from 1 to 4 and take the average then from 4 to 8 average and so on till 1460.in the end will get 360 files
I tried to have them in a list,but did not know how to make the loop.
How do I read multiple files and manupulat them? in R language
I have been wasting countless hourse to figuer it out.any help
results <- array(dim=360)
for (i in 1:360){
results <- mean(yourlist[[(i*4):(i*4+3)]])
}
YMMV with the mean(yourList) call, but that structure would be how you could loop through the data once it's loaded.