extracting node information - r

Using the XML library, I have parsed a web page
basicInfo <- htmlParse(myURL, isURL = TRUE)
the relevant section of which is
<div class="col-left"><h1 class="tourney-name">Price Cutter Charity Championship Pres'd by Dr Pep</h1><img class="tour-logo" alt="Nationwide Tour" src="http://a.espncdn.com/i/golf/leaderboard11/logo-nationwide-tour.png"/></div>
I can manage to extract the tournament name
tourney <- xpathSApply(basicInfo, "//*/div[#class='col-left']", xmlValue)
but also wish to know the tour it is from using the alt tag. In this case I want to get the result "Nationwide Tour"
TIA and apologies for scrolling required

Don't know R but I'm pretty good with XPath
Try this:
tourney_name <- xpathSApply(basicInfo, "//*/div[#class='col-left']/h1/text()", xmlValue)
tourney_loc <- xpathSApply(basicInfo, "//*/div[#class='col-left']/img/#alt", xmlValue)
Note the use of "#" to extract attributes and text() to extract text nodes (looks like R did this automatically), my revised tourney_name xpath should do the same thing, but it is more clear which part is being extracted.

Related

rvest - Extract data from OMIM

EDIT: After some research and help from others, I figured what I was trying to do is not ethical. I asked for OMIM API permission from OMIM website and advise the same to anyone who needs to do same stuff.
I am quite inexperienced in HTML.
Using some keywords like 'ciliary' and 'primary' I want to go into OMIM, get into first 5 links listed, save text within those links and scrape data based on keywords like 'homozygous', 'heterozygous' etc.
What I have done so far:
rvestedOMIM <- function() {
clinicKeyWord1 <- c('primary', 'ciliary')
OMIM1 <- paste0("https://www.omim.org/search/?index=entry&start=1&limit=10&sort=score+desc%2C+prefix_sort+desc&search=", clinicKeyWord1[1], "+", clinicKeyWord1[2])
webpage <- read_html(OMIM1)
rank_data_html <- html_nodes(webpage,'.mim-hint')
# Go into first 5 links and extract the data based on keywords
allLinks <- rank_data_html[grep('a href',rank_data_html)]
allLinks <- allLinks[grep('omim', allLinks)]
}
At the moment, I am stuck at going through the links listed in the first OMIM search (with 'primary' and 'ciliary' keywords). allLinks object within the function I wrote is intended to extract the links
e.g.
244400. CILIARY DYSKINESIA, PRIMARY, 1; CILD1
(https://www.omim.org/entry/244400?search=ciliary%20primary&highlight=primary%20ciliary)
608644. CILIARY DYSKINESIA, PRIMARY, 3; CILD3
(https://www.omim.org/entry/608644?search=ciliary%20primary&highlight=primary%20ciliary)
Even if I could scrape the OMIM id in the links 244400 or 608644, I can navigate through the links by myself which is a workaround I thought in case I couldn't scrape that yields the full link.
Thank you for your help

Retrieve whole lyrics from URL

I am trying to retrieve the whole lyrics of a band from the web.
I have noticed that they build URLs using ".../firstletter/bandname/songname.html"
Here is an example.
http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html
I was thinkining about creating a function that would read.csv the URLs.
That part was kind of easy because I can get the titles by a simple copy paste and save as .csv. Then, use that vector to pass the function for each value in order to construct the URL name.
But I tried to read the first one just to see what it looks like and I found that there will be too much "cleaning the data" if my goal is to build a csv file with each lyric.
x <-read.csv(url("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html"))
I think my approach is not the best (or maybe I need a better data cleaning strategy)
The HTML page has a tell on where the lyrics begin:
Usage of azlyrics.com content by any third-party lyrics provider is prohibited by our licensing agreement. Sorry about that.
Taking advantage of that, you can detect this string, and then read everything up to the end of the div:
m <- readLines("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html")
giveaway <- "Sorry about that."
#You can add the full line in case you think one of the lyrics might have this sentence in it.
start <- grep(giveaway, m) + 1 # Where the lyric starts
end <- grep("</div>", m[start:length(m)])[1] + start
# Take the first </div> after the start of the lyric, and then fix the position by adding the start
lyrics <- paste(gsub("<br>|</div>", "", m[start:end]), collapse = "\n")
#This is just an example of how to clear the remaining tags and join the text.
And then:
> cat(lyrics) #using cat() prints the line breaks
Ridin' down the highway
Goin' to a show
Stop in all the byways
Playin' rock 'n' roll
.
.
.
Well it's a long way
It's a long way, you should've told me
It's a long way, such a long way
Assuming that "cleaning the data" means you would be parsing through html tags. I recommend using DOM scraping library that would extract only the text lyrics from the page and save those lyrics to CSV, database or wherever. That way you wouldn't have to do any data cleaning. I don't know what programming language your using, but a simple google search will show you a lot of DOM querying and parsing libraries for any language.
Here is an example with PHP
http://simplehtmldom.sourceforge.net/manual.htm
$html = file_get_html('http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html');
// Find all images
$lyrics = $html->find('div.ringtone',1)->next_sibling();
print($lyrics.innertext);
now you have lyrics. Save Them.(code not tested);
If your using the R-Language. Use this library here. You will be able to query the DOM and extract the lyrics easily.
https://github.com/hadley/rvest

readHTMLTables -- Retrieving Country Names and urls of articles related to the heads of governments

I'd like to make a map of the actual world presidents.
For this, I want to scrape the images of each president from wikipedia.
The first step is getting data from the wiki page:
http://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government
I have trouble getting the country names and president page urls because the table has rowspans.
For the moment, my code looks like below but it's not ok because of the row spanning..
library(XML)
u = "http://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
doc = htmlParse(u)
tb = getNodeSet(doc, "//table")[[3]]
stateNames <- readHTMLTable(tb)$State
presidentUrls <- xpathSApply(tb, "//table/tr/td[2]/a[2]/#href")
Any idea welcome!
Mat
If there is heterogeneity in the table, I don't think we can deal with the problem by a single line of code. In your case, some td has colspan=2, while the others don't. So they can be selected and processed separately with filters like the following:
nations1 <- xpathSApply(tb, "//table/tr[td[#colspan='2']]/td[1]/a/text()")
nations2 <- xpathSApply(tb, "//table/tr[count(td)=3]/td[1]/a/text()")
Should you meet other types of conditions in the table, just keep in mind that XPath has more.

xpath node determination

I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")
I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.
Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)

Web Scraping (in R?)

I want to get the names of the companies in the middle column of this page (written in bold in blue), as well as the location indicator of the person who is registering the complaint (e.g. "India, Delhi", written in green). Basically, I want a table (or data frame) with two columns, one for company, and the other for location. Any ideas?
You can easily do this using the XML package in R. Here is the code
url = "http://www.consumercomplaints.in/bysubcategory/mobile-service-providers/page/1.html"
doc = htmlTreeParse(url, useInternalNodes = T)
profiles = xpathSApply(doc, "//a[contains(#href, 'profile')]", xmlValue)
profiles = profiles[!(1:length(profiles) %% 2)]
states = xpathSApply(doc, "//a[contains(#href, 'bystate')]", xmlValue)
This to match titles in blue bold, the trick is to open the source code of page and look what is before and after what are you looking for, then you use regex.
preg_match('/>[a-zA-Z0-9]+<\/a><\/h4><\/td>/', $str, $matches);
for($i = 0;$i<sizeof($matches);$i++)
echo $matches[$i];
You may check this.

Resources