rvest - Extract data from OMIM - r

EDIT: After some research and help from others, I figured what I was trying to do is not ethical. I asked for OMIM API permission from OMIM website and advise the same to anyone who needs to do same stuff.
I am quite inexperienced in HTML.
Using some keywords like 'ciliary' and 'primary' I want to go into OMIM, get into first 5 links listed, save text within those links and scrape data based on keywords like 'homozygous', 'heterozygous' etc.
What I have done so far:
rvestedOMIM <- function() {
clinicKeyWord1 <- c('primary', 'ciliary')
OMIM1 <- paste0("https://www.omim.org/search/?index=entry&start=1&limit=10&sort=score+desc%2C+prefix_sort+desc&search=", clinicKeyWord1[1], "+", clinicKeyWord1[2])
webpage <- read_html(OMIM1)
rank_data_html <- html_nodes(webpage,'.mim-hint')
# Go into first 5 links and extract the data based on keywords
allLinks <- rank_data_html[grep('a href',rank_data_html)]
allLinks <- allLinks[grep('omim', allLinks)]
}
At the moment, I am stuck at going through the links listed in the first OMIM search (with 'primary' and 'ciliary' keywords). allLinks object within the function I wrote is intended to extract the links
e.g.
244400. CILIARY DYSKINESIA, PRIMARY, 1; CILD1
(https://www.omim.org/entry/244400?search=ciliary%20primary&highlight=primary%20ciliary)
608644. CILIARY DYSKINESIA, PRIMARY, 3; CILD3
(https://www.omim.org/entry/608644?search=ciliary%20primary&highlight=primary%20ciliary)
Even if I could scrape the OMIM id in the links 244400 or 608644, I can navigate through the links by myself which is a workaround I thought in case I couldn't scrape that yields the full link.
Thank you for your help

Related

How to search tweets containing either one hashtag or other ones

I need to get tweets that contain at least of the following hashtags: #EUwahl #Euwahlen #Europawahl #Europawahlen. This means, I am looking for tweets containing at least one of those hashtags but they can also contain more of them. Furthermore, in each of these tweets one out of seven specific user (eg #AfD) must be mentioned as well in the tweet.
So far I only know how to search Twitter for one hashtag only or several ones. Meaning, I am familiar with the operator and using a + between the hashtags but not with the operator for or.
This is an example of a code I have used so far to do any searches in Twitter:
euelection <- searchTwitter("#EUwahl", n=1000, since = "2019-05-01",until = "2019-05-26")
I can install twitteR but it requires some authentication key which is not very easy for me to get.
The principle is to search using OR with space in between. I provide you an example with rtweet
library(rtweet)
# your tags
TAGS = c("#EUwahl","#Europawahl")
# make the search term
SEARCH = paste(TAGS,collapse=" OR ")
# do the search
# you can also use twitteR
test <- search_tweets(SEARCH, n=100)
# your found tweet text
head(test$text)
## check which tweet contains which tag
tab = sapply(TAGS,function(i)as.numeric(grepl(i,test$text,ignore.case=T)))
# all of them contain either #EUwahl or #Europawahl

how to use rvest to scrape same kind of datapoint but labelled with different id

if I want to use rvest to scrape a particular datapoint (name, address, phone etc) repeated in different section of this page, all start with similar span id, but not exactly the same, such as:
docs-internal-guid-049ac94a-f34e-5729-b053-30567fdf050a
docs-internal-guid-765e48e9-f34b-7c88-5d95-042a93fcfda3
what's the best approach? to find and copy each id is not viable. Thanks
Edit:
You can use the following script to retrieve all star restaurants:
library("rvest")
url_base <- "http://www.straitstimes.com/lifestyle/food/full-list-of-michelin-starred-restaurants-for-2017"
data <- read_html(url_base) %>%
html_nodes("h3") %>%
html_text()
This also gives you the headers ("ONE MICHELIN STAR", "TWO MICHELIN STARS", "THREE MICHELIN STARS"), bu this might even be helpful.
Background to the script:
Fortunately, all and only the relevant information is within the h3 selector. The script gives you a char vector as output. Of course, you can further elaborate on this with e.g. %>% as.data.frame() or however you want to store / process the data.
------------------- old answer -------------------
Could you maybe provide the url of that particular page? For me it sounds like you have to find the right css-selector (nth-child(x)) you can use in a loop.

Retrieve whole lyrics from URL

I am trying to retrieve the whole lyrics of a band from the web.
I have noticed that they build URLs using ".../firstletter/bandname/songname.html"
Here is an example.
http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html
I was thinkining about creating a function that would read.csv the URLs.
That part was kind of easy because I can get the titles by a simple copy paste and save as .csv. Then, use that vector to pass the function for each value in order to construct the URL name.
But I tried to read the first one just to see what it looks like and I found that there will be too much "cleaning the data" if my goal is to build a csv file with each lyric.
x <-read.csv(url("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html"))
I think my approach is not the best (or maybe I need a better data cleaning strategy)
The HTML page has a tell on where the lyrics begin:
Usage of azlyrics.com content by any third-party lyrics provider is prohibited by our licensing agreement. Sorry about that.
Taking advantage of that, you can detect this string, and then read everything up to the end of the div:
m <- readLines("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html")
giveaway <- "Sorry about that."
#You can add the full line in case you think one of the lyrics might have this sentence in it.
start <- grep(giveaway, m) + 1 # Where the lyric starts
end <- grep("</div>", m[start:length(m)])[1] + start
# Take the first </div> after the start of the lyric, and then fix the position by adding the start
lyrics <- paste(gsub("<br>|</div>", "", m[start:end]), collapse = "\n")
#This is just an example of how to clear the remaining tags and join the text.
And then:
> cat(lyrics) #using cat() prints the line breaks
Ridin' down the highway
Goin' to a show
Stop in all the byways
Playin' rock 'n' roll
.
.
.
Well it's a long way
It's a long way, you should've told me
It's a long way, such a long way
Assuming that "cleaning the data" means you would be parsing through html tags. I recommend using DOM scraping library that would extract only the text lyrics from the page and save those lyrics to CSV, database or wherever. That way you wouldn't have to do any data cleaning. I don't know what programming language your using, but a simple google search will show you a lot of DOM querying and parsing libraries for any language.
Here is an example with PHP
http://simplehtmldom.sourceforge.net/manual.htm
$html = file_get_html('http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html');
// Find all images
$lyrics = $html->find('div.ringtone',1)->next_sibling();
print($lyrics.innertext);
now you have lyrics. Save Them.(code not tested);
If your using the R-Language. Use this library here. You will be able to query the DOM and extract the lyrics easily.
https://github.com/hadley/rvest

readHTMLTables -- Retrieving Country Names and urls of articles related to the heads of governments

I'd like to make a map of the actual world presidents.
For this, I want to scrape the images of each president from wikipedia.
The first step is getting data from the wiki page:
http://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government
I have trouble getting the country names and president page urls because the table has rowspans.
For the moment, my code looks like below but it's not ok because of the row spanning..
library(XML)
u = "http://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
doc = htmlParse(u)
tb = getNodeSet(doc, "//table")[[3]]
stateNames <- readHTMLTable(tb)$State
presidentUrls <- xpathSApply(tb, "//table/tr/td[2]/a[2]/#href")
Any idea welcome!
Mat
If there is heterogeneity in the table, I don't think we can deal with the problem by a single line of code. In your case, some td has colspan=2, while the others don't. So they can be selected and processed separately with filters like the following:
nations1 <- xpathSApply(tb, "//table/tr[td[#colspan='2']]/td[1]/a/text()")
nations2 <- xpathSApply(tb, "//table/tr[count(td)=3]/td[1]/a/text()")
Should you meet other types of conditions in the table, just keep in mind that XPath has more.

xpath node determination

I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")
I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.
Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)

Resources