Web Scraping (in R?) - r

I want to get the names of the companies in the middle column of this page (written in bold in blue), as well as the location indicator of the person who is registering the complaint (e.g. "India, Delhi", written in green). Basically, I want a table (or data frame) with two columns, one for company, and the other for location. Any ideas?

You can easily do this using the XML package in R. Here is the code
url = "http://www.consumercomplaints.in/bysubcategory/mobile-service-providers/page/1.html"
doc = htmlTreeParse(url, useInternalNodes = T)
profiles = xpathSApply(doc, "//a[contains(#href, 'profile')]", xmlValue)
profiles = profiles[!(1:length(profiles) %% 2)]
states = xpathSApply(doc, "//a[contains(#href, 'bystate')]", xmlValue)

This to match titles in blue bold, the trick is to open the source code of page and look what is before and after what are you looking for, then you use regex.
preg_match('/>[a-zA-Z0-9]+<\/a><\/h4><\/td>/', $str, $matches);
for($i = 0;$i<sizeof($matches);$i++)
echo $matches[$i];
You may check this.

Related

Issue scraping a collapsible table using rvest

I am trying to scrape information from multiple collapsible tables from a website called APIS.
An example of what I'm trying to collect is here http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next
Ideally I'd like to be able to have the drop down heading followed by the information underneath, though when using rvest I cant seem to get it to select the correct section from the html.
I'm reasonably new to R, this is what I have from watching some videos about scraping:
link = "http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next"
page = read_html(link)
name = page %>% html_nodes(".tab-tables :nth-child(1)") %>% html_text()
the "name" value displays "Character (empty)"
It may be because I'm new to this and there's a really obvious answer but any help would be appreciated
The data for each tab comes from additional requests you can find in the browser network tab when pressing F5 to refresh the page. For example, the nutrients info comes from:
http://www.apis.ac.uk/sites/default/files/AJAX/srcl_2019/apis_tab_nnut.php?ajax=true&site=1001814&BH=&populateBH=true
Which you can think of more generally as:
scheme='http'
netloc='www.apis.ac.uk'
path='/sites/default/files/AJAX/srcl_2019/apis_tab_nnut.php'
params=''
query='ajax=true&site=1001814&BH=&populateBH=true'
fragment=''
So, you would make your request to those urls you see in the network tab.
If you want to dynamically determine these urls, then make a request, as you did, to the landing page, then regex out from the response text the path (see above) of the urls. This can be done using the following pattern url: "(\\/sites\\/default\\/files\\/.*?)".
You then need to add the protocol + domain (scheme and netloc) to the returned matches based on landing page protocol and domain.
There are some additional query string parameters, which come after the ?, which can also be dynamically retrieved, if reconstructing the urls from the response text. You can see these within the page source:
You probably want to extract each of those data param specs for the Ajax requests e.g. with data:\\s\\((.*?)\\), then have a custom function which turns the matches into the required query string suffix to add to the previously retrieved urls.
Something like the following:
library(rvest)
library(magrittr)
library(stringr)
get_query_string <- function(match, site_code) {
string <- paste0(
"?",
gsub("siteCode", site_code, gsub('["{}]', "", gsub(",\\s+", "&", gsub(":\\s+", "=", match))))
)
return(string)
}
link <- "http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next"
page <- read_html(link) %>% toString()
links <- paste0("http://www.apis.ac.uk", stringr::str_match_all(page, 'url: "(\\/sites\\/default\\/files\\/.*?)"')[[1]][, 2])
params <- stringr::str_match_all(page, "data:\\s\\((.*?)\\),")[[1]][, 2]
site_code <- stringr::str_match_all(page, 'var siteCode = "(.*?)"')[[1]][, 2]
params <- lapply(params, get_query_string, site_code)
urls <- paste0(links, params)

rvest - Extract data from OMIM

EDIT: After some research and help from others, I figured what I was trying to do is not ethical. I asked for OMIM API permission from OMIM website and advise the same to anyone who needs to do same stuff.
I am quite inexperienced in HTML.
Using some keywords like 'ciliary' and 'primary' I want to go into OMIM, get into first 5 links listed, save text within those links and scrape data based on keywords like 'homozygous', 'heterozygous' etc.
What I have done so far:
rvestedOMIM <- function() {
clinicKeyWord1 <- c('primary', 'ciliary')
OMIM1 <- paste0("https://www.omim.org/search/?index=entry&start=1&limit=10&sort=score+desc%2C+prefix_sort+desc&search=", clinicKeyWord1[1], "+", clinicKeyWord1[2])
webpage <- read_html(OMIM1)
rank_data_html <- html_nodes(webpage,'.mim-hint')
# Go into first 5 links and extract the data based on keywords
allLinks <- rank_data_html[grep('a href',rank_data_html)]
allLinks <- allLinks[grep('omim', allLinks)]
}
At the moment, I am stuck at going through the links listed in the first OMIM search (with 'primary' and 'ciliary' keywords). allLinks object within the function I wrote is intended to extract the links
e.g.
244400. CILIARY DYSKINESIA, PRIMARY, 1; CILD1
(https://www.omim.org/entry/244400?search=ciliary%20primary&highlight=primary%20ciliary)
608644. CILIARY DYSKINESIA, PRIMARY, 3; CILD3
(https://www.omim.org/entry/608644?search=ciliary%20primary&highlight=primary%20ciliary)
Even if I could scrape the OMIM id in the links 244400 or 608644, I can navigate through the links by myself which is a workaround I thought in case I couldn't scrape that yields the full link.
Thank you for your help

readHTMLTables -- Retrieving Country Names and urls of articles related to the heads of governments

I'd like to make a map of the actual world presidents.
For this, I want to scrape the images of each president from wikipedia.
The first step is getting data from the wiki page:
http://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government
I have trouble getting the country names and president page urls because the table has rowspans.
For the moment, my code looks like below but it's not ok because of the row spanning..
library(XML)
u = "http://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
doc = htmlParse(u)
tb = getNodeSet(doc, "//table")[[3]]
stateNames <- readHTMLTable(tb)$State
presidentUrls <- xpathSApply(tb, "//table/tr/td[2]/a[2]/#href")
Any idea welcome!
Mat
If there is heterogeneity in the table, I don't think we can deal with the problem by a single line of code. In your case, some td has colspan=2, while the others don't. So they can be selected and processed separately with filters like the following:
nations1 <- xpathSApply(tb, "//table/tr[td[#colspan='2']]/td[1]/a/text()")
nations2 <- xpathSApply(tb, "//table/tr[count(td)=3]/td[1]/a/text()")
Should you meet other types of conditions in the table, just keep in mind that XPath has more.

extracting node information

Using the XML library, I have parsed a web page
basicInfo <- htmlParse(myURL, isURL = TRUE)
the relevant section of which is
<div class="col-left"><h1 class="tourney-name">Price Cutter Charity Championship Pres'd by Dr Pep</h1><img class="tour-logo" alt="Nationwide Tour" src="http://a.espncdn.com/i/golf/leaderboard11/logo-nationwide-tour.png"/></div>
I can manage to extract the tournament name
tourney <- xpathSApply(basicInfo, "//*/div[#class='col-left']", xmlValue)
but also wish to know the tour it is from using the alt tag. In this case I want to get the result "Nationwide Tour"
TIA and apologies for scrolling required
Don't know R but I'm pretty good with XPath
Try this:
tourney_name <- xpathSApply(basicInfo, "//*/div[#class='col-left']/h1/text()", xmlValue)
tourney_loc <- xpathSApply(basicInfo, "//*/div[#class='col-left']/img/#alt", xmlValue)
Note the use of "#" to extract attributes and text() to extract text nodes (looks like R did this automatically), my revised tourney_name xpath should do the same thing, but it is more clear which part is being extracted.

xpath node determination

I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")
I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.
Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)

Resources