Web scraping from continuous URLs using R

Web scraping from continuous URLs using R - r

I am trying to scrap data from a website which lists the ratings of multiple products. So, let's say a product has 800 brands. So, with 10 brands per page, I will need to scrap data from 8 pages. Eg: Here is the data for baby care. There are 24 pages worth of brands that I need - http://www.goodguide.com/products?category_id=152775-baby-care&sort_order=DESC#!rf%3D%26rf%3D%26rf%3D%26cat%3D152775%26page%3D1%26filter%3D%26sort_by_type%3Drating%26sort_order%3DDESC%26meta_ontology_node_id%3D
I have used the bold font for 1, as that is the only thing that changes in this url as we move from page to page. So, I thought it might be straight forward to write a loop in R. But what I find is that as I move to page 2, the page does not load again. Instead, just the results are updated in about 5 secs. However, R does not wait for 5 seconds and thus, I had the data from the first page 26 times.
I also tried entering the page 2 url directly and ran my code without a loop. Same story- I got page 1 results. I am sure I can't be the only one facing this. Any help is appreciated. I have attached the code.
Thanks a million. And I hope my question was clear enough.
# build the URL
N<-matrix(NA,26,15)
R<-matrix(NA,26,60)
for(n in 1:26){
url <- paste("http://www.goodguide.com/products?category_id=152775-baby-care&sort_order=DESC#!rf%3D%26rf%3D%26rf%3D%26cat%3D152775%26page%3D",i,"%26filter%3D%26sort_by_type%3Drating%26sort_order%3DDESC%26meta_ontology_node_id%3D")
raw.data <-readLines(url)
Parse <- htmlParse(raw.data)
#####
A<-querySelector(Parse, "div.results-container")
#####
Name<-querySelectorAll(A,"div.reviews>a")
Ratings<-querySelectorAll(A,"div.value")
N[n,]<-sapply(Name,function(x)xmlGetAttr(x,"href"))
R[n,]<-sapply(Ratings,xmlValue)
}

Referring to the html source reveals that the urls you want can be simplified to this structure:
http://www.goodguide.com/products?category_id=152775-baby-care&page=2&sort_orde‌r=DESC.
The content of these urls is retrieved by R as expected.
Note that you can also go straight to:
u <- sprintf('http://www.goodguide.com/products?category_id=152775-baby-car‌e&page=%s&sort_order=DESC', n)
Parse <- htmlParse(u)

Related

Web scraping in R with rvest: problem finding tags

I'm triying to use rvest library in order to annotate some data of an url.
This process had worked well with previous urls and data, but this time I'm experiencing difficulties in finding the associated CSS 'tag' for the data I want to extract.
The url is:
https://gnomad.broadinstitute.org/variant/8-52733231-G-A?dataset=gnomad_r2_1
I am particularly interested in the value of European Allele frecuency (see image):
I have tried to find the associated CSS tag for this number by searching in the source code. I have also tried some chrome extensions to reveal the tags (SelectorGadget)... but all my attempts were in vane. The result is always the same: {xml_nodeset (0)}
My code is:
url <-'https://gnomad.broadinstitute.org/variant/8-52733231-G-A?dataset=gnomad_r2_1'
webpage <- read_html(url)
europ.freq<- html_nodes(webpage,'CSS_tag')
europ.freq
Can anyone help me with this??
Thanks a lot in advanced!!!!!!!

WOW Kikoralston!
Thank you very much for your detailed answer and kindness. I really apreciate it!!!!!
:D :D :D :D
After some problems defining the binary path of firefox, that I finally solved with:
rs <- rsDriver(browser = "firefox",
extraCapabilities = list(`mox:firefoxOptions` = list(binary = "C://Program Files (x86)//Mozilla Firefox//firefox.exe")))
It worked perfect!!!! :D
Thanks Thanks Thanks!!

This is going to be a long answer. But when I started scraping these websites, I struggled with it. So I just want to give you a context of what is going on.
Why is this happening?
The reason you are getting an empty list in your results is simple: it's because the nodes are not there!
You can check out by yourself by printing the raw html source code that read_html(url) function returns in this case
> webpage %>% as.character()
[1] "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n<title>gnomAD</title>\n<meta charset=\"utf-8\">\n<meta name=\"description\" content=\"The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.\">\n<meta name=\"viewport\" content=\"width=device-width,initial-scale=1\">\n<link rel=\"stylesheet\" href=\"https://cdnjs.cloudflare.com/ajax/libs/normalize/7.0.0/normalize.min.css\">\n<link rel=\"search\" href=\"/opensearch.xml\" type=\"application/opensearchdescription+xml\" title=\"gnomAD\">\n<style>body,html{height:100%;width:100%;background-color:#fafafa;font-family:-apple-system,BlinkMacSystemFont,\"Segoe UI\",Roboto,Oxygen-Sans,Ubuntu,Cantarell,\"Helvetica Neue\",sans-serif;font-size:14px}#media print{body,html{background:0 0}}</style>\n</head>\n<body>\n<div id=\"root\"></div>\n<script async src=\"https://www.googletagmanager.com/gtag/js?id=UA-85987017-1\"></script><script>function gtag(){dataLayer.push(arguments)}window.gaTrackingId=\"UA-85987017-1\",window.dataLayer=window.dataLayer||[],gtag(\"js\",new Date),gtag(\"config\",\"UA-85987017-1\",{send_page_view:!1}),window.gtag=gtag</script><script src=\"/js/bundle-8c98b722b9c7c14b5bbc.js\"></script>\n</body>\n</html>\n"
As you can see the html code is pretty small. Not much there. This is because this request you make to the url in R returns just an empty template for the webpage. In your web browser, this webpage is built dynamically. Your first request starts a chain of subsequent requests that executes JS scripts, downloads images, downloads data, ... And all these subsequent requests update the content of your webpage so you get the end result you see in your browser.
If you look in chrome's developer tools (image below), you can see all this chain of requests
The request on the top is the one you made with read_html. Then the webpage it returned gets updated by all these subsequent requests.
Usually, these types of dashboard applications have one specific request made in the middle that returns a json string with all the data to populate the website. In chrome you can filter the json responses by clicking the "XHR" button in the top of the screen. In this specific case, it is that request I marked with a red arrow.
So one solution is to go through the json responses and find the one that has the data you want. Then repeat this request in R (in this case it is a POST request) and parse the response using json parsing functions (like jsonlite::fromJSON()).
But my experience with this is that sometimes it works, sometimes it does not. I am not always sure how to make that specific request to the server. Sometimes there are formatting issues. Sometimes there are state variables that your browser sets and you have to figure out how to do it...
A probably better solution is to use RSelenium (which I just recently started trying to use)
Solution with RSelenium
A couple of nice tutorials on RSelenium:
https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html
http://joshuamccrain.com/tutorials/web_scraping_R_selenium.html
With RSelenium you are actually running your web browser. So all those subsequent calls are also happening. The webpage gets built dynamically and you have the nodes (and the data) you are looking for.
I am running a Mac and I could not make RSelenium work with Google Chrome. With firefox it worked perfectly. Maybe it will work for you. I am still learning the details of RSelenium. So maybe you could change some of these parameters in rsDriver.
library(rvest)
library(RSelenium)
rD <- rsDriver(browser = c("firefox"), port = 4568L,
chromever = NULL,
iedrver = NULL,
phantomver = NULL,
version="3.141.59")
remDr <- rD[["client"]]
remDr$navigate('https://gnomad.broadinstitute.org/variant/8-52733231-G-A?dataset=gnomad_r2_1')
# remDr$getPageSource() returns a list with the html code as string
# use read_html to convert it into something that rvest can parse
webpage <- remDr$getPageSource()[[1]] %>% read_html()
# html_table() returns all tables in the webpage as data frames
# the one you want is the second one
df.table <- webpage %>% html_table() %>% .[[2]]
# close browser and stop server
rD$client$close()
rD$server$stop()
print(df.table)
(note that I did not search by css tag because in your example it was easier to just get the whole table. But you could also get an specific node by css tag if you wanted to)
And the results is:
Population Population Allele Count Allele Number Number of Homozygotes Allele Frequency
1 European (non-Finnish) European (non-Finnish) 13779 65426 0 0.2106
2 African/African-American African/African-American 2264 14050 0 0.1611
3 South Asian South Asian 2193 13982 0 0.1568
4 Latino/Admixed American Latino/Admixed American 2792 18084 0 0.1544
5 East Asian East Asian 1869 12272 0 0.1523
6 European (Finnish) European (Finnish) 1665 12042 0 0.1383
7 Ashkenazi Jewish Ashkenazi Jewish 592 4344 0 0.1363
8 Other Other 509 3888 0 0.1309
9 Female Female 11460 66844 0 0.1714
10 Male Male 14203 77244 0 0.1839
11 Total Total 25663 144088 0 0.1781
The data value you want is in df.table$`Allele Frequency`[1].
There are more things you can do, like clicking in the webpage or submitting forms to update the data in the page. I suggest having a look at those two links I listed above. I am still learning about it too.

How to iterate till the last page of a website & write data row-by-row in a .csv file too?

I am trying to scrape some indeed job postings for personal stuff (code below), however I currently have to go until the last page to find out what its "index" or page number is, then I am able to iterate from first to the last page.
I wanted to have it automatic, where I only provide the URL and rest the function takes care. Could anyone help me out? Also, since I will scraping couple 100 of pages, I fear that I will get kicked out, so I wanted make sure to get as much data as possible, so I have writing to a csv file like in the example below. Is there a better way to do that too?
Indeed didn't give me an API key so this is the only method I know. Here is the code:
## squencing the pages based on the result (here i just did 1 page to 5th page)
page_results <- seq(from = 10, to = 50, by = 10)
first_page_url <- "https://www.indeed.com/jobs?q=data+analyst&l=United+States"
for(i in seq_along(page_results)) {
Sys.sleep(1)
url <- paste0(first_page_url, "&start=", page_results[i]) #second page will have url+&start= 20 and so on.
page <- xml2::read_html(url)
####
#bunch of scraping from each page, method for that is implemented already
#....
####
print(i) #prints till fifth page, so i will print 1 to 5
#I also wanted to write CSV line by line so if some error happens I atleast get everythinh pre-error
# do you anything efficient than this?
write.table(as.data.frame(i), "i.csv", sep = ",", col.names = !file.exists("i.csv"), append = T)
}

I took this advice, and wated to close this answer to reduce the open question. So answered my own question. Thank you SO community to always helping out.
"I think the manual approach where you decide to give the page start and page end makes more sense, and "scraping friendly" because you can control how much pages you want to get (plus respects the company servers). You know after a while you see same job descriptions. So stick with current approach in my opinion. About writing the .csv file per iteration, I think that's fine. Someone better than me should definitely say something. Because I don't have enough knowledge in R yet." – UltaPitt

Completing web forms and ingesting the responses with R?

So, here's the current situation:
I have 2000+ lines of R code that produces a couple dozen text files. This code runs in under 10 seconds.
I then manually paste each of these text files into a website, wait ~1 minute for the website's response (they're big text files), then manually copy and paste the response into Excel, and finally save them as text files again. This takes hours and is prone to user error.
Another ~600 lines of R code then combines these dozens of text files into a single analysis. This takes a couple of minutes.
I'd like to automate step 2--and I think I'm close, I just can't quite get it to work. Here's some sample code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults
The code runs and every time I've done it "balcoResults" comes back with "Status: 200". Success! EXCEPT the file size is 0...
I don't know where the problem is, but my best guess is that the text block isn't getting filled out before the form is submitted. If I go to the website (http://hess.ess.washington.edu/math/v3/v3_age_in.html) and manually submit an empty form, it produces a blank webpage: pure white, nothing on it.
The problem with this potential explanation (and me fixing the code) is that I don't know why the text block wouldn't be filled out. The results of set_values tells me that "text_block" has 120 characters in it. This is the correct length for textString. I don't know why these 120 characters wouldn't be pasted into the web form.
An alternative possibility is that R isn't waiting long enough to get a response from the website, but this seems less likely because a single sample (as here) runs quickly and the status code of the response is 200.
Yesterday I took the DataCamp course on "Working with Web Data in R." I've explored GET and POST from the httr package, but I don't know how to pick apart the GET response to modify the form and then have POST submit it. I've considered trying the package RSelenium, but according to what I've read, I'd have to download and install a "Selenium Server". This intimidates me, but I could probably do it -- if I was convinced that RSelenium would solve my problem. When I look on CRAN at the function names in the RSelenium package, it's not clear which ones would help me. Without firm knowledge for how RSelenium would solve my problem, or even if it would, this seems like a poor return on the time investment required. (But if you guys told me it was the way to go, and which functions to use, I'd be happy to do it.)
I've explored SO for fixes, but none of the posts that I've found have helped. I've looked here, here, and here, to list three.
Any suggestions?

After two days of thinking, I spotted the problem. I didn't assign the results of set_value function to a variable (if that's the right R terminology).
Here's the corrected code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
balcoForm <- set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults

URL contains page number but this number is unknown

I'm trying in R to get the list of tickers from every exchange covered by Quandl.
There are 2 ways:
1) For every exchange the provide the zipped csv with all ticker. The URL looks like this (XXXXXXXXXXXXXXXXXXXX - API key, YYY - code of exchange):
https://www.quandl.com/api/v3/databases/YYY/codes?api_key=XXXXXXXXXXXXXXXXXXXX
This looks pretty promissing, but I was not able to read the file with read.table or e.g fread. Don't know why. Is it because of the API key? read.table is supposed to read zip files with no problem.
2) I was able to go further with the 2nd way. They provide URL to the csv of tickers. E.g.:
https://www.quandl.com/api/v3/datasets.csv?database_code=YYY&per_page=100&sort_by=id&page=1&api_key=XXXXXXXXXXXXXXXXXXXX
As you see, URL contains page number. The problem is they only mention it below in text, that you need to run this URL many time (e.g. 56 for LSE) in order to get the full list. I was able to do it like this:
pages <- 1:100 # "100" is taken just to be big enough
Source <- c("LSE","FSE", ...) # vector of exchange codes
QUANDL_API_KEY="XXXXXXXXXXXXXXXXXXXXXXXXXX"
TICKERS = lapply(sprintf(
"https://www.quandl.com/api/v3/datasets.csv?
database_code=%s&per_page=100&sort_by=id&page=%s&api_key=%s",
Source,pages,QUANDL_API_KEY),
FUN=fread,
stringsAsFactors=FALSE)
TICKERS <- do.call(rbind, TICKERS)
The problem is I just put 100 pages, but when R tryies to get the non-existing page (e.g. #57) it delivers an error and do not go further. I was trying to do smth like iferror, but failed.
Could you pls give some hints?

Issues with URL in R

I am fairly new to R and am having trouble with pulling data from the Forbes website.
My current function is:
url = http://www.forbes.com/global2000/list/#page:1_sort:0_direction:asc_search:_filter:All%20industries_filter:All%20countries_filter:All%20states
data = readHTMLTable(url)
However, when I change the page # in the url from 1 to 2 (or to any other number), the data that is pulled is the same data from page 1. For some reason R does not pull the data from the correct page. If you manually paste the link into the browser with a specific page #, then it works fine.
Does anyone have an idea as to why this is happening?
Thanks!

This appears to be an issue caused by URL fragments, which the pound sign represents. It essentially creates an anchor on a page and directs your browser to jump to that particular location.
You might be having this trouble because readHTMLTable() might not be created to work with URL fragments. See if you can find a version of the same table that does not require # in the URL.
Here are some helpful links that might shed light on what you are experiencing:
What is it when a link has a pound "#" sign in it
https://support.microsoft.com/kb/202261/en-us
If I come across anything else that's helpful, I'll share it in follow-up comments.

What you might need to do is use the URLencode() method in R.
kdb.url <- "http://m1:5000/q.csv?select from data0 where folio0 = `KF"
kdb.url <- URLencode(kdb.url)
df <- read.csv(kdb.url, header=TRUE)
You might have meta-characters in your URL too. (Mine are the spaces and the backtick.)
>kdb.url
[1] "http://m1:5000/q.csv?select%20from%20data0%20where%20folio0%20=%20%60KF"
They think of everything those R guys.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex