Issue scraping a collapsible table using rvest - r

I am trying to scrape information from multiple collapsible tables from a website called APIS.
An example of what I'm trying to collect is here http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next
Ideally I'd like to be able to have the drop down heading followed by the information underneath, though when using rvest I cant seem to get it to select the correct section from the html.
I'm reasonably new to R, this is what I have from watching some videos about scraping:
link = "http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next"
page = read_html(link)
name = page %>% html_nodes(".tab-tables :nth-child(1)") %>% html_text()
the "name" value displays "Character (empty)"
It may be because I'm new to this and there's a really obvious answer but any help would be appreciated

The data for each tab comes from additional requests you can find in the browser network tab when pressing F5 to refresh the page. For example, the nutrients info comes from:
http://www.apis.ac.uk/sites/default/files/AJAX/srcl_2019/apis_tab_nnut.php?ajax=true&site=1001814&BH=&populateBH=true
Which you can think of more generally as:
scheme='http'
netloc='www.apis.ac.uk'
path='/sites/default/files/AJAX/srcl_2019/apis_tab_nnut.php'
params=''
query='ajax=true&site=1001814&BH=&populateBH=true'
fragment=''
So, you would make your request to those urls you see in the network tab.
If you want to dynamically determine these urls, then make a request, as you did, to the landing page, then regex out from the response text the path (see above) of the urls. This can be done using the following pattern url: "(\\/sites\\/default\\/files\\/.*?)".
You then need to add the protocol + domain (scheme and netloc) to the returned matches based on landing page protocol and domain.
There are some additional query string parameters, which come after the ?, which can also be dynamically retrieved, if reconstructing the urls from the response text. You can see these within the page source:
You probably want to extract each of those data param specs for the Ajax requests e.g. with data:\\s\\((.*?)\\), then have a custom function which turns the matches into the required query string suffix to add to the previously retrieved urls.
Something like the following:
library(rvest)
library(magrittr)
library(stringr)
get_query_string <- function(match, site_code) {
string <- paste0(
"?",
gsub("siteCode", site_code, gsub('["{}]', "", gsub(",\\s+", "&", gsub(":\\s+", "=", match))))
)
return(string)
}
link <- "http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next"
page <- read_html(link) %>% toString()
links <- paste0("http://www.apis.ac.uk", stringr::str_match_all(page, 'url: "(\\/sites\\/default\\/files\\/.*?)"')[[1]][, 2])
params <- stringr::str_match_all(page, "data:\\s\\((.*?)\\),")[[1]][, 2]
site_code <- stringr::str_match_all(page, 'var siteCode = "(.*?)"')[[1]][, 2]
params <- lapply(params, get_query_string, site_code)
urls <- paste0(links, params)

Related

Web scraping with R?

I have a dataframe which indicates, in column, an url.
test = data.frame (id = 1, url = "https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
Using this, I would like to retrieve an element in the web page. Specifically, I would like to retrieve the value of the activity state.
https://zupimages.net/viewer.php?id=20/51/t1fx.png
Thanks to my research, I was able to find a code which allows to select the element thanks to its "XPath".
library (rvest)
page = read_html ("https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
page%>% html_nodes (xpath = '// * [# id = "detailAttributFiche"] / div / p')%>% html_text ()%>% as.character ()
character (0)
As you can see, I always have a "character (0)" that appears, as if it couldn't read the whole page. I suspect some JavaScript part is not linking properly ...
How can I do ?
Thank you.
The data is from this link (the etatActiviteInst parameter): https://www.georisques.gouv.fr/webappReport/ws/installations/etablissement/0030-12015

R based Web Scraper for Cabela's using rvest

Maybe slightly out of the ordinary, but I want to track down a particular rifle that I am interested in purchasing. I am familiar with R, so I started down that path, but I'm guessing there are better options.
What I want to do is check a web page hourly to see if the availability has changed. If it has, I get a text message.
I started out using rvest and twilio. The problem is that I can't figure out how to get all the way down to the data that I need. The page has an "Add to cart" button that isn't shown if the item isn't available using css style display:none.
I've tried various ways of trying to get down to that particular div by using id names, css classes, xpath, etc, but keep coming up with nothing.
Any ideas? is it the formatting of the div name? Or do I have to manually dig through each nested div?
EDIT: I was able to find the right xpath to work. But as pointed out below, you can't see the style.
EDIT2 - in the out of stock div, the text "In Select Stores Only" is displayed, but I can't figure out how to isolate it.
#Schedule script to run every hour
library(rvest)
library(twilio)
#vars for sms
Sys.setenv(TWILIO_SID = "xxxxxxxxxxx")
Sys.setenv(TWILIO_TOKEN = "xxxxxxxxxxx")
#example, two url's - one with in stock item, one without
OutStockURL <- read_html("https://www.cabelas.com/shop/en/marlin-1895sbl-lever-action-rifle?searchTerm=1895%20sbl")
InStockURL <- read_html("https://www.cabelas.com/shop/en/thompson-center-venture-ii-bolt-action-centerfire-rifle")
#div id that contains information on if product is in stock or not
instockdivid <- "WC_Sku_List_TableContent_3074457345620110138_Price & Availability_1_16_grid"
outstockdivid <- "WC_Sku_List_TableContent_24936_Price & Availability_1_15_grid"
#inside the div is a button that is either displayed or not based on availability
instockbutton <- 'id="SKU_List_Widget_Add2CartButton_3074457345620110857_table"'
outstockbutton <- 'id="SKU_List_Widget_Add2CartButton_3074457345617539137_table"'
#if item is unavailable, button style is set to display:none - style="display: none;"
test <- InStockURL %>%
html_nodes("div")
#xpath to buttons
test <- InStockURL %>%
html_nodes(xpath = '//*
[#id="SKU_List_Widget_Add2CartButton_3074457345620110857_table"]')
test2 <- OutStockURL %>%
html_nodes(xpath = '//*
[#id="SKU_List_Widget_Add2CartButton_3074457345617539137_table"]')
#not sure where to go from here to see if the button is visible or not
#if button is displayed, send email
tw_send_message(
to = "+15555555555",
from = "+5555555555",
body = paste("Your Item Is Available!")
)

extract links of subsequent images in div#data-old-hires

With some help, I am able to extract the landing image/main image of a url. However, I would like to be able to extract the subsequent images as well
require(rvest)
url <-"https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-
Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-
spons&keywords=lunch+bag&psc=1"
webpage <- read_html(url)
r <- webpage %>%
html_nodes("#landingImage") %>%
html_attr("data-a-dynamic-image")
imglink <- strsplit(r, '"')[[1]][2]
print(imglink)
This gives the correct output for the main image. However, I would like to extract the links when I roll-over to the other images of the same product. Essentially, I would like the output to have the following links:
1.https://images-na.ssl-images- amazon.com/images/I/81bF%2Ba21WLL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/81HVwttGJAL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/81Z1wxLn-uL.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/91iKg%2BKqKML.UY500.jpg
https://images-na.ssl-images-amazon.com/images/I/91zhpH7%2B8gL.UY500.jpg
Many thanks
As requested Python script at bottom. In order to make this applicable across languages the answer is in two parts. 1) A high level pseudo code description of steps which can be carried out with R/Python/many other languages 2) A python example.
R script to obtain string shown at end (Steps 1-3 of Process).
1) Process:
Obtain the html via GET request
Regex out a substring from one of the script tags which is in fact what jquery on the page uses to provide the image links from json
The regex pattern is
jQuery\.parseJSON\(\'(.*)\'\);
The explanation is:
Basically, the contained json object is gathered starting at the { before "dataInJson" and ending before the characters '). That extracts this json object as string. The use of 1st Capturing Group (.*) gathers from between the start string and end string (excluding either side).
The first match is the only one wanted, so out of matches returned the first must be extracted. This is then handled with a json parsing library that can take a string and return a json object
That json object is looped accessing by key (in the case of Python as structure is a dictionary - R will be slightly different) colorImages, to generate the colours (of the product), which are in turn used to access the actual urls themselves.
colours:
nested level for images:
2) Those steps shown in Python
import requests #library to handle xhr GET
import re #library to handle regex
import json
headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'jQuery\.parseJSON\(\'(.*)\'\);')
data = p1.findall(r.text)[0]
json_source = json.loads(data)
for colour in json_source['colorImages']:
for image in json_source['colorImages'][colour]:
print(image['large'])
Output:
All the links for the product in all colours - large image links only (so urls appear slightly different and more numerous but are the same images)
R script to regex out required string and generate JSON:
library(rvest)
library( jsonlite)
library(stringr)
con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
page %>%
html_nodes(xpath=".//script[contains(., 'colorImages')]")%>%
html_text() %>% as.character %>% str_match(.,"jQuery\\.parseJSON\\(\\'(.*)\\'\\);") -> res
json = fromJSON(res[,2][2])
They've updated the page so now just use:
Python:
import requests #library to handle xhr GET
import re #library to handle regex
headers = {'User-Agent' : 'Mozilla/5.0', 'Referer':'https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc='}
r = requests.get('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', headers = headers)
p1 = re.compile(r'"large":"(.*?)"')
links = p1.findall(r.text)
print(links)
R:
library(rvest)
library(stringr)
con <- url('https://www.amazon.in/Livwell-Multipurpose-MultiColor-Polka-%20Lunch/dp/B07LGTPM3D/ref=sr_1_1_sspa?ie=UTF8&qid=1548701326&sr=8-1-%20spons&keywords=lunch+bag&psc=1', "rb")
page = read_html(con)
res <- page %>%
html_nodes(xpath=".//script[contains(., 'var data')]")%>%
html_text() %>% as.character %>%
str_match_all(.,'"large":"(.*?)"')
print(res[[1]][,2])

Send expression to website return dynamic result (picture)

I use http://www.regexper.com to view a picto representation regular expressions a lot. I would like a way to ideally:
send a regular expression to the site
open the site with that expression displayed
For example let's use the regex: "\\s*foo[A-Z]\\d{2,3}". I'd go tot he site and paste \s*foo[A-Z]\d{2,3} (note the removal of the double slashes). And it returns:
I'd like to do this process from within R. Creating a wrapper function like view_regex("\\s*foo[A-Z]\\d{2,3}") and the page (http://www.regexper.com/#%5Cs*foo%5BA-Z%5D%5Cd%7B2%2C3%7D) with the visual diagram would be opened with the default browser.
I think RCurl may be appropriate but this is new territory for me. I also see the double slash as a problem because http://www.regexper.com expects single slashes and R needs double. I can get R to return a single slash to the console using cat as follows, so this may be how to approach.
x <- "\\s*foo[A-Z]\\d{2,3}"
cat(x)
\s*foo[A-Z]\d{2,3}
Try something like this:
Query <- function(searchPattern, browse = TRUE) {
finalURL <- paste0("http://www.regexper.com/#",
URLencode(searchPattern))
if (isTRUE(browse)) browseURL(finalURL)
else finalURL
}
x <- "\\s*foo[A-Z]\\d{2,3}"
Query(x) ## Will open in the browser
Query(x, FALSE) ## Will return the URL expected
# [1] "http://www.regexper.com/#%5cs*foo[A-Z]%5cd%7b2,3%7d"
The above function simply pastes together the web URL prefix ("http://www.regexper.com/#") and the encoded form of the search pattern you want to query.
After that, there are two options:
Open the result in the browser
Just return the full encoded URL

How to wait for webpage to load before reading lines in R?

I am using R to scrape some webpages. One of these pages is a redirect to a new page. When I used readLines with this page like so
test <- readLines('http://zfin.org/cgi-bin/webdriver?MIval=aa-markerselect.apg&marker_type=GENE&query_results=t&input_name=anxa5b&compare=contains&WINSIZE=25')
I get the still redirecting page, instead of the final page http://zfin.org/ZDB-GENE-030131-9076. I want to use this redirection page because in the URL it has input_name=anxa which makes it easy to grab pages for different input names.
How can I get the HTML of the final page?
The redirection page: http://zfin.org/cgi-bin/webdriver?MIval=aa-markerselect.apg&marker_type=GENE&query_results=t&input_name=anxa5b&compare=contains&WINSIZE=25
The final page: http://zfin.org/ZDB-GENE-030131-9076
I don't know how to wait untill the redirection but in the source code of the web page before the redirection, you can see (in a script tag) a javascript function replaceLocation which contain the path where you are redirected : replaceLocation(\"/ZDB-GENE-030131-9076\").
Then I suggest you to parse the code and get this path.
Here is my solution :
library(RCurl)
library(XML)
url <- "http://zfin.org/cgi-bin/webdriver?MIval=aa-markerselect.apg&marker_type=GENE&query_results=t&input_name=anxa5b&compare=contains&WINSIZE=25"
domain <- "http://zfin.org"
doc <- htmlParse(getURL(url, useragent='R'))
scripts <- xpathSApply(doc, "//script", xmlValue)
script <- scripts[which(lapply(lapply(scripts, grep, pattern = "replaceLocation\\([^url]"), length) > 0)]
# > script
# [1] "\n \n\t \n\t replaceLocation(\"/ZDB-GENE-030131-9076\")\n \n \n\t"
new.url <- paste0(domain, gsub('.*\\"(.*)\\".*', '\\1', script))
readLines(new.url)
xpathSApply(doc, "//script", xmlValue) to get all the scripts in the source code.
script <- scripts[which(lapply(lapply(scripts, grep, pattern = "replaceLocation\\([^url]"), length) > 0)] to get the script containing the function with the redirecting path.
("replaceLocation\\([^url]" You need to exclude "url" cause there is two replaceLocationfunctions, one with the object url and the other one with the evaluated object (a string))
And finaly gsub('.*\\"(.*)\\".*', '\\1', script) to get only what you need in the script, the argument of the function, the path.
Hope this help !

Resources