R based Web Scraper for Cabela's using rvest - r

Maybe slightly out of the ordinary, but I want to track down a particular rifle that I am interested in purchasing. I am familiar with R, so I started down that path, but I'm guessing there are better options.
What I want to do is check a web page hourly to see if the availability has changed. If it has, I get a text message.
I started out using rvest and twilio. The problem is that I can't figure out how to get all the way down to the data that I need. The page has an "Add to cart" button that isn't shown if the item isn't available using css style display:none.
I've tried various ways of trying to get down to that particular div by using id names, css classes, xpath, etc, but keep coming up with nothing.
Any ideas? is it the formatting of the div name? Or do I have to manually dig through each nested div?
EDIT: I was able to find the right xpath to work. But as pointed out below, you can't see the style.
EDIT2 - in the out of stock div, the text "In Select Stores Only" is displayed, but I can't figure out how to isolate it.
#Schedule script to run every hour
library(rvest)
library(twilio)
#vars for sms
Sys.setenv(TWILIO_SID = "xxxxxxxxxxx")
Sys.setenv(TWILIO_TOKEN = "xxxxxxxxxxx")
#example, two url's - one with in stock item, one without
OutStockURL <- read_html("https://www.cabelas.com/shop/en/marlin-1895sbl-lever-action-rifle?searchTerm=1895%20sbl")
InStockURL <- read_html("https://www.cabelas.com/shop/en/thompson-center-venture-ii-bolt-action-centerfire-rifle")
#div id that contains information on if product is in stock or not
instockdivid <- "WC_Sku_List_TableContent_3074457345620110138_Price & Availability_1_16_grid"
outstockdivid <- "WC_Sku_List_TableContent_24936_Price & Availability_1_15_grid"
#inside the div is a button that is either displayed or not based on availability
instockbutton <- 'id="SKU_List_Widget_Add2CartButton_3074457345620110857_table"'
outstockbutton <- 'id="SKU_List_Widget_Add2CartButton_3074457345617539137_table"'
#if item is unavailable, button style is set to display:none - style="display: none;"
test <- InStockURL %>%
html_nodes("div")
#xpath to buttons
test <- InStockURL %>%
html_nodes(xpath = '//*
[#id="SKU_List_Widget_Add2CartButton_3074457345620110857_table"]')
test2 <- OutStockURL %>%
html_nodes(xpath = '//*
[#id="SKU_List_Widget_Add2CartButton_3074457345617539137_table"]')
#not sure where to go from here to see if the button is visible or not
#if button is displayed, send email
tw_send_message(
to = "+15555555555",
from = "+5555555555",
body = paste("Your Item Is Available!")
)

Related

Issue scraping a collapsible table using rvest

I am trying to scrape information from multiple collapsible tables from a website called APIS.
An example of what I'm trying to collect is here http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next
Ideally I'd like to be able to have the drop down heading followed by the information underneath, though when using rvest I cant seem to get it to select the correct section from the html.
I'm reasonably new to R, this is what I have from watching some videos about scraping:
link = "http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next"
page = read_html(link)
name = page %>% html_nodes(".tab-tables :nth-child(1)") %>% html_text()
the "name" value displays "Character (empty)"
It may be because I'm new to this and there's a really obvious answer but any help would be appreciated
The data for each tab comes from additional requests you can find in the browser network tab when pressing F5 to refresh the page. For example, the nutrients info comes from:
http://www.apis.ac.uk/sites/default/files/AJAX/srcl_2019/apis_tab_nnut.php?ajax=true&site=1001814&BH=&populateBH=true
Which you can think of more generally as:
scheme='http'
netloc='www.apis.ac.uk'
path='/sites/default/files/AJAX/srcl_2019/apis_tab_nnut.php'
params=''
query='ajax=true&site=1001814&BH=&populateBH=true'
fragment=''
So, you would make your request to those urls you see in the network tab.
If you want to dynamically determine these urls, then make a request, as you did, to the landing page, then regex out from the response text the path (see above) of the urls. This can be done using the following pattern url: "(\\/sites\\/default\\/files\\/.*?)".
You then need to add the protocol + domain (scheme and netloc) to the returned matches based on landing page protocol and domain.
There are some additional query string parameters, which come after the ?, which can also be dynamically retrieved, if reconstructing the urls from the response text. You can see these within the page source:
You probably want to extract each of those data param specs for the Ajax requests e.g. with data:\\s\\((.*?)\\), then have a custom function which turns the matches into the required query string suffix to add to the previously retrieved urls.
Something like the following:
library(rvest)
library(magrittr)
library(stringr)
get_query_string <- function(match, site_code) {
string <- paste0(
"?",
gsub("siteCode", site_code, gsub('["{}]', "", gsub(",\\s+", "&", gsub(":\\s+", "=", match))))
)
return(string)
}
link <- "http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next"
page <- read_html(link) %>% toString()
links <- paste0("http://www.apis.ac.uk", stringr::str_match_all(page, 'url: "(\\/sites\\/default\\/files\\/.*?)"')[[1]][, 2])
params <- stringr::str_match_all(page, "data:\\s\\((.*?)\\),")[[1]][, 2]
site_code <- stringr::str_match_all(page, 'var siteCode = "(.*?)"')[[1]][, 2]
params <- lapply(params, get_query_string, site_code)
urls <- paste0(links, params)

Web scraping with R?

I have a dataframe which indicates, in column, an url.
test = data.frame (id = 1, url = "https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
Using this, I would like to retrieve an element in the web page. Specifically, I would like to retrieve the value of the activity state.
https://zupimages.net/viewer.php?id=20/51/t1fx.png
Thanks to my research, I was able to find a code which allows to select the element thanks to its "XPath".
library (rvest)
page = read_html ("https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
page%>% html_nodes (xpath = '// * [# id = "detailAttributFiche"] / div / p')%>% html_text ()%>% as.character ()
character (0)
As you can see, I always have a "character (0)" that appears, as if it couldn't read the whole page. I suspect some JavaScript part is not linking properly ...
How can I do ?
Thank you.
The data is from this link (the etatActiviteInst parameter): https://www.georisques.gouv.fr/webappReport/ws/installations/etablissement/0030-12015

rvest - Extract data from OMIM

EDIT: After some research and help from others, I figured what I was trying to do is not ethical. I asked for OMIM API permission from OMIM website and advise the same to anyone who needs to do same stuff.
I am quite inexperienced in HTML.
Using some keywords like 'ciliary' and 'primary' I want to go into OMIM, get into first 5 links listed, save text within those links and scrape data based on keywords like 'homozygous', 'heterozygous' etc.
What I have done so far:
rvestedOMIM <- function() {
clinicKeyWord1 <- c('primary', 'ciliary')
OMIM1 <- paste0("https://www.omim.org/search/?index=entry&start=1&limit=10&sort=score+desc%2C+prefix_sort+desc&search=", clinicKeyWord1[1], "+", clinicKeyWord1[2])
webpage <- read_html(OMIM1)
rank_data_html <- html_nodes(webpage,'.mim-hint')
# Go into first 5 links and extract the data based on keywords
allLinks <- rank_data_html[grep('a href',rank_data_html)]
allLinks <- allLinks[grep('omim', allLinks)]
}
At the moment, I am stuck at going through the links listed in the first OMIM search (with 'primary' and 'ciliary' keywords). allLinks object within the function I wrote is intended to extract the links
e.g.
244400. CILIARY DYSKINESIA, PRIMARY, 1; CILD1
(https://www.omim.org/entry/244400?search=ciliary%20primary&highlight=primary%20ciliary)
608644. CILIARY DYSKINESIA, PRIMARY, 3; CILD3
(https://www.omim.org/entry/608644?search=ciliary%20primary&highlight=primary%20ciliary)
Even if I could scrape the OMIM id in the links 244400 or 608644, I can navigate through the links by myself which is a workaround I thought in case I couldn't scrape that yields the full link.
Thank you for your help

Reading in html with R rvest. How do I check if a CSS selector class contains anything?

this is my first attempt to deal with HTML and CSS selectors. I am using the R package rvest to scrap the Billboard Top 100 website. Some of the data that I am interested in include this weeks rank, song, weather or not the song is New, and weather or not the song has any awards.
I am able to get the song name and rank with the following:
library(rvest)
URL <- "http://www.billboard.com/charts/hot-100/2017-09-30"
webpage <- read_html(URL)
current_week_rank <- html_nodes(webpage, '.chart-row__current-week')
current_week_rank <- as.numeric(html_text(current_week_rank))
My problem comes with the new and award indicators. The songs are listed in rows with each of the 100 contained in:
<article> class="chart-row char-row--1 js chart-row" ....
</article>
If a song is new, this will have class within it like:
<div class="chart-row__new-indicator">
If a song has an award, there will be this class within it:
<div class="chart-row__award-indicator">
Is there a way that I can look at all 100 instances of the class="chart-row char-row--1 js chart-row" ... and see if either of these exist within it? The output that I get from the current_week_rank is one column of 100 values. I am hoping that there is a way to get this so that I have one observation for each song.
Thank you for any help or advice.
Basically amounts to a tailored version of the Q&A I indicated above. I can't tell for 100% certain whether the or is working as intended, since there's only one row in your example page with a <div class="chart-row__new-indicator">, and that row also happens to have a <div class="chart-row__award-indicator"> tag as well.
#xpath to focus on the 100 rows of interest
primary_xp = '//div[#class="chart-row__primary"]'
#xpath which subselects rows you're after
check_xp = paste('div[#class="chart-row__award-indicator" or' ,
'#class="chart-row__new-indicator"]')
webpage %>% html_nodes(xpath = primary_xp) %>%
#row__primary for which there are no such child nodes
# will come back NA, and hence so will html_attr('class')
html_node(xpath = check_xp) %>%
#! is a bit extraneous, as it only flips FALSE to TRUE
# for the rows you're after (necessity depends on
# particulars of your application)
html_attr('class') %>% is.na %>% `!`
FWIW, you may be able to shorten check_xp to the following:
check_xp = 'div[contains(#class, "indicator")]'
Which certainly covers both "chart-row__award-indicator" and "chart-row__new-indicator", but would also wrap up other nodes with a class containing "indicator", if such an alternative tag exists (you'll have to determine this for yourself)

how to use rvest to scrape same kind of datapoint but labelled with different id

if I want to use rvest to scrape a particular datapoint (name, address, phone etc) repeated in different section of this page, all start with similar span id, but not exactly the same, such as:
docs-internal-guid-049ac94a-f34e-5729-b053-30567fdf050a
docs-internal-guid-765e48e9-f34b-7c88-5d95-042a93fcfda3
what's the best approach? to find and copy each id is not viable. Thanks
Edit:
You can use the following script to retrieve all star restaurants:
library("rvest")
url_base <- "http://www.straitstimes.com/lifestyle/food/full-list-of-michelin-starred-restaurants-for-2017"
data <- read_html(url_base) %>%
html_nodes("h3") %>%
html_text()
This also gives you the headers ("ONE MICHELIN STAR", "TWO MICHELIN STARS", "THREE MICHELIN STARS"), bu this might even be helpful.
Background to the script:
Fortunately, all and only the relevant information is within the h3 selector. The script gives you a char vector as output. Of course, you can further elaborate on this with e.g. %>% as.data.frame() or however you want to store / process the data.
------------------- old answer -------------------
Could you maybe provide the url of that particular page? For me it sounds like you have to find the right css-selector (nth-child(x)) you can use in a loop.

Resources