Web scraping with R? - web-scraping

I have a dataframe which indicates, in column, an url.
test = data.frame (id = 1, url = "https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
Using this, I would like to retrieve an element in the web page. Specifically, I would like to retrieve the value of the activity state.
https://zupimages.net/viewer.php?id=20/51/t1fx.png
Thanks to my research, I was able to find a code which allows to select the element thanks to its "XPath".
library (rvest)
page = read_html ("https://www.georisques.gouv.fr/risques/installations/donnees/details/0030.12015")
page%>% html_nodes (xpath = '// * [# id = "detailAttributFiche"] / div / p')%>% html_text ()%>% as.character ()
character (0)
As you can see, I always have a "character (0)" that appears, as if it couldn't read the whole page. I suspect some JavaScript part is not linking properly ...
How can I do ?
Thank you.

The data is from this link (the etatActiviteInst parameter): https://www.georisques.gouv.fr/webappReport/ws/installations/etablissement/0030-12015

Related

How to replace comma with a dot in GTM for JSON structured data?

I am noob with structured data implementation and don't have any code knowledge.
I have been looking for a week how to solve a warning with price in Google structured data testing tool.
My prices are with a comma which is not accepted by Google.
By checking the http://schema.org/price it tells me that "Use '.' (Unicode 'FULL STOP' (U+002E)) rather than ',' to indicate a decimal point. Avoid using these symbols as a readability separator."
I have a CSS variable element #PdtPrixRef named in a variable "Product-price" with a comma "12.5" but I can't find how to replace it in my structured data with the value "12.5"... Someone to help me?
Hereafter my actual script :
My actual GTM script
Should I add something to my script or making an VARIABLE (Custom Js)?
I think it's something like
value.replace(",", ".")
But I do't know how to write the full proper function from beginning to end...
Yes you can just create a Custom JavaScript Variable
Here is the code
function(){
var price = {{Product-price}};
return price.replace("," , ".");
}
Then using this variable to your JSON-LD script.

Issue scraping a collapsible table using rvest

I am trying to scrape information from multiple collapsible tables from a website called APIS.
An example of what I'm trying to collect is here http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next
Ideally I'd like to be able to have the drop down heading followed by the information underneath, though when using rvest I cant seem to get it to select the correct section from the html.
I'm reasonably new to R, this is what I have from watching some videos about scraping:
link = "http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next"
page = read_html(link)
name = page %>% html_nodes(".tab-tables :nth-child(1)") %>% html_text()
the "name" value displays "Character (empty)"
It may be because I'm new to this and there's a really obvious answer but any help would be appreciated
The data for each tab comes from additional requests you can find in the browser network tab when pressing F5 to refresh the page. For example, the nutrients info comes from:
http://www.apis.ac.uk/sites/default/files/AJAX/srcl_2019/apis_tab_nnut.php?ajax=true&site=1001814&BH=&populateBH=true
Which you can think of more generally as:
scheme='http'
netloc='www.apis.ac.uk'
path='/sites/default/files/AJAX/srcl_2019/apis_tab_nnut.php'
params=''
query='ajax=true&site=1001814&BH=&populateBH=true'
fragment=''
So, you would make your request to those urls you see in the network tab.
If you want to dynamically determine these urls, then make a request, as you did, to the landing page, then regex out from the response text the path (see above) of the urls. This can be done using the following pattern url: "(\\/sites\\/default\\/files\\/.*?)".
You then need to add the protocol + domain (scheme and netloc) to the returned matches based on landing page protocol and domain.
There are some additional query string parameters, which come after the ?, which can also be dynamically retrieved, if reconstructing the urls from the response text. You can see these within the page source:
You probably want to extract each of those data param specs for the Ajax requests e.g. with data:\\s\\((.*?)\\), then have a custom function which turns the matches into the required query string suffix to add to the previously retrieved urls.
Something like the following:
library(rvest)
library(magrittr)
library(stringr)
get_query_string <- function(match, site_code) {
string <- paste0(
"?",
gsub("siteCode", site_code, gsub('["{}]', "", gsub(",\\s+", "&", gsub(":\\s+", "=", match))))
)
return(string)
}
link <- "http://www.apis.ac.uk/select-feature?site=1001814&SiteType=SSSI&submit=Next"
page <- read_html(link) %>% toString()
links <- paste0("http://www.apis.ac.uk", stringr::str_match_all(page, 'url: "(\\/sites\\/default\\/files\\/.*?)"')[[1]][, 2])
params <- stringr::str_match_all(page, "data:\\s\\((.*?)\\),")[[1]][, 2]
site_code <- stringr::str_match_all(page, 'var siteCode = "(.*?)"')[[1]][, 2]
params <- lapply(params, get_query_string, site_code)
urls <- paste0(links, params)

R based Web Scraper for Cabela's using rvest

Maybe slightly out of the ordinary, but I want to track down a particular rifle that I am interested in purchasing. I am familiar with R, so I started down that path, but I'm guessing there are better options.
What I want to do is check a web page hourly to see if the availability has changed. If it has, I get a text message.
I started out using rvest and twilio. The problem is that I can't figure out how to get all the way down to the data that I need. The page has an "Add to cart" button that isn't shown if the item isn't available using css style display:none.
I've tried various ways of trying to get down to that particular div by using id names, css classes, xpath, etc, but keep coming up with nothing.
Any ideas? is it the formatting of the div name? Or do I have to manually dig through each nested div?
EDIT: I was able to find the right xpath to work. But as pointed out below, you can't see the style.
EDIT2 - in the out of stock div, the text "In Select Stores Only" is displayed, but I can't figure out how to isolate it.
#Schedule script to run every hour
library(rvest)
library(twilio)
#vars for sms
Sys.setenv(TWILIO_SID = "xxxxxxxxxxx")
Sys.setenv(TWILIO_TOKEN = "xxxxxxxxxxx")
#example, two url's - one with in stock item, one without
OutStockURL <- read_html("https://www.cabelas.com/shop/en/marlin-1895sbl-lever-action-rifle?searchTerm=1895%20sbl")
InStockURL <- read_html("https://www.cabelas.com/shop/en/thompson-center-venture-ii-bolt-action-centerfire-rifle")
#div id that contains information on if product is in stock or not
instockdivid <- "WC_Sku_List_TableContent_3074457345620110138_Price & Availability_1_16_grid"
outstockdivid <- "WC_Sku_List_TableContent_24936_Price & Availability_1_15_grid"
#inside the div is a button that is either displayed or not based on availability
instockbutton <- 'id="SKU_List_Widget_Add2CartButton_3074457345620110857_table"'
outstockbutton <- 'id="SKU_List_Widget_Add2CartButton_3074457345617539137_table"'
#if item is unavailable, button style is set to display:none - style="display: none;"
test <- InStockURL %>%
html_nodes("div")
#xpath to buttons
test <- InStockURL %>%
html_nodes(xpath = '//*
[#id="SKU_List_Widget_Add2CartButton_3074457345620110857_table"]')
test2 <- OutStockURL %>%
html_nodes(xpath = '//*
[#id="SKU_List_Widget_Add2CartButton_3074457345617539137_table"]')
#not sure where to go from here to see if the button is visible or not
#if button is displayed, send email
tw_send_message(
to = "+15555555555",
from = "+5555555555",
body = paste("Your Item Is Available!")
)

xpath returning empty text when web-scraping in r

I'm trying to scrape information from https://www.kff.org/interactive/subsidy-calculator. For instance, put state=California, zip=90001, income=20000, no coverage, 1 people, 1 adult, no children, age=21, no tobacco.
We get the following:
https://www.kff.org/interactive/subsidy-calculator/#state=ca&zip=94704&income-type=dollars&income=20000&employer-coverage=0&people=1&alternate-plan-family=individual&adult-count=1&adults%5B0%5D%5Bage%5D=21&adults%5B0%5D%5Btobacco%5D=0&child-count=0
I would like to get the numbers for "estimated financial help" and "your cost for a silver plan" (they are bolded-blue in the "Results" grey box, for some reason I can't upload the screenshot). When I use the xpath for the numbers, I get back empty string. This is not the case if I were to retrieve some other text (not in the grey box). I wonder what could be wrong with this. I have attached code below. Please forgive me if this is a stupid question since I'm very new to web-scraping. Thank you!
state = tolower('CA')
zip = 94704
income = 20000
people = 1
adult = 1
children = 0
url = paste0("https://www.kff.org/interactive/subsidy-calculator/#state=", state, "&zip=", zip, "&income-type=dollars&income=", income, "&employer-coverage=0&people=", people, "&alternate-plan-family=individual&adult-count=", adult, "&adults%5B0%5D%5Bage%5D=21&adults%5B0%5D%5Btobacco%5D=0&child-count=", children)
# This returns empty string
r = read_html(url) %>%
html_nodes(xpath ='//*[#id="subsidy-calculator-new"]/div[5]/div/div/dl/dd[1]/span') %>% html_text()
# This returns "Number of children (20 and younger) enrolling in Marketplace coverage", a line that's not in the grey box.
r = read_html(url) %>%
html_nodes(xpath = '//*[#id="subsidy-form"]/div[2]/div[3]/div[3]/p') %>%
html_text()
The values are generated through scripts that run on the page. Your current method won't allow for this hence your result. You are likely better off using a method which allows scripts to run such as RSelenium.
The form you complete #subsidy-form feeds values into a template in a script tag #results-template. The associated calculations are covered in this script https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/calculator.js?ver=1.7.7 where you will find the logic and the pre-set values such as poverty lines per year.
The simplest quick view is probably to inspect the javascript variables when the new SubsidyCalculator object is created to process the form i.e. js starting with var sc = new SubsidyCalculator. You could 'reverse engineer' those variables with your values plus the values returned from the json below which I think, but haven't confirmed, feed the 6 variables that begin with kff_sc, according to zipcode, into the calculator e.g. silver: kff_sc.silver . You get an idea of the ballpark figures given there are default values given at top of script.
Figures in relation to zipcode are retrieved from this: https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/json/zips/94.json where the last two numbers before .json are the first two numbers of zipcode. You can determine this from the input validation script: https://www.kff.org/wp-content/themes/kaiser-foundation-2016/interactives/subsidy-calculator/2019/shared.js?ver=1.7.7
var bucket = $( this ).val().substring( 0, 2 );
if ( kff_sc.buckets[bucket] ) return;
$.ajax( '/wp-content/themes/vip/kaiser-foundation-2016/interactives/subsidy-calculator/2019/json/zips/' + bucket + '.json',
The first two digits determine the bucket.
All in all you could likely implement your own calculator but you would be re-inventing the wheel. Seems easier to just automate the browser and then extract the resultant values.

How I can read multiple web addresses with "%" sign in address that block dynamic iteration with sprintf?

Code I used to scrape website with multiple pages uses sprintf function that iterate by changing url's dynamic part "%d" for pages. But recently website I scrape added into address some variables which has "%". So further I cannot scrape because it gives error mapping function I use with sprintf for these newly added % sign?
url_base <- "https://www.xxxxxx.com/girne?s-r=S&property_type=1&property=&min_price=&max_price=&currency=1&min_m2=&max_m2=&title-type%5B0%5D=1&page=%d&sort=mr"
map_df(1:10,function(i){
emlak <- read_html(sprintf(url_base,i))
fiyat <-emlak%>%html_nodes("#properties .price")%>%html_text()
alan <-emlak%>%html_nodes(".glyphicons-vector-path-square+ .detail-value")%>%html_text()
ilanno <-emlak%>%html_nodes(".fa-hashtag+ .detail-value")%>%html_text()
bolge <-emlak%>%html_nodes("#properties figure")%>%html_text()
data.frame(fiyat,alan,ilanno,bolge,stringsAsFactors = FALSE)
}) -> emlak_table3
Is there any way to define dynamic iterator other than "%"? I would like to use same procuedure to scrape website and download pages data
To insert a literal % in sprintf, use %%. I.e. sprintf('Your rate: %.1f%%', 31.4).
Thus, every place in your string where you need a literal '%', use two. Every place where you need to insert a value, use one.

Resources