R - Web scraping item price - r

I'm trying to write an R script checking prices on a popular swiss website.
Following methodology explained here: https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/ I tried to use rvestfor that:
library(rvest)
url <- "https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344"
webpage <- read_html(url)
Unfortunately, I have limited html/css knowledge and the content of webpage is very obscure to me.
I tried inspecting the page with google chrome and it looks like the price is located in something named priceEnergyWrapper--2ZNIJ but I cannot find any trace of that in webpage. I did not have more luck using SelectorGadget
Can anybody help me get the price out of webpage?

Since it is dynamically generated, you will need RSelenium.
Your code should be something like:
library(RSelenium)
driver <- rsDriver(browser=c("chrome"))
rem_driver <- driver[["client"]]
rem_driver$open()
rem_driver$navigate("https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344")
This will ask Selenium to open this page after loading the entire page, and hence all the HTML that you see by clicking Page Source should be available.
Now do:
rem_driver$findElement(using = 'class', value = 'priceEnergyWrapper--2ZNIJ')
You should now see the necessary HTML to get the price value out of it, which at the time of checking the website is 25 CHF.
PS: I do not scrape websites for others unless I am sure that the owners of the websites do not object to crawlers/scrapers/bots. Hence, my codes are based on the idea of how to go about with Selenium. I have not tested them personally. However, you should more or less get the general idea and the reason behind using a tool like Selenium. You should also find out if you are allowed to legally scrape this website and for others in the near future.
Additional resources to read about RSelenium:
https://ropensci.org/tutorials/rselenium_tutorial/

Related

Scrape google visualization table from a webpage into R

Trying to scarpe a Market Valuation table from this webpage:
https://www.starcapital.de/en/research/stock-market-valuation/
The website is dynamic and asks for user location. The table of interest is listed as class "google-visualization-table-table".
I have tried the following r code
library(rvest)
url <- "https://www.starcapital.de/en/research/stock-market-valuation/"
valuation <- url %>%
html() %>%
html_nodes(xpath='//*[#id="infotable_div2"]/div/div/table') %>%
html_table()
valuation <- valuation[[1]]
and I get no error but no results. What is wrong?
This is a problem you will run into pretty often when scraping websites. The problem here is that this webpage is dynamic. That is, it uses JavaScript to create the visualization and this is done after the page loads. And, crucially here, after rvest downloads the page which is why you don't see it with your code. I confirmed this by disabling JavaScript in Chrome and I see that the chart is missing from the page.
That said, you aren't out of luck! I again used Chrome's Developer Tools' Network pane to look through the requests the page was making. Pages like this that create charts dynamically often make a separate network request to grab data before creating the chart. After some scrolling and poking around, I saw one that looks like the dataset you're interested in:
https://www.starcapital.de/fileadmin/charts/Res_Aktienmarktbewertungen_FundamentalKZ_Tbl.php?lang=en
Open that up in your browser and take a look. Let me know if that's the data you were hoping to get. It's in a somewhat custom-looking JSON format so you may end up needing to write a bit of code to get it into R. Check out the jsonlite package for manipulating the JSON and the httr package for getting the data from that URL into R.
Edit: An alternative approach would be to use an R package that can run the dynamic part of the page (that gets the data to make the chart/table) such as splashr. There are a few other R packages out there that can do this but that's one I'm familiar with.

Scrape a page when URL does not change with page number - R

I want to scrape all the URLs from this page:
http://www.domainia.nl/QuarantaineList.aspx
I am able to scrape the first page, however, I can not change the page, because it is not in the URL. So how can I change the page with scraping? I've been looking into RSelenium, but could not get it working.
I'm running the next code to get at least the first page:
#Constructin the to scrape urls
baseURL <- "http://www.domainia.nl/quarantaine/"
date <- gsub("-", "/", Sys.Date())
URL <- paste0(baseURL, date)
#Scraping the page
page <- read_html(URL) %>% html_nodes("td") %>% html_text()
links <- str_subset(page, pattern = "^\r\n.*.nl$")
links <- gsub(pattern = "\r\n", "", links) %>% trimws
I've looked at the site; it's using a Javascript POST to refresh its contents.
Originally a HTTP-POST was meant to send information to a server, for example to send the contents of what somebody entered in a form. As such, it often includes information on the page you are coming from, which means you probably will need more information then just "page n".
If you want to get another page, like your browser would show you, you need to send a similar request. The httr package inlcudes a POST function, I think you should take a look at that.
For knowing what to post, I think it's most useful to capture what your browser does, and copy that. In Chrome, you can use inspect, tab Network to see what is sent and received, I bet other browsers have similar tools.
However, it looks like that website makes its money by showing that information, and if some other source would show the same things, they'd lose money. Therefore I doubt if it's that easy to emulate, I think some part of the request differs every time, yet needs to be exactly right. For example, they could build checks to see if the entire page was rendered, instead of discarded like you do. So I wouldn't be surprised if they intentionally make it very hard to do what you are trying to do.
Which brings me to an entirely different solution: ask them!
When I tried scraping a website with dynamically generated content for the first time, I was struggling as well. Until I explored the website some more, and saw that they had a link where you could download the entire thing, tidied up, in a nice csv-format.
And for a webserver, people trying to scrape their website is often inconvenient, it also demands resources from the server, a lot more than someone downloading a file.
It's quite possible they'll tell you "no", but if they really don't want you to get their data, I bet they've made it difficult to scrape. Maybe you'll just get banned if you make too many requests from the same IP, maybe some other method.
And it's also entirely possible that they don't want their data in the hands of a competitor, but that they'll give it to you if you only use it for a particular purpose.
(too big for a comment and it also has as salient image, but not an answer, per se)
Emil is spot on, except that this is a asp.net/sharepoint-esque site with binary "view states" and other really daft web practices that will make it nigh impossible to scrape with just httr:
When you do use the Network tab (again, as Emil astutely suggests) you can also use curlconverter to automatically build httr VERB functions out of requests "Copied as cURL".
For this site — assuming it's legal to scrape (it has no robots.txt and I am not fluent in Dutch and did not see an obvious "terms and conditions"-like link) — you can use something like splashr or Selenium to navigate, click and scrape since it acts like real browser.

Scrape football elo-ratings with rvest

I am trying to harvest the world football elo ratings with rvest but I keep getting an empty list
Using the inspect element in google chrome I get the xpath //*[(#id = "maintable_2014_World_Cup_start")]/div[6]
library(rvest)
library(dplyr)
page<-"http://www.eloratings.net/2014_World_Cup_start"
elo_rating<-read_html(page)%>%
html_nodes(xpath='//[#id="maintable_World"]/div[6]')%>%
html_table()
I get an empty list
Searching online and within SE, I came across this and perhaps it has to do something with javascript (which I know nothing about..:). Also, when at the page source (with google chrome) I see a lot of calls to javascript
Lastly, I came across this R publication, with an example of extracting data from the same website, but when I try to replicate the R code, I still get empty lists and empty character objects
I went through many threads here in SE (this, this, this but I can't find a solution
If the obstacle is javascript, is there anything I can do to extract the data?
The obstacle does seem to be javascript as the tables are generated by it. I think you need to use PhantomJS to render the tables and grab them. See this page for help.

How to use R to scrape javascript html href link that uses dopostback

I'm trying to scrape this URL [of tennis league scores][1]
[1]: http://tennislink.usta.com/leagues/Main/statsandstandings.aspx#&&s=2%7C%7C%7C%7C4.0%7C%7CM%7C%7C2016%7C%7C9%7C%7C310%7C%7C. My goal is to automate scraping the results of my teams for analysis.
Using rvest and phantomJS I can easily scrape the table on the above link and create an R dataframe with the five cols.
However, I also want to capture the href= for each row so that I can follow the link and scrape the details for each row. When I "inspect" the first element of a row (the element with the embedded link) I don't see the URL but rather see this
<a id="ctl00_mainContent_rptYearForTeamResults_ctl00_rptYearTeamsInfo_ctl16_LinkButton1" href="javascript:__doPostBack('ctl00$mainContent$rptYearForTeamResults$ctl00$rptYearTeamsInfo$ctl16$LinkButton1','')" class="">Text appears here that I can easily scrape</a>
I've searched for how to scrape dopostback's in R but have not found anything useful. I did find references to Rselenium and have looked at the Cran Rselenium website but could not find references to deal with dopostback.
I also found references to phantomjs, which allowed me to scrape the table.
I have successfully scraped html at other times programmatically using R and rvest, including capturing URL's embedded directly in the HTML with href=, following those URL's programmatically, and continuing the scraping for thousands of records.
However, dopostback has stumped me - I have no javascript skills.
I've tried to find clues using "inspect element" that would allow me to simulate the dopostback in R but nothing jumps out at me.
I would appreciate any help.

Web scraping Oracle (ATG) Commerce

I am new to web scraping, and I use the following tool and method to scrap:
I use R (with packages Curl, XML, etc) to read the web pages (with a url link), and htmlTreeParse function to parse the html page.
Then in order to know get the data I want, I first use the developer tool i Chrome to insepct the code.
When I know in which node the data are, I use xpathApply to get them.
Usually, it works well. But I had an issue with this site: http://www.sephora.fr/Parfum/Parfum-Femme/C309/2
When you click on the link, you will load the page, and in fact it is the page 1 (of the products).
You have to load the url again (by entering a second time the url), in order to get the page 2.
When I use the usual process to read the data. The htmlTreeParse function always gives me the page1.
I tried to understand more this web site:
It seems that it is built with Oracle commerce (ATG commerce).
The "real" url is hidden, and when you click on the filter (for instance, you select a brand), you will get url with requestid: http://www.sephora.fr/Parfum/Parfum-Femme/C309?_requestid=285099
This doesn't help to know which selection I made.
Could you please help:
How can I access to more products ?
Thank you
I found the solution: selenium ! I think that it is the ultimate tool for web scraping. I posted several questions concerning web scraping, now with rselenium, almost everything is possible.

Resources