Scrape google visualization table from a webpage into R

Scrape google visualization table from a webpage into R - r

Trying to scarpe a Market Valuation table from this webpage:
https://www.starcapital.de/en/research/stock-market-valuation/
The website is dynamic and asks for user location. The table of interest is listed as class "google-visualization-table-table".
I have tried the following r code
library(rvest)
url <- "https://www.starcapital.de/en/research/stock-market-valuation/"
valuation <- url %>%
html() %>%
html_nodes(xpath='//*[#id="infotable_div2"]/div/div/table') %>%
html_table()
valuation <- valuation[[1]]
and I get no error but no results. What is wrong?

This is a problem you will run into pretty often when scraping websites. The problem here is that this webpage is dynamic. That is, it uses JavaScript to create the visualization and this is done after the page loads. And, crucially here, after rvest downloads the page which is why you don't see it with your code. I confirmed this by disabling JavaScript in Chrome and I see that the chart is missing from the page.
That said, you aren't out of luck! I again used Chrome's Developer Tools' Network pane to look through the requests the page was making. Pages like this that create charts dynamically often make a separate network request to grab data before creating the chart. After some scrolling and poking around, I saw one that looks like the dataset you're interested in:
https://www.starcapital.de/fileadmin/charts/Res_Aktienmarktbewertungen_FundamentalKZ_Tbl.php?lang=en
Open that up in your browser and take a look. Let me know if that's the data you were hoping to get. It's in a somewhat custom-looking JSON format so you may end up needing to write a bit of code to get it into R. Check out the jsonlite package for manipulating the JSON and the httr package for getting the data from that URL into R.
Edit: An alternative approach would be to use an R package that can run the dynamic part of the page (that gets the data to make the chart/table) such as splashr. There are a few other R packages out there that can do this but that's one I'm familiar with.

Related

Is there a method to find the plot graph data in a web page?

I'm not a web developer, so please bear me.
https://www.etoro.com/people/hyjbrighter/chart
I know that there are several libraries to plot graph in Javascript but how can I check if a specific page is using highchart or another competitor?
I expect to find some kind of Json in the source code but how can I find it?

The trick is to open the Network tab of Dev Tools, reload the page, and search for the piece of data that you want to scrape. Here I saw a number is 21361.15, I searched for it and detected the JSON file is from https://www.etoro.com/sapi/userstats/CopySim/Username/hyjbrighter/OneYearAgo?callback=angular.callbacks._0&client_request_id=2ce991a6-0943-4111-abd3-6906ca92e45c.
But you need to clear the parameters in this situation to actually get the proper information.
I don't know which language you use, if you use Python, here is the code:
import requests
import pandas
data = requests.get("https://www.etoro.com/sapi/userstats/CopySim/Username/hyjbrighter/OneYearAgo").json()['simulation']['oneYearAgo']['chart']
data = pandas.DataFrame(data)
print(data)
Output:
If you use R, use jsonlite package.

R - Web scraping item price

I'm trying to write an R script checking prices on a popular swiss website.
Following methodology explained here: https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/ I tried to use rvestfor that:
library(rvest)
url <- "https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344"
webpage <- read_html(url)
Unfortunately, I have limited html/css knowledge and the content of webpage is very obscure to me.
I tried inspecting the page with google chrome and it looks like the price is located in something named priceEnergyWrapper--2ZNIJ but I cannot find any trace of that in webpage. I did not have more luck using SelectorGadget
Can anybody help me get the price out of webpage?

Since it is dynamically generated, you will need RSelenium.
Your code should be something like:
library(RSelenium)
driver <- rsDriver(browser=c("chrome"))
rem_driver <- driver[["client"]]
rem_driver$open()
rem_driver$navigate("https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344")
This will ask Selenium to open this page after loading the entire page, and hence all the HTML that you see by clicking Page Source should be available.
Now do:
rem_driver$findElement(using = 'class', value = 'priceEnergyWrapper--2ZNIJ')
You should now see the necessary HTML to get the price value out of it, which at the time of checking the website is 25 CHF.
PS: I do not scrape websites for others unless I am sure that the owners of the websites do not object to crawlers/scrapers/bots. Hence, my codes are based on the idea of how to go about with Selenium. I have not tested them personally. However, you should more or less get the general idea and the reason behind using a tool like Selenium. You should also find out if you are allowed to legally scrape this website and for others in the near future.
Additional resources to read about RSelenium:
https://ropensci.org/tutorials/rselenium_tutorial/

Scrape a page when URL does not change with page number - R

I want to scrape all the URLs from this page:
http://www.domainia.nl/QuarantaineList.aspx
I am able to scrape the first page, however, I can not change the page, because it is not in the URL. So how can I change the page with scraping? I've been looking into RSelenium, but could not get it working.
I'm running the next code to get at least the first page:
#Constructin the to scrape urls
baseURL <- "http://www.domainia.nl/quarantaine/"
date <- gsub("-", "/", Sys.Date())
URL <- paste0(baseURL, date)
#Scraping the page
page <- read_html(URL) %>% html_nodes("td") %>% html_text()
links <- str_subset(page, pattern = "^\r\n.*.nl$")
links <- gsub(pattern = "\r\n", "", links) %>% trimws

I've looked at the site; it's using a Javascript POST to refresh its contents.
Originally a HTTP-POST was meant to send information to a server, for example to send the contents of what somebody entered in a form. As such, it often includes information on the page you are coming from, which means you probably will need more information then just "page n".
If you want to get another page, like your browser would show you, you need to send a similar request. The httr package inlcudes a POST function, I think you should take a look at that.
For knowing what to post, I think it's most useful to capture what your browser does, and copy that. In Chrome, you can use inspect, tab Network to see what is sent and received, I bet other browsers have similar tools.
However, it looks like that website makes its money by showing that information, and if some other source would show the same things, they'd lose money. Therefore I doubt if it's that easy to emulate, I think some part of the request differs every time, yet needs to be exactly right. For example, they could build checks to see if the entire page was rendered, instead of discarded like you do. So I wouldn't be surprised if they intentionally make it very hard to do what you are trying to do.
Which brings me to an entirely different solution: ask them!
When I tried scraping a website with dynamically generated content for the first time, I was struggling as well. Until I explored the website some more, and saw that they had a link where you could download the entire thing, tidied up, in a nice csv-format.
And for a webserver, people trying to scrape their website is often inconvenient, it also demands resources from the server, a lot more than someone downloading a file.
It's quite possible they'll tell you "no", but if they really don't want you to get their data, I bet they've made it difficult to scrape. Maybe you'll just get banned if you make too many requests from the same IP, maybe some other method.
And it's also entirely possible that they don't want their data in the hands of a competitor, but that they'll give it to you if you only use it for a particular purpose.

(too big for a comment and it also has as salient image, but not an answer, per se)
Emil is spot on, except that this is a asp.net/sharepoint-esque site with binary "view states" and other really daft web practices that will make it nigh impossible to scrape with just httr:
When you do use the Network tab (again, as Emil astutely suggests) you can also use curlconverter to automatically build httr VERB functions out of requests "Copied as cURL".
For this site — assuming it's legal to scrape (it has no robots.txt and I am not fluent in Dutch and did not see an obvious "terms and conditions"-like link) — you can use something like splashr or Selenium to navigate, click and scrape since it acts like real browser.

Scrape football elo-ratings with rvest

I am trying to harvest the world football elo ratings with rvest but I keep getting an empty list
Using the inspect element in google chrome I get the xpath //*[(#id = "maintable_2014_World_Cup_start")]/div[6]
library(rvest)
library(dplyr)
page<-"http://www.eloratings.net/2014_World_Cup_start"
elo_rating<-read_html(page)%>%
html_nodes(xpath='//[#id="maintable_World"]/div[6]')%>%
html_table()
I get an empty list
Searching online and within SE, I came across this and perhaps it has to do something with javascript (which I know nothing about..:). Also, when at the page source (with google chrome) I see a lot of calls to javascript
Lastly, I came across this R publication, with an example of extracting data from the same website, but when I try to replicate the R code, I still get empty lists and empty character objects
I went through many threads here in SE (this, this, this but I can't find a solution
If the obstacle is javascript, is there anything I can do to extract the data?

The obstacle does seem to be javascript as the tables are generated by it. I think you need to use PhantomJS to render the tables and grab them. See this page for help.

How to scrape table which is dynamically generated by Javascript

I am a beginner in R and was trying to extract different table data from different Websites. I was able to perform the basic data scraping, but I am stuck while trying to extract data from the following table.
url: https://www.nseindia.com/live_market/dynaContent/live_watch/equities_stock_watch.htm?cat=N
I tried using the html_read & html_nodes function using css and xpath but it does not return a value. Could anyone advise me on how to proceed ?

So the problem you're facing is that rvest will read the source of a page, but it won't execute the javascript on the page. The table is created by executing javascript once the source has been loaded.
Your best option is to look into RSelenium. This is because RSelenium actually launches and drives a browser window, once the javascript has executed you can query the current source (what you would see if you right click in Chrome and select Inspect).
However, RSelenium was pulled from CRAN because some dependencies were pulled from CRAN, so you'll probably need to use MRAN to install it.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex