I am a beginner in R and was trying to extract different table data from different Websites. I was able to perform the basic data scraping, but I am stuck while trying to extract data from the following table.
url: https://www.nseindia.com/live_market/dynaContent/live_watch/equities_stock_watch.htm?cat=N
I tried using the html_read & html_nodes function using css and xpath but it does not return a value. Could anyone advise me on how to proceed ?
So the problem you're facing is that rvest will read the source of a page, but it won't execute the javascript on the page. The table is created by executing javascript once the source has been loaded.
Your best option is to look into RSelenium. This is because RSelenium actually launches and drives a browser window, once the javascript has executed you can query the current source (what you would see if you right click in Chrome and select Inspect).
However, RSelenium was pulled from CRAN because some dependencies were pulled from CRAN, so you'll probably need to use MRAN to install it.
Related
Trying to scarpe a Market Valuation table from this webpage:
https://www.starcapital.de/en/research/stock-market-valuation/
The website is dynamic and asks for user location. The table of interest is listed as class "google-visualization-table-table".
I have tried the following r code
library(rvest)
url <- "https://www.starcapital.de/en/research/stock-market-valuation/"
valuation <- url %>%
html() %>%
html_nodes(xpath='//*[#id="infotable_div2"]/div/div/table') %>%
html_table()
valuation <- valuation[[1]]
and I get no error but no results. What is wrong?
This is a problem you will run into pretty often when scraping websites. The problem here is that this webpage is dynamic. That is, it uses JavaScript to create the visualization and this is done after the page loads. And, crucially here, after rvest downloads the page which is why you don't see it with your code. I confirmed this by disabling JavaScript in Chrome and I see that the chart is missing from the page.
That said, you aren't out of luck! I again used Chrome's Developer Tools' Network pane to look through the requests the page was making. Pages like this that create charts dynamically often make a separate network request to grab data before creating the chart. After some scrolling and poking around, I saw one that looks like the dataset you're interested in:
https://www.starcapital.de/fileadmin/charts/Res_Aktienmarktbewertungen_FundamentalKZ_Tbl.php?lang=en
Open that up in your browser and take a look. Let me know if that's the data you were hoping to get. It's in a somewhat custom-looking JSON format so you may end up needing to write a bit of code to get it into R. Check out the jsonlite package for manipulating the JSON and the httr package for getting the data from that URL into R.
Edit: An alternative approach would be to use an R package that can run the dynamic part of the page (that gets the data to make the chart/table) such as splashr. There are a few other R packages out there that can do this but that's one I'm familiar with.
I'm trying to scrape this URL [of tennis league scores][1]
[1]: http://tennislink.usta.com/leagues/Main/statsandstandings.aspx#&&s=2%7C%7C%7C%7C4.0%7C%7CM%7C%7C2016%7C%7C9%7C%7C310%7C%7C. My goal is to automate scraping the results of my teams for analysis.
Using rvest and phantomJS I can easily scrape the table on the above link and create an R dataframe with the five cols.
However, I also want to capture the href= for each row so that I can follow the link and scrape the details for each row. When I "inspect" the first element of a row (the element with the embedded link) I don't see the URL but rather see this
<a id="ctl00_mainContent_rptYearForTeamResults_ctl00_rptYearTeamsInfo_ctl16_LinkButton1" href="javascript:__doPostBack('ctl00$mainContent$rptYearForTeamResults$ctl00$rptYearTeamsInfo$ctl16$LinkButton1','')" class="">Text appears here that I can easily scrape</a>
I've searched for how to scrape dopostback's in R but have not found anything useful. I did find references to Rselenium and have looked at the Cran Rselenium website but could not find references to deal with dopostback.
I also found references to phantomjs, which allowed me to scrape the table.
I have successfully scraped html at other times programmatically using R and rvest, including capturing URL's embedded directly in the HTML with href=, following those URL's programmatically, and continuing the scraping for thousands of records.
However, dopostback has stumped me - I have no javascript skills.
I've tried to find clues using "inspect element" that would allow me to simulate the dopostback in R but nothing jumps out at me.
I would appreciate any help.
I am new to web scraping, and I use the following tool and method to scrap:
I use R (with packages Curl, XML, etc) to read the web pages (with a url link), and htmlTreeParse function to parse the html page.
Then in order to know get the data I want, I first use the developer tool i Chrome to insepct the code.
When I know in which node the data are, I use xpathApply to get them.
Usually, it works well. But I had an issue with this site: http://www.sephora.fr/Parfum/Parfum-Femme/C309/2
When you click on the link, you will load the page, and in fact it is the page 1 (of the products).
You have to load the url again (by entering a second time the url), in order to get the page 2.
When I use the usual process to read the data. The htmlTreeParse function always gives me the page1.
I tried to understand more this web site:
It seems that it is built with Oracle commerce (ATG commerce).
The "real" url is hidden, and when you click on the filter (for instance, you select a brand), you will get url with requestid: http://www.sephora.fr/Parfum/Parfum-Femme/C309?_requestid=285099
This doesn't help to know which selection I made.
Could you please help:
How can I access to more products ?
Thank you
I found the solution: selenium ! I think that it is the ultimate tool for web scraping. I posted several questions concerning web scraping, now with rselenium, almost everything is possible.
I have a COM Add In, in excel from RIMES. It basically allows me to get data, there is a specific button refresh all data. I am loading the data in R and do calculus. I would like to avoid having to open and refresh the excel file before loading the data. In other words I would like something to "click on refresh all". I figured I could do a VBA script that would do that, however I cannot figure what are the functions embedded in the excel add-in. How can one "explore" the com addin ?
Thanks
I was finally able to find a proper solution, turns out there are templates that are created form the COM ADD IN that uses the function refresh which is then available in VBA. One can then use a VB.script to create a way to compute everything without opening excel.
How To Create a VB.Script (StackOverflow)
Cheers
R
I'm trying to write some programs to download a lot of economic data (on the order of hundreds of distinct tables from different websites, that'd need to be updated frequently). Take this website:
http://www.oecd-ilibrary.org/economics/country-statistical-profiles-key-tables-from-oecd_20752288
I want an R program to be able to click on one of those little green buttons that will download an xls file, so I don't have to click it by hand. Is there a package / function in R for this type of thing? (And if not, is there another simple-ish way to do it?)
Thanks!
The buttons just link to .xls files. So, you could use the URL of the page that the button points to, and use that URL as an input to a script / function that does the scraping. There are plenty of packages like rcurl that you could use to manage the download.