I am trying to harvest the world football elo ratings with rvest but I keep getting an empty list
Using the inspect element in google chrome I get the xpath //*[(#id = "maintable_2014_World_Cup_start")]/div[6]
library(rvest)
library(dplyr)
page<-"http://www.eloratings.net/2014_World_Cup_start"
elo_rating<-read_html(page)%>%
html_nodes(xpath='//[#id="maintable_World"]/div[6]')%>%
html_table()
I get an empty list
Searching online and within SE, I came across this and perhaps it has to do something with javascript (which I know nothing about..:). Also, when at the page source (with google chrome) I see a lot of calls to javascript
Lastly, I came across this R publication, with an example of extracting data from the same website, but when I try to replicate the R code, I still get empty lists and empty character objects
I went through many threads here in SE (this, this, this but I can't find a solution
If the obstacle is javascript, is there anything I can do to extract the data?
The obstacle does seem to be javascript as the tables are generated by it. I think you need to use PhantomJS to render the tables and grab them. See this page for help.
Related
Trying to scarpe a Market Valuation table from this webpage:
https://www.starcapital.de/en/research/stock-market-valuation/
The website is dynamic and asks for user location. The table of interest is listed as class "google-visualization-table-table".
I have tried the following r code
library(rvest)
url <- "https://www.starcapital.de/en/research/stock-market-valuation/"
valuation <- url %>%
html() %>%
html_nodes(xpath='//*[#id="infotable_div2"]/div/div/table') %>%
html_table()
valuation <- valuation[[1]]
and I get no error but no results. What is wrong?
This is a problem you will run into pretty often when scraping websites. The problem here is that this webpage is dynamic. That is, it uses JavaScript to create the visualization and this is done after the page loads. And, crucially here, after rvest downloads the page which is why you don't see it with your code. I confirmed this by disabling JavaScript in Chrome and I see that the chart is missing from the page.
That said, you aren't out of luck! I again used Chrome's Developer Tools' Network pane to look through the requests the page was making. Pages like this that create charts dynamically often make a separate network request to grab data before creating the chart. After some scrolling and poking around, I saw one that looks like the dataset you're interested in:
https://www.starcapital.de/fileadmin/charts/Res_Aktienmarktbewertungen_FundamentalKZ_Tbl.php?lang=en
Open that up in your browser and take a look. Let me know if that's the data you were hoping to get. It's in a somewhat custom-looking JSON format so you may end up needing to write a bit of code to get it into R. Check out the jsonlite package for manipulating the JSON and the httr package for getting the data from that URL into R.
Edit: An alternative approach would be to use an R package that can run the dynamic part of the page (that gets the data to make the chart/table) such as splashr. There are a few other R packages out there that can do this but that's one I'm familiar with.
I'm trying to write an R script checking prices on a popular swiss website.
Following methodology explained here: https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/ I tried to use rvestfor that:
library(rvest)
url <- "https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344"
webpage <- read_html(url)
Unfortunately, I have limited html/css knowledge and the content of webpage is very obscure to me.
I tried inspecting the page with google chrome and it looks like the price is located in something named priceEnergyWrapper--2ZNIJ but I cannot find any trace of that in webpage. I did not have more luck using SelectorGadget
Can anybody help me get the price out of webpage?
Since it is dynamically generated, you will need RSelenium.
Your code should be something like:
library(RSelenium)
driver <- rsDriver(browser=c("chrome"))
rem_driver <- driver[["client"]]
rem_driver$open()
rem_driver$navigate("https://www.galaxus.ch/fr/s8/product/quiksilver-everyday-stretch-l-shorts-de-bain-10246344")
This will ask Selenium to open this page after loading the entire page, and hence all the HTML that you see by clicking Page Source should be available.
Now do:
rem_driver$findElement(using = 'class', value = 'priceEnergyWrapper--2ZNIJ')
You should now see the necessary HTML to get the price value out of it, which at the time of checking the website is 25 CHF.
PS: I do not scrape websites for others unless I am sure that the owners of the websites do not object to crawlers/scrapers/bots. Hence, my codes are based on the idea of how to go about with Selenium. I have not tested them personally. However, you should more or less get the general idea and the reason behind using a tool like Selenium. You should also find out if you are allowed to legally scrape this website and for others in the near future.
Additional resources to read about RSelenium:
https://ropensci.org/tutorials/rselenium_tutorial/
I'm trying to scrape this URL [of tennis league scores][1]
[1]: http://tennislink.usta.com/leagues/Main/statsandstandings.aspx#&&s=2%7C%7C%7C%7C4.0%7C%7CM%7C%7C2016%7C%7C9%7C%7C310%7C%7C. My goal is to automate scraping the results of my teams for analysis.
Using rvest and phantomJS I can easily scrape the table on the above link and create an R dataframe with the five cols.
However, I also want to capture the href= for each row so that I can follow the link and scrape the details for each row. When I "inspect" the first element of a row (the element with the embedded link) I don't see the URL but rather see this
<a id="ctl00_mainContent_rptYearForTeamResults_ctl00_rptYearTeamsInfo_ctl16_LinkButton1" href="javascript:__doPostBack('ctl00$mainContent$rptYearForTeamResults$ctl00$rptYearTeamsInfo$ctl16$LinkButton1','')" class="">Text appears here that I can easily scrape</a>
I've searched for how to scrape dopostback's in R but have not found anything useful. I did find references to Rselenium and have looked at the Cran Rselenium website but could not find references to deal with dopostback.
I also found references to phantomjs, which allowed me to scrape the table.
I have successfully scraped html at other times programmatically using R and rvest, including capturing URL's embedded directly in the HTML with href=, following those URL's programmatically, and continuing the scraping for thousands of records.
However, dopostback has stumped me - I have no javascript skills.
I've tried to find clues using "inspect element" that would allow me to simulate the dopostback in R but nothing jumps out at me.
I would appreciate any help.
I am having some trouble being able to pick up data on the following website: https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Dashboard/Medicare-Drug-Spending/Drug_Spending_Dashboard.html
It's interactive nature makes it difficult for me to do, but I really just want the first table you see with the different drug names. I am tried inspecting different elements in Chrome to try to find the data source, but I cannot find any raw files. Any ideas how I could approach this problem?
It could be a project well beyond my skills right now but I've got around one full month to spend on it so I think I can do it. What I want to build is this: Gather news about a specific subject from various sources. Easy, right? Just get the rss feeds and display them on a page. Well, I want something more advanced: Duplicates removed and customized presentation (that is, be able to define/change the format in which the news headlines are displayed).
I've played a bit with Yahoo Pipes and some other tools and I am facing two big problems:
Some sources don't provide rss feeds. How do I create one?
What's the best method to find and remove duplicates. I thought about comparing the headlines and checking if there is a matching bigger than, say, 50%. Is that a good practice though?
Please add any other things (problems, suggestions, whatever) I might not have considered.
Duplication is a nasty issue. What I eventually ended up doing:
1. Strip out all HTML tags except for links (Although I started using regex, I was burned. I eventually moved to custom parsing to remove tags)
2. Strip out all whitespace
3. Case-desensitize
4. Hash all that with MD5.
Here's why you leave the link in:
A comment might be as simple as "Yes, this sucks". "Yes, this sucks" could be a common comment. BUT if the text "this sucks" is linked to different things, then it is not a duplicate comment.
Additionally, you will find that HTML tag escaping is weird with RSS feeds. You would think that a stray < would be double-encoded: (I think)&<;
But it is not. It is encoded <
But so too are HTML tags! :<p>
I eventually copied all the known HTML tags as parsed by Mozilla Firefox and manually recognized those tags.
Creating an RSS feed from HTML is quite nasty and I can only point you to services such as Spinn3r, which are fantastic at de-duplication and content extraction. These services typically use probability-based algorithms that are above me. I know of one provider that got away with regexing pages (They had to know that a certain page was MySpace-based or Blogger-based) but they did not perform admirably.
You might want to try to use the YQL module to scrape a webpage that doesn't provide RSS. Here's a sample of a YQL statement to scrape HTML.
About duplicates, take a look at this pipe.
Customized presentation: if you want it truly customized you'll have to manipulate the pipe results yourself, e.g. get it as JSON an manipulate it with Javascript, or process it server-side.