I'm pretty much brand new to web scraping with rvest.. and really new to most everything except Qlik coding.
I am attempting to scrape data found at board game geek, see the below link. Using inspect, it certainly seems possible, but yet rvest is not finding the tags. I first thought I had to go through the whole javascript process using V8 (javascript is called at the top of the html), but when I just use html_text on the whole document, all the information I need is in there.
*UPDATE: It appears to be in JSON. I used a combination of notepad++ and web tool to clean it and load into R. Any recommendations on tutorials/demos for how to do this systematically? I have all the links I need to loop through, but not sure how to go from the html_text output to a clean JSON input via code. *
I provided examples below, but I need to scrape the majority of the data elements available, so not looking for code to copy and paste but rather the best method to pursue. See below.
Link: https://boardgamegeek.com/boardgame/63888/innovation
HTML Example I am trying to pull from. Span returns nothing with html_nodes so I couldn't even start there.
<span ng-if="min > 0" class="ng-binding ng-scope">45</span>
OR
<a title="Civilization" ng-href="/boardgamecategory/1015/civilization" class="ng-binding" href="/boardgamecategory/1015/civilization">Civilization</a>
Javscript sections at top of page like this: about 8 of them:
<script type="text/javascript" src="https://cf.geekdo-static.com/static/geekcollection_master2_5e84926ab7e90.js"></script>
When I just use html_text on the whole object I can find see all the elements I am looking for e.g.:
\"minplaytime\":\"30\" OR {\"name\":\"Deck, Bag, and Pool Building\"
I'm assuming this is JSON? Is there a way to parse the html_text output, or another method? Is it easier just to rush the javascript at the top of the page using V8? Is there an easy guide for this?
Are you aware, that BGG has an API? Documentation can be found here: URL
The code will be provided as XML file. So for your example you can get the ID of your game - your example is 63888 (its in the URL). So the xml file can be found at: https://www.boardgamegeek.com/xmlapi2/thing?id=63888
You can read the info with this code:
library(dplyr)
library(rvest)
game_data <- read_xml("https://www.boardgamegeek.com/xmlapi2/thing?id=63888")
game_data %>%
html_nodes("name[type=primary]") %>%
html_attr("value") %>%
as.character()
#> [1] "Innovation"
By inspecting the xml file you can choose what node you want to export.
Created on 2020-04-06 by the reprex package (v0.3.0)
Related
I'm working on scripting some dataset downloads in R from the Center for Survey and Survey/Registrar data, this nesstar-based data archive: http://cssr.surveybank.aau.dk/webview
Poking around, I've found there are bookmarkable links for each dataset in each format, e.g., http://cssr.surveybank.aau.dk/webview/velocity?format=STATA&includeDocumentation=on&execute=&ddiformat=pdf&study=http%3A%2F%2F172.18.36.233%3A80%2Fobj%2FfStudy%2FElectionStudy-1973&analysismode=table&v=2&mode=download
There's no username or password required to use the site, so that's one bullet dodged. But the next step is to click on the "Download" button, and that's where I'm stumped. This question Using R to "click" a download file button on a webpage sounds like it should be right on, but this webpage actually isn't similar. Unlike that one, this button is not part of a form, so my efforts using html_form() and submit_form() predictably got nowhere. (And it's not a link, so of course follow_link() won't work either.) The following gets me to the right node, but doesn't actually click the button.
library(magrittr)
library(rvest)
url <- "http://cssr.surveybank.aau.dk/webview/velocity?format=STATA&includeDocumentation=on&execute=&ddiformat=pdf&study=http%3A%2F%2F172.18.36.233%3A80%2Fobj%2FfStudy%2FElectionStudy-1973&analysismode=table&v=2&mode=download"
s <- html_session(url)
download_button <- s %>% html_node(".button")
Now that RSelenium is back on CRAN (yay!), I suppose I could go in that direction instead, but I'd really prefer an rvest or httr-based solution. If anyone could help, I'd really appreciate it.
For months I have been able to read this page, but starting Wednesday, it freezes.
myURL <- "http://www.nasdaq.com/symbol/fb"
webpage <- readLines(myURL)
I've tried:
read_html (rvest)
html_session (rvest) also reset user agent - no change.
readLines This used to be all I needed. Now it freezes like every other approach.
GET (httr)
getURL (RCurl)
Tried all of these both through R studio on a Windows box and directly in R on an Ubuntu server. Freezes everywhere.
Poked around with the Chrome Developer Tools on the network tab to try to understand why this loads easily in browser and not at all in R. I didn't see any smoking gun, but I'm not an expert.
If anyone can figure out how to get the page without it freezing, that is all the help I need to get unstuck. Thanks!
I'm not sure which parts of the webpage you want to collect, but I had success getting some of the vital info with this code:
library(rvest)
library(dplyr)
url <- "https://www.nasdaq.com/symbol/fb"
read_html(url) -> foo
html_nodes(foo, css = "b") %>% html_text()
Are you able to run the code above? Does it give you what you need? Depending on which pieces of data you need from the website, you might need to use a tool like SelectorGadget to find the css values that you need.
I hope that this helps. If it doesn't, please elaborate.
Specifically I am trying to parse Amazon reviews of a product with the rvest library in R.
reviews_url <- "https://www.amazon.com/Magic-Bullet-Blender-Small-Silver/product-reviews/B012T634SM/ref=cm_cr_getr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1"
amazon_review <- read_html(reviews_url)
reviewRaw <- amazon_review %>%
html_nodes(".review-text") %>%
html_text()
The problem I am facing is, that if I rerun the function I sometimes get different Outputs, like it somehow parsed a different site. Sometimes it is the right output.
How can I fix this?
I already tried using the RSelenium package and use the WebDriver to load the Page and give it time to load but it does not help.
Interestingly the output alternates between 2 alternatives. So either the reviews are parsed correctly or they are not. The wrong alternative always looks the same however.
There definitely is some pattern there, but I just can't get my head around what the problem could be here. It might have to do something with the way the reviews are being loaded in at Amazon?
Anyways, I am thankful for any idea to solve this.
Best regards.
I'm trying to use LDAvis for the first time, but have run into the following issue:
After running serVis on my JSON object,
serVis(json, out.dir = 'LDAvis', open.browser = FALSE)
the 5 expected files are created (i.e., d3.v3.js, index.html, lda.css, lda.json, and ldavis.js). As I understand LDAvis, opening the html file should open the interactive viewer. However, in doing this, only a blank webpage is opened.
I've compared the html source code with that from LDAvis projects found online, and they are the same. This was built using Christopher Gandrud's script found here where the LDA results come from the topicmodels package and used the Gibbs method. The underlying data uses ~45K documents with ~15K unique terms. For what it's worth, the lda.json file seems a bit small at ~6MB.
Unfortunately, this issue seems too large to provide sample data or reproducible code. (If I could isolate the issue more, then perhaps I could add sample code.) Instead, I was hoping if readers had any ideas for the cause of this issue or if it has come about before.
Thanks ahead for any feedback!
I've resolved the issue after realizing that most web browsers restrict access to local files. For Chrome, the .exe needs to be called with the option "--allow-file-access-from-files". Otherwise, no error is displayed opening the LDAvis output unless you inspect HTML elements manually.
I have been cribbing off of the very helpful responses on Scraping html tables into R data frames using the XML package to scrape some html off the web and work with it in R.
The XML package seems to be pretty thorough about escaping non-alphabetic characters in text strings. Is there a simple way in XML or some other package that would reverse some/all of the character escaping that passing my data through XML did? I started to do it myself, but after encountering cases like 'Representative JoaquÃÂn Castro' thought 'there must be a better solution...'
Just for clarity, using the XML package to parse this HTML
library(XML)
apos_str <- c("<b>Tim O'Reilly</b>")
apos_str.parsed <- htmlTreeParse(apos_str, error=function(...){})
apos_str.parsed$children$html[[1]][[1]]
would produce
<b>Tim O'Reilly</b>
And I'd ideally like a function or package that would search for that
'
and turn it back into
'<b>Tim O'Reilly</b>'
Edit To clarify, from the comments below, I get how to do this for the particular case of apostrophes, or any other character I see in my data. What I'm looking for is a package where someone has worked this out more generally.
Research I've done so far:
-Read everything I could find in the XML documentation on escaping.
-Looked for a promising package on the CRAN NLP page.
-did a search for 'unescape [R]' and 'reverse escape [R]' here on SO.
Wasn't able to make any headway so thought I would bring the question here.
I'm not sure I understand the difficulty. String processing for replacements are done with the base regex functions: sub, gsub, regexpr, gregexpr
?sub # the same help page will also discuss 'gsub'
txt <- '<b>Tim O'Reilly</b>'
sub("\\'", "'", txt)
[1] "<b>Tim O'Reilly</b>"
If you had a list of values that occur between "&" and ";" you could split on those and then recombine. I suppose it is possible that you were hoping someone had already done that. You should clarify what level of abstraction you were hoping to achieve.
EDIT:
A blogger discusses the specific case of "&apos" http://fishbowl.pastiche.org/2003/07/01/the_curse_of_apos/
I've done some further research of my own. Those are not properly called "escapes" but rather "named entities". I cannot find any references to them in the rhelp archives. I have downloaded the XML listing from the w3.org website that defines these "enities" and am trying to convert to a tabular form that would support search and replace. But your comment about 'Representative JoaquÃÂn Castro' has me puzzled. the odd characters are not in the form "$#xxx", so ........... what exactly are you asking for? Please post a suitable test case with the expected output.
EDIT 2: The was a basically identical question from Michael Friendly that just got answered by David Carlson on Rhelp. Here's the link to the posting on the Rhelp archives:
https://stat.ethz.ch/pipermail/r-help/2012-August/321478.html
He's already done a better job than I had on creating a translation table and has included code to march through html text. (and a bonus... he included &apos). And a next-day followup from Michael Friendly has wrapped the process up in a function. You can follow the link on the Archives page.