I'm working on scripting some dataset downloads in R from the Center for Survey and Survey/Registrar data, this nesstar-based data archive: http://cssr.surveybank.aau.dk/webview
Poking around, I've found there are bookmarkable links for each dataset in each format, e.g., http://cssr.surveybank.aau.dk/webview/velocity?format=STATA&includeDocumentation=on&execute=&ddiformat=pdf&study=http%3A%2F%2F172.18.36.233%3A80%2Fobj%2FfStudy%2FElectionStudy-1973&analysismode=table&v=2&mode=download
There's no username or password required to use the site, so that's one bullet dodged. But the next step is to click on the "Download" button, and that's where I'm stumped. This question Using R to "click" a download file button on a webpage sounds like it should be right on, but this webpage actually isn't similar. Unlike that one, this button is not part of a form, so my efforts using html_form() and submit_form() predictably got nowhere. (And it's not a link, so of course follow_link() won't work either.) The following gets me to the right node, but doesn't actually click the button.
library(magrittr)
library(rvest)
url <- "http://cssr.surveybank.aau.dk/webview/velocity?format=STATA&includeDocumentation=on&execute=&ddiformat=pdf&study=http%3A%2F%2F172.18.36.233%3A80%2Fobj%2FfStudy%2FElectionStudy-1973&analysismode=table&v=2&mode=download"
s <- html_session(url)
download_button <- s %>% html_node(".button")
Now that RSelenium is back on CRAN (yay!), I suppose I could go in that direction instead, but I'd really prefer an rvest or httr-based solution. If anyone could help, I'd really appreciate it.
Related
How do I get VSCode to refresh the plot when making a new plot without needing to close the previous plot tab?
Ideally, it should open up a new tab with the new plot next to the old one.
Moreover, when using Httpgd, even tho the plot refreshes without me having to close the previous plot, it opens up in a separate window outside VSCode. How do I get it to open up as a new tab like in the gif above?
I had the exact same problem and what fixed it for me was simply, in vscode, going into:
Files > Preferences > Settings > section Extensions > scroll down and find R:
find the section "Plot: Use Httpgd" and check it, and restart Vscode.
I can't confirm but I believe it might have started after installing the httpgd package, because I had the behavior you desire before installation, and the exact same behavior you experience after httpgd installation. Of course if my solution works for you (I hope), it will work under httpgd after and not the base default VsCode/R-plot viewer as in your gif.
#coip pointed out that the steps given on this stackoverflow post to me. Downloading the httpgd package from CRAN fixed the problem for me.
I'm pretty much brand new to web scraping with rvest.. and really new to most everything except Qlik coding.
I am attempting to scrape data found at board game geek, see the below link. Using inspect, it certainly seems possible, but yet rvest is not finding the tags. I first thought I had to go through the whole javascript process using V8 (javascript is called at the top of the html), but when I just use html_text on the whole document, all the information I need is in there.
*UPDATE: It appears to be in JSON. I used a combination of notepad++ and web tool to clean it and load into R. Any recommendations on tutorials/demos for how to do this systematically? I have all the links I need to loop through, but not sure how to go from the html_text output to a clean JSON input via code. *
I provided examples below, but I need to scrape the majority of the data elements available, so not looking for code to copy and paste but rather the best method to pursue. See below.
Link: https://boardgamegeek.com/boardgame/63888/innovation
HTML Example I am trying to pull from. Span returns nothing with html_nodes so I couldn't even start there.
<span ng-if="min > 0" class="ng-binding ng-scope">45</span>
OR
<a title="Civilization" ng-href="/boardgamecategory/1015/civilization" class="ng-binding" href="/boardgamecategory/1015/civilization">Civilization</a>
Javscript sections at top of page like this: about 8 of them:
<script type="text/javascript" src="https://cf.geekdo-static.com/static/geekcollection_master2_5e84926ab7e90.js"></script>
When I just use html_text on the whole object I can find see all the elements I am looking for e.g.:
\"minplaytime\":\"30\" OR {\"name\":\"Deck, Bag, and Pool Building\"
I'm assuming this is JSON? Is there a way to parse the html_text output, or another method? Is it easier just to rush the javascript at the top of the page using V8? Is there an easy guide for this?
Are you aware, that BGG has an API? Documentation can be found here: URL
The code will be provided as XML file. So for your example you can get the ID of your game - your example is 63888 (its in the URL). So the xml file can be found at: https://www.boardgamegeek.com/xmlapi2/thing?id=63888
You can read the info with this code:
library(dplyr)
library(rvest)
game_data <- read_xml("https://www.boardgamegeek.com/xmlapi2/thing?id=63888")
game_data %>%
html_nodes("name[type=primary]") %>%
html_attr("value") %>%
as.character()
#> [1] "Innovation"
By inspecting the xml file you can choose what node you want to export.
Created on 2020-04-06 by the reprex package (v0.3.0)
For months I have been able to read this page, but starting Wednesday, it freezes.
myURL <- "http://www.nasdaq.com/symbol/fb"
webpage <- readLines(myURL)
I've tried:
read_html (rvest)
html_session (rvest) also reset user agent - no change.
readLines This used to be all I needed. Now it freezes like every other approach.
GET (httr)
getURL (RCurl)
Tried all of these both through R studio on a Windows box and directly in R on an Ubuntu server. Freezes everywhere.
Poked around with the Chrome Developer Tools on the network tab to try to understand why this loads easily in browser and not at all in R. I didn't see any smoking gun, but I'm not an expert.
If anyone can figure out how to get the page without it freezing, that is all the help I need to get unstuck. Thanks!
I'm not sure which parts of the webpage you want to collect, but I had success getting some of the vital info with this code:
library(rvest)
library(dplyr)
url <- "https://www.nasdaq.com/symbol/fb"
read_html(url) -> foo
html_nodes(foo, css = "b") %>% html_text()
Are you able to run the code above? Does it give you what you need? Depending on which pieces of data you need from the website, you might need to use a tool like SelectorGadget to find the css values that you need.
I hope that this helps. If it doesn't, please elaborate.
Specifically I am trying to parse Amazon reviews of a product with the rvest library in R.
reviews_url <- "https://www.amazon.com/Magic-Bullet-Blender-Small-Silver/product-reviews/B012T634SM/ref=cm_cr_getr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1"
amazon_review <- read_html(reviews_url)
reviewRaw <- amazon_review %>%
html_nodes(".review-text") %>%
html_text()
The problem I am facing is, that if I rerun the function I sometimes get different Outputs, like it somehow parsed a different site. Sometimes it is the right output.
How can I fix this?
I already tried using the RSelenium package and use the WebDriver to load the Page and give it time to load but it does not help.
Interestingly the output alternates between 2 alternatives. So either the reviews are parsed correctly or they are not. The wrong alternative always looks the same however.
There definitely is some pattern there, but I just can't get my head around what the problem could be here. It might have to do something with the way the reviews are being loaded in at Amazon?
Anyways, I am thankful for any idea to solve this.
Best regards.
I'm trying to use LDAvis for the first time, but have run into the following issue:
After running serVis on my JSON object,
serVis(json, out.dir = 'LDAvis', open.browser = FALSE)
the 5 expected files are created (i.e., d3.v3.js, index.html, lda.css, lda.json, and ldavis.js). As I understand LDAvis, opening the html file should open the interactive viewer. However, in doing this, only a blank webpage is opened.
I've compared the html source code with that from LDAvis projects found online, and they are the same. This was built using Christopher Gandrud's script found here where the LDA results come from the topicmodels package and used the Gibbs method. The underlying data uses ~45K documents with ~15K unique terms. For what it's worth, the lda.json file seems a bit small at ~6MB.
Unfortunately, this issue seems too large to provide sample data or reproducible code. (If I could isolate the issue more, then perhaps I could add sample code.) Instead, I was hoping if readers had any ideas for the cause of this issue or if it has come about before.
Thanks ahead for any feedback!
I've resolved the issue after realizing that most web browsers restrict access to local files. For Chrome, the .exe needs to be called with the option "--allow-file-access-from-files". Otherwise, no error is displayed opening the LDAvis output unless you inspect HTML elements manually.