R: rvest package read_html() function gives different outputs on same URL - r

Specifically I am trying to parse Amazon reviews of a product with the rvest library in R.
reviews_url <- "https://www.amazon.com/Magic-Bullet-Blender-Small-Silver/product-reviews/B012T634SM/ref=cm_cr_getr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1"
amazon_review <- read_html(reviews_url)
reviewRaw <- amazon_review %>%
html_nodes(".review-text") %>%
html_text()
The problem I am facing is, that if I rerun the function I sometimes get different Outputs, like it somehow parsed a different site. Sometimes it is the right output.
How can I fix this?
I already tried using the RSelenium package and use the WebDriver to load the Page and give it time to load but it does not help.
Interestingly the output alternates between 2 alternatives. So either the reviews are parsed correctly or they are not. The wrong alternative always looks the same however.
There definitely is some pattern there, but I just can't get my head around what the problem could be here. It might have to do something with the way the reviews are being loaded in at Amazon?
Anyways, I am thankful for any idea to solve this.
Best regards.

Related

Why can't read_csv use my directory/path?

I am having a problem with my read_csv. I have used this function with no problem but the path/directory I am using is a little different than normal and I can't figure it out by myself.
This is the code I have been using:
X2022_03_08_habit_and_OCD_clinical <- read_csv("Box/OCD: Habit or Learning?/experiment/data/raw/survey-data/2022-03-08_habit-and-OCD_clinical.csv")
I have tired tweaking this by not using the first two arguments with no luck. Has anyone used Box in a path before (Box is in my finder, like desktop would be). I also tried updating r for that first error code but maybe it didn't take, I am not sure how to update again.
Here is the error code I have been receiving:Error Message
I would appreciate any help and I apologize if there is a simple answer I have been missing!

Web Scraping BoardGameGeek with RVest

I'm pretty much brand new to web scraping with rvest.. and really new to most everything except Qlik coding.
I am attempting to scrape data found at board game geek, see the below link. Using inspect, it certainly seems possible, but yet rvest is not finding the tags. I first thought I had to go through the whole javascript process using V8 (javascript is called at the top of the html), but when I just use html_text on the whole document, all the information I need is in there.
*UPDATE: It appears to be in JSON. I used a combination of notepad++ and web tool to clean it and load into R. Any recommendations on tutorials/demos for how to do this systematically? I have all the links I need to loop through, but not sure how to go from the html_text output to a clean JSON input via code. *
I provided examples below, but I need to scrape the majority of the data elements available, so not looking for code to copy and paste but rather the best method to pursue. See below.
Link: https://boardgamegeek.com/boardgame/63888/innovation
HTML Example I am trying to pull from. Span returns nothing with html_nodes so I couldn't even start there.
<span ng-if="min > 0" class="ng-binding ng-scope">45</span>
OR
<a title="Civilization" ng-href="/boardgamecategory/1015/civilization" class="ng-binding" href="/boardgamecategory/1015/civilization">Civilization</a>
Javscript sections at top of page like this: about 8 of them:
<script type="text/javascript" src="https://cf.geekdo-static.com/static/geekcollection_master2_5e84926ab7e90.js"></script>
When I just use html_text on the whole object I can find see all the elements I am looking for e.g.:
\"minplaytime\":\"30\" OR {\"name\":\"Deck, Bag, and Pool Building\"
I'm assuming this is JSON? Is there a way to parse the html_text output, or another method? Is it easier just to rush the javascript at the top of the page using V8? Is there an easy guide for this?
Are you aware, that BGG has an API? Documentation can be found here: URL
The code will be provided as XML file. So for your example you can get the ID of your game - your example is 63888 (its in the URL). So the xml file can be found at: https://www.boardgamegeek.com/xmlapi2/thing?id=63888
You can read the info with this code:
library(dplyr)
library(rvest)
game_data <- read_xml("https://www.boardgamegeek.com/xmlapi2/thing?id=63888")
game_data %>%
html_nodes("name[type=primary]") %>%
html_attr("value") %>%
as.character()
#> [1] "Innovation"
By inspecting the xml file you can choose what node you want to export.
Created on 2020-04-06 by the reprex package (v0.3.0)

rvest/httr: automating downloads from a nesstar webpage

I'm working on scripting some dataset downloads in R from the Center for Survey and Survey/Registrar data, this nesstar-based data archive: http://cssr.surveybank.aau.dk/webview
Poking around, I've found there are bookmarkable links for each dataset in each format, e.g., http://cssr.surveybank.aau.dk/webview/velocity?format=STATA&includeDocumentation=on&execute=&ddiformat=pdf&study=http%3A%2F%2F172.18.36.233%3A80%2Fobj%2FfStudy%2FElectionStudy-1973&analysismode=table&v=2&mode=download
There's no username or password required to use the site, so that's one bullet dodged. But the next step is to click on the "Download" button, and that's where I'm stumped. This question Using R to "click" a download file button on a webpage sounds like it should be right on, but this webpage actually isn't similar. Unlike that one, this button is not part of a form, so my efforts using html_form() and submit_form() predictably got nowhere. (And it's not a link, so of course follow_link() won't work either.) The following gets me to the right node, but doesn't actually click the button.
library(magrittr)
library(rvest)
url <- "http://cssr.surveybank.aau.dk/webview/velocity?format=STATA&includeDocumentation=on&execute=&ddiformat=pdf&study=http%3A%2F%2F172.18.36.233%3A80%2Fobj%2FfStudy%2FElectionStudy-1973&analysismode=table&v=2&mode=download"
s <- html_session(url)
download_button <- s %>% html_node(".button")
Now that RSelenium is back on CRAN (yay!), I suppose I could go in that direction instead, but I'd really prefer an rvest or httr-based solution. If anyone could help, I'd really appreciate it.

html_session, read_html, readLines, GET, getURL all freeze

For months I have been able to read this page, but starting Wednesday, it freezes.
myURL <- "http://www.nasdaq.com/symbol/fb"
webpage <- readLines(myURL)
I've tried:
read_html (rvest)
html_session (rvest) also reset user agent - no change.
readLines This used to be all I needed. Now it freezes like every other approach.
GET (httr)
getURL (RCurl)
Tried all of these both through R studio on a Windows box and directly in R on an Ubuntu server. Freezes everywhere.
Poked around with the Chrome Developer Tools on the network tab to try to understand why this loads easily in browser and not at all in R. I didn't see any smoking gun, but I'm not an expert.
If anyone can figure out how to get the page without it freezing, that is all the help I need to get unstuck. Thanks!
I'm not sure which parts of the webpage you want to collect, but I had success getting some of the vital info with this code:
library(rvest)
library(dplyr)
url <- "https://www.nasdaq.com/symbol/fb"
read_html(url) -> foo
html_nodes(foo, css = "b") %>% html_text()
Are you able to run the code above? Does it give you what you need? Depending on which pieces of data you need from the website, you might need to use a tool like SelectorGadget to find the css values that you need.
I hope that this helps. If it doesn't, please elaborate.

R function example requires nonstandard dataset, doesn't jive with devtools

I've been struggling to get the example code for a function working using devtools::check(), because the data required for the example is not in .RData format. Unfortunately, the way the function is written, .RData cannot be loaded and work properly. The function takes in a list of filenames and performs an action on them collectively.
Therefore, example code must be written in a way that check() is able to access a folder and list the files therein. Using the function on my own computer, I input
setwd("/Users/mydirectory")
myfilelist <- list.files(pattern = "mypattern")
output <- myfunction(myfilelist, ...)
and everything is groovy. But this doesn't work with devtools because #examples doesn't know how to access subdirectories on my computer. check() pulls the following error:
base::assign(".ptime", proc.time(), pos = "CheckExEnv")
This is almost undoubtedly because check() doesn't know where to look for the data. I'd like it to look toward github to access the online data repository.
I found this brief conversation regarding a similar roxygen-related problem, but overall I haven't seen much advice on how to work through it. I think that perhaps this issue starts to get a little closer to my situation, but here the user failed to export a function, rather than bind data to an example.
I don't think I'm looking for a pull function (though the end goal is to pull data...), does anyone have advice moving forward? I have the data stored in the inst/extdata folder on github, so while I don't really have something reproducible for you all I'm hoping you might have some thoughts.
Edit: I worked around the problem using #alistaire's advice below, and guiding the roxygen to the package directory (updated on github) and also using \dontrun{}. However, I am leaving the question unanswered for now because I think accessing data stored in github should still be somehow possible and we haven't yet addressed that.

Resources