Google Sheets =IMPORTHTML runs on load then "could not fetch url" - web-scraping

I am trying to make a Market calculator for Old School Rune Scape and I'm running into problems with using the IMPORTHTML function of google sheets.
I use the following to pull data from an online table.
=IMPORTHTML("http://services.runescape.com/m=itemdb_oldschool/results?query=bronze&minPrice=0&maxPrice=-1&members=no&page=1#main-search", "table", 1)
This will work, however, after a few minutes it no longer updates and throws up
"Could not fetch url: http://services.runescape.com/m=itemdb_oldschool/results?query=iron&minPrice=0&maxPrice=-1&members=no&page=2#main-search"
I've found that by deleting the "=" from the start of the cell and then adding it back in, the data will repopulate.
Is there a way to force update this function? Or perhaps a different function that would work better for this application?

try like this:
=IFERROR(
IMPORTHTML("http://services.runescape.com/m=itemdb_oldschool/results?query=bronze&minPrice=0&maxPrice=-1&members=no&page=1#main-search",
"table", 1),
IMPORTHTML("http://services.runescape.com/m=itemdb_oldschool/results?query=bronze&minPrice=0&maxPrice=-1&members=no&page=1#main-search",
"table", 1))
or even:
=IFERROR(IFERROR(
IMPORTHTML("http://services.runescape.com/m=itemdb_oldschool/results?query=bronze&minPrice=0&maxPrice=-1&members=no&page=1#main-search",
"table", 1),
IMPORTHTML("http://services.runescape.com/m=itemdb_oldschool/results?query=bronze&minPrice=0&maxPrice=-1&members=no&page=1#main-search",
"table", 1)),
IMPORTHTML("http://services.runescape.com/m=itemdb_oldschool/results?query=bronze&minPrice=0&maxPrice=-1&members=no&page=1#main-search",
"table", 1))

Related

R - Extract CSV file from javascript link via RCurl

I have a url:
url <- "http://www.railroadpm.org/home/RPM/Performance%20Reports/BNSF.aspx"
that contains a link to a csv file that I would like to download. The "Export to CSV" link on the above page. The problem is that the csv file is not part of a url, but rather it's javascript. What I would like to do is access the link and create a dataframe out of the csv file. The javascript is:
javascript:__doPostBack('ctl11$btnCSV','')
and from that I can tell that the id is
"ctl11_btnCSV"
but I am unsure of how this fits into RCUrl, which from SO seems to be the best way to access this data. Any help would be appreciated.
Thanks.
There was zero effort put into this question (esp since the OP came to the conclusion that RCurl is the current best practice for web wrangling in R) but anytime an SO web scraping question that involves a SharePoint site can actually be answered (Microsoft SharePoint is one of the worst things invented ever next to Windows) it's worth posting an answer.
library(rvest)
library(httr)
# make an initial connection to get cookies
httr::GET(
"http://www.railroadpm.org/home/RPM/Performance%20Reports/BNSF.aspx"
) -> res
# retrieve some hidden bits we need to pass b/c SharePoint is a wretched thing.
pg <- content(res, as = "parsed")
for_post <- html_nodes(pg, "input[type='hidden']")
# post the hidden form & save out the CSV
httr::POST(
"http://www.railroadpm.org/home/RPM/Performance%20Reports/BNSF.aspx",
body = as.list(
c(
setNames(
html_attr(for_post, "value"),
html_attr(for_post, "id")
),
`__EVENTTARGET` = "ctl11$btnCSV"
)
),
write_disk("meaures.csv"),
progress()
) -> res

scraping an interactive table in R with rvest

I'm trying to scrape the scrolling table from the following link: http://proximityone.com/cd114_2013_2014.htm
I'm using rvest but am having trouble finding the correct xpath for the table. My current code is as follows:
url <- "http://proximityone.com/cd114_2013_2014.htm"
table <- gis_data_html %>%
html_node(xpath = '//span') %>%
html_table()
Currently I get the error "no applicable method for 'html_table' applied to an object of class "xml_missing""
Anyone know what I would need to change to scrape the interactive table in the link?
So the problem you're facing is that rvest will read the source of a page, but it won't execute the javascript on the page. When I inspect the interactive table, I see
<textarea id="aw52-box-focus" class="aw-control-focus " tabindex="0"
onbeforedeactivate="AW(this,event)" onselectstart="AW(this,event)"
onbeforecopy="AW(this,event)" oncut="AW(this,event)" oncopy="AW(this,event)"
onpaste="AW(this,event)" style="z-index: 1; width: 100%; height: 100%;">
</textarea>
but when I look at the page source, "aw52-box-focus" doesn't exist. This is because it's created as the page loads via javascript.
You have a couple of options to deal with this. The 'easy' one is to use RSelenium and use an actual browser to load the page and then get the element after it's loaded. The other options is to read through the javascript and see where it's getting the data from and then tap into that rather than scraping the table.
UPDATE
Turns out it's really easy to read the javascript - it's just loading a CSV file. The address is in plain text, http://proximityone.com/countytrends/cd114_acs2014utf8_hl.csv
The .csv doesn't have column headers, but those are in the <script> as well
var columns = [
"FirstNnme",
"LastName",
"Party",
"Feature",
"St",
"CD",
"State<br>CD",
"State<br>CD",
"Population<br>2013",
"Population<br>2014",
"PopCh<br>2013-14",
"%PopCh<br>2013-14",
"MHI<br>2013",
"MHI<br>2014",
"MHI<br>Change<br>2013-14",
"%MHI<br>Change<br>2013-14",
"MFI<br>2013",
"MFI<br>2014",
"MFI<br>Change<br>2013-14",
"%MFI<br>Change<br>2013-14",
"MHV<br>2013",
"MHV<br>2014",
"MHV<br>Change<br>2013-14",
"%MHV<br>Change<br>2013-14",
]
Programmatic Solution
Instead of digging through the javacript (in case there are several such pages on this website you want) you can attempt this pro programmatically too. We read the page, get the <script> notes, get the "text" (the script itself) and look for references to a csv file. Then we expand out the relative URL and read it in. This doesn't help with column names, but shouldn't be too hard to extract that too.
library(rvest)
page = read_html("http://proximityone.com/cd114_2013_2014.htm")
scripts = page %>%
html_nodes("script") %>%
html_text() %>%
grep("\\.csv",.,value=T)
relCSV = stringr::str_extract(scripts,"\\.\\./.*?csv")
fullCSV = gsub("\\.\\.","http://proximityone.com",relCSV)
data = read.csv(fullCSV,header = FALSE)

Trouble fetching data from DP04 table using acs.R

I am using the acs.R package and I am having trouble collecting data from the DP tables and S tables. The tables beginning with B are fine though. Here is an example of my code and the error I receive:
national = geo.make(us="*")
Race_US <- acs.fetch(endyear = 2015, span = 1, geography = national,
table.number = "DP04", col.names = "pretty")
Warning message:
In (function (endyear, span = 5, dataset = "acs", keyword, table.name, :
Sorry, no tables/keyword meets your search.
Suggestions:
try with 'case.sensitive=F',
remove search terms,
change 'keyword' to 'table.name' in search (or vice-versa)
For some reason it is unable to find the table. I have tried acs.lookup with various keywords that should work and still nothing.
Thanks for using the acs.R package.
The problem here is with the "DP" tables: although they are available through the census api, they are not fetched via the acs.R package, since they are in a different format -- not really "raw data" as much as pre-formatted tables made from data found in other places. That said, you should be able to find the underlying data in other tables that are available with acs.fetch.

Why is my Rfacebook loop script not working when there is a post with zero comments?

I've edited my question to be more relevant
It's only been less than a month since I started to learn R and I'm trying to use it to get rid of the tedious work related to Facebook (extracting comments) that we use for our reports.
Using Rfacebook package, i made this script where it extracts (1) the posts of the page for a given period, and (2) Comments on those posts. It worked well for the page I'm doing the report for, but when I tried it on other pages with posts that had zero comments, it reported an error.
Here's the script:
Loading libraries
library(Rfacebook)
library(lubridate)
library(tibble)
Setting time period. Change time as you please.
current_date <-Sys.Date()
past30days<-current_date-30
Assigning a page. Edit this to the page you are monitoring*
brand<-'bpi'
Authenticating Facebook. Use your own
app_id <- "xxxxxxxxxxxxxxxx"
app_secret <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
token <- fbOAuth(app_id,app_secret,extended_permissions = FALSE,legacy_permissions = FALSE)
Extract all posts from a page
listofposts <- getPage(brand, token, n = 5000, since = past30days, until = current_date, feed = FALSE, reactions = FALSE, verbose=TRUE)
write.csv(listofposts,file = paste0('AsOf',current_date,brand,'Posts','.csv'))
Convert to a data frame
df<-as_tibble(listofposts)
Convert to a vector
postidvector<-df[["id"]]
Get the number of posts in the period
n<-length(postidvector)
Produce all comments via loop
reactions<-vector("list",n)
for(i in 1:n){
reactions[[i]]<-assign(paste(brand,'Comments', i, sep = ""), (getPost((postidvector[i]),token,comments=T,likes=F,n.likes=5000,n.comments=10000)))
}
Extract all comments per post to CSV
for(j in 1:n){
write.csv(reactions[[j]],file = paste0('AsOf',current_date,brand,'Comments' ,j, '.csv'))
}
Here's the error when exporting the comments to CSV when I tried it on the
pages with posts that had ZERO comments:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 1, 0
I tried it on a heavy traffic page, and it worked fine too. One post had 10,000 comments and it extracted just fine. :(
Thanks in advance! :D
Pages can be restricted by age or location. You can´t use an App Access Token for those, because it does not include a user session so Facebook does not know if you are allowed to see the Page content. You will have to use a User Token or Page Token for those.

automating the login to the uk data service website in R with RCurl or httr

I am in the process of writing a collection of freely-downloadable R scripts for http://asdfree.com/ to help people analyze the complex sample survey data hosted by the UK data service. In addition to providing lots of statistics tutorials for these data sets, I also want to automate the download and importation of this survey data. In order to do that, I need to figure out how to programmatically log into this UK data service website.
I have tried lots of different configurations of RCurl and httr to log in, but I'm making a mistake somewhere and I'm stuck. I have tried inspecting the elements as outlined in this post, but the websites jump around too fast in the browser for me to understand what's going on.
This website does require a login and password, but I believe I'm making a mistake before I even get to the login page.
Here's how the website works:
The starting page should be: https://www.esds.ac.uk/secure/UKDSRegister_start.asp
This page will automatically re-direct your web browser to a long URL that starts with: https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah]
(1) For some reason, the SSL certificate does not work on this website. Here's the SO question I posted regarding this. The workaround I've used is simply ignoring the SSL:
library(httr)
set_config( config( ssl.verifypeer = 0L ) )
and then my first command on the starting website is:
z <- GET( "https://www.esds.ac.uk/secure/UKDSRegister_start.asp" )
this gives me back a z$url that looks a lot like the https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah] page that my browser also re-directs to.
In the browser, then, you're supposed to type in "uk data archive" and click the continue button. When I do that, it re-directs me to the web page https://shib.data-archive.ac.uk/idp/Authn/UserPassword
I think this is where I'm stuck because I cannot figure out how to have cURL followlocation and land on this website. Note: no username/password has been entered yet.
When I use the httr GET command from the wayf.ukfederation.org.uk page like this:
y <- GET( z$url , query = list( combobox = "https://shib.data-archive.ac.uk/shibboleth-idp" ) )
the y$url string looks a lot like z$url (except it's got a combobox= on the end). Is there any way to get through to this uk data archive authentication page with RCurl or httr?
I can't tell if I'm just overlooking something or if I absolutely must use the SSL certificate described in my previous SO post or what?
(2) At the point I do make it through to that page, I believe the remainder of the code would just be:
values <- list( j_username = "your.username" ,
j_password = "your.password" )
POST( "https://shib.data-archive.ac.uk/idp/Authn/UserPassword" , body = values)
But I guess that page will have to wait...
The relevant data variables returned by the form are action and origin, not combobox. Give action the value selection and origin the value from the relevant entry in combobox
y <- GET( z$url, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
> y$url
[1] "https://shib.data-archive.ac.uk:443/idp/Authn/UserPassword"
Edit
It looks as though the handle pool isn't keeping your session alive correctly. You therefore need to pass the handles directly rather than automatically. Also for the POST command you need to set multipart=FALSE as this is the default for HTML forms. The R command has a different default as it is mainly designed for uploading files. So:
y <- GET( handle=z$handle, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
POST(body=values,multipart=FALSE,handle=y$handle)
Response [https://www.esds.ac.uk/]
Status: 200
Content-type: text/html
...snipped...
<title>
Introduction to ESDS
</title>
<meta name="description" content="Introduction to the ESDS, home page" />
I think one way to address "enter your organization" page goes like this:
library(tidyverse)
library(rvest)
library(stringr)
org <- "your_organization"
user <- "your_username"
password <- "your_password"
signin <- "http://esds.ac.uk/newRegistration/newLogin.asp"
handle_reset(signin)
# get to org page and enter org
p0 <- html_session(signin) %>%
follow_link("Login")
org_link <- html_nodes(p0, "option") %>%
str_subset(org) %>%
str_match('(?<=\\")[^"]*') %>%
as.character()
f0 <- html_form(p0) %>%
first() %>%
set_values(origin = org_link)
fake_submit_button <- list(name = "submit-btn",
type = "submit",
value = "Continue",
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "btn-enabled"
f0[["fields"]][["submit"]] <- fake_submit_button
c0 <- cookies(p0)$value
names(c0) <- cookies(p0)$name
p1 <- submit_form(session = p0, form = f0, config = set_cookies(.cookies = c0))
Unfortunately, that doesn't solve the whole problem—(2) is harder than it looks. I've got more of what I think is a solution posted here: R: use rvest (or httr) to log in to a site requiring cookies. Hopefully someone will help us get the rest of the way.

Resources