I would like to use RCurl as a polite webcrawler to download data from a website.
Obviously I need the data for scientific research. Although I have the rights to access the content of the website via my university, the terms of use of the website forbid the use of webcrawlers.
I tried to ask the administrator of the site directly for the data but they only replied in a very vague fashion. Well anyway it seems like they won’t simply send the underlying databases to me.
What I want to do now is ask them officially to get the one-time permission to download specific text-only content from their site using an R code based on RCurl that includes a delay of three seconds after each request has been executed.
The address of the sites that I want to download data from work like this:
http://plants.jstor.org/specimen/ID of the site
I tried to program it with RCurl but I cannot get it done.
A few things complicate things:
One can only access the website if cookies are allowed (I got that working in RCurl with the cookiefile-argument).
The Next-button only appears in the source code when one actually accesses the site by clicking through the different links in a normal browser.
In the source code the Next-button is encoded with an expression including
Next > >
When one tries to access the site directly (without having clicked through to it in the same browser before), it won't work, the line with the link is simply not in the source code.
The IDs of the sites are combinations of letters and digits (like “goe0003746” or “cord00002203”), so I can't simply write a for-loop in R that tries every number from 1 to 1,000,000.
So my program is supposed to mimic a person that clicks through all the sites via the Next-button, each time saving the textual content.
Each time after saving the content of a site, it should wait three seconds before clicking on the Next-button (it must be a polite crawler). I got that working in R as well using the Sys.sleep function.
I also thought of using an automated program, but there seem to be a lot of such programs and I don’t know which one to use.
I’m also not exactly the program-writing person (apart from a little bit of R), so I would really appreciate a solution that doesn’t include programming in Python, C++, PHP or the like.
Any thoughts would be much appreciated! Thank you very much in advance for comments and proposals !!
Try a different strategy.
##########################
####
#### Scrape http://plants.jstor.org/specimen/
#### Idea:: Gather links from http://plants.jstor.org/search?t=2076
#### Then follow links:
####
#########################
library(RCurl)
library(XML)
### get search page::
cookie = 'cookiefile.txt'
curl = getCurlHandle ( cookiefile = cookie ,
useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
header = F,
verbose = TRUE,
netrc = TRUE,
maxredirs = as.integer(20),
followlocation = TRUE)
querry.jstor <- getURL('http://plants.jstor.org/search?t=2076', curl = curl)
## remove white spaces:
querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
### get links from search page
getLinks = function() {
links = character()
list(a = function(node, ...) {
links <<- c(links, xmlGetAttr(node, "href"))
node
},
links = function()links)
}
## retrieve links
querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)
## cleanup links to keep only the one we want.
querry.jstor.links = NULL
querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links
## number of results
jstor.article <- getNodeSet(htmlTreeParse(querry.jstor2, useInt=T), "//article")
NumOfRes <- strsplit(gsub(',', '', gsub(' ', '' ,xmlValue(jstor.article[[1]][[1]]))), split='')[[1]]
NumOfRes <- as.numeric(paste(NumOfRes[1:min(grep('R', NumOfRes))-1], collapse = ''))
for(i in 2:ceiling(NumOfRes/20)){
querry.jstor <- getURL('http://plants.jstor.org/search?t=2076&p=',i, curl = curl)
## remove white spaces:
querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)
querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links
Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5)))
}
## make directory for saving data:
dir.create('./jstorQuery/')
## Now we have all the links, so we can retrieve all the info
for(j in 1:length(querry.jstor.links)){
if(nchar(querry.jstor.links[j]) != 1){
querry.jstor <- getURL('http://plants.jstor.org',querry.jstor.links[j], curl = curl)
## remove white spaces:
querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
## contruct name:
filename = querry.jstor.links[j][grep( '/', querry.jstor.links[j])+1 : nchar( querry.jstor.links[j])]
## save in directory:
write(querry.jstor2, file = paste('./jstorQuery/', filename, '.html', sep = '' ))
Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5)))
}
}
I may be missing exactly the bit you are hung up on, but it sounds like you are almost there.
It seems you can request page 1 with cookies on. Then parse the content searching for the next site ID, then request that page by building the URL with the next site ID. Then scrape whatever data you want.
It sounds like you have code that does almost all of this. Is the problem parsing page 1 to get the ID for the next step? If so, you should formulate a reproducible example and I suspect you'll get a very fast answer to your syntax problems.
If you're having trouble seeing what the site is doing, I recommend using the Tamper Data plug in for Firefox. It will let you see what request is being made at each mouse click. I find it really useful for this type of thing.
Related
There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}
If you go to the website https://www.myfxbook.com/members/iseasa_public1/rush/2531687 then click that dropdown box Export, then choose CSV, you will be taken to https://www.myfxbook.com/statements/2531687/statement.csv and the download (from the browser) will proceed automatically. The thing is, you need to be logged in to https://www.myfxbook.com in order to receive the information; otherwise, the file downloaded will contain the text "Please login to Myfxbook.com to use this feature".
I tried using read.csv to get the csv file in R, but only got that "Please login" message. I believe R has to simulate an html session (whatever that is, I am not sure about this) so that access will be granted. Then I tried some scraping tools to login first, but to no avail.
library(rvest)
login <- "https://www.myfxbook.com"
pgsession <- html_session(login)
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, loginEmail = "*****", loginPassword = "*****") # loginEmail and loginPassword are the names of the html elements
submit_form(pgsession, filled_form)
url <- "https://www.myfxbook.com/statements/2531687/statement.csv"
page <- jump_to(pgsession, url) # page will contain 48 bytes of data (in the 'content' element), which is the size of that warning message, though I could not access this content.
From the try above, I got that page has an element called cookies which in turns contains JSESSIONID. From my research, it seems this JSESSIONID is what "proves" I am logged in to that website. Nonetheless, downloading the CSV does not work.
Then I tried:
library(RCurl)
h <- getCurlHandle(cookiefile = "")
ans <- getForm("https://www.myfxbook.com", loginEmail = "*****", loginPassword = "*****", curl = h)
data <- getURL("https://www.myfxbook.com/statements/2531687/statement.csv", curl = h)
data <- getURLContent("https://www.myfxbook.com/statements/2531687/statement.csv", curl = h)
It seems these libraries were built to scrape html pages and do not deal with files in other formats.
I would pretty much appreciate any help as I've been trying to make this work for quite some time now.
Thanks.
I would like to use a website from R. The website is http://soundoftext.com/ where I can download WAV. files with audios from a given text and a language (voice).
There are two steps to download the voice in WAV:
1) Insert text and Select language. And Submit
2) On the new window, click Save and select folder.
Until now, I could get the xml tree, convert it to list and modify the values of text and language. However, I don't know how to convert the list to XML (with the new values) and execute it. Then, I would need to do the second step too.
Here is my code so far:
require(RCurl)
require(XML)
webpage <- getURL("http://soundoftext.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
x<-xmlToList(pagetree)
# Inserting word
x$body$div$div$div$form$div$label$.attrs[[1]]<-"Raúl"
x$body$div$div$div$form$div$label$.attrs[[1]]
# Select language
x$body$div$div$div$form$div$select$option$.attrs<-"es"
x$body$div$div$div$form$div$select$option$.attrs
I have follow this approach but there is an error with "tag".
UPDATED: I just tried to use rvest to download the audio file, however, it does not respond or trigger anything. What am I doing wrong (missing)?
url <- "http://soundoftext.com/"
s <- html_session(url)
f0 <- html_form(s)
f1 <- set_values(f0[[1]], text="Raúl", lang="es")
attr(f1, "type") <- "Submit"
s[["fields"]][["submit"]] <- f1
attr(f1, "Class") <- "save"
test <- submit_form(s, f1)
I see nothing wrong with your approach and it was worth a try.. that's what I'd write too.
The page is somewhat annoying in that uses jquery to append new divs at each request. I still think that should be possible to do with rvest, but I found a fun workaround using the httr package:
library(httr)
url <- "http://soundoftext.com/sounds"
fd <- list(
submit = "save",
text = "Banana",
lang="es"
)
resp<-POST(url, body=fd, encode="form")
id <- content(resp)$id
download.file(URLencode(paste0("http://soundoftext.com/sounds/", id)), destfile = 'test.mp3')
Essentially when it send the POST request to the server, an ID come back, if we simply GET that id when can download the file.
Creator of Sound of Text here. Sorry it took so long for me to find this post.
I just redesigned Sound of Text, so your html parsing probably won't work anymore.
However, there is now an API that you can use which should make things considerably easier for you.
You can find the documentation here: https://soundoftext.com/docs
I apologize if it's not very good. Please let me know if you have any questions.
I'm stuck on this one after much searching....
I started with scraping the contents of a table from:
http://www.skatepress.com/skates-top-10000/artworks/
Which is easy:
data <- data.frame()
for (i in 1:100){
print(paste("page", i, "of 100"))
url <- paste("http://www.skatepress.com/skates-top-10000/artworks/", i, "/", sep = "")
temp <- readHTMLTable(stringsAsFactors = FALSE, url, which = 1, encoding = "UTF-8")
data <- rbind(data, temp)
} # end of scraping loop
However, I need to additionally scrape the detail that is contained in a pop-up box when you click on each name (and on the artwork title) in the list on the site.
I can't for the life of me figure out how to pass the breadcrumb (or artist-id or painting-id) through in order to make this happen. Since straight up using rvest to access the contents of the nodes doesn't work, I've tried the following:
I tried passing the painting id through in the url like this:
url <- ("http://www.skatepress.com/skates-top-10000/artworks/?painting_id=576")
site <- html(url)
But it still gives an empty result when scraping:
node1 <- "bread-crumb > ul > li.activebc"
site %>% html_nodes(node1) %>% html_text(trim = TRUE)
character(0)
I'm (clearly) not a scraping expert so any and all assistance would be greatly appreciated! I need a way to capture this additional information for each of the 10,000 items on the list...hence why I'm not interested in doing this manually!
Hoping this is an easy one and I'm just overlooking something simple.
This will be a more efficient base scraper and you can get progress bars for free with the pbapply package:
library(xml2)
library(httr)
library(rvest)
library(dplyr)
library(pbapply)
library(jsonlite)
base_url <- "http://www.skatepress.com/skates-top-10000/artworks/%d/"
n <- 100
bind_rows(pblapply(1:n, function(i) {
mutate(html_table(html_nodes(read_html(sprintf(base_url, i)), "table"))[[1]],
`Sale Date`=as.Date(`Sale Date`, format="%m.%d.%Y"),
`Premium Price USD`=as.numeric(gsub(",", "", `Premium Price USD`)))
})) -> skatepress
I added trivial date & numeric conversions.
I belive your main issue is that the site requires a login to get the additional data. You should give that (i.e. logging in) a shot using httr and grab the wordpress_logged_inXXXXXXX… cookie from that endeavour. I just grabbed it from inspecting the session with Developer Tools in Chrome and that will also work for you (but it's worth the time to learn how to do it via httr).
You'll need to scrape two additional <a … tags from each table row. The one for "artist" looks like:
Pablo Picasso
You can scrape the contents with:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artist.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id="pab_pica_1881"),
verbose()) -> artist_response
fromJSON(content(artist_response, as="text"))
(The return value is too large to post here)
The one for "artwork" looks like:
Les femmes d′Alger (Version ′O′)
and you can get that in similar fashion:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artwork.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id=576),
verbose()) -> artwork_response
fromJSON(content(artwork_response, as="text"))
That's not huge but I won't clutter the response with it.
NOTE that you can also use rvest's html_session to do the login (which will get you cookies for free) and then continue to use that session in the scraping (vs read_html) which will mean you don't have to do the httr GET/PUT.
You'll have to figure out how you want to incorporate that data into the data frame or associate it with it via various id's in the data frame (or some other strategy).
You can see it call those two php scripts via Developer Tools and it also shows the data it passes in. I'm also really shocked that site doesn't have any anti-scraping clauses in their ToS but they don't.
I am in the process of writing a collection of freely-downloadable R scripts for http://asdfree.com/ to help people analyze the complex sample survey data hosted by the UK data service. In addition to providing lots of statistics tutorials for these data sets, I also want to automate the download and importation of this survey data. In order to do that, I need to figure out how to programmatically log into this UK data service website.
I have tried lots of different configurations of RCurl and httr to log in, but I'm making a mistake somewhere and I'm stuck. I have tried inspecting the elements as outlined in this post, but the websites jump around too fast in the browser for me to understand what's going on.
This website does require a login and password, but I believe I'm making a mistake before I even get to the login page.
Here's how the website works:
The starting page should be: https://www.esds.ac.uk/secure/UKDSRegister_start.asp
This page will automatically re-direct your web browser to a long URL that starts with: https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah]
(1) For some reason, the SSL certificate does not work on this website. Here's the SO question I posted regarding this. The workaround I've used is simply ignoring the SSL:
library(httr)
set_config( config( ssl.verifypeer = 0L ) )
and then my first command on the starting website is:
z <- GET( "https://www.esds.ac.uk/secure/UKDSRegister_start.asp" )
this gives me back a z$url that looks a lot like the https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah] page that my browser also re-directs to.
In the browser, then, you're supposed to type in "uk data archive" and click the continue button. When I do that, it re-directs me to the web page https://shib.data-archive.ac.uk/idp/Authn/UserPassword
I think this is where I'm stuck because I cannot figure out how to have cURL followlocation and land on this website. Note: no username/password has been entered yet.
When I use the httr GET command from the wayf.ukfederation.org.uk page like this:
y <- GET( z$url , query = list( combobox = "https://shib.data-archive.ac.uk/shibboleth-idp" ) )
the y$url string looks a lot like z$url (except it's got a combobox= on the end). Is there any way to get through to this uk data archive authentication page with RCurl or httr?
I can't tell if I'm just overlooking something or if I absolutely must use the SSL certificate described in my previous SO post or what?
(2) At the point I do make it through to that page, I believe the remainder of the code would just be:
values <- list( j_username = "your.username" ,
j_password = "your.password" )
POST( "https://shib.data-archive.ac.uk/idp/Authn/UserPassword" , body = values)
But I guess that page will have to wait...
The relevant data variables returned by the form are action and origin, not combobox. Give action the value selection and origin the value from the relevant entry in combobox
y <- GET( z$url, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
> y$url
[1] "https://shib.data-archive.ac.uk:443/idp/Authn/UserPassword"
Edit
It looks as though the handle pool isn't keeping your session alive correctly. You therefore need to pass the handles directly rather than automatically. Also for the POST command you need to set multipart=FALSE as this is the default for HTML forms. The R command has a different default as it is mainly designed for uploading files. So:
y <- GET( handle=z$handle, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
POST(body=values,multipart=FALSE,handle=y$handle)
Response [https://www.esds.ac.uk/]
Status: 200
Content-type: text/html
...snipped...
<title>
Introduction to ESDS
</title>
<meta name="description" content="Introduction to the ESDS, home page" />
I think one way to address "enter your organization" page goes like this:
library(tidyverse)
library(rvest)
library(stringr)
org <- "your_organization"
user <- "your_username"
password <- "your_password"
signin <- "http://esds.ac.uk/newRegistration/newLogin.asp"
handle_reset(signin)
# get to org page and enter org
p0 <- html_session(signin) %>%
follow_link("Login")
org_link <- html_nodes(p0, "option") %>%
str_subset(org) %>%
str_match('(?<=\\")[^"]*') %>%
as.character()
f0 <- html_form(p0) %>%
first() %>%
set_values(origin = org_link)
fake_submit_button <- list(name = "submit-btn",
type = "submit",
value = "Continue",
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "btn-enabled"
f0[["fields"]][["submit"]] <- fake_submit_button
c0 <- cookies(p0)$value
names(c0) <- cookies(p0)$name
p1 <- submit_form(session = p0, form = f0, config = set_cookies(.cookies = c0))
Unfortunately, that doesn't solve the whole problem—(2) is harder than it looks. I've got more of what I think is a solution posted here: R: use rvest (or httr) to log in to a site requiring cookies. Hopefully someone will help us get the rest of the way.