I want to download all the data in either pdf or excel for
each State X Crop Year X Standard Reports combination from this website.
I followed this tutorial to do what I want.
Download data from URL
However, I hit an error on the second line.
driver <- rsDriver()
Error in subprocess::spawn_process(tfile, ...) :
group termination: could not assign process to a job: Access is denied
Are there any alternative methods that I could use to download these data?
First, check robots.txt on the website if there is any. Then read the terms and conditions if there is any. And it is always important to throttle the request below.
After checking all the terms and conditions, the code below should get you started:
library(httr)
library(xml2)
link <- "https://aps.dac.gov.in/LUS/Public/Reports.aspx"
r <- GET(link)
doc <- read_html(content(r, "text"))
#write_html(doc, "temp.html")
states <- sapply(xml_find_all(doc, ".//select[#name='DdlState']/option"), function(x)
setNames(xml_attr(x, "value"), xml_text(x)))
states <- states[!grepl("^Select", names(states))]
years <- sapply(xml_find_all(doc, ".//select[#name='DdlYear']/option"), function(x)
setNames(xml_attr(x, "value"), xml_text(x)))
years <- years[!grepl("^Select", names(years))]
rptfmt <- sapply(xml_find_all(doc, ".//select[#name='DdlFormat']/option"), function(x)
setNames(xml_attr(x, "value"), xml_text(x)))
stdrpts <- unlist(lapply(xml_find_all(doc, ".//td/a"), function(x) {
id <- xml_attr(x, "id")
if (grepl("^TreeView1t", id)) return(setNames(id, xml_text(x)))
}))
get_vs <- function(doc) sapply(xml_find_all(doc, ".//input[#type='hidden']"), function(x)
setNames(xml_attr(x, "value"), xml_attr(x, "name")))
fmt <- rptfmt[2] #Excel format
for (sn in names(states)) {
for (yn in names(years)) {
for (srn in seq_along(stdrpts)) {
s <- states[sn]
y <- years[yn]
sr <- stdrpts[srn]
r <- POST(link,
body=as.list(c("__EVENTTARGET"="DdlState",
"__EVENTARGUMENT"="",
"__LASTFOCUS"="",
"TreeView1_ExpandState"="ennnn",
"TreeView1_SelectedNode"="",
"TreeView1_PopulateLog"="",
get_vs(doc),
DdlState=unname(s),
DdlYear=0,
DdlFormat=1)),
encode="form")
doc <- read_html(content(r, "text"))
treeview <- c("__EVENTTARGET"="TreeView1",
"__EVENTARGUMENT"=paste0("sStandard Reports\\", srn),
"__LASTFOCUS"="",
"TreeView1_ExpandState"="ennnn",
"TreeView1_SelectedNode"=unname(stdrpts[srn]),
"TreeView1_PopulateLog"="")
vs <- get_vs(doc)
ddl <- c(DdlState=unname(s), DdlYear=unname(y), DdlFormat=unname(fmt))
r <- POST(link, body=as.list(c(treeview, vs, ddl)), encode="form")
if (r$headers$`content-type`=="application/vnd.ms-excel")
writeBin(content(r, "raw"), paste0(sn, "_", yn, "_", names(stdrpts)[srn], ".xls"))
Sys.sleep(5)
}
}
}
Here is my best attempt:
If you look in the network activities you will see a post request is sent:
Request body data:
If you scroll down you will see the form data that is used.
body <- structure(list(`__EVENTTARGET` = "TreeView1", `__EVENTARGUMENT` = "sStandard+Reports%5C4",
`__LASTFOCUS` = "", TreeView1_ExpandState = "ennnn", TreeView1_SelectedNode = "TreeView1t4",
TreeView1_PopulateLog = "", `__VIEWSTATE` = "", `__VIEWSTATEGENERATOR` = "",
`__VIEWSTATEENCRYPTED` = "", `__EVENTVALIDATION` = "", DdlState = "35",
DdlYear = "2001", DdlFormat = "1"), .Names = c("__EVENTTARGET",
"__EVENTARGUMENT", "__LASTFOCUS", "TreeView1_ExpandState", "TreeView1_SelectedNode",
"TreeView1_PopulateLog", "__VIEWSTATE", "__VIEWSTATEGENERATOR",
"__VIEWSTATEENCRYPTED", "__EVENTVALIDATION", "DdlState", "DdlYear",
"DdlFormat"))
There are certain session related values:
attr_names <- c("__EVENTVALIDATION", "__VIEWSTATEGENERATOR", "__VIEWSTATE", "__VIEWSTATEENCRYPTED")
You could add them like this:
setAttrNames <- function(attr_name){
name <- doc %>%
html_nodes(xpath = glue("//*[#id = '{attr_name}']")) %>%
html_attr(name = "value")
body[[attr_name]] <<- name
}
Then you can add this session specific values:
library(rvest)
library(glue)
url <- "https://aps.dac.gov.in/LUS/Public/Reports.aspx"
doc <- url %>% GET %>% content("text") %>% read_html
sapply(attr_names, setAttrNames)
Sending the request:
Then you can send the request:
response <- POST(
url = url,
encode = "form",
body = body,
hdrs
)
response$status_code # still indicates that we have an error in the request.
Follow up ideas:
I checked for cookies. There is a session cookie, but it does not seem to be necessary for the request.
Adding headers.
Trying to set the request headers
header <- structure(c("aps.dac.gov.in", "keep-alive", "3437", "max-age=0",
"https://aps.dac.gov.in", "1", "application/x-www-form-urlencoded",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",
"?1", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"same-origin", "navigate", "https://aps.dac.gov.in/LUS/Public/Reports.aspx",
"gzip, deflate, br", "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"), .Names = c("Host",
"Connection", "Content-Length", "Cache-Control", "Origin", "Upgrade-Insecure-Requests",
"Content-Type", "User-Agent", "Sec-Fetch-User", "Accept", "Sec-Fetch-Site",
"Sec-Fetch-Mode", "Referer", "Accept-Encoding", "Accept-Language"
))
hdrs <- header %>% add_headers
response <- POST(
url = url,
encode = "form",
body = body,
hdrs
)
But i get a timeout for this request.
Note: The site does not seem to have a robots.txt. But check the Terms and Conditions of the site.
I tried running these 2 lines myself at work and got somewhat a more explicit error message than you.
Could not open chrome browser.
Client error message:
Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
Further Details: run errorDetails method
Check server log for further details.
It might be because if you are at work without admin privileges, R can't create a child process.
As a matter of fact I used to run into absolutely awful problems myself trying to build a bot using RSelenium. rsDriver() was not consistent at all and kept crashing. I had to include it in a loop with error catching in order to keep it running, but then I had to find out and delete gigabytes of temp files manually.
I tried to install Docker and spent a lot of time doing the setup but finally it wasn't supported on my Windows non-professional edition.
Solution: Selenium from Python is very well documented, never crashes, works like a charm. Coding in the interactive Spyder editor from Anaconda feels almost like R.
And of course you can use something like system("python myscript.py") from R in order to get the process started and the resulting files back into R if you wish so.
EDIT: No admin privileges are required at all for Anaconda or Selenium. I run it myself without any problem from work. If you have trouble with pip install commands being SSL-blocked like me you can bypass it using the --trusted-host argument.
Selenium is useful when you must run the javascript on a webpage. For websites that don't require the javascript to be run (i.e. if the information you're after is contained within the webpage HTML), rvest or httr are your best bets.
In your case though, to download a file, simply use download.file(), which is a function in base R.
The website in your question is currently down (so I can't see it), but here's an example using a random file from another website
download.file("https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf", "mygreatfile.pdf")
To check that it worked
dir()
# [1] "mygreatfile.pdf"
Depending on how the website is structured, you may be able to obtain a list of the file urls, then loop through them in R downloading one after another.
Lastly, an extra tip. Depending on the file type, and what you're doing with them, you may be able to read them directly into R (instead of saving them first). For example read.csv() works with a url to directly read the csv from the web. Other read functions may be able to do the same.
Update
I currently see an internal 500 error when I visit the site, but I can see the site via the wayback machine, so I can see there is indeed javascript on the webpage. When the site is back up and running, I will attempt to download the files
Related
I am trying to get a table from a website into R. The code that I am currently running is:
library(htmltab)
url1 <- 'https://covid19-dashboard.ages.at/dashboard_Hosp.html'
TAB<-htmltab(url1, which = "//table[#id = 'tblIcuTimeline']")
This is selecting the correct table because the variables are the ones I want but the table is empty. It might be a problem with my XPath. The error that I am getting is:
No encoding supplied: defaulting to UTF-8.
Error in Node[[1]] : subscript out of bounds
This website has a convenient JSON file available, which you can extract like so:
library(jsonlite)
url <- "https://covid19-dashboard.ages.at/data/JsonData.json"
ll <- jsonlite::fromJSON(txt = url)
From there you can subset and extract what you want. My guess is you are after the ll$CovidFallzahlen My German is not so good, so couln't isolate the exact values you are after.
The problem is (probaby) that on direct approach of the page, the table is empy and has to be filled on pageload. But on initial approach of the page (using your code), the table is still empty.
Below is a RSelenium approach that results in a list all.table with all filled tables. Pick the one you need.
requirement: firefox is installed
library(RSelenium)
library(rvest)
library(xml2)
#setup driver, client and server
driver <- rsDriver( browser = "firefox", port = 4545L, verbose = FALSE )
server <- driver$server
browser <- driver$client
#goto url in browser
browser$navigate("https://covid19-dashboard.ages.at/dashboard_Hosp.html")
#get all tables
doc <- xml2::read_html(browser$getPageSource()[[1]])
all.table <- rvest::html_table(doc)
#close everything down properly
browser$close()
server$stop()
# needed, else the port 4545 stays occupied by the java process
system("taskkill /im java.exe /f", intern = FALSE, ignore.stdout = FALSE)
If you go to the website https://www.myfxbook.com/members/iseasa_public1/rush/2531687 then click that dropdown box Export, then choose CSV, you will be taken to https://www.myfxbook.com/statements/2531687/statement.csv and the download (from the browser) will proceed automatically. The thing is, you need to be logged in to https://www.myfxbook.com in order to receive the information; otherwise, the file downloaded will contain the text "Please login to Myfxbook.com to use this feature".
I tried using read.csv to get the csv file in R, but only got that "Please login" message. I believe R has to simulate an html session (whatever that is, I am not sure about this) so that access will be granted. Then I tried some scraping tools to login first, but to no avail.
library(rvest)
login <- "https://www.myfxbook.com"
pgsession <- html_session(login)
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, loginEmail = "*****", loginPassword = "*****") # loginEmail and loginPassword are the names of the html elements
submit_form(pgsession, filled_form)
url <- "https://www.myfxbook.com/statements/2531687/statement.csv"
page <- jump_to(pgsession, url) # page will contain 48 bytes of data (in the 'content' element), which is the size of that warning message, though I could not access this content.
From the try above, I got that page has an element called cookies which in turns contains JSESSIONID. From my research, it seems this JSESSIONID is what "proves" I am logged in to that website. Nonetheless, downloading the CSV does not work.
Then I tried:
library(RCurl)
h <- getCurlHandle(cookiefile = "")
ans <- getForm("https://www.myfxbook.com", loginEmail = "*****", loginPassword = "*****", curl = h)
data <- getURL("https://www.myfxbook.com/statements/2531687/statement.csv", curl = h)
data <- getURLContent("https://www.myfxbook.com/statements/2531687/statement.csv", curl = h)
It seems these libraries were built to scrape html pages and do not deal with files in other formats.
I would pretty much appreciate any help as I've been trying to make this work for quite some time now.
Thanks.
I am trying to download weather data, similar to the question asked here: How to parse XML to R data frame
but when I run the first line in the example, I get "Error: 1: failed to load HTTP resource". I've checked that the URL is valid. Here is the line I'm referring to:
data <- xmlParse("http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML")
I've managed to find a work around with the following, but would like to understand why the first line didn't work.
testfile <- "G:/Self Improvement/R Working Directory/test.xml"
url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
download.file(url, testfile, mode="wb") # get data into test
data <- xmlParse(testfile)
Appreciate any insights.
You can download the file by setting a UserAgent as follows:
require(httr)
UA <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"
my_url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
doc <- GET(my_url, user_agent(UA))
Now have a look at content(doc, "text") to see that it is the file you see in the browser
Then you can parse it via XML or xml2. I find xml2 easier but that is just my taste. Both work.
data <- XML::xmlParse(content(doc, "text"))
data2 <- xml2::read_xml(content(doc, "text"))
Why do i have to use a user agent?
From the RCurl FAQ: http://www.omegahat.org/RCurl/FAQ.html
Why doesn't RCurl provide a default value for the useragent that some sites require?
This is a matter of philosophy. Firstly, libcurl doesn't specify a default value and it is a framework for others to build applications. Similarly, RCurl is a general framework for R programmers to create applications to make "Web" requests. Accordingly, we don't set the user agent either. We expect the R programmer to do this. R programmers using RCurl in an R package to make requests to a site should use the package name (and also the version of R) as the user agent and specify this in all requests.
Basically, we expect others to specify a meaningful value for useragent so that they identify themselves correctly.
Note that users (not recommended for programmers) can set the R option named RCurlOptions via R's option() function. The value should be a list of named curl options. This is used in each RCurl request merging these values with those specified in the call. This allows one to provide default values.
I suspect http://forecast.weather.gov/ to reject all requests without a UserAgent.
I downloaded this url to a text file. After that, I get the content of the file and parse it to XML data. Here is my code:
rm(list=ls())
require(XML)
require(xml2)
require(httr)
url <- "http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML"
download.file(url=url,"url.txt" )
xmlParse(url)
data <- xmlParse("url.txt")
xml_data <- xmlToList(data)
location <- as.list(xml_data[["data"]][["location"]][["point"]])
start_time <- unlist(xml_data[["data"]][["time-layout"]][
names(xml_data[["data"]][["time-layout"]]) == "start-valid-time"])
I am in the process of writing a collection of freely-downloadable R scripts for http://asdfree.com/ to help people analyze the complex sample survey data hosted by the UK data service. In addition to providing lots of statistics tutorials for these data sets, I also want to automate the download and importation of this survey data. In order to do that, I need to figure out how to programmatically log into this UK data service website.
I have tried lots of different configurations of RCurl and httr to log in, but I'm making a mistake somewhere and I'm stuck. I have tried inspecting the elements as outlined in this post, but the websites jump around too fast in the browser for me to understand what's going on.
This website does require a login and password, but I believe I'm making a mistake before I even get to the login page.
Here's how the website works:
The starting page should be: https://www.esds.ac.uk/secure/UKDSRegister_start.asp
This page will automatically re-direct your web browser to a long URL that starts with: https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah]
(1) For some reason, the SSL certificate does not work on this website. Here's the SO question I posted regarding this. The workaround I've used is simply ignoring the SSL:
library(httr)
set_config( config( ssl.verifypeer = 0L ) )
and then my first command on the starting website is:
z <- GET( "https://www.esds.ac.uk/secure/UKDSRegister_start.asp" )
this gives me back a z$url that looks a lot like the https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah] page that my browser also re-directs to.
In the browser, then, you're supposed to type in "uk data archive" and click the continue button. When I do that, it re-directs me to the web page https://shib.data-archive.ac.uk/idp/Authn/UserPassword
I think this is where I'm stuck because I cannot figure out how to have cURL followlocation and land on this website. Note: no username/password has been entered yet.
When I use the httr GET command from the wayf.ukfederation.org.uk page like this:
y <- GET( z$url , query = list( combobox = "https://shib.data-archive.ac.uk/shibboleth-idp" ) )
the y$url string looks a lot like z$url (except it's got a combobox= on the end). Is there any way to get through to this uk data archive authentication page with RCurl or httr?
I can't tell if I'm just overlooking something or if I absolutely must use the SSL certificate described in my previous SO post or what?
(2) At the point I do make it through to that page, I believe the remainder of the code would just be:
values <- list( j_username = "your.username" ,
j_password = "your.password" )
POST( "https://shib.data-archive.ac.uk/idp/Authn/UserPassword" , body = values)
But I guess that page will have to wait...
The relevant data variables returned by the form are action and origin, not combobox. Give action the value selection and origin the value from the relevant entry in combobox
y <- GET( z$url, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
> y$url
[1] "https://shib.data-archive.ac.uk:443/idp/Authn/UserPassword"
Edit
It looks as though the handle pool isn't keeping your session alive correctly. You therefore need to pass the handles directly rather than automatically. Also for the POST command you need to set multipart=FALSE as this is the default for HTML forms. The R command has a different default as it is mainly designed for uploading files. So:
y <- GET( handle=z$handle, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
POST(body=values,multipart=FALSE,handle=y$handle)
Response [https://www.esds.ac.uk/]
Status: 200
Content-type: text/html
...snipped...
<title>
Introduction to ESDS
</title>
<meta name="description" content="Introduction to the ESDS, home page" />
I think one way to address "enter your organization" page goes like this:
library(tidyverse)
library(rvest)
library(stringr)
org <- "your_organization"
user <- "your_username"
password <- "your_password"
signin <- "http://esds.ac.uk/newRegistration/newLogin.asp"
handle_reset(signin)
# get to org page and enter org
p0 <- html_session(signin) %>%
follow_link("Login")
org_link <- html_nodes(p0, "option") %>%
str_subset(org) %>%
str_match('(?<=\\")[^"]*') %>%
as.character()
f0 <- html_form(p0) %>%
first() %>%
set_values(origin = org_link)
fake_submit_button <- list(name = "submit-btn",
type = "submit",
value = "Continue",
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "btn-enabled"
f0[["fields"]][["submit"]] <- fake_submit_button
c0 <- cookies(p0)$value
names(c0) <- cookies(p0)$name
p1 <- submit_form(session = p0, form = f0, config = set_cookies(.cookies = c0))
Unfortunately, that doesn't solve the whole problem—(2) is harder than it looks. I've got more of what I think is a solution posted here: R: use rvest (or httr) to log in to a site requiring cookies. Hopefully someone will help us get the rest of the way.
I would like to use RCurl as a polite webcrawler to download data from a website.
Obviously I need the data for scientific research. Although I have the rights to access the content of the website via my university, the terms of use of the website forbid the use of webcrawlers.
I tried to ask the administrator of the site directly for the data but they only replied in a very vague fashion. Well anyway it seems like they won’t simply send the underlying databases to me.
What I want to do now is ask them officially to get the one-time permission to download specific text-only content from their site using an R code based on RCurl that includes a delay of three seconds after each request has been executed.
The address of the sites that I want to download data from work like this:
http://plants.jstor.org/specimen/ID of the site
I tried to program it with RCurl but I cannot get it done.
A few things complicate things:
One can only access the website if cookies are allowed (I got that working in RCurl with the cookiefile-argument).
The Next-button only appears in the source code when one actually accesses the site by clicking through the different links in a normal browser.
In the source code the Next-button is encoded with an expression including
Next > >
When one tries to access the site directly (without having clicked through to it in the same browser before), it won't work, the line with the link is simply not in the source code.
The IDs of the sites are combinations of letters and digits (like “goe0003746” or “cord00002203”), so I can't simply write a for-loop in R that tries every number from 1 to 1,000,000.
So my program is supposed to mimic a person that clicks through all the sites via the Next-button, each time saving the textual content.
Each time after saving the content of a site, it should wait three seconds before clicking on the Next-button (it must be a polite crawler). I got that working in R as well using the Sys.sleep function.
I also thought of using an automated program, but there seem to be a lot of such programs and I don’t know which one to use.
I’m also not exactly the program-writing person (apart from a little bit of R), so I would really appreciate a solution that doesn’t include programming in Python, C++, PHP or the like.
Any thoughts would be much appreciated! Thank you very much in advance for comments and proposals !!
Try a different strategy.
##########################
####
#### Scrape http://plants.jstor.org/specimen/
#### Idea:: Gather links from http://plants.jstor.org/search?t=2076
#### Then follow links:
####
#########################
library(RCurl)
library(XML)
### get search page::
cookie = 'cookiefile.txt'
curl = getCurlHandle ( cookiefile = cookie ,
useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
header = F,
verbose = TRUE,
netrc = TRUE,
maxredirs = as.integer(20),
followlocation = TRUE)
querry.jstor <- getURL('http://plants.jstor.org/search?t=2076', curl = curl)
## remove white spaces:
querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
### get links from search page
getLinks = function() {
links = character()
list(a = function(node, ...) {
links <<- c(links, xmlGetAttr(node, "href"))
node
},
links = function()links)
}
## retrieve links
querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)
## cleanup links to keep only the one we want.
querry.jstor.links = NULL
querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links
## number of results
jstor.article <- getNodeSet(htmlTreeParse(querry.jstor2, useInt=T), "//article")
NumOfRes <- strsplit(gsub(',', '', gsub(' ', '' ,xmlValue(jstor.article[[1]][[1]]))), split='')[[1]]
NumOfRes <- as.numeric(paste(NumOfRes[1:min(grep('R', NumOfRes))-1], collapse = ''))
for(i in 2:ceiling(NumOfRes/20)){
querry.jstor <- getURL('http://plants.jstor.org/search?t=2076&p=',i, curl = curl)
## remove white spaces:
querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)
querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links
Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5)))
}
## make directory for saving data:
dir.create('./jstorQuery/')
## Now we have all the links, so we can retrieve all the info
for(j in 1:length(querry.jstor.links)){
if(nchar(querry.jstor.links[j]) != 1){
querry.jstor <- getURL('http://plants.jstor.org',querry.jstor.links[j], curl = curl)
## remove white spaces:
querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
## contruct name:
filename = querry.jstor.links[j][grep( '/', querry.jstor.links[j])+1 : nchar( querry.jstor.links[j])]
## save in directory:
write(querry.jstor2, file = paste('./jstorQuery/', filename, '.html', sep = '' ))
Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5)))
}
}
I may be missing exactly the bit you are hung up on, but it sounds like you are almost there.
It seems you can request page 1 with cookies on. Then parse the content searching for the next site ID, then request that page by building the URL with the next site ID. Then scrape whatever data you want.
It sounds like you have code that does almost all of this. Is the problem parsing page 1 to get the ID for the next step? If so, you should formulate a reproducible example and I suspect you'll get a very fast answer to your syntax problems.
If you're having trouble seeing what the site is doing, I recommend using the Tamper Data plug in for Firefox. It will let you see what request is being made at each mouse click. I find it really useful for this type of thing.