Request to improve code to download sequence of URLs

Request to improve code to download sequence of URLs - r

in a file I have a table of 23,772 URLs that I need to download. In the code below, that is represented by dwsites. Due to the server restrictions, I am only able to download a block of 300 sites at a time. I have accomplished that task with the code below (it is an excerpt of the actual code), but I want to know a better way.
Can you offer any suggestions?
Thank you.
dwsites <- data.frame(sites = c(1:23772), url = rep("url", 23772))
dwsitessub <- dwsites[1:300,] # this is the part that I would like to change
curl = getCurlHandle()
pagesnew = list()
for(u in strpatnew) {pagesnew[[u]] = getURLContent(u, curl = curl)}
lapply(seq_along(strpatternew), function(u) cat(pagesnew[[u]], file = file.path("filepath", paste0(strpatternew[[u]], sep = ""))))
dwsitessub <- dwsites[301:459,]
curl = getCurlHandle()
pagesnew = list()
for(u in strpatnew) {pagesnew[[u]] = getURLContent(u, curl = curl)}
lapply(seq_along(strpatternew), function(u) cat(pagesnew[[u]], file = file.path("filepath", paste0(strpatternew[[u]], sep = ""))))
...
dwsitessub <- 23501:nrow(dwsites)
curl = getCurlHandle()
pagesnew = list()
for(u in strpatnew) {pagesnew[[u]] = getURLContent(u, curl = curl)}
lapply(seq_along(strpatternew), function(u) cat(pagesnew[[u]], file = file.path("filepath", paste0(strpatternew[[u]], sep = ""))))

Related

r curl curl_download doesn't append

I'm trying to take a set of dois and have the doc.org website return the information in .bib format. The code below is supposed to do that and crucially append each new result to a .bib file. mode = a is what I understand will do the appending but it doesn't. The last line of code prints out the contents of oufFile and it contains only the last .bib results.
What needs to be changed to make this work.
library(curl)
outFile <- tempfile(fileext = ".bib")
url1 <- "https://doi.org/10.1016/j.tvjl.2017.12.021"
url2 <- "https://doi.org/10.1016/j.yqres.2013.10.005"
h <- new_handle()
handle_setheaders(h, "accept" = "application/x-bibtex")
curl_download(url1, destfile = outFile, handle = h, mode = "a")
curl_download(url2, destfile = outFile, handle = h, mode = "a")
read_delim(outFile, delim = "\n")

It's not working for me as well with curl_download(). Alternatively you could download with curl() and and use write() with append = TRUE.
Here is a solution for that, which easily can be used for as many urls as you are looking to download the bibtex from. You can execute this after your line 7.
library(dplyr)
library(purrr)
urls <- list(url1, url2)
walk(urls, ~ {
curl(., handle = h) %>%
readLines(warn = FALSE) %>%
write(file = outFile, append = TRUE)
})
library(readr)
read_delim(outFile, delim = "\n")

Passing correct params to RCurl/postForm

I'm trying to download a pdf from the National Information Center via RCurl but I've been having some trouble. For this example URL, I want the pdf corresponding to the default settings, except for "Report Format" which should be "PDF". When I run the following script, it saves the file associated with selecting the other buttons ("Parent(s) of..."/HMDA -- not the default). I tried adding these input elements to params, but it didn't change anything. Could somebody help me identify the problem? thanks.
library(RCurl)
curl = getCurlHandle()
curlSetOpt(cookiejar = 'cookies.txt', curl = curl)
params = list(rbRptFormatPDF = 'rbRptFormatPDF')
url = 'https://www.ffiec.gov/nicpubweb/nicweb/OrgHierarchySearchForm.aspx?parID_RSSD=2162966&parDT_END=99991231'
html = getURL(url, curl = curl)
viewstate = sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html)
event = sub('.*id="__EVENTVALIDATION" value="([0-9a-zA-Z+/=]*).*', '\\1', html)
params[['__VIEWSTATE']] = viewstate
params[['__EVENTVALIDATION']] = event
params[['btnSubmit']] = 'Submit'
result = postForm(url, .params=params, curl=curl, style='POST')
writeBin( as.vector(result), 'test.pdf')

Does this provide the correct PDF?
library(httr)
library(rvest)
library(purrr)
# setup inane sharepoint viewstate parameters
res <- GET(url = "https://www.ffiec.gov/nicpubweb/nicweb/OrgHierarchySearchForm.aspx",
query=list(parID_RSSD=2162966, parDT_END=99991231))
# extract them
pg <- content(res, as="parsed")
hidden <- html_nodes(pg, xpath=".//form/input[#type='hidden']")
params <- setNames(as.list(xml_attr(hidden, "value")), xml_attr(hidden, "name"))
# pile on more params
params <- c(
params,
grpInstitution = "rbCurInst",
lbTopHolders = "2961897",
grpHMDA = "rbNonHMDA",
lbTypeOfInstitution = "-99",
txtAsOfDate = "12/28/2016",
txtAsOfDateErrMsg = "",
lbHMDAYear = "2015",
grpRptFormat = "rbRptFormatPDF",
btnSubmit = "Submit"
)
# submit the req and save to disk
POST(url = "https://www.ffiec.gov/nicpubweb/nicweb/OrgHierarchySearchForm.aspx",
query=list(parID_RSSD=2162966, parDT_END=99991231),
add_headers(Origin = "https://www.ffiec.gov"),
body = params,
encode = "form",
write_disk("/tmp/output.pdf")) -> res2

R - RCurl scrape data from a password-protected site

I'm trying to scrape some table data from a password-protected website (I have a valid username/password) using R and have yet to succeed.
For an example, here's the website to log in to my dentist: http://www.deltadentalins.com/uc/index.html
I have tried the following:
library(httr)
download <- "https://www.deltadentalins.com/indService/faces/Home.jspx?_afrLoop=73359272573000&_afrWindowMode=0&_adf.ctrl-state=12pikd0f19_4"
terms <- "http://www.deltadentalins.com/uc/index.html"
values <- list(username = "username", password = "password", TARGET = "", SMAUTHREASON = "", POSTPRESERVATIONDATA = "",
bundle = "all", dups = "yes")
POST(terms, body = values)
GET(download, query = values)
I have also tried:
your.username <- 'username'
your.password <- 'password'
require(SAScii)
require(RCurl)
require(XML)
agent="Firefox/23.0"
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
curl = curl
)
# list parameters to pass to the website (pulled from the source html)
params <-
list(
'lt' = "",
'_eventID' = "",
'TARGET' = "",
'SMAUTHREASON' = "",
'POSTPRESERVATIONDATA' = "",
'SMAGENTNAME' = agent,
'username' = your.username,
'password' = your.password
)
#logs into the form
html = postForm('https://www.deltadentalins.com/siteminderagent/forms/login.fcc', .params = params, curl = curl)
# logs into the form
html
I can't get either to work. Are there any experts out there that can help?

Updated 3/5/16 to work with package Relenium
#### FRONT MATTER ####
library(devtools)
library(RSelenium)
library(XML)
library(plyr)
######################
## This block will open the Firefox browser, which is linked to R
RSelenium::checkForServer()
remDr <- remoteDriver()
startServer()
remDr$open()
url="yoururl"
remDr$navigate(url)
This first section loads the required packages, sets the login URL, and then opens it in a Firefox instance. I type in my username & password, and then I'm in and can start scraping.
infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE)
infoTable
Table1 <- infoTable[[1]]
Apps <- Table1[,1] # Application Numbers
For this example, the first page contained two tables. The first is the one I'm interested and has a table of application numbers and names. I pull out the first column (application numbers).
Links2 <- paste("https://yourURL?ApplicantID=", Apps2, sep="")
The data I want are stored in invidiual applications, so this bit created the links that I want to loop through.
### Grabs contact info table from each page
LL <- lapply(1:length(Links2),
function(i) {
url=sprintf(Links2[i])
firefox$get(url)
firefox$getPageSource()
infoTable <- readHTMLTable(firefox$getPageSource(), header = TRUE)
if("First Name" %in% colnames(infoTable[[2]]) == TRUE) infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[2]][1,])
else infoTable2 <- cbind(infoTable[[1]][1,], infoTable[[3]][1,])
print(infoTable2)
}
)
results <- do.call(rbind.fill, LL)
results
write.csv(results, "C:/pathway/results2.csv")
This final section loops through the link for each application, then grabs the table with their contact information (which is either table 2 OR table 3, so R has to check first). Thanks again to Chinmay Patil for the tip on relenium!

how to download a large binary file with RCurl after server authentication

i originally asked this question about performing this task with the httr package, but i don't think it's possible using httr. so i've re-written my code to use RCurl instead -- but i'm still tripping up on something probably related to the writefunction.. but i really don't understand why.
you should be able to reproduce my work by using the 32-bit version of R, so you hit memory limits if you read anything into RAM. i need a solution that downloads directly to the hard disk.
to start, this code to works -- the zipped file is appropriately saved to the disk.
library(RCurl)
filename <- tempfile()
f <- CFILE(filename, "wb")
url <- "http://www2.census.gov/acs2011_5yr/pums/csv_pus.zip"
curlPerform(url = url, writedata = f#ref)
close(f)
# 2.1 GB file successfully written to disk
now here's some RCurl code that does not work. as stated in the previous question, reproducing this exactly will require creating an extract on ipums.
your.email <- "email#address.com"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"
library(RCurl)
values <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt',
followlocation = TRUE,
autoreferer = TRUE,
ssl.verifypeer = FALSE,
curl = curl
)
params <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)
and now that i'm logged in, try the same commands as above, but with the curl object to keep the cookies.
filename <- tempfile()
f <- CFILE(filename, mode = "wb")
this line breaks--
curlPerform(url = extract.path, writedata = f#ref, curl = curl)
close(f)
# the error is:
Error in curlPerform(url = extract.path, writedata = f#ref, curl = curl) :
embedded nul in string: [[binary jibberish here]]
the answer to my previous post referred me to this c-level writefunction answer, but i'm clueless about how to re-create that curl_writer C program (on windows?)..
dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
curlPerform(URL=url, writefunction=writer)
..or why it's even necessary, given that the five lines of code at the top of this question work without anything crazy like getNativeSymbolInfo. i just don't understand why passing in that extra curl object that stores the authentication/cookies and tells it not to verify SSL would cause code that otherwise works.. to break?

From this link create a file named curl_writer.c and save it to C:\<folder where you save your R files>
#include <stdio.h>
/**
* Original code just sent some message to stderr
*/
size_t writer(void *buffer, size_t size, size_t nmemb, void *stream) {
fwrite(buffer,size,nmemb,(FILE *)stream);
return size * nmemb;
}
Open a command window, go to the folder where you saved curl_writer.c and run the R compiler
c:> cd "C:\<folder where you save your R files>"
c:> R CMD SHLIB -o curl_writer.dll curl_writer.c
Open R and run your script
C:> R
your.email <- "email#address.com"
your.password <- "password"
extract.path <- "https://usa.ipums.org/usa-action/downloads/extract_files/some_file.csv.gz"
library(RCurl)
values <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt',
followlocation = TRUE,
autoreferer = TRUE,
ssl.verifypeer = FALSE,
curl = curl
)
params <-
list(
"login[email]" = your.email ,
"login[password]" = your.password ,
"login[is_for_login]" = 1
)
html <- postForm("https://usa.ipums.org/usa-action/users/validate_login", .params = params, curl = curl)
dl <- getURL( "https://usa.ipums.org/usa-action/extract_requests/download" , curl = curl)
# Load the DLL you created
# "writer" is the name of the function
# "curl_writer" is the name of the dll
dyn.load("curl_writer.dll")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
# Note that "URL" parameter is upper case, in your code it is lowercase
# I'm not sure if that has something to do
# "writer" is the symbol defined above
f <- CFILE(filename <- tempfile(), "wb")
curlPerform(URL=url, writedata=f#ref, writefunction=writer, curl=curl)
close(f)

this is now possible with the httr package. thanks hadley!
https://github.com/hadley/httr/issues/44

JSON character size limit using curlPerform or getURL

I am running into what appears to be character size limit in a JSON string when trying retrieve data from either curlPerform() or getURL(). Here is non-reproducible code [1], but it should shed some light on the problem.
# Note that .base.url is the basic url for the API, q is a query, user
# is specified, etc.
session = getCurlHandle()
curl.opts <- list(userpwd = paste(user, ":", key, sep = ""),
httpheader = "Content-Type: application/json")
request <- paste(.base.url, q, sep = "")
txt <- getURL(url = request, curl = session, .opts = curl.opts,
write = basicTextGatherer())
or
r = dynCurlReader()
curlPerform(url = request, writefunction = r$update, curl = session,
.opts = curl.opts)
My guess is that the update or value functions in the basicTextGather or dynCurlReader text handler objects are having trouble with the large strings. In this example, r$value() will return a truncated string that is approximately 2 MB. The code given above will work fine for queries < 2 MB.
Note that I can easily do the following from the command line (or using system() in R), but writing to disc seems like a waste if I am doing the subsequent analysis in R.
curl -v --header "Content-Type: application/json" --user username:register:passwd https://base.url.for.api/getdata/select+*+from+sometable > stream.json
where stream.json is a roughly 14MB json string. I can read the string into R using either
con <- file(paste(.project.path, "data/stream.json", sep = ""), "r")
string <- readLines(con)
or directly to list as
tmp <- fromJSON(file = paste(.project.path, "data/stream.json", sep = ""))
Any thoughts are very much appreciated.
Ryan
[1] - Sorry for not providing reproducible code, but I'm dealing with a govt firewall.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Request to improve code to download sequence of URLs - r

Related

r curl curl_download doesn't append

Passing correct params to RCurl/postForm

R - RCurl scrape data from a password-protected site

how to download a large binary file with RCurl after server authentication

JSON character size limit using curlPerform or getURL

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Request to improve code to download sequence of URLs - r

Related

r curl curl_download doesn't append

Passing correct params to RCurl/postForm

R - RCurl scrape data from a password-protected site

how to download a large binary file with RCurl *after* server authentication

JSON character size limit using curlPerform or getURL

Categories

Resources

how to download a large binary file with RCurl after server authentication