I would like to programmatically export the records available at this website. To do this manually, I would navigate to the page, click export, and choose the csv.
I tried copying the link from the export button which will work as long as I have a cookie (I believe). So a wget or httr request will result in the html site instead of the file.
I've found some help from an issue on the rvest github repo but ultimately I can't really figure out like the issue maker how to use objects to save the cookie and use it in a request.
Here is where I'm at:
library(httr)
library(rvest)
apoc <- html_session("https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx")
headers <- headers(apoc)
GET(url = "https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=False&exportFormat=CSV&isExport=True",
add_headers(headers)) # how can I take the output from headers in httr and use it as an argument in GET from httr?
I have checked the robots.txt and this is permissible.
You can get the __VIEWSTATE and __VIEWSTATEGENERATOR from the headers when you GET https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx and then reuse those __VIEWSTATE and __VIEWSTATEGENERATOR in your subsequent POST query and GET csv.
options(stringsAsFactors=FALSE)
library(httr)
library(curl)
library(xml2)
url <- 'https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx'
#get session headers
req <- GET(url)
req_html <- read_html(rawToChar(req$content))
fields <- c("__VIEWSTATE","__VIEWSTATEGENERATOR")
viewheaders <- lapply(fields, function(x) {
xml_attr(xml_find_first(req_html, paste0(".//input[#id='",x,"']")), "value")
})
names(viewheaders) <- fields
#post request. you can get the list of form fields using tools like Fiddler
params <- c(viewheaders,
list(
"M$ctl19"="M$UpdatePanel|M$C$csfFilter$btnExport",
"M$C$csfFilter$ddlNameType"="Any",
"M$C$csfFilter$ddlField"="Elections",
"M$C$csfFilter$ddlReportYear"="2017",
"M$C$csfFilter$ddlStatus"="Default",
"M$C$csfFilter$ddlValue"=-1,
"M$C$csfFilter$btnExport"="Export"))
resp <- POST(url, body=params, encode="form")
print(resp$status_code)
resptext <- rawToChar(resp$content)
#writeLines(resptext, "apoc.html")
#get response i.e. download csv
url <- "https://aws.state.ak.us//ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=True&exportFormat=CSV&isExport=True"
req <- GET(url, body=params)
read.csv(text=rawToChar(req$content))
You might need to play around with the inputs/code to get what you want precisely.
Here is another similar solution using RCurl:
how-to-login-and-then-download-a-file-from-aspx-web-pages-with-r
Related
This seems like a simple problem but I've been struggling with it for a few days. This is a minimum working example rather than the actual problem:
This question seemed similat but I was unable to use the answer to solve my problem.
In a browser, I go to this url, and click on [Search] (no need to make any choices from the lists), and then on [Download Results] (choosing, for example, the Xlsx option). The file then downloads.
To automate this in R I have tried:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
Using Chrome Developer tools I find the url being used to initiate the download, so I try:
url2 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
res <- GET(url = url2, query = list(format = "xlsx"))
However this does not download the file:
> res$content
raw(0)
I also tried
download.file(url = paste0(url2, "?format=xlsx") , destfile = "down.xlsx", mode = "wb")
But this downloads nothing:
> Content type '' length 0 bytes
> downloaded 0 bytes
Note that, in the browser, pasting url2 and adding the format query does initiate the download (after doing the search from url1)
I thought that I should somehow be using the session info from the initial code block to do the download, but so far I can't see how.
Thanks in advance for any help !
You are almost there and your intuition is correct about using the session info.
You just need to use rvest::jump_to to navigate to the second url and then write it to disk:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
url2 <- "https://secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
#### The above is your original code - below is the additional code you need:
download <- jump_to(subform, paste0(url2, "?format=xlsx"))
writeBin(download$response$content, "down.xlsx")
There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}
If you go to the website https://www.myfxbook.com/members/iseasa_public1/rush/2531687 then click that dropdown box Export, then choose CSV, you will be taken to https://www.myfxbook.com/statements/2531687/statement.csv and the download (from the browser) will proceed automatically. The thing is, you need to be logged in to https://www.myfxbook.com in order to receive the information; otherwise, the file downloaded will contain the text "Please login to Myfxbook.com to use this feature".
I tried using read.csv to get the csv file in R, but only got that "Please login" message. I believe R has to simulate an html session (whatever that is, I am not sure about this) so that access will be granted. Then I tried some scraping tools to login first, but to no avail.
library(rvest)
login <- "https://www.myfxbook.com"
pgsession <- html_session(login)
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, loginEmail = "*****", loginPassword = "*****") # loginEmail and loginPassword are the names of the html elements
submit_form(pgsession, filled_form)
url <- "https://www.myfxbook.com/statements/2531687/statement.csv"
page <- jump_to(pgsession, url) # page will contain 48 bytes of data (in the 'content' element), which is the size of that warning message, though I could not access this content.
From the try above, I got that page has an element called cookies which in turns contains JSESSIONID. From my research, it seems this JSESSIONID is what "proves" I am logged in to that website. Nonetheless, downloading the CSV does not work.
Then I tried:
library(RCurl)
h <- getCurlHandle(cookiefile = "")
ans <- getForm("https://www.myfxbook.com", loginEmail = "*****", loginPassword = "*****", curl = h)
data <- getURL("https://www.myfxbook.com/statements/2531687/statement.csv", curl = h)
data <- getURLContent("https://www.myfxbook.com/statements/2531687/statement.csv", curl = h)
It seems these libraries were built to scrape html pages and do not deal with files in other formats.
I would pretty much appreciate any help as I've been trying to make this work for quite some time now.
Thanks.
I would like to use a website from R. The website is http://soundoftext.com/ where I can download WAV. files with audios from a given text and a language (voice).
There are two steps to download the voice in WAV:
1) Insert text and Select language. And Submit
2) On the new window, click Save and select folder.
Until now, I could get the xml tree, convert it to list and modify the values of text and language. However, I don't know how to convert the list to XML (with the new values) and execute it. Then, I would need to do the second step too.
Here is my code so far:
require(RCurl)
require(XML)
webpage <- getURL("http://soundoftext.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
x<-xmlToList(pagetree)
# Inserting word
x$body$div$div$div$form$div$label$.attrs[[1]]<-"Raúl"
x$body$div$div$div$form$div$label$.attrs[[1]]
# Select language
x$body$div$div$div$form$div$select$option$.attrs<-"es"
x$body$div$div$div$form$div$select$option$.attrs
I have follow this approach but there is an error with "tag".
UPDATED: I just tried to use rvest to download the audio file, however, it does not respond or trigger anything. What am I doing wrong (missing)?
url <- "http://soundoftext.com/"
s <- html_session(url)
f0 <- html_form(s)
f1 <- set_values(f0[[1]], text="Raúl", lang="es")
attr(f1, "type") <- "Submit"
s[["fields"]][["submit"]] <- f1
attr(f1, "Class") <- "save"
test <- submit_form(s, f1)
I see nothing wrong with your approach and it was worth a try.. that's what I'd write too.
The page is somewhat annoying in that uses jquery to append new divs at each request. I still think that should be possible to do with rvest, but I found a fun workaround using the httr package:
library(httr)
url <- "http://soundoftext.com/sounds"
fd <- list(
submit = "save",
text = "Banana",
lang="es"
)
resp<-POST(url, body=fd, encode="form")
id <- content(resp)$id
download.file(URLencode(paste0("http://soundoftext.com/sounds/", id)), destfile = 'test.mp3')
Essentially when it send the POST request to the server, an ID come back, if we simply GET that id when can download the file.
Creator of Sound of Text here. Sorry it took so long for me to find this post.
I just redesigned Sound of Text, so your html parsing probably won't work anymore.
However, there is now an API that you can use which should make things considerably easier for you.
You can find the documentation here: https://soundoftext.com/docs
I apologize if it's not very good. Please let me know if you have any questions.
I'm trying to download historical stock trading from my country with R. I tried with the download.file() function. Indeed, a file is downloaded but is an empty spreadsheet. Obviously, if I use this url in my browser the file I downloaded is in fact the one I want.
I would love to do it with quantmod, but that package only applies to larger markets
url<-"https://www.ccbolsa.cl/apps/script/detalleaccion/Transaccion.asp?Nemo=AFPCAPITAL&Menu=H"
destfile <- "/home/hector/TxHistoricas.xls"
download.file(url, destfile)
Thanks in advance.
You can jury-rig something like this if you don't want to use selenium:
library(rvest)
library(httr)
library(stringr)
URL <- "https://www.ccbolsa.cl/apps/script/detalleaccion/Transaccion.asp?Nemo=AFPCAPITAL&Menu=H"
Get initial URL:
res <- html_session(URL, timeout(30))
It embeds a form that it uses javascript to submit to get the form:
inputs <- html_nodes(res, "input")
It uses the last javascript entry to do a redirect on page load, so we need the location of it:
scripts <- html_nodes(res, "script")
action <- html_text(scripts[[length(scripts)]])
This is the new URL to submit to:
base_url <- "https://www.ccbolsa.cl/apps/script/detalleaccion"
loc <- str_match(action, '\\.action *= *"(.*)"')[,2]
doc_url <- sprintf("%s/%s", base_url, loc)
Gather up all the query params:
query <- lapply(inputs, xml_attr, "value")
names(query) <- sapply(inputs, xml_attr, "name")
Now we have to make a new POST request with the query encoded as "form", using and providing a redirect URL (timeout was necessary for me). This write the "xls" content to a file:
ret <- POST(doc_url,
body=query,
encode="form",
add_headers(Referer=URL),
write_disk("fil.xls", overwrite=TRUE),
timeout(30))
It says it's an XLS file:
ret$headers$`content-type`
## [1] "application/vnd.ms-excel"
but it's really an HTML table, so you can really just do:
ret <- POST(doc_url,
body=query,
encode="form",
add_headers(Referer=URL),
timeout(30))
doc <- read_html(content(ret, as="text"))
dat <- html_table(html_nodes(doc, "table"), fill=TRUE)
to get what you're looking for (there are two ugly tables in the dat list and you may want to use header=TRUE as an additional parameter to html_table).
I am not sure how "dynamic" this solution but that's test-able/verifiable.