Novice - first time attempting to extract data via an API and using R.
I obtained a API Key and the Secret.
Converted to base64.
Now perplexed as to the next step where the instructions that I have state that I should "Enter the generated base64value in the header and request body and call the token URI as shown below;
[Code]
Authorization: Basic {base64value}
Content-Type: application/x-www-form-urlencoded
POST https://api.destination.com/oauth/token
grant_type=client_credentials
[/Code]
Any insight as to if R can be used to obtain the OAuth Token?
If so, what are the required packages that I need to install?
What are the specific steps?
Currently reading several books on R but thought that someone will be able to provide some insight.
Thanks in advance.
The httr package is a great place to start learning API's in R. If you haven't been lead there already I highly recommend taking the time to check it out.
library(httr)
base64_value <- your_generated_base64string
response <-
POST(url = "https://api.precisely.com/oauth/token",
add_headers(Authorization = paste("Basic", base64_value))
body = list(grant_type = "client_credentials"),
encode = "form")
# we're hoping this is 200
response$status
The following code does not work. A status code of 401 is the result.
After attempting this numerous times, maybe I need to once again regenerate another key and secret. Then, obtain another base64 value. Then, try again. Other options include trying Rcurl or even trying Python.
I assume that the server eventually locks out a base64value after so many unsuccessful attempts.
Appreciate the time/insight.
Any additional insight is appreciated.
library(httr)
base64_value <-
"123456789="
response14 <-
httr::POST (url = "https://api.precisely.com/oauth/token",
httr::add_headers(Authorization = paste("Basic", base64_value)),
body = list(grant_type = "client_credentials"),
encode = "form"
)
Latest iteration.
Error received is regarding timeout.
library(httr)
base64_value <-
"123456789="
response14 <-
httr::POST (url = "https://api.precisely.com/oauth/token",
httr::add_headers(Authorization = paste("Basic", "123456789=")),
body = list(grant_type = "client_credentials"),
encode = "form"
)
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: [api.precisely.com] Resolving timed out after 10000 milliseconds
I'm receiving the data I'm requesting but don't understand how to sufficiently extract the data. Here is the POST request:
library(httr)
url <- "http://tools-cluster-interface.iedb.org/tools_api/mhci/"
body <- list(method="recommended", sequence_text="SLYNTVATLYCVHQRIDV", allele="HLA-A*01:01,HLA-A*02:01", length="8,9")
data <- httr::POST(url, body = body,encode = "form", verbose())
If I print the data with:
data
..it shows the request details followed by a nicely formatted table. However if I try to extract with:
httr::content(data, "text")
This returns a single string with all the values of the original table. The output looks delimited by "\" but I couldn't str_replace or tease it out properly.
I'm new to requests using R (and httr) and assume it's an option I'm missing with httr. Any advice?
API details here: http://tools.iedb.org/main/tools-api/
The best way to do this is to specify the MIME type:
content(data, type = 'text/tab-separated-values')
I want to download all the data in either pdf or excel for
each State X Crop Year X Standard Reports combination from this website.
I followed this tutorial to do what I want.
Download data from URL
However, I hit an error on the second line.
driver <- rsDriver()
Error in subprocess::spawn_process(tfile, ...) :
group termination: could not assign process to a job: Access is denied
Are there any alternative methods that I could use to download these data?
First, check robots.txt on the website if there is any. Then read the terms and conditions if there is any. And it is always important to throttle the request below.
After checking all the terms and conditions, the code below should get you started:
library(httr)
library(xml2)
link <- "https://aps.dac.gov.in/LUS/Public/Reports.aspx"
r <- GET(link)
doc <- read_html(content(r, "text"))
#write_html(doc, "temp.html")
states <- sapply(xml_find_all(doc, ".//select[#name='DdlState']/option"), function(x)
setNames(xml_attr(x, "value"), xml_text(x)))
states <- states[!grepl("^Select", names(states))]
years <- sapply(xml_find_all(doc, ".//select[#name='DdlYear']/option"), function(x)
setNames(xml_attr(x, "value"), xml_text(x)))
years <- years[!grepl("^Select", names(years))]
rptfmt <- sapply(xml_find_all(doc, ".//select[#name='DdlFormat']/option"), function(x)
setNames(xml_attr(x, "value"), xml_text(x)))
stdrpts <- unlist(lapply(xml_find_all(doc, ".//td/a"), function(x) {
id <- xml_attr(x, "id")
if (grepl("^TreeView1t", id)) return(setNames(id, xml_text(x)))
}))
get_vs <- function(doc) sapply(xml_find_all(doc, ".//input[#type='hidden']"), function(x)
setNames(xml_attr(x, "value"), xml_attr(x, "name")))
fmt <- rptfmt[2] #Excel format
for (sn in names(states)) {
for (yn in names(years)) {
for (srn in seq_along(stdrpts)) {
s <- states[sn]
y <- years[yn]
sr <- stdrpts[srn]
r <- POST(link,
body=as.list(c("__EVENTTARGET"="DdlState",
"__EVENTARGUMENT"="",
"__LASTFOCUS"="",
"TreeView1_ExpandState"="ennnn",
"TreeView1_SelectedNode"="",
"TreeView1_PopulateLog"="",
get_vs(doc),
DdlState=unname(s),
DdlYear=0,
DdlFormat=1)),
encode="form")
doc <- read_html(content(r, "text"))
treeview <- c("__EVENTTARGET"="TreeView1",
"__EVENTARGUMENT"=paste0("sStandard Reports\\", srn),
"__LASTFOCUS"="",
"TreeView1_ExpandState"="ennnn",
"TreeView1_SelectedNode"=unname(stdrpts[srn]),
"TreeView1_PopulateLog"="")
vs <- get_vs(doc)
ddl <- c(DdlState=unname(s), DdlYear=unname(y), DdlFormat=unname(fmt))
r <- POST(link, body=as.list(c(treeview, vs, ddl)), encode="form")
if (r$headers$`content-type`=="application/vnd.ms-excel")
writeBin(content(r, "raw"), paste0(sn, "_", yn, "_", names(stdrpts)[srn], ".xls"))
Sys.sleep(5)
}
}
}
Here is my best attempt:
If you look in the network activities you will see a post request is sent:
Request body data:
If you scroll down you will see the form data that is used.
body <- structure(list(`__EVENTTARGET` = "TreeView1", `__EVENTARGUMENT` = "sStandard+Reports%5C4",
`__LASTFOCUS` = "", TreeView1_ExpandState = "ennnn", TreeView1_SelectedNode = "TreeView1t4",
TreeView1_PopulateLog = "", `__VIEWSTATE` = "", `__VIEWSTATEGENERATOR` = "",
`__VIEWSTATEENCRYPTED` = "", `__EVENTVALIDATION` = "", DdlState = "35",
DdlYear = "2001", DdlFormat = "1"), .Names = c("__EVENTTARGET",
"__EVENTARGUMENT", "__LASTFOCUS", "TreeView1_ExpandState", "TreeView1_SelectedNode",
"TreeView1_PopulateLog", "__VIEWSTATE", "__VIEWSTATEGENERATOR",
"__VIEWSTATEENCRYPTED", "__EVENTVALIDATION", "DdlState", "DdlYear",
"DdlFormat"))
There are certain session related values:
attr_names <- c("__EVENTVALIDATION", "__VIEWSTATEGENERATOR", "__VIEWSTATE", "__VIEWSTATEENCRYPTED")
You could add them like this:
setAttrNames <- function(attr_name){
name <- doc %>%
html_nodes(xpath = glue("//*[#id = '{attr_name}']")) %>%
html_attr(name = "value")
body[[attr_name]] <<- name
}
Then you can add this session specific values:
library(rvest)
library(glue)
url <- "https://aps.dac.gov.in/LUS/Public/Reports.aspx"
doc <- url %>% GET %>% content("text") %>% read_html
sapply(attr_names, setAttrNames)
Sending the request:
Then you can send the request:
response <- POST(
url = url,
encode = "form",
body = body,
hdrs
)
response$status_code # still indicates that we have an error in the request.
Follow up ideas:
I checked for cookies. There is a session cookie, but it does not seem to be necessary for the request.
Adding headers.
Trying to set the request headers
header <- structure(c("aps.dac.gov.in", "keep-alive", "3437", "max-age=0",
"https://aps.dac.gov.in", "1", "application/x-www-form-urlencoded",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",
"?1", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"same-origin", "navigate", "https://aps.dac.gov.in/LUS/Public/Reports.aspx",
"gzip, deflate, br", "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"), .Names = c("Host",
"Connection", "Content-Length", "Cache-Control", "Origin", "Upgrade-Insecure-Requests",
"Content-Type", "User-Agent", "Sec-Fetch-User", "Accept", "Sec-Fetch-Site",
"Sec-Fetch-Mode", "Referer", "Accept-Encoding", "Accept-Language"
))
hdrs <- header %>% add_headers
response <- POST(
url = url,
encode = "form",
body = body,
hdrs
)
But i get a timeout for this request.
Note: The site does not seem to have a robots.txt. But check the Terms and Conditions of the site.
I tried running these 2 lines myself at work and got somewhat a more explicit error message than you.
Could not open chrome browser.
Client error message:
Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
Further Details: run errorDetails method
Check server log for further details.
It might be because if you are at work without admin privileges, R can't create a child process.
As a matter of fact I used to run into absolutely awful problems myself trying to build a bot using RSelenium. rsDriver() was not consistent at all and kept crashing. I had to include it in a loop with error catching in order to keep it running, but then I had to find out and delete gigabytes of temp files manually.
I tried to install Docker and spent a lot of time doing the setup but finally it wasn't supported on my Windows non-professional edition.
Solution: Selenium from Python is very well documented, never crashes, works like a charm. Coding in the interactive Spyder editor from Anaconda feels almost like R.
And of course you can use something like system("python myscript.py") from R in order to get the process started and the resulting files back into R if you wish so.
EDIT: No admin privileges are required at all for Anaconda or Selenium. I run it myself without any problem from work. If you have trouble with pip install commands being SSL-blocked like me you can bypass it using the --trusted-host argument.
Selenium is useful when you must run the javascript on a webpage. For websites that don't require the javascript to be run (i.e. if the information you're after is contained within the webpage HTML), rvest or httr are your best bets.
In your case though, to download a file, simply use download.file(), which is a function in base R.
The website in your question is currently down (so I can't see it), but here's an example using a random file from another website
download.file("https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf", "mygreatfile.pdf")
To check that it worked
dir()
# [1] "mygreatfile.pdf"
Depending on how the website is structured, you may be able to obtain a list of the file urls, then loop through them in R downloading one after another.
Lastly, an extra tip. Depending on the file type, and what you're doing with them, you may be able to read them directly into R (instead of saving them first). For example read.csv() works with a url to directly read the csv from the web. Other read functions may be able to do the same.
Update
I currently see an internal 500 error when I visit the site, but I can see the site via the wayback machine, so I can see there is indeed javascript on the webpage. When the site is back up and running, I will attempt to download the files
I've successfully received an access token from an oauth2.0 request so that I can start obtaining some data from the server. However, I keep getting error 403 on each attempt. APIs are very new to me and I only am entry level in using R so I can't figure out whats wrong with my request. I'm using the crul package currently, but I've tried to make the request with the httr package as well, but I can't get anything through without encountering the 403 error. I have a shiny app which in the end I'd like to be able to refresh with data imported from this other application which actually stores data, but I want to try to pull data to my console locally first so I can understand the basic process of doing so. I will post some of my current attempts.
(x <- HttpClient$new(
url = 'https://us.castoredc.com',
opts = list( exceptions = FALSE),
headers = list())
)
res.token <- x$post('oauth/token',
body = list(client_id = "{id}",
client_secret = "{secret}",
grant_type = 'client_credentials'))
importantStuff <- jsonlite::fromJSON(res$parse("UTF-8"))
token <- paste("Bearer", importantStuff$access_token)
I obtain my token, but the following doesn't seem to work.###
I'm attempting to get the list of study codes so that I can call on them in
further requests to actually get data from a study.
res.studies <- x$get('/api/study',headers = list(Authorization =
token,client_id = "{id}",
client_secret = "{secret}",
grant_type = 'client_credentials'),
body = list(
content_type = 'application/json'))
Their support team gave me the above endpoint to access the content, but I get 403 so I think i'm not using my token correctly?
status: 403
access-control-allow-headers: Authorization
access-control-allow-methods: Get,Post,Options,Patch
I'm the CEO at Castor EDC and although its pretty cool to see a Castor EDC question on here, I apologize for the time you lost over trying to figure this out. Was our support team not able to provide more assistance?
Regardless, I have actually used our API quite a bit in R and we also have an amazing R Engineer in house if you need more help.
Reflecting on your answer, yes, you always need a Study ID to be able to do anything interesting with the API. One thing that could make your life A LOT easier is our R API wrapper, you can find that here: https://github.com/castoredc/castoRedc
With that you would:
remotes::install_github("castoredc/castoRedc")
library(castoRedc)
castor_api <- CastorData$new(key = Sys.getenv("CASTOR_KEY"),
secret = Sys.getenv("CASTOR_SECRET"),
base_url = "https://data.castoredc.com")
example_study_id <- studies[["study_id"]][1]
fields <- castor_api$getFields(example_study_id)
etc.
Hope that makes you life a lot easier in the future.
So, After some investigation, It turns out that you first have to make a request to obtain another id for each Castor study under your username. I will post some example code that worked finally.
req.studyinfo <- httr::GET(url = "us.castoredc.com/api/study"
,httr::add_headers(Authorization = token))
json <- httr::content(req.studyinfo,as = "text")
studies <- fromJSON(json)
Then, this will give you a list of your studies in Castor for which you can obtain the ID that you care about for your endpoints. It will be a list that contains a data frame containing this information.
you use the same format with whatever endpoint you like that is posted in their documentation to retrieve data. Thank you for your observations! I will leave this here in case anyone is employed to develop anything from data used in the Castor EDC. Their documentation was vague to me, so maybe it will help someone in the future.
Example for next step:
req.studydata <- httr::GET("us.castoredc.com/api/study/{study id obtained
from previous step}/data-point-
collection/study",,httr::add_headers(Authorization =
token))
json.data <- httr::content(req.studydata,as = "text")
data <- fromJSON(json.data)
This worked for me, I removed the Sys.getenv() part
library(castoRedc)
castor_api <- CastorData$new(key = "CASTOR_KEY",
secret = "CASTOR_SECRET",
base_url = "https://data.castoredc.com")
example_study_id <- studies[["study_id"]][1]
fields <- castor_api$getFields(example_study_id)
I would like to programmatically export the records available at this website. To do this manually, I would navigate to the page, click export, and choose the csv.
I tried copying the link from the export button which will work as long as I have a cookie (I believe). So a wget or httr request will result in the html site instead of the file.
I've found some help from an issue on the rvest github repo but ultimately I can't really figure out like the issue maker how to use objects to save the cookie and use it in a request.
Here is where I'm at:
library(httr)
library(rvest)
apoc <- html_session("https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx")
headers <- headers(apoc)
GET(url = "https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=False&exportFormat=CSV&isExport=True",
add_headers(headers)) # how can I take the output from headers in httr and use it as an argument in GET from httr?
I have checked the robots.txt and this is permissible.
You can get the __VIEWSTATE and __VIEWSTATEGENERATOR from the headers when you GET https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx and then reuse those __VIEWSTATE and __VIEWSTATEGENERATOR in your subsequent POST query and GET csv.
options(stringsAsFactors=FALSE)
library(httr)
library(curl)
library(xml2)
url <- 'https://aws.state.ak.us/ApocReports/Registration/CandidateRegistration/CRForms.aspx'
#get session headers
req <- GET(url)
req_html <- read_html(rawToChar(req$content))
fields <- c("__VIEWSTATE","__VIEWSTATEGENERATOR")
viewheaders <- lapply(fields, function(x) {
xml_attr(xml_find_first(req_html, paste0(".//input[#id='",x,"']")), "value")
})
names(viewheaders) <- fields
#post request. you can get the list of form fields using tools like Fiddler
params <- c(viewheaders,
list(
"M$ctl19"="M$UpdatePanel|M$C$csfFilter$btnExport",
"M$C$csfFilter$ddlNameType"="Any",
"M$C$csfFilter$ddlField"="Elections",
"M$C$csfFilter$ddlReportYear"="2017",
"M$C$csfFilter$ddlStatus"="Default",
"M$C$csfFilter$ddlValue"=-1,
"M$C$csfFilter$btnExport"="Export"))
resp <- POST(url, body=params, encode="form")
print(resp$status_code)
resptext <- rawToChar(resp$content)
#writeLines(resptext, "apoc.html")
#get response i.e. download csv
url <- "https://aws.state.ak.us//ApocReports/Registration/CandidateRegistration/CRForms.aspx?exportAll=True&exportFormat=CSV&isExport=True"
req <- GET(url, body=params)
read.csv(text=rawToChar(req$content))
You might need to play around with the inputs/code to get what you want precisely.
Here is another similar solution using RCurl:
how-to-login-and-then-download-a-file-from-aspx-web-pages-with-r