I am trying to retrieve information from an API, which gives the name of the product from the barcode, through the API.
I am using the httr::GET().
The URL needed for the API contains the barcode itself, but I do not know how to automate the system so it can read the barcode contained in every entry, and plugging it into the url without me copying and pasting the code manually in the script.
one_code <- GET("api.upcdatabase.org/json/aebfed7a26f24a05efd7f77749dc2fcc/…")
result <- content(one_code)
result$description
A couple extra things to consider.
First, the site provides https for the API so it should be used since you're exposing your API key on any network you make requests from otherwise.
Test the core HTTP status code and halt on major HTTP errors (not API errors).
You should also put your API key in something like an environment variable so it never ends up in scripts or GitHub repo commits. Use ~/.Renviron (make a single line entry for UPCDATABASE_API_KEY=your_key and then restart R).
You should handle error and success conditions and consider returning a data frame so you can have all the fields in a tidy, accessible fashion.
Finally, do some basic type conversion prior to returning the values to make return field values easier to use.
library(httr)
library(jsonlite)
library(purrr)
get_upc_code_info <- function(code, api_key=Sys.getenv("UPCDATABASE_API_KEY")) {
URL <- sprintf("https://api.upcdatabase.org/json/%s/%s", api_key, code)
res <- GET(URL)
stop_for_status(res)
res <- content(res, as="text", encoding="UTF-8")
res <- fromJSON(res, flatten=TRUE)
if (res$valid == "true") {
res <- flatten_df(res)
res$valid <- TRUE
res$avg_price <- as.numeric(res$avg_price)
res$rate_up <- as.numeric(res$rate_up)
res$rate_down <- as.numeric(res$rate_down)
return(res)
} else {
message(res$reason)
return(data.frame(number = code, valid = FALSE, stringsAsFactors=FALSE))
}
}
xdf <- get_upc_code_info("0111222333446")
dplyr::glimpse(xdf)
## Observations: 1
## Variables: 8
## $ valid <lgl> TRUE
## $ number <chr> "0111222333446"
## $ itemname <chr> "UPC Database Testing Code"
## $ alias <chr> "Testing Code"
## $ description <chr> "http://upcdatabase.org/code/0111222333446"
## $ avg_price <dbl> 123.45
## $ rate_up <dbl> 14
## $ rate_down <dbl> 3
Similar to what Aurèle suggested, you can use the function to make it easier to get multiple codes. Since this function returns a data frame, you can easily get a larger, complete data frame from individual lookups with purrr::map_df():
codes <- c("0057000006976", "3228881010711", "0817346023170", "44xx4444444")
xdf <- map_df(codes, get_upc_code_info)
dplyr::glimpse(xdf)
## Observations: 4
## Variables: 8
## $ valid <lgl> TRUE, TRUE, TRUE, FALSE
## $ number <chr> "0057000006976", "3228881010711", "0817346023170",...
## $ itemname <chr> "Heinz Original Beans (Pork & Molasses)", "Lip...
## $ alias <chr> "", "", "", NA
## $ description <chr> "", "Boîte de 20 sachets", "", NA
## $ avg_price <dbl> NA, NA, 39.99, NA
## $ rate_up <dbl> 0, 0, 1, NA
## $ rate_down <dbl> 0, 0, 0, NA
Consider putting finishing touches on this, adding a function to POST to the API, possibly make a Shiny app so folks can submit new entries through R and turning it into a package. You might even get extra free credits on the site if you do so.
Store your barcodes in one data structure (list or vector):
barcodes <- c(
"aebfed7a26f24a05efd7f77749dc2fcc",
"xyz1234567f24a05efd7f77749dc2fcc",
"pqr9876543f24a05efd7f77749dc2fcc"
)
Write a function:
scrape <- function(barcode) {
sample=GET(paste0("api.upcdatabase.org/json/", barcode, "/rest/of/the/url"))
result=content(sample)
result$description
}
And apply:
res <- lapply(barcodes, scrape)
The results are stored in a single list, so that they're easier to manipulate.
Related
I am trying to use R to retrieve data from the YouTube API v3 and there are few/no tutorials out there that show the basic process. I have figured out this much so far:
# Youtube API query
base_url <- "https://youtube.googleapis.com/youtube/v3/"
my_yt_search <- function(search_term, max_results = 20) {
my_api_url <- str_c(base_url, "search?part=snippet&", "maxResults=", max_results, "&", "q=", search_term, "&key=",
my_api_key, sep = "")
result <- GET(my_api_url)
return(result)
}
my_yt_search(search_term = "salmon")
But I am just getting some general meta-data and not the search results. Help?
PS. I know there is a package 'tuber' out there but I found it very unstable and I just need to perform simple searches so I prefer to code the requests myself.
Sadly there is no way to directly get the durations, you'll need to call the videos endpoint (with the part set to part=contentDetails) after doing the search if you want to get those infos, however you can pass as much as 50 ids in a single call thus we can save some time by pasting all the ids together.
library(httr)
library(jsonlite)
library(tidyverse)
my_yt_duration <- function(...){
my_api_url <- paste0(base_url, "videos?part=contentDetails", paste0("&id=", ..., collapse=""), "&key=",
my_api_key )
GET(my_api_url) -> resp
fromJSON(content(resp, "text"))$items %>% as_tibble %>% select(id, contentDetails) -> tb
tb$contentDetails$duration %>% tibble(id=tb$id, duration=.)
}
### getting the video IDs
my_yt_search(search_term = "salmon")->res
## Converting from JSON then selecting all the video ids
# fromJSON(content(res,as="text") )$items$id$videoId
my_yt_duration(fromJSON(content(res,as="text") )$items$id$videoId) -> tib.id.duration
# A tibble: 20 x 2
id duration
<chr> <chr>
1 -x2E7T3-r7k PT4M14S
2 b0ahREpQqsM PT3M35S
3 ROz8898B3dU PT14M17S
4 jD9VJ92xyzA PT5M42S
5 ACfeJuZuyxY PT3M1S
6 bSOd8r4wjec PT6M29S
7 522BBAsijU0 PT10M51S
8 1P55j9ub4es PT14M59S
9 da8JtU1YAyc PT3M4S
10 4MpYuaJsvRw PT8M27S
11 _NbbtnXkL-k PT2M53S
12 3q1JN_3s3gw PT6M17S
13 7A-4-S_k_rk PT9M37S
14 txKUTx5fNbg PT10M2S
15 TSSPDwAQLXs PT3M11S
16 NOHEZSVzpT8 PT7M51S
17 4rTMdQzsm6U PT17M24S
18 V9eeg8d9XEg PT10M35S
19 K4TWAvZPURg PT3M3S
20 rR9wq5uN_q8 PT4M53S
After having used the following script for a while, it suddenly stopped working. I constructed a simple function that finds a table - based on its xpath - within a web page.
library(rvest)
url <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08'
find_table <- function(x){read_html(x) %>%
html_nodes(xpath = '//*[#id="center"]/table[2]') %>%
html_table() %>%
as.data.frame()}
table <- find_table(url)
I also tried to use httr::GET before read_html, passing the following argument:
query = list(r_date = "2017-12-22")
but nothing changed. Any ideas?
Well, that code doesn't work since you missed a ) in the url <- line.
We'll add in httr:
library(httr)
library(rvest)
url is the name of a base function. using base function names as variables can make problems in code hard to debug. Unless you write perfect code, it's a good idea to not use the names that way.
URL <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08')
I don't know if you know the "rules" about web scraping but if you're making repeated requests to this site, then a "crawl delay" should be used. They don't have one set in their robots.txt so 5 seconds is the accepted alternative. I point this out as you may be getting rate limited.
find_table <- function(x, crawl_delay=5) {
Sys.sleep(crawl_delay) # you can put this in a loop vs here if you aren't often doing repeat gets
# switch to httr::GET so you can get web server interaction info.
# since you're scraping, it's expected that you use a custom user agent
# that also supplies contact info.
res <- GET(x, user_agent("My scraper"))
# check to see if there's a non HTTP 200 response which there may be
# if you're getting rate-limited
stop_for_status(res)
# now, try to do the parsing. It looks like you're trying to target a
# single table, so i switched it from `html_nodes()` to `html_node()` since
# the latter returns a `list` and the pipe will error out if there's more
# than on list element.
content(res, "parsed") %>%
html_node(xpath = '//*[#id="center"]/table[2]') %>%
html_table() %>%
as.data.frame()
}
table is a base function name, too (see above)
result <- find_table(URL)
Worked fine for me:
str(result)
## 'data.frame': 11 obs. of 5 variables:
## $ ENTI EROGATORI : chr "Cassa DD.PP." "Istituti di previdenza amministrati dal Tesoro" "Istituto per il credito sportivo" "Aziende di credito" ...
## $ : logi NA NA NA NA NA NA ...
## $ ACCENSIONE ACCERTAMENTI : chr "4.638.500,83" "0,00" "0,00" "953.898,47" ...
## $ ACCENSIONE RISCOSSIONI C|COMP. + RESIDUI: chr "2.177.330,12" "0,00" "129.114,22" "848.935,84" ...
## $ RIMBORSO IMPEGNI : chr "438.696,57" "975,07" "45.584,55" "182.897,01" ...
I would like to extract the following table using rvest from http://finra-markets.morningstar.com/BondCenter/TRACEMarketAggregateStats.jsp (for any date):
I tried the following but failed to produce any result:
library(rvest)
url <- "http://finra-markets.morningstar.com/BondCenter/TRACEMarketAggregateStats.jsp"
htmlSession <-html_session(url) ## create session
goForm <- html_form(htmlSession)[[2]] ## pull form from session
#filledGoForm <- set_values(goForm, value="04/26/2017") # This does not work
filledGoForm <- goForm
filledGoForm$fields[[1]]$value <- "04/26/2017"
htmlSession <- submit_form(htmlSession, filledGoForm)
> htmlSession <- submit_form(htmlSession, filledGoForm)
Submitting with ''
Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode, :
Not Found (HTTP 404).
Any hints on how to do this highly appreciated.
That site uses many XHR requests to populate the tables. And, it establishes a server session with a hidden POST request which won't be replicated with html_session().
We'll need to add in httr for some help:
library(httr)
library(rvest)
The first thing we need to do is to just hit the site to get an initial qs_wid cookie into the implicit cookie jar curl/httr/rvest share:
init <- GET("http://finra-markets.morningstar.com/MarketData/Default.jsp")
Next, we need to mimic the hidden "login" that the web page does:
nxt <- POST(url = "http://finra-markets.morningstar.com/finralogin.jsp",
body = list(redirectPage = "/BondCenter/TRACEMarketAggregateStats.jsp"),
encode = "form")
That creates a session on the server back-end and places a few other cookies in our cookie jar.
Finally:
GET(
url = "http://finra-markets.morningstar.com/transferPage.jsp",
query = list(
`path`="http://muni-internal.morningstar.com/public/MarketBreadth/C",
`date`="04/24/2017",
`_`=as.numeric(Sys.time())
)
) -> res
makes the request. You can make a function out of all three steps (together) and parameterize that last GET.
Unfortunately, that returns a very broken HTML <table> that html_table() can't translate into a data frame automagically for you, but that shouldn't stop you:
content(res) %>%
html_nodes("td") %>%
html_text() %>%
matrix(ncol=4, byrow=TRUE) %>%
as_data_frame() %>%
mutate_all(as.numeric) %>%
rename(all_issues=V1, investment_grade=V2, high_yield=V3, convertible=V4) %>%
mutate(category = c("total_issues_traded", "advances", "declines", "unchanged", "high_52", "low_52", "dollar_volume"))
## # A tibble: 7 × 5
## all_issues investment_grade high_yield convertible category
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 7983 5602 2194 187 total_issues_traded
## 2 3025 1798 1100 127 advances
## 3 4448 3575 824 49 declines
## 4 124 42 75 7 unchanged
## 5 257 66 175 16 high_52
## 6 139 105 33 1 low_52
## 7 22601 16143 5742 715 dollar_volume
To get the other data tables, go to the Developer Tools option in your browser (switch to one that has it if yours doesn't … you're likely on Windows given that you're doing finance things and IE/Edge aren't very good browsers for introspection) and refresh the page to see the other requests that get made.
I work with regularly refreshed XML reports and I would like to automate the munging process using R & xml2.
Here's a link to an entire example file.
Here's a sample of the XML:
<?xml version="1.0" ?>
<riDetailEnrolleeReport xmlns="http://vo.edge.fm.cms.hhs.gov">
<includedFileHeader>
<outboundFileIdentifier>f2e55625-e70e-4f9d-8278-fc5de7c04d47</outboundFileIdentifier>
<cmsBatchIdentifier>RIP-2015-00096</cmsBatchIdentifier>
<cmsJobIdentifier>16220</cmsJobIdentifier>
<snapShotFileName>25032.BACKUP.D03152016T032051.dat</snapShotFileName>
<snapShotFileHash>20d887c9a71fa920dbb91edc3d171eb64a784dd6</snapShotFileHash>
<outboundFileGenerationDateTime>2016-03-15T15:20:54</outboundFileGenerationDateTime>
<interfaceControlReleaseNumber>04.03.01</interfaceControlReleaseNumber>
<edgeServerVersion>EDGEServer_14.09_01_b0186</edgeServerVersion>
<edgeServerProcessIdentifier>8</edgeServerProcessIdentifier>
<outboundFileTypeCode>RIDE</outboundFileTypeCode>
<edgeServerIdentifier>2800273</edgeServerIdentifier>
<issuerIdentifier>25032</issuerIdentifier>
</includedFileHeader>
<calendarYear>2015</calendarYear>
<executionType>P</executionType>
<includedInsuredMemberIdentifier>
<insuredMemberIdentifier>ARS001</insuredMemberIdentifier>
<memberMonths>12.13</memberMonths>
<totalAllowedClaims>1000.00</totalAllowedClaims>
<totalPaidClaims>100.00</totalPaidClaims>
<moopAdjustedPaidClaims>100.00</moopAdjustedPaidClaims>
<cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
<estimatedRIPayment>0.00</estimatedRIPayment>
<coinsurancePercentPayments>0.00</coinsurancePercentPayments>
<includedPlanIdentifier>
<planIdentifier>25032VA013000101</planIdentifier>
<includedClaimIdentifier>
<claimIdentifier>CADULT4SM00101</claimIdentifier>
<claimPaidAmount>100.00</claimPaidAmount>
<crossYearClaimIndicator>N</crossYearClaimIndicator>
</includedClaimIdentifier>
</includedPlanIdentifier>
</includedInsuredMemberIdentifier>
<includedInsuredMemberIdentifier>
<insuredMemberIdentifier>ARS002</insuredMemberIdentifier>
<memberMonths>9.17</memberMonths>
<totalAllowedClaims>0.00</totalAllowedClaims>
<totalPaidClaims>0.00</totalPaidClaims>
<moopAdjustedPaidClaims>0.00</moopAdjustedPaidClaims>
<cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
<estimatedRIPayment>0.00</estimatedRIPayment>
<coinsurancePercentPayments>0.00</coinsurancePercentPayments>
<includedPlanIdentifier>
<planIdentifier>25032VA013000101</planIdentifier>
<includedClaimIdentifier>
<claimIdentifier></claimIdentifier>
<claimPaidAmount>0</claimPaidAmount>
<crossYearClaimIndicator>N</crossYearClaimIndicator>
</includedClaimIdentifier>
</includedPlanIdentifier>
</includedInsuredMemberIdentifier>
</riDetailEnrolleeReport>
I would like to:
Read in the XML into R
Locate a specific insuredMemberIdentifier
Extract the planIdentifier and all claimIdentifier data associated with the member ID in (2)
Store all text and values for insuredMemberIdentifier, planIdentifier, claimIdentifier, and claimPaidAmount in a data.frame with a row for each unique claim ID (member ID to claim ID is a 1 to many)
So far, I have accomplished 1 and I'm in the ballpark on 2:
## Step 1 ##
ride <- read_xml("/Users/temp/Desktop/RIDetailEnrolleeReport.xml")
## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(ride, "//d1:insuredMemberIdentifier[text()='ARS001']", xml_ns(ride))
[I know that I can then use xml_text() to extract the text of the element.]
After the code in Step 2 above, I've tried using xml_parent() to locate the parent node of the insuredMemberIdentifier, saving that as a variable, and then repeating Step 2 for claim info on that saved variable node.
node <- xml_parent(memID)
xml_find_all(node, "//d1:claimIdentifier", xml_ns(ride))
But this just results in pulling all claimIdentifiers in the global file.
Any help/information on how to get to step 4, above, would be greatly appreciated. Thank you in advance.
Apologies for the late response, but for posterity, import data as above using xml2, then parse the xml file by ID, as hinted by har07.
# output object to collect all claims
res <- data.frame(
insuredMemberIdentifier = rep(NA, 1),
planIdentifier = NA,
claimIdentifier = NA,
claimPaidAmount = NA)
# vector of ids of interest
ids <- c('ARS001')
# indexing counter
starti <- 1
# loop through all ids
for (ii in seq_along(ids)) {
# find ii-th id
## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(x = ride,
xpath = paste0("//d1:insuredMemberIdentifier[text()='", ids[ii], "']"))
# find node for
node <- xml_parent(memID)
# as har07's comment find claim id within this node
cid <- xml_find_all(node, ".//d1:claimIdentifier", xml_ns(ride))
pid <- xml_find_all(node, ".//d1:planIdentifier", xml_ns(ride))
cpa <- xml_find_all(node, ".//d1:claimPaidAmount", xml_ns(ride))
# add invalid data handling if necessary
if (length(cid) != length(cpa)) {
warning(paste("cid and cpa do not match for", ids[ii]))
next
}
# collect outputs
res[seq_along(cid) + starti - 1, ] <- list(
ids[ii],
xml_text(pid),
xml_text(cid),
xml_text(cpa))
# adjust counter to add next id into correct row
starti <- starti + length(cid)
}
res
# insuredMemberIdentifier planIdentifier claimIdentifier claimPaidAmount
# 1 ARS001 25032VA013000101 CADULT4SM00101 100.00
I just discovered the force cli tool. How do I query for more than 2000 records?
I have it working for queries but it only returns 2000 and I need to get them all.
force query select Id, Name from Custom_Object__c --format:csv > custom.csv
this file has about 10,000 records and I am only able to get the first 2000. The documentation of the cli tool does not mention many details, however it is much faster than the tool I am using, which R using the RForcecom library
All you need to do is use the bulkQuery function in the package RForcecom. Example:
all_leads <- "SELECT Id FROM Lead WHERE CreatedDate < TODAY"
rforcecom.bulkQuery(rforce_session, all_leads,object="Lead") -> all_leads_df
Selects Id for everything in Lead prior to today. Takes about a minute for me to do ~700k rows.
The salesforcer package can query millions of records quickly via the Bulk 1.0 API. Below is an example which returns all records and fields on the Contact object:
library(tidyverse)
library(salesforcer)
sf_auth(username, password, security_token)
# get all the fields on the Contact object (remove compound fields which
# are not queryable via the Bulk API.
# https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/compound_fields.htm
contact_fields <- sf_describe_object_fields('Contact') %>%
filter(type != 'address', !grepl('Latitude|Longitude', name))
# build the SOQL
all_soql <- sprintf("SELECT %s FROM Contact",
paste(contact_fields$name, collapse=","))
# query the records and fields
all_records <- sf_query(all_soql, "Contact", api_type="Bulk 1.0")
all_records
#> # A tibble: 2,039,230 x 58
#> Id IsDeleted MasterRecordId AccountId ...
#> <chr> <lgl> <lgl> <lgl>
#> 1 0033s00000wycyeAAA FALSE NA NA
#> 2 0033s00000wycyjAAA FALSE NA NA
#> 3 0033s00000wycyoAAA FALSE NA NA
#> # ...