How to get all of the records in COVID-19 Data Lake linelistrecords - r

I'd like to use the https://api.c3.ai/covid/api/1/linelistrecord/fetch API but only get 2000 records back. I know that there are more than 2000 records -- how do I get them?
Here's my code in R:
library(tidyverse)
library(httr)
library(jsonlite)
resp <- POST(
"https://api.c3.ai/covid/api/1/linelistrecord/fetch",
body = list(
spec = {}
) %>% toJSON(auto_unbox = TRUE),
accept("application/json")
)
length(content(resp)$objs)
I get 2000 records.

The spec you are passing in has the following optional fields, among others:
limit // maximum number of objects to return
offset // offset to use for paged reads
The default value of limit is 2000.
The fetch result that is returned has a boolean field, along with the array of objects, called hasMore, which indicates whether there are more records in the underlying data store.
You can write a loop that ends once hasMore is false. Start with an offset of 0, and limit n (say , n=2000), and then iteratively increase offset by n.
library(tidyverse)
library(httr)
library(jsonlite)
limit <- 2000
offset <- 0
hasMore <- TRUE
all_objs <- c()
while(hasMore) {
resp <- POST(
"https://api.c3.ai/covid/api/1/linelistrecord/fetch",
body = list(
spec = list(
limit = limit,
offset = offset,
filter = "contains(location, 'California')" # just as an example, to cut down on the dataset
)
) %>% toJSON(auto_unbox = TRUE),
accept("application/json")
)
hasMore <- content(resp)$hasMore
offset <- offset + limit
all_objs <- c(all_objs, content(resp)$objs)
}
length(all_objs)

You could also do something similar in Python too. Here is a code snippet for doing the same in Python
import requests
headers = {'Accept': 'application/json'}
import io
import pandas as pd
def read_data(url, payload, headers = headers):
df_list = []
has_more = True
offset = 0
payload['spec']['offset'] = offset
while has_more:
response = requests.post('https://api.c3.ai/covid/api/1/linelistrecord/fetch', json=payload, headers = headers)
df = pd.DataFrame.from_dict(response.json()['objs'])
has_more = response.json()['hasMore']
payload['spec']['offset'] += df.shape[0]
df_list.append(df)
df = pd.concat(df_list)
return df
url = 'https://api.c3.ai/covid/api/1/linelistrecord/fetch'
payload = {
"spec":{
"filter": "exists(hospitalAdmissionDate)",
"include": "caseConfirmationDate, outcomeDate, hospitalAdmissionDate, age"
}
}
df = read_data(url, payload)

Related

Cant fetch all records in Qualtrics API using httr package

I am trying to fetch all my mailinglists contacts using the following custom function, but contact lists didn't download all the records inside them. Idk what I am doing wrong?
get_all_contacts<-function(mailingListID){
directoryId<-"POOL_XXXXXXXXXX"
apiToken<-"XXXXXXXXXX"
fetch_url<- VERB(verb = "GET", url = paste("https://iad1.qualtrics.com/API/v3/directories/", directoryId,
"/mailinglists/",mailingListID ,"/contacts",sep = ""),
add_headers(`X-API-TOKEN` = apiToken), encode = "json")
fetch_url<-content(fetch_url, "parse",encoding = "UTF-8")
fetch_url<-fetch_url$result$nextPage
elements <- list()
while(!is.null(fetch_url)){
res<- VERB(verb = "GET", url = fetch_url,
add_headers(`X-API-TOKEN` = apiToken),
encode = "json")
res<-content(res, "parse",encoding = "UTF-8")
elements <- append(elements,res$result$elements)
fetch_url <- res$result$nextPage
}
return(elements)
}

Send json-body with array in it

I need to send post-request with json-body
The body contains a parameter cart, which is actually an array
"cart": [{"nid": "123","groupId": "123","price": "200.00","priceWithDiscount": "200.00","amount": "1.0"}]
To assemble cart i have this code (double list is for square brackets):
args <- list(list(price="200.00",priceWithDiscount="200.00",amount="1.0",nid="123",groupId="123"))
x <- jsonlite::toJSON(args, pretty = TRUE,auto_unbox = T)
And full request:
answer <- content(POST("https://url.com",
body = list(
sum = '200',
sumDiscount = '200',
guid = 'test_26112021',
number = 'test_26112021',
date = paste0(Sys.time()),
bonusAdd = '0',
bonusWriteOff = '0',
depositAdd = '0',
depositWriteOff = '0',
cart = x
),encode="json"))
This request responds an error "The cart field must be a array." so I guess I was mistaken in this part of the code
Thanks to #deschen's comment, the solution is:
args1<-list(list(price="200.00",priceWithDiscount="200.00",amount="1.0",nid="123",groupId="123"))
args2<-list(guid='test_26112021',sum='200',sumDiscount='200',number='test_26112021',date=paste0(Sys.time()),bonusAdd='0',bonusWriteOff='0',depositAdd='0',depositWriteOff='0',cart=args1)
body <- jsonlite::toJSON(args2, pretty = TRUE, auto_unbox = T)
answer <- content(POST("https://url.com", content_type_json(),body = body, encode="json"))

Correct way to get response body of XHR requests generated by a page with RStudio Chromote

I'd like to use Chromote to gather the response body of the XHR calls made by a website, but I find the API a bit complex to master, especially the async pipeline.
I guess I need to first enable the Network functionality and then load the page (this can do), but then I need to:
list all XHR calls
filter them by recognizing patterns in the request URL
access the request body of the selected sources
Can someone provide any guidance or tutorial material on this regard?
UPDATE:
Ok, I switched to package crrri and made a general function for the purpose. The only missing part is some logic to decide when to close the connection and return the results:
get_website_resources <- function(url, url_filter = '*', type_filter = '*') {
library(crrri)
library(dplyr)
library(stringr)
library(jsonlite)
library(magrittr)
chrome <- Chrome$new()
out <- new.env()
out$l <- list()
client <- chrome$connect(callback = ~ NULL)
Fetch <- client$Fetch
Page <- client$Page
Fetch$enable(patterns = list(list(urlPattern="*", requestStage="Response"))) %...>% {
Fetch$requestPaused(callback = function(params) {
if (str_detect(params$request$url, url_filter) & str_detect(params$resourceType, type_filter)) {
Fetch$getResponseBody(requestId = params$requestId) %...>% {
resp <- .
if (resp$body != '') {
if (resp$base64Encoded) resp$body = base64_dec(resp$body) %>% rawToChar()
body <- list(list(
url = params$request$url,
response = resp
)) %>% set_names(params$requestId)
str(body)
out$l <- append(out$l, body)
}
}
}
Fetch$continueRequest(requestId = params$requestId)
})
} %...>% {
Page$navigate(url)
}
out$l
}
Cracked it. Here's the final function. It uses a crrri::perform_with_chrome wich force synch behaviour and run the rest of the process into a promise object with a resolve callback defined outside the promise itself which is called either if a number of resources are collected or if a certain amount of time has passed:
get_website_resources <- function(url, url_filter = '*', type_filter = '*', wait_for = 20, n_of_resources = NULL, interactive = F) {
library(crrri)
library(promises)
crrri::perform_with_chrome(function(client) {
Fetch <- client$Fetch
Page <- client$Page
if (interactive) client$inspect()
out <- new.env()
out$results <- list()
out$resolve_function <- NULL
out$pr <- promises::promise(function(resolve, reject) {
out$resolve_function <- resolve
Fetch$enable(patterns = list(list(urlPattern="*", requestStage="Response"))) %...>% {
Fetch$requestPaused(callback = function(params) {
if (str_detect(params$request$url, url_filter) & str_detect(params$resourceType, type_filter)) {
Fetch$getResponseBody(requestId = params$requestId) %...>% {
resp <- .
if (resp$body != '') {
if (resp$base64Encoded) resp$body = jsonlite::base64_dec(resp$body) %>% rawToChar()
body <- list(list(
url = params$request$url,
response = resp
)) %>% set_names(params$requestId)
#str(body)
out$results <- append(out$results, body)
if (!is.null(n_of_resources) & length(out$results) >= n_of_resources) out$resolve_function(out$results)
}
}
}
Fetch$continueRequest(requestId = params$requestId)
})
} %...>% {
Page$navigate(url)
} %>% crrri::wait(wait_for) %>%
then(~ out$resolve_function(out$results))
})
out$pr$then(function(x) x)
}, timeouts = max(wait_for + 3, 30), cleaning_timeout = max(wait_for + 3, 30))
}

How to include / exclude filter statement in R httr query for Localytics

I can successfully query data from Localytics using R, such as the following example:
r <- POST(url = "https://api.localytics.com/v1/query,
body=list(app_id=<APP_ID>,
metrics=c("occurrences","users"),
dimensions=c('a:URI'),
conditions=list(day = c("between", "2020-02-11", "2020-03-12"),
event_name = "Content Viewed",
"a:Item URI" = "testing")
),
encode="json",
authenticate(Key,Secret),
accept("application/json"),
content_type("application/json"))
stop_for_status(r)
But what I would like to do is create a function so I can do this quickly and not have to copy/paste data.
The issue I am running into is with the line "a:Item URI" = "testing", where I am filtering all searches by the Item URI where they all equal "testing", but sometimes, I don't want to include the filter statement, so I just remove that line entirely.
When I wrote my function, I tried something like the following:
get_localytics <- function(appID, metrics, dimensions, from = Sys.Date()-30,
to = Sys.Date(), eventName = "Content Viewed",
Key, Secret, filterDim = NULL, filterCriteria = NULL){
r <- httr::POST(url = "https://api.localytics.com/v1/query",
body = list(app_id = appID,
metrics = metrics,
dimensions = dimensions,
conditions = list(day = c("between", as.character(from), as.character(to)),
event_name = eventName,
filterDim = filterCriteria)
),
encode="json",
authenticate(Key, Secret),
accept("application/json"),
content_type("application/json"))
stop_for_status(r)
result <- paste(rawToChar(r$content),collapse = "")
document <- fromJSON(result)
df <- document$results
return(df)
}
But my attempt at adding filterDim and filterCriteria only produce the error Unprocessable Entity. (Keep in mind, there are lots of variables I can filter by, not just "a:Item URI" so I need to be able to manipulate that as well.
How can I include a statement, where if I need to filter, I can incorporate that line, but if I don't need to filter, that line isn't included?
conditions is just a list, so you can conditionally add elements to it. Here we just use an if statement to test of the values are passed and if so, add them in.
get_localytics <- function(appID, metrics, dimensions, from = Sys.Date()-30,
to = Sys.Date(), eventName = "Content Viewed",
Key, Secret, filterDim = NULL, filterCriteria = NULL){
conditions <- list(day = c("between", as.character(from), as.character(to)),
event_name = eventName)
if (!is.null(filterDim) & !is.null(filterCriteria)) {
conditions[[filterDim]] <- filterCriteria)
}
r <- httr::POST(url = "https://api.localytics.com/v1/query",
body = list(app_id = appID,
metrics = metrics,
dimensions = dimensions,
conditions = conditions),
encode="json",
authenticate(Key, Secret),
accept("application/json"),
content_type("application/json"))
stop_for_status(r)
result <- paste(rawToChar(r$content),collapse = "")
document <- fromJSON(result)
df <- document$results
return(df)
}

Skip a value in a loop if URL doesn't exist

I am trying to get a code to grab all NBA box scores for the month of October. I want the code to try every URL possible for the combination of dates (27-31) and the 30 teams. However, as not all of the teams play every day, some combinations won't exist, so I am trying to implement the try function to skip the non-existent URLs, but I cant seem to figure it out. Here is what I have written so far:
install.packages("XML")
library(XML)
teams = c('ATL','BKN','BOS','CHA','CHI',
'CLE','DAL','DEN','DET','GS',
'HOU','IND','LAC','LAL','MEM',
'MIA','MIL','MIN','NOP','NYK',
'OKC','ORL','PHI','PHX','POR',
'SAC','SA','TOR','UTA','WSH')
october = c()
for (i in teams){
for (j in (c(27:31))){
url = paste("http://www.basketball-reference.com/boxscores/201510",
j,"0",i,".html",sep = "")
data <- try(readHTMLTable(url, stringsAsFactors = FALSE))
if(inherits(data, "error")) next
away_1 = as.data.frame(data[1])
colnames(away_1) = c("Players","MP","FG","FGA","FG%","3P","3PA","3P%","FT","FTA",
"FT%", "ORB","DRB","TRB","AST","STL","BLK","TO","PF","PTS","+/-")
away_1 = away_1[away_1$Players != "Reserves",]
away_1 = away_1[away_1$MP != "Did Not Play",]
away_1$team = rep(toupper(substr(names(as.data.frame(data[1]))[1],
5, 7)),length(away_1$Players))
away_1$loc = rep(i,length(away_1$Players))
home_1 = as.data.frame(data[3])
colnames(home_1) = c("Players","MP","FG","FGA","FG%","3P","3PA","3P%","FT","FTA",
"FT%", "ORB","DRB","TRB","AST","STL","BLK","TO","PF","PTS","+/-")
home_1 = home_1[home_1$Players != "Reserves",]
home_1 = home_1[home_1$MP != "Did Not Play",]
home_1$team = rep(toupper(substr(names(as.data.frame(data[2]))[1],
5, 7)),length(home_1$Players))
home_1$loc = rep(i,length(home_1$Players))
game = rbind(away_1,home_1)
october = rbind(october, game)
}
}
Everything above and below the following lines appears to work:
data <- try(readHTMLTable(url, stringsAsFactors = FALSE))
if(inherits(data, "error")) next
I just need to properly format these two.
For anyone interested, I figured it out using url.exists in RCurl. Just impliment the following after the url definition line:
if(url.exists(url) == TRUE){...}
How about using tryCatch for error handling?
result = tryCatch({
expr
}, warning = function(w) {
warning-handler-code
}, error = function(e) {
error-handler-code
}, finally = {
cleanup-code
})
where readHTMLTable will be use as the main part ('expr'). You can simply return missing value if error/warning occurs and then omit missing values on final result.

Resources