C3.ai COVID-19 Data Lake Quickstart in R - r

I am working on a research assignment on COVID and using the datalake API to fetch different kind of datasets available to us.
I am wondering if it's possible to fetch all outbreak countries.
ids = list("Australia"), this works with individual country, it doesnt seem to accept wildcard or all.
Can anyone give me any insights on this please.
# Total number of confirmed cases in Australia and proportion of getting infected.
today <- Sys.Date()
casecounts <- evalmetrics(
"outbreaklocation",
list(
spec = list(
**ids = list("Australia"),**
expressions = list("JHU_ConfirmedCasesInterpolated","JHU_ConfirmedDeathsInterpolated"),
start = "2019-12-20",
end = today-1,
interval = "DAY"
)
)
)
casecounts

The easiest way to access a list of countries is in the Excel file linked at https://c3.ai/covid-19-api-documentation/#tag/OutbreakLocation. It has a list of countries in the first sheet, and shows which of those have data from JHU.
You could also fetch an approximate list of country-level locations with:
locations <- fetch(
"outbreaklocation",
list(
spec = list(
filter = "not(contains(id, '_'))"
)
)
)
That should contain all of the countries, but could have some non-countries like World Bank regions.
Then, you'd use this code to get the time series data for all of those locations:
location_ids <-
locations %>%
dplyr::select(-location) %>%
unnest_wider(fips, names_sep = ".") %>%
sample_n(15) %>% # include this to test on a smaller set
pull(id)
today <- Sys.Date()
casecounts <- evalmetrics(
"outbreaklocation",
list(
spec = list(
ids = location_ids,
expressions = list("JHU_ConfirmedCasesInterpolated","JHU_ConfirmedDeathsInterpolated"),
start = "2019-12-20",
end = today-1,
interval = "DAY"
)
),
get_all = TRUE
)
casecounts

Related

Querying IMF API with imfr - error no result/does not accept filter

I am currently trying to download a particular series from the Direction Of Trade Statistics at the IMF for a calculation of trade volumes between countries. There is a r-package imfr that does a fantastic job at doing this. However, when going for a particular set, I run into problems.
This code, works just fine and gets me the full data-series I am interested in for the fiven countries:
library(imfr)
# get the list of imf datasets
imf_ids()
# I am interested in direction of trade "DOT", so check the list of codes that are in the datastructure
imf_codelist(database_id = "DOT")
# I want the export and import data between countries FOB so "TXG_FOB_USD" and "TMG_FOB_USD"
imf_codes("CL_INDICATOR_DOT")
# works nicely for exports:
data_list_exports <- imf_data(database_id = "DOT", indicator = c("TXG_FOB_USD"),
country = c("US","JP","KR"),
start = "1995",
return_raw = TRUE,
freq = "A")
# however the same code does not work for imports
data_list_imports <- imf_data(database_id = "DOT", indicator = c("TMG_FOB_USD"),
country = c("US","JP","KR"),
start = "1995",
return_raw = TRUE,
freq = "A")
This will return an empty series and I did not understand why. So I thought, maybe the US is not in the dataset (although unlikely)
library(httr)
library(jsonlite)
# look at the API endpoint, that provides us with the data structure behind a dataset
result <- httr::GET("http://dataservices.imf.org/REST/SDMX_JSON.svc/DataStructure/DTO") %>% httr::content(as = "parsed")
structure_url <- "http://dataservices.imf.org/REST/SDMX_JSON.svc/DataStructure/DOT"
raw_data <- jsonlite::fromJSON(structure_url )
test <- raw_data$Structure$CodeLists
However, the result indicates that indeed the US is in the data. So what if I just donĀ“t specify a country? The result finally does download the data, but only the 60 first countries because of rate limits. When doing the same with an httr::GET I directly hit the rate limit and get an error back.
data_list_imports <- imf_data(database_id = "DOT", indicator = c("TMG_FOB_USD"),
start = "1995",
return_raw = TRUE,
freq = "A")
Does anybody have an idea what I am doing wrong? I am really at a loss and just hope it is a typo somewhere...
Thanks and all the best!
This kind of answers the question:
cjyetman over at github gave me the following hint:
You can use the print_url = TRUE argument to see the actual API call.
With...
imf_data(database_id = "DOT", indicator = c("TMG_FOB_USD"),
country = c("US","JP","KR"),
start = "1995",
return_raw = TRUE,
freq = "A",
print_url = TRUE)
you get...
http://dataservices.imf.org/REST/SDMX_JSON.svc/CompactData/DOT/.US+JP+KR.TMG_FOB_USD?startPeriod=1995&endPeriod=2021
which does not return any data.
But if you add "AU" as a country to that list, you do get data with...
http://dataservices.imf.org/REST/SDMX_JSON.svc/CompactData/DOT/.AU+US+JP+KR.TMG_FOB_USD?startPeriod=1995&endPeriod=2021
So I guess either there is something wrong currently with their API,
or they actually do not have data for specifically that indicator for
those countries with that frequency, etc.
This does work indeed and makes apparent that either there is truly "missing data" in the API, or I am simply looking for data, where there is none. Since the original quest was to look at trade volumes, I have since found out, that the import value is usually used, with the CIF value and not FOB.
Hence the correct indicator for the API call would have been the following:
library(imfr)
data_list_imports <- imf_data(database_id = "DOT", indicator = c("TMG_CIF_USD"),
country = c("US","JP","KR"),
start = "1995",
return_raw = TRUE,
freq = "A")

Census Data Using an API on R studio

So, I am new to using R, so sorry if the questions seem a little basic!
But my work is asking me to look through census data using an API and identify some variables in each tract, then create a csv file they can look at. The code is fully written for me, I believe, but I need to change the variables to:
S2602_C01_023E - black / his
S2602_C01_081E - unemployment rate
S2602_C01_070E - not US citizen (divide by total population)
S0101_C01_030E - # over 65 (divide by total pop)
S1603_C01_009E - # below poverty (divide by total pop)
S1251_C01_010E - # child under 18 (divide by # households)
S2503_C01_013E - median income
S0101_C01_001E - total population
S2602_C01_078E - in labor force
And, I need to divide some of the variables, like I have written, and export all of this into a CSV file. I just don't really know what to do with the code..like I am just lost because I have never used R. I try changing the variables to the ones I need, but an error comes up. Any help would be greatly appreciated!
library(tidycensus)
library(tidyverse)
library(stringr)
library(haven)
library(profvis)
#list of variables possible
v18 <- load_variables(year = 2018,
dataset = "acs5",
cache = TRUE)
#function to get variables for all states. Year, variables can be
easily edited.
get_census_data <- function(st) {
Sys.sleep(5)
df <- get_acs(year = 2018,
variables = c(totpop = "B01003_001",
male = "B01001_002",
female = "B01001_026",
white_alone = "B02001_002",
black_alone = "B02001_003",
americanindian_alone = "B02001_004",
asian_alone = "B02001_005",
nativehaw_alone = "B02001_006",
other_alone = "B02001_007",
twoormore = "B02001_008",
nh = "B03003_002",
his = "B03003_003",
noncit = "B05001_006",
povstatus = "B17001_002",
num_households = "B19058_001",
SNAP_households = "B19058_002",
medhhi = "B19013_001",
hsdiploma_25plus = "B15003_017",
bachelors_25plus = "B15003_022",
greater25 = "B15003_001",
inlaborforce = "B23025_002",
notinlaborforce = "B23025_007",
greater16 = "B23025_001",
civnoninstitutional = "B27010_001",
withmedicare_male_0to19 = "C27006_004",
withmedicare_male_19to64 = "C27006_007",
withmedicare_male_65plus = "C27006_010",
withmedicare_female_0to19 = "C27006_014",
withmedicare_female_19to64 = "C27006_017",
withmedicare_female_65plus = "C27006_020",
withmedicaid_male_0to19 = "C27007_004",
withmedicaid_male_19to64 = "C27007_007",
withmedicaid_male_65plus = "C27007_010",
withmedicaid_female_0to19 = "C27007_014",
withmedicaid_female_19to64 = "C27007_017",
withmedicaid_female_65plus ="C27007_020"),
geography = "tract",
state = st )
return(df)
}
#loops over all states
df_list <- setNames(lapply(states, get_census_data), states)
##if you want to keep margin of error, remove everything after %>%
in next two lines
final_df <- bind_rows(df_list) %>%
select(-moe)
colnames(final_df)[3] <- "varname"
#cleaning up final data, making it wide instead of long
final_df_wide <- final_df %>%
gather(variable, value, -(GEOID:varname)) %>%
unite(temp, varname, variable) %>%
spread(temp, value)
#exporting to csv file, adjust your path
write.csv(final_df,"C:\Users\NAME\Documents\acs_2018_tractlevel_dat.
a.csv" )
Since you can't really give an reproducible example without revealing your API key, I'll try my best to figure out what could work here:
Let's first edit the function that pulls data from the API:
get_census_data <- function(st) {
Sys.sleep(5)
df <- get_acs(year = 2018,
variables = c(blackHis= "S2602_C01_023E",
unEmployRate = "S2602_C01_081E",
notUSCit = "S2602_C01_070E")
geography = "tract",
state = st )
return(df)
}
I've just put in two of the variables, but you should get the point.
Try if this works for you. And returns the data that is stored in the respective variables.

USDA PS&D API query not filtering correctly

I'm new to API programming so this question might have an obvious answer, but thanks for hanging in there with me.
I'm trying to use the USDA FAS' PS&D API (documentation here) to get grain balance sheet attributes (production, consumption, exports, etc.) for selected countries (Argentina, Brazil, etc.).
While I can get the API to send me data for all countries for a specific crop/year, I cannot find a way to:
get the response to send back ONLY data for Argentina and Brazil, or
specify multiple years of data to be returned.
I've tried specifying various country names/codes as query parameters but the resulting data still has all the countries.
Here is my existing code, which works to return return corn data for ALL countries for the 2018 year.
library(httr)
library(jsonlite)
## This is a fake API Key
msKEY = "ABDC-123456-HGFRE-58AB"
baseURL <- "https://apps.fas.usda.gov/PSDOnlineDataServices/api/CommodityData/GetCommodityDataByYear?"
x <- GET(baseURL, query = list(
CommodityCode = "0440000",
marketYear = 2018,
country = "BZ",
),
add_headers(API_KEY = msKEY)
)
status_code(x)
x2 <- fromJSON(
content(
x, as = "text"
)
)
str(x2)
I expect this code to return corn data for 2018 for Brazil only, but it returns data for all countries. There are no error codes thrown (that I'm aware of) and I'm thoroughly stumped.
Any thoughts/suggestions are much appreciated!
Perhaps a little too late, but the code is referring to BRAZIL as BZ when the code is BR. Also there is an extra term in the query list after de country code that should be eliminated.
x <- GET(baseURL, query = list(
CommodityCode = "0440000",
marketYear = 2018,
country = "BR"),
add_headers(API_KEY = msKEY)
)
status_code(x)
x2 <- fromJSON(
content(
x, as = "text"
)
)
str(x2)
sorry for the delay. You're right on the issue with the country identifier. The filter in the query does not work. I also realized they have two different endpoints for the same data, which affects the query and results. I wrote the following code to get the most out of it and then filter the country in the resulting dataframe:
products <- c("Meal, Soybean",
"Meal, Soybean (Local)")
selected_products <- product_codes %>%
filter(product_codes$CommodityName %in% products)
market_years <- c(1980:2022)
# for loop - complete
df <- list()
df_y <- list()
for (i in 1:length(selected_products$CommodityCode)) {
for (j in 1:length(market_years)) {
baseURL <- "https://apps.fas.usda.gov/PSDOnlineDataServices/api/CommodityData/GetCommodityDataByYear?"
m <- GET(baseURL,
query = list(CommodityCode = selected_products$CommodityCode[i],
marketYear = market_years[j]),
add_headers(API_KEY = msKEY))
m2 <- fromJSON(content(m, as = "text"))
df_y[[j]] <- m2
}
df[[i]] <- df_y
}
mega_df <- df %>%
bind_rows()

Resolve 500 server error with Google Core Reporting API

I am at a complete loss at this point on how to resolve my query issue. We have been using the R package, RGA, for about a year now without any problems. I have a script that has been fetching data from 7 views, matching sessions based off of specific pages on our website and totaling them to our product offerings.
This has been working without problem for months. Out of nowhere I have been getting 503 and 500 internal server errors and I'm not sure why.
I've tried changing the fetch.by status to "month", "quarter", "year", "day", etc... but I think the initial query is just too big.
I've also tried changing the max.results options and fetching just one profile ID at a time. We have 7 to process.
date1 <- as.character(cut(Sys.Date(), "month"))
date1_name <- format(as.Date(date1), format = '%Y%m')
date2 <- as.character(Sys.Date())
date2_name <- format(as.Date(date2), format = '%Y%m')
dimensions <- c("ga:yearMonth"
)
metrics <- c("ga:sessions"
)
filters2 <- "ga:sessions>0"
#fetch trip level data for all users and for the micro-goal segment
# country_short_table
short_unq <- unique(country_short_table$destination)
brand_trip_unique <- unique(trip_country_brand$brand_trip)
brand_trip=1
brand_trip=73
all_sessions <- data.frame()
for (brand_trip in 1:length(brand_trip_unique)){
mkt <- gsub('_.*', '', brand_trip_unique[brand_trip])
trip <- gsub('.*_', '', brand_trip_unique[brand_trip])
id <- as.character(ids[ids$market==mkt, 'id'])
segment <- paste('ga:pagePath=~(reisen|circuit)/.*/', trip, sep = '')
segment_def <- paste('users::condition::',segment,sep = '')
table <- get_ga(profileId = id,
start.date = date1,
end.date = date2,
metrics = metrics,
dimensions = dimensions
,filters = filters2
,segment = segment_def
,samplingLevel = "HIGHER_PRECISION"
,fetch.by = "quarter"
,max.results = NULL
)
if (is.list(table)==T) {
table$trip <- trip
table$market <- mkt
all_sessions <- bind_rows(all_sessions, table)
} else {
next()
}
}
GOAL: Can you recommend any way that I could avoid this issue, maybe by separating the date queries and cycling them by weeks or days of the month? I need monthly data aggregated every day but I'm not sure how to edit this script I inherited.

How to get the grouping right in R with Plotly

I have some problem to group my data in Plotly under R. To start with I was using local data from a csv file, reading them with:
geogrid_data <- read.delim('geogrid.csv', row.names = NULL, stringsAsFactors = TRUE)
and the plotting went well, using the following:
library(plotly)
library(RColorBrewer)
x <- list(
title = 'Date'
)
p <- plotly::plot_ly(geogrid_data,
type = 'scatter',
x = ~ts_now,
y = ~absolute_v_sum,
text = paste('Table: ', geogrid_data$table_name,
'<br>Absolute_v_Sum: ', geogrid_data$absolute_v_sum),
hoverinfo = 'text',
mode = 'lines',
color = list(
color = colorRampPalette(RColorBrewer::brewer.pal(11,'Spectral'))(
length(unique(geogrid_data$table_name))
)
),
transforms = list(
list(
type = 'groupby',
groups = ~table_name
)
)
) %>% layout(showlegend = TRUE, xaxis = x)
Here the output:
Then I was going to alter the data source to an Oracle database table, reading the data as follows, using the ROracle package:
# retrieve data into resultSet object
rs <- dbSendQuery(con, "SELECT * FROM GEOGRID_STATS")
# fetch records from the resultSet into a data.frame
geogrid_data <- fetch(rs)
# free resources occupied by resultSet
dbClearResult(rs)
dbUnloadDriver(drv)
# remove duplicates from dataframe (based on TABLE_NAME, TS_BEFORE, TS_NOW, NOW_SUM)
geogrid_data <- geogrid_data %>% distinct(TABLE_NAME, TS_BEFORE, TS_NOW, NOW_SUM, .keep_all = TRUE)
# alter date columns in place
geogrid_data$TS_BEFORE <- as.Date(geogrid_data$TS_BEFORE, format='%d-%m-%Y')
geogrid_data$TS_NOW <- as.Date(geogrid_data$TS_NOW, format='%d-%m-%Y')
and adjusting the plotting to:
p <- plotly::plot_ly(
type = 'scatter',
x = geogrid_data$TS_NOW,
y = geogrid_data$ABSOLUTE_V_SUM,
text = paste('Table: ', geogrid_data$TABLE_NAME,
'<br>Absolute_v_Sum: ', geogrid_data$ABSOLUTE_V_SUM,
'<br>Date: ', geogrid_data$TS_NOW),
hoverinfo = 'text',
mode = 'lines',
color = list(
color = colorRampPalette(RColorBrewer::brewer.pal(11,'Spectral'))(
length(unique(geogrid_data$TABLE_NAME))
)
),
transforms = list(
list(
type = 'groupby',
groups = geogrid_data$TABLE_NAME
)
)
) %>% layout(showlegend = TRUE, xaxis = x)
Unfortunately, this is leading to some problem with the grouping as it seems.:
As you can see from the label text when hovering over the data point, the point represents data from NY_SKOV_PLANTEB_MW_POLY while the legend is set to show data from NY_BYGN_MW_POLY. Looking at other data points in this graph I found a wild mix of points of all sorts in this graph, some of them representing data of NY_BYGN_MW_POLY, most of them not.
Also the plotting with respect to the time line does not work any more, e.g. data are plotted with start on Dec. 11 - Dec. 10 - Dec. 10 - Dec. 12 - Dec. 20 - Dec. 17 - Dec. 16 - Dec. 15.
Where do I go wrong in handling the data, and what do I have to do to get it right?
Of course, one should look at the data... thanks Marco, after your question I did look at my data.
There are some points where I simply assumed things.
The reason why all data plotted fine with data from the csv file is simple. All information manually compiled in the csv file came from information in emails that have been ordered by date. Hence, I compiled the data in the csv file ordered by date and Plotly does not have any problems grouping the data by table_name.
After looking at my data I tidied up, keeping only the data I need to show in the plot and used dplyr to sort the data by time.
geogrid_data <- dplyr::arrange(geogrid_data, TS_NOW)
It is only by time and not by time and table name because the sorting by table name is done anyway by Plotly and the groupby statement

Resources