Update Purrr loop to input data row by row in R - r

This question kinda builds on questions I asked here and here, but its finally coming together and I think I know what the problem is, just need help kicking it over the goal line. TL;DR at the bottom.
The overall goal as simply put as possible:
I have a dataframe that is from an API pull of a redcap database. It
has a few columns of information about various studies.
I'd like to go through that dataframe line by line, and push it into a different website called Oncore, through an API.
In the first question linked above (here again), I took a much simpler dataframe... took one column from that dataframe (the number), used it to do an API pull from Oncore where it would download from Oncore, copy one variable it downloaded over to a different spot, and push it back in. It would do this over and over, once per row. Then it would return a simple dataframe of the row number and the api status code returned.
Now I want to get a bit more complicated and instead of just pulling a number from one colum, I want to swap over a bunch of variables from my original dataframe, and upload them.
The idea is for sample studies input into Redcap to be pushed into Oncore.
What I've tried:
I have this dataframe from the redcap api pull:
testprotocols<-structure(list(protocol_no = c("LS-P-Joe's API", "JoeTest3"),
nct_number = c(654321, 543210), library = structure(c(2L,
2L), levels = c("General Research", "Oncology"), class = "factor"),
organizational_unit = structure(c(1L, 1L), levels = c("Lifespan Cancer Institute",
"General Research"), class = "factor"), title = c("Testing to see if basic stuff came through",
"Testing Oncology Projects for API"), department = structure(c(2L,
2L), levels = c("Diagnostic Imaging", "Lifespan Cancer Institute"
), class = "factor"), protocol_type = structure(2:1, levels = c("Basic Science",
"Other"), class = "factor"), protocolid = 1:2), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
I have used this code to try and push the data into Oncore:
##This chunk gets a random one we're going to change later
base <- "https://website.forteresearchapps.com"
endpoint <- "/website/rest/protocols/"
protocol <- "2501"
## 'results' will get changed later to plug back in
## store
protocolid <- protocolnb <- library_names <- get_codes <- put_codes <- list()
UpdateAccountNumbers <- function(protocol){
call2<-paste(base,endpoint, protocol, sep="")
httpResponse <- GET(call2, add_headers(authorization = token))
results = fromJSON(content(httpResponse, "text"))
results$protocolId<- "8887" ## doesn't seem to matter
results$protocolNo<- testprotocols$protocol_no
results$library<- as.character(testprotocols$library)
results$title<- testprotocols$title
results$nctNo<-testprotocols$nct_number
results$objectives<-"To see if the API works, specifically if you can write over a previous number"
results$shortTitle<- "Short joseph Title"
results$nctNo<-testprotocols$nct_number
results$department <- as.character(testprotocols$department)
results$organizationalUnit<- as.charater(testprotocols$organizational_unit)
results$protocolType<- as.character(testprotocols$protocol_type)
call2 <- paste(base,endpoint, protocol, sep="")
httpResponse_put <- PUT(
call2,
add_headers(authorization = token),
body=results, encode = "json",
verbose()
)
# save stats
protocolid <- append(protocolid, protocol)
protocolnb <- append(protocolnb, testprotocols$PROTOCOL_NO[match(protocol, testprotocols$PROTOCOL_ID)])
library_names <- append(library_names, testprotocols$LIBRARY[match(protocol, testprotocols$PROTOCOL_ID)])
get_codes <- append(get_codes, status_code(httpResponse_get))
put_codes <- append(put_codes, status_code(httpResponse_put))
}
## Oncology will have to change to whatever the df name is, above and below this
purrr::walk(testprotocols$protocol_no, UpdateAccountNumbers)
allresults <- tibble('protocolNo'=unlist(protocol_no),'protocolnb'=unlist(protocolnb),'library_names'=unlist(library_names), 'get_codes'=unlist(get_codes), 'put_codes'=unlist(put_codes) )
When I get to the line:
purrr::walk(testprotocols$protocol_no, UpdateAccountNumbers)
I get this error:
When I do traceback() I get this:
When I step through the loop line by line I realized that in this chunk of code:
call2<-paste(base,endpoint, protocol, sep="")
httpResponse <- GET(call2, add_headers(authorization = token))
results = fromJSON(content(httpResponse, "text"))
results$protocolId<- "8887" ## doesn't seem to matter
results$protocolNo<- testprotocols$protocol_no
results$library<- as.character(testprotocols$library)
results$title<- testprotocols$title
results$nctNo<-testprotocols$nct_number
results$objectives<-"To see if the API works, specifically if you can write over a previous number"
results$shortTitle<- "Short joseph Title"
results$nctNo<-testprotocols$nct_number
results$department <- as.character(testprotocols$department)
results$organizationalUnit<- as.charater(testprotocols$organizational_unit)
results$protocolType<- as.character(testprotocols$protocol_type)
Where I had envisioned it downloading ONE sample study and replacing aspects of it with variables from ONE row of my beginning dataframe, its actually trying to paste everything in the column in there. I.e. results$nctNo is "654321 543210" instead of just "654321" from the first row.
TL;DR version:
I need my purrr loop to take one row at a time instead of my entire column, and I think if I do that, it'll all magically work.

Within UpdateAccountNumbers(), you are referring to entire columns of the testprotocols frame when you do things like results$nctNo<-testprotocols$nct_number ....
Instead, perhaps at the top of the UpdateAccountNumbers() function, you can do something like tp = testprotocols[testprotocols$protocol_no == protocol,], and then when you are trying to assign values to results you can refer to tp instead of testprotocols
Note that your purrr::walk() command is passing just one value of protocol at a time to the UpdateAccountNumbers() function

Related

How do I remove the limit or keep getting data on the Radlibrary package?

I am currently using the Radlibrary package. I used the following code:
query <- adlib_build_query(ad_active_status = "ALL",
ad_delivery_date_max = "2022-11-08",
ad_delivery_date_min = "2022-06-24",
ad_reached_countries = "US",
ad_type = "POLITICAL_AND_ISSUE_ADS",
search_terms = "democrat",
publisher_platforms = "FACEBOOK",
fields = c('id',
'ad_creation_time',
'ad_creative_bodies',
'ad_creative_link_captions',
'ad_creative_link_descriptions',
'ad_creative_link_titles',
'ad_delivery_start_time',
'ad_delivery_stop_time',
'ad_snapshot_url',
'bylines',
'currency',
'delivery_by_region',
'estimated_audience_size',
'languages',
'page_id',
'page_name',
'spend',
'publisher_platforms',
'demographic_distribution',
'impressions'))
response <- adlib_get(query)
data <- as_tibble(response)
I've noticed that I only get 1000 observations at a time from that time frame? Is there an efficient way to be able to collect all the observations within the time frame? I've thought about changing the "stop time" based on the last date in the dataset, but that might take a long time if there are a lot of ads in the span of a few days. Any suggestions?

"Error: lexical error: invalid char in json text. " when using Purrr loop in R

Kinda long winded but here goes:
I have a dataframe like this:
testprotocols<-structure(list(protocol_no = c("LS-P-Joe's API", "JoeTest3"),
nct_number = c(654321, 543210), library = structure(c(2L,
2L), levels = c("General Research", "Oncology"), class = "factor"),
organizational_unit = structure(c(1L, 1L), levels = c("Lifespan Cancer Institute",
"General Research"), class = "factor"), title = c("Testing to see if basic stuff came through",
"Testing Oncology Projects for API"), department = structure(c(2L,
2L), levels = c("Diagnostic Imaging", "Lifespan Cancer Institute"
), class = "factor"), protocol_type = structure(2:1, levels = c("Basic Science",
"Other"), class = "factor"), protocolid = 1:2), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
I want to push it into a website using an API with this code, that'll go line by line and return a dataframe given me the status of whether or not that row worked:
##This chunk gets a random one we're going to change later
base <- "https://website.forteresearchapps.com"
endpoint <- "/website/rest/protocols/"
protocol <- "2501"
## 'results' will get changed later to plug back in
## store
protocolid <- protocolnb <- library_names <- get_codes <- put_codes <- list()
UpdateAccountNumbers <- function(protocol){
call2<-paste(base,endpoint, protocol, sep="")
httpResponse <- GET(call2, add_headers(authorization = token))
results = fromJSON(content(httpResponse, "text"))
results$protocolId<- "8887" ## doesn't seem to matter
results$protocolNo<- testprotocols$protocol_no
results$library<- as.character(testprotocols$library)
results$title<- testprotocols$title
results$nctNo<-testprotocols$nct_number
results$objectives<-"To see if the API works, specifically if you can write over a previous number"
results$shortTitle<- "Short joseph Title"
results$nctNo<-testprotocols$nct_number
results$department <- as.character(testprotocols$department)
results$organizationalUnit<- as.charater(testprotocols$organizational_unit)
results$protocolType<- as.character(testprotocols$protocol_type)
call2 <- paste(base,endpoint, protocol, sep="")
httpResponse_put <- PUT(
call2,
add_headers(authorization = token),
body=results, encode = "json",
verbose()
)
# save stats
protocolid <- append(protocolid, protocol)
protocolnb <- append(protocolnb, testprotocols$PROTOCOL_NO[match(protocol, testprotocols$PROTOCOL_ID)])
library_names <- append(library_names, testprotocols$LIBRARY[match(protocol, testprotocols$PROTOCOL_ID)])
get_codes <- append(get_codes, status_code(httpResponse_get))
put_codes <- append(put_codes, status_code(httpResponse_put))
}
## Oncology will have to change to whatever the df name is, above and below this
purrr::walk(testprotocols$protocol_no, UpdateAccountNumbers)
allresults <- tibble('protocolNo'=unlist(protocol_no),'protocolnb'=unlist(protocolnb),'library_names'=unlist(library_names), 'get_codes'=unlist(get_codes), 'put_codes'=unlist(put_codes) )
This basic gist of purrr loop is from my question here: Question
The only difference is that in that question, I was only doing one small change within the loop, this line:
results$hospitalAccountNo <- results$internalAccountNo
Where it would take what it downloaded from the API, copy it over to 'hospitalAccountNo' and put it back up.
This time around, I'm trying to make a few more changes: all of these lines which I envision using the 'testprotocols' dataframe and writing over the 'results' it downloaded, then uploading one row at a time using the loop.
results$protocolId<- "8887" ## doesn't seem to matter
results$protocolNo<- testprotocols$protocol_no
results$library<- as.character(testprotocols$library)
results$title<- testprotocols$title
results$nctNo<-testprotocols$nct_number
results$objectives<-"To see if the API works, specifically if you can write over a previous number"
results$shortTitle<- "Short joseph Title"
results$nctNo<-testprotocols$nct_number
results$department <- as.character(testprotocols$department)
results$organizationalUnit<- as.charater(testprotocols$organizational_unit)
results$protocolType<- as.character(testprotocols$protocol_type)
For whatever reason, when I try to run the line:
purrr::walk(testprotocols$protocol_no, UpdateAccountNumbers)
If I run traceback() I get this:
I'd love it if someone could just fix my entire loop for me haha, but realistically my question is:
Where should I look to figure out what is causing this error?

Querying IMF API with imfr - error no result/does not accept filter

I am currently trying to download a particular series from the Direction Of Trade Statistics at the IMF for a calculation of trade volumes between countries. There is a r-package imfr that does a fantastic job at doing this. However, when going for a particular set, I run into problems.
This code, works just fine and gets me the full data-series I am interested in for the fiven countries:
library(imfr)
# get the list of imf datasets
imf_ids()
# I am interested in direction of trade "DOT", so check the list of codes that are in the datastructure
imf_codelist(database_id = "DOT")
# I want the export and import data between countries FOB so "TXG_FOB_USD" and "TMG_FOB_USD"
imf_codes("CL_INDICATOR_DOT")
# works nicely for exports:
data_list_exports <- imf_data(database_id = "DOT", indicator = c("TXG_FOB_USD"),
country = c("US","JP","KR"),
start = "1995",
return_raw = TRUE,
freq = "A")
# however the same code does not work for imports
data_list_imports <- imf_data(database_id = "DOT", indicator = c("TMG_FOB_USD"),
country = c("US","JP","KR"),
start = "1995",
return_raw = TRUE,
freq = "A")
This will return an empty series and I did not understand why. So I thought, maybe the US is not in the dataset (although unlikely)
library(httr)
library(jsonlite)
# look at the API endpoint, that provides us with the data structure behind a dataset
result <- httr::GET("http://dataservices.imf.org/REST/SDMX_JSON.svc/DataStructure/DTO") %>% httr::content(as = "parsed")
structure_url <- "http://dataservices.imf.org/REST/SDMX_JSON.svc/DataStructure/DOT"
raw_data <- jsonlite::fromJSON(structure_url )
test <- raw_data$Structure$CodeLists
However, the result indicates that indeed the US is in the data. So what if I just donĀ“t specify a country? The result finally does download the data, but only the 60 first countries because of rate limits. When doing the same with an httr::GET I directly hit the rate limit and get an error back.
data_list_imports <- imf_data(database_id = "DOT", indicator = c("TMG_FOB_USD"),
start = "1995",
return_raw = TRUE,
freq = "A")
Does anybody have an idea what I am doing wrong? I am really at a loss and just hope it is a typo somewhere...
Thanks and all the best!
This kind of answers the question:
cjyetman over at github gave me the following hint:
You can use the print_url = TRUE argument to see the actual API call.
With...
imf_data(database_id = "DOT", indicator = c("TMG_FOB_USD"),
country = c("US","JP","KR"),
start = "1995",
return_raw = TRUE,
freq = "A",
print_url = TRUE)
you get...
http://dataservices.imf.org/REST/SDMX_JSON.svc/CompactData/DOT/.US+JP+KR.TMG_FOB_USD?startPeriod=1995&endPeriod=2021
which does not return any data.
But if you add "AU" as a country to that list, you do get data with...
http://dataservices.imf.org/REST/SDMX_JSON.svc/CompactData/DOT/.AU+US+JP+KR.TMG_FOB_USD?startPeriod=1995&endPeriod=2021
So I guess either there is something wrong currently with their API,
or they actually do not have data for specifically that indicator for
those countries with that frequency, etc.
This does work indeed and makes apparent that either there is truly "missing data" in the API, or I am simply looking for data, where there is none. Since the original quest was to look at trade volumes, I have since found out, that the import value is usually used, with the CIF value and not FOB.
Hence the correct indicator for the API call would have been the following:
library(imfr)
data_list_imports <- imf_data(database_id = "DOT", indicator = c("TMG_CIF_USD"),
country = c("US","JP","KR"),
start = "1995",
return_raw = TRUE,
freq = "A")

Fetching data from OECD into R via SDMX(XML)

I want to extract data from the OECD website particularily the dataset "REGION_ECONOM" with the dimensions "GDP" (GDP of the respective regions) and "POP_AVG" (the average population of the respective region).
This is the first time I am doing this:
I picked all the required dimensions on the OECD website and copied the SDMX (XML) link.
I tried to load them into R and convert them to a data frame with the following code:
(in the link I replaced the list of all regions with "ALL" as otherwise the link would have been six pages long)
if (!require(rsdmx)) install.packages('rsdmx') + library(rsdmx)
url2 <- "https://stats.oecd.org/restsdmx/sdmx.ashx/GetData/REGION_ECONOM/1+2.ALL.SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
sdmx2 <- readSDMX(url2)
stats2 <- as.data.frame(sdmx2)
head(stats2)
Unfortunately, this returns a "400 Bad request" error.
When just selecting a couple of regions the error does not appear:
if (!require(rsdmx)) install.packages('rsdmx') + library(rsdmx)
url1 <- "https://stats.oecd.org/restsdmx/sdmx.ashx/GetData/REGION_ECONOM/1+2.AUS+AU1+AU101+AU103+AU104+AU105.SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
sdmx1 <- readSDMX(url1)
stats1 <- as.data.frame(sdmx1)
head(stats1)
I also tried to use the "OECD" package to get the data. There I had the same problem. ("400 Bad Request")
if (!require(OECD)) install.packages('OECD') + library(OECD)
df1<-get_dataset("REGION_ECONOM", filter = "GDP+POP_AVG",
start_time = 2008, end_time = 2009, pre_formatted = TRUE)
However, when I use the package for other data sets it does work:
df <- get_dataset("FTPTC_D", filter = "FRA+USA", pre_formatted = TRUE)
Does anyone know where my mistake could lie?
the sdmx-ml api does not seem to work as explained (using the all parameter), whereas the json API works just fine. The following query returns the values for all countries and returns them as json - I simply replaced All by an empty field.
query <- https://stats.oecd.org/sdmx-json/data/REGION_ECONOM/1+2..SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?
Transforming it to a readable format is not so trivial. I played around a bit to find the following work-around:
# send a GET request using httr
library(httr)
query <- "https://stats.oecd.org/sdmx-json/data/REGION_ECONOM/1+2..SNA_2008.GDP+POP_AVG.REAL_PPP.ALL.1990+1991+1992+1993+1994+1995+1996+1997+1998+1999+2000+2001+2002+2003+2004+2005+2006+2007+2008+2009+2010+2011+2012+2013+2014+2015+2016+2017+2018/all?"
dat_raw <- GET(query)
dat_parsed <- parse_json(content(dat_raw, "text")) # parse the content
Next, access the observations from the nested list and transform them to a matrix. Also extract the features from the keys:
dat_obs <- dat_parsed[["dataSets"]][[1]][["observations"]]
dat0 <- do.call(rbind, dat_obs) # get a matrix
new_features <- matrix(as.numeric(do.call(rbind, strsplit(rownames(dat0), ":"))), nrow = nrow(dat0))
dat1 <- cbind(new_features, dat0) # add feature columns
dat1_df <- as.data.frame(dat1) # optionally transform to data frame
Finally you want to find out about the keys. Those are hidden in the "structure". This one you also need to parse correctly, so I wrote a function for you to easier extract the values and ids:
## Get keys of features
keys <- dat_parsed[["structure"]][["dimensions"]][["observation"]]
for (i in 1:length(keys)) print(paste("id position:", i, "is feature", keys[[i]]$id))
# apply keys
get_features <- function(data_input, keys_input, feature_index, value = FALSE) {
keys_temp <- keys_input[[feature_index]]$values
keys_temp_matrix <- do.call(rbind, keys_temp)
keys_temp_out <- keys_temp_matrix[, value + 1][unlist(data_input[, feature_index])+1] # column 1 is id, 2 is value
return(unlist(keys_temp_out))
}
head(get_features(dat1_df, keys, 7))
head(get_features(dat1_df, keys, 2, value = FALSE))
head(get_features(dat1_df, keys, 2, value = TRUE))
I hope that helps you in your project.
Best, Tobias

Map over data frame columns, apply function to data if column meets condition

I'm pulling data from the Google Analytics API, processing it locally, then knitting an .Rmd file into text, tables, and visualisations. As part of the knitting/tabling process, I'm doing some basic formatting (e.g. rounding off percentages and adding % signs).
For this question, I have toPercent(), which works fine if used like this:
toPercent <- function(percentData){
percentData <- round(data, 2)
percentData <- mapply(toString, percentData)
percentData <- paste(percentData, "%", sep="")
}
devices <- toPercent(devices$avgSessionDuration)
However, manually setting the function for every table is time-intensive. I created the percentCheck() to look for columns that matched my criteria:
percentCheck <- function(data){
data[,grep("rate|percent", names(data), ignore.case=TRUE)] <- toPercent(data[,grep("rate|percent", names(data), ignore.case=TRUE)])
}
devices <- percentCheck(devices)
But I know this doesn't work on a dataset with multiple matches (e.g. a column for exitRate and a column for bounceRate).
Q1: Have I written toPercent() in a way that won't return multiple values to one entry?
Q2: How can I structure percentCheck() to map over the dataset and only apply toPercent() if the column name includes a given string?
Version/Packages:
R version 3.1.1 (2014-07-10) -- "Sock it to Me"
library(rga)
library(knitr)
library(stargazer)
Data:
> dput(devices)
structure(list(deviceCategory = c("desktop", "mobile", "tablet"
), sessions = c(817, 38, 1540), avgSessionDuration = c(153.424888853179,
101.942758538617, 110.270988142292), bounceRate = c(39.0192297391397,
50.2915625371891, 50.1343873517787), exitRate = c(25.3257456030279,
32.0236280487805, 29.0991902834008)), .Names = c("deviceCategory",
"sessions", "avgSessionDuration", "bounceRate", "exitRate"), row.names = c(NA,
-3L), class = "data.frame")
How about this modification:
percentCheck <- function(data){
idx <- grepl("rate|percent", names(data), ignore.case=TRUE)
data[idx] <- lapply(data[idx], function(x) paste0(sprintf("%.2f", round(x,2)), "%"))
return(data)
}
Here, I first used grepl to create and index of columns which meet the specified criteria. Then, this index is used in lapply to apply it to all these columns and the function that is applied is similar to your toPercent function, only I found it a bit more compact like this.
Now you can apply it to your whole data set in one go:
percentCheck(devices)
# deviceCategory sessions avgSessionDuration bounceRate exitRate
#1 desktop 817 153.4249 39.02% 25.33%
#2 mobile 38 101.9428 50.29% 32.02%
#3 tablet 1540 110.2710 50.13% 29.10%

Resources