How do I cache vectorized calls that take user input in R? - r

I am trying to calculate a field for all rows of a large dataset. The function to calculate it is from the package taxize, and uses an HTTP request to query an external site for the right ID number. It is searching by scientific name, and often there are multiple results, in which case this function asks for user input. I would like the function to cache my selection and return that ID number every time the same call is made from then on. I have tried with my own caching function and with memoizedCall() from the package R.cache but every time it hits the second entry of the same scientific name it still prompts me for user input. I feel like I am misunderstanding something basic about how vectorization works. Sorry for my ignorance but any advice is appreciated.
Here is the code I used as a custom caching function.
check_tsn <- function(data,tsn_list){
print(data)
print(tsn_list)
if (is.null(tsn_list$data)){
tsn_list$data = taxize::get_tsn(data)
print('added to tsn_list')
}
return(tsn_list$data)
}
tsn_list <- vector(mode = "list", nrow(wanglang))
Genus.Species <- c('Tamiops swinhoei','Bos taurus','Tamiops swinhoei')
IUCN.ID <- c('21382','','21382')
species <- data.frame(Genus.Species,IUCN.ID)
species$TSN.ID = check_tsn(species$Genus.Species,tsn_list)

Related

How do I create a loop function to apply acoustic indices from "soundecology" to specific sections of .wav files using R

I have a large quantity of .wav files that I need to analyze using the acoustic indices from the "soundecology" package in R. However, the recordings do not have uniform start times and I need to analyze specific periods of time within the files. I want to create a function and loop for automating the process.
I have created a spread sheet for each folder of recordings (each folder is a different location) that lays out the recording and the times within each recording that I need to analyze. Basically, a row contains: the sound file name, the time when the sample should start (eg. 09:00:00, the number of seconds from the start of the file that that time occurs, and the munber of seconds from the start time of the file that the end of the sample should occur.
That data looks like this:
Spread sheet of data
I am using the package "tuneR" and "warbleR" to select the specific portion of a sound file that I want to analyze. Here is the the code and the output that I would like to loop across all the sound files:
wavrow1 <-read_wave(mvb$sound.files[1], from = mvb$start[1], to = mvb$end[1])
wavrow1.aci <- acoustic_complexity(wavrow1, j=10)
which yeilds
max_freq not set, using value of: 22050
min_freq not set, using value of: 0
This is a mono file.
Calculating index. Please wait...
Acoustic Complexity Index (total): 934.568
However, when I put this into a function in order to then put it into a loop I get a different output.
acianalyzeFUN <- function(mvb, i){
r <- read_wave(mvb$sound.files[i], mvb$start[i], mvb$end[i])
soundfile.aci <- acoustic_complexity(r, j=10)
}
row1.test <- acianalyzeFUN(mvb, 1)
This gives the output:
max_freq not set, using value of: 22050
min_freq not set, using value of: 0
This is a mono file.
Calculating index. Please wait...
Acoustic Complexity Index (total): 19183.03
Acoustic Complexity Index (by minute): 931.98
Which is different.
So I need to fix this function and put it into a loop so that I can apply it across all the files and save the results into a data frame or ultimately another spread sheet.
I was thinking a loop like the following might work but I am also getting errors with it:
output <- vector("logical", length(97))
for (i in seq_along(mvb$sound.files)) {
output[[i]] <- acianalyzeFUN(mvb, i)
}
Which returns this error:
max_freq not set, using value of: 22050
min_freq not set, using value of: 0
This is a mono file.
Calculating index. Please wait...
Acoustic Complexity Index (total): 19183.03
Acoustic Complexity Index (by minute): 931.98
Error in output[[i]] <- acianalyzeFUN(mvb, i) :
more elements supplied than there are to replace
Thanks for any help and advice on this. Please let me know if there are any other pieces of information that would be helpful.
the read_wave function takes following arguments :
read_wave(X, index, from = X$start[index], to = X$end[index], channel = NULL,
header = FALSE, path = NULL)
In the manual test, you specify from = mvb$start[1], to = mvb$end[1]
In the function you created, you dont specify the arguments :
r <- read_wave(mvb$sound.files[i], mvb$start[i], mvb$end[i])
so that mvb$start[i] gets affected to index and mvb$end[i] to from.
You should write:
acianalyzeFUN <- function(mvb, i){
r <- read_wave(mvb$sound.files[i], from = mvb$start[i], to = mvb$end[i])
soundfile.aci <- acoustic_complexity(r, j=10)
}
This should explain the difference you observe.
Regarding the error, you create a vector of logical to collect the result, but acianalyzeFUN returns nothing : it just sets two variables r and soundfileaci without returning anything.

Accessing Spoitify API with Rspotify to obtain genre information for multiple artisrts

I am using RStudio 3.4.4 on a windows 10 machine.
I have got a vector of artist names and I am trying to get genre information for them all on spotify. I have successfully set up the API and the RSpotify package is working as expected.
I am trying to build up to create a function but I am failing pretty early on.
So far i have the following but it is returning unexpected results
len <- nrow(Artist_Nam)
artist_info <- character(artist)
for(i in 1:len){
ifelse(nrow(searchArtist(Artist_Nam$ArtistName[i], token = keys))>=1,
artist_info[i] <- searchArtist(Artist_Nam$ArtistName[i], token = keys)$genres[1],
artist_info[i] <- "")
}
artist_info
I was expecting this to return a list of genres, and artists where there is not a match on spotify I would have an empty entry ""
Instead what is returned is a list and entries are populated with genres and on inspection these genres are correct and there are "" where there is no match however, something odd happens from [73] on wards (I have over 3,000 artists), the list now only returns "".
despite when i actually look into this using the searchArtist() manually there are matches.
I wonder if anyone has any suggestions or has experienced anything like this before?
There may be a rate limit to the number of requests you can make a minute and you may just be hitting that limit. Adding a small delay with Sys.sleep() within your loop to prevent you from hitting their API too hard to be throttled.

Using Sys.sleep() to delay API call

I'm using R to make an API call to a weather data provider to download some weather forecasts. I'm using a free key that allows me to make no more than 10 calls per minute. I've tried using Sys.sleep() to ensure I don't go over the threshold but the API resource monitor tells me that I've exceeded the number of calls.
For example, if I'm making 6 calls, a time interval of 10 seconds between the calls ought to be sufficient (not taking into account the time R would need).
dat <- list()
for(i in 1:6){
dat[[i]] <- getWeatherData(web_url, api_key, history_date, data_format)
Sys.sleep(10)
web_url <- gsub(i-1, i, url)
}
The getWeatherData function does the following:
makes the API call (only one API call is made each time the function is invoked. Uses httr::GET() to get the data),
parses the XML output to get desired variables (regulat expressions),
performs some clean-up (for missing/garbage values),
converts strings to R date-time objects (POSIXct), and
rounds values to the nearest hour (lubridate::round_date()).
Function inputs:
web_url is a custom url,
api_key is my personal key,
history_date is a string (formatted as "%d/%m/%Y %H:%M:%S"), and
data_format specifies if I want an .XML or .json file as output.
I can not share the url/key for obvious reasons. As soon as I run this, I get a notification from the data provider that I've exceeded the allowable calls per minute (10). I don't get a notification every time - not sure why that is either.
Any help is appreciated!
This solution should be helpful for you if Sys.sleep doesn't do the trick.
Basically, this replaces the use of Sys.sleep with while logic.
dat <- list()
delay_seconds<-10
for(i in 1:6){
dat[[i]] <- getWeatherData(web_url, api_key, history_date, data_format)
date_time<-Sys.time()
while((as.numeric(Sys.time()) - as.numeric(date_time))<delay_seconds){}
web_url <- gsub(i-1, i, url)
}
Here, we are:
defining a number of seconds to wait ( delay_seconds<-10 )
defining a start time for comparison ( date_time<-Sys.time() )
using a while loop that checks the present time in comparison to our comparison time and seeing if this is less than our chosen delay interval ( (as.numeric(Sys.time()) - as.numeric(date_time)<delay_seconds )
doing nothing until the wait time is over( {} )
Not knowing if you need/want to, but in the case that you're hoping to get your data out of the lists and into a longer combined form, I recommend the dplyr function bind_rows().
dat2<-bind_rows(dat)
Thanks to an answer by rbtj to this question: How to make execution pause, sleep, wait for X seconds in R?

Loop to wait for result or timeout in r

I've written a very quick blast script in r to enable interfacing with the NCBI blast API. Sometimes however, the result url takes a while to load and my script throws an error until the url is ready. Is there an elegant way (i.e. a tryCatch option) to handle the error until the result is returned or timeout after a specified time?
library(rvest)
## Definitive set of blast API instructions can be found here: https://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/BLAST_URLAPI.html
## Generate query URL
query_url <-
function(QUERY,
PROGRAM = "blastp",
DATABASE = "nr",
...) {
put_url_stem <-
'https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Put'
arguments = list(...)
paste0(
put_url_stem,
"&QUERY=",
QUERY,
"&PROGRAM=",
PROGRAM,
"&DATABASE=",
DATABASE,
arguments
)
}
blast_url <- query_url(QUERY = "NP_001117.2") ## test query
blast_session <- html_session(blast_url) ## create session
blast_form <- html_form(blast_session)[[1]] ## pull form from session
RID <- blast_form$fields$RID$value ## extract RID identifier
get_url <- function(RID, ...) {
get_url_stem <-
"https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get"
arguments = list(...)
paste0(get_url_stem, "&RID=", RID, "&FORMAT_TYPE=XML", arguments)
}
hits_xml <- read_xml(get_url(RID)) ## this is the sticky part
Sometimes it takes several minutes for the get_url to go live so what I would like is to do is to keep trying let's say every 20-30 seconds until it either produces the url or times out after a pre-specified time.
I think you may find this answer about the use of tryCatch useful
Regarding the 'keep trying until timeout' part. I imagine you can work on top of this other answer about a tryCatch loop on error
Hope it helps.

Manual API rate limiting

I am trying to write a manual rate-limiting function for the rgithub package. So far this is what I have:
library(rgithub)
pull <- function(i){
commits <- get.pull.request.commits(owner = owner, repo = repo, id = i, ctx = get.github.context(), per_page=100)
links <- digest_header_links(commits)
number_of_pages <- links[2,]$page
if (number_of_pages != 0)
try_default(for (n in 1:number_of_pages){
if (as.integer(commits$headers$`x-ratelimit-remaining`) < 5)
Sys.sleep(as.integer(commits$headers$`x-ratelimit-reset`)-as.POSIXct(Sys.time()) %>% as.integer())
else
get.pull.request.commits(owner = owner, repo = repo, id = i, ctx = get.github.context(), per_page=100, page = n)
}, default = NULL)
else
return(commits)
}
list <- c(500, 501, 502)
pull_lists <- lapply(list, pull)
The intention i that if the x-ratelimit-remaining variable goes below a certain threshold the script should wait until the time specified in x-ratelimit-reset has passed, and then continue the script. However, I'm not sure if this is the actual behavior of the if else set up that I have here.
The function runs fine, but I have some doubts about whether it actually does the rate limiting or whether it somehow skips that steps. Hence I ask: a) how can I find out if it actually does rate-limiting, and b) if not, how can I rewrite it so that it actually does rate limiting? Would a while condition/loop perhaps be better?
You can test if it does the rate limiting changing 5 to a large enough number and adding a display of the timing of Sys.sleep using:
print(system.time(Sys.sleep(...)))
That said, the function seems ok to me, unfortunately I cannot test it easily as rgithub is not available for my version of R (3.1.3).
Not a canonical answer, but some working example.
You should add some logging in your script, even kind of write.csv(append=TRUE).
I've implemented automatic antiddos process which prevent your ip to be banned by the exchange market. You can find it jangorecki/Rbitcoin/R/utils.R.
Rbitcoin.last_api_call is env object stored in package namespace, kind of session package cache.
This can help you with setting it in your package.
You should also consider a optional parallel supported version. Linking to database with concurrency read. My function can be easy modified to queue call and recheck timing every X seconds.
Edit
I forget to add that mentioned function support multiple source systems. That allows for example to extend your rgithub for bitbucket, etc. and still effectively manage API rate limiting.

Resources