automate input to user query in R - r

I apologize if this question has been asked with terminology I don't recognize but it doesn't appear to be.
I am using the function comm2sci in the library taxize to search for the scientific names for a database of over 120,000 rows of common names. Here is a subset of 10:
commnames <- c("WESTERN CAPERCAILLIE", "AARDVARK", "AARDWOLF", "ABACO ISLAND BOA",
"ABBOTT'S DAY GECKO", "ABDIM'S STORK", "ABRONIA GRAMINEA", "ABYSSINIAN BLUE
WINGED GOOSE",
"ABYSSINIAN CAT", "ABYSSINIAN GROUND HORNBILL")
When searching with the NCBI database in this function, it asks for user input if the common name is generic/general and not species specific, for example the following call will ask for clarification for "AARDVARK" by entering '1', '2' or 'return' for 'NA'.
install.packages("taxize")
library(taxize)
ncbioutput <- comm2sci(commnames, db = "ncbi")###querying ncbi database
Because of this, I cannot rely on this function to find the names of the 120000 species without me sitting and entering 'return' every few minutes. I know this question sounds taxize specific - but I've had this situation in the past with other functions as well. My question is: is there a general way to place the comm2sci call in a conditional statement that will return a specific value when user input is prompted? Or otherwise write a function that will return some input when prompted?
All searches related to this tell me how to ask for user input but not how to override user queries. These are two of the question threads I've found, but I can't seem to apply them to my situation: Make R wait for console input?, Switch R script from non-interactive to interactive
I hope this was clear. Thank you very much for your time!

So the get_* functions, used internally, all by default ask for user input when there is > 1 option. But, all of those functions have a sister function with an underscore, e.g., get_uid_ that do not prompt for input, and return all data. You can use that to get all the data, then process however you like.
Made some changes to comm2sci, so update first: devtools::install_github("ropensci/taxize")
Here's an example.
library(taxize)
commnames <- c("WESTERN CAPERCAILLIE", "AARDVARK", "AARDWOLF", "ABACO ISLAND BOA",
"ABBOTT'S DAY GECKO", "ABDIM'S STORK", "ABRONIA GRAMINEA",
"ABYSSINIAN BLUE WINGED GOOSE",
"ABYSSINIAN CAT", "ABYSSINIAN GROUND HORNBILL")
Then use get_uid_ to get all data
ids <- get_uid_(commnames)
Process the results in ids as you like. Here, for brevity, we'll just grab first row of each
ids <- lapply(ids, function(z) z[1,])
Then grab the uid's out
ids <- as.uid(unname(vapply(ids, "[[", "", "uid")), check = FALSE)
And pass to comm2sci
comm2sci(ids)
$`100830`
[1] "Tetrao urogallus"
$`9818`
[1] "Orycteropus afer"
$`9680`
[1] "Proteles cristatus"
$`51745`
[1] "Chilabothrus exsul"
$`8565`
[1] "Gekko"
$`39789`
[1] "Ciconia abdimii"
$`278977`
[1] "Abronia graminea"
$`8865`
[1] "Cyanochen cyanopterus"
$`9685`
[1] "Felis catus"
$`153643`
[1] "Bucorvus abyssinicus"
Note that NCBI returns common names from get_uid/get_uid_, so you can just go ahead and pluck those out if you want

Related

How do I cache vectorized calls that take user input in R?

I am trying to calculate a field for all rows of a large dataset. The function to calculate it is from the package taxize, and uses an HTTP request to query an external site for the right ID number. It is searching by scientific name, and often there are multiple results, in which case this function asks for user input. I would like the function to cache my selection and return that ID number every time the same call is made from then on. I have tried with my own caching function and with memoizedCall() from the package R.cache but every time it hits the second entry of the same scientific name it still prompts me for user input. I feel like I am misunderstanding something basic about how vectorization works. Sorry for my ignorance but any advice is appreciated.
Here is the code I used as a custom caching function.
check_tsn <- function(data,tsn_list){
print(data)
print(tsn_list)
if (is.null(tsn_list$data)){
tsn_list$data = taxize::get_tsn(data)
print('added to tsn_list')
}
return(tsn_list$data)
}
tsn_list <- vector(mode = "list", nrow(wanglang))
Genus.Species <- c('Tamiops swinhoei','Bos taurus','Tamiops swinhoei')
IUCN.ID <- c('21382','','21382')
species <- data.frame(Genus.Species,IUCN.ID)
species$TSN.ID = check_tsn(species$Genus.Species,tsn_list)

Accessing Spoitify API with Rspotify to obtain genre information for multiple artisrts

I am using RStudio 3.4.4 on a windows 10 machine.
I have got a vector of artist names and I am trying to get genre information for them all on spotify. I have successfully set up the API and the RSpotify package is working as expected.
I am trying to build up to create a function but I am failing pretty early on.
So far i have the following but it is returning unexpected results
len <- nrow(Artist_Nam)
artist_info <- character(artist)
for(i in 1:len){
ifelse(nrow(searchArtist(Artist_Nam$ArtistName[i], token = keys))>=1,
artist_info[i] <- searchArtist(Artist_Nam$ArtistName[i], token = keys)$genres[1],
artist_info[i] <- "")
}
artist_info
I was expecting this to return a list of genres, and artists where there is not a match on spotify I would have an empty entry ""
Instead what is returned is a list and entries are populated with genres and on inspection these genres are correct and there are "" where there is no match however, something odd happens from [73] on wards (I have over 3,000 artists), the list now only returns "".
despite when i actually look into this using the searchArtist() manually there are matches.
I wonder if anyone has any suggestions or has experienced anything like this before?
There may be a rate limit to the number of requests you can make a minute and you may just be hitting that limit. Adding a small delay with Sys.sleep() within your loop to prevent you from hitting their API too hard to be throttled.

Automate Response at Prompt in R interactive

Please see below my reference to a previous question asked along these lines.
I am running the library taxize in R. Taxize includes a function for getting a stable number associated with a scientific name, get_tsn().
I can run this in interactive mode or non-interactive mode so that I am either
prompted or not, respectively, to choose among multiple hits.
Interactive:
> tax.num <- get_tsn("Acer rubrum", ask=TRUE)
Retrieving data for taxon 'Acer rubrum'
tsn target commonNames nameUsage
1 28728 Acer rubrum red maple accepted
2 28730 Acer rubrum ssp. drummondii NA not accepted
3 526853 Acer rubrum var. drummondii Drummond's maple accepted
...
More than one TSN found for taxon 'Acer rubrum'!
Enter rownumber of taxon (other inputs will return 'NA'):
Non-interactive:
> tax.num <- get_tsn("Acer rubrum", ask=TRUE)
Retrieving data for taxon 'Acer rubrum'
Warning message:
> 1 result; no direct match found
I need to run this library in interactive mode so that I do not get an empty result when there is more than one match. However, babysitting this script is totally unrealistic for the size of my data, which are in the millions of scientific names. Thus, I want to automate a response to the prompt so that the answer is always 1. This will be the right answer for probably 99% of cases and will ultimately still lead to the right answer downstream in 100% of cases for reasons that are probably beyond the scope of this question.
Thus, how can I automate the response to always be 1?
I looked at this question and tried modifying my code accordingly.
options(httr_oauth_cache=T)
tax.num <- get_tsn("Acer rubrum",ask=T)
However, this gave the same result shown for interactive mode above.
Your help is appreciated.
UPDATE: Ignore below. Obviously Nathan Werth posted the best answer in a comment above.
tax.num <- get_tsn_(searchterm = "Acer rubrum", rows = 1)
works wonderfully!
...
I decided to modify the source code to handle this. I suspect that there is a more desirable solution, but this one meets my needs.
Thus, in the file get_tsn.R from the source, I replaced the following block of code
# prompt
message("\n\n")
print(tsn_df)
message("\nMore than one TSN found for taxon '", x, "'!\n
Enter rownumber of taxon (other inputs will return 'NA'):\n")
# prompt
take <- scan(n = 1, quiet = TRUE, what = 'raw')
with
take <- 1
I could have deleted other echoing to screen bits, that are unnecessary and now not true.
The revised function, which I tested using trace("get_tsn",edit=TRUE), returns as follows:
> print(tax.num)
[1] "28728"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] TRUE
attr(,"pattern_match")
[1] FALSE
attr(,"uri")
[1] "http://www.itis.gov/servlet/SingleRpt/SingleRpt?
search_topic=TSN&search_value=28728"
attr(,"class")
[1] "tsn"
I will recompile and install it on Linux now with the edit for use with this particular project.
I still welcome other, better answers.

Writing help information for user defined functions in R

I frequently use user defined functions in my code.
RStudio supports the automatic completion of code using the Tab key. I find this amazing because I always can read quickly what is supposed to go in the (...) of functions/calls.
However, my user defined functions just show the parameters, no additional info and obviously, no help page.
This isn't so much pain for me but I would like to share code I think it would be useful to have some information at hand besides the #coments in every line.
Nowadays, when I share, my lines usually look like this
myfun <- function(x1,x2,x3,...){
# This is a function for this and that
# x1 is a factor, x2 is an integer ...
# This line of code is useful for transformation of x2 by x1
some code here
# Now we do this other thing
more code
# This is where the magic happens
return (magic)
}
I think this line by line comment is great but I'd like to improve it and make some things handy just like every other function.
Not really an answer, but if you are interested in exploring this further, you should start at the rcompgen-help page (although that's not a function name) and also examine the code of:
rc.settings
Also, executing this allows you to see what the .CompletionEnv has in it for currently loaded packages:
names(rc.status())
#-----
[1] "attached_packages" "comps" "linebuffer" "start"
[5] "options" "help_topics" "isFirstArg" "fileName"
[9] "end" "token" "fguess" "settings"
And if you just look at:
rc.status()$help_topics
... you see the character items that the tab-completion mechanism uses for matching. On my machine at the moment there are 8881 items in that vector.

SnowballC in R stems "many" and "only"

I am using SnowballC to process a text document, but realize it stems words such as "many" and "only" even though they are not supposed to be stemmed.
> library(SnowballC)
>
> str <- c("many", "only", "things")
> str.stemmed <- stemDocument(str)
> str.stemmed
[1] "mani" "onli" "thing"
>
> dic <- c("many", "only", "online", "things")
> str.complete <- stemCompletion(str.stemmed, dic)
> str.complete
mani onli thing
"" "online" "things"
You can see that after stemming, "many" and "only" became "mani" and "onli", which cannot be completed back with stemCompletion later on, since letters in "many" is not inclusive of "mani". Notice how "onli" gets completed to "online" instead of the original "only".
Why is that? Is that a way to fix this?
Stemming is often executed as a set of rules from stripping all affixes--both derivational and inflectional--from a word, leaving its root. Lemmatization typically only removes inflectional affixes. Stemming is a much more aggressive version of lemmatization. Given what you want, it seems like you'd prefer lemmatization.
To compare the two, most lemmatizers are limited to a few rules for dealing with affixes to nouns and verbs in English---ed, -s, -ing, for example. There are a few irregular cases they have to handle, but with some training data, many are probably covered.
Stemmers are expected to dig deeper. As a result, the space of possible transformations they can make is bigger, so you're a lot more likely to end up with errors.
To see what's happening in your data, let's look at the specifics.
online -> onli: why on earth would this happen? Not totally sure on this one; there's probably some rule that tries to cater to words like medic-ine and medic-al, sub-mari-ne and mari-ne, imagi-ne and imagi-na-tion.
only -> onli, many -> mani: These seem particularly strange, but are probably more reasonable than the previous rule--especially in the context of dealing with verbs that end in -ed. If you're stemming the words denied, studied, modified, specified, you'll want them to be equivalent to their uninflected forms deny, study, modify, specify.
You could have a rule to transform each verb into the uninflected form, but the authors here chose to make the roots the forms ending in -i. To ensure that these match, -y endings had to be transformed to -i as well.
With a lemmatizer, you might get more predictable results. Since they only remove inflectional affixes, you'd get only, many, online, and thing, as you wanted. Both a good stemmer and lemmatizer can work well, but the stemmer does more stuff and therefore has more room for error.
That is how stemmers work. You've got a (smallish) set of rules that reduce most words to something resembling a canonical form (a stem), but not quite. There are many other corner cases you will find, so many in fact that I hesitate to call them corner cases, e.g.
many -> mani
other -> other
corner -> corner
cases -> case
in -> in
sentences -> sentenc
What you want is a lemmatiser. Have a look at this question for a more detailed explanation:
Stemmers vs Lemmatizers

Resources