Rentrez is pulling the wrong data from NCBI in R? - r

I am trying to download sequence data from E. coli samples within the state of Washington - it's about 1283 sequences, which I know is a lot. The problem that I am running into is that entrez_search and/or entrez_fetch seem to be pulling the wrong data. For example, the following R code does pull 1283 IDs, but when I use entrez_fetch on those IDs, the sequence data I get is from chickens and corn and things that are not E. coli:
search <- entrez_search(db = "biosample",
term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
retmax = 9999, use_history = T)
Similarly, I tried pulling the sequence from one sample manually as a test. When I search for the accession number SAMN30954130 on the NCBI website, I see metadata for an E. coli sample. When I use this code, I see metadata for a chicken:
search <- entrez_search(db = "biosample",
term = "SAMN30954130[ACCN]",
retmax = 9999, use_history = T)
fetch_test <- entrez_fetch(db = "nucleotide",
id = search$ids,
rettype = "xml")
fetch_list <- xmlToList(fetch_test)

The issue here is that you are using a Biosample UID to query the Nucleotide database. However, the UID is then interpreted as a Nucleotide UID, so you get a sequence record unrelated to your original Biosample query.
What you need to use in this situation is entrez_link, which uses a UID to link records between two databases.
For example, your Biosample accession SAMN30954130 has the Biosample UID 30954130. You link that to Nucleotide like this:
nuc_links <- entrez_link(dbfrom='biosample', id=30954130, db='nuccore')
And you can get the corresponding Nucleotide UID(s) like this:
nuc_links$links$biosample_nuccore
[1] "2307876014"
And then:
fetch_test <- entrez_fetch(db = "nucleotide",
id = 2307876014,
rettype = "xml")
This is covered in the section "Finding cross-references" of the rentrez tutorial.

Related

Update Purrr loop to input data row by row in R

This question kinda builds on questions I asked here and here, but its finally coming together and I think I know what the problem is, just need help kicking it over the goal line. TL;DR at the bottom.
The overall goal as simply put as possible:
I have a dataframe that is from an API pull of a redcap database. It
has a few columns of information about various studies.
I'd like to go through that dataframe line by line, and push it into a different website called Oncore, through an API.
In the first question linked above (here again), I took a much simpler dataframe... took one column from that dataframe (the number), used it to do an API pull from Oncore where it would download from Oncore, copy one variable it downloaded over to a different spot, and push it back in. It would do this over and over, once per row. Then it would return a simple dataframe of the row number and the api status code returned.
Now I want to get a bit more complicated and instead of just pulling a number from one colum, I want to swap over a bunch of variables from my original dataframe, and upload them.
The idea is for sample studies input into Redcap to be pushed into Oncore.
What I've tried:
I have this dataframe from the redcap api pull:
testprotocols<-structure(list(protocol_no = c("LS-P-Joe's API", "JoeTest3"),
nct_number = c(654321, 543210), library = structure(c(2L,
2L), levels = c("General Research", "Oncology"), class = "factor"),
organizational_unit = structure(c(1L, 1L), levels = c("Lifespan Cancer Institute",
"General Research"), class = "factor"), title = c("Testing to see if basic stuff came through",
"Testing Oncology Projects for API"), department = structure(c(2L,
2L), levels = c("Diagnostic Imaging", "Lifespan Cancer Institute"
), class = "factor"), protocol_type = structure(2:1, levels = c("Basic Science",
"Other"), class = "factor"), protocolid = 1:2), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
I have used this code to try and push the data into Oncore:
##This chunk gets a random one we're going to change later
base <- "https://website.forteresearchapps.com"
endpoint <- "/website/rest/protocols/"
protocol <- "2501"
## 'results' will get changed later to plug back in
## store
protocolid <- protocolnb <- library_names <- get_codes <- put_codes <- list()
UpdateAccountNumbers <- function(protocol){
call2<-paste(base,endpoint, protocol, sep="")
httpResponse <- GET(call2, add_headers(authorization = token))
results = fromJSON(content(httpResponse, "text"))
results$protocolId<- "8887" ## doesn't seem to matter
results$protocolNo<- testprotocols$protocol_no
results$library<- as.character(testprotocols$library)
results$title<- testprotocols$title
results$nctNo<-testprotocols$nct_number
results$objectives<-"To see if the API works, specifically if you can write over a previous number"
results$shortTitle<- "Short joseph Title"
results$nctNo<-testprotocols$nct_number
results$department <- as.character(testprotocols$department)
results$organizationalUnit<- as.charater(testprotocols$organizational_unit)
results$protocolType<- as.character(testprotocols$protocol_type)
call2 <- paste(base,endpoint, protocol, sep="")
httpResponse_put <- PUT(
call2,
add_headers(authorization = token),
body=results, encode = "json",
verbose()
)
# save stats
protocolid <- append(protocolid, protocol)
protocolnb <- append(protocolnb, testprotocols$PROTOCOL_NO[match(protocol, testprotocols$PROTOCOL_ID)])
library_names <- append(library_names, testprotocols$LIBRARY[match(protocol, testprotocols$PROTOCOL_ID)])
get_codes <- append(get_codes, status_code(httpResponse_get))
put_codes <- append(put_codes, status_code(httpResponse_put))
}
## Oncology will have to change to whatever the df name is, above and below this
purrr::walk(testprotocols$protocol_no, UpdateAccountNumbers)
allresults <- tibble('protocolNo'=unlist(protocol_no),'protocolnb'=unlist(protocolnb),'library_names'=unlist(library_names), 'get_codes'=unlist(get_codes), 'put_codes'=unlist(put_codes) )
When I get to the line:
purrr::walk(testprotocols$protocol_no, UpdateAccountNumbers)
I get this error:
When I do traceback() I get this:
When I step through the loop line by line I realized that in this chunk of code:
call2<-paste(base,endpoint, protocol, sep="")
httpResponse <- GET(call2, add_headers(authorization = token))
results = fromJSON(content(httpResponse, "text"))
results$protocolId<- "8887" ## doesn't seem to matter
results$protocolNo<- testprotocols$protocol_no
results$library<- as.character(testprotocols$library)
results$title<- testprotocols$title
results$nctNo<-testprotocols$nct_number
results$objectives<-"To see if the API works, specifically if you can write over a previous number"
results$shortTitle<- "Short joseph Title"
results$nctNo<-testprotocols$nct_number
results$department <- as.character(testprotocols$department)
results$organizationalUnit<- as.charater(testprotocols$organizational_unit)
results$protocolType<- as.character(testprotocols$protocol_type)
Where I had envisioned it downloading ONE sample study and replacing aspects of it with variables from ONE row of my beginning dataframe, its actually trying to paste everything in the column in there. I.e. results$nctNo is "654321 543210" instead of just "654321" from the first row.
TL;DR version:
I need my purrr loop to take one row at a time instead of my entire column, and I think if I do that, it'll all magically work.
Within UpdateAccountNumbers(), you are referring to entire columns of the testprotocols frame when you do things like results$nctNo<-testprotocols$nct_number ....
Instead, perhaps at the top of the UpdateAccountNumbers() function, you can do something like tp = testprotocols[testprotocols$protocol_no == protocol,], and then when you are trying to assign values to results you can refer to tp instead of testprotocols
Note that your purrr::walk() command is passing just one value of protocol at a time to the UpdateAccountNumbers() function

Vectorizing for-loop in R

Oh, man. I am so terrible at removing for-loops from my code because I find them so intuitive and I first learned C++. Below, I am fetching IDs for a search (copd in this case) and using that ID to retrieve its full XML file and from that save its location into a vector. I do not know how to speed this up, and it took about 5 minutes to run on 700 IDs, whereas most searches have 70,000+ IDs. Thank you for any and all guidance.
library(rentrez)
library(XML)
# number of articles for term copd
count <- entrez_search(db = "pubmed", term = "copd")$count
# set max to count
id <- entrez_search(db = "pubmed", term = "copd", retmax = count)$ids
# empty vector that will soon contain locations
location <- character()
# get all location data
for (i in 1:count)
{
# get ID of each search
test <- entrez_fetch(db = "pubmed", id = id[i], rettype = "XML")
# convert to XML
test_list <- XML::xmlToList(test)
# retrieve location
location <- c(location, test_list$PubmedArticle$MedlineCitation$Article$AuthorList$Author$AffiliationInfo$Affiliation)
}
This may give you a start - it seems to be possible to pull down multiple at once.
library(rentrez)
library(xml2)
# number of articles for term copd
count <- entrez_search(db = "pubmed", term = "copd")$count
# set max to count
id_search <- entrez_search(db = "pubmed", term = "copd", retmax = count, use_history = T)
# get all
document <- entrez_fetch(db = "pubmed", rettype = "XML", web_history = id_search$web_history)
document_list <- as_list(read_xml(document))
Problem is that this is still time consuming because there are a large number of documents. Its also curious that it returns exactly 10,000 articles when I've tried this - there may be a limit to what you can return at once.
You can then use something like the purrr package to start extracting the information you want.

Collect search occurrences with rscopus in r?

I have to make a lot of queries on Scopus. For this reason I need to automatize the process.
I have loaded "rscopus" Package and I wrote this code:
test <- generic_elsevier_api(query = "industry",
type = c("abstract"),
search_type = c("scopus"),
api_key = myLabel,
headers = NULL,
content_type = c("content"),
root_http = "http:/api.elsevier.com",
http_end = NULL,
verbose = TRUE,
api_key_error = TRUE)
My goal is obtaining the number of occurrences of a particular query.
In this example, if I search for "industry", I want to obtain the number of search results of the query.
query occurrence
industry 1789
how could I do?

How to convert text fields into numeric/vector space for a SVM in R Studio? [duplicate]

This question already exists:
How to train a SVM to classify text string similarity using R Studio? [closed]
Closed 5 years ago.
I am attempting to train a Support Vector Machine to aid in the detection of similarity between strings. My training data consists of two text fields and a third field that contains 0 or 1 to indicate similarity. This last field was calculated with the help of an edit distance operation. I know that I need to convert the two text fields to numeric values before continuing. I am hoping to find out what is the best method to achieve this?
The training data looks like:
ID MAKTX_Keyword PH_Level_04_Keyword Result
266325638 AMLODIPINE AMLODIPINE 0
724712821 IRBESARTANHCTZ IRBESARTANHCTZ 0
567428641 RABEPRAZOLE RABEPRAZOLE 0
137472217 MIRTAZAPINE MIRTAZAPINE 0
175827784 FONDAPARINUX ARIXTRA 1
456372747 VANCOMYCIN VANCOMYCIN 0
653832438 BRUFEN IBUPROFEN 1
917575539 POTASSIUM POTASSIUM 0
222949123 DIOSMINHESPERIDIN DIOSMINHESPERIDIN 0
892725684 IBUPROFEN IBUPROFEN 0
I have been experimenting with the text2vec library, using this useful vignette as a guide. In doing so, I can presumably represent one of the fields in vector space.
But how can I use this library to manage both text fields at the same time?
Should I concatenate the two string fields into a single field?
Is text2vec the best approach to take?
The code that will be used to manage one of the fields:
library(text2vec)
library(data.table)
preproc_func = tolower
token_func = word_tokenizer
it_train = itoken(Train_PRDHA_String.df$MAKTX_Keyword,
preprocessor = preproc_func,
tokenizer = token_func,
ids = Train_PRDHA_String.df$ID,
progressbar = TRUE)
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
t1 = Sys.time()
dtm_train = create_dtm(it_train, vectorizer)
print(difftime(Sys.time(), t1, units = 'sec'))
dim(dtm_train)
identical(rownames(dtm_train), Train_PRDHA_String.df$id)
One way to embed docs into the same space is to learn vocabulary from both columns:
preproc_func = tolower
token_func = word_tokenizer
union_txt = c(Train_PRDHA_String.df$MAKTX_Keyword, Train_PRDHA_String.df$PH_Level_04_Keyword)
it_train = itoken(union_txt,
preprocessor = preproc_func,
tokenizer = token_func,
ids = Train_PRDHA_String.df$ID,
progressbar = TRUE)
vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
it1 = itoken(Train_PRDHA_String.df$MAKTX_Keyword, preproc_func,
token_func, ids = Train_PRDHA_String.df$ID)
dtm_train_1 = create_dtm(it1, vectorizer)
it2 = itoken(Train_PRDHA_String.df$PH_Level_04_Keyword, preproc_func,
token_func, ids = Train_PRDHA_String.df$ID)
dtm_train_2 = create_dtm(it2, vectorizer)
And after that you can combine them into a single matrix:
dtm_train = cbind(dtm_train_1, dtm_train_2)
However if you want to solve problem of duplicate detection I suggest to use char_tokenizer with ngram > 1 (say ngram = c(3, 3)). And check great stringdist package. I suppose you received Result with some manual human work. Because if it is just edit distance, algorithm will learn at most how edit distance works.

Storing data from a for loop in a data frame

I am trying to create a function that interacts with the pubmed api to retrieve xml files associated with 100 publications. I then want to parse the xml files individually to retrieve the title of each publication and the abstract of each publication. I am using the Rentrez package to interact with the api, and have successfully retrieved the necessary xml files. I am using the xml package to parse the xml files, and have verified that the Xpath expressions retrieve the data that I want. In truth, I am looking to take data from other fields (journal title, Mesh Terms, etc., but I am stuck at this step here)
However, I have not been able to create a proper for loop to move this data into a data frame. I receive the following error from running my code:
error in $<-.data.frame(*tmp*, "Abstract", value = list("text of abstract"):
replacement has 1 row, data has 0
When I test the function to receive title information (by removing the expression to retrieve abstract information), I receive an empty data frame with no information about the titles that I want. But there is no error message then.
If I execute pubmed_parsed("Kandel+Eric", n=2), my goal is to receive a data frame with the character vectors from two titles in the column "ATitle" (titles: "Roles for small noncoding RNAs in silencing of retrotransposons in the mammalian brain" and "ApCPEB4, a non-prion domain containing homolog of ApCPEB, is involved in the initiation of long-term facilitation"). And the character vectors from the two abstracts to correspondingly appear in the column "Abstract" (portions of abstracts: "Piwi-interacting RNAs (piRNAs), long thought to be restricted to gremlin...", "Two pharmacologically distinct types of local protein synthesis are required for synapse- specific...").
library(xml)
library(rentrez)
pubmed_parsed <- function(term, n=100){
df <- data.frame(ATitle = character(), JTitle = character(), MeshTerms = character(), Abstract = character(), FAuthor = character(), LAuthor = character(), stringsAsFactors = FALSE)
IdList <- entrez_search(db = "pubmed", term = term, retmode = "xml", retmax = n)
for (i in 1:n){
XmlFile <- entrez_fetch(db = "pubmed", id=IdList$ids[i], rettype = "xml", retmode = "xml", parsed=TRUE)
Parsed <- xmlRoot(XmlFile)
df$ATitle[i] <- xpathSApply(Parsed, "/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/Title", xmlValue, simplify = FALSE)
df$Abstract[i] <- xpathSApply(Parsed, "/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/Title", xmlValue, simplify = FALSE)
}
df
}
Here's one way to get a table and a few suggestions. First, I would use the Web history option and download all results together instead of looping through downloads.
ids <- entrez_search(db = "pubmed", term = "Kandel ER", use_history = TRUE)
ids
Entrez search result with 502 hits (object contains 20 IDs and a web_history object)
Search term (as translated): Kandel ER[Author]
doc <- entrez_fetch(db="pubmed", web_history=ids$web_history, rettype="xml", retmax = 3, parsed=TRUE)
Next, get the articles into a node set and query that to handle all your missing and multiple tags.
articles <- getNodeSet( doc, "//PubmedArticle")
length(articles)
[1] 3
articles[[1]]
<PubmedArticle>
<MedlineCitation Status="Publisher" Owner="NLM">
<PMID Version="1">27791114</PMID>
<DateCreated>
...
I usually create a function to add NAs if tags are missing and join multiple tags using a comma.
xpath2 <-function(x, path, fun = xmlValue, ...){
y <- xpathSApply(x, path, fun, ...)
ifelse(length(y) == 0, NA,
ifelse(length(y) > 1, paste(unlist(y), collapse=", "), y))
}
Then just apply that function to the nodes (with the leading dot in xpath so it's relative to that node). This will combine multiple keywords into a comma-separated list and include NA for article 3 with missing keywords.
sapply(articles, xpath2, ".//Keyword")
[1] "DNA methylation, behavior, endogenous siRNA, piwi-interacting RNA, transposon"
[2] "Aplysia, CPEB, CPEB4, Long-term facilitation"
[3] NA
Most xpath should work
sapply(articles, xpath2, ".//PubDate/Year")
[1] "2016" "2016" "2016"
sapply(articles, xpath2, ".//ArticleId[#IdType='pmc']")
[1] "PMC5111663" "PMC5075418" NA
You can also use xmlGetAttr if needed
sapply(articles, xpath2, ".//Article", xmlGetAttr, "PubModel")
[1] "Print-Electronic" "Electronic" "Electronic"
Finally, create a data.frame
data.frame(
ATitle = sapply(articles, xpath2, ".//ArticleTitle"),
JTitle = sapply(articles, xpath2, ".//Journal/Title"),
Keywords = sapply(articles, xpath2, ".//Keyword"),
Authors = sapply(articles, xpath2, ".//Author/LastName"),
Abstract = sapply(articles, xpath2, ".//AbstractText"))
I'm not sure what happened to MeSH terms, but I only see Keywords in the few examples I downloaded. Also, there are probably a few ways to get first and last authors. You could get both last name and initials (assuming both are always present) and replace the comma before the initials to get an Author string. Then split that to get first and last author or even print the first three below.
au <- sapply(articles, xpath2, ".//Author/LastName|.//Author/Initials")
au <- gsub(",( [A-Z]+,?)", "\\1", au)
authors_etal <- function(x, authors=3, split=", *"){
y <- strsplit(x, split)
sapply(y, function(x){
if(length(x) > (authors + 1)) x <- c(x[1:authors], "et al.")
paste(x, collapse=", ")
})
}
authors_etal(au)
[1] "Nandi S, Chandramohan D, Fioriti L, et al."
[2] "Lee SH, Shim J, Cheong YH, et al."
[3] "Si K, Kandel ER"

Resources