I'm using this "RISmed" library to do some query of my gene or protein of interest and the output comes with pubmed ID basically, but most of the times it consist of non-specific hits as well which are not my interest. As I can only see the pubmed ID I have to manually put those returned ID and search them in NCBI to see if the paper is of my interest or not.
Question: Is there a way to to return the abstract of the paper or summary sort of along with its pumed ID , which can be implemented in R?
If anyone can help it would be really great..
Using the example from the manuals we need EUtilsGet function.
library(RISmed)
search_topic <- 'copd'
search_query <- EUtilsSummary(search_topic, retmax = 10,
mindate = 2012, maxdate = 2012)
summary(search_query)
# see the ids of our returned query
QueryId(search_query)
# get actual data from PubMed
records <- EUtilsGet(search_query)
class(records)
# store it
pubmed_data <- data.frame('Title' = ArticleTitle(records),
'Abstract' = AbstractText(records))
Related
I'm trying to download DNA sequence data from NCBI using entrez_fetch. With the following code, I perform a search for the IDs of the sequences I need with entrez_search, and then I attempt to download the sequence data in FASTA format:
library(rentrez)
#Search for sequence ids
search <- entrez_search(db = "biosample",
term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
retmax = 9999, use_history = T)
search$ids
length(search$ids)
search$web_history
#Download sequence data
ecoli_fasta <- entrez_fetch(db = "nuccore",
web_history = search$web_history,
rettype = "fasta")
When I do this, I get the following error:
Error: HTTP failure: 400
Cannot+retrieve+query+from+history
I don't understand what this means and Googling hasn't led me to an answer.
I tried using a different package (ape) and the function read.GenBank to download the sequences as an alternative, but this method only managed to download about 1000 of the 12000 sequences I needed. I would like the use entrez_fetch if possible - does anyone have any insight for me?
This may be a starter.
Also be aware that queries to genome databases can return massive amounts of data, so be sure to limit your queries.
Build search web history
library(rentrez)
search <- entrez_search(db="nuccore",
term="Escherichia coli[Organism]",
use_history = T)
Use web history to fetch data
cat(entrez_fetch(db="nuccore",
web_history=search$web_history, rettype="fasta", retstart=24, retmax=100))
>pdb|7QQ3|I Chain I, 23S ribosomal RNA
NGTTAAGCGACTAAGCGTACACGGTGGATGCCCTGGCAGTCAGAGGCGATGAAGGACGTGCTAATCTGCG
ATAAGCGTCGGTAAGGTGATATGAACCGTTATAACCGGCGATTTCCGAATGGGGAAACCCAGTGTGTTTC
GACACACTATCATTAACTGAATCCATAGGTTAATGAGGCGAACCGGGGGAACTGAAACATCTAAGTACCC
CGAGGAAAAGAAATCAACCGAGATTCCCCCAGTAGCGGCGAGCGAACGGGGAGCAGCCCAGAGCCTGAAT
CAGTGTGTGTGTTAGTGGAAGCGTCTGGAAAGGCGCGCGATACAGGGTGACAGCCCCGTACACAAAAATG
CACATGCTGTGAGCTCGATGAGTAGGGCGGGACACGTGGTATCCTGTCTGAATATGGGGGGACCATCCTC
CAAGGCTAAATACTCCTGACTGACCGATAGTGAACCAGTACCGTGAGGGAAAGGCGAAAAGAACCCCGGC
...
Use a loop to cycle through sequences, e.g
for(i in seq(1, 300, 100)){
cat(entrez_fetch(db="nuccore",
web_history=search$web_history, rettype="fasta", retstart=i, retmax=100))
}
I am trying to download sequence data from 1283 records in GenBank using rentrez. I'm using the following code, first to search for records fitting my criteria, then linking across databases, and finally fetching the sequence data:
# Search for sequence ids in biosample database
search <- entrez_search(db = "biosample",
term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
retmax = 9999, use_history = T)
search$ids
length(search$ids)
search$web_history
#Link IDs across databases: biosample to nuccore (nucleotide sequences)
nuc_links <- entrez_link(dbfrom ="biosample",
id = search$web_history,
db ="nuccore",
by_id = T)
nuc_links$links
#Fetch nucleotide sequences
fetch_ids1 <- entrez_fetch(db = "nucleotide",
id = nuc_links$links$biosample_nuccore,
rettype = "xml")
When I do this for one single record, I am able to get the data I need. When I try to scale it up to pull data for all the sequences I need using the web history of my search, it's not working. The nuc_links$links list is NULL, which is telling me that the entrez_link is not working how I hope it will. Can anyone show me where I'm going wrong?
Using R, I want to obtain the list of articles referencing to a scientific journal paper.
The only information I have is the title of the article, e.g. "Protein measurement with the folin phenol reagent".
Is anyone able to help me by producing a replicable example that I can use?
Here is what I tried so far.
The R package fulltext seems to be useful, because it allows to retrieve a list of IDs linked to an article. For instance, I can get the article's DOI:
library(fulltext)
res1 <- ft_search(query = "Protein measurement with the folin phenol reagent", from = "crossref")
res1 <- ft_links(res1)
res1$crossref$ids
In the same way, I can get the scopus id, by setting from = "scopus" in the function fulltext::ft_search (and by including a scopus API key).
If using the DOI, I can obtain the number of citations of the article using the R library rcrossref:
rcrossref::cr_citation_count(res1$crossref$ids[1])
Similarly, I can use the R package rscopus if I want to use the scopus id, rather than the DOI.
Unfortunately, this information is not sufficient to me, as I need the list of articles referencing to the paper, not the number.
I saw on the internet many people using the package scholar. But if I understand correctly, for this to work I need article's authors to have a google scholar ID, and I have to find a way to retrieve this ID. So it doesn't look like a viable solution.
Does anyone has any idea on how to solve this problem?
Once you have the DOI, you can use the OpenCitations API to fetch data about publications that cite the article. Access the API with the rjson-package via https://opencitations.net/index/coci/api/v1/citations/{DOI}. The field name citing contains as values the DOIs of all publications that cite the publication. You can then use CrossRef's API to fetch further metadata about the citing papers, such as titles, journal, publication date and authors (via https://api.crossref.org/works/{DOI}).
Here is an example of OpenCitations' API with three citations (as of January 2021).
Here is a possible code (with the same example as above):
opcit <- "https://opencitations.net/index/coci/api/v1/citations/10.1177/1369148118786043"
result <- rjson::fromJSON(file = opcit)
citing <- lapply(result, function(x){
x[['citing']]
})
# a vector with three DOIs, each of which cite the paper
citing <- unlist(citing)
Now we have the vector citing with three DOIs. You can then use rcrossref to find out basic information about the citing papers, such as:
paper <- rcrossref::cr_works(citing[1])
# find out the title of that paper
paper[["data"]][["title"]]
# output: "Exchange diplomacy: theory, policy and practice in the Fulbright program"
Since you have a vector of DOIs in citing, you could also use this approach:
citingdata <- rcrossref::cr_cn(citing)
The output of citingdata should lead to the metadata of the three citing papers, structured like in these two examples:
[[1]]
[1] "#article{Wong_2020,\n\tdoi = {10.1017/s1752971920000196},\n\turl = {https://doi.org/10.1017%2Fs1752971920000196},\n\tyear = 2020,\n\tmonth = {jun},\n\tpublisher = {Cambridge University Press ({CUP})},\n\tpages = {1--31},\n\tauthor = {Seanon S. Wong},\n\ttitle = {One-upmanship and putdowns: the aggressive use of interaction rituals in face-to-face diplomacy},\n\tjournal = {International Theory}\n}"
[[2]]
[1] "#article{Aalberts_2020,\n\tdoi = {10.1080/21624887.2020.1792734},\n\turl = {https://doi.org/10.1080%2F21624887.2020.1792734},\n\tyear = 2020,\n\tmonth = {aug},\n\tpublisher = {Informa {UK} Limited},\n\tvolume = {8},\n\tnumber = {3},\n\tpages = {240--264},\n\tauthor = {Tanja Aalberts and Xymena Kurowska and Anna Leander and Maria Mälksoo and Charlotte Heath-Kelly and Luisa Lobato and Ted Svensson},\n\ttitle = {Rituals of world politics: on (visual) practices disordering things},\n\tjournal = {Critical Studies on Security}\n}"
I have the following code in R.
library(biomaRt)
snp_mart = useMart("ENSEMBL_MART_SNP", dataset="hsapiens_snp")
snp_attributes = c("refsnp_id", "chr_name", "chrom_start",
"associated_gene", "ensembl_gene_stable_id", "minor_allele_freq")
getENSG <- function(rs, mart = snp_mart) {
results <- getBM(attributes = snp_attributes,
filters = "snp_filter", values = rs, mart = mart)
return(results)
}
getENSG("rs144864312")
refsnp_id chr_name chrom_start associated_gene ensembl_gene_stable_id
1 rs144864312 8 20254959 NA ENSG00000061337
minor_allele_freq
1 0.000399361
I have no background in biology so please forgive me if this is an obvious question. I was told that rs144864312 should match to the gene name "LZTS1".
The code above I largely got from off the internet. My question is where do I extract that gene name from? I get that the listAttributes(snp_mart) gives a list of all possible outputs but I don't see any that give me the above "gene name". Where do I extract this gene name from using biomart (and given the rs number)? Thank you in advance.
PS: I need to do this for something like 500 entries (not just 1). Hence why I created a simple function as above to extract the gene name.
First I think your question will draw more professional attention on https://www.biostars.org/
That said, to my knowledge, now you have the ensembl ID (ENSG00000061337), you are just one step away from getting the gene name. If you google "how to convert ensembl ID to gene name" you will find many approaches. Here I list a few options:
use: https://david.ncifcrf.gov/conversion.jsp
use biomart under ensemble: http://www.ensembl.org/biomart/martview/1cb4c119ae91cb34b2cd5280be0a1aac
download a table with both gene name and ensembl ID, and customize your query. You might want to download it from UCSC Genome Browser, and here are some instructions: https://www.biostars.org/p/92939/
Good luck
GEOquery is a great R package to retrieve and analyze the Gene Expression data stored in NCBI Gene Expression Omnibus (GEO) database. I have used the following code provided from GEO2R service of GEO database (that generates the initial R script to analyze your desired data automatically) to extract some GEO series of experiments:
gset <- getGEO("GSE10246", GSEMatrix =TRUE)
if (length(gset) > 1) idx <- grep("GPL1261", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
gset # displays a summary of the data stored in this variable
The problem is that I can not retrieve the sample titles from it. I have found some function Columns() that works on GDS datasets and returns the sample names, but not on GSE.
Please note I am not interested in sample accession IDs (i.e. GSM258609 GSM258610, etc), what I want is the sample human readable titles.
Is there any idea? Thanks
After
gset <- getGEO("GSE10246", GSEMatrix =TRUE)
gset is a simple list, it's first element is an ExpressionSet, and the sample information are in the phenoData or pData, so maybe you're looking for
pData(gset[[1]])
See ?ExpressionSet for more.