Downloading DNA sequence data in R using entrez_fetch: cannot retrieve query - r

I'm trying to download DNA sequence data from NCBI using entrez_fetch. With the following code, I perform a search for the IDs of the sequences I need with entrez_search, and then I attempt to download the sequence data in FASTA format:
library(rentrez)
#Search for sequence ids
search <- entrez_search(db = "biosample",
term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
retmax = 9999, use_history = T)
search$ids
length(search$ids)
search$web_history
#Download sequence data
ecoli_fasta <- entrez_fetch(db = "nuccore",
web_history = search$web_history,
rettype = "fasta")
When I do this, I get the following error:
Error: HTTP failure: 400
Cannot+retrieve+query+from+history
I don't understand what this means and Googling hasn't led me to an answer.
I tried using a different package (ape) and the function read.GenBank to download the sequences as an alternative, but this method only managed to download about 1000 of the 12000 sequences I needed. I would like the use entrez_fetch if possible - does anyone have any insight for me?

This may be a starter.
Also be aware that queries to genome databases can return massive amounts of data, so be sure to limit your queries.
Build search web history
library(rentrez)
search <- entrez_search(db="nuccore",
term="Escherichia coli[Organism]",
use_history = T)
Use web history to fetch data
cat(entrez_fetch(db="nuccore",
web_history=search$web_history, rettype="fasta", retstart=24, retmax=100))
>pdb|7QQ3|I Chain I, 23S ribosomal RNA
NGTTAAGCGACTAAGCGTACACGGTGGATGCCCTGGCAGTCAGAGGCGATGAAGGACGTGCTAATCTGCG
ATAAGCGTCGGTAAGGTGATATGAACCGTTATAACCGGCGATTTCCGAATGGGGAAACCCAGTGTGTTTC
GACACACTATCATTAACTGAATCCATAGGTTAATGAGGCGAACCGGGGGAACTGAAACATCTAAGTACCC
CGAGGAAAAGAAATCAACCGAGATTCCCCCAGTAGCGGCGAGCGAACGGGGAGCAGCCCAGAGCCTGAAT
CAGTGTGTGTGTTAGTGGAAGCGTCTGGAAAGGCGCGCGATACAGGGTGACAGCCCCGTACACAAAAATG
CACATGCTGTGAGCTCGATGAGTAGGGCGGGACACGTGGTATCCTGTCTGAATATGGGGGGACCATCCTC
CAAGGCTAAATACTCCTGACTGACCGATAGTGAACCAGTACCGTGAGGGAAAGGCGAAAAGAACCCCGGC
...
Use a loop to cycle through sequences, e.g
for(i in seq(1, 300, 100)){
cat(entrez_fetch(db="nuccore",
web_history=search$web_history, rettype="fasta", retstart=i, retmax=100))
}

Related

Downloading multiple sequences from GenBank in R and writing to separate FASTA files

I'm trying to download sequence data from GenBank using rentrez in R. I have performed my search using the following code:
search <- entrez_search(db = "biosample",
term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
retmax = 9999, use_history = T)
ids1 <- search$ids[1:300]
ecoli_fasta1 <- entrez_fetch(db = "nuccore",
id = ids1,
rettype = "fasta")
This gives me a large character object in R, which I can write to FASTA file with (I think) all 300 sequences in it. What I want is to split the sequences so each is in its own FASTA file, but I am finding this tricky. Can anyone point me in the right direction?
Someone asked a similar question here in Python, but I can't figure out how to translate it to R and what I'm working on.

How do I download a large number of GenBank sequences using entrez_fetch in R?

I am trying to download sequence data from 1283 records in GenBank using rentrez. I'm using the following code, first to search for records fitting my criteria, then linking across databases, and finally fetching the sequence data:
# Search for sequence ids in biosample database
search <- entrez_search(db = "biosample",
term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
retmax = 9999, use_history = T)
search$ids
length(search$ids)
search$web_history
#Link IDs across databases: biosample to nuccore (nucleotide sequences)
nuc_links <- entrez_link(dbfrom ="biosample",
id = search$web_history,
db ="nuccore",
by_id = T)
nuc_links$links
#Fetch nucleotide sequences
fetch_ids1 <- entrez_fetch(db = "nucleotide",
id = nuc_links$links$biosample_nuccore,
rettype = "xml")
When I do this for one single record, I am able to get the data I need. When I try to scale it up to pull data for all the sequences I need using the web history of my search, it's not working. The nuc_links$links list is NULL, which is telling me that the entrez_link is not working how I hope it will. Can anyone show me where I'm going wrong?

How do I summarise data from multiple files into batches of multiple files?

The file names my lab has contains the monkey ID, system number, date, and the task the data is for, and each file contains a header row. We would like to check their progress daily, so we’re interested in seeing if we can automate the daily summaries as much as possible using R. The way we are doing it with Excel is clunky and takes too much time. For a daily summary, we’d need to know what task(s) they worked on, how many trials they did (on each task), and their percentage correct (on each task).
I have figured out the coding script to determine the daily summaries but I’m having issues as right now the script only runs with one monkey and one data file. Ideally, I would like to have it so that I could effectively point the script at a folder with a bunch of data files from different monkeys and generate daily summaries. Here are some examples of the summaries I have generated:
7-255 Summary table:
n_trials correct incorrect task_name
7 42.85714 57.14286 SHAPE2
H033 Summary table:
n_trials correct incorrect task_name
177 44.0678 55.9322 MTSseq
I have attached my coding script below:
library(tidyverse)
#library(readr)
file <- "/Users/siddharthsatishchandran/R project/R project status 1/Data/H033.csv"
data_trials <- read_csv(file)
head(data_trials)
summary(data_trials)
n_trials <- length(data_trials$trial)
correct <- mean(data_trials$correct)*100
incorrect <- 100 - correct
df_trials_correct <- data.frame(n_trials = n_trials,
correct = correct,
incorrect = incorrect,
task_name = unique(data_trials$task_name))
df_trials_correct
This might be what you are looking for:
path <- "/Users/siddharthsatishchandran/R project/R project status 1/Data"
file_list <- list.files(
path,
pattern = "csv$"
)
summary_tables <- list()
for (file in file_list) {
data_trials <- read_csv(file.path(path, file))
summary_tables[[files]] <- data.frame(
n_trials = length(data_trials$trial),
correct = mean(data_trials$correct)*100,
incorrect = 100 - correct,
task_name = unique(data_trials$task_name)
)
}
Now you get a list of data.frames, each containing your desired information.
This could be "flattened" into a single data.frame using bind_rows:
bind_rows(summary_tables, .id = "monkey_id")

Random sample of tweets of a time period using TwitteR

I need as much tweets as possible for a given hashtag of two-days time period. The problem is there're too many of them (guess ~1 mln) to extract using just a time period specification:
It would definitely take a lot of time if I specify like retryOnRateLimit = 120
I'll get blocked soon if I don't and get tweet just for a half of a day
The obvious answer for me is to extract a random sample by given parameters but I can't figure out how to do it.
My code is here:
a = searchTwitteR('hashtag', since="2017-01-13", n = 1000000, resultType = "mixed", retryOnRateLimit = 10)
The last try was stopped at 17,5 thousand tweets, which covers only passed 12 hours
P.S. it may be useful not to extract retweets, but still, I don't know how to specify it within searchTwitteR().
The twitteR package is deprecated in favor of the rtweet package. If I were you, I would use rtweet to get every last one of those tweets.
Technically, you could specify 1 million straight away using search_tweets() from the rtweet package. I recommend, however, breaking it up into pieces though since collecting 200000 tweets will take several hours.
library(rtweet)
maxid <- NULL
rt <- vector("list", 5)
for (i in seq_len(5)) {
rt[[i]] <- search_tweets("hashtag", n = 200000,
retyonratelimit = TRUE,
max_id = maxid)
maxid <- rt[[i]]$status_id[nrow(rt[[i]])]
}
## extract users data and combine into data frame
users <- do.call("rbind", users_data(rt))
## collapse tweets data into data frame
rt <- do.call("rbind", rt)
## add users data as attribute
attr(rt, "users") <- users
## preview data
head(rt)
## preview users data (rtweet exports magrittr's `%>%` pipe operator)
users_data(rt) %>% head()

Entrez and RISmed library for pubmed data mining

I'm using this "RISmed" library to do some query of my gene or protein of interest and the output comes with pubmed ID basically, but most of the times it consist of non-specific hits as well which are not my interest. As I can only see the pubmed ID I have to manually put those returned ID and search them in NCBI to see if the paper is of my interest or not.
Question: Is there a way to to return the abstract of the paper or summary sort of along with its pumed ID , which can be implemented in R?
If anyone can help it would be really great..
Using the example from the manuals we need EUtilsGet function.
library(RISmed)
search_topic <- 'copd'
search_query <- EUtilsSummary(search_topic, retmax = 10,
mindate = 2012, maxdate = 2012)
summary(search_query)
# see the ids of our returned query
QueryId(search_query)
# get actual data from PubMed
records <- EUtilsGet(search_query)
class(records)
# store it
pubmed_data <- data.frame('Title' = ArticleTitle(records),
'Abstract' = AbstractText(records))

Resources