Problems extracting metadata from NCBI in R - r

I am trying to extract some information (metadata) from GenBank using the R package "rentrez" and the example I found here https://ajrominger.github.io/2018/05/21/gettingDNA.html. Specifically, for a particular group of organisms, I search for all records that have geographical coordinates and then want to extract data about the accession number, taxon, sequenced locus, country, lat_long, and collection date. As an output, I want a csv file with the data for each record in a separate row. It seems that the code below can do the job but at some point, rows get muddled with data from different records overlapping the neighbouring rows. For example, from 157 records that rentrez retrieves from NCBI 109 records in the file look like what I want to achieve but the rest is a total mess. I would greatly appreciate any advice on how to fix the issue because I am a total newbie with R and figuring out each step takes a lot of time.
setwd ("C:/R-Works")
library('XML')
library('rentrez')
argasid <- entrez_search(db="nuccore", term = "Argasidae[Organism] AND [lat]", use_history=TRUE, retmax=15000)
x <- entrez_fetch (db="nuccore", id=argasid$ids, rettype= "native", retmode="xml", parse=TRUE)
x <-xmlToList(x)
cleanEntrez <- function(x) {
basePath <- 'Seq-entry_seq.Bioseq'
c(
genbank = as.character(x[paste(basePath,
'Bioseq_id', 'Seq-id', 'Seq-id_genbank',
'Textseq-id', 'Textseq-id_accession',
sep = '.')]),
taxon = as.character(x[paste(basePath,
'Bioseq_descr', 'Seq-descr', 'Seqdesc',
'Seqdesc_source', 'BioSource', 'BioSource_org',
'Org-ref', 'Org-ref_taxname',
sep = '.')]),
bseqdesc_title = as.character(x[paste(basePath,
'Bioseq_descr', 'Seq-descr', 'Seqdesc',
'Seqdesc_title',
sep = '.')]),
lat_lon = as.character(x[grep('lat-lon', x) + 1]),
geo_description = as.character(x[grep('country', x) + 1]),
coll_date = as.character(x[grep('collection-date', x) + 1])
)
}
getGenbankMeta <- function(ids) {
allRec <- entrez_fetch(db = 'nuccore', id = ids,
rettype = 'native', retmode = 'xml',
parsed = TRUE)
allRec <- xmlToList(allRec)[[1]]
o <- lapply(allRec, function(x) {
cleanEntrez(unlist(x))
})
temp <- array(unlist(o), dim = c(length(o[[1]]), length(ids)))
seqVec <- temp[nrow(temp), ]
seqDF <- as.data.frame(t(temp[-nrow(temp), ]))
names(seqDF) <- names(o[[1]])[-nrow(temp)]
return(list(seq = seqVec, data = seqDF))
}
write.csv(getGenbankMeta(argasid$ids), 'argasid_georef.csv')

Related

Why is my loop in R skipping the first element from the results?

My code takes the two destination airports (JFK and then Las Vegas), passes them through a URL to return flight information in the For Loop, which I'm trying to add to a data frame. However, it only is including the results from the last element, Las Vegas. Should I use something other than a list for this?
library (httr)
library (jsonlite)
des <- c("JFK", "LAS")
flights = NULL
flights = list()
for (x in 1 : length(des))
{
url <- paste0("https://travelpayouts-travelpayouts-flight-data-v1.p.rapidapi.com/v1/prices/direct/?destination=", des[x], "&origin=BOS")
r<-GET(url, add_headers("X-RapidAPI-Host" = "travelpayouts-travelpayouts-flight-data-v1.p.rapidapi.com",
"X-RapidAPI-Key" = " MY KEY HERE ",
"X-Access-Token" = " MY TOKEN HERE"))
jsonResponseParsed<-content(r,as="text")
f <- fromJSON(jsonResponseParsed, flatten = TRUE)
flights[[x]] <- data.frame(f$data)
}
data = do.call(rbind, flights)
#price will be in rubles will need to convert to USD

Problems with binding columns from two data frames using a for loop in R

I have 7 of two different asc files loaded into R, asc[i] and wasc[i], [i] denotes that there are 1:7 ascs and wascs loaded into R. I need to combine the wasc[i] with the asc[i][[1]] (Just the first column in asc[i] with the whole wasc[i] file).
This should be repeated for every pair of asc and wasc files.
The code keeps giving me blank data frames, so I don't know why this doesn't work. The naming is correct, yet the code is not recognizing that the asc[i] and wasc[i] correlate with previously loaded files.
Any help will be greatly appreciated.
# These data frames will reproduce my issue
asc1 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc1 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
asc2 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc2 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
asc3 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc3 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
for (i in 1:3) {
d <- paste("asc", i, sep ="")
f <- paste("wasc", i, sep ="")
full_wing <- as.character(paste("full_wing", i, sep = ""))
assign(full_wing,cbind(d[[1]], f))
}
# Output of full_wing1 data frame
dput(full_wing1)
structure(c("asc1", "wasc1"), .Dim = 1:2, .Dimnames = list(NULL,
c("", "f")))
Additional Information:
asc files are 19 columns long
wasc files are 13 columns long
I only want to combine column 1 from the asc file with the entire wasc file, thus cutting out the remaining 18 columns of the asc file.
# put data in a list
asc = mget(ls(pattern = "^asc"))
wasc = mget(ls(pattern = "^wasc"))
full_wing = Map(f = function(w, a) cbind(w, a[[1]]), w = wasc, a = asc)
Map is a nice shortcut for iterating in parallel over multiple arguments. It returns a nice list. You can access the individual elements with, e.g., full_wing[[1]], full_wing[[3]], etc. Map is just a shortcut, the above code is basically equivalent to the for loop below:
results = list()
for (i in seq_along(asc)) {
results[[i]] = cbind(wasc[[i]], asc[[i]][[1]])
}
I use mget to put the data in a list because in your example you already have objects like asc1, asc2, etc. A much better way to go is to never create those variables in the first place, instead read the files directly into a list, something like this:
asc_paths = list.files(pattern = "^asc")
asc = lapply(asc_paths, read.table)
You can see a lot more explanation of this at How to make a list of data frames?
If you only ever need one column of the asc files, another way to simplify this would be to only read in the needed column, see Only read limited number of columns for some recommendations there.

Using a loop to apply gmapsdistance to a list in R

I am trying to use the gmapsdistance package in R to calculate the journey time by public transport between a list of postcodes (origin) and a single destination postcode.
The output for a single query is:
$Time
[1] 5352
$Distance
[1] 34289
$Status
[1] "OK"
I actually have 2.5k postcodes to use but whilst I troubleshoot it I have set the iterations to 10. london1 is a dataframe containing a single column with 2500 postcodes in 2500 rows.
This is my attempt so far;
results <- for(i in 1:10) {
gmapsdistance::set.api.key("xxxxxx")
gmapsdistance::gmapsdistance(origin = "london1[i]"
destination = "WC1E 6BT"
mode = "transit"
dep_date = "2017-04-18"
dep_time = "09:00:00")}
When I run this loop I get
results <- for(i in 1:10) {
+ gmapsdistance::set.api.key("AIzaSyDFebeOppqSyUGSut_eGs8JcjdsgPBo8zk")
+ gmapsdistance::gmapsdistance(origin = "london1[i]"
+ destination = "WC1E 6BT"
Error: unexpected symbol in:
" gmapsdistance::gmapsdistance(origin = "london1[i]"
destination"
mode = "transit"
dep_date = "2017-04-18"
dep_time = "09:00:00")}
Error: unexpected ')' in " dep_time = "09:00:00")"
My questions are:
1)How can I fix this?
2) How do I need to format this, so the output is a dataframe or matrix containing the origin postcode and journey time
Thanks
There are a few things going on here:
"london[i]" needs to be london[i, 1]
you need to separate your arguments with commas ,
I get an error when using, e.g., "WC1E 6BT", I found it necessary to replace the space with a dash, like "WC1E-6BT"
the loop needs to explicitly assign values to elements of results
So your code would look something like:
library(gmapsdistance)
## some example data
london1 <- data.frame(postCode = c('WC1E-7HJ', 'WC1E-6HX', 'WC1E-7HY'))
## make an empty list to be filled in
results <- vector('list', 3)
for(i in 1:3) {
set.api.key("xxxxxx")
## fill in your results list
results[[i]] <- gmapsdistance(origin = london1[i, 1],
destination = "WC1E-6BT",
mode = "transit",
dep_date = "2017-04-18",
dep_time = "09:00:00")
}
It turns out you don't need a loop---and probably shouldn't---when using gmapsdistance (see the help doc) and the output from multiple inputs also helps in quickly formatting your output into a data.frame:
set.api.key("xxxxxx")
temp1 <- gmapsdistance(origin = london1[, 1],
destination = "WC1E-6BT",
mode = "transit",
dep_date = "2017-04-18",
dep_time = "09:00:00",
combinations = "all")
The above returns a list of data.frame objects, one each for Time, Distance and Status. You can then easily make those into a data.frame containing everything you might want:
res <- data.frame(origin = london1[, 1],
desination = 'WC1E-6BT',
do.call(data.frame, lapply(temp1, function(x) x[, 2])))
lapply(temp1, function(x) x[, 2]) extracts the needed column from each data.frame in the list, and do.call puts them back together as columns in a new data.frame object.

Appending to a text file in a loop

I have a data frame called MetricsInput which looks like this:
ID ExtractName Dimensions Metrics First_Ind
124 extract1.txt ga:date gs:sessions 1
128 extract1.txt ga:date gs:sessions 0
134 extract1.txt ga:date gs:sessions 0
124 extract2.txt ga:browser ga:users 1
128 extract2.txt ga:browser ga:users 0
134 extract2.txt ga:browser ga:users 0
I'm trying to use the above data frame in a loop to run a series of queries, which ultimately will create 2 text files, extract1.txt and extract2.txt. The reason I have the first_ind field is I only want to append the column headings on the first run through each unique file.
Here's my loop -- the issue I'm having is that the data for each ID is not appending -- I seem to be overwriting my results, not appending. Where did I go wrong?
for(i in seq(from=1, to=nrow(MetricsInput), by=1)){
id <- MetricsInput[i,1]
myresults <- ga$getData(id,batch = TRUE, start.date="2013-12-01", end.date="2014-01-01", metrics = MetricsInput[i,4], dimensions = MetricsInput[i,3])
appendcolheads <- ifelse(MetricsInput[i,5]==1, TRUE, FALSE)
write.table(myresults, file=MetricsInput$ExtractName[i], append=TRUE, row.names = FALSE, col.names = appendcolheads, sep="\t")
}
Although you can get this code to work, it doesn't look like the right approach at all. As #MrFlick said in the comments it's very hard to help without being able to reproduce your problem, but I would do something along the following lines
GetData <- function(id, metric, dim) {
d <- ga$getData(id, batch = TRUE, start.date="2013-12-01",
end.date="2014-01-01", metrics = metric, dimensions = dim)
d$id <- id
d
}
myresults <- Map(GetData,
id = MetricsInput$ID,
metric = MetricsInput$Metrics,
dim = MetricsInput$Dimensions)
This will give you a list whose ith component is the output of the ith iteration in your for loop. So now you have to split it in two to write it in the files you wanted
myresultslist <- split(myresults, MetricsInput$ExtractName)
myresultslist <- lapply(myresultslist, do.call, what = rbind)
Map(write.table, x = myresultslist, file = names(myresultslist),
row.names = FALSE, sep = "\t")
Why don't you create a data frame in the loop and then write it to the text file?
myresults <- data.frame()
for (i in yourloop) {
#your code here
id <- MetricsInput[i,1]
temp <- ga$getData(id,batch = TRUE, start.date="2013-12-01", end.date="2014-01-01", metrics = MetricsInput[i,4], dimensions = MetricsInput[i,3])
myresults <- rbind(myresults, temp)
}
write.csv(myresults, ...)

Need help copying the input from a function as the input for another function in R

I need help determining how I can use the input for the function below as an input for another r file.
Hotel <- function(hotel) {
require(data.table)
dat <- read.csv("demo.csv", header = TRUE)
dat$Date <- as.Date(paste0(format(strptime(as.character(dat$Date),
"%m/%d/%y"),
"%Y/%m"),"/1"))
library(data.table)
table <- setDT(dat)[, list(Revenue = sum(Revenues),
Hours = sum(Hours),
Index = mean(Index)),
by = list(Hotel, Date)]
answer <- na.omit(table[table$Hotel == hotel, ])
if (nrow(answer) == 0) {
stop("invalid hotel")
}
return(answer)
}
I would input Hotel("Hotel Name")
Here's the other R file using the Hotel name I inputted above.
#Reads the dataframe from the Hotel Function
star <- (Hotel("Hotel Name"))
#Calculates the Revpolu and Index
Revpolu <- star$Revenue / star$Hours
Index <- star$Index
png(filename = "~/Desktop/result.png", width = 480, height= 480)
plot(Index, Revpolu, main = "Hotel Name", col = "green", pch = 20)
testing <- cor.test(Index, Revpolu)
write.table(testing[["p.value"]], file = "output.csv", sep = ";", row.names = FALSE, col.names = FALSE)
dev.off()
I would like for this part to become automated instead of having to copy and paste from the first file an input and then storing it as a variable. Or if it's easier, then make all of this just one function.
Also instead of having to input one Hotel name for the function. Is it possible to make the first file read all the hotel names if they are identified as row names in the .csv file and have that input read in the second file?
Since your example is not reproducible and your code has some bugs (using the column "Rooms" which is not produced by your function), I can't give you a tested answer, but here's how you can structure your code to produce the statistics you want for all hotels without having to copy and paste hotel names:
library(data.table)
# Use fread instead of read.csv, it's faster
dat <- fread("demo.csv", header = TRUE)
dat[, Date := as.Date(paste0(format(strptime(as.character(Date), "%m/%d/%y"), "%Y/%m"),"/1"))
table <- dat[, list(
Revenue = sum(Revenues),
Hours = sum(Hours),
Index = mean(Index)
), by = list(Hotel, Date)]
# You might want to consider using na.rm=TRUE in cor.test instead of
# using na.omit, but I kept it here to keep the result similar.
answer <- na.omit(table)
# Calculate Revpolu inside the data.table
table[, Revpolu := Revenue / Hours]
# You can compute a p-value for all hotels using a group by
testing <- table[, list(p.value = cor.test(Index, Revpolu)[["p.value"]]), by=Hotel]
write.table(testing, file = "output.csv", sep = ";", row.names = FALSE, col.names = FALSE)
# You can get individual plots for each hotel with a for loop
hotels <- unique(table$Hotel)
for (h in hotels) {
png(filename = "~/Desktop/result.png", width = 480, height= 480)
plot(table[Hotel == h, Index], table[Hotel == h, Revpolu], main = h, col = "green", pch = 20)
dev.off()
}

Resources