I am trying to extract some information (metadata) from GenBank using the R package "rentrez" and the example I found here https://ajrominger.github.io/2018/05/21/gettingDNA.html. Specifically, for a particular group of organisms, I search for all records that have geographical coordinates and then want to extract data about the accession number, taxon, sequenced locus, country, lat_long, and collection date. As an output, I want a csv file with the data for each record in a separate row. It seems that the code below can do the job but at some point, rows get muddled with data from different records overlapping the neighbouring rows. For example, from 157 records that rentrez retrieves from NCBI 109 records in the file look like what I want to achieve but the rest is a total mess. I would greatly appreciate any advice on how to fix the issue because I am a total newbie with R and figuring out each step takes a lot of time.
setwd ("C:/R-Works")
library('XML')
library('rentrez')
argasid <- entrez_search(db="nuccore", term = "Argasidae[Organism] AND [lat]", use_history=TRUE, retmax=15000)
x <- entrez_fetch (db="nuccore", id=argasid$ids, rettype= "native", retmode="xml", parse=TRUE)
x <-xmlToList(x)
cleanEntrez <- function(x) {
basePath <- 'Seq-entry_seq.Bioseq'
c(
genbank = as.character(x[paste(basePath,
'Bioseq_id', 'Seq-id', 'Seq-id_genbank',
'Textseq-id', 'Textseq-id_accession',
sep = '.')]),
taxon = as.character(x[paste(basePath,
'Bioseq_descr', 'Seq-descr', 'Seqdesc',
'Seqdesc_source', 'BioSource', 'BioSource_org',
'Org-ref', 'Org-ref_taxname',
sep = '.')]),
bseqdesc_title = as.character(x[paste(basePath,
'Bioseq_descr', 'Seq-descr', 'Seqdesc',
'Seqdesc_title',
sep = '.')]),
lat_lon = as.character(x[grep('lat-lon', x) + 1]),
geo_description = as.character(x[grep('country', x) + 1]),
coll_date = as.character(x[grep('collection-date', x) + 1])
)
}
getGenbankMeta <- function(ids) {
allRec <- entrez_fetch(db = 'nuccore', id = ids,
rettype = 'native', retmode = 'xml',
parsed = TRUE)
allRec <- xmlToList(allRec)[[1]]
o <- lapply(allRec, function(x) {
cleanEntrez(unlist(x))
})
temp <- array(unlist(o), dim = c(length(o[[1]]), length(ids)))
seqVec <- temp[nrow(temp), ]
seqDF <- as.data.frame(t(temp[-nrow(temp), ]))
names(seqDF) <- names(o[[1]])[-nrow(temp)]
return(list(seq = seqVec, data = seqDF))
}
write.csv(getGenbankMeta(argasid$ids), 'argasid_georef.csv')
I am new to web scraping. I am trying to scrape a table with the following code. But I am unable to get it. The source of data is
https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1
url <- "https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1"
urlYAnalysis <- paste(url, sep = "")
webpage <- readLines(urlYAnalysis)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
Tab <- readHTMLTable(tableNodes[[1]])
I copied this apporach from the link (Web scraping of key stats in Yahoo! Finance with R) where it is applied on yahoo finance data.
In my opinion, in readHTMLTable(tableNodes[[12]]), it should be Table 12. But when I try giving tableNodes[[12]], it always gives me an error.
Error in do.call(data.frame, c(x, alis)) :
variable names are limited to 10000 bytes
Please suggest me the way to extract the table and combine the data from other tabs as well (Fundamental, Technical and Performance).
This data is returned dynamically as json. In R (behaves differently from Python requests) you get html from which you can extract a given page's results as json. A page includes all the tabs info and 50 records. From the first page you are given the total record count and therefore can calculate the total number of pages to loop over to get all results. Perhaps combine them info a final dataframe during a loop to total number of pages; where you alter the pn param of the XHR POST body to the appropriate page number for desired results in each new POST request. There are two required headers.
Probably a good idea to write a function that accepts a page number in signature and returns a given page's json as a dataframe. Apply that via a tidyverse package to handle loop and combining of results to final dataframe?
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>%read_html()%>%html_node('p')%>% html_text()
page1_data <- jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])
total_rows <- str_match(s, '"totalCount\":(\\d+),' )[1,2]%>%as.integer()
num_pages <- ceiling(total_rows/50)
My current attempt at combining which I would welcome feedback on. This is all the returned columns, for all pages, and I have to handle missing columns and different ordering of columns as well as 1 column being a data.frame. As the returned number is far greater than those visible on page, you could simply revise to subset returned columns with a mask just for the columns present in the tabs.
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
get_data <- function(page_number){
data['pn'] = page_number
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>% read_html() %>% html_node('p') %>% html_text()
if(page_number==1){ return(s) }
else{return(data.frame(jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])))}
}
clean_df <- function(df){
interim <- df['viewData']
df_minus <- subset(df, select = -c(viewData))
df_clean <- cbind.data.frame(c(interim, df_minus))
return(df_clean)
}
initial_data <- get_data(1)
df <- clean_df(data.frame(jsonlite::fromJSON(str_match(initial_data, '(\\[.*\\])' )[1,2])))
total_rows <- str_match(initial_data, '"totalCount\":(\\d+),' )[1,2] %>% as.integer()
num_pages <- ceiling(total_rows/50)
dfs <- map(.x = 2:num_pages,
.f = ~clean_df(get_data(.)))
r <- rbindlist(c(list(df),dfs),use.names=TRUE, fill=TRUE)
write_csv(r, 'data.csv')
I am working on some work of prediction citation counts for articles. The problem I have is that I need information about journals from ISI Web of Knowledge. They're gathering these information (journal impact factor, eigenfactor,...) year by year, but there is no way to download all one-year-journal-informations at once. There's just option to "mark all" which marks always first 500 journals in the list (this list then can be downloaded). I am programming this project in R. So my question is, how to retrieve this information at once or in efficient and tidy way? Thank you for any idea.
I used RSelenium to scrape WOS to get citation data and make a plot similar to this one by Kieran Healy (but mine was for archaeology journals, so my code is tailored to that):
Here's my code (from a slightly bigger project on github):
# setup broswer and selenium
library(devtools)
install_github("ropensci/rselenium")
library(RSelenium)
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()
# go to http://apps.webofknowledge.com/
# refine search by journal... perhaps arch?eolog* in 'topic'
# then: 'Research Areas' -> archaeology -> refine
# then: 'Document types' -> article -> refine
# then: 'Source title' -> choose your favourite journals -> refine
# must have <10k results to enable citation data
# click 'create citation report' tab at the top
# do the first page manually to set the 'save file' and 'do this automatically',
# then let loop do the work after that
# before running the loop, get URL of first page that we already saved,
# and paste in next line, the URL will be different for each run
remDr$navigate("http://apps.webofknowledge.com/CitationReport.do?product=UA&search_mode=CitationReport&SID=4CvyYFKm3SC44hNsA2w&page=1&cr_pqid=7&viewType=summary")
Here's the loop to automate collecting data from the next several hundred pages of WOS results...
# Loop to get citation data for each page of results, each iteration will save a txt file, I used selectorgadget to check the css ids, they might be different for you.
for(i in 1:1000){
# click on 'save to text file'
result <- try(
webElem <- remDr$findElement(using = 'id', value = "select2-chosen-1")
); if(class(result) == "try-error") next;
webElem$clickElement()
# click on 'send' on pop-up window
result <- try(
webElem <- remDr$findElement(using = "css", "span.quickoutput-action")
); if(class(result) == "try-error") next;
webElem$clickElement()
# refresh the page to get rid of the pop-up
remDr$refresh()
# advance to the next page of results
result <- try(
webElem <- remDr$findElement(using = 'xpath', value = "(//form[#id='summary_navigation']/table/tbody/tr/td[3]/a/i)[2]")
); if(class(result) == "try-error") next;
webElem$clickElement()
print(i)
}
# there are many duplicates, but the code below will remove them
# copy the folder to your hard drive, and edit the setwd line below
# to match the location of your folder containing the hundreds of text files.
Read all text files into R...
# move them manually into a folder of their own
setwd("/home/two/Downloads/WoS")
# get text file names
my_files <- list.files(pattern = ".txt")
# make list object to store all text files in R
my_list <- vector(mode = "list", length = length(my_files))
# loop over file names and read each file into the list
my_list <- lapply(seq(my_files), function(i) read.csv(my_files[i],
skip = 4,
header = TRUE,
comment.char = " "))
# check to see it worked
my_list[1:5]
Combine list of dataframes from the scrape into one big dataframe
# use data.table for speed
install_github("rdatatable/data.table")
library(data.table)
my_df <- rbindlist(my_list)
setkey(my_df)
# filter only a few columns to simplify
my_cols <- c('Title', 'Publication.Year', 'Total.Citations', 'Source.Title')
my_df <- my_df[,my_cols, with=FALSE]
# remove duplicates
my_df <- unique(my_df)
# what journals do we have?
unique(my_df$Source.Title)
Make abbreviations for journal names, make article titles all upper case ready for plotting...
# get names
long_titles <- as.character(unique(my_df$Source.Title))
# get abbreviations automatically, perhaps not the obvious ones, but it's fast
short_titles <- unname(sapply(long_titles, function(i){
theletters = strsplit(i,'')[[1]]
wh = c(1,which(theletters == ' ') + 1)
theletters[wh]
paste(theletters[wh],collapse='')
}))
# manually disambiguate the journals that now only have 'A' as the short name
short_titles[short_titles == "A"] <- c("AMTRY", "ANTQ", "ARCH")
# remove 'NA' so it's not confused with an actual journal
short_titles[short_titles == "NA"] <- ""
# add abbreviations to big table
journals <- data.table(Source.Title = long_titles,
short_title = short_titles)
setkey(journals) # need a key to merge
my_df <- merge(my_df, journals, by = 'Source.Title')
# make article titles all upper case, easier to read
my_df$Title <- toupper(my_df$Title)
## create new column that is 'decade'
# first make a lookup table to get a decade for each individual year
year1 <- 1900:2050
my_seq <- seq(year1[1], year1[length(year1)], by = 10)
indx <- findInterval(year1, my_seq)
ind <- seq(1, length(my_seq), by = 1)
labl1 <- paste(my_seq[ind], my_seq[ind + 1], sep = "-")[-42]
dat1 <- data.table(data.frame(Publication.Year = year1,
decade = labl1[indx],
stringsAsFactors = FALSE))
setkey(dat1, 'Publication.Year')
# merge the decade column onto my_df
my_df <- merge(my_df, dat1, by = 'Publication.Year')
Find the most cited paper by decade of publication...
df_top <- my_df[ave(-my_df$Total.Citations, my_df$decade, FUN = rank) <= 10, ]
# inspecting this df_top table is quite interesting.
Draw the plot in a similar style to Kieran's, this code comes from Jonathan Goodwin who also reproduced the plot for his field (1, 2)
######## plotting code from from Jonathan Goodwin ##########
######## http://jgoodwin.net/ ########
# format of data: Title, Total.Citations, decade, Source.Title
# THE WRITERS AUDIENCE IS ALWAYS A FICTION,205,1974-1979,PMLA
library(ggplot2)
ws <- df_top
ws <- ws[order(ws$decade,-ws$Total.Citations),]
ws$Title <- factor(ws$Title, levels = unique(ws$Title)) #to preserve order in plot, maybe there's another way to do this
g <- ggplot(ws, aes(x = Total.Citations,
y = Title,
label = short_title,
group = decade,
colour = short_title))
g <- g + geom_text(size = 4) +
facet_grid (decade ~.,
drop=TRUE,
scales="free_y") +
theme_bw(base_family="Helvetica") +
theme(axis.text.y=element_text(size=8)) +
xlab("Number of Web of Science Citations") + ylab("") +
labs(title="Archaeology's Ten Most-Cited Articles Per Decade (1970-)", size=7) +
scale_colour_discrete(name="Journals")
g #adjust sizing, etc.
Another version of the plot, but with no code: http://charlesbreton.ca/?page_id=179