Storing data from a for loop in a data frame - r

I am trying to create a function that interacts with the pubmed api to retrieve xml files associated with 100 publications. I then want to parse the xml files individually to retrieve the title of each publication and the abstract of each publication. I am using the Rentrez package to interact with the api, and have successfully retrieved the necessary xml files. I am using the xml package to parse the xml files, and have verified that the Xpath expressions retrieve the data that I want. In truth, I am looking to take data from other fields (journal title, Mesh Terms, etc., but I am stuck at this step here)
However, I have not been able to create a proper for loop to move this data into a data frame. I receive the following error from running my code:
error in $<-.data.frame(*tmp*, "Abstract", value = list("text of abstract"):
replacement has 1 row, data has 0
When I test the function to receive title information (by removing the expression to retrieve abstract information), I receive an empty data frame with no information about the titles that I want. But there is no error message then.
If I execute pubmed_parsed("Kandel+Eric", n=2), my goal is to receive a data frame with the character vectors from two titles in the column "ATitle" (titles: "Roles for small noncoding RNAs in silencing of retrotransposons in the mammalian brain" and "ApCPEB4, a non-prion domain containing homolog of ApCPEB, is involved in the initiation of long-term facilitation"). And the character vectors from the two abstracts to correspondingly appear in the column "Abstract" (portions of abstracts: "Piwi-interacting RNAs (piRNAs), long thought to be restricted to gremlin...", "Two pharmacologically distinct types of local protein synthesis are required for synapse- specific...").
library(xml)
library(rentrez)
pubmed_parsed <- function(term, n=100){
df <- data.frame(ATitle = character(), JTitle = character(), MeshTerms = character(), Abstract = character(), FAuthor = character(), LAuthor = character(), stringsAsFactors = FALSE)
IdList <- entrez_search(db = "pubmed", term = term, retmode = "xml", retmax = n)
for (i in 1:n){
XmlFile <- entrez_fetch(db = "pubmed", id=IdList$ids[i], rettype = "xml", retmode = "xml", parsed=TRUE)
Parsed <- xmlRoot(XmlFile)
df$ATitle[i] <- xpathSApply(Parsed, "/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/Title", xmlValue, simplify = FALSE)
df$Abstract[i] <- xpathSApply(Parsed, "/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/Title", xmlValue, simplify = FALSE)
}
df
}

Here's one way to get a table and a few suggestions. First, I would use the Web history option and download all results together instead of looping through downloads.
ids <- entrez_search(db = "pubmed", term = "Kandel ER", use_history = TRUE)
ids
Entrez search result with 502 hits (object contains 20 IDs and a web_history object)
Search term (as translated): Kandel ER[Author]
doc <- entrez_fetch(db="pubmed", web_history=ids$web_history, rettype="xml", retmax = 3, parsed=TRUE)
Next, get the articles into a node set and query that to handle all your missing and multiple tags.
articles <- getNodeSet( doc, "//PubmedArticle")
length(articles)
[1] 3
articles[[1]]
<PubmedArticle>
<MedlineCitation Status="Publisher" Owner="NLM">
<PMID Version="1">27791114</PMID>
<DateCreated>
...
I usually create a function to add NAs if tags are missing and join multiple tags using a comma.
xpath2 <-function(x, path, fun = xmlValue, ...){
y <- xpathSApply(x, path, fun, ...)
ifelse(length(y) == 0, NA,
ifelse(length(y) > 1, paste(unlist(y), collapse=", "), y))
}
Then just apply that function to the nodes (with the leading dot in xpath so it's relative to that node). This will combine multiple keywords into a comma-separated list and include NA for article 3 with missing keywords.
sapply(articles, xpath2, ".//Keyword")
[1] "DNA methylation, behavior, endogenous siRNA, piwi-interacting RNA, transposon"
[2] "Aplysia, CPEB, CPEB4, Long-term facilitation"
[3] NA
Most xpath should work
sapply(articles, xpath2, ".//PubDate/Year")
[1] "2016" "2016" "2016"
sapply(articles, xpath2, ".//ArticleId[#IdType='pmc']")
[1] "PMC5111663" "PMC5075418" NA
You can also use xmlGetAttr if needed
sapply(articles, xpath2, ".//Article", xmlGetAttr, "PubModel")
[1] "Print-Electronic" "Electronic" "Electronic"
Finally, create a data.frame
data.frame(
ATitle = sapply(articles, xpath2, ".//ArticleTitle"),
JTitle = sapply(articles, xpath2, ".//Journal/Title"),
Keywords = sapply(articles, xpath2, ".//Keyword"),
Authors = sapply(articles, xpath2, ".//Author/LastName"),
Abstract = sapply(articles, xpath2, ".//AbstractText"))
I'm not sure what happened to MeSH terms, but I only see Keywords in the few examples I downloaded. Also, there are probably a few ways to get first and last authors. You could get both last name and initials (assuming both are always present) and replace the comma before the initials to get an Author string. Then split that to get first and last author or even print the first three below.
au <- sapply(articles, xpath2, ".//Author/LastName|.//Author/Initials")
au <- gsub(",( [A-Z]+,?)", "\\1", au)
authors_etal <- function(x, authors=3, split=", *"){
y <- strsplit(x, split)
sapply(y, function(x){
if(length(x) > (authors + 1)) x <- c(x[1:authors], "et al.")
paste(x, collapse=", ")
})
}
authors_etal(au)
[1] "Nandi S, Chandramohan D, Fioriti L, et al."
[2] "Lee SH, Shim J, Cheong YH, et al."
[3] "Si K, Kandel ER"

Related

Create data frame from lists that are uneven

My code below is me scraping data from IMDB from multiple pages, however, when I try to combine the data into one data frame it is giving me an error telling me the differing rows for gross and meta. I was wondering how would I go about inserting NA values to those empty places so the strings are equal in length? (Note, I have to remove some links because I need certain rep to post more links)
urls <- c("https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=51&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=101&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=151&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=201&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=251&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=301&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=351&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=401&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=451&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=501&ref_=adv_nxt",
"https://www.imdb.com/search/title?title_type=feature&release_date=2010-01-01,2017-12-31&start=551&ref_=adv_nxt",
"https://www.imdb.com/search/title?
)
results_list <- list()
for(.page in seq_along(urls)){
webpage <- read_html(urls[[.page]])
titlehtml <- html_nodes(webpage,'.lister-item-header a')
title <- html_text(titlehtml)
runtimehtml <- html_nodes(webpage,'.text-muted .runtime')
runtime <- html_text(runtimehtml)
runtime <- gsub(" min","",runtime)
ratinghtml <- html_nodes(webpage,'.ratings-imdb-rating strong')
rating<- html_text(ratinghtml)
voteshtml <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')
votes <- html_text(voteshtml)
votes<-gsub(",","",votes)#removing commas
metascorehtml <- html_nodes(webpage,'.metascore')
metascore <- html_text(metascorehtml)
metascore<-gsub(" ","",metascore)#removing extra space in metascore
grosshtml <- html_nodes(webpage,'.ghost~ .text-muted+ span')
gross <- html_text(grosshtml)
gross<-gsub("M","",gross)#removing '$' and 'M' signs
gross<-substring(gross,2,6)
results_list[[.page]] <- data.frame(Title = title,
Runtime = as.numeric(runtime),
Rating = as.numeric(rating),
Metascore = as.numeric(metascore),
Votes = as.numeric(votes),
Gross_Earning_in_Mil = as.numeric(unlist(gross))
)
}
final_results <- plyr::ldply(results_list)
Error in data.frame(Title = title, Runtime = as.numeric(runtime), Rating = as.numeric(rating), :
arguments imply differing number of rows: 50, 49, 48
You need to know where your data is missing, so you need to know which items belong together. Right now you just have seperate vectors of values, so you don't know which belong together.
Looking at the page, it looks they are neatly organized into "lister-item-content"-nodes, so the clean thing to do is first extract those nodes, and only then pull out more info from each unit seperately. Something like this works for me:
items <- html_nodes(webpage,'.lister-item-content')
gross <- sapply(items, function(i) {html_text(html_node(i, '.ghost~ .text-muted+ span'))})
It inserts NA at every place where 'items' does not contain the header you're looking for.

Vectorizing for-loop in R

Oh, man. I am so terrible at removing for-loops from my code because I find them so intuitive and I first learned C++. Below, I am fetching IDs for a search (copd in this case) and using that ID to retrieve its full XML file and from that save its location into a vector. I do not know how to speed this up, and it took about 5 minutes to run on 700 IDs, whereas most searches have 70,000+ IDs. Thank you for any and all guidance.
library(rentrez)
library(XML)
# number of articles for term copd
count <- entrez_search(db = "pubmed", term = "copd")$count
# set max to count
id <- entrez_search(db = "pubmed", term = "copd", retmax = count)$ids
# empty vector that will soon contain locations
location <- character()
# get all location data
for (i in 1:count)
{
# get ID of each search
test <- entrez_fetch(db = "pubmed", id = id[i], rettype = "XML")
# convert to XML
test_list <- XML::xmlToList(test)
# retrieve location
location <- c(location, test_list$PubmedArticle$MedlineCitation$Article$AuthorList$Author$AffiliationInfo$Affiliation)
}
This may give you a start - it seems to be possible to pull down multiple at once.
library(rentrez)
library(xml2)
# number of articles for term copd
count <- entrez_search(db = "pubmed", term = "copd")$count
# set max to count
id_search <- entrez_search(db = "pubmed", term = "copd", retmax = count, use_history = T)
# get all
document <- entrez_fetch(db = "pubmed", rettype = "XML", web_history = id_search$web_history)
document_list <- as_list(read_xml(document))
Problem is that this is still time consuming because there are a large number of documents. Its also curious that it returns exactly 10,000 articles when I've tried this - there may be a limit to what you can return at once.
You can then use something like the purrr package to start extracting the information you want.

Convert R data into a .GEXF format

My first attempt at building a bipartite graph of co-authors' PubMed publications (226 records). The following is a sample of the input file (just one CSV line):
11810598;Chêne G, Angelini E, Cotte L, Lang JM, Morlat P, Rancinan C, May T, Journot V, Raffi F, Jarrousse B, Grappin M, Lepeu G, Molina JM;2002;Mar;Role of long-term nucleoside-analogue therapy in lipodystrophy and metabolic disorders in human immunodeficiency virus-infected patients.
> InputFile = 'JMMolina_PubMed.csv'
# Read the CSV input file into the initial JMMpubs data frame
> setwd('~/Dropbox/R')
> JMMpubs <- read.csv(file=InputFile , header =
> FALSE , sep = ";" , strip.white = TRUE)
> names(JMMpubs) <- c("ID","AuthList", "Year", "Month", "Title")
# build a new data frame IdAuth with one Id line for each coauthor
# therefor the first article which has 13 co-authors will generate 13 lines with the same Id
> Authors <- strsplit(as.character(JMMpubs$AuthList), split = ", ")
> IdAuth <- data.frame(Id = rep(JMMpubs$ID, sapply(Authors,length)), Author = unlist(Authors))
# Now I would like to export this data to Gephi
# The nodes of the graph should be the UNIQUE names in Authors
> UniqueAuthors <- unique(unlist(Authors))
The edges of the graph should be each row of IdAuth. I would like to associate the year of the publications JMMpubs$Year to each edge (to paint recent edges red and older ones in paler hues).
I have a similar issue. My solution has been the following.
For this to work you will as far as I can tell need to reshuffle your data a bit.
If I understand correctly you need the authors who's connected to an ID.
The original answer is on this post https://stackoverflow.com/a/16177624/8080865 by user1317221_G
I would set up DF as:
df3<-data.frame(Author = c("fawf", "ewew", "wewe", "wrewe", "zare")
ID= "11", "11", "11"... etc)´
###TNET solution WoRKS
#create an identifier df for each author
dfnames <- data.frame(i = as.numeric(df3$Id),
value = df$author)
library(tnet)
tdf <- as.tnet( cbind(df3[,1],df3[,2]), type="binary two-mode tnet")
relations <- projecting_tm(tdf, method = "sum")
# match original names
relations[["i"]] <- dfnames[match(relations[['i']], dfnames[['']] ) , 'value']
relations[["j"]] <- dfnames[match(relations[['j']], dfnames[['i']] ) , 'value']
# clean up names
names(relations) <- c("source" , "target", "weight")
I hope this helps you figure out your answer?

Calculating number of xmlchildren under each parent node for a list in R

I am querying PubMED with a long list of PMIDs using R. Because entrez_fetch can only do a certain number at a time, I have broken down my ~2000 PMIDs into one list with several vectors (each about 500 in length). When I query PubMED, I am extracting information from XML files for each publication. What I would like to have in the end is something like this:
Original.PMID Publication.type
26956987 Journal.article
26956987 Meta.analysis
26956987 Multicenter.study
26402000 Journal.article
25404043 Journal.article
25404043 Meta.analysis
Each publication has a unique PMID but there may be several publication types associated with each PMID (as seen above). I can query the PMID number from the XML file, and I can get the publication types of each PMID. What I have problems with is repeating the PMID x number of times so that each PMID is associated with each of the publication type it has. I am able to do this if I don't have my data in a list with multiple sublists (e.g., if I have 14 batches, each as its own data frame) by getting the number of children nodes from the parent PublicationType node. But I can't seem to figure out how to do this for within a list.
My code so far is this:
library(rvest)
library(tidyverse)
library(stringr)
library(regexr)
library(rentrez)
library(XML)
pubmed<-my.data.frame
into.batches<-function(x,n) split(x,cut(seq_along(x),n,labels=FALSE))
batches<-into.batches(pubmed.fwd$PMID, 14)
headings<-lapply(1:14, function(x) {paste0("Batch",x)})
names(batches)<-headings
fwd<-sapply(batches, function(x) entrez_fetch(db="pubmed", id=x, rettype="xml", parsed=TRUE))
trial1<-lapply(fwd, function(x)
list(pub.type = xpathSApply(x, "//PublicationTypeList/PublicationType", xmlValue),
or.pmid = xpathSApply(x, "//ArticleId[#IdType='pubmed']", xmlValue)))
trial1 is what I am having problems with. This gives me a list where within each Batch, I have a vector for pub.type and a vector for or.pmid but they're different lengths.
I am trying to figure out how many children publication types there are for each publication, so I can repeat the PMID that many number of times. I am currently using the following code which does not do what I want:
trial1<-lapply(fwd, function(x)
list(childnodes = xpathSApply(xmlRoot(x), "count(.//PublicationTypeList/PublicationType)", xmlChildren)))
Unfortunately, this just tells me the total number of children nodes for each batch, not for each publication (or pmid).
I would split the XML results into separate Article nodes and apply xpath functions to get pmids and pubtypes.
pmids <- c(11677608, 22328765 ,11337471)
res <- entrez_fetch(db="pubmed", rettype="xml", id = pmids)
doc <- xmlParse(res)
x <- getNodeSet(doc, "//PubmedArticle")
x1 <- sapply(x, xpathSApply, ".//ArticleId[#IdType='pubmed']", xmlValue)
x2 <- sapply(x, xpathSApply, ".//PublicationType", xmlValue)
data.frame( pmid= rep(x1, sapply(x2, length) ), pubtype = unlist(x2) )
pmid pubtype
1 11677608 Journal Article
2 11677608 Research Support, Non-U.S. Gov't
3 22328765 Journal Article
4 22328765 Research Support, Non-U.S. Gov't
5 11337471 Journal Article
Also, NCBI says to use the HTTP POST method if using more than 200 UIDs. rentrez does not support POSTing, but you can run that with a few lines of code.
First, you need a vector with 1000s of Pubmed IDs (6171 from the microbial genome table)
library(readr)
x <- read_tsv( "ftp://ftp.ncbi.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt",
na = "-", quote = "")
ids <- unique( x$`Pubmed ID` )
ids <- ids[ids < 1e9 & !is.na(ids)]
Post the ids to NCBI using httr POST.
uri = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?"
response <- httr::POST(uri, body= list(id = paste(ids, collapse=","), db = "pubmed"))
Parse the results following the code in entrez_post to get the web history.
doc <- xmlParse( httr::content(response, as="text", encoding="UTF-8") )
result <- xpathApply(doc, "/ePostResult/*", xmlValue)
names(result) <- c("QueryKey", "WebEnv")
class(result) <- c("web_history", "list")
Finally, fetch up to 10K records (or loop through using the retstart option if you have more than 10K)
res <- entrez_fetch(db="pubmed", rettype="xml", web_history=result)
doc <- xmlParse(res)
These only take a second to run on my laptop.
articles <- getNodeSet(doc, "//PubmedArticle")
x1 <- sapply(articles, xpathSApply, ".//ArticleId[#IdType='pubmed']", xmlValue)
x2 <- sapply(articles, xpathSApply, ".//PublicationType", xmlValue)
data_frame( pmid= rep(x1, sapply(x2, length) ), pubtype = unlist(x2) )
# A tibble: 9,885 × 2
pmid pubtype
<chr> <chr>
1 11677608 Journal Article
2 11677608 Research Support, Non-U.S. Gov't
3 12950922 Journal Article
4 12950922 Research Support, Non-U.S. Gov't
5 22328765 Journal Article
...
And one last comment. If you want one row per article, I usually create a function that combines multiple tags into a delimited list and adds NAs for missing nodes.
xpath2 <-function(x, ...){
y <- xpathSApply(x, ...)
ifelse(length(y) == 0, NA, paste(y, collapse="; "))
}
data_frame( pmid = sapply(articles, xpath2, ".//ArticleId[#IdType='pubmed']", xmlValue),
journal = sapply(articles, xpath2, ".//Journal/Title", xmlValue),
pubtypes = sapply(articles, xpath2, ".//PublicationType", xmlValue))
# A tibble: 6,172 × 3
pmid journal pubtypes
<chr> <chr> <chr>
1 11677608 Nature Journal Article; Research Support, Non-U.S. Gov't
2 12950922 Molecular microbiology Journal Article; Research Support, Non-U.S. Gov't
3 22328765 Journal of bacteriology Journal Article; Research Support, Non-U.S. Gov't
4 11337471 Genome research Journal Article
...
Since likely ArticleId is unique for each article and PublicationType may be more than one per article, consider iteratively creating dataframes instead of separate vectors.
Specifically, use node indexing, [#], across each PubmedArticle node of XML doc since this is the shared ancestor of id and type, then xpath to needed descendants. Below creates a list of dataframes of equal length to fwd:
trial1 <- lapply(fwd, function(doc) {
# RETRIEVE NUMBER OF ARTICLES PER EACH XML
num_of_articles <- length(xpathSApply(doc, "//PubmedArticle"))
# LOOP THROUGH EACH ARTICLE AND BIND XML VALUES TO DATAFRAME
dfList <- lapply(seq(num_of_articles), function(i)
data.frame(
Original.PMID = xpathSApply(doc, paste0("//PubmedArticle[",i,"]/descendant::ArticleId[#IdType='pubmed']"), xmlValue),
Publication.type = xpathSApply(doc, paste0("//PubmedArticle[",i,"]/descendant::PublicationTypeList/PublicationType"), xmlValue)
))
# ROW BIND ALL DFS INTO ONE
df <- do.call(rbind, dfList)
})
For a final master dataframe across all batches, run do.call(rbind, ...) again out the loop:
finaldf <- do.call(rbind, trial1)

Efficient way to remove all proper names from corpus

Working in R, I'm trying to find an efficient way to search through a file of texts and remove or replace all instances of proper names (e.g., Thomas). I assume there is something available to do this but have been unable to locate.
So, in this example the words "Susan" and "Bob" would be removed. This is a simplified example, when in reality would want this to apply to hundreds of documents and therefore a fairly large list of names.
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
names(texts) [1] <- "text"
Here's one approach based upon a data set of firstnames:
install.packages("gender")
library(gender)
install_genderdata_package()
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
It may need some tuning for your specific data set.
Another approach could be based upon part-of-speech tagging.

Resources