Creating a dataset from an XML file in R statistics - r

I am trying to download an XML file of journal article records and create a dataset for further interrogation in R. I'm completely new to XML and quite novice at R. I cobbled together some code using bits of code from 2 sources:
GoogleScholarXScraper
and
Extracting records from pubMed
library(RCurl)
library(XML)
library(stringr)
#Search terms
SearchString<-"cancer+small+cell+non+lung+survival+plastic"
mySearch<-str_c("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=",SearchString,"&usehistory=y",sep="",collapse=NULL)
#Seach
pub.esearch<-getURL(mySearch)
#Extract QueryKey and WebEnv
pub.esearch<-xmlTreeParse(pub.esearch,asText=TRUE)
key<-as.numeric(xmlValue(pub.esearch[["doc"]][["eSearchResult"]][["QueryKey"]]))
env<-xmlValue(pub.esearch[["doc"]][["eSearchResult"]][["WebEnv"]])
#Fetch Records
myFetch<-str_c("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&WebEnv=",env,"&retmode=xml&query_key=",key)
pub.efetch<-getURL(myFetch)
myxml<-xmlTreeParse(pub.efetch,asText=TRUE,useInternalNodes=TRUE)
#Create dataset of article characteristics #This doesn't work
pub.data<-NULL
pub.data<-data.frame(
journal <- xpathSApply(myxml,"//PubmedArticle/MedlineCitation/MedlineJournalInfo/MedlineTA", xmlValue),
abstract<- xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Abstract/AbstractText",xmlValue),
affiliation<-xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Affiliation", xmlValue),
year<-xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year", xmlValue)
,stringsAsFactors=FALSE)
The main problem I seem to have is that my returned XML file is not completely uniformly structured. For example, some references have a node structure like this:
- <Abstract>
<AbstractText>The Wilms' tumor gene... </AbstractText>
Whilst some have labels and are like this
- <Abstract>
<AbstractText Label="BACKGROUND & AIMS" NlmCategory="OBJECTIVE">Some background text.</AbstractText>
<AbstractText Label="METHODS" NlmCategory="METHODS"> Some text on methods.</AbstractText>
When I extract the 'AbstactText' I am hoping to get 24 rows of data back (there are 24 records when I run this made up search today), but xpathSApply returns all labels within 'AbstactText' as individual elements of my dataframe. Is there a way to collapse the XML structure in this instance/Ignore the labels? Is there a way to make xpathSApply return 'NA' when nothing is found at end of a path? I am aware of xmlToDataFrame, which sounds like it should fit the bill, but whenever I try to use this it doesn't seem to give me anything sensible.
Thanks for your help

I am unsure as to which you want however:
xpathSApply(myxml,"//*/AbstractText[#Label]")
will get the nodes with labels (keeping all attributes etc).
xpathSApply(myxml,"//*/AbstractText[not(#Label)]",xmlValue)
will get the nodes without labels.
EDIT:
test<-xpathApply(myxml,"//*/Abstract",xmlValue)
> length(test)
[1] 24
may give you what you want
EDIT:
to get affiliation, year etc padded with NA's
dumfun<-function(x,xstr){
res<-xpathSApply(x,xstr,xmlValue)
if(length(res)==0){
out<-NA
}else{
out<-res
}
out
}
xpathSApply(myxml,"//*/Article",dumfun,xstr='./Affiliation')
xpathSApply(myxml,"//*/Article",dumfun,xstr='./Journal/JournalIssue/PubDate/Year')

Related

Basic XML R package question - how to return other attributes for matching entries?

I've downloaded an XML database (Cellosaurus - https://web.expasy.org/cellosaurus/) and I'm trying to use the XML package in R to find all misspellings of a cell line name and return the misspelling and accession.
I've never used XML or XPath expressions before and I'm having real difficulties, so I also hope I've used the correct terminology in my question...
I've loaded the database like so:
doc <- XML::xmlInternalTreeParse(file)
and I can see an example entry which looks like this:
<cell-line category="Cancer cell line">
<accession-list>
<accession type="primary">CVCL_6774</accession>
</accession-list>
<name-list>
<name type="identifier">DOV13</name>
</name-list>
<comment-list>
<comment category="Misspelling"> DOR 13; In ArrayExpress E-MTAB-2706, PubMed=25485619 and PubMed=25877200 </comment>
</comment-list>
I think I've managed to pull out all of the misspellings (which is slightly useful already):
mispelt <- XML::getNodeSet(doc, "//comment[#category=\"Misspelling\"]")
but now I have no idea how to get the accession associated with each misspelling. Perhaps there's a different function I should be using?
Can anyone help me out or point me towards a simple XML R package tutorial please?
It's difficult to help with an incomplete example. But the basic idea is to navigate up the tree structure to get to the data you want. I've used the more current xml2 package but the same idea should hold for XML. For example
library(xml2)
xx <- read_xml("cell.xml")
nodes <- xml_find_all(xx, "//comment[#category=\"Misspelling\"]")
xml_find_first(nodes, ".//../../accession-list/accession") |> xml_text()
# [1] "CVCL_6774"
It's not clear if you have multiple comments or how your data is structured. You may need to lapply or purrr::map the second node selector after the first if you have multiple nodes

How do you download data from API and export it into a nice CSV file to query?

I am trying to figure out how to 'download' data into a nice CSV file to be able to analyse.
I am currently looking at WHO data here:
I am doing so through following documentation and getting output like so:
test_data <- jsonlite::parse_json(url("http://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple"))
head(test_data)
This gives me a rather messy list of list of lists.
For example:
I get this
It is not very easy to analyse and rather messy. How could I clean this up by using say two columns that is returned from this json_parse, information only from say dim like REGION, YEAR, COUNTRY and then the values from the column Value. I would like to make this into a nice dataframe/CSV file so I can then more easily understand what is happening.
Can anyone give any advice?
jsonlite::fromJSON gives you data in a better format and the 3rd element in the list is where the main data is.
url <- 'https://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple'
tmp <- jsonlite::fromJSON(url)
data <- tmp[[3]]

Extract specific structured data from multiple pages pdf file with the same format using R

Hi first of all thanks for the help. I would like to know if there’s a way to extract specific data that is allocated in the same place in all pages from a pdf editable file.
The file (modified to comply with privacy concerns) contains a series of payroll receipts, all pages contain the same format and data. I would like to extract only the SSN (No. IMSS) of each employee and put them on a data frame. I have searched for how to do this but I have only found cases where the data is not properly structered and since in this file all pages are exactly equal, I would like to know if there's a less troublesome way.
Using pdf tools and the steps bellow I was able to isolate the data I wanted (allocated on line 9), but only from an individual page. I would like to know if it’s possible to enter a command that works for all pages. Thank you.
> library(pdftools)
> test <- pdf_text("pruebas.pdf")
> orden <- strsplit(test,"\r\n")
> required <- c(unlist(strsplit(orden2[[1]],"\r\n")))
> nss <- required[9]
> result <- as.data.frame(nss)
This is a text parsing task and there are several ways to do it. Perhaps the quickest way is to split the output at every No. IMSS:, select the second fragments, split the result at the line break, then take the first fragment. The code isn't pretty, but it works:
sapply(strsplit(sapply(strsplit(pdftools::pdf_text("pruebas.pdf"),
"No\\. IMSS: +"), `[`, 2), "\r"), `[`, 1)
#> [1] "12-34-56-7895-5" "12-34-56-7895-9" "12-34-56-7895-7" "12-34-56-7895-1"

R generate multiple Excel files from a dataset, based on conditions from another

I've got a dataset with feedback comments on multiple criteria from a customer survey conducted on many sites, where each row represents a single response.
For simplicity's sake, I have simplified the original dataset and produced a reproducible dataframe with comments for only three sites.
The criteria are listed from columns 4 - 10.
comments = data.frame(RESPONDENT_ID=c(1,2,3,4,5,6,7,8),
REGION=c("ASIA","ASIA","ASIA","ASIA","ASIA","EUROPE","EUROPE","EUROPE"),
SITE=c("Tokyo Center","Tokyo Center","Tokyo Center","PB Tower","PB Tower","Rome Heights","Rome Heights","Rome Heights"),
Lighting=c("Dim needs to be better","","Good","I don't like it","Could be better","","",""),
Cleanliness=c("","very clean I'm happy","great work","","disappointed","I like the work","","nice"),
Hygiene=c("","happy","needs improvement","great","poor not happy","nice!!","clean as usual i'm never disappointed",""),
Service=c("great service","impressed","could do better","","","need to see more","cant say","meh"),
Punctuality=c("always on time","","loving it","proper and respectful","","","punctual as always","delays all the time!"),
Efficiency=c("generally efficient","never","cannot comment","","","","","happy with this"),
Motivation=c("always very motivated","driven","exceeds expectations","","poor service","ok can do better","hmm","motivated"))
I've got a second dataset, which contains the bottom 3 scoring criteria for each of the three sites.
bottom = data.frame(REGION=c("ASIA","ASIA","EUROPE"),
SITE=c("Tokyo Center","PB Tower","Rome Heights"),
BOTTOM_1=c("Lighting","Cleanliness","Motivation"),
BOTTOM_2=c("Hygiene","Service","Lighting"),
BOTTOM_3=c("Motivation","Punctuality","Cleanliness"))
My Objective:
1) From the comments dataframe, for each SITE, I'd like to filter the bottom dataframe, and extract the comments for the bottom 3 criteria per site only.
2) Based on this extraction, for each unique SITE, I'd like to create an Excel file with three sheets, each sheet named after the bottom 3 criteria for that given site.
3) Each Sheet would contain a list of comments extracted for that particular site.
4) I'd like all Excel files saved in the format:
REGION_SITE_Comments2017.xlsx
Desired Final Output:
3 Excel files (or as many files as there are unique sites), each Excel file having three tabs named after their bottom 3 criteria, and each sheet with a list of comments corresponding to the given criterion for that site.
So as an example, one of the three files generated would look like this:
The file name would be ASIA_TokyoCenter_Comments2017.xlsx
The file would contain 3 sheets, "Lighting","Hygiene" & "Motivation" (based on the three bottom criteria for this site)
Each of these sheets would contain their respective site-level comments.
My Methodology:
I tried using a for loop on the comments dataframe, and filtering the bottom dataframe for each site listed.
Then using the write.xlsx function from the xlsx package to generate the Excel files, with the sheetName argument set to each of the bottom three citeria per site.
However I cannot seem to get the desired results. I have searched on Stackoverflow for similar solutions, but haven't found anything yet.
Any help with this would be highly appreciated!
This probably can be formatted better...
But for each level in Region and Site, for each 'bottom', we extract each independent combination and write to file.
bottom <- sapply(bottom, as.character) # Get out of factors.
sp <- split(comments, comments$REGION) # Split data into a list format for ease.
for(i in unique(bottom[,1])){
for(j in unique(bottom[,2])){
x <- sp[[1]][sp[[i]][,3]==j,]
y <- x[,colnames(x)%in%bottom[bottom[,1]==i& bottom[,2]==j,3:5]]
for(q in colnames(y)){
if(nrow(x) > 0) {
write.xlsx(x=y[,q],
file=paste(i,j, 'Comments2017.xlsx', sep='_'),
sheetName=q, append=T)
}
}
}
}
Is this what you were looking for?

what's wrong with my R code?

I am struggling to parse contents from HTML using htmlTreeParse and XPath.
Below is the web link from where I need to extract information of "most valuable brands" and create a data frame out of it.
http://www.forbes.com/powerful-brands/list/#tab:rank
As a first step towards building the table, I am trying to extract the list of brands (Apple, Google, Microsoft etc. ). I am trying through below code:
library(XML)
htmlContent <- getURL("http://www.forbes.com/powerful-brands/list/#tab:rank", ssl.verifypeer=FALSE)
htmlParsed <- htmlTreeParse(htmlContent, useInternal = TRUE)
output <- xpathSApply(htmlParsed, "/html/body/div/div/div/table[#id='the_list']/tbody/tr/td[#class='name']", xmlValue)
But its returning NULL. I am not able to find my mistake. "/html/body/div/div/div/table[#id='the_list']/thead/tr/th" works correctly, returning ("", "Rank", "brand" etc.)
This means path upto table is correct. But I am not able to understand what's wrong thereafter.

Resources