I've downloaded an XML database (Cellosaurus - https://web.expasy.org/cellosaurus/) and I'm trying to use the XML package in R to find all misspellings of a cell line name and return the misspelling and accession.
I've never used XML or XPath expressions before and I'm having real difficulties, so I also hope I've used the correct terminology in my question...
I've loaded the database like so:
doc <- XML::xmlInternalTreeParse(file)
and I can see an example entry which looks like this:
<cell-line category="Cancer cell line">
<accession-list>
<accession type="primary">CVCL_6774</accession>
</accession-list>
<name-list>
<name type="identifier">DOV13</name>
</name-list>
<comment-list>
<comment category="Misspelling"> DOR 13; In ArrayExpress E-MTAB-2706, PubMed=25485619 and PubMed=25877200 </comment>
</comment-list>
I think I've managed to pull out all of the misspellings (which is slightly useful already):
mispelt <- XML::getNodeSet(doc, "//comment[#category=\"Misspelling\"]")
but now I have no idea how to get the accession associated with each misspelling. Perhaps there's a different function I should be using?
Can anyone help me out or point me towards a simple XML R package tutorial please?
It's difficult to help with an incomplete example. But the basic idea is to navigate up the tree structure to get to the data you want. I've used the more current xml2 package but the same idea should hold for XML. For example
library(xml2)
xx <- read_xml("cell.xml")
nodes <- xml_find_all(xx, "//comment[#category=\"Misspelling\"]")
xml_find_first(nodes, ".//../../accession-list/accession") |> xml_text()
# [1] "CVCL_6774"
It's not clear if you have multiple comments or how your data is structured. You may need to lapply or purrr::map the second node selector after the first if you have multiple nodes
I am trying to figure out how to 'download' data into a nice CSV file to be able to analyse.
I am currently looking at WHO data here:
I am doing so through following documentation and getting output like so:
test_data <- jsonlite::parse_json(url("http://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple"))
head(test_data)
This gives me a rather messy list of list of lists.
For example:
I get this
It is not very easy to analyse and rather messy. How could I clean this up by using say two columns that is returned from this json_parse, information only from say dim like REGION, YEAR, COUNTRY and then the values from the column Value. I would like to make this into a nice dataframe/CSV file so I can then more easily understand what is happening.
Can anyone give any advice?
jsonlite::fromJSON gives you data in a better format and the 3rd element in the list is where the main data is.
url <- 'https://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple'
tmp <- jsonlite::fromJSON(url)
data <- tmp[[3]]
Hi first of all thanks for the help. I would like to know if there’s a way to extract specific data that is allocated in the same place in all pages from a pdf editable file.
The file (modified to comply with privacy concerns) contains a series of payroll receipts, all pages contain the same format and data. I would like to extract only the SSN (No. IMSS) of each employee and put them on a data frame. I have searched for how to do this but I have only found cases where the data is not properly structered and since in this file all pages are exactly equal, I would like to know if there's a less troublesome way.
Using pdf tools and the steps bellow I was able to isolate the data I wanted (allocated on line 9), but only from an individual page. I would like to know if it’s possible to enter a command that works for all pages. Thank you.
> library(pdftools)
> test <- pdf_text("pruebas.pdf")
> orden <- strsplit(test,"\r\n")
> required <- c(unlist(strsplit(orden2[[1]],"\r\n")))
> nss <- required[9]
> result <- as.data.frame(nss)
This is a text parsing task and there are several ways to do it. Perhaps the quickest way is to split the output at every No. IMSS:, select the second fragments, split the result at the line break, then take the first fragment. The code isn't pretty, but it works:
sapply(strsplit(sapply(strsplit(pdftools::pdf_text("pruebas.pdf"),
"No\\. IMSS: +"), `[`, 2), "\r"), `[`, 1)
#> [1] "12-34-56-7895-5" "12-34-56-7895-9" "12-34-56-7895-7" "12-34-56-7895-1"
I've got a dataset with feedback comments on multiple criteria from a customer survey conducted on many sites, where each row represents a single response.
For simplicity's sake, I have simplified the original dataset and produced a reproducible dataframe with comments for only three sites.
The criteria are listed from columns 4 - 10.
comments = data.frame(RESPONDENT_ID=c(1,2,3,4,5,6,7,8),
REGION=c("ASIA","ASIA","ASIA","ASIA","ASIA","EUROPE","EUROPE","EUROPE"),
SITE=c("Tokyo Center","Tokyo Center","Tokyo Center","PB Tower","PB Tower","Rome Heights","Rome Heights","Rome Heights"),
Lighting=c("Dim needs to be better","","Good","I don't like it","Could be better","","",""),
Cleanliness=c("","very clean I'm happy","great work","","disappointed","I like the work","","nice"),
Hygiene=c("","happy","needs improvement","great","poor not happy","nice!!","clean as usual i'm never disappointed",""),
Service=c("great service","impressed","could do better","","","need to see more","cant say","meh"),
Punctuality=c("always on time","","loving it","proper and respectful","","","punctual as always","delays all the time!"),
Efficiency=c("generally efficient","never","cannot comment","","","","","happy with this"),
Motivation=c("always very motivated","driven","exceeds expectations","","poor service","ok can do better","hmm","motivated"))
I've got a second dataset, which contains the bottom 3 scoring criteria for each of the three sites.
bottom = data.frame(REGION=c("ASIA","ASIA","EUROPE"),
SITE=c("Tokyo Center","PB Tower","Rome Heights"),
BOTTOM_1=c("Lighting","Cleanliness","Motivation"),
BOTTOM_2=c("Hygiene","Service","Lighting"),
BOTTOM_3=c("Motivation","Punctuality","Cleanliness"))
My Objective:
1) From the comments dataframe, for each SITE, I'd like to filter the bottom dataframe, and extract the comments for the bottom 3 criteria per site only.
2) Based on this extraction, for each unique SITE, I'd like to create an Excel file with three sheets, each sheet named after the bottom 3 criteria for that given site.
3) Each Sheet would contain a list of comments extracted for that particular site.
4) I'd like all Excel files saved in the format:
REGION_SITE_Comments2017.xlsx
Desired Final Output:
3 Excel files (or as many files as there are unique sites), each Excel file having three tabs named after their bottom 3 criteria, and each sheet with a list of comments corresponding to the given criterion for that site.
So as an example, one of the three files generated would look like this:
The file name would be ASIA_TokyoCenter_Comments2017.xlsx
The file would contain 3 sheets, "Lighting","Hygiene" & "Motivation" (based on the three bottom criteria for this site)
Each of these sheets would contain their respective site-level comments.
My Methodology:
I tried using a for loop on the comments dataframe, and filtering the bottom dataframe for each site listed.
Then using the write.xlsx function from the xlsx package to generate the Excel files, with the sheetName argument set to each of the bottom three citeria per site.
However I cannot seem to get the desired results. I have searched on Stackoverflow for similar solutions, but haven't found anything yet.
Any help with this would be highly appreciated!
This probably can be formatted better...
But for each level in Region and Site, for each 'bottom', we extract each independent combination and write to file.
bottom <- sapply(bottom, as.character) # Get out of factors.
sp <- split(comments, comments$REGION) # Split data into a list format for ease.
for(i in unique(bottom[,1])){
for(j in unique(bottom[,2])){
x <- sp[[1]][sp[[i]][,3]==j,]
y <- x[,colnames(x)%in%bottom[bottom[,1]==i& bottom[,2]==j,3:5]]
for(q in colnames(y)){
if(nrow(x) > 0) {
write.xlsx(x=y[,q],
file=paste(i,j, 'Comments2017.xlsx', sep='_'),
sheetName=q, append=T)
}
}
}
}
Is this what you were looking for?
I am struggling to parse contents from HTML using htmlTreeParse and XPath.
Below is the web link from where I need to extract information of "most valuable brands" and create a data frame out of it.
http://www.forbes.com/powerful-brands/list/#tab:rank
As a first step towards building the table, I am trying to extract the list of brands (Apple, Google, Microsoft etc. ). I am trying through below code:
library(XML)
htmlContent <- getURL("http://www.forbes.com/powerful-brands/list/#tab:rank", ssl.verifypeer=FALSE)
htmlParsed <- htmlTreeParse(htmlContent, useInternal = TRUE)
output <- xpathSApply(htmlParsed, "/html/body/div/div/div/table[#id='the_list']/tbody/tr/td[#class='name']", xmlValue)
But its returning NULL. I am not able to find my mistake. "/html/body/div/div/div/table[#id='the_list']/thead/tr/th" works correctly, returning ("", "Rank", "brand" etc.)
This means path upto table is correct. But I am not able to understand what's wrong thereafter.