My objective is to create an xml object that comprises multiple elements that each contain varying information. A simple example of what the xml object looks like is:
library(xml2)
x1 <- read_xml("<Diag><line level='3' description='a log message'/><line level='3' description='a second log message'/></Diag>")
message(x1)
Which outputs:
<Diag>
<line level="3" description="a log message"/>
<line level="3" description="a second log message"/>
</Diag>
At the moment, I take the information from a data frame called diag. I add the children using a for loop:
library(xml2)
diag <- data.frame(level=c(3,3),description=c('a log message','a second log message'),stringsAsFactors = F)
x2 <- xml_new_root("Diag")
for (i in 1:dim(diag)[1]) {
xml_add_child(.x=x2,.value="line",level=diag$level[i],description=diag$description[i])
}
message(x2)
The xml layout of which is identical to that of x1.
However, this loop is less elegant than I would like it to be, and, for large data frames, can be slow.
My question is: is there any way that I can create multiple children at once using the data in my data frame, by using something akin to apply?
I have tried various options but none were successful and I am not sure I was close enough to post any of these options here. Currently, I am using the xml2 package but if a solution can be found using another package then I'd be open to that also.
Much obliged for any help!
The following seems to be doing what you want, using sapply as requested.
x2 <- xml_new_root("Diag")
sapply(1:dim(diag)[1], function(i) {
xml_add_child(.x=x2,.value="line",level=diag$level[i],description=diag$description[i])
}
)
message(x2)
<?xml version="1.0" encoding="UTF-8"?>
<Diag>
<line level="3" description="a log message"/>
<line level="3" description="a second log message"/>
</Diag>
Related
I'm fairly new to working with XML files within the R environment, but I have at least come further in making it work normally than I have with the specific file.
Quick background: I receive data in the attached format, but I cannot convert the data into a data frame (which I have succeeded in with other files.) Somehow my normal procedure doesn't work with this. My goal is to make the data into a data frame. Normally I would just use xmlToDataFrame(), but that provides me with the following error:
unable to find an inherited method for function ‘xmlToDataFrame’ for
signature ‘"xml_document", "missing", "missing", "missing", "missing"’
Then I tried the below sequence
data = read_xml("file.xml")
xmlimport = xmlTreeParse("file.xml")
topxml = xmlRoot(xmlimport)
topxml = xmlSApply(topxml,function(x) xmlSApply(x,xmlValue))
That provided me with the attached picture as output. All the data is contained within the cells, and I cannot seem to access the data. I feel like there is a really simple solution, but after working with the file for longer than I like to admit, I hope you can point something (hopefully) obvious out to me.
If you have the time to assist me in it, I've uploaded the file here
Hope that will do.
Thanks for taking the time to assist me.
Note: The data is a bank fee statement, and the data is completely fictional
Output result
I've downloaded an XML database (Cellosaurus - https://web.expasy.org/cellosaurus/) and I'm trying to use the XML package in R to find all misspellings of a cell line name and return the misspelling and accession.
I've never used XML or XPath expressions before and I'm having real difficulties, so I also hope I've used the correct terminology in my question...
I've loaded the database like so:
doc <- XML::xmlInternalTreeParse(file)
and I can see an example entry which looks like this:
<cell-line category="Cancer cell line">
<accession-list>
<accession type="primary">CVCL_6774</accession>
</accession-list>
<name-list>
<name type="identifier">DOV13</name>
</name-list>
<comment-list>
<comment category="Misspelling"> DOR 13; In ArrayExpress E-MTAB-2706, PubMed=25485619 and PubMed=25877200 </comment>
</comment-list>
I think I've managed to pull out all of the misspellings (which is slightly useful already):
mispelt <- XML::getNodeSet(doc, "//comment[#category=\"Misspelling\"]")
but now I have no idea how to get the accession associated with each misspelling. Perhaps there's a different function I should be using?
Can anyone help me out or point me towards a simple XML R package tutorial please?
It's difficult to help with an incomplete example. But the basic idea is to navigate up the tree structure to get to the data you want. I've used the more current xml2 package but the same idea should hold for XML. For example
library(xml2)
xx <- read_xml("cell.xml")
nodes <- xml_find_all(xx, "//comment[#category=\"Misspelling\"]")
xml_find_first(nodes, ".//../../accession-list/accession") |> xml_text()
# [1] "CVCL_6774"
It's not clear if you have multiple comments or how your data is structured. You may need to lapply or purrr::map the second node selector after the first if you have multiple nodes
I am trying to extract data from the following xml file, namely all the chunks after Object-id_str, "HMDB0003993" in this case. The goal is to extract ALL the chunks after Object-id_str, as there will be multiple per xml file. I am then hoping to aggregate all these ID's into a dataframe (one column presumably).
<?xml version="1.0"?>
<PC-Substances
xmlns="http://www.ncbi.nlm.nih.gov"
xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
xs:schemaLocation="http://www.ncbi.nlm.nih.gov
ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.xsd"
>
<PC-Substance>
<PC-Substance_sid>
<PC-ID>
<PC-ID_id>348276207</PC-ID_id>
<PC-ID_version>1</PC-ID_version>
</PC-ID>
</PC-Substance_sid>
<PC-Substance_source>
<PC-Source>
<PC-Source_db>
<PC-DBTracking>
<PC-DBTracking_name>811</PC-DBTracking_name>
<PC-DBTracking_source-id>
<Object-id>
<Object-id_str>HMDB0003993</Object-id_str>
<PC-XRefData_patent>US7807614</PC-XRefData_patent>
I have tried using readLines and str_match, and read_xml and nodes, but am having trouble working with my results data and turning them into a data frame.
I have tried the method below and used this code
my_xml <- read_xml("my_xml.xml")
usernodes <- xml_find_all(my_xml, ".//PC-Substance")
ui <- sapply(usernodes, function(n){xml_text(xml_find_first(n, ".//Object-id_str"))})
However things go awry when trying to find the nodes. I keep getting a list of 0.
Any help appreciated.
Thank you!
this is my first time working with XML data, and I'd appreciate any help/advice that you can offer!
I'm working on pulling some data that is stored on AWS in a collection of XML files. I have an index files that contains a list of the ~200,000 URLs where the XML files are hosted. I'm currently using the XML package in R to loop through each URL and pull the data from the node that I'm interested in. This is working fine, but with so many URLs, this loop takes around 12 hours to finish.
Here's a simplified version of my code. The index file contains the list of URLs. The parsed XML files aren't very large (stored as dat in this example...R tells me they're 432 bytes). I've put NodeOfInterest in as a placeholder for the spot where I'd normally list the XML tag that I'd like to pull data from.
for (i in 1:200000) {
url <- paste('http://s3.amazonaws.com/',index[i,9],'_public.xml', sep="") ## create URL based off of index file
dat <- (xmlTreeParse(url, useInternal = TRUE)) ## load entire XML file
nodes <- (getNodeSet(dat, "//x:NodeOfInterest", "x")) ##find node for the tag I'm interested in
if (length(nodes) > 0 & exists("dat")) {
dat2 <- xmlToDataFrame(nodes) ##create data table from node
compiled_data <- rbind(compiled_data, dat2) ##create master file
rm(dat2)
}
print(i)
}
It seems like there must be a more efficient way to pull this data. I think the longest step (by far) is loading the XML into memory, but I haven't found anything out there that suggests another option. Any advice???
Thanks in advance!
If parsing the XML into a tree is your pinchpoint (in xmlTreeParse) maybe use a streaming interface like SAX which will allow you to only process those elements that are useful for your application. I haven't used it, but the package xml2 is built on top of libxml2 which provides a SAX ability.
When applying R transform Field operation node in SPSS Modeler, for every script, the system will automatically add the following code on the top of my own script to interface with the R Add-on:
while(ibmspsscfdata.HasMoreData()){
modelerDataModel <- ibmspsscfdatamodel.GetDataModel()
modelerData <- ibmspsscfdata.GetData(rowCount=1000,missing=NA,rDate="None",logicalFields=FALSE)
Please note "rowCount=1000". When I process a table with >1000 rows (which is very normal), errors occur.
Looking for a way to change the default setting or any way to help to process table >1000 rows!
I've tried to add this at the beggining of my code and it works just fine:
while(ibmspsscfdata.HasMoreData())
{
modelerData <-rbind(modelerData,ibmspsscfdata.GetData(rowCount=1000,missing=NA,rDate="None",logicalFields=FALSE))
}
Note that you will consume a lot of memory with "big data" and parameters of .GetData() function should be set accordingly to "Read Data Options" in node setting.