getNodeSet {XML} not working when XML root node contains "xlmns" attribute - r

I have an xml document like this
<MasterDataSet xmlns="http://tempuri.org/MasterDataSet.xsd">
<t_attribute>
<class_id>2</class_id>
<description>Latitude</description>
</t_attribute>
<t_object>
<name>Ship</name>
</t_object>
...
</MasterDataSet>
With many "t_attribute" and "t_object" nodes. I want to get a node set of all the "t_object" nodes so I use getNodeSet with xPath:
library("XML")
emtree0 <- xmlParse("EM0.xml", useInternalNodes = TRUE)
onlyobjects <- getNodeSet(emtree0,"/MasterDataSet//t_object")
But this returns an empty list.
However if I modify the XML file to look like this, i.e. if I remove the xmlns attribute, it works perfectly:
<MasterDataSet>
<t_attribute>
...
Any suggestions to make the code work without having to remove the xmlns attribute?

Related

Not Able to fetch element using Get Element in Robot Framework

I have the below xml snippet and i am unable to fetch the element using Get Element.
<configuration commit-localtime="2020-06-27 12:48:13 IST" commit-seconds="1593242293" commit-user="root">
<groups>
<name>group1</name>
<interfaces>
<interface>
<name><*></name>
<unit>
Is the xpath=configuration/groups/name incorrect?
Have also tried xpath=name but does not work.
Get error as No element matching 'configuration/groups/name' found
Your Xpath is incorrect, I have created a small test case as an example. In it you can see an xpath expression that returns what you want.
Fetch element in XML document
${root} = Parse XML ${XML}
log ${root}
${first} = Get Element ${root} xpath=.//groups/name
Should Be Equal ${first.text} group1

R SAX Parse attribute in empty element XML

I am new to R and I cannot find example of extracting specific attribute in an empty element, most example i found are extracting data value of a child nodes.
In a nutshell, how to extract an attribute using xmlEventParse() for XML like this:
<elements>
<element attribute1="value" attribute2="value"/>
<element attribute1="value" attribute2="value"/>
</elements>
Assuming I want to get attribute1 on the 2nd element.
Thanks in advance.
Update: Found the solution. It is xmlAttrs(root[[2]])[['attribute1']]

Smallest possible piece of code to automatically replace ampersands in XML Doc

For an XML document containing the escape characters, I have seen several options to work around. What is the fastest/smallest possible method to either ignore invalid characters or replace them with correct format?
The data is going into a database and the column that this data with the potential for funny characters is going into (location address) is the least important.
I'm getting the entity_name parsing error at the dataset.ReadXml command
Here is my code:
FN = Path.GetFileName(file1).ToString()
xmlFile = XmlReader.Create(Path.Combine(My.Settings.Local_Meter_Path, FN), New XmlReaderSettings())
ds.ReadXml(xmlFile)

reading configuration from text file

I have a txt file which has entries
indexUrl=http://192.168.2.105:9200
jarFilePath = /home/soumy/lib
How can I read this file from R and get the value of jarFilePath ?
I need this to set the .jaddClassPath()... I have problem to copying the jar to classpath because of the difference in slashes in windows and linux
in linux I want to use
.jaddClassPath(dir("target/mavenLib", full.names=TRUE ))
but in windows
.jaddClassPath(dir("target\\mavenLib", full.names=TRUE ))
So thinking to read location of jar from property file !!!
If there is anyother alternative please let me know that also
As of Sept 2016, CRAN has the package properties.
It handles = in property values correctly (but does not handle spaces after the first = sign).
Example:
Contents of properties file /tmp/my.properties:
host=123.22.22.1
port=798
user=someone
pass=a=b
R code:
install.packages("properties")
library(properties)
myProps <- read.properties("/tmp/my.properties")
Then you can access the properties like myProps$host, etc., In particular, myProps$pass is a=b as expected.
I do not know whether a package offers a specific interface.
If not, I would first load the data in a data frame using read.table:
myProp <- read.table("path/to/file/filename.txt, header=FALSE, sep="=", row.names=1, strip.white=TRUE, na.strings="NA", stringsAsFactors=FALSE)
sep="=" is obviously the separator, this will nicely separate your property names and values.
row.names=1 says the first column contains your row names, so you can index your data properties this way to retrieve each property you want.
For instance: myProp["jarFilePath", 2] will return "/home/soumy/lib".
strip.white=TRUE will strip leading and trailing spaces you probably don't care about.
One could conveniently convert the loaded data frame into a named vector for a cleaner way to retrieve the property values: myPropVec <- setNames(myProp[[2]], myProp[[1]]).
Then to retrieve a property value from its name: myPropVec["jarFilePath"] will return "/home/soumy/lib" as well.

xpath node determination

I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")
I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.
Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)

Resources