Retrieve Image Source with XPath in R using XML-Package - r

I'd like to retrieve the image source within the below HTML code block, but can't find the right syntax.
library(XML)
library(RCurl)
script <- getURL("http://www.floraweb.de/pflanzenarten/druck.xsql?suchnr=4346")
(doc <- htmlParse(script))
<div class="divider"><hr></div>
<div id="contentblock"><div id="content">
<h1>Alle Angaben</h1>
<p>Zu der von Ihnen gewählten Pflanzenart liegen folgende Informationen vor:</p>
<p>Wissenschaftlicher Name: Poa badensis agg. </p>
<p>Deutscher Name: Artengruppe Badener Rispengras</p>
<p>Familienzugehörigkeit: Poaceae, Süßgräser</p>
<p>Status: keine Angaben </p>
<p class="centeredcontent"><img border="0" src="../bilder/Arten/dummy.tmb.jpg"></p>
Desired result:
"../bilder/Arten/dummy.tmb.jpg"
Any pointers are greatly appreciated!

Try the following:
script <- getURL("http://www.floraweb.de/pflanzenarten/druck.xsql?suchnr=4346")
doc <- htmlTreeParse(script,useInternalNodes=T)
img<-xpathSApply(doc,'//*/p[#class="centeredcontent"]/img',xmlAttrs)
> img[2]
[1] "../bilder/Arten/dummy.tmb.jpg"
The use of Internal representation maybe necessary
EDIT:
I just looked up htmlParse and its equivalent to htmlTreeParse(useInternalNodes=T)
#Martin Morgan thanks have added below
doc <- htmlParse("http://www.floraweb.de/pflanzenarten/druck.xsql?suchnr=4346")
xpathSApply(doc, '//*/p[#class="centeredcontent"]/img/#src')

Use:
//div[#id='contentblock']
/div/p[#class='centeredcontent']
/img/#src
This selects the src attribute of any p element whose class attribute has the value "centeredcontent"and that (the p element) is a child of a div that is a child of a div whose id attribute has the value '"contentblock"'.
If you want to get directly the value of this attribute, use:
string(//div[#id='contentblock']
/div/p[#class='centeredcontent']
/img/#src)

Related

XPath and CSS in Oxygen XML Author: How to create a dynamic parameter in oxy_xpath

I am creating an Oxygen framework to display XML data in Oxygen XML Author's author mode. This is part of the XML I have with two nodes <ab> in it:
<TEI>
<text>
<body>
<div n="A">
<ab xml:id="n_d2e23" type="person">
<seg type="name">
<persName>
<surname>Aarberg</surname>,
<forename>Peter von</forename>
</persName>
<roleName>König</roleName>
(<date from="ca. 1300" to="vor 1372">ca. 1300–vor 1372</date>)
</seg>
<seg type="affiliations">
<list>
<item>
<affiliation role="CEO" hkg:orgKey="#n_123_456">Best CEO they ever had</affiliation>
</item>
</list>
</seg>
</ab>
<ab xml:id="n_123_456" source="SW_EB" type="organization">
<seg type="name">
<orgName>Altenburger Hofdruckerei</orgName>
</seg>
</ab>
</div>
</body>
</text>
</TEI>
The first <ab> node has an attribute hkg:orgKey="#n_123_456" which is referring to the second <ab> node's attribute xml:id="n_123_456". I use the following CSS to display the value of <orgName> of the second <ab> node in the visual representation of the first <ab> node in Oxygen XML Author:
affiliation:after {
content: " role: " oxy_textfield(
edit, "#role")
" Organization ID: " oxy_textfield(
edit, "#hkg:orgKey")
"Organization name: " oxy_xpath(
"/TEI/text/body/div/ab[#xml:id='n_123_456']/seg[#type='name']/orgName/text()"
)
" " oxy_url("gfx/link_register.png");
link: attr("hkg:orgKey")
}
This works fine and the value of <orgName> of the 2nd <ab> node, in fact "Altenburger Hofdruckerei", is displayed within the first <ab> node - as long as I use the concrete value hkg:orgKey="#n_123_456". But now I need to create the line
/TEI/text/body/div/ab[#xml:id='n_123_456']/seg[#type='name']/orgName/text()
dynamically which means: Instead of the specific value xml:id=n_123_456 it should use whatever the value of the attribute #hkg:orgKey of the node <ab> of the 1st element <ab> is. I tried this:
/TEI/text/body/div/ab[#xml:id=#hkg:orgKey]/seg[#type='name']/orgName/text()
but it doesn't work. I also tried other variations like ab[#xml:id='#hkg:orgKey'] or ab[#xml:id=attr('hkg:orgKey')] and many more but none of them gave me the expected result.
Maybe it is a syntax problem. I really hope there will be a solution to this and I would be very thankful for assistance. Any help is appreciated.
I tried to send as much code as needed but of course shortened some parts not relevant in this context. If something is missing (or too much) please let me know.
Thanks in advance.
I will assume you have already declared in the CSS a mapping for the "hkg" prefix like:
#namespace hkg "someNamespace";
I would replace:
link: attr("hkg:orgKey")
with:
link: attr(hkg|orgKey);
because in CSS you refer to namespaced elements with "prefix|elementName" instead of "prefix:elementName".
As for the main question, this line:
oxy_xpath("/TEI/text/body/div/ab[#xml:id='n_123_456']/seg[#type='name']/orgName/text()")
with:
oxy_xpath(oxy_concat("/TEI/text/body/div/ab[#xml:id='", oxy_substring(attr(hkg|orgKey), 1), "']/seg[#type='name']/orgName/text()"))
I'm using oxy_concat to step outside of the string literal, evaluate the attribute value and use its value in the larger XPath expression. I used "oxy_substring" to remove the "#" from the attribute reference.

Dynamically generated id in tomahwak dataList (JSF 1.2)

I cannot figure out how to generate dynamically id for elements in <t:dataList> which is inside <t:dataTable>. Code looks more or less like this :
<t:dataTable value="#{SomeBean.foo}" var="item">
<h:column rendered="true">
<div id="divpvmu">
<t:dataList value="#{item.templates}" var="template" rowCountVar="templateIndex">
<div id="saveBtn">
</div>
</t:dataList>
</div>
</h:column>
Obiously this code generate some number of divs with the same id="saveBtn"
. I would like to have distinct id for every generated div. I was trying to do this by this change :
<div id="saveBtn_#{templateIndex}">
, however then I'm getting an exception
javax.servlet.jsp.JspException: java.io.IOException: Example.jsp(50,31) JBWEB004178: #{..} is not allowed in template text
Is there any way to generate distinct id's for such a construction ? I'm using JSF 1.2.
There is the <t:div> tag. It allows you to use EL expression in id attribute.
So something like this should do the job:
<t:div id="saveBtn_#{templateIndex}" forceId="true">
JSF should also automatically generate id for you:
<t:dataList id="myList" value="#{item.templates}" var="template">
<t:div id="saveBtn">
</t:div>
</t:dataList>
This tag in t:dataTable, t:dataList and so on will generate id like:
myList:0:saveBtn for first element,
myList:1:saveBtn for second element, and so on.

Nested elements naming style (Jade, HAML, Slim)

Looking for solution how to use SMACSS naming convention with jade, haml or slim template engine.
Expect following jade code :
.module
.child
.child
as output i'll get following:
<div class="module">
<div class="child"></div>
<div class="child"></div>
</div>
but i'd like to reach following result:
<div class="module">
<div class="module-child"></div>
<div class="module-child"></div>
</div>
is there any solution to manage it like i can do it in SASS for example, i mean avoid adding 'module-' string to each 'child' manually ?
UPDATE
Also acceptable solutions with Haml and Slim
This is the closest I got with jade (live playground here):
mixin e(elt)
- var a = attributes;
- var cl = attributes.class;delete attributes.class
- var elt = elt ? elt : 'div' // If no parameter given
if cl
- var cl = parent + '-' + cl
else
- var cl = parent
#{elt}&attributes({'class': cl}, attributes)
block
- var parent = 'box'
+e('aside')#so-special
+e('h2').title Related
+e('ul').list
+e('li').item: +e('a')(href='#').link Item 1
+e('li').item: +e('span').link.current Item 2 and current
+e('li').item#third(data-dash='fine', aria-live='polite') Item 3 not even a link
| multi-line
| block
// - var parent = 'other' problem of scope I guess
+e('li').item lorem ipsum dolor sit amet
- var parent = 'footer'
+e('footer')(role='contentInfo')
+e.inner © Company - 2014
A mixin named e will output an element taken as a parameter (default is div) with its attributes and content as is, except for the first class that'll be prefixed with the value of the variable parent (or will be the value of parent itself if it hasn't any class)
I prefer using default jade syntax for attributes, including class and id than passing many parameters to a mixin (this one doesn't need any if it's a div, as with .sth text'd output <div class="sth>text</div> and +e.sth text will output <div class="parent-sth>text</div>)
Mixin would be shorter if it didn't have to deal with other attributes (href, id, data-*, role, etc)
Remaining problem: changing the value of parent has no effect when it's indented. It had with simpler previous attempts so I guess it's related to scope of variables. You theoretically don't want to change the prefix for child elements but in practice... Maybe as a second optional parameter?
Things I had problem with while playing with jade:
attributes doesn't work as expected. Now it's &attributes(attributes). Thanks to jade-book issue on GitHub
but it'll output class untouched plus the prefixed one, so I had to remove it (delete) in a place it'd be executed by jade
Some thoughts from me: what's wrong with a variable?
- var myModule = 'module'
div(class="#{myModule}")
div(class="#{myModule}-child")
div(class="#{myModule}-child")
or combine it with an each:
- var myModule2 = 'foobar'
div(class="#{myModule2}")
each idx in [0, 1, 2, 3]
div(class="#{myModule2}-child") I'm child #{idx}
Sure, there is much more code to write, but if a change is neccessary then you must do this only at one point.
Ciao
Ralf
You should be able to achieve this with SASS. As long as you have the latest SASS version, you should be able to use the following syntax:
.module {
&-child {
}
}
Have a look at this article for more information on newer features of SASS http://davidwalsh.name/future-sass

XPATH and CSS for Selenium Automation - Help Required

I want to Find the XPATH/CSS locator to extract the text from the following structure.
Kindly help.
<div class="page-header song-wrap">
<div class="art solo-art">
<div class="meta-info">
<h1 class="page-title">
Zehnaseeb
I want to give the locator/XPATH so that it can return the text "Zehnaseeb" (In this case)
This did not yield any result,
driver.findElement(By.xpath(".//*[#id='main']/div/section/div[1]/div[2]/h1")).getText();
have you tried waiting for the element,
String text = new WebDriverWait(driver,30).until(ExpectedConditions.visibilityOfElementLocated(By.cssSelector("div.page-header h1.page-title"))).getText();
If you are using C#, I recommend to use "ScrapySharp", it's very nice for parsing HTTML.
https://bitbucket.org/rflechner/scrapysharp/wiki/Home
Document htmlDoc = new HtmlDocument();
htmlDoc.loadHtml(driver.PageSource);
var zehnaseebstring = doc.DocumentNode.CssSelect("h1.page-title").SingleOrDefault().InnerText;
this should work.
I would check all the elements in between to see that the hierarchy is correct, but you could try to simplify by removing some of the elements in between by using descendant //
//*[#id='main']//h1[#class='page-title']

Mixed mode content - How do I select text from h1 but don't include its child element's text?

I have trouble printing simple text from a <h1> element:
require 'nokogiri'
doc = Nokogiri::HTML("<h1><em>Name</em>A Johnson </h1>")
puts doc.at_xpath("//h1").content
It outputs:
NameA Johnson
I want just A Johnson in the output. Is it possible to select just this text using XPath or CSS selectors?
How about using text() XPath function? Like this (untested though):
require 'nokogiri'
doc = Nokogiri::HTML("<h1><em>Name</em>A Johnson </h1>")
puts doc.at_xpath("//h1/text()").content
These solutions may only give part of the story. Consider:
doc = Nokogiri::HTML("<h1><em>Name</em>A <br>Johnson </h1>")
puts doc.at_xpath("//h1/text()").content
=> A
puts doc.at('h1').children.last.text
=> Johnson
or my suggestion:
puts doc.search("h1/text()").text
=> A Johnson

Resources