how to use contains and not contains on different classes in xpath - r

I'm strugling with this simple code.
<div id="post_message_975824" class="alt3">
<div class="quote">
some unwanted text
</div>
the text to get <abr>ABR</abr> text to get
</div>
and I want to get this worked:
xpath = "//*[contains(#id, 'post_message_') and not(contains(#class,'quote'))]"
but this fails. I was trying to use some another query but not sure what I'm doing wrong?
EDIT
I found his code works:
xpath = "//*[contains(#id,'post_message_')//div[not(contains(#class,'quote'))]"
but it doesn't select the desired text when there's no quote subclass in the html.
The idea is to get all text from all subnodes also but not from those restricted.

Try this xpath :
//div[contains(#id,'post_message_')]/text() | //div[contains(#id,'post_message_')]/*[not(contains(#class,'quote'))]/text()
The first part of xpath //div[contains(#id,'post_message_')]/text() gives the text under the parent div i.e. <div id="post_message_975824" class="alt3">
The second part of xpath //div[contains(#id,'post_message_')]/*[not(contains(#class,'quote'))]/text() gives the text under all its child nodes only if the child doesn't contain an attribute called class with value quote
The result on your example is :
the text to get
ABR
text to get

Why not just remove all the nodes you don't want?
library(xml2)
doc <- read_xml('<div id="post_message_975824" class="alt3">
<div class="quote">
some unwanted text
</div>
the text to get <abr>ABR</abr> text to get
</div>')
xml_find_all(doc, ".//div[#class='quote']") %>% xml_remove()

Related

get text inside elements from class name using scrapy

How can I get the first text, I mean "Quotes to Scrape", from the following element using class name by scrapy python?
<div class="col-md-8">
<h1>
Quotes to Scrape
</h1>
</div>
Thanks for your time.
Here is a reasonable list of selectors both for css and xpath.
The element has no class, but you can get the text like this:
response.css('h1 a::text').get()

Unable to locate element by div class

Trying to check if the element set focus to using class header matching by text and getting error unable to locate the element. I know the header title which is 'My Details' in this example, and using this title, how to locate the element?
<div class="attribute-group-header card__header">
<h3 class="attribute-group-title card__header-title">My Details</h3>
</div>
Element should be focused //div[contains(.,'My Details')
To locate the h3 in your example code, use this xpath //h3[contains(text(),'My Details')]
To locate the div which has card__header in class, use this xpath //div[contains(#class,'card__header')]
It worked fine with this keyword and the X-path reference. Thank you all for guiding me to find the solution
Element should be enabled //h3[contains(text(),'${MyLinkText}')]

Unable to get value from xpath

I have below html code from which I want to extract the text "Extracted Text" inside last tag by using xpath of css selector. the text "value" inside 2nd tag will always be changing and we have stored that value in some variable. So I want to write a code which will parse below html and extract the text.
<div>
<div>value</div>
<div class="a">
<div>
<div>Extracted Text</div>
</div>
</div>
</div>
I have tried with below code:
response.xpath('//div[div="variable"]//div/div/text()')
but it didn't work. Please help.
This xpath does what you want
'//div[text()="value"]/following-sibling::div/div/div/text()'
Tested on command line
xmllint --html --xpath '//div[text()="value"]/following-sibling::div/div/div/text()' test.html
Extracted Text

xquery- how to get content of a node which is immediately after a node with known text

I am trying to extract content from a XHTML document-- in this document, within a div, there are a number of 'b' elements, each followed by a link.
For eg--
<div id="main">
<b> Bold text 1</b>
some link 1
<b> Bold text 2</b>
some link 2
<b> ABRACADABRA</b>
abracadbralink
</div>
Now, I want to extract the link 'abracadabralink'-- the problems are that, I dont know how many and elements are there before this specific link-- in different documents there are a different number of such elements- sometimes there are many links immediately after a single element-- all I do know is that the text for the element that occurs just before the link that I want, is always fixed.
So the only fixed information is that I want the link immediately after the element with known text-- how do I get this link using XQuery?
If I get it right, you are interested in the value of the #href attribute? This can be done with standard XPath syntax:
doc('yourdoc.xml')//*[. = ' abracadbralink']/#href/string()
For more information on XPath, I’d advise you to check out some online tutorials, such as http://www.w3schools.com/xpath/default.asp
I guess the following should work for you:
$yournode/b[. = ' ABRACADABRA']/following-sibling::a/#href/string()

Extracting text fragment from a HTML body (in .NET)

I have an HTML content which is entered by user via a richtext editor so it can be almost anything (less those not supposed to be outside the body tag, no worries about "head" or doctype etc).
An example of this content:
<h1>Header 1</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right>A link here</div><hr />
<h1>Header 2</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right>A link here</div><hr />
The trick is, I need to extract first 100 characters of the text only (HTML tags stripped). I also need to retain the line breaks and not break any word.
So the output for the above will be something like:
Header 1
Some text here
Some more text here
A link here
Header 2
Some text here
Some
It has 98 characters and line breaks are retained. What I can achieve so far is to strip the all HTML tags using Regex:
Regex.Replace(htmlStr, "<[^>]*>", "")
Then trim the length using Regex as well with:
Regex.Match(textStr, #"^.{1,100}\b").Value
My problem is, how to retaining the line break?. I get an output like:
Header 1
Some text hereSome more text here
A link here
Header 2
Some text hereSome more text
Notice the joining sentences? Perhaps someone can show me some other ways of solving this problem. Thanks!
Additional Info: My purpose is to generate plain text synopsis from a bunch of HTML content. Guess this will help clarify the this problem.
I think how I would solve this is to look at it as though it were a simple browser. Create a base Tag class, make it abstract with maybe an InnerHTML property and a virtual method PrintElement.
Next, create classes for each HTML tag that you care about and inherit from your base class. Judging from your example, the tags you care most about are h1, p, a, and hr. Implement the PrintElement method such that it returns a string that prints out the element properly based on the InnerHTML (such as the p class' PrintElement would return "\n[InnerHTML]\n").
Next, build a parser that will parse through your HTML and determine which object to create and then add those objects to a queue (a tree would be better, but doesn't look like it's necessary for your purposes).
Finally, go through your queue calling the PrintElement method for each element.
May be more work than you had planned, but it's a far more robust solution than simply using regex and should you decided to change your mind in the future and want to show simple styling it's just a matter of going back and modifying your PrintElement methods.
For info, stripping html with a regex is... full of subtle problems. The HTML Agility Pack may be more robust, but still suffers from the words bleeding together:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.InnerText;
One way could be to strip html in three steps:
Regex.Replace(htmlStr, "<[^/>]*>", "") // don't strip </.*>
Regex.Replace(htmlStr, "</p>", "\r\n") // all paragraph ends are replaced w/ new line
Regex.Replace(htmlStr, "<[^>]*>", "") // replace remaining </.*>
Well, I need to close this though not having the ideal solution. Since the HTML tags used in my app are very common ones (no tables, list etc) with little or no nesting, what I did is to preformat the HTML fragments before I save them after user input.
Remove all line breaks
Add a line break prefix to all block tags (e.g. div, p, hr, h1/2/3/4 etc)
Before I extract them out to be displayed as plain-text, use regex to remove the html tag and retain the line-break. Hardly any rocket science but works for me.

Resources