Here is an extraction case which I would like to know if there is a native method in Jsoup or possibly any other HTML parser to do effectively. Suppose I have the following page from which I want to extract "StackOverFlow" and any nearby text which would form a proper sentence.
<html> <head><title>A test page </title></head>
<body>
<p> Not relevant 1. </p>
<p> Not relevant 2.
<em> word1 word2 word3 <b> StackOverFlow </b> word4 word5 word6 </em>
</p>
</body>
</html>
The text that should be extracted is : word1 word2 word3 StackOverFlow word4 word5 word6.
and not this: Not relevant 2. word1 word2 word3 StackOverFlow word4 word5 word6
i.e. is there a way of identifying sentence boundaries in Jsoup? One could think of some regular expressions but I wonder if there is a better solution.
Try this:
doc.select("em").text();
The best way is to use CSS JQuery alike selectors.
Please read also about "combinators" also, so you will control of from wich element your element must be child.
http://jsoup.org/apidocs/org/jsoup/select/Selector.html
Related
I have a <h1> class that pulls a title string from an API. The example I have is a string of 3 words like "word1 word2 word3", and it looks like
Word1 Word2
Word3
However, the designs specify that it should be "bottom heavy" like so:
Word1
Word2 Word3
How do I force that with css? I can't exactly add a <br/> to split up the title, since I'm pulling the entire title from an API/shouldn't be modifying that.
I'm working on a corpus of email messages, and trying to replace all html tags in the corpus with the string ''. How can I replace all html tag using the fact that they begin with >< and end with > ?
Example:
<html>
<body>
This is some random text.
<p>This is some text in a paragraph.</p>
</body>
</html>
Should be translated to:
<html>
<html>
This is some random text.
<html>This is some text in a paragraph.<html>
<html>
<html>
Thanks
You should use the power of the regex with gsub. If you simply want to replace any <markup_name> by <hml> then gsub("<[^>]+>", "<html>", email_text) will do it.
The trick is [^>]\+ which extends (+) the regex until the first > ([^>] matches any character that is not >).
Here's another method offered only for completeness since it is less general than #Math's solution that I consider superior. Thinking that one might also use the range-quantifier pattern operators {n,m}. It probably has many deficiencies. It also raises the memory of a famous SO answer: RegEx match open tags except XHTML self-contained tags
dat <- "<html>
<body>
This is some random text.
<p>This is some text in a paragraph.</p>
</body>
</html>"
gsub("<.{1,5}>", "<html>", dat)
#[1] "<html>\n <html>\n This is some random text.\n <html>This is some text in a paragraph.<html>\n<html>\n<html>"
> cat( gsub("<.{1,5}>", "<html>", dat) )
<html>
<html>
This is some random text.
<html>This is some text in a paragraph.<html>
<html>
<html>
Take a look at the sample XML below--
<div id="sample">
<b>Some text</b>
: Demetra R. Smith
<b> Some more text </b>
</div>
Now, is the text "Demetra R. Smith" a "child node" of the 'b' node(that contains text 'Some text'? Or is it a 'next-sibling'?
How do you determine if some content is a sibling or a child-- esp. in this case the text "Demetra R. Smith" is not enclosed in any tag (else I would not be asking this question)?
Some text is a child (text) node of the first <b/>-node. : Demetra R. Smith is a sibling, in this case a following-sibling of the <b/>-node (there is no next-sibling).
You could access it using
/div/b[1]/following-sibling::text()[1]
which selects the first <b/>-node inside the (each) <div/>-element, looks for all text nodes on the following-sibling-axis and limits to the first of them. It will return
: Demetra R. Smith
I'm using python docutils and the rst2html.py script to convert restructured text to html.
I want to convert a line like this:
Test1 `(link1) <C:/path with spaces/file.html>`_
Into something like this:
<p>Test1 <a class="reference external" href="C:/path with spaces/file.html">(link1)</a>
But instead I get this (spaces in path are dropped):
<p>Test1 <a class="reference external" href="C:/pathwithspaces/file.html">(link1)</a>
How do I preserve the whitespace in links?
I don't know how you are grabbing the line from the file (or stdin), but you should convert the link related string to HTML entities. You can find more information in the following link Escaping HTML - Python Wiki.
Hope this help you.
I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].
First of all, you are looking for the a nodes whose name attributes start with 'hw'. This can be achieved with the following path:
$item//a[starts-with(#name,'hw')]
Once you have found your a nodes you want to retrieve the first text node that follows the a node. This can be done as so:
$item//a[starts-with(#name,'hw')]/following-sibling::text()[1]