Find an HTML element using BeautifulSoup by partial text - web-scraping

I have to find all paragraph that contains a specific partial text.
I find to find it that way
Whole text in paragraph is
"Open Until: Tuesday November 20, 2018, // 4:00 pm MST"
There is always a new date each time, so i have to give partial text like
element = soup.findAll("p",text="Open Until")

You haven't shared the relevant html elements for that portion, so it is hard to provide you with any solution. However, text="Open Until" doesn't work that way. It only looks for the full text not partial. Try like below instead.
for item in soup.find_all("p"):
if not "Open Until" in item.text:continue
print(item.text)

Related

How do I select only a certain character of text and turn it into a break?

I will have users input text in a textbox to set as their identifier, however, they can only enter 1 line of text. I have no way of changing that.
I would like to add CSS that takes the string of text and edits a | character and changes it to a <br>
The string of text they will type will be something like this: 1234-5678-1234 | Jim
I want it to show up like this:
1234-5678-1234
Jim
I'm guessing the code might look like this:
p:contains('|') {code for an enter and float right}
I would be posting this as comment but I need 50 rep :)
Just this: What you are trying to do needs JS. You should give RegExp a try. There's not a way to do that using pure css.
It is not possible to select an element on the basis of its textual content, except for the special case of empty content. There was once (in 2001) a draft suggesting a :contains(...) selector, but this feature was removed as the draft progressed (to eventually become Selectors Level 3 recommendation).
Still less is there a way to select something inside an element based on its content.
Besides, adding <br> would not be possible. You cannot add tags or elements with CSS, only textual content via pseudo-elements.
Moreover, if the input is read in an input element, you cannot make its content displayed in two lines. If the user input is actually echoed an in different element, like p element, then it is programmatically copied there, so the question is why the change is made there. You can modify the content with JavaScript, and it would be rather simple to replace any | by <br> in the content of a p element.
You should give a try using <div contenteditable="true"></div>. You would be able to solve your issue, using JS by wrapping and adding tags as necessary. Also, textbox wont support multiline & formatting.
A good read for contenteditable attribute on MDN: https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Content_Editable
I did some testing and created the following RegExp: Example Here
<input type="text" id="demo" value="123-456-7890 | John Doe" size="35">
<button onclick="myFunction()">Try it</button>
<p style="text-align:left;" id="number"></p>
<p style="text-align:right;" id="name"></p>
<script>
function myFunction() {
var str = document.getElementById("demo").value;
strnum = str.indexOf('|');
var name = str.slice(strnum+1);
var number = str.slice(0,strnum);
document.getElementById("number").innerHTML = number;
document.getElementById("name").innerHTML = name;
}
</script>
Is this what you want?

HTML5 empty <time> tag for date of publication

I'm confused about HTML5 <time> semantic tag. It seems to me it's for tagging strings such as "today", "last Christmas", "10.10.2010", "2am" etc. to help machines recognize and understand them. All valid examples I could find on HTML5 Doctor and W3C page were like:
<p>I have a date on <time datetime="2008-02-14">Valentines day</time>.</p>
<p>We open at <time>10:00</time> every morning.</p>
But what if I want to set article publication date (e.g. to help Google robots), but without any string displaying it? Would it be valid to have:
<article><time datetime="..."></time> content </article>
or:
<article><time datetime="..." /> content </article>
or should I do that some other way?
(It seems hilarious to me to have e.g. <time datetime="...">...</time> and CSS display:none for time selector...)
The time element contains the datetime value in machine-readable (the datetime attribute) and human-readable (the text content) formats.
The datetime value of a time element is the value of the element's datetime content attribute, if it has one, or the element's textContent, if it does not.
WHATWG
So if the datetime attribute is absent, the datetime value will be parsed from the text content.
If the datetime attribute is present, you are free to write anything into the text content, or nothing in your case.

display text in certain format

I am using MVC 5 for application and I have a table in which one record contains a string with formatted text like this
<p><span style="font-size: large;">Good Morning.. Its April 15 today! HAve a nice day..</span></p> <h3></h3> <p><span style="font-size: large;"><br /></span></p>
now when displaying the text in view i need to show the text in the format written inside the record like suppose for above i need like
Good Morning.... Its April 15 today! hava a nice day....
[note: font size is large >

xquery- how to get content of a node which is immediately after a node with known text

I am trying to extract content from a XHTML document-- in this document, within a div, there are a number of 'b' elements, each followed by a link.
For eg--
<div id="main">
<b> Bold text 1</b>
some link 1
<b> Bold text 2</b>
some link 2
<b> ABRACADABRA</b>
abracadbralink
</div>
Now, I want to extract the link 'abracadabralink'-- the problems are that, I dont know how many and elements are there before this specific link-- in different documents there are a different number of such elements- sometimes there are many links immediately after a single element-- all I do know is that the text for the element that occurs just before the link that I want, is always fixed.
So the only fixed information is that I want the link immediately after the element with known text-- how do I get this link using XQuery?
If I get it right, you are interested in the value of the #href attribute? This can be done with standard XPath syntax:
doc('yourdoc.xml')//*[. = ' abracadbralink']/#href/string()
For more information on XPath, I’d advise you to check out some online tutorials, such as http://www.w3schools.com/xpath/default.asp
I guess the following should work for you:
$yournode/b[. = ' ABRACADABRA']/following-sibling::a/#href/string()

Extracting text fragment from a HTML body (in .NET)

I have an HTML content which is entered by user via a richtext editor so it can be almost anything (less those not supposed to be outside the body tag, no worries about "head" or doctype etc).
An example of this content:
<h1>Header 1</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right>A link here</div><hr />
<h1>Header 2</h1>
<p>Some text here</p><p>Some more text here</p>
<div align=right>A link here</div><hr />
The trick is, I need to extract first 100 characters of the text only (HTML tags stripped). I also need to retain the line breaks and not break any word.
So the output for the above will be something like:
Header 1
Some text here
Some more text here
A link here
Header 2
Some text here
Some
It has 98 characters and line breaks are retained. What I can achieve so far is to strip the all HTML tags using Regex:
Regex.Replace(htmlStr, "<[^>]*>", "")
Then trim the length using Regex as well with:
Regex.Match(textStr, #"^.{1,100}\b").Value
My problem is, how to retaining the line break?. I get an output like:
Header 1
Some text hereSome more text here
A link here
Header 2
Some text hereSome more text
Notice the joining sentences? Perhaps someone can show me some other ways of solving this problem. Thanks!
Additional Info: My purpose is to generate plain text synopsis from a bunch of HTML content. Guess this will help clarify the this problem.
I think how I would solve this is to look at it as though it were a simple browser. Create a base Tag class, make it abstract with maybe an InnerHTML property and a virtual method PrintElement.
Next, create classes for each HTML tag that you care about and inherit from your base class. Judging from your example, the tags you care most about are h1, p, a, and hr. Implement the PrintElement method such that it returns a string that prints out the element properly based on the InnerHTML (such as the p class' PrintElement would return "\n[InnerHTML]\n").
Next, build a parser that will parse through your HTML and determine which object to create and then add those objects to a queue (a tree would be better, but doesn't look like it's necessary for your purposes).
Finally, go through your queue calling the PrintElement method for each element.
May be more work than you had planned, but it's a far more robust solution than simply using regex and should you decided to change your mind in the future and want to show simple styling it's just a matter of going back and modifying your PrintElement methods.
For info, stripping html with a regex is... full of subtle problems. The HTML Agility Pack may be more robust, but still suffers from the words bleeding together:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.InnerText;
One way could be to strip html in three steps:
Regex.Replace(htmlStr, "<[^/>]*>", "") // don't strip </.*>
Regex.Replace(htmlStr, "</p>", "\r\n") // all paragraph ends are replaced w/ new line
Regex.Replace(htmlStr, "<[^>]*>", "") // replace remaining </.*>
Well, I need to close this though not having the ideal solution. Since the HTML tags used in my app are very common ones (no tables, list etc) with little or no nesting, what I did is to preformat the HTML fragments before I save them after user input.
Remove all line breaks
Add a line break prefix to all block tags (e.g. div, p, hr, h1/2/3/4 etc)
Before I extract them out to be displayed as plain-text, use regex to remove the html tag and retain the line-break. Hardly any rocket science but works for me.

Resources