Extracting text from class using beautifulsoup - web-scraping

<a class="mdl-navigation__link" href="#StringFormatInvalid">
<i class="material-icons error-icon">error</i>Invalid format string (1)</a>
All of the items of interests are with the tag , but I want to extract the text from that tag. How should I go on about this?

We don't know if you want the text inside the <i> tag or all the text inside the <a> tag.
Anyway, here is a snippet to find the both:
from bs4 import BeautifulSoup
import requests
html = """<a class="mdl-navigation__link" href="#StringFormatInvalid">
<i class="material-icons error-icon">error</i>Invalid format string (1)</a>"""
soup = BeautifulSoup(html, 'html.parser')
a = soup.find('a', {'class': 'mdl-navigation__link'})
i = soup.find('i', {'class': 'material-icons error-icon'})
print('a text = ', a.get_text())
print('i text = ', i.get_text())
OUTPUT:
a text =
errorInvalid format string (1)
i text = error

Related

Select a group of elements and text using css selectors

I have an HTML page like:-
<div>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
</div>
I need to select a group like this:-
<a href='link'>
<u class>name</u>
</a>
text
<br>
I need to select 3 values from a group:- link, name, and text.
Is there any way to select a group like this, and extract these particular values from each group in scrapy using, CSS selectors, Xpath, or anything?
Scrapy provides a mechanism to yield multiple values on the html page using Items- as items, Python objects that define key-value pairs.
You can extract individually and but yield them together as key-value pairs.
to extract value of an attribute of an element, use attr().
to extract innerhtml, use text.
Like you can define your parse function in scrapy like this:
def parse(self, response):
for_link = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8) a::attr(href)').getall()
for_name = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8) a u::text').getall()
for_text = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8)::text').getall()
# Yield all elements
yield {"link": for_link, "name": for_name, "text": for_text}
Open the items.py file.
# Define here the models for your scraped
# items
# Import the required library
import scrapy
# Define the fields for Scrapy item here
# in class
class <yourspider>Item(scrapy.Item):
# Item key for a
for_link = scrapy.Field()
# Item key for u
for_name = scrapy.Field()
# Item key for span
for_text = scrapy.Field()
for more details, read this tutorial
If its okay to wrap text in a span like so:
<a href='link'>
<u class>name</u>
</a>
<span>text</span>
<br>
Then you can select everything in CSS like so:
a, a + span {}
Or you can style these two separately:
a {}
a + span {}
The + means "comes immediately after" or "is immediately followed by"

Webscraping on html function parameter and export to csv

<div class="readmore">
<a href="" onclick="updateDetailModal({name":"Company Name 1","website":"https:\/\/hello.com.sg\/","phone":"65 8123 4567","email":"hello#gmail.com.sg"})" class="btn btn-primary" data-toggle="modal" data-target="#exampleModal">More
</a>
</div>
Hi I'm looking to web scrape the following so that I can get it in a .csv file in this format<br>
Company Name | Website Url | Phone | Email -> 1st Row
Company Name 1 | https://hello.com.sg/ | 81234567 | hello#gmail.com -> 2nd Row
Company Name 2 | https://hello2.com.sg/ | 87654321 | hello2#gmail.com -> Subsequent rows for all links
Is there a way to use regex to get the individual fields and export them to a CSV file? I've been trying python and beautiful soup but I only know how to export using class or id. Not sure how to do it for function parameters.
Appreciate your help!
To extract the information you are looking for you need not just beautifulsoup (or lxml), but also json and a bit of string manipulation.
Assuming your html looks like this:
modal = """<div class="readmore">
<a href="" onclick="updateDetailModal({"name":"Company Name 1","website":"https:\/\/hello.com.sg\/","phone":"65 8123 4567","email":"hello#gmail.com.sg"})" class="btn btn-primary" data-toggle="modal" data-target="#exampleModal">More
</a>
<a href="" onclick="updateDetailModal({"name":"Company Name 2","website":"https:\/\/hello2.com.sg\/","phone":"87654321","email":"hello2#gmail.com.sg"})" class="btn btn-primary" data-toggle="modal" data-target="#exampleModal">More
</a>
</div>"""
Then:
from bs4 import BeautifulSoup as bs
import json
soup = bs(modal,"lxml")
infos = soup.select('a')
companies = []
for info in infos:
target = info.attrs['onclick'].split('(')[1].split(')')[0]
data = json.loads(target)
companies.extend([[v for v in data.values()]])
Your data is now in the companies list:
for co in companies:
print(co)
Output:
['Company Name 1', 'https://hello.com.sg/', '65 8123 4567', 'hello#gmail.com.sg']
['Company Name 2', 'https://hello2.com.sg/', '87654321', 'hello2#gmail.com.sg']
From here you write it to csv using standard methods.

DomCrawler filterXpath for emails

In my project I am trying to use filterXPath for emails. So I get an E-Mail via IMAP and put the mail body into my DomCrawler.
$crawler = new Crawler();
$crawler->addHtmlContent($mail->textHtml); //mail html content utf8
Now to my issue. I only want the plain text of the mail body, but still remain all new lines spaces etc - the exact same as the mail looks just in plain text without html (still with \n\r etc).
For that reason I tried using $crawler->filterXPath('//body/descendant-or-self::*/text()') to get every text node inside the mail.
However my test-mail containts html like:
<p>
<u>
<span>
<a href="mailto:mail#example.com">
<span style="color:#0563C1">mail#example.com</span>
</a>
</span>
</u>
<span>
</span>
<span>·</span>
<span>
<b>
<a href="http://www.example.com">
<span style="color:#0563C1">www.example.com</span>
</a>
</b>
<p/>
</span>
</p>
In my mail this looks like mail#example.com · www.example.com (in one single line).
With my filterXPath I get multiple nodes which result in following (multiple lines):
mail#example.com
· wwww.example.com
I know that probably the 
 might be the problem, which is a \r, but since I can't change the html in the mail, I need another solution - as mentioned before in the mail it is only a single line.
Please keep in mind, that my solution has to work for every mail - I do not know how the mail html looks like - it can change every time. So I need a generic solution.
I already tried using strip_tags too - this does not change the result at all.
My current approach:
$crawler = new Crawler();
$crawler->addHtmlContent($mail->textHtml);
$text = "";
foreach ($crawler->filterXPath('//body/descendant-or-self::*/text()') as $element) {
$part = trim($element->textContent);
if($part) {
$text .= "|".$part."|\n"; //to see whitespaces etc
}
}
echo $text;
//OUTPUT
|mail#example.com|
|·|
| |
|www.example.com|
| |
I believe something like this should work:
$xpath = new DOMXpath($crawler);
$result = $xpath->query('(//span[not(descendant::*)])');
$text = "";
foreach ($result as $element) {
$part = trim($element->textContent);
if($part) {
$text .= "|".$part."|"; //to see whitespaces etc
}
}
echo $text;
Output:
|mail#example.com||·||www.example.com|
Do note that you are dealing with two different ways to treat whitespace only text nodes: HTML has its own rules about if those are rendered (the difference are mainly between block elements and inline elements and also includes normalization) and XPATH works over a document tree provided by a parser (or DOM API) which has its own configuration about preserving or not those whitespace only text nodes. Taking this into account, one solution could be to use the string() function to get the string value of the element containing the email:
For this input:
<root>
<p>
<u>
<span>
<a href="mailto:mail#example.com">
<span style="color:#0563C1">mail#example.com</span>
</a>
</span>
</u>
<span>
</span>
<span>·</span>
<span>
<b>
<a href="http://www.example.com">
<span style="color:#0563C1">www.example.com</span>
</a>
</b>
<p/>
</span>
</p>
</root>
This XPath expresion:
string(/root)
Outputs:
mail#example.com
·
www.example.com
Check in here

How to display text in an MVC view with htmlattrbutes

I have the following code :
#Html.ActionLink("Hello " + User.Identity.GetUserName() + "!",
"Manage", "Account",
routeValues: null,
htmlAttributes: new { title = "Manage" })
I just want to display the text (with the correct htmlattribute) (i.e. no link)
Could you help me with the correct syntax please?
I think you can use Url.Action method.
<a href="#Url.Action("ActionName")">
<span>"Hello " + User.Identity.GetUserName() + "!"</span>
</a>
If i understand correctly,you want to show the text inside your link without an achor tag, but with your html attributes (title attributes)
Try this
<span title="Manage">Hello #User.Identity.GetUserName() !</span>
If you want the text with no link i.e. no anchor element, then just use plain HTML
<span title="Manage">Hello #User.Identity.GetUserName()!</span>
Or if you don't want to enclose it within a <span>
<text>Hello #User.Identity.GetUserName()!</text>
But with this you won't get the title attribute since the text is not enclosed within an html tag with which to apply it to.
If you actually want an anchor then you could also use #Url.Action() in conjunction with plain HTML
<a title="Manage" href="#Url.Action("Manage", "Account")">
Hello #User.Identity.GetUserName()!
</a>

Ambiguous error in xpath [last()] method

In a custom step in cucumber, I wrote this:
find(:xpath ,"//ul//input[#placeholder = 'Enter Something'][last()]").set(value)
And Im getting Regexp ambiguous match error:It is getting both the elements.
How can I get this element using xpath(or maybe even css) in cucumber??
I'm using cucumber-1.2.1 and capybara-2.0.3
(Please note:every attribute in the above two input fields are same)
HTML:
<ul class = "someclass">
<li>
<div>
<a></a>
<input></input>
<input placeholder = "Enter Something"></input>
</div>
</li>
<li>
<div>
<a></a>
<input></input>
<input placeholder = "Enter Something"> // This is the element I want
</input>
</div>
</li>
</ul>
You'll need an extra set of parenthesis in your xpath:
"(//ul//input[#placeholder = 'Enter Something'])[last()]"

Resources