lxml - scraping text within cluttered php - web-scraping

I'm in the process of rewriting a poorly written website that was originally coded in php.
I'm trying to isolate the text within a p tag and was wondering how I can take just the text portions. Any ideas?
<p>
<span lang="EN-IE" xml:lang="EN-IE">
<br>
TEXT SAMPLE 1
<br>
<br>
TEXT SAMPLE 2
<span lang="EN-IE" xml:lang="EN-IE">TEXT SAMPLE 3
</span>,
<span lang="EN-IE" xml:lang="EN-IE"> TEXT SAMPLE 4
</span> TEXT SAMPLE 5
<span lang="EN-IE" xml:lang="EN-IE">. </span>
</span><span lang="EN-IE" xml:lang="EN-IE">
<br>
<br>
TEXT SAMPLE 6
</span>
<span lang="EN-IE" xml:lang="EN-IE"> </span>
TEXT SAMPLE 7

BeautifulSoup is a good place to start. Especially the get_text function.
This will output all the text in the snippet above:
from bs4 import BeautifulSoup
CONTENT = """
<p>
<span lang="EN-IE" xml:lang="EN-IE">
<br>
TEXT SAMPLE 1
<br>
<br>
TEXT SAMPLE 2
<span lang="EN-IE" xml:lang="EN-IE">TEXT SAMPLE 3
</span>,
<span lang="EN-IE" xml:lang="EN-IE"> TEXT SAMPLE 4
</span> TEXT SAMPLE 5
<span lang="EN-IE" xml:lang="EN-IE">. </span>
</span><span lang="EN-IE" xml:lang="EN-IE">
<br>
<br>
TEXT SAMPLE 6
</span>
<span lang="EN-IE" xml:lang="EN-IE"> </span>
TEXT SAMPLE 7
"""
if __name__ == '__main__':
soup = BeautifulSoup(CONTENT)
print soup.get_text()
The output may need some string manipulation, as there are many new lines, but that will strip away the HTML.

Related

Select a group of elements and text using css selectors

I have an HTML page like:-
<div>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
</div>
I need to select a group like this:-
<a href='link'>
<u class>name</u>
</a>
text
<br>
I need to select 3 values from a group:- link, name, and text.
Is there any way to select a group like this, and extract these particular values from each group in scrapy using, CSS selectors, Xpath, or anything?
Scrapy provides a mechanism to yield multiple values on the html page using Items- as items, Python objects that define key-value pairs.
You can extract individually and but yield them together as key-value pairs.
to extract value of an attribute of an element, use attr().
to extract innerhtml, use text.
Like you can define your parse function in scrapy like this:
def parse(self, response):
for_link = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8) a::attr(href)').getall()
for_name = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8) a u::text').getall()
for_text = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8)::text').getall()
# Yield all elements
yield {"link": for_link, "name": for_name, "text": for_text}
Open the items.py file.
# Define here the models for your scraped
# items
# Import the required library
import scrapy
# Define the fields for Scrapy item here
# in class
class <yourspider>Item(scrapy.Item):
# Item key for a
for_link = scrapy.Field()
# Item key for u
for_name = scrapy.Field()
# Item key for span
for_text = scrapy.Field()
for more details, read this tutorial
If its okay to wrap text in a span like so:
<a href='link'>
<u class>name</u>
</a>
<span>text</span>
<br>
Then you can select everything in CSS like so:
a, a + span {}
Or you can style these two separately:
a {}
a + span {}
The + means "comes immediately after" or "is immediately followed by"

R, XPath, text scraping : get the text inside a node, while filtering on the attribute value of one of its descendant

Here is a quick mock-code. What I want to get is a character vector, with the text content of each p-node for whom the descendant a has its attribute href = "value1".
<doc>
<div class="intervention">
<p>
<a></a>
<b>
xxx
</b>
text1
</p>
<p>
<a></a>
<b>
xxx
</b>
text2
</p>
<p>
<a></a>
<b>
xxx
</b>
text3
</p>
</div>
<div class="intervention">
<p>
<a></a>
<b>
xxx
</b>
text4
</p>
<p>
<a></a>
<b>
xxx
</b>
text5
</p>
<p>
<a></a>
<b>
xxx
</b>
text6
</p>
</div>
</doc>
In other words, I want to get this vector:
c("xxxtext1","xxxtext3","xxxtext5","xxxtext6")
Could you please help me find the adequate XPath? So far, I have found this one that gives me all the text content in the p node, but I cannot get it to filter based on a's href value.
"//div[#class='intervention']//*[not(self::script)]"
Many thanks in advance for your help!
Your xpath should be //a[#href='value1']/ancestor::p
So for example:
library(xml2)
result <- xml_text(xml_find_all(doc, xpath = "//a[#href='value1']/ancestor::p"))
gsub("\\s", "", result) # Remove line breaks and spaces
#> [1] "xxxtext1" "xxxtext3" "xxxtext5" "xxxtext6"

Python-BS4 text inside <span> that has no class

I have those 2 span with text inside them.They have no class or id and i want to scrape that text with bs4 but i don't know how.Using the small tag don't help me becouse the html is full of those.
Can someone help me with an exemple?
enter image description here
<td valign="bottom" class="bottom-cell">
<div class="space rel">
<p class="lheight16">
<small class="breadcrumb x-normal">
<span><i data-icon="location-filled"></i>Iasi</span>
</small>
<small class="breadcrumb x-normal">
<span><i data-icon="clock"></i>Ieri 16:13</span>
</small>
</p>
try this, The :nth-of-type(1) selector matches every span element that is the 1th child, of a particular type, of its parent.
for i in data.select('.lheight16 small span:nth-of-type(1)'):
print(i.text)
There are multiple options to do this, but most will orientate on the parents of the spans - Cause there is no expected output (recommend you should improve that) in your question, check these two.
Option a:
for span in soup.select('td.bottom-cell span'):
print(span.get_text())
Option:b
print(soup.select_one('td.bottom-cell').get_text(' - ',strip=True))
Example
from bs4 import BeautifulSoup
html='''
<td valign="bottom" class="bottom-cell">
<div class="space rel">
<p class="lheight16">
<small class="breadcrumb x-normal">
<span><i data-icon="location-filled"></i>Iasi</span>
</small>
<small class="breadcrumb x-normal">
<span><i data-icon="clock"></i>Ieri 16:13</span>
</small>
</p>
</div>
</td>
'''
soup = BeautifulSoup(html, 'lxml')
#option a:
for span in soup.select('td.bottom-cell span'):
print(span.get_text())
#option:b
print(soup.select_one('td.bottom-cell').get_text(' - ',strip=True))
Output
a:
Iasi
Ieri 16:13
b:
Iasi - Ieri 16:13

Image attached to text

I have an image I want placed over some particular text. Thing is, this text is not always of the same length, which means the image is in different places depending on the text length. What options do I have to change for the image to be placed only over that particular but also not move about when other text around it is of different length?
<div>
<br>
<span class="bigSignature" v-if="report_info.accepted && report_info.accepted_by !== null">{{ memberById(report_info.accepted_by).callsign }}
<img style="position:absolute;top:47%;left:82%;transform:translate(-50%,-50%);
width:15%;height:auto;" src="https://imgur.com/blabla.png"/>
</span>
<br>
<br>
</div>
<button v-if="isAdmin" class="float-right" v-on:click="acceptRejectReport">{{ acceptButtonText }}</button>
</div>

beautifulsoup4 - grab Sibling element if Sibling present

The most common repetitive structure of the HTML is:
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
in such situations I grab the text it is possible for you
Occasionally (i.e., not always), the <p> of class="Standard" has a sibling <p> of class="P3", like so:
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
When this <p> of class="P3" is present, I want to additionally grab the text inside it, e.g. here I would additionally grab: (to ask a question in Spanish, you just use inflection)
My question is, given this kind of structure:
<div>
...
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
...
</div>
How can I produce output like this:
it is possible for you
it is acceptable for me
(to ask a question in Spanish, you just use inflection)
Currently, I've managed to do this:
p_standards = soup.find_all("p", class_ = "Standard")
for p_standard in p_standards:
p_english = p_standard.find("span", class_="T3")
print(p_english.contents[0])
And the output I get is:
it is possible for you
it is acceptable for me
use this :
Python Code :
from bs4 import BeautifulSoup
import re
text = '''
<div>
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
</div>
'''
soup = BeautifulSoup(text,features='html.parser')
p_standards = soup.find_all("p", class_ = "Standard")
for p_standard in p_standards:
p_english = p_standard.find('span',attrs={'class':'T3'})
nextSibling = p_standard.find_next_sibling()
print(p_english.text)
if(nextSibling.attrs['class'][0] == 'P3' and nextSibling.name == 'p'):
print(nextSibling.text)
Demo : Here
Explanation :
In order to get the class value within the find_next_sibling's
returned element i had to search into the variables of the instance
its self as there is no doc that mentions it on the official website
so i printed nextSibling.__dict__.keys()
the 0 index is because the class attribute's type is an array
I think it is more efficient to use css Or syntax and an adjacent sibling combinator to perform this
from bs4 import BeautifulSoup as bs
html = '''
<div>
...
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
...
</div>
'''
soup = bs(html, 'lxml')
items = [i.text.strip() for i in soup.select('.Standard, .Standard + .P3')]
print(items)

Resources