I have an HTML page like:-
<div>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
</div>
I need to select a group like this:-
<a href='link'>
<u class>name</u>
</a>
text
<br>
I need to select 3 values from a group:- link, name, and text.
Is there any way to select a group like this, and extract these particular values from each group in scrapy using, CSS selectors, Xpath, or anything?
Scrapy provides a mechanism to yield multiple values on the html page using Items- as items, Python objects that define key-value pairs.
You can extract individually and but yield them together as key-value pairs.
to extract value of an attribute of an element, use attr().
to extract innerhtml, use text.
Like you can define your parse function in scrapy like this:
def parse(self, response):
for_link = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8) a::attr(href)').getall()
for_name = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8) a u::text').getall()
for_text = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8)::text').getall()
# Yield all elements
yield {"link": for_link, "name": for_name, "text": for_text}
Open the items.py file.
# Define here the models for your scraped
# items
# Import the required library
import scrapy
# Define the fields for Scrapy item here
# in class
class <yourspider>Item(scrapy.Item):
# Item key for a
for_link = scrapy.Field()
# Item key for u
for_name = scrapy.Field()
# Item key for span
for_text = scrapy.Field()
for more details, read this tutorial
If its okay to wrap text in a span like so:
<a href='link'>
<u class>name</u>
</a>
<span>text</span>
<br>
Then you can select everything in CSS like so:
a, a + span {}
Or you can style these two separately:
a {}
a + span {}
The + means "comes immediately after" or "is immediately followed by"
Related
I have those 2 span with text inside them.They have no class or id and i want to scrape that text with bs4 but i don't know how.Using the small tag don't help me becouse the html is full of those.
Can someone help me with an exemple?
enter image description here
<td valign="bottom" class="bottom-cell">
<div class="space rel">
<p class="lheight16">
<small class="breadcrumb x-normal">
<span><i data-icon="location-filled"></i>Iasi</span>
</small>
<small class="breadcrumb x-normal">
<span><i data-icon="clock"></i>Ieri 16:13</span>
</small>
</p>
try this, The :nth-of-type(1) selector matches every span element that is the 1th child, of a particular type, of its parent.
for i in data.select('.lheight16 small span:nth-of-type(1)'):
print(i.text)
There are multiple options to do this, but most will orientate on the parents of the spans - Cause there is no expected output (recommend you should improve that) in your question, check these two.
Option a:
for span in soup.select('td.bottom-cell span'):
print(span.get_text())
Option:b
print(soup.select_one('td.bottom-cell').get_text(' - ',strip=True))
Example
from bs4 import BeautifulSoup
html='''
<td valign="bottom" class="bottom-cell">
<div class="space rel">
<p class="lheight16">
<small class="breadcrumb x-normal">
<span><i data-icon="location-filled"></i>Iasi</span>
</small>
<small class="breadcrumb x-normal">
<span><i data-icon="clock"></i>Ieri 16:13</span>
</small>
</p>
</div>
</td>
'''
soup = BeautifulSoup(html, 'lxml')
#option a:
for span in soup.select('td.bottom-cell span'):
print(span.get_text())
#option:b
print(soup.select_one('td.bottom-cell').get_text(' - ',strip=True))
Output
a:
Iasi
Ieri 16:13
b:
Iasi - Ieri 16:13
I want to scrape some items, which are on the same page, using Scrapy.
HTML looks like this:
<div class="container" id="1">
<span class="title">
product-title1
</span>
<div class="description">
product-desc
</div>
<div class="price">
1.0
</div>
</div>
I need to extract name, description and price.
Unfortunately, sometimes product doesn't have the description and HTML look like this:
<div class="container" id="2">
<span class="title">
product-title2
</span>
<div class="price">
2.0
</div>
</div>
Currently I am using CSS selectors which returns list of all elements existing on the website:
title = response.css('span[class="title"]').extract()
['product-title1', 'product-title2', 'product-title3']
description = response.css('div[class="description"]').extract()
['desc1','desc3']
price = response.css('div[class="price"]').extract()
['1.0','2.0','3.0']
Is it possible to get for example an empty string in place of missing 'desc2' when description object isn't there, using CSS selector?
I recommend you to rewrite you code:
for section in response.xpath('//div[#class="container"]'):
title = section.xpath('./span[#class="title"]/text()').get(default='not-found') # you can use any default value here or just empty string
desctiption = section.xpath('./div[#class="description"]').get()
price = section.xpath('./div[#class="price"]/text()').get()
Check this out..
for section in response.xpath('//div[#class="container"]'):
title = section.xpath('./span[#class="title"]/text()').get()
desctiption_tag = section.xpath("//div[contains(#class,'description')]")
if desctiption_tag:
desctiption = section.xpath('./div[#class="description"]').get()
else:
desctiption = "String"
price = section.xpath('./div[#class="price"]/text()').get()
The most common repetitive structure of the HTML is:
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
in such situations I grab the text it is possible for you
Occasionally (i.e., not always), the <p> of class="Standard" has a sibling <p> of class="P3", like so:
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
When this <p> of class="P3" is present, I want to additionally grab the text inside it, e.g. here I would additionally grab: (to ask a question in Spanish, you just use inflection)
My question is, given this kind of structure:
<div>
...
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
...
</div>
How can I produce output like this:
it is possible for you
it is acceptable for me
(to ask a question in Spanish, you just use inflection)
Currently, I've managed to do this:
p_standards = soup.find_all("p", class_ = "Standard")
for p_standard in p_standards:
p_english = p_standard.find("span", class_="T3")
print(p_english.contents[0])
And the output I get is:
it is possible for you
it is acceptable for me
use this :
Python Code :
from bs4 import BeautifulSoup
import re
text = '''
<div>
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
</div>
'''
soup = BeautifulSoup(text,features='html.parser')
p_standards = soup.find_all("p", class_ = "Standard")
for p_standard in p_standards:
p_english = p_standard.find('span',attrs={'class':'T3'})
nextSibling = p_standard.find_next_sibling()
print(p_english.text)
if(nextSibling.attrs['class'][0] == 'P3' and nextSibling.name == 'p'):
print(nextSibling.text)
Demo : Here
Explanation :
In order to get the class value within the find_next_sibling's
returned element i had to search into the variables of the instance
its self as there is no doc that mentions it on the official website
so i printed nextSibling.__dict__.keys()
the 0 index is because the class attribute's type is an array
I think it is more efficient to use css Or syntax and an adjacent sibling combinator to perform this
from bs4 import BeautifulSoup as bs
html = '''
<div>
...
<p class="Standard">
<span class="T3">
it is possible for you
</span>
</p>
<p class="Standard">
<span class="T3">
it is acceptable for me
</span>
</p>
<p class="P3">
(to ask a question in Spanish, you just use inflection)
</p>
...
</div>
'''
soup = bs(html, 'lxml')
items = [i.text.strip() for i in soup.select('.Standard, .Standard + .P3')]
print(items)
How can we extract value by using xpth or css selector if the attribute is dynamically changed for example:
<p data-reactid=".2e46q6vkxnc.1.$0">
<b data-reactid=".2e46q6vkxnc.1.$0.0">Mark Obtain</b>
<i class="avu-full-width" data-reactid=".2e46q6vkxnc.1.$0.1">
<span data-reactid=".2e46q6vkxnc.1.$0.1.0"> </span>
<span data-reactid=".2e46q6vkxnc.1.$0.1.1">450 A+.</span>
</i>
</p>
<p data-reactid=".2e46q6vkxnc.1.$1">
<b data-reactid=".2e46q6vkxnc.1.$1.0">Student Name</b>
<i class="avu-full-width" data-reactid=".2e46q6vkxnc.1.$1.1">
<span data-reactid=".2e46q6vkxnc.1.$0.1.0"> </span>
<span data-reactid=".2e46q6vkxnc.1.$0.1.1">First Name</span>
</i>
</p>
In this case attribute of element is dynamically changing but "Mark Obtain" and "Student Name" will always be same, so is there any way or can we write if condition or some regex along with xpath expression to get "450 A+" and "First Name" values.
Please help
To get required values you can use below XPath expressions:
//p[b="Mark Obtain"]//span[2]/text()
to get "450 A+."
and
//p[b="Student Name"]//span[2]/text()
to get "First Name"
In a custom step in cucumber, I wrote this:
find(:xpath ,"//ul//input[#placeholder = 'Enter Something'][last()]").set(value)
And Im getting Regexp ambiguous match error:It is getting both the elements.
How can I get this element using xpath(or maybe even css) in cucumber??
I'm using cucumber-1.2.1 and capybara-2.0.3
(Please note:every attribute in the above two input fields are same)
HTML:
<ul class = "someclass">
<li>
<div>
<a></a>
<input></input>
<input placeholder = "Enter Something"></input>
</div>
</li>
<li>
<div>
<a></a>
<input></input>
<input placeholder = "Enter Something"> // This is the element I want
</input>
</div>
</li>
</ul>
You'll need an extra set of parenthesis in your xpath:
"(//ul//input[#placeholder = 'Enter Something'])[last()]"