How to pass a web element into the BeautifulSoup - web-scraping

I am getting web element like this
elements = browser.find_elements_by_xpath("//*[contains(text(), 'Open Until')]")
Now I have to pass this element to soup to find it next & previous sibling. I am trying this
soup = BeautifulSoup(elements,'html.parser')
What should i write
??? soup = BeautifulSoup(elements.source,'html.parser') ???
Please Suggest

you don't need to mix it and you can't, selenium also has method to get prev and next sibling, example
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://140f670e-5774-43b5-a1a5-c993f66fa51d.htmlpasta.com/')
element = driver.find_element_by_xpath("//*[contains(text(), 'Open Until')]")
prevSibling = element.find_element_by_xpath('.//preceding-sibling::*')
nextSibling = element.find_element_by_xpath('.//following-sibling::*')
print(prevSibling.tag_name + ': ' + prevSibling.text)
print(element.tag_name + ': ' + element.text)
print(nextSibling.tag_name + ': ' + nextSibling.text)
driver.close()

The elements returned by selenium of the form of Selenium WebElements and not in the form of html.
The WebElements need to be converted to HTML for BeautifulSoup to be able to parse it.
# List of WebElements
elements = browser.find_elements_by_xpath("//*[contains(text(), 'Open Until')]")
#iterate over all the elements found
for WebElement in elements:
elementHTML = WebElement.get_attribute('outerHTML') #gives exact HTML content of the element
elementSoup = BeautifulSoup(elementHTML,'html.parser')
print(elementSoup)

This should be a comment, but I am not able to add one.
So it should be:
soup = BeautifulSoup(elements.parent.page_source,'html.parser')
or
create soup directly from browser:
soup = BeautifulSoup(browser.page_source,'html.parser')
and then search and get your elements in soup
There is no information about this in https://selenium-python.readthedocs.io/locating-elements.html or https://saucelabs.com/resources/articles/selenium-tips-css-selectors

Related

Beautiful Soup returning only the last URL of a txt file

I'm trying to parse a set of url of a txt file, but Beautiful Soup is returning only the content of the last url. It's a set of urls with movie reviews from the website LetterBoxD. For example, if the file has 10 urls, I'm getting "none" for the first 9. Only the 10th is returning properly. Can someone help me?
from bs4 import BeautifulSoup
import requests
with open('list_of_urls.txt', 'r') as f:
x = f.readlines()
for url in x:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
text = soup.find(class_='review body-text -prose -hero -loose')
print(text)
For finding element by several classes an array of class names should be used:
text = soup.find(class_= ['review', 'body-text', '-prose', '-hero', '-loose'])
It also seems, that letterboxd.com may have different combination of classes on review element, e.g. review body-text -prose -hero prettify, so I would recommend to search after less classes, e.g.
text = soup.find(class_= ['review', 'body-text'])
Thank you so much! But I found out that the URLs had a \n at the end. So I used rstrip('\n') to delete it.
Btw, Alexandra's tip help a lot for future extraction! Thank you!
This is my new code:
for url in x:
url = url.rstrip('\n')
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
text = soup.find(class_='review body-text -prose -hero -loose')
print(text)

For Loop Not Repeating

I'm trying to find the hrefs for all the states where this company has stores, however it only finds the href for the first state.
Can anyone figure out why the for loop doesn't repeat for the rest of the states? Thank you very much for your help!
import requests
from bs4 import BeautifulSoup
import csv
# website
sitemap = 'website_url'
# content of website
sitemap_content = requests.get(sitemap).content
# parsing website
soup = BeautifulSoup(sitemap_content, 'html.parser')
#print(soup)
list_of_divs = soup.findAll('div', attrs={'class':'listings-inner'})
#print(list_of_divs)
header = ['Links']
with open ('/Users/ABC/Desktop/v1.csv','wt') as csvfile:
writer = csv.writer(csvfile, delimiter ="\t" )
writer.writerow(header)
for state in list_of_divs:
# get the url's by state
print(state.find('div', attrs={'class':'itemlist'}).a.get('href'))
rows = [state.find('div', attrs={'class':'itemlist'}).a.get('href')]
writer.writerow(rows)
list_of_divs actually only contains one element, which is the only div on the page with class listings-inner. So when you iterate through all of it's elements and use the find method, it'll only return the first result.
You want to use the find_all method on that div:
import requests
from bs4 import BeautifulSoup
sitemap = 'https://stores.dollargeneral.com/'
sitemap_content = requests.get(sitemap).content
soup = BeautifulSoup(sitemap_content, 'html.parser')
listings_div = soup.find('div', attrs={'class':'listings-inner'})
for state in listings_div.find_all('div', attrs={'class':'itemlist'}):
print(state.a.get('href'))

CSS selector or XPath that gets information between two i tags?

I'm trying to scrape price information, and the HTML of the website looks like this
<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>
I want to get 999. (I don't want the dollar sign or the .00) I currently have
product_price_sn = product.css('.def-price i').extract()
I know it's wrong but not sure how to fix it. Any idea how to scrape that price information? Thanks!
You can use this xpath //span[#class="def-price"]/text()
Make sure you are using /text() and not //text(). Otherwise it will return all text nodes inside span tag.
or
This css selector .def-price::text. When using css selector don't use .def-price ::text, it will return all text nodes like the //text() in xpath.
Using scrapy response.xpath object
from scrapy.http import Request, HtmlResponse as Response
content = '''<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>'''.encode('utf-8')
url = 'https://stackoverflow.com/questions/62849500'
''' mocking scrapy request object '''
request = Request(url=url)
''' mocking scrapy response object '''
response = Response(url=url, request=request, body=content)
''' using xpath '''
print(response.xpath('//span[#class="def-price"]/text()').extract())
# outputs ['\n ', '\n "999"\n ']
print(''.join(response.xpath('//span[#class="def-price"]/text()').extract()).strip())
# outputs "99"
''' using css selector '''
print(response.css('.def-price::text').extract())
# outputs ['\n ', '\n "999"\n ']
print(''.join(response.css('.def-price::text').extract()).strip())
# outputs "99"
See it in action here
Using lxml html parser
from lxml import html
parser = html.fromstring("""
<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>
"""
)
print(parser.xpath('//span[#class="def-price"]/text()'))
# outputs ['\n ', '\n "999"\n ']
print(''.join(parser.xpath('//span[#class="def-price"]/text()')).strip())
# outputs "999"
See it in action here
With BeautifulSoup, you can use CSS selector .def_price and then .find_all(text=True, recursive=0) to get all immediate text.
For example:
from bs4 import BeautifulSoup
txt = '''<span class="def-price" datasku='....'>
<i>$</i>
"999"
<i>.00<i>
</span>'''
soup = BeautifulSoup(txt, 'html.parser')
print( ''.join(soup.select_one('.def-price').find_all(text=True, recursive=0)).strip() )
Prints:
"999"
Scrapy implements an extension for that as it isn't standard for CSS selectors. So this should work for you:
product_price_sn = product.css('.def-price i::text').extract()
Here is what the docs say:
Per W3C standards, CSS selectors do not support selecting text nodes
or attribute values. But selecting these is so essential in a web
scraping context that Scrapy (parsel) implements a couple of
non-standard pseudo-elements:
to select text nodes, use ::text
to select attribute values, use ::attr(name) where name is the name of the attribute that you want the value of

BeautifulSoup not finding all tags when using .find method?

I am trying to scrape from https://github.com/trending the number of of trending repositories using BeautifulSoup in Python. The code is supposed to find all tags with class_ = "Box-row" and then print the number found. On the site the actual number of trending repositories is 25 but the code only returns 9.
I have tried changing the parser from 'html.parser' to 'lxml' but both returned the same results.
page = requests.get('https://github.com/trending')
soup = BeautifulSoup(page.text, 'html.parser')
soup = BeautifulSoup(page.text)
repo = soup.find(class_ = "Box-row")
print(len(repo))
In the html there are 25 tags with "Box-row" class attributes so I expected to see print(len(repo)) = 25, but instead it's 9.
Try this:
repo = soup.find_all("article",{"class":"Box-row"})

AttributeError: 'NoneType' object has no attribute 'text' in web-scraping

I'm trying to recreate the web scraping on this website
https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe
I'm working in jupyter as a first project and I've come up with this error
AttributeError: 'NoneType' object has no attribute 'text'
I've tried changing the link but it makes no difference. I don't really know enough to do anything about the problem. Here is all the code so far...
#import the libraries
import urllib.request
from bs4 import BeautifulSoup
# specify the url
quote_page = "https://www.bloomberg.com/quote/SP1:IND"
page = urllib.request.urlopen(quote_page)
# parse the html using BeautifulSoup and store in variable `soup`
soup = BeautifulSoup(page, "html.parser")
# Take out the <div> of name and get its value
name_box = soup.find("h1", attrs={"class": "name"})
name = name_box.text.strip()
# strip() is used to remove starting and trailing
print (name)
# get the index price
price_box = soup.find("div", attrs={"class":"price"})
price = price_box.text.strip()
print (price)
Any help would be appreciated a lot
I use selenium to webscrape, but I am sure I can help you out (maybe).
This section is were you code gives you the error I assume:
price_box = soup.find("div", attrs={"class":"price"})
price = price_box.text.strip()
print (price)
What I would do is:
price_box = soup.find("div", attrs={"class":"price"})
price = price_box().text
print (price.text)

Resources