Beautiful Soup 4, findAll - web-scraping

My code is this
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url=https://www.chembid.com/results/?q=124-07-2&sort=price
my_url='https://www.chembid.com/results/?q=124-07-2&sort=price'
# opening up connection grapping the page
uClient=uReq(my_url)
page_html=uClient.read()
uClient.close()
#html parser
page_soup=soup(page_html,"html.parser")
for Container in Containers:
name=Container.div.div.span
title_container=Container.findAll("a",{"class":"supplier"})
supplier=title_container[0].text
And what im trying to do is now to use bs4 to findAll
>>> cas_no=Container.findAll("span",{"class":"regular-small-regular-small-font block"})
in this code
Factory supply high quality 99% min Octanoic acid/caprylic acid CAS 124-07-2 used in the manufacture of dyes, drugs, spices
Verifizierter Anbieter -->
-->
Shandong Baovi Energy Technology Co., Ltd.
CChina
CAS-No.: 124-07-2
Quality/Grade: Agriculture Grade,Electron Grade,Food Grade,Industrial Grade,Medicine Grade,Reagent Grade
www.alibaba.com
$0.25 - 3.68
per Kilogram, FOB
Show offer
And what im trying to find is the name, supplier, Cas-no, quality and price.
Thanks

so first thing I'm seeing is your trying to iterate through your Containers object, but never stored that as anything. So you'll need to have that stored before iterating through that.
hopefully someone will post a more robust solution, but in terms of what is pulled and what you're asking as an output, this will pull it from that specific page. There are a few parts that are not present so I had to account for those and if they weren't there, just have a null. None the less, this should get you going:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import pandas as pd
results = pd.DataFrame()
my_url='https://www.chembid.com/results/?q=124-07-2&sort=price'
# opening up connection grapping the page
uClient=uReq(my_url)
page_html=uClient.read()
uClient.close()
#html parser
page_soup=soup(page_html,"html.parser")
containers = page_soup.find_all('div', {'class':"result-horizontal-wrapper"})
for container in containers:
name = container.div.div.span.text
if container.find('a' , {'class':'supplier'}):
supplier = container.find('a' , {'class':'supplier'}).text
else:
supplier = 'n/a'
span_cas_qulity = container.find_all('span', {'class':'regular-small-font block'})
cas_no = [x.text for x in span_cas_qulity if 'CAS' in x.text]
quality = [x.text for x in span_cas_qulity if 'Quality/Grade' in x.text]
if cas_no != []:
cas_no = cas_no[0]
else:
cas_no = None
if quality != []:
quality = quality[0]
else:
quality = None
span_price = container.select('span.black-bold-font-big')[0].text
span_rate = container.select('span.block.regular-small-font.price')[0].text
temp_df = pd.DataFrame([[name, supplier, cas_no, quality, span_price, span_rate]], columns = ['name','supplier','cas_no','quality','price','rate'])
results = results.append(temp_df).reset_index(drop = True)

Related

Extract Data from Div class used multiple times

I am trying to get the company location from this website:https://slashdot.org/software/p/monday.com/
I am able to get close with the following code, but I am unable to navigate there.
Code:
url = 'https://slashdot.org/software/p/monday.com/'
profile = requests.get(url)
soup = bs(profile.content, 'lxml')
location = soup.select_one('div:nth-of-type(4).field-row').text
I feel like this is getting me in the area, but I've been unable to navigate over to "United States." Can someone show me what I am doing wrong?
Desired Out:
United States
Thanks!
To get the desired data you can use soup-contains() method and put them into a dict to get both the key value pairs
import pandas as pd
from bs4 import BeautifulSoup
import requests
url= 'https://slashdot.org/software/p/monday.com/'
req = requests.get(url)
soup = BeautifulSoup(req.text,'lxml')
d = {soup.select_one('.field-row div:-soup-contains("Headquarters")').text.replace(':',''):soup.select_one('.field-row div:-soup-contains("Headquarters") + div').text}
print(d)
Output:
{'Headquarters': 'United States'}

Web scraping for hidden content

I am trying to scrape the price data from this website: https://fuelkaki.sg/home
However, the data does not appear to be in the HTML code of the page. Upon inspecting, the data seems to be nested in the tag, for instance under Caltex for the retailer name, and similarly under multiple nested tags for the price data, which I am unable to scrape with the following code (there are no results to be found).
Any help would be much appreciated.
import requests
from bs4 import BeautifulSoup
URL = 'https://fuelkaki.sg/home'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('div', class_='fuel-name')
The table is behind JS (JavaScript) so BeautifulSoup won't see it.
Here's how I'd do it:
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
url = "https://fuelkaki.sg/home"
driver.get(url)
time.sleep(3)
element = driver.find_element_by_xpath('//*[#class="table"]')
print(element.text)
driver.close()
Output:
Diesel
92
95
98
Others
(V-Power, etc)
Caltex
3‬‬‬‬‬‬0‬‬‬‬‬‬ ‬‬‬‬‬‬S‬‬‬‬‬‬e‬‬‬‬‬‬p‬‬‬‬‬‬t‬‬‬‬‬‬e‬‬‬‬‬‬m‬‬‬‬‬‬b‬‬‬‬‬‬e‬‬‬‬‬‬r‬‬‬‬‬‬ ‬‬‬‬‬‬2‬‬‬‬‬‬0‬‬‬‬‬‬2‬‬‬‬‬‬0‬‬‬‬‬‬,‬‬‬‬‬‬ ‬‬‬‬‬‬0‬‬‬‬‬‬2‬‬‬‬‬‬:‬‬‬‬‬‬0‬‬‬‬‬‬5‬‬‬‬‬‬p‬‬‬‬‬‬m
S$ 1‬‬‬‬.‬‬‬‬7‬‬‬‬3
S$ 2‬‬‬.‬‬‬0‬‬‬2
S$ 2‬‬‬.‬‬‬0‬‬‬6
N.A.
S$ 2‬‬‬‬‬‬.‬‬‬‬‬‬6‬‬‬‬‬‬1
and so on...
EDIT:
If you want the table in a Dataframe try this:
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
url = "https://fuelkaki.sg/home"
driver.get(url)
time.sleep(3)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser").select_one(".table")
df = pd.read_html(str(soup))
df = pd.concat(df).rename(columns={"Unnamed: 0": ""})
df.to_csv("fuel_data.csv", index=False)
driver.close()
Outputs a .csv file with the table's data:

For Loop Not Repeating

I'm trying to find the hrefs for all the states where this company has stores, however it only finds the href for the first state.
Can anyone figure out why the for loop doesn't repeat for the rest of the states? Thank you very much for your help!
import requests
from bs4 import BeautifulSoup
import csv
# website
sitemap = 'website_url'
# content of website
sitemap_content = requests.get(sitemap).content
# parsing website
soup = BeautifulSoup(sitemap_content, 'html.parser')
#print(soup)
list_of_divs = soup.findAll('div', attrs={'class':'listings-inner'})
#print(list_of_divs)
header = ['Links']
with open ('/Users/ABC/Desktop/v1.csv','wt') as csvfile:
writer = csv.writer(csvfile, delimiter ="\t" )
writer.writerow(header)
for state in list_of_divs:
# get the url's by state
print(state.find('div', attrs={'class':'itemlist'}).a.get('href'))
rows = [state.find('div', attrs={'class':'itemlist'}).a.get('href')]
writer.writerow(rows)
list_of_divs actually only contains one element, which is the only div on the page with class listings-inner. So when you iterate through all of it's elements and use the find method, it'll only return the first result.
You want to use the find_all method on that div:
import requests
from bs4 import BeautifulSoup
sitemap = 'https://stores.dollargeneral.com/'
sitemap_content = requests.get(sitemap).content
soup = BeautifulSoup(sitemap_content, 'html.parser')
listings_div = soup.find('div', attrs={'class':'listings-inner'})
for state in listings_div.find_all('div', attrs={'class':'itemlist'}):
print(state.a.get('href'))

Beautiful Soup Pagination, find_all not finding text within next_page class. Need also to extract data from URLS

I've been working on this for a week and am determined to get this working!
My ultimate goal is to write a webscraper where you can insert the county name and the scraper will produce a csv file of information from mugshots - Name, Location, Eye Color, Weight, Hair Color and Height (it's a genetics project I am working on).
The site organization is primary site page --> state page --> county page -- 120 mugshots with name and url --> url with data I am ultimately after and next links to another set of 120.
I thought the best way to do this would be to write a scraper that will grab the URLs and Names from the table of 120 mugshots and then use pagination to grab all the URLs and names from the rest of the county (in some cases there are 10's of thousands). I can get the first 120, but my pagination doesn't work.. so Im ending up with a csv of 120 names and urls.
I closely followed this article which was very helpful
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd
county_name = input('Please, enter a county name: /Arizona/Maricopa-County-AZ \n')
print(f'Searching {county_name}. Wait, please...')
base_url = 'https://www.mugshots.com'
search_url = f'https://mugshots.com/US-Counties/{county_name}/'
data = {'Name': [],'URL': []}
def export_table_and_print(data):
table = pd.DataFrame(data, columns=['Name', 'URL'])
table.index = table.index + 1
table.to_csv('mugshots.csv', index=False)
print('Scraping done. Here are the results:')
print(table)
def get_mugshot_attributes(mugshot):
name = mugshot.find('div', attrs={'class', 'label'})
url = mugshot.find('a', attrs={'class', 'image-preview'})
name=name.text
url=mugshot.get('href')
url = base_url + url
data['Name'].append(name)
data['URL'].append(url)
def parse_page(next_url):
page = requests.get(next_url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'lxml')
list_all_mugshot = bs.find_all('a', attrs={'class', 'image-preview'})
for mugshot in list_all_mugshot:
get_mugshot_attributes(mugshot)
next_page_text = mugshot.find('a class' , attrs={'next page'})
if next_page_text == 'Next':
next_page_text=mugshot.get_text()
next_page_url=mugshot.get('href')
next_page_url=base_url+next_page_url
print(next_page_url)
parse_page(next_page_url)
else:
export_table_and_print(data)
parse_page(search_url)
Any ideas on how to get the pagination to work and also how to eventually get the data from the list of URLs I scrape?
I appreciate your help! I've been working in python for a few months now, but the BS4 and Scrapy stuff is so confusing for some reason.
Thank you so much community!
Anna
It seems you want to know the logic as to how you can get the content using populated urls derived from each of the page traversing next pages. This is how you can parse all the links from each page including next page and then use those links to get the content from their inner pages.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://mugshots.com/"
base = "https://mugshots.com"
def get_next_pages(link):
print("**"*20,"current page:",link)
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("[itemprop='name'] > a[href^='/Current-Events/']"):
yield from get_main_content(urljoin(base,item.get("href")))
next_page = soup.select_one(".pagination > a:contains('Next')")
if next_page:
next_page = urljoin(url,next_page.get("href"))
yield from get_next_pages(next_page)
def get_main_content(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("h1#item-title > span[itemprop='name']").text
yield item
if __name__ == '__main__':
for elem in get_next_pages(url):
print(elem)

'None' returned when web scraping using beautiful soup using find()

I am trying to pick the FTSE price from the bbc website using BeautifulSoup & Requests but I get the output 'None' when I run it.
import sys
import requests
from bs4 import BeautifulSoup
URL = 'https://www.bbc.co.uk/news/topics./c9qdqqkgz27t/ftse-100'
page = requests.get(URL,timeout=5)
#fetch content from URL
soup = BeautifulSoup(page.content,'html.parser')
#parse html content
price = soup.find(class_='gel-paragon nw-c-md-market-summary_value')
#price = soup.find("div", class_="gel-paragon nw-c-md-market-summary_value")
#find class with name 'gel...'
print(price)
I've tried using different types of the find function but both return the same. I plan to use this logic to gather data from multiple pages ultimately but want to get it right before I try to iterate.
Your url was wrong, i did few edits and it works!
import requests
from bs4 import BeautifulSoup
URL = 'https://www.bbc.co.uk/news/topics/c9qdqqkgz27t/ftse-100'
page = requests.get(URL)
soup = BeautifulSoup(page.content,'html.parser')
price = soup.find('div', attrs={
'class':'gel-paragon nw-c-md-market-summary__value'})
print(price.text)
Output:
7442.28
This works perfectly:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bbc.com/news/topics/c9qdqqkgz27t/ftse-100'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
price = soup.select_one('div.gel-paragon')
print(price.text)
Output:
7418.34
Note: Try 'html.parser' if you don't have 'lxml'

Resources