BeautifulSoup 4, Finding text without identifiers - web-scraping

I am helping a non-profit to scrape their eBay store listings.
So far I have this code working properly:
testlink = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
r = requests.get(testlink, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
name = soup.find('h1', class_='it-ttl').text.strip("Details, about")
price = soup.find('span', class_='notranslate').text.strip("US, $")
ebayID = soup.find('div', class_='u-flL iti-act-num itm-num-txt').text
color = soup.find('h2', itemprop='color').text
brand = soup.find('h2', itemprop='brand').text
However, I am not being able to extract the following info from the image bellow:
44
tree button
wool blend
ventless
solid
enter image description here
also it would be awesome to scrape the information from the image bellow:
enter image description here
Thank you

To extract the item number and attributes, you can use this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
label = label.get_text(strip=True)
value = value.get_text(strip=True)
print('{:<30} {}'.format(label, value))
# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
Prints:
Condition: Pre-owned:An item that has been used or worn previously. See the seller’s listing for full details anddescription of any imperfections.See all condition definitions- opens in a new window or tab...Read moreabout the condition
Size: 44
Type: Blazer
Jacket/Coat Length: Regular
Color: Brown
Department: Men
Brand: Pal Zileri
Chest Size: 44
Jacket Front Button Style: Three-Button
Material: Wool Blend
Jacket Vent Style: Ventless
Pattern: Solid
Size Type: Regular
Fit: Athletic
NUMBER: LXW349-JULW3
EDIT:
To save the labes/values in dictionary and to csv, you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = {}
# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
label = label.get_text(strip=True)
label = label.rstrip(':').lower()
value = value.get_text(strip=True)
print('{:<30} {}'.format(label, value))
data[label] = value
# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
data['item number'] = number
df = pd.DataFrame([data])
df.to_csv('data.csv', index=False)
Creates data.csv (screenshot from LibreOffice):
EDIT 2: To include an image link:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.ebay.com/itm/Pal-Zileri-Mens-Brown-Solid-Loro-Piana-Blazer-44R-2-975/224099569981?hash=item342d60113d:g:DWAAAOSwNZFfEHjF'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# https://i.ebayimg.com/images/g/DWAAAOSwNZFfEHjF/s-l1600.jpg
data = {}
# extract the attributes:
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
label = label.get_text(strip=True)
label = label.rstrip(':').lower()
value = value.get_text(strip=True)
print('{:<30} {}'.format(label, value))
data[label] = value
# extract the image
image = soup.select_one('[itemprop="image"]')['src'].replace('l300', 'l1600')
data['image'] = image
# extract the item number:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
print('NUMBER:', number)
data['item number'] = number
df = pd.DataFrame([data])
df.to_csv('data.csv', index=False)

Related

How can I save this scrapped data into a data frame?

Hey this is my code that i used to scrap some data from website for practise. Can you help me set it into a data frame and save it?
url = "https://aedownload.com/download-magazine-promo-for-element-3d-free-videohive/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
title = soup.find(class_="blog-title").text.strip()
project_details = soup.find( class_="project-details").text
link_wp = soup.find (class_="wp-video-shortcode").text
link_infopage = soup.find(class_="infopage112").text
project_description = soup.find(class_= "Project-discription").text
print(title)
print(project_details)
print(link_wp)
print(link_infopage)
print(project_description)
Create an empty dictionary and append items to dict1 and use pandas to create dataframe
dict1={}
dict1['title'] = soup.find(class_="blog-title").text.strip()
dict1['project_details'] = soup.find( class_="project-details").text
dict1['link_wp'] = soup.find (class_="wp-video-shortcode").text
dict1['link_infopage'] = soup.find(class_="infopage112").text
dict1['project_description'] = soup.find(class_= "Project-discription").text
import pandas as pd
df = pd.DataFrame()
df = df.append(dict1, ignore_index=True)
Output:
title project_details link_wp link_infopage project_description
0 Download Magazine Promo for Element 3D – FREE ... \nMagazine Promo for Element 3D 23030644 Video... https://previews.customer.envatousercontent.co... Buy it \nFree Download\n\n\n\n\n\n\nRelated Templates...
To create new DataFrame from the data you can try:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://aedownload.com/download-magazine-promo-for-element-3d-free-videohive/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
title = soup.find(class_="blog-title").text.strip()
project_details = soup.find(class_="project-details").text
link_wp = soup.find(class_="wp-video-shortcode").text
link_infopage = soup.find(class_="infopage112").text
project_description = soup.find(class_="Project-discription").text
df = pd.DataFrame(
{
"title": [title],
"project_details": [project_details],
"link_wp": [link_wp],
"link_infopage": [link_infopage],
"project_description": [project_description],
}
)
df.to_csv("data.csv", index=False)
Saves data.csv (screenshot from LibreOffice):

my result only get 1 data, can you help me for get more data and save another file?

import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = 'http://h1.nobbd.de/index.php?start='
for page in range(1,10):
req = requests.get(URL + str(page) + '=')
soup = BeautifulSoup(req.text, 'html.parser')
h1 = soup.find_all('div',attrs={'class','report-wrapper'})
for hack in h1:
h2 = hack.find_all("div",attrs={"class","report"})
for i in h2:
layanan = i.find_all('b')[0].text.strip()
report = i.find_all('a')[2].text.strip()
bug_hunter = i.find_all('a')[1].text.strip()
mirror = i.find("a", {"class": "title"})['href']
date = i.find_all("div", {"class": "date"})
for d in date:
waktu = d.text
data = {"Company": [layanan], "Title:": [report], "Submit:": [bug_hunter], "Link:": [mirror], "Date:": [waktu]}
df = pd.DataFrame(data)
my result only get 1 data, can you help me for get more data and save another file?
df.head()
index
Company
Title:
Submit:
Link:
Date:
0
Reddit
Application level DOS at Login Page ( Accepts Long Password )
e100_speaks
https://hackerone.com/reports/1168804
03 Feb 2022
What happens?
Based on your questions code, you will overwrite your dataframe with every iteration, thats why you only get one result.
How to fix?
Create an empty list before your loops
Append all the extracted dicts to this list
Create your dataframe based on that list of dicts
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
url = 'http://h1.nobbd.de/index.php?start='
for page in range(1,3):
req = requests.get(url + str(page))
soup = BeautifulSoup(req.text, 'html.parser')
h1 = soup.find_all('div',attrs={'class','report-wrapper'})
for hack in h1:
h2 = hack.find_all("div",attrs={"class","report"})
for i in h2:
layanan = i.find_all('b')[0].text.strip()
report = i.find_all('a')[2].text.strip()
bug_hunter = i.find_all('a')[1].text.strip()
mirror = i.find("a", {"class": "title"})['href']
date = i.find_all("div", {"class": "date"})
for d in date:
waktu = d.text
data.append({'Company':[layanan], 'Title':[report], 'Submit':[hunter], 'link':[mirror], 'Date':[waktu]})
df = pd.DataFrame(data)

Having problem with web scraping with bs4 in python

My program returns different numbers each time. If I run each page individually it gives out the right results. I wanted to get all the links which have 3 or more votes.
from bs4 import BeautifulSoup as bs
import requests
import pandas
pg = 1
url ="https://stackoverflow.com/search?page="+str(pg)+"&tab=Relevance&q=scrappy%20python"
src = requests.get(url).text
soup = bs(src,'html.parser')
pages = soup.findAll('a',{'class' : 's-pagination--item js-pagination-item'})
number_of_pages = len(pages)
print(number_of_pages)
qualified=[]
while pg<=number_of_pages:
print("In Page :"+str(pg))
url = "https://stackoverflow.com/search?page=" + str(pg) + "&tab=Relevance&q=scrappy%20python"
src = requests.get(url).text
soup = bs(src, 'html.parser')
a_links = soup.findAll('a',{'class':'question-hyperlink'})
span_links = soup.findAll('span',{'class':'vote-count-post'})
hrefs = []
for a_link in a_links:
hrefs.append(a_link.get('href'))
for link in range(len(span_links)):
vote = span_links[link].strong.text
n = int(vote)
if n>2:
the_link = 'https://stackoverflow.com' + hrefs[link]
qualified.append(the_link)
print(len(qualified))
pg +=1
print(len(qualified)) will show the length of the full list that is your error. You can get how many links in each by adding i = 0 after while pg<=number_of_pages: and i += 1 after if n>2: then add print(i) before or after pg +=1.
Then code will be like this:
from bs4 import BeautifulSoup as bs
import requests
import pandas
pg = 1
url ="https://stackoverflow.com/search?page="+str(pg)+"&tab=Relevance&q=scrappy%20python"
src = requests.get(url).text
soup = bs(src,'html.parser')
pages = soup.findAll('a',{'class' : 's-pagination--item js-pagination-item'})
number_of_pages = len(pages)
print(number_of_pages)
qualified=[]
while pg<=number_of_pages:
i = 0
print("In Page :"+str(pg))
url = "https://stackoverflow.com/search?page=" + str(pg) + "&tab=Relevance&q=scrappy%20python"
src = requests.get(url).text
soup = bs(src, 'html.parser')
a_links = soup.findAll('a',{'class':'question-hyperlink'})
span_links = soup.findAll('span',{'class':'vote-count-post'})
hrefs = []
for a_link in a_links:
hrefs.append(a_link.get('href'))
for link in range(len(span_links)):
vote = span_links[link].strong.text
n = int(vote)
if n>2:
i += 1
the_link = 'https://stackoverflow.com' + hrefs[link]
qualified.append(the_link)
print(i)
pg +=1
#print(qualified)
Output:
6
In Page :1
1
In Page :2
4
In Page :3
2
In Page :4
3
In Page :5
2
In Page :6
2

How to extract title of each product using python web scraping

Here is the link:https://www.118100.se/sok/foretag/?q=brf&loc=&ob=rel&p=0
def get_index_data(soup):
try:
links = soup.find_all('div','a',id=False).get('href')
except:
links = []
print(links)
Find all div, which has class name Name (class="Name"). which gives you all title names. If you want href then iterate through all titles and find a tag which has title is text of title.text.
import requests
import bs4 as bs
url = 'https://www.118100.se/sok/foretag/?q=brf&loc=&ob=rel&p=0'
response = requests.get(url)
# print('Response:', response.status_code)
soup = bs.BeautifulSoup(response.text, 'lxml')
titles = soup.find_all('div', {'class': 'Name'})
# a = soup.find_all('a')
# print(a)
for title in titles:
link = soup.find('a', {'title': title.text}).get('href')
print('https://www.118100.se' + link)

Unable to get data into correct format after web scraping in python

i have written a code that scrapes the websites: https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=
{}&PageSize=36&order=BESTMATCH".format(page)
but when i run this code, data is not formtted, like product name is coming in ever cell and so on price and image.
from urllib.request import urlopen
from bs4 import BeautifulSoup
f = open("Scrapedetails.csv", "w")
Headers = "Item_Name, Price, Image\n"
f.write(Headers)
for page in range(1,15):
page_url = "https://www.newegg.com/Product/ProductList.aspx?
Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=
{}&PageSize=36&order=BESTMATCH".format(page)
html = urlopen(page_url)
bs0bj = BeautifulSoup(html, "html.parser")
page_details = bs0bj.find_all("div", {"class":"item-container"})
for i in page_details:
Item_Name = i.find("a", {"class":"item-title"})
Price = i.find("li", {"class":"price-current"})
Image = i.find("img")
Name_item = Item_Name.get_text()
Prin = Price.get_text()
imgf = Image["src"]# to get the key src
f.write("{}".format(Name_item).strip()+ ",{}".format(Prin).strip()+
",{}".format(imgf)+ "\n")
f.close()
can someone help me to ammend codes so that i can get name in name column, price in price column and image in image column.
what are the new ways to save data in csv,can someone help me in it with codes too?
Alright i got it solved.
from urllib.request import urlopen
from bs4 import BeautifulSoup
f = open("Scrapedetails.csv", "w")
Headers = "Item_Name, Price, Image\n"
f.write(Headers)
for page in range(1,15):
page_url = "https://www.newegg.com/Product/ProductList.aspx?
Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=
{}&PageSize=36&order=BESTMATCH".format(page)
html = urlopen(page_url)
bs0bj = BeautifulSoup(html, "html.parser")
page_details = bs0bj.find_all("div", {"class":"item-container"})
for i in page_details:
Item_Name = i.find("a", {"class":"item-title"})
Price = i.find("li", {"class":"price-current"}).find('strong')
Image = i.find("img")
Name_item = Item_Name.get_text().strip()
prin = Price.get_text()
imgf = Image["src"]# to get the key src
print(Name_item)
print(prin)
print('https:{}'.format(imgf))
f.write("{}".format(Name_item).replace(",", "|")+ ",{}".format(prin)+ ",https:{}".format(imgf)+ "\n")
f.close()
These are the codes for anyone who wishes to start with webscraping a simplest way

Resources