Not Able To Scrape Website Title - Python Bs4 - web-scraping

I am trying to get the titles of game but with title i am getting span text also
here is my code
import time
import requests,pandas
from bs4 import BeautifulSoup
r = requests.get("https://www.pocketgamer.com/android/best-horror-games/?page=1", headers=
{'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101
Firefox/61.0'})
c = r.content
bs4 = BeautifulSoup(c,"html.parser")
all = bs4.find_all("h3",{"class":"indent"})
print(all)
Output
[<h3 class="indent">
<div><span>1</span></div>
Fran Bow </h3>, <h3 class="indent">
<div><span>2</span></div>
Bendy and the Ink Machine </h3>, <h3 class="indent">
<div><span>3</span></div>
Five Nights at Freddy's </h3>, <h3 class="indent">
<div><span>4</span></div>
Sanitarium </h3>, <h3 class="indent">
<div><span>5</span></div>
OXENFREE </h3>, <h3 class="indent">
<div><span>6</span></div>
Thimbleweed Park </h3>, <h3 class="indent">
<div><span>7</span></div>
Samsara Room </h3>, <h3 class="indent">
i tried this code also but not working
#all = all.find_all("h3")[0].text

How to fix?
Cause the text you wanna get is always the last element in <h3> you can extract it by contents of <h3>.
element.contents[-1]
To get the text iterate over result set:
for x in bs4.find_all("h3",{"class":"indent"}):
print(x.contents[-1].get_text(strip=True))
Example
import requests,pandas
from bs4 import BeautifulSoup
r = requests.get("https://www.pocketgamer.com/android/best-horror-games/?page=1",
headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content
bs4 = BeautifulSoup(c,"html.parser")
all = [x.contents[-1].get_text(strip=True) for x in bs4.find_all("h3",{"class":"indent"})]
print(all)
Output
['Fran Bow', 'Bendy and the Ink Machine', "Five Nights at Freddy's", 'Sanitarium', 'OXENFREE', 'Thimbleweed Park', 'Samsara Room', 'Into the Dead 2', 'Slayaway Camp', 'Eyes - the horror game', 'Slendrina:The Cellar', 'Hello Neighbor', 'Alien: Blackout', 'Rest in Pieces', 'Friday the 13th: Killer Puzzle', 'I Am Innocent', 'Detention', 'Limbo', 'Knock-Knock', 'Sara Is Missing', 'Death Park: Scary Horror Clown', 'Horror Hospital 2', 'Horrorfield - Multiplayer Survival Horror Game', 'Erich Sann: Horror in the scary Academy', 'The Innsmouth Case']

Related

Scraping an href

I was wondering if someone could help me scrape an href tag and clean it up. I am trying to scrape the url from the big "Visit Website" button on this page: https://www.goodfirms.co/software/inflow-inventory, and then clean it up a little bit.
Code:
url = 'https://www.goodfirms.co/software/inflow-inventory'
page = requests.get(url)
time.sleep(2)
soup = bs(page.content, 'lxml')
try:
url = soup.find("div", class_="entity-detail-header-visit-website")
except AttributeError:
url = "Couldn't Find"
Print(url)
Output Print:
<div class="entity-detail-header-visit-website">
<a class="visit-website-btn" href="https://www.inflowinventory.com/?utm_source=goodfirms&utm_medium=profile" rel="nofollow" target="_blank">Visit website</a>
</div>
Desired Output:
https://www.inflowinventory.com
This will get you what you need:
import requests
from bs4 import BeautifulSoup
headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
r = requests.get('https://www.goodfirms.co/software/inflow-inventory', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
link = soup.select_one('a.visit-website-btn')
print(link['href'].split('/?utm')[0])
Result:
https://www.inflowinventory.com
Documentation for BeautifulSoup can be found at:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Try this code to get #href value
url = soup.find("a", class_="visit-website-btn").get('href')
Having complete URL you can get base with
from urllib.parse import urlsplit
print(urlsplit(url).netloc)
# www.inflowinventory.com
"div", class_="entity-detail-header-visit-website" detects the same url two times with html content. So .a.get('href') with find() method will pull the righ url
import requests
from bs4 import BeautifulSoup
url = 'https://www.goodfirms.co/software/inflow-inventory'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
link = soup.find("div", class_="entity-detail-header-visit-website").a.get('href')
print(link)
Output:
https://www.inflowinventory.com/?utm_source=goodfirms&utm_medium=profile
If you are looking for a solution according to your code then it is like this.
import requests
from bs4 import BeautifulSoup
url = 'https://www.goodfirms.co/software/inflow-inventory'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
try:
url = soup.find("div", class_="entity-detail-header-visit-website")
print(url.a.get('href'))
except AttributeError:
url = "Couldn't Find"
print(url)
Result :
https://www.inflowinventory.com/?utm_source=goodfirms&utm_medium=profile

Beautiful soup articles scraping

Why does my code only finds 5 articles instead all of all 30 in the page?
Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
url = 'https://www.15min.lt/tema/svietimas-24297'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
antrastes = soup.find_all('h3', {'class': 'vl-title'})
print(antrastes)
Page uses JavaScript to add items but requests/BeautifulSoup can't run JavaScript.
It may need to use Selenium to control real web browser which can run JavaScript.
And it may also need some JavaScript code to scroll page.
Eventually you can check in DevTools in Firefox/Chrome if JavaScript loads data from some URL and you can try to use this URL with requests. It may need to use Session to get cookies and headers from first GET.
This code uses URL which I found in DevTools (tab: Network, filter: XHR).
It needs to set different offset (date time) in url to get different rows - url.format(offset)
If you use current datetime then you don't even need to read main page.
It needs header 'X-Requested-With': 'XMLHttpRequest' to work.
It sends JSON data with keys rows (with HTML) and offset (with datetime for next rows).
And I use this offset to get next rows. I run this in loop to get more rows.
import urllib.parse
import requests
from bs4 import BeautifulSoup
import datetime
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
url = 'https://www.15min.lt/tags/ajax/list/svietimas-24297?tag=24297&type=&offset={}&last_row=2&iq=L&force_wide=true&cachable=1&layout%5Bw%5D%5B%5D=half_wide&layout%5Bw%5D%5B%5D=third_wide&layout%5Bf%5D%5B%5D=half_wide&layout%5Bf%5D%5B%5D=third_wide&cosite=default'
offset = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for _ in range(5):
print('=====', offset, '=====')
offset = urllib.parse.quote_plus(offset)
response = requests.get(url.format(offset), headers=headers)
data = response.json()
soup = BeautifulSoup(data['rows'], 'html.parser')
antrastes = soup.find_all('h3', {'class': 'vl-title'})
for item in antrastes:
print(item.text.strip())
print('---')
offset = data['offset'] # offset for next data
Result:
===== 2022-03-09 21:20:36 =====
Konkursas „Praeities stiprybė – dabarčiai“. Susipažinkite su finalininkų darbais ir išrinkite nugalėtojus
---
ŠMSM į ukrainiečių vaikų ugdymą žada įtraukti ir atvykstančius mokytojus
---
Didėjant būrelių Vilniuje finansavimui, tikimasi įtraukti ir ukrainiečių vaikus
---
Mylėti priešus – ne glostyti palei plauką
---
Atvira pamoka su prof. Alfredu Bumblausku: „Ką reikėtų žinoti apie Ukrainos istoriją?“
---
===== 2022-03-04 13:20:21 =====
Vilniečiams vaikams – didesnis neformaliojo švietimo krepšelis
---
Premjerė: sudėtingiausiose situacijoje mokslo ir mokslininkų svarba tik didėja
---
Prasideda priėmimas į sostinės mokyklas: ką svarbu žinoti?
---
Dešimtokai lietuvių kalbos ir matematikos pasiekimus gegužę tikrinsis nuotoliniu būdu
---
Vilniuje prasideda priėmimas į mokyklas
---
===== 2022-03-01 07:09:05 =====
Nuotolinė istorijos pamoka apie Ukrainą sulaukė 30 tūkst. peržiūrų
---
J.Šiugždinienė: po Ukrainos pergalės bendradarbiavimas su šia herojiška valstybe tik didės
---
Vilniaus savivaldybė svarsto įkurdinti moksleivius buvusiame „Ignitis“ pastate
---
Socialdemokratai ragina stabdyti švietimo įstaigų tinklo pertvarką
---
Pokyčiai mokyklinėje literatūros programoje: mažiau privalomų autorių, brandos egzaminas – iš kelių dalių
---
===== 2022-02-26 11:04:29 =====
Mokytojo Gyčio „pagalbos“ – žygis, puodas ir uodas
---
Nuo kovo 2-osios pradinukams klasėse nebereikės dėvėti kaukių
---
Dr. Austėja Landsbergienė: Matematikos nerimas – kas tai ir ar įmanoma išvengti?
---
Ukrainos palaikymui – visuotinė istorijos pamoka Lietuvos mokykloms
---
Mokinius kviečia didžiausias chemijos dalyko konkursas Lietuvoje
---
===== 2022-02-23 10:11:14 =====
Mokyklų tinklo stiprinimas savivaldybėse: klausimai ir atsakymai
---
Vaiko ir paauglio kelias į sėkmę, arba Kaip gauti Nobelio premiją
---
Geriausias ugdymas – žygis, laužas, puodas ir uodas
---
Vilija Targamadzė: Bendrojo ugdymo mokyklų reformatoriai, ar ir toliau sėsite kakofoniją?
---
Švietimo ministrė: tai, kad turime sujungtas 5–8 klases, yra kažkas baisaus
---

Web scraping news page with a "load more"

I'm trying to scrape this news website "https://inshorts.com/en/read/national" and i'm just getting the results for just the displayed articles, i need all the articles on the website which contain the word (eg."COVID-19"), and don't have to use the "load more" button.
Here's my code which gives the current articles:
import requests
from bs4 import BeautifulSoup
import pandas as pd
dummy_url="https://inshorts.com/en/read/badminton"
data_dummy=requests.get(dummy_url)
soup=BeautifulSoup(data_dummy.content,'html.parser')
urls=["https://inshorts.com/en/read/national"]
news_data_content,news_data_title,news_data_category,news_data_time=[],[],[],[]
for url in urls:
category=url.split('/')[-1]
data=requests.get(url)
soup=BeautifulSoup(data.content,'html.parser')
news_title=[]
news_content=[]
news_category=[]
news_time = []
for headline,article,time in zip(soup.find_all('div', class_=["news-card-title news-right-box"]),
soup.find_all('div',class_=["news-card-content news-right-box"]),
soup.find_all('div', class_ = ["news-card-author-time news-card-author-time-in-title"])):
news_title.append(headline.find('span',attrs={'itemprop':"headline"}).string)
news_content.append(article.find('div',attrs={'itemprop':"articleBody"}).string)
news_time.append(time.find('span', clas=["date"]))
news_category.append(category)
news_data_title.extend(news_title)
news_data_content.extend(news_content)
news_data_category.extend(news_category)
news_data_time.extend(news_time)
df1=pd.DataFrame(news_data_title,columns=["Title"])
df2=pd.DataFrame(news_data_content,columns=["Content"])
df3=pd.DataFrame(news_data_category,columns=["Category"])
df4=pd.DataFrame(news_data_time, columns=["time"])
df=pd.concat([df1,df2,df3,df4],axis=1)
def name():
a = input("File Name: ")
return a
b = name()
df.to_csv(b + ".csv")
You can use this example how to simulate the clicking on Load More button:
import re
import requests
from bs4 import BeautifulSoup
url = "https://inshorts.com/en/read/national"
api_url = "https://inshorts.com/en/ajax/more_news"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"
}
# load first page:
html_doc = requests.get(url, headers=headers).text
min_news_id = re.search(r'min_news_id = "([^"]+)"', html_doc).group(1)
pages = 10 # <-- here I limit number of pages to 10
while pages:
soup = BeautifulSoup(html_doc, "html.parser")
# search the soup for your articles here
# ...
# here I just print the headlines:
for headline in soup.select('[itemprop="headline"]'):
print(headline.text)
# load next batch of articles:
data = requests.post(api_url, data={"news_offset": min_news_id}).json()
html_doc = data["html"]
min_news_id = data["min_news_id"]
pages -= 1
Prints news headlines of first 10 pages:
...
Moeen has done some wonderful things in Test cricket: Root
There should be an evolution in player-media relationship: Federer
Swiggy in talks to raise over $500 mn at $10 bn valuation: Reports
Tesla investors urged to reject Murdoch, Kimbal Musk's re-election
Doctor dies on Pune-Mumbai Expressway when rolls of paper fall on his car
2 mothers name newborn girls after Cyclone Gulab in Odisha
100 US citizens, permanent residents waiting to leave Afghanistan
Iran's nuclear programme has crossed all red lines: Israeli PM

Is there any way of getting an output of all the header links because iv got none and no error as well

Tried using beautiful soup for Scraping header links out of Bing but I don't get any errors nor output.
from bs4 import BeautifulSoup
import requests
search = input("Search for:")
params = {"q": search}
r = requests.get("http://www.bing.com/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find("ol", {"id": "b_results"})
links = soup.findAll("li", {"class": "b_algo"})
for item in links:
item_text = item.find("a").text
item_href = item.find("a").attrs["href"]
if item_text and item_href:
print(item_text)
print(item_href)
Try to specify User-Agent HTTP header to obtain the results:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bing.com/search'
params = {'q': 'tree'}
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers, params=params).content, 'html.parser')
for a in soup.select('.b_algo a'):
print(a.text, a['href'])
Prints:
tree|好きな物語と出逢えるサイト https://tree-novel.com/
sustainably stylish home furniture Hong Kong | TREE https://tree.com.hk/
Chairs & Benches https://tree.com.hk/furniture/chairs-benches
Desks https://tree.com.hk/furniture/desks
Living Room https://tree.com.hk/rooms/living-room
Bedroom https://tree.com.hk/rooms/bedroom
Finishing Touches https://tree.com.hk/furniture/finishing-touches
Entryway https://tree.com.hk/rooms/entryway
Tree | Definition of Tree by Merriam-Webster https://www.merriam-webster.com/dictionary/tree
Tree | Definition of Tree at Dictionary.com https://www.dictionary.com/browse/tree
tree | Structure, Uses, Importance, & Facts | Britannica https://www.britannica.com/plant/tree
Tree Images · Nature Photography · Free Photos from Pexels ... https://www.pexels.com/search/tree/

Accessing websites in a dropdown list

I'm trying to build a web scraper that visits school district websites and retrieves the names and websites of the schools. I'm using https://www.dallasisd.org/ to test the code below.
I'm currently stuck on how to 1) only access the dropdown list of 'Schools' and 2) retrieve the links in the <li> tags in the same dropdown.
Any help would be much appreciated! Thank you.
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.request
import requests
import re
import xlwt
import pandas as pd
import xlrd
from xlutils.copy import copy
import os.path
hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
browser = webdriver.Chrome()
url = 'https://www.dallasisd.org/'
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source, "lxml")
for name_list in soup.find_all(class_ ='sw-dropdown-list'):
print(name_list.text)
The dropdown lists of elementary schools are contained in the <div id="cs-elementary-schools-panel" [...]> which you could access prior to finding all and obtain the links:
from bs4 import BeautifulSoup
import requests
headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
url = 'https://www.dallasisd.org/'
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
dropdown = soup.find('div', attrs={'id': "cs-elementary-schools-panel"})
for link in dropdown.find_all('li', attrs={'class': "cs-panel-item"}):
print("Url: https://www.dallasisd.org" + link.find('a')['href'])
You can easily extend this code to the Middle and High schools

Resources