Why does my code only finds 5 articles instead all of all 30 in the page?
Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
url = 'https://www.15min.lt/tema/svietimas-24297'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
antrastes = soup.find_all('h3', {'class': 'vl-title'})
print(antrastes)
Page uses JavaScript to add items but requests/BeautifulSoup can't run JavaScript.
It may need to use Selenium to control real web browser which can run JavaScript.
And it may also need some JavaScript code to scroll page.
Eventually you can check in DevTools in Firefox/Chrome if JavaScript loads data from some URL and you can try to use this URL with requests. It may need to use Session to get cookies and headers from first GET.
This code uses URL which I found in DevTools (tab: Network, filter: XHR).
It needs to set different offset (date time) in url to get different rows - url.format(offset)
If you use current datetime then you don't even need to read main page.
It needs header 'X-Requested-With': 'XMLHttpRequest' to work.
It sends JSON data with keys rows (with HTML) and offset (with datetime for next rows).
And I use this offset to get next rows. I run this in loop to get more rows.
import urllib.parse
import requests
from bs4 import BeautifulSoup
import datetime
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
url = 'https://www.15min.lt/tags/ajax/list/svietimas-24297?tag=24297&type=&offset={}&last_row=2&iq=L&force_wide=true&cachable=1&layout%5Bw%5D%5B%5D=half_wide&layout%5Bw%5D%5B%5D=third_wide&layout%5Bf%5D%5B%5D=half_wide&layout%5Bf%5D%5B%5D=third_wide&cosite=default'
offset = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for _ in range(5):
print('=====', offset, '=====')
offset = urllib.parse.quote_plus(offset)
response = requests.get(url.format(offset), headers=headers)
data = response.json()
soup = BeautifulSoup(data['rows'], 'html.parser')
antrastes = soup.find_all('h3', {'class': 'vl-title'})
for item in antrastes:
print(item.text.strip())
print('---')
offset = data['offset'] # offset for next data
Result:
===== 2022-03-09 21:20:36 =====
Konkursas „Praeities stiprybė – dabarčiai“. Susipažinkite su finalininkų darbais ir išrinkite nugalėtojus
---
ŠMSM į ukrainiečių vaikų ugdymą žada įtraukti ir atvykstančius mokytojus
---
Didėjant būrelių Vilniuje finansavimui, tikimasi įtraukti ir ukrainiečių vaikus
---
Mylėti priešus – ne glostyti palei plauką
---
Atvira pamoka su prof. Alfredu Bumblausku: „Ką reikėtų žinoti apie Ukrainos istoriją?“
---
===== 2022-03-04 13:20:21 =====
Vilniečiams vaikams – didesnis neformaliojo švietimo krepšelis
---
Premjerė: sudėtingiausiose situacijoje mokslo ir mokslininkų svarba tik didėja
---
Prasideda priėmimas į sostinės mokyklas: ką svarbu žinoti?
---
Dešimtokai lietuvių kalbos ir matematikos pasiekimus gegužę tikrinsis nuotoliniu būdu
---
Vilniuje prasideda priėmimas į mokyklas
---
===== 2022-03-01 07:09:05 =====
Nuotolinė istorijos pamoka apie Ukrainą sulaukė 30 tūkst. peržiūrų
---
J.Šiugždinienė: po Ukrainos pergalės bendradarbiavimas su šia herojiška valstybe tik didės
---
Vilniaus savivaldybė svarsto įkurdinti moksleivius buvusiame „Ignitis“ pastate
---
Socialdemokratai ragina stabdyti švietimo įstaigų tinklo pertvarką
---
Pokyčiai mokyklinėje literatūros programoje: mažiau privalomų autorių, brandos egzaminas – iš kelių dalių
---
===== 2022-02-26 11:04:29 =====
Mokytojo Gyčio „pagalbos“ – žygis, puodas ir uodas
---
Nuo kovo 2-osios pradinukams klasėse nebereikės dėvėti kaukių
---
Dr. Austėja Landsbergienė: Matematikos nerimas – kas tai ir ar įmanoma išvengti?
---
Ukrainos palaikymui – visuotinė istorijos pamoka Lietuvos mokykloms
---
Mokinius kviečia didžiausias chemijos dalyko konkursas Lietuvoje
---
===== 2022-02-23 10:11:14 =====
Mokyklų tinklo stiprinimas savivaldybėse: klausimai ir atsakymai
---
Vaiko ir paauglio kelias į sėkmę, arba Kaip gauti Nobelio premiją
---
Geriausias ugdymas – žygis, laužas, puodas ir uodas
---
Vilija Targamadzė: Bendrojo ugdymo mokyklų reformatoriai, ar ir toliau sėsite kakofoniją?
---
Švietimo ministrė: tai, kad turime sujungtas 5–8 klases, yra kažkas baisaus
---
Related
I'm trying to retrieve the links of a Google Scholar user's work from their profile but am having trouble accessing the html that is hidden behind the "show more" button. I would like to be able to capture all the links from a user but currently can only get the first 20. Im using the following script to scrape for reference.
from bs4 import BeautifulSoup
import requests
author_url = 'https://scholar.google.com/citations?hl=en&user=mG4imMEAAAAJ'
html_content = requests.get(author_url)
soup = BeautifulSoup(html_content.text, 'lxml')
tables = soup.final_all('table)
table = tables[1]
rows = table.final_all('tr')
links = []
for row in rows:
t = row.find('a')
if t is not None:
links.append(t.get('href'))
You need to use cstart URL parameter which stands for page number, 0 is the first page, 10 is the second.. This parameter allows to skip the need to click "show more button" and does the same thing.
This parameter needs to be used in while loop in order to paginate through all articles.
To exist the loop, one of the ways would be to check certain CSS selector such as .gsc_a_e which is assigned to text when no results are present:
The great thing about such approach is that it paginates dynamically, instead of for i in range() which is hard coded and will be broken if certain authors have 20 articles and another has 2550 articles.
On the screenshot above I'm using the SelectorGadget Chrome extension that lets you pick CSS selectors by clicking on certain elements in the browser. It works great if the website is not heavily JS driven.
Keep in mind that at some point you also need to use CAPTCHA solver or proxies. This is only when you need to extract a lot of articles from multiple authors.
Code with the option to save to CSV using pandas and a full example in the online IDE:
import pandas as pd
from bs4 import BeautifulSoup
import requests, lxml, json
def bs4_scrape_articles():
params = {
"user": "mG4imMEAAAAJ", # user-id
"hl": "en", # language
"gl": "us", # country to search from
"cstart": 0, # articles page. 0 is the first page
"pagesize": "100" # articles per page
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
}
all_articles = []
articles_is_present = True
while articles_is_present:
html = requests.post("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for article in soup.select("#gsc_a_b .gsc_a_t"):
article_title = article.select_one(".gsc_a_at").text
article_link = f'https://scholar.google.com{article.select_one(".gsc_a_at")["href"]}'
article_authors = article.select_one(".gsc_a_at+ .gs_gray").text
article_publication = article.select_one(".gs_gray+ .gs_gray").text
all_articles.append({
"title": article_title,
"link": article_link,
"authors": article_authors,
"publication": article_publication
})
# this selector is checking for the .class that contains: "There are no articles in this profile."
# example link: https://scholar.google.com/citations?hl=en&user=mG4imMEAAAAJ&cstart=600
if soup.select_one(".gsc_a_e"):
articles_is_present = False
else:
params["cstart"] += 100 # paginate to the next page
print(json.dumps(all_articles, indent=2, ensure_ascii=False))
# pd.DataFrame(data=all_articles).to_csv(f"google_scholar_{params['user']}_articles.csv", encoding="utf-8", index=False)
bs4_scrape_articles()
Outputs (shows only last results as output is 400+ articles):
[
{
"title": "Exponential family sparse coding with application to self-taught learning with text documents",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=mG4imMEAAAAJ&cstart=400&pagesize=100&citation_for_view=mG4imMEAAAAJ:LkGwnXOMwfcC",
"authors": "H Lee, R Raina, A Teichman, AY Ng",
"publication": ""
},
{
"title": "Visual and Range Data",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=mG4imMEAAAAJ&cstart=400&pagesize=100&citation_for_view=mG4imMEAAAAJ:eQOLeE2rZwMC",
"authors": "S Gould, P Baumstarck, M Quigley, AY Ng, D Koller",
"publication": ""
}
]
If you don't want want to deal with bypassing blocks from Google or maintaining your script, have a look at the Google Scholar Author Articles API.
There's also a scholarly package that can also extract author articles.
Code that shows how to extract all author articles with Google Scholar Author Articles API:
from serpapi import GoogleScholarSearch
from urllib.parse import urlsplit, parse_qsl
import pandas as pd
import os
def serpapi_scrape_articles():
params = {
# https://docs.python.org/3/library/os.html
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"hl": "en",
"author_id": "mG4imMEAAAAJ",
"start": "0",
"num": "100"
}
search = GoogleScholarSearch(params)
all_articles = []
articles_is_present = True
while articles_is_present:
results = search.get_dict()
for index, article in enumerate(results["articles"], start=1):
title = article["title"]
link = article["link"]
authors = article["authors"]
publication = article.get("publication")
citation_id = article["citation_id"]
all_articles.append({
"title": title,
"link": link,
"authors": authors,
"publication": publication,
"citation_id": citation_id
})
if "next" in results.get("serpapi_pagination", {}):
# split URL in parts as a dict() and update "search" variable to a new page
search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
else:
articles_is_present = False
print(json.dumps(all_articles, indent=2, ensure_ascii=False))
# pd.DataFrame(data=all_articles).to_csv(f"serpapi_google_scholar_{params['author_id']}_articles.csv", encoding="utf-8", index=False)
serpapi_scrape_articles()
Here is one way of obtaining that data:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from tqdm import tqdm ## if Jupyter notebook: from tqdm.notebook import tqdm
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
big_df = pd.DataFrame()
headers = {
'accept-language': 'en-US,en;q=0.9',
'x-requested-with': 'XHR',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
payload = {'json': '1'}
for x in tqdm(range(0, 500, 100)):
url = f'https://scholar.google.com/citations?hl=en&user=mG4imMEAAAAJ&cstart={x}&pagesize=100'
r = s.post(url, data=payload)
soup = bs(r.json()['B'], 'html.parser')
works = [(x.get_text(), 'https://scholar.google.com' + x.get('href')) for x in soup.select('a') if 'javascript:void(0)' not in x.get('href') and len(x.get_text()) > 7]
df = pd.DataFrame(works, columns = ['Paper', 'Link'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)
Result in terminal:
100%
5/5 [00:03<00:00, 1.76it/s]
Paper Link
0 Latent dirichlet allocation https://scholar.google.com/citations?view_op=view_citation&hl=en&user=mG4imMEAAAAJ&pagesize=100&citation_for_view=mG4imMEAAAAJ:IUKN3-7HHlwC
1 On spectral clustering: Analysis and an algorithm https://scholar.google.com/citations?view_op=view_citation&hl=en&user=mG4imMEAAAAJ&pagesize=100&citation_for_view=mG4imMEAAAAJ:2KloaMYe4IUC
2 ROS: an open-source Robot Operating System https://scholar.google.com/citations?view_op=view_citation&hl=en&user=mG4imMEAAAAJ&pagesize=100&citation_for_view=mG4imMEAAAAJ:u-x6o8ySG0sC
3 Rectifier nonlinearities improve neural network acoustic models https://scholar.google.com/citations?view_op=view_citation&hl=en&user=mG4imMEAAAAJ&pagesize=100&citation_for_view=mG4imMEAAAAJ:gsN89kCJA0AC
4 Recursive deep models for semantic compositionality over a sentiment treebank https://scholar.google.com/citations?view_op=view_citation&hl=en&user=mG4imMEAAAAJ&pagesize=100&citation_for_view=mG4imMEAAAAJ:_axFR9aDTf0C
... ... ...
473 A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes https://scholar.google.com/citations?view_op=view_citation&hl=en&user=mG4imMEAAAAJ&cstart=400&pagesize=100&citation_for_view=mG4imMEAAAAJ:hMod-77fHWUC
474 On Discrim inative vs. Generative https://scholar.google.com/citations?view_op=view_citation&hl=en&user=mG4imMEAAAAJ&cstart=400&pagesize=100&citation_for_view=mG4imMEAAAAJ:qxL8FJ1GzNcC
475 Game Theory with Restricted Strategies https://scholar.google.com/citations?view_op=view_citation&hl=en&user=mG4imMEAAAAJ&cstart=400&pagesize=100&citation_for_view=mG4imMEAAAAJ:8k81kl-MbHgC
476 Exponential family sparse coding with application to self-taught learning with text documents https://scholar.google.com/citations?view_op=view_citation&hl=en&user=mG4imMEAAAAJ&cstart=400&pagesize=100&citation_for_view=mG4imMEAAAAJ:LkGwnXOMwfcC
477 Visual and Range Data https://scholar.google.com/citations?view_op=view_citation&hl=en&user=mG4imMEAAAAJ&cstart=400&pagesize=100&citation_for_view=mG4imMEAAAAJ:eQOLeE2rZwMC
478 rows × 2 columns
See pandas documentation at https://pandas.pydata.org/docs/
Also Requests docs: https://requests.readthedocs.io/en/latest/
For BeautifulSoup, go to https://beautiful-soup-4.readthedocs.io/en/latest/
And for TQDM visit https://pypi.org/project/tqdm/
I'm learning web scraping and was able to scrape data from a website to an excel file. However, in the excel file, you can see that it also includes b' ', instead of just the strings (names of Youtube channels, uploads, views). Any idea where this came from?
from bs4 import BeautifulSoup
import csv
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'} # Need to use this otherwise it returns error 403.
url = requests.get('https://socialblade.com/youtube/top/50/mostviewed', headers=headers)
#print(url)
soup = BeautifulSoup(url.text, 'lxml')
rows = soup.find('div', attrs = {'style': 'float: right; width: 900px;'}).find_all('div', recursive = False)[4:] # If in the inspect of the website, it uses class, then instead of 'style", type in '_class = ' instead. We don't need the first 4 rows, so [4:]
file = open('/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/My_Projects/Web_scraping/topyoutubers.csv', 'w')
writer = csv.writer(file)
# write header rows
writer.writerow(['Username', 'Uploads', 'Views'])
for row in rows:
username = row.find('a').text.strip()
numbers = row.find_all('span', attrs = {'style': 'color:#555;'})
uploads = numbers[0].text.strip()
views = numbers[1].text.strip()
print(username + ' ' + uploads + ' ' + views)
writer.writerow([username.encode('utf-8'), uploads.encode('utf-8'), views.encode('utf-8')])
file.close()
It is caused by the way you do your encoding - you might better define it once while opening the file:
file = open('topyoutubers.csv', 'w', encoding='utf-8')
New code
from bs4 import BeautifulSoup
import csv
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'} # Need to use this otherwise it returns error 403.
url = requests.get('https://socialblade.com/youtube/top/50/mostviewed', headers=headers)
#print(url)
soup = BeautifulSoup(url.text, 'lxml')
rows = soup.find('div', attrs = {'style': 'float: right; width: 900px;'}).find_all('div', recursive = False)[4:] # If in the inspect of the website, it uses class, then instead of 'style", type in '_class = ' instead. We don't need the first 4 rows, so [4:]
file = open('/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/My_Projects/Web_scraping/topyoutubers.csv', 'w', encoding='utf-8')
writer = csv.writer(file)
# write header rows
writer.writerow(['Username', 'Uploads', 'Views'])
for row in rows:
username = row.find('a').text.strip()
numbers = row.find_all('span', attrs = {'style': 'color:#555;'})
uploads = numbers[0].text.strip()
views = numbers[1].text.strip()
print(username + ' ' + uploads + ' ' + views)
writer.writerow([username, uploads, views])
file.close()
Output
Username Uploads Views
1 T-Series 15,029 143,032,749,708
2 Cocomelon - Nursery Rhymes 605 93,057,513,422
3 SET India 48,505 78,282,384,002
4 Zee TV 97,302 59,037,594,757
I'm trying to build a web scraper that visits school district websites and retrieves the names and websites of the schools. I'm using https://www.dallasisd.org/ to test the code below.
I'm currently stuck on how to 1) only access the dropdown list of 'Schools' and 2) retrieve the links in the <li> tags in the same dropdown.
Any help would be much appreciated! Thank you.
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.request
import requests
import re
import xlwt
import pandas as pd
import xlrd
from xlutils.copy import copy
import os.path
hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
browser = webdriver.Chrome()
url = 'https://www.dallasisd.org/'
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source, "lxml")
for name_list in soup.find_all(class_ ='sw-dropdown-list'):
print(name_list.text)
The dropdown lists of elementary schools are contained in the <div id="cs-elementary-schools-panel" [...]> which you could access prior to finding all and obtain the links:
from bs4 import BeautifulSoup
import requests
headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
url = 'https://www.dallasisd.org/'
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
dropdown = soup.find('div', attrs={'id': "cs-elementary-schools-panel"})
for link in dropdown.find_all('li', attrs={'class': "cs-panel-item"}):
print("Url: https://www.dallasisd.org" + link.find('a')['href'])
You can easily extend this code to the Middle and High schools
I'm trying to get data from a table on transfermarkt.com. I was able to get the first 25 entry with the following code. However, I need to get the rest of the entries which are in the following pages. When I clicked on the second page, url does not change.
I tried to increase the range in the for loop but it gives an error. Any suggestion would be appreciated.
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop'
heads = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
r = requests.get(url, headers = heads)
source = r.text
soup = BeautifulSoup(source, "html.parser")
players = soup.find_all("a",{"class":"spielprofil_tooltip"})
values = soup.find_all("td",{"class":"rechts hauptlink"})
playerslist = []
valueslist = []
for i in range(0,25):
playerslist.append(players[i].text)
valueslist.append(values[i].text)
df = pd.DataFrame({"Players":playerslist, "Values":valueslist})
Alter the url in the loop and also change your selectors
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
players = []
values = []
headers = {'User-Agent':'Mozilla/5.0'}
with requests.Session() as s:
for page in range(1,21):
r = s.get(f'https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={page}', headers=headers)
soup = bs(r.content,'lxml')
players += [i.text for i in soup.select('.items .spielprofil_tooltip')]
values += [i.text for i in soup.select('.items .rechts.hauptlink')]
df = pd.DataFrame({"Players":players, "Values":values})
I am trying to get the underlying data from the interactive map on this website:https://www.sabrahealth.com/properties
I tried using the Inspect feature on Google Chrome to find the XHR file that would hold the locations of all the points on the map but nothing appeared. Is there another way to extract the location data from this map?
Well, the location data is available to download on their site here. But let's assume you are wanting the actual latitude, longitude values to do some analysis.
The first thing I would do is exactly what you did (look for the XHR). If I can't find anything there, the second thing I always do is search the html for the <script> tags. sometimes the data is "hiding" in there. It takes a little bit more detective work. It doesn't always yield results, but it does in this case.
If you look within the <script> tags, you'll find the relevant json format. Then you can just work with that. It's just a matter of finding it then manipulating the string to get the valid json format, then use json.loads() to feed that in.
import requests
import bs4
import json
url = 'https://www.sabrahealth.com/properties'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'jQuery.extend(Drupal.settings,' in script.text:
jsonStr = script.text.split('jQuery.extend(Drupal.settings,')[1]
jsonStr = jsonStr.rsplit(');',1)[0]
jsonObj = json.loads(jsonStr)
for each in jsonObj['gmap']['auto1map']['markers']:
name = each['markername']
lat = each['latitude']
lon = each['longitude']
soup = bs4.BeautifulSoup(each['text'], 'html.parser')
prop_type = soup.find('i', {'class':'property-type'}).text.strip()
sub_cat = soup.find('span', {'class':'subcat'}).text.strip()
location = soup.find('span', {'class':'subcat'}).find_next('p').text.split('\n')[0]
print ('Type: %s\nSubCat: %s\nLat: %s\nLon: %s\nLocation: %s\n' %(prop_type, sub_cat, lat, lon, location))
Output:
Type: Senior Housing - Leased
SubCat: Assisted Living
Lat: 38.3309
Lon: -85.862521
Location: Floyds Knobs, Indiana
Type: Skilled Nursing/Transitional Care
SubCat: SNF
Lat: 29.719507
Lon: -99.06649
Location: Bandera, Texas
Type: Skilled Nursing/Transitional Care
SubCat: SNF
Lat: 37.189079
Lon: -77.376015
Location: Petersburg, Virginia
Type: Skilled Nursing/Transitional Care
SubCat: SNF
Lat: 37.759998
Lon: -122.254616
Location: Alameda, California
...