How do I find the complete list of url-paths within a website for scraping? - web-scraping

Is there a way I can use python to see the complete list of url-paths for a website I am scraping?
The structure of the url doesn't change just the paths:
https://www.broadsheet.com.au/{city}/guides/best-cafes-{area}
Right now I have a function that allows me to define {city} and {area} using an f-string literal but I have to do this manually. For example: city = melbourne and area = fitzroy.
I'd like to try and make the function iterate through all available paths for me but I need to work out how to get the complete list of paths.
Is there a way a scraper can do it?

You can parse the sitemap for the required URLs, for example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.broadsheet.com.au/sitemap'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for loc in soup.select('loc'):
if not loc.text.strip().endswith('/guide'):
continue
soup2 = BeautifulSoup(requests.get(loc.text).content, 'html.parser')
for loc2 in soup2.select('loc'):
if '/best-cafes-' in loc2.text:
print(loc2.text)
Prints:
https://www.broadsheet.com.au/melbourne/guides/best-cafes-st-kilda
https://www.broadsheet.com.au/melbourne/guides/best-cafes-fitzroy
https://www.broadsheet.com.au/melbourne/guides/best-cafes-balaclava
https://www.broadsheet.com.au/melbourne/guides/best-cafes-preston
https://www.broadsheet.com.au/melbourne/guides/best-cafes-seddon
https://www.broadsheet.com.au/melbourne/guides/best-cafes-northcote
https://www.broadsheet.com.au/melbourne/guides/best-cafes-fairfield
https://www.broadsheet.com.au/melbourne/guides/best-cafes-ascot-vale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-west-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-flemington
https://www.broadsheet.com.au/melbourne/guides/best-cafes-windsor
https://www.broadsheet.com.au/melbourne/guides/best-cafes-kensington
https://www.broadsheet.com.au/melbourne/guides/best-cafes-prahran
https://www.broadsheet.com.au/melbourne/guides/best-cafes-essendon
https://www.broadsheet.com.au/melbourne/guides/best-cafes-pascoe-vale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-albert-park
https://www.broadsheet.com.au/melbourne/guides/best-cafes-port-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-armadale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brighton
https://www.broadsheet.com.au/melbourne/guides/best-cafes-malvern
https://www.broadsheet.com.au/melbourne/guides/best-cafes-malvern-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-glen-iris
https://www.broadsheet.com.au/melbourne/guides/best-cafes-camberwell
https://www.broadsheet.com.au/melbourne/guides/best-cafes-hawthorn-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brunswick-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-bentleigh
https://www.broadsheet.com.au/melbourne/guides/best-cafes-coburg
https://www.broadsheet.com.au/melbourne/guides/best-cafes-richmond
https://www.broadsheet.com.au/melbourne/guides/best-cafes-bentleigh-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-collingwood
https://www.broadsheet.com.au/melbourne/guides/best-cafes-elwood
https://www.broadsheet.com.au/melbourne/guides/best-cafes-abbotsford
https://www.broadsheet.com.au/melbourne/guides/best-cafes-south-yarra
https://www.broadsheet.com.au/melbourne/guides/best-cafes-yarraville
https://www.broadsheet.com.au/melbourne/guides/best-cafes-thornbury
https://www.broadsheet.com.au/melbourne/guides/best-cafes-west-footscray
https://www.broadsheet.com.au/melbourne/guides/best-cafes-footscray
https://www.broadsheet.com.au/melbourne/guides/best-cafes-south-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-hawthorn
https://www.broadsheet.com.au/melbourne/guides/best-cafes-carlton-north
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brunswick
https://www.broadsheet.com.au/melbourne/guides/best-cafes-carlton
https://www.broadsheet.com.au/melbourne/guides/best-cafes-elsternwick
https://www.broadsheet.com.au/sydney/guides/best-cafes-bronte
https://www.broadsheet.com.au/sydney/guides/best-cafes-coogee
https://www.broadsheet.com.au/sydney/guides/best-cafes-rosebery
https://www.broadsheet.com.au/sydney/guides/best-cafes-ultimo
https://www.broadsheet.com.au/sydney/guides/best-cafes-enmore
https://www.broadsheet.com.au/sydney/guides/best-cafes-dulwich-hill
https://www.broadsheet.com.au/sydney/guides/best-cafes-leichhardt
https://www.broadsheet.com.au/sydney/guides/best-cafes-glebe
https://www.broadsheet.com.au/sydney/guides/best-cafes-annandale
https://www.broadsheet.com.au/sydney/guides/best-cafes-rozelle
https://www.broadsheet.com.au/sydney/guides/best-cafes-paddington
https://www.broadsheet.com.au/sydney/guides/best-cafes-balmain
https://www.broadsheet.com.au/sydney/guides/best-cafes-erskineville
https://www.broadsheet.com.au/sydney/guides/best-cafes-willoughby
https://www.broadsheet.com.au/sydney/guides/best-cafes-bondi-junction
https://www.broadsheet.com.au/sydney/guides/best-cafes-north-sydney
https://www.broadsheet.com.au/sydney/guides/best-cafes-bondi
https://www.broadsheet.com.au/sydney/guides/best-cafes-potts-point
https://www.broadsheet.com.au/sydney/guides/best-cafes-mosman
https://www.broadsheet.com.au/sydney/guides/best-cafes-alexandria
https://www.broadsheet.com.au/sydney/guides/best-cafes-crows-nest
https://www.broadsheet.com.au/sydney/guides/best-cafes-manly
https://www.broadsheet.com.au/sydney/guides/best-cafes-woolloomooloo
https://www.broadsheet.com.au/sydney/guides/best-cafes-newtown
https://www.broadsheet.com.au/sydney/guides/best-cafes-vaucluse
https://www.broadsheet.com.au/sydney/guides/best-cafes-chippendale
https://www.broadsheet.com.au/sydney/guides/best-cafes-marrickville
https://www.broadsheet.com.au/sydney/guides/best-cafes-redfern
https://www.broadsheet.com.au/sydney/guides/best-cafes-camperdown
https://www.broadsheet.com.au/sydney/guides/best-cafes-darlinghurst
https://www.broadsheet.com.au/adelaide/guides/best-cafes-goodwood
https://www.broadsheet.com.au/perth/guides/best-cafes-northbridge
https://www.broadsheet.com.au/perth/guides/best-cafes-leederville

You are essentially trying to create a spider just like search engines do. So, why not use one that already exists? It's free up to 100 daily queries. You will have to set up a Google Custom Search and define a search query.
get your API key from here: https://developers.google.com/custom-search/v1/introduction/?apix=true
define a new search engine: https://cse.google.com/cse/all using URL https://www.broadsheet.com.au/
Click public URL and copy the part from cx=123456:abcdef
place your API key and the cx-part in below URL google
adjust the below query to get the results for different cities. I set it up to find results for Melbourne but you can use a placeholder there easily and format the string.
import requests
google = 'https://www.googleapis.com/customsearch/v1?key={your_custom_search_key}&cx={your_custom_search_id}&q=site:https://www.broadsheet.com.au/melbourne/guides/best+%22best+cafes+in%22+%22melbourne%22&start={}'
results = []
with requests.Session() as session:
start = 1
while True:
result = session.get(google.format(start)).json()
if 'nextPage' in result['queries'].keys():
start = result['queries']['nextPage'][0]['startIndex']
print(start)
else:
break
results += result['items']

Related

WebScraping for downloading certain .csv files

I have this question. I need to download certain .csv files from a website as the title said, and i'm having troubles doing it. I'm very new on programming and especially with this topic(web scraping)
from bs4 import BeautifulSoup as BS
import requests
DOMAIN = 'https://datos.gob.ar'
URL = 'https://datos.gob.ar/dataset/cultura-mapa-cultural-espacios-culturales/'
FILETYPE = ".csv"
def get_soup(url):
return BS(requests.get(url).text, 'html.parser')
for link in get_soup(URL).find_all('a'):
file_link = link.get('href')
if FILETYPE in file_link:
print(file_link)
this code shows all avaibable .csv files but I just need to download those which end up with "biblioteca popular.csv" , "cine.csv" and "museos.csv"
Maybe it's a very simple task but I can not finding out
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/456d1087-87f9-4e27-9c9c-1d9734c7e51d/download/biblioteca_especializada.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/01c6c048-dbeb-44e0-8efa-6944f73715d7/download/biblioteca_popular.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/8d0b7f33-d570-4189-9961-9e907193aebc/download/casas_bicentenario.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/4207def0-2ff7-41d5-9095-d42ae8207a5d/download/museos.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/392ce1a8-ef11-4776-b280-6f1c7fae16ae/download/cine.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/87ebac9c-774c-4ef2-afa7-044c41ee4190/download/teatro.csv
You can extract the JavaScript object housing that info which otherwise would be loaded to where you see if by JavaScript running in the browser. You then need to do some Unicode code point cleaning and string cleaning and parse as JSON. You can use a key word list to select from desired urls.
Unicode cleaning method by #Mark Tolonen
import json
import requests
import re
URL = 'https://datos.gob.ar/dataset/cultura-mapa-cultural-espacios-culturales/'
r = requests.get(URL)
search = ["Bibliotecas Populares", "Salas de Cine", "Museos"]
s = re.sub( r'\n\s{2,}', '', re.search(r'"#graph": (\[[\s\S]+{0}[\s\S]+)}}'.format(search[0]), r.text).group(1))
data = json.loads(re.sub(r'\\"', '', re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(m.group(1),16)),s)))
for i in data:
if 'schema:name' in i:
name = i['schema:name']
if name in search:
print(name)
print(i['schema:url'])

How to obtain URL for item in absolute links-for loop (requests_html)

I try to include the url of the currently scraped item to the dataframe
Like i scrape the title with r.html.find(...)
Is there a way to do this with requests html?
to get the URL, obtain the element and then do <element>.attrs['href'] to get the URL of the object
Example
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://python.org/')
about = r.html.find('.button', first = True)
print(about.attrs['href'])

Scraping specific checkbox values using Python

I am trying to analyze the data on this website: website
I want to scrape a couple of countries such as BZN|PT - BZN|ES and BZN|RO - BZN|BG
I tried for forecastedTransferCapacitiesMonthAhead the following:
from bs4 import BeautifulSoup
import requests
page = requests.get('https://transparency.entsoe.eu/transmission-domain/r2/forecastedTransferCapacitiesMonthAhead/show')
soup = BeautifulSoup(page.text, 'html.parser')
tran_month = soup.find('table', id='dv-datatable').findAll('tr')
for price in tran_month:
print(''.join(price.get_text("|", strip=True).split()))
But I only get the preselected country. How can I pass my arguments so that I can select the countries that I want? Much obliged.
The code is missing a crucial part - i.e., the parameters which inform the requests, like import/export and from/to countries and types.
In order to solve the issue, below you might find a code built on yours, which uses the GET + parameters function of requests. To run the complete code, you should find out the complete list of parameters per country.
from bs4 import BeautifulSoup
import requests
payload = { # this is the dictionary whose values can be changed for the request
'name' : '',
'defaultValue' : 'false',
'viewType' : 'TABLE',
'areaType' : 'BORDER_BZN',
'atch' : 'false',
'dateTime.dateTime' : '01.05.2020 00:00|UTC|MONTH',
'border.values' : 'CTY|10YPL-AREA-----S!BZN_BZN|10YPL-AREA-----S_BZN_BZN|10YDOM-CZ-DE-SKK',
'direction.values' : ['Export', 'Import']
}
page = requests.get('https://transparency.entsoe.eu/transmission-domain/r2/forecastedTransferCapacitiesMonthAhead/show',
params = payload) # GET request + parameters
soup = BeautifulSoup(page.text, 'html.parser')
tran_month = soup.find('table', id='dv-datatable').findAll('tr')
for price in tran_month: # print all values, row by row (date, export and import)
print(price.text.strip())

BeautifulSoup not finding all tags when using .find method?

I am trying to scrape from https://github.com/trending the number of of trending repositories using BeautifulSoup in Python. The code is supposed to find all tags with class_ = "Box-row" and then print the number found. On the site the actual number of trending repositories is 25 but the code only returns 9.
I have tried changing the parser from 'html.parser' to 'lxml' but both returned the same results.
page = requests.get('https://github.com/trending')
soup = BeautifulSoup(page.text, 'html.parser')
soup = BeautifulSoup(page.text)
repo = soup.find(class_ = "Box-row")
print(len(repo))
In the html there are 25 tags with "Box-row" class attributes so I expected to see print(len(repo)) = 25, but instead it's 9.
Try this:
repo = soup.find_all("article",{"class":"Box-row"})

Beautiful Soup Pagination, find_all not finding text within next_page class. Need also to extract data from URLS

I've been working on this for a week and am determined to get this working!
My ultimate goal is to write a webscraper where you can insert the county name and the scraper will produce a csv file of information from mugshots - Name, Location, Eye Color, Weight, Hair Color and Height (it's a genetics project I am working on).
The site organization is primary site page --> state page --> county page -- 120 mugshots with name and url --> url with data I am ultimately after and next links to another set of 120.
I thought the best way to do this would be to write a scraper that will grab the URLs and Names from the table of 120 mugshots and then use pagination to grab all the URLs and names from the rest of the county (in some cases there are 10's of thousands). I can get the first 120, but my pagination doesn't work.. so Im ending up with a csv of 120 names and urls.
I closely followed this article which was very helpful
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd
county_name = input('Please, enter a county name: /Arizona/Maricopa-County-AZ \n')
print(f'Searching {county_name}. Wait, please...')
base_url = 'https://www.mugshots.com'
search_url = f'https://mugshots.com/US-Counties/{county_name}/'
data = {'Name': [],'URL': []}
def export_table_and_print(data):
table = pd.DataFrame(data, columns=['Name', 'URL'])
table.index = table.index + 1
table.to_csv('mugshots.csv', index=False)
print('Scraping done. Here are the results:')
print(table)
def get_mugshot_attributes(mugshot):
name = mugshot.find('div', attrs={'class', 'label'})
url = mugshot.find('a', attrs={'class', 'image-preview'})
name=name.text
url=mugshot.get('href')
url = base_url + url
data['Name'].append(name)
data['URL'].append(url)
def parse_page(next_url):
page = requests.get(next_url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'lxml')
list_all_mugshot = bs.find_all('a', attrs={'class', 'image-preview'})
for mugshot in list_all_mugshot:
get_mugshot_attributes(mugshot)
next_page_text = mugshot.find('a class' , attrs={'next page'})
if next_page_text == 'Next':
next_page_text=mugshot.get_text()
next_page_url=mugshot.get('href')
next_page_url=base_url+next_page_url
print(next_page_url)
parse_page(next_page_url)
else:
export_table_and_print(data)
parse_page(search_url)
Any ideas on how to get the pagination to work and also how to eventually get the data from the list of URLs I scrape?
I appreciate your help! I've been working in python for a few months now, but the BS4 and Scrapy stuff is so confusing for some reason.
Thank you so much community!
Anna
It seems you want to know the logic as to how you can get the content using populated urls derived from each of the page traversing next pages. This is how you can parse all the links from each page including next page and then use those links to get the content from their inner pages.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://mugshots.com/"
base = "https://mugshots.com"
def get_next_pages(link):
print("**"*20,"current page:",link)
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("[itemprop='name'] > a[href^='/Current-Events/']"):
yield from get_main_content(urljoin(base,item.get("href")))
next_page = soup.select_one(".pagination > a:contains('Next')")
if next_page:
next_page = urljoin(url,next_page.get("href"))
yield from get_next_pages(next_page)
def get_main_content(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("h1#item-title > span[itemprop='name']").text
yield item
if __name__ == '__main__':
for elem in get_next_pages(url):
print(elem)

Resources