Web Scraping Specific Data from OpenWeather (One Call) API into CSV file - web-scraping

So, as above I have a problem on scraping from the openweather api. I have done some practice with scraping with bs4, but the format from the openweather api is a totally different thing hahah.
FYI, I've only learnt this around only a week since it's a requirement (project) from school. (Prof hasn't taught us anything and only gave reading materials, so yeah...)
So far this is the code:
#import libs
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
#Takes the url
url = "https://api.openweathermap.org/data/2.5/onecall?lat=14.590456&lon=120.9774225&units=metric&appid=<API Key>"
#Website Actions
url = urlopen(url)
urlRead = url.read()
url.close()
#BS Actions
# parses html into a soup data structure to traverse html as if it were a json data type.
html = bs(urlRead, "html.parser")
html
This was when i tried to see what stuff I can get from it, which resulted to: (Without the spacing, I just used jsbeautifier for that)
{
"lat":14.59,
"lon":120.98,
"timezone":"Asia/Manila",
"current": {
"dt":1587533672,
"sunrise":1587505076,
"sunset":1587550252,
"temp":34.62,
"feels_like":36.42,
"pressure":1010,
"humidity":56,
"dew_point":24.55,
"uvi":13.13,
"clouds":20,
"visibility":10000,
...
],
"clouds":11,
"uvi":13.74
}
]
}
So the problem is, how to I extract only specific data (into a csv) from this since the whole thing is a single text?
Something like, only the data from current:{}, or certain hours from hourly:{}, etc.

Do some reading on json structures and converting dictionaries to dataframe. It's as simple as iterating through your list, and calling the key in your json response.
Secondly, I'd use requests here and just read in the json response to store as list/dictionaries.
import requests
import pandas as pd
#Takes the url
url = "https://api.openweathermap.org/data/2.5/onecall?lat=14.590456&lon=120.9774225&units=metric&appid=<API Key>"
#Website Actions
jsonData = requests.get(url).json()
Call the key 'current':
jsonData = {
"lat":14.59,
"lon":120.98,
"timezone":"Asia/Manila",
"current": {
"dt":1587533672,
"sunrise":1587505076,
"sunset":1587550252,
"temp":34.62,
"feels_like":36.42,
"pressure":1010,
"humidity":56,
"dew_point":24.55,
"uvi":13.13,
"clouds":20,
"visibility":10000,
"clouds":11,
"uvi":13.74
}
}
df = pd.DataFrame(jsonData['current'],index=[0])
Output:
print(df.to_string())
dt sunrise sunset temp feels_like pressure humidity dew_point uvi clouds visibility
0 1587533672 1587505076 1587550252 34.62 36.42 1010 56 24.55 13.74 11 10000

Related

web scraping of concurrent review pages

How can I scrape concurrent web pages of customer reviews in Python for which follows A) a regular order B)irregular order, let me explain:
In this link there is a page number=2 means second page of the review
https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2',
And when click next button link becomes '.....pageNumber=3..' and so on... I sometimes find the last page , sometimes not...
But in any case, I want to write a line of code that covers all the pages instead of generating all pages and pasting them to Jupiter notebook
My code was like this, number of URL s are reduced:
`import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urlencode
import csv
# Define a list of URL's that will be scraped.
list_of_urls = ['https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=3',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=4',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=5',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=6',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=7',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=8',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=9'
]
# Retrieve each of the url's HTML data and convert the data into a beautiful soup object.
# Find, extract and store reviewer names and review text into a list.
names = []
reviews = []
data_string = ""
for url in list_of_urls:
params = {'api_key': "f00ffd18cb3cb9e64c315b9aa54e29f3", 'url': url}
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
soup = BeautifulSoup(response.text, 'html.parser')
for item in soup.find_all("span", class_="a-profile-name"):
data_string = data_string + item.get_text()
names.append(data_string)
data_string = ""
for item in soup.find_all("span", {"data-hook": "review-body"}):
data_string = data_string + item.get_text()
reviews.append(data_string)
data_string = ""
# Create the dictionary.
reviews_dict = {'Reviewer Name': names, 'Reviews': reviews}
# Print the lengths of each list.
print(len(names), len(reviews))
# Create a new dataframe.
df = pd.DataFrame.from_dict(reviews_dict, orient='index')
df.head()
# Delete all the columns that have missing values.
df.dropna(axis=1, inplace=True)
df.head()
# Transpose the dataframe.
prod_reviews = df.T
print(prod_reviews.head(10))
# Remove special characters from review text.
prod_reviews['Reviews'] = prod_reviews['Reviews'].astype(str)
prod_reviews.head(5)
# Convert dataframe to CSV file.
prod_reviews.to_csv('Review.csv', index=False, header=True)`
So, a list of URLs that will be scraped goes to hundreds...
I want to shorten it, i dont want to paste all URLs, how can i do it????

WebScraping for downloading certain .csv files

I have this question. I need to download certain .csv files from a website as the title said, and i'm having troubles doing it. I'm very new on programming and especially with this topic(web scraping)
from bs4 import BeautifulSoup as BS
import requests
DOMAIN = 'https://datos.gob.ar'
URL = 'https://datos.gob.ar/dataset/cultura-mapa-cultural-espacios-culturales/'
FILETYPE = ".csv"
def get_soup(url):
return BS(requests.get(url).text, 'html.parser')
for link in get_soup(URL).find_all('a'):
file_link = link.get('href')
if FILETYPE in file_link:
print(file_link)
this code shows all avaibable .csv files but I just need to download those which end up with "biblioteca popular.csv" , "cine.csv" and "museos.csv"
Maybe it's a very simple task but I can not finding out
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/456d1087-87f9-4e27-9c9c-1d9734c7e51d/download/biblioteca_especializada.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/01c6c048-dbeb-44e0-8efa-6944f73715d7/download/biblioteca_popular.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/8d0b7f33-d570-4189-9961-9e907193aebc/download/casas_bicentenario.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/4207def0-2ff7-41d5-9095-d42ae8207a5d/download/museos.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/392ce1a8-ef11-4776-b280-6f1c7fae16ae/download/cine.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/87ebac9c-774c-4ef2-afa7-044c41ee4190/download/teatro.csv
You can extract the JavaScript object housing that info which otherwise would be loaded to where you see if by JavaScript running in the browser. You then need to do some Unicode code point cleaning and string cleaning and parse as JSON. You can use a key word list to select from desired urls.
Unicode cleaning method by #Mark Tolonen
import json
import requests
import re
URL = 'https://datos.gob.ar/dataset/cultura-mapa-cultural-espacios-culturales/'
r = requests.get(URL)
search = ["Bibliotecas Populares", "Salas de Cine", "Museos"]
s = re.sub( r'\n\s{2,}', '', re.search(r'"#graph": (\[[\s\S]+{0}[\s\S]+)}}'.format(search[0]), r.text).group(1))
data = json.loads(re.sub(r'\\"', '', re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(m.group(1),16)),s)))
for i in data:
if 'schema:name' in i:
name = i['schema:name']
if name in search:
print(name)
print(i['schema:url'])

How do I find the complete list of url-paths within a website for scraping?

Is there a way I can use python to see the complete list of url-paths for a website I am scraping?
The structure of the url doesn't change just the paths:
https://www.broadsheet.com.au/{city}/guides/best-cafes-{area}
Right now I have a function that allows me to define {city} and {area} using an f-string literal but I have to do this manually. For example: city = melbourne and area = fitzroy.
I'd like to try and make the function iterate through all available paths for me but I need to work out how to get the complete list of paths.
Is there a way a scraper can do it?
You can parse the sitemap for the required URLs, for example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.broadsheet.com.au/sitemap'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for loc in soup.select('loc'):
if not loc.text.strip().endswith('/guide'):
continue
soup2 = BeautifulSoup(requests.get(loc.text).content, 'html.parser')
for loc2 in soup2.select('loc'):
if '/best-cafes-' in loc2.text:
print(loc2.text)
Prints:
https://www.broadsheet.com.au/melbourne/guides/best-cafes-st-kilda
https://www.broadsheet.com.au/melbourne/guides/best-cafes-fitzroy
https://www.broadsheet.com.au/melbourne/guides/best-cafes-balaclava
https://www.broadsheet.com.au/melbourne/guides/best-cafes-preston
https://www.broadsheet.com.au/melbourne/guides/best-cafes-seddon
https://www.broadsheet.com.au/melbourne/guides/best-cafes-northcote
https://www.broadsheet.com.au/melbourne/guides/best-cafes-fairfield
https://www.broadsheet.com.au/melbourne/guides/best-cafes-ascot-vale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-west-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-flemington
https://www.broadsheet.com.au/melbourne/guides/best-cafes-windsor
https://www.broadsheet.com.au/melbourne/guides/best-cafes-kensington
https://www.broadsheet.com.au/melbourne/guides/best-cafes-prahran
https://www.broadsheet.com.au/melbourne/guides/best-cafes-essendon
https://www.broadsheet.com.au/melbourne/guides/best-cafes-pascoe-vale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-albert-park
https://www.broadsheet.com.au/melbourne/guides/best-cafes-port-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-armadale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brighton
https://www.broadsheet.com.au/melbourne/guides/best-cafes-malvern
https://www.broadsheet.com.au/melbourne/guides/best-cafes-malvern-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-glen-iris
https://www.broadsheet.com.au/melbourne/guides/best-cafes-camberwell
https://www.broadsheet.com.au/melbourne/guides/best-cafes-hawthorn-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brunswick-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-bentleigh
https://www.broadsheet.com.au/melbourne/guides/best-cafes-coburg
https://www.broadsheet.com.au/melbourne/guides/best-cafes-richmond
https://www.broadsheet.com.au/melbourne/guides/best-cafes-bentleigh-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-collingwood
https://www.broadsheet.com.au/melbourne/guides/best-cafes-elwood
https://www.broadsheet.com.au/melbourne/guides/best-cafes-abbotsford
https://www.broadsheet.com.au/melbourne/guides/best-cafes-south-yarra
https://www.broadsheet.com.au/melbourne/guides/best-cafes-yarraville
https://www.broadsheet.com.au/melbourne/guides/best-cafes-thornbury
https://www.broadsheet.com.au/melbourne/guides/best-cafes-west-footscray
https://www.broadsheet.com.au/melbourne/guides/best-cafes-footscray
https://www.broadsheet.com.au/melbourne/guides/best-cafes-south-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-hawthorn
https://www.broadsheet.com.au/melbourne/guides/best-cafes-carlton-north
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brunswick
https://www.broadsheet.com.au/melbourne/guides/best-cafes-carlton
https://www.broadsheet.com.au/melbourne/guides/best-cafes-elsternwick
https://www.broadsheet.com.au/sydney/guides/best-cafes-bronte
https://www.broadsheet.com.au/sydney/guides/best-cafes-coogee
https://www.broadsheet.com.au/sydney/guides/best-cafes-rosebery
https://www.broadsheet.com.au/sydney/guides/best-cafes-ultimo
https://www.broadsheet.com.au/sydney/guides/best-cafes-enmore
https://www.broadsheet.com.au/sydney/guides/best-cafes-dulwich-hill
https://www.broadsheet.com.au/sydney/guides/best-cafes-leichhardt
https://www.broadsheet.com.au/sydney/guides/best-cafes-glebe
https://www.broadsheet.com.au/sydney/guides/best-cafes-annandale
https://www.broadsheet.com.au/sydney/guides/best-cafes-rozelle
https://www.broadsheet.com.au/sydney/guides/best-cafes-paddington
https://www.broadsheet.com.au/sydney/guides/best-cafes-balmain
https://www.broadsheet.com.au/sydney/guides/best-cafes-erskineville
https://www.broadsheet.com.au/sydney/guides/best-cafes-willoughby
https://www.broadsheet.com.au/sydney/guides/best-cafes-bondi-junction
https://www.broadsheet.com.au/sydney/guides/best-cafes-north-sydney
https://www.broadsheet.com.au/sydney/guides/best-cafes-bondi
https://www.broadsheet.com.au/sydney/guides/best-cafes-potts-point
https://www.broadsheet.com.au/sydney/guides/best-cafes-mosman
https://www.broadsheet.com.au/sydney/guides/best-cafes-alexandria
https://www.broadsheet.com.au/sydney/guides/best-cafes-crows-nest
https://www.broadsheet.com.au/sydney/guides/best-cafes-manly
https://www.broadsheet.com.au/sydney/guides/best-cafes-woolloomooloo
https://www.broadsheet.com.au/sydney/guides/best-cafes-newtown
https://www.broadsheet.com.au/sydney/guides/best-cafes-vaucluse
https://www.broadsheet.com.au/sydney/guides/best-cafes-chippendale
https://www.broadsheet.com.au/sydney/guides/best-cafes-marrickville
https://www.broadsheet.com.au/sydney/guides/best-cafes-redfern
https://www.broadsheet.com.au/sydney/guides/best-cafes-camperdown
https://www.broadsheet.com.au/sydney/guides/best-cafes-darlinghurst
https://www.broadsheet.com.au/adelaide/guides/best-cafes-goodwood
https://www.broadsheet.com.au/perth/guides/best-cafes-northbridge
https://www.broadsheet.com.au/perth/guides/best-cafes-leederville
You are essentially trying to create a spider just like search engines do. So, why not use one that already exists? It's free up to 100 daily queries. You will have to set up a Google Custom Search and define a search query.
get your API key from here: https://developers.google.com/custom-search/v1/introduction/?apix=true
define a new search engine: https://cse.google.com/cse/all using URL https://www.broadsheet.com.au/
Click public URL and copy the part from cx=123456:abcdef
place your API key and the cx-part in below URL google
adjust the below query to get the results for different cities. I set it up to find results for Melbourne but you can use a placeholder there easily and format the string.
import requests
google = 'https://www.googleapis.com/customsearch/v1?key={your_custom_search_key}&cx={your_custom_search_id}&q=site:https://www.broadsheet.com.au/melbourne/guides/best+%22best+cafes+in%22+%22melbourne%22&start={}'
results = []
with requests.Session() as session:
start = 1
while True:
result = session.get(google.format(start)).json()
if 'nextPage' in result['queries'].keys():
start = result['queries']['nextPage'][0]['startIndex']
print(start)
else:
break
results += result['items']

Scraping specific checkbox values using Python

I am trying to analyze the data on this website: website
I want to scrape a couple of countries such as BZN|PT - BZN|ES and BZN|RO - BZN|BG
I tried for forecastedTransferCapacitiesMonthAhead the following:
from bs4 import BeautifulSoup
import requests
page = requests.get('https://transparency.entsoe.eu/transmission-domain/r2/forecastedTransferCapacitiesMonthAhead/show')
soup = BeautifulSoup(page.text, 'html.parser')
tran_month = soup.find('table', id='dv-datatable').findAll('tr')
for price in tran_month:
print(''.join(price.get_text("|", strip=True).split()))
But I only get the preselected country. How can I pass my arguments so that I can select the countries that I want? Much obliged.
The code is missing a crucial part - i.e., the parameters which inform the requests, like import/export and from/to countries and types.
In order to solve the issue, below you might find a code built on yours, which uses the GET + parameters function of requests. To run the complete code, you should find out the complete list of parameters per country.
from bs4 import BeautifulSoup
import requests
payload = { # this is the dictionary whose values can be changed for the request
'name' : '',
'defaultValue' : 'false',
'viewType' : 'TABLE',
'areaType' : 'BORDER_BZN',
'atch' : 'false',
'dateTime.dateTime' : '01.05.2020 00:00|UTC|MONTH',
'border.values' : 'CTY|10YPL-AREA-----S!BZN_BZN|10YPL-AREA-----S_BZN_BZN|10YDOM-CZ-DE-SKK',
'direction.values' : ['Export', 'Import']
}
page = requests.get('https://transparency.entsoe.eu/transmission-domain/r2/forecastedTransferCapacitiesMonthAhead/show',
params = payload) # GET request + parameters
soup = BeautifulSoup(page.text, 'html.parser')
tran_month = soup.find('table', id='dv-datatable').findAll('tr')
for price in tran_month: # print all values, row by row (date, export and import)
print(price.text.strip())

Beautiful Soup Pagination, find_all not finding text within next_page class. Need also to extract data from URLS

I've been working on this for a week and am determined to get this working!
My ultimate goal is to write a webscraper where you can insert the county name and the scraper will produce a csv file of information from mugshots - Name, Location, Eye Color, Weight, Hair Color and Height (it's a genetics project I am working on).
The site organization is primary site page --> state page --> county page -- 120 mugshots with name and url --> url with data I am ultimately after and next links to another set of 120.
I thought the best way to do this would be to write a scraper that will grab the URLs and Names from the table of 120 mugshots and then use pagination to grab all the URLs and names from the rest of the county (in some cases there are 10's of thousands). I can get the first 120, but my pagination doesn't work.. so Im ending up with a csv of 120 names and urls.
I closely followed this article which was very helpful
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd
county_name = input('Please, enter a county name: /Arizona/Maricopa-County-AZ \n')
print(f'Searching {county_name}. Wait, please...')
base_url = 'https://www.mugshots.com'
search_url = f'https://mugshots.com/US-Counties/{county_name}/'
data = {'Name': [],'URL': []}
def export_table_and_print(data):
table = pd.DataFrame(data, columns=['Name', 'URL'])
table.index = table.index + 1
table.to_csv('mugshots.csv', index=False)
print('Scraping done. Here are the results:')
print(table)
def get_mugshot_attributes(mugshot):
name = mugshot.find('div', attrs={'class', 'label'})
url = mugshot.find('a', attrs={'class', 'image-preview'})
name=name.text
url=mugshot.get('href')
url = base_url + url
data['Name'].append(name)
data['URL'].append(url)
def parse_page(next_url):
page = requests.get(next_url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'lxml')
list_all_mugshot = bs.find_all('a', attrs={'class', 'image-preview'})
for mugshot in list_all_mugshot:
get_mugshot_attributes(mugshot)
next_page_text = mugshot.find('a class' , attrs={'next page'})
if next_page_text == 'Next':
next_page_text=mugshot.get_text()
next_page_url=mugshot.get('href')
next_page_url=base_url+next_page_url
print(next_page_url)
parse_page(next_page_url)
else:
export_table_and_print(data)
parse_page(search_url)
Any ideas on how to get the pagination to work and also how to eventually get the data from the list of URLs I scrape?
I appreciate your help! I've been working in python for a few months now, but the BS4 and Scrapy stuff is so confusing for some reason.
Thank you so much community!
Anna
It seems you want to know the logic as to how you can get the content using populated urls derived from each of the page traversing next pages. This is how you can parse all the links from each page including next page and then use those links to get the content from their inner pages.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://mugshots.com/"
base = "https://mugshots.com"
def get_next_pages(link):
print("**"*20,"current page:",link)
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("[itemprop='name'] > a[href^='/Current-Events/']"):
yield from get_main_content(urljoin(base,item.get("href")))
next_page = soup.select_one(".pagination > a:contains('Next')")
if next_page:
next_page = urljoin(url,next_page.get("href"))
yield from get_next_pages(next_page)
def get_main_content(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("h1#item-title > span[itemprop='name']").text
yield item
if __name__ == '__main__':
for elem in get_next_pages(url):
print(elem)

Resources