WebScraping for downloading certain .csv files - web-scraping

I have this question. I need to download certain .csv files from a website as the title said, and i'm having troubles doing it. I'm very new on programming and especially with this topic(web scraping)
from bs4 import BeautifulSoup as BS
import requests
DOMAIN = 'https://datos.gob.ar'
URL = 'https://datos.gob.ar/dataset/cultura-mapa-cultural-espacios-culturales/'
FILETYPE = ".csv"
def get_soup(url):
return BS(requests.get(url).text, 'html.parser')
for link in get_soup(URL).find_all('a'):
file_link = link.get('href')
if FILETYPE in file_link:
print(file_link)
this code shows all avaibable .csv files but I just need to download those which end up with "biblioteca popular.csv" , "cine.csv" and "museos.csv"
Maybe it's a very simple task but I can not finding out
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/456d1087-87f9-4e27-9c9c-1d9734c7e51d/download/biblioteca_especializada.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/01c6c048-dbeb-44e0-8efa-6944f73715d7/download/biblioteca_popular.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/8d0b7f33-d570-4189-9961-9e907193aebc/download/casas_bicentenario.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/4207def0-2ff7-41d5-9095-d42ae8207a5d/download/museos.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/392ce1a8-ef11-4776-b280-6f1c7fae16ae/download/cine.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/87ebac9c-774c-4ef2-afa7-044c41ee4190/download/teatro.csv

You can extract the JavaScript object housing that info which otherwise would be loaded to where you see if by JavaScript running in the browser. You then need to do some Unicode code point cleaning and string cleaning and parse as JSON. You can use a key word list to select from desired urls.
Unicode cleaning method by #Mark Tolonen
import json
import requests
import re
URL = 'https://datos.gob.ar/dataset/cultura-mapa-cultural-espacios-culturales/'
r = requests.get(URL)
search = ["Bibliotecas Populares", "Salas de Cine", "Museos"]
s = re.sub( r'\n\s{2,}', '', re.search(r'"#graph": (\[[\s\S]+{0}[\s\S]+)}}'.format(search[0]), r.text).group(1))
data = json.loads(re.sub(r'\\"', '', re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(m.group(1),16)),s)))
for i in data:
if 'schema:name' in i:
name = i['schema:name']
if name in search:
print(name)
print(i['schema:url'])

Related

web scraping of concurrent review pages

How can I scrape concurrent web pages of customer reviews in Python for which follows A) a regular order B)irregular order, let me explain:
In this link there is a page number=2 means second page of the review
https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2',
And when click next button link becomes '.....pageNumber=3..' and so on... I sometimes find the last page , sometimes not...
But in any case, I want to write a line of code that covers all the pages instead of generating all pages and pasting them to Jupiter notebook
My code was like this, number of URL s are reduced:
`import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urlencode
import csv
# Define a list of URL's that will be scraped.
list_of_urls = ['https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=3',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=4',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=5',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=6',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=7',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=8',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=9'
]
# Retrieve each of the url's HTML data and convert the data into a beautiful soup object.
# Find, extract and store reviewer names and review text into a list.
names = []
reviews = []
data_string = ""
for url in list_of_urls:
params = {'api_key': "f00ffd18cb3cb9e64c315b9aa54e29f3", 'url': url}
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
soup = BeautifulSoup(response.text, 'html.parser')
for item in soup.find_all("span", class_="a-profile-name"):
data_string = data_string + item.get_text()
names.append(data_string)
data_string = ""
for item in soup.find_all("span", {"data-hook": "review-body"}):
data_string = data_string + item.get_text()
reviews.append(data_string)
data_string = ""
# Create the dictionary.
reviews_dict = {'Reviewer Name': names, 'Reviews': reviews}
# Print the lengths of each list.
print(len(names), len(reviews))
# Create a new dataframe.
df = pd.DataFrame.from_dict(reviews_dict, orient='index')
df.head()
# Delete all the columns that have missing values.
df.dropna(axis=1, inplace=True)
df.head()
# Transpose the dataframe.
prod_reviews = df.T
print(prod_reviews.head(10))
# Remove special characters from review text.
prod_reviews['Reviews'] = prod_reviews['Reviews'].astype(str)
prod_reviews.head(5)
# Convert dataframe to CSV file.
prod_reviews.to_csv('Review.csv', index=False, header=True)`
So, a list of URLs that will be scraped goes to hundreds...
I want to shorten it, i dont want to paste all URLs, how can i do it????

How to convert div tags to a table?

I want to extract the table from this website https://www.rankingthebrands.com/The-Brand-Rankings.aspx?rankingID=37&year=214
Checking the source of that website, I noticed that somehow the table tag is missing. I assume that this table is a summary of multiple div classes. Is there any easy approach to convert this table to excel/csv? I badly have coding skills/experience...
Appreciate any help
There are a few way to do that. One of which (in python) is (pretty self-explanatory, I believe):
import lxml.html as lh
import csv
import requests
url = 'https://www.rankingthebrands.com/The-Brand-Rankings.aspx?rankingID=37&year=214'
req = requests.get(url)
doc = lh.fromstring(req.text)
headers = ['Position', 'Name', 'Brand Value', 'Last']
with open('brands.csv', 'a', newline='') as fp:
#note the 'a' in there - for 'append`
file = csv.writer(fp)
file.writerow(headers)
#with the headers out of the way, the heavier xpath lifting begins:
for row in doc.xpath('//div[#class="top100row"]'):
pos = row.xpath('./div[#class="pos"]//text()')[0]
name = row.xpath('.//div[#class="name"]//text()')[0]
brand_val = row.xpath('.//div[#class="weighted"]//text()')[0]
last = row.xpath('.//div[#class="lastyear"]//text()')[0]
file.writerow([pos,name,brand_val,last])
The resulting file should be at least close to what you're looking for.

How do I find the complete list of url-paths within a website for scraping?

Is there a way I can use python to see the complete list of url-paths for a website I am scraping?
The structure of the url doesn't change just the paths:
https://www.broadsheet.com.au/{city}/guides/best-cafes-{area}
Right now I have a function that allows me to define {city} and {area} using an f-string literal but I have to do this manually. For example: city = melbourne and area = fitzroy.
I'd like to try and make the function iterate through all available paths for me but I need to work out how to get the complete list of paths.
Is there a way a scraper can do it?
You can parse the sitemap for the required URLs, for example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.broadsheet.com.au/sitemap'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for loc in soup.select('loc'):
if not loc.text.strip().endswith('/guide'):
continue
soup2 = BeautifulSoup(requests.get(loc.text).content, 'html.parser')
for loc2 in soup2.select('loc'):
if '/best-cafes-' in loc2.text:
print(loc2.text)
Prints:
https://www.broadsheet.com.au/melbourne/guides/best-cafes-st-kilda
https://www.broadsheet.com.au/melbourne/guides/best-cafes-fitzroy
https://www.broadsheet.com.au/melbourne/guides/best-cafes-balaclava
https://www.broadsheet.com.au/melbourne/guides/best-cafes-preston
https://www.broadsheet.com.au/melbourne/guides/best-cafes-seddon
https://www.broadsheet.com.au/melbourne/guides/best-cafes-northcote
https://www.broadsheet.com.au/melbourne/guides/best-cafes-fairfield
https://www.broadsheet.com.au/melbourne/guides/best-cafes-ascot-vale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-west-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-flemington
https://www.broadsheet.com.au/melbourne/guides/best-cafes-windsor
https://www.broadsheet.com.au/melbourne/guides/best-cafes-kensington
https://www.broadsheet.com.au/melbourne/guides/best-cafes-prahran
https://www.broadsheet.com.au/melbourne/guides/best-cafes-essendon
https://www.broadsheet.com.au/melbourne/guides/best-cafes-pascoe-vale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-albert-park
https://www.broadsheet.com.au/melbourne/guides/best-cafes-port-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-armadale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brighton
https://www.broadsheet.com.au/melbourne/guides/best-cafes-malvern
https://www.broadsheet.com.au/melbourne/guides/best-cafes-malvern-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-glen-iris
https://www.broadsheet.com.au/melbourne/guides/best-cafes-camberwell
https://www.broadsheet.com.au/melbourne/guides/best-cafes-hawthorn-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brunswick-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-bentleigh
https://www.broadsheet.com.au/melbourne/guides/best-cafes-coburg
https://www.broadsheet.com.au/melbourne/guides/best-cafes-richmond
https://www.broadsheet.com.au/melbourne/guides/best-cafes-bentleigh-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-collingwood
https://www.broadsheet.com.au/melbourne/guides/best-cafes-elwood
https://www.broadsheet.com.au/melbourne/guides/best-cafes-abbotsford
https://www.broadsheet.com.au/melbourne/guides/best-cafes-south-yarra
https://www.broadsheet.com.au/melbourne/guides/best-cafes-yarraville
https://www.broadsheet.com.au/melbourne/guides/best-cafes-thornbury
https://www.broadsheet.com.au/melbourne/guides/best-cafes-west-footscray
https://www.broadsheet.com.au/melbourne/guides/best-cafes-footscray
https://www.broadsheet.com.au/melbourne/guides/best-cafes-south-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-hawthorn
https://www.broadsheet.com.au/melbourne/guides/best-cafes-carlton-north
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brunswick
https://www.broadsheet.com.au/melbourne/guides/best-cafes-carlton
https://www.broadsheet.com.au/melbourne/guides/best-cafes-elsternwick
https://www.broadsheet.com.au/sydney/guides/best-cafes-bronte
https://www.broadsheet.com.au/sydney/guides/best-cafes-coogee
https://www.broadsheet.com.au/sydney/guides/best-cafes-rosebery
https://www.broadsheet.com.au/sydney/guides/best-cafes-ultimo
https://www.broadsheet.com.au/sydney/guides/best-cafes-enmore
https://www.broadsheet.com.au/sydney/guides/best-cafes-dulwich-hill
https://www.broadsheet.com.au/sydney/guides/best-cafes-leichhardt
https://www.broadsheet.com.au/sydney/guides/best-cafes-glebe
https://www.broadsheet.com.au/sydney/guides/best-cafes-annandale
https://www.broadsheet.com.au/sydney/guides/best-cafes-rozelle
https://www.broadsheet.com.au/sydney/guides/best-cafes-paddington
https://www.broadsheet.com.au/sydney/guides/best-cafes-balmain
https://www.broadsheet.com.au/sydney/guides/best-cafes-erskineville
https://www.broadsheet.com.au/sydney/guides/best-cafes-willoughby
https://www.broadsheet.com.au/sydney/guides/best-cafes-bondi-junction
https://www.broadsheet.com.au/sydney/guides/best-cafes-north-sydney
https://www.broadsheet.com.au/sydney/guides/best-cafes-bondi
https://www.broadsheet.com.au/sydney/guides/best-cafes-potts-point
https://www.broadsheet.com.au/sydney/guides/best-cafes-mosman
https://www.broadsheet.com.au/sydney/guides/best-cafes-alexandria
https://www.broadsheet.com.au/sydney/guides/best-cafes-crows-nest
https://www.broadsheet.com.au/sydney/guides/best-cafes-manly
https://www.broadsheet.com.au/sydney/guides/best-cafes-woolloomooloo
https://www.broadsheet.com.au/sydney/guides/best-cafes-newtown
https://www.broadsheet.com.au/sydney/guides/best-cafes-vaucluse
https://www.broadsheet.com.au/sydney/guides/best-cafes-chippendale
https://www.broadsheet.com.au/sydney/guides/best-cafes-marrickville
https://www.broadsheet.com.au/sydney/guides/best-cafes-redfern
https://www.broadsheet.com.au/sydney/guides/best-cafes-camperdown
https://www.broadsheet.com.au/sydney/guides/best-cafes-darlinghurst
https://www.broadsheet.com.au/adelaide/guides/best-cafes-goodwood
https://www.broadsheet.com.au/perth/guides/best-cafes-northbridge
https://www.broadsheet.com.au/perth/guides/best-cafes-leederville
You are essentially trying to create a spider just like search engines do. So, why not use one that already exists? It's free up to 100 daily queries. You will have to set up a Google Custom Search and define a search query.
get your API key from here: https://developers.google.com/custom-search/v1/introduction/?apix=true
define a new search engine: https://cse.google.com/cse/all using URL https://www.broadsheet.com.au/
Click public URL and copy the part from cx=123456:abcdef
place your API key and the cx-part in below URL google
adjust the below query to get the results for different cities. I set it up to find results for Melbourne but you can use a placeholder there easily and format the string.
import requests
google = 'https://www.googleapis.com/customsearch/v1?key={your_custom_search_key}&cx={your_custom_search_id}&q=site:https://www.broadsheet.com.au/melbourne/guides/best+%22best+cafes+in%22+%22melbourne%22&start={}'
results = []
with requests.Session() as session:
start = 1
while True:
result = session.get(google.format(start)).json()
if 'nextPage' in result['queries'].keys():
start = result['queries']['nextPage'][0]['startIndex']
print(start)
else:
break
results += result['items']

Web Scraping Specific Data from OpenWeather (One Call) API into CSV file

So, as above I have a problem on scraping from the openweather api. I have done some practice with scraping with bs4, but the format from the openweather api is a totally different thing hahah.
FYI, I've only learnt this around only a week since it's a requirement (project) from school. (Prof hasn't taught us anything and only gave reading materials, so yeah...)
So far this is the code:
#import libs
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
#Takes the url
url = "https://api.openweathermap.org/data/2.5/onecall?lat=14.590456&lon=120.9774225&units=metric&appid=<API Key>"
#Website Actions
url = urlopen(url)
urlRead = url.read()
url.close()
#BS Actions
# parses html into a soup data structure to traverse html as if it were a json data type.
html = bs(urlRead, "html.parser")
html
This was when i tried to see what stuff I can get from it, which resulted to: (Without the spacing, I just used jsbeautifier for that)
{
"lat":14.59,
"lon":120.98,
"timezone":"Asia/Manila",
"current": {
"dt":1587533672,
"sunrise":1587505076,
"sunset":1587550252,
"temp":34.62,
"feels_like":36.42,
"pressure":1010,
"humidity":56,
"dew_point":24.55,
"uvi":13.13,
"clouds":20,
"visibility":10000,
...
],
"clouds":11,
"uvi":13.74
}
]
}
So the problem is, how to I extract only specific data (into a csv) from this since the whole thing is a single text?
Something like, only the data from current:{}, or certain hours from hourly:{}, etc.
Do some reading on json structures and converting dictionaries to dataframe. It's as simple as iterating through your list, and calling the key in your json response.
Secondly, I'd use requests here and just read in the json response to store as list/dictionaries.
import requests
import pandas as pd
#Takes the url
url = "https://api.openweathermap.org/data/2.5/onecall?lat=14.590456&lon=120.9774225&units=metric&appid=<API Key>"
#Website Actions
jsonData = requests.get(url).json()
Call the key 'current':
jsonData = {
"lat":14.59,
"lon":120.98,
"timezone":"Asia/Manila",
"current": {
"dt":1587533672,
"sunrise":1587505076,
"sunset":1587550252,
"temp":34.62,
"feels_like":36.42,
"pressure":1010,
"humidity":56,
"dew_point":24.55,
"uvi":13.13,
"clouds":20,
"visibility":10000,
"clouds":11,
"uvi":13.74
}
}
df = pd.DataFrame(jsonData['current'],index=[0])
Output:
print(df.to_string())
dt sunrise sunset temp feels_like pressure humidity dew_point uvi clouds visibility
0 1587533672 1587505076 1587550252 34.62 36.42 1010 56 24.55 13.74 11 10000

extracting key-value data from javascript json type data with bs4

I am trying to extract some information from HTML of a web page.
But neither regex method nor list comprehension method works.
At http://bitly.kr/RWz5x, there is some key called encparam enclosed in getjason from a javascript tag which is 49th from all script elements of the page.
Thank you for your help in advance.
sam = requests.get('http://bitly.kr/RWz5x')
#html = sam.text
html=sam.content
soup = BeautifulSoup(html, 'html.parser')
scripts = soup.find_all('script')
#your_script = [script for script in scripts if 'encparam' in str(script)][0]
#print(your_script)
#print(scripts)
pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, scripts.text))
Send your request to the following url which you can find in the sources tab:
import requests
from bs4 import BeautifulSoup as bs
import re
res = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
soup = bs(res.content, 'lxml')
r = re.compile(r"encparam: '(.*)'")
data = soup.find('script', text=r).text
encparam = r.findall(data)[0]
print(encparam)
It is likely you can avoid bs4 altogether:
import requests
import re
r = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
p = re.compile(r"encparam: '(.*)'")
encparam = p.findall(r.text)[0]
print(encparam)
If you actually want the encparam part in the string:
import requests
import re
r = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
p = re.compile(r"(encparam: '\w+')")
encparam = p.findall(r.text)[0]
print(encparam)

Resources