web scraping of concurrent review pages

web scraping of concurrent review pages - web-scraping

How can I scrape concurrent web pages of customer reviews in Python for which follows A) a regular order B)irregular order, let me explain:
In this link there is a page number=2 means second page of the review
https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2',
And when click next button link becomes '.....pageNumber=3..' and so on... I sometimes find the last page , sometimes not...
But in any case, I want to write a line of code that covers all the pages instead of generating all pages and pasting them to Jupiter notebook
My code was like this, number of URL s are reduced:
`import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urlencode
import csv
# Define a list of URL's that will be scraped.
list_of_urls = ['https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=3',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=4',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=5',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=6',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=7',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=8',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=9'
]
# Retrieve each of the url's HTML data and convert the data into a beautiful soup object.
# Find, extract and store reviewer names and review text into a list.
names = []
reviews = []
data_string = ""
for url in list_of_urls:
params = {'api_key': "f00ffd18cb3cb9e64c315b9aa54e29f3", 'url': url}
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
soup = BeautifulSoup(response.text, 'html.parser')
for item in soup.find_all("span", class_="a-profile-name"):
data_string = data_string + item.get_text()
names.append(data_string)
data_string = ""
for item in soup.find_all("span", {"data-hook": "review-body"}):
data_string = data_string + item.get_text()
reviews.append(data_string)
data_string = ""
# Create the dictionary.
reviews_dict = {'Reviewer Name': names, 'Reviews': reviews}
# Print the lengths of each list.
print(len(names), len(reviews))
# Create a new dataframe.
df = pd.DataFrame.from_dict(reviews_dict, orient='index')
df.head()
# Delete all the columns that have missing values.
df.dropna(axis=1, inplace=True)
df.head()
# Transpose the dataframe.
prod_reviews = df.T
print(prod_reviews.head(10))
# Remove special characters from review text.
prod_reviews['Reviews'] = prod_reviews['Reviews'].astype(str)
prod_reviews.head(5)
# Convert dataframe to CSV file.
prod_reviews.to_csv('Review.csv', index=False, header=True)`
So, a list of URLs that will be scraped goes to hundreds...
I want to shorten it, i dont want to paste all URLs, how can i do it????

Related

WebScraping for downloading certain .csv files

I have this question. I need to download certain .csv files from a website as the title said, and i'm having troubles doing it. I'm very new on programming and especially with this topic(web scraping)
from bs4 import BeautifulSoup as BS
import requests
DOMAIN = 'https://datos.gob.ar'
URL = 'https://datos.gob.ar/dataset/cultura-mapa-cultural-espacios-culturales/'
FILETYPE = ".csv"
def get_soup(url):
return BS(requests.get(url).text, 'html.parser')
for link in get_soup(URL).find_all('a'):
file_link = link.get('href')
if FILETYPE in file_link:
print(file_link)
this code shows all avaibable .csv files but I just need to download those which end up with "biblioteca popular.csv" , "cine.csv" and "museos.csv"
Maybe it's a very simple task but I can not finding out
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/456d1087-87f9-4e27-9c9c-1d9734c7e51d/download/biblioteca_especializada.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/01c6c048-dbeb-44e0-8efa-6944f73715d7/download/biblioteca_popular.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/8d0b7f33-d570-4189-9961-9e907193aebc/download/casas_bicentenario.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/4207def0-2ff7-41d5-9095-d42ae8207a5d/download/museos.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/392ce1a8-ef11-4776-b280-6f1c7fae16ae/download/cine.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/87ebac9c-774c-4ef2-afa7-044c41ee4190/download/teatro.csv

You can extract the JavaScript object housing that info which otherwise would be loaded to where you see if by JavaScript running in the browser. You then need to do some Unicode code point cleaning and string cleaning and parse as JSON. You can use a key word list to select from desired urls.
Unicode cleaning method by #Mark Tolonen
import json
import requests
import re
URL = 'https://datos.gob.ar/dataset/cultura-mapa-cultural-espacios-culturales/'
r = requests.get(URL)
search = ["Bibliotecas Populares", "Salas de Cine", "Museos"]
s = re.sub( r'\n\s{2,}', '', re.search(r'"#graph": (\[[\s\S]+{0}[\s\S]+)}}'.format(search[0]), r.text).group(1))
data = json.loads(re.sub(r'\\"', '', re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(m.group(1),16)),s)))
for i in data:
if 'schema:name' in i:
name = i['schema:name']
if name in search:
print(name)
print(i['schema:url'])

How to obtain URL for item in absolute links-for loop (requests_html)

I try to include the url of the currently scraped item to the dataframe
Like i scrape the title with r.html.find(...)
Is there a way to do this with requests html?

to get the URL, obtain the element and then do <element>.attrs['href'] to get the URL of the object
Example
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://python.org/')
about = r.html.find('.button', first = True)
print(about.attrs['href'])

Scraping specific checkbox values using Python

I am trying to analyze the data on this website: website
I want to scrape a couple of countries such as BZN|PT - BZN|ES and BZN|RO - BZN|BG
I tried for forecastedTransferCapacitiesMonthAhead the following:
from bs4 import BeautifulSoup
import requests
page = requests.get('https://transparency.entsoe.eu/transmission-domain/r2/forecastedTransferCapacitiesMonthAhead/show')
soup = BeautifulSoup(page.text, 'html.parser')
tran_month = soup.find('table', id='dv-datatable').findAll('tr')
for price in tran_month:
print(''.join(price.get_text("|", strip=True).split()))
But I only get the preselected country. How can I pass my arguments so that I can select the countries that I want? Much obliged.

The code is missing a crucial part - i.e., the parameters which inform the requests, like import/export and from/to countries and types.
In order to solve the issue, below you might find a code built on yours, which uses the GET + parameters function of requests. To run the complete code, you should find out the complete list of parameters per country.
from bs4 import BeautifulSoup
import requests
payload = { # this is the dictionary whose values can be changed for the request
'name' : '',
'defaultValue' : 'false',
'viewType' : 'TABLE',
'areaType' : 'BORDER_BZN',
'atch' : 'false',
'dateTime.dateTime' : '01.05.2020 00:00|UTC|MONTH',
'border.values' : 'CTY|10YPL-AREA-----S!BZN_BZN|10YPL-AREA-----S_BZN_BZN|10YDOM-CZ-DE-SKK',
'direction.values' : ['Export', 'Import']
}
page = requests.get('https://transparency.entsoe.eu/transmission-domain/r2/forecastedTransferCapacitiesMonthAhead/show',
params = payload) # GET request + parameters
soup = BeautifulSoup(page.text, 'html.parser')
tran_month = soup.find('table', id='dv-datatable').findAll('tr')
for price in tran_month: # print all values, row by row (date, export and import)
print(price.text.strip())

Beautiful Soup Pagination, find_all not finding text within next_page class. Need also to extract data from URLS

I've been working on this for a week and am determined to get this working!
My ultimate goal is to write a webscraper where you can insert the county name and the scraper will produce a csv file of information from mugshots - Name, Location, Eye Color, Weight, Hair Color and Height (it's a genetics project I am working on).
The site organization is primary site page --> state page --> county page -- 120 mugshots with name and url --> url with data I am ultimately after and next links to another set of 120.
I thought the best way to do this would be to write a scraper that will grab the URLs and Names from the table of 120 mugshots and then use pagination to grab all the URLs and names from the rest of the county (in some cases there are 10's of thousands). I can get the first 120, but my pagination doesn't work.. so Im ending up with a csv of 120 names and urls.
I closely followed this article which was very helpful
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd
county_name = input('Please, enter a county name: /Arizona/Maricopa-County-AZ \n')
print(f'Searching {county_name}. Wait, please...')
base_url = 'https://www.mugshots.com'
search_url = f'https://mugshots.com/US-Counties/{county_name}/'
data = {'Name': [],'URL': []}
def export_table_and_print(data):
table = pd.DataFrame(data, columns=['Name', 'URL'])
table.index = table.index + 1
table.to_csv('mugshots.csv', index=False)
print('Scraping done. Here are the results:')
print(table)
def get_mugshot_attributes(mugshot):
name = mugshot.find('div', attrs={'class', 'label'})
url = mugshot.find('a', attrs={'class', 'image-preview'})
name=name.text
url=mugshot.get('href')
url = base_url + url
data['Name'].append(name)
data['URL'].append(url)
def parse_page(next_url):
page = requests.get(next_url)
if page.status_code == requests.codes.ok:
bs = BeautifulSoup(page.text, 'lxml')
list_all_mugshot = bs.find_all('a', attrs={'class', 'image-preview'})
for mugshot in list_all_mugshot:
get_mugshot_attributes(mugshot)
next_page_text = mugshot.find('a class' , attrs={'next page'})
if next_page_text == 'Next':
next_page_text=mugshot.get_text()
next_page_url=mugshot.get('href')
next_page_url=base_url+next_page_url
print(next_page_url)
parse_page(next_page_url)
else:
export_table_and_print(data)
parse_page(search_url)
Any ideas on how to get the pagination to work and also how to eventually get the data from the list of URLs I scrape?
I appreciate your help! I've been working in python for a few months now, but the BS4 and Scrapy stuff is so confusing for some reason.
Thank you so much community!
Anna

It seems you want to know the logic as to how you can get the content using populated urls derived from each of the page traversing next pages. This is how you can parse all the links from each page including next page and then use those links to get the content from their inner pages.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://mugshots.com/"
base = "https://mugshots.com"
def get_next_pages(link):
print("**"*20,"current page:",link)
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("[itemprop='name'] > a[href^='/Current-Events/']"):
yield from get_main_content(urljoin(base,item.get("href")))
next_page = soup.select_one(".pagination > a:contains('Next')")
if next_page:
next_page = urljoin(url,next_page.get("href"))
yield from get_next_pages(next_page)
def get_main_content(link):
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("h1#item-title > span[itemprop='name']").text
yield item
if __name__ == '__main__':
for elem in get_next_pages(url):
print(elem)

How to scrape options from dropdown list and store them in table?

I am trying to make an interactive dashboard with analysis, base on car side. I would like user to be able to pick car brand for example BMW, Audi etc. and base on this choise he will have only avaiablity to pick BMW/Audi etc. models. I have a problem after selecting each brand, I am not able to scrape the models that belongs to that brand. Page that I am scraping from:
main page --> https://www.otomoto.pl/osobowe/
sub car brand page example --> https://www.otomoto.pl/osobowe/audi/
I have tried to scrape every option, so later on I can maybe somehow clean the data to store only models
code:
otomoto_models - paste0("https://www.otomoto.pl/osobowe/"audi/")
models <- read_html(otomoto_models) %>%
html_nodes("option") %>%
html_text()
But it is just scraping the brands with other options avaiable on the page engine type etc. While after inspecting element I can clearly see models types.
otomoto <- "https://www.otomoto.pl/osobowe/"
brands <- read_html(otomoto) %>%
html_nodes("option") %>%
html_text()
brands <- data.frame(brands)
for (i in 1:nrow(brands)){
no_marka_pojazdu <- i
if(brands[i,1] == "Marka pojazdu"){
break
}
}
no_marka_pojazdu <- no_marka_pojazdu + 1
for (i in 1:nrow(brands)){
zuk <- i
if(substr(brands[i,1],1,3) == "Żuk"){
break
}
}
Modele_pojazdow <- as.character(brands[no_marka_pojazdu:zuk,1])
Modele_pojazdow <- removeNumbers(Modele_pojazdow)
Modele_pojazdow <- substr(Modele_pojazdow,1,nchar(Modele_pojazdow)-2)
Modele_pojazdow <- data.frame(Modele_pojazdow)
Above code is only to pick supported car brands on the webpage and store them in the data frame. With that I am able to create html link and direct everything to one selected brand.
I would like to have similar object to "Modele_pojazdow" but with models limited on previous selected car brand.
Dropdown list with models appears as white box with text "Model pojazdu" next to the "Audi" box on the right side.

Some may frown on the solution language being Python, but the aim of this is was to give some pointers (high level process). I haven't written R in a long time so Python was quicker.
EDIT: R script now added
General outline:
The first dropdown options can be grabbed from the value attribute of each node returned by using a css selector of #param571 option. This uses an id selector (#) to target the parent dropdown select element, and then option type selector in descendant combination, to specify the option tag elements within. The html to apply this selector combination to can be retrieved by an xhr request to the url you initially provided. You want a nodeList returned to iterate over; akin to applying selector with js document.querySelectorAll.
The page uses ajax POST requests to update the second dropdown based on your first dropdown choice. Your first dropdown choice determines the value of a parameter search[filter_enum_make], which is used in the POST request to the server. The subsequent response contains a list of the available options (it includes some case alternatives which can be trimmed out).
I captured the POST request by using fiddler. This showed me the request headers and params in the request body. Screenshot sample shown at end.
The simplest way to extract the options from the response text, IMO, is to regex the appropriate string out (I wouldn't normally recommend regex for working with html but in this case it serves us nicely). If you don't want to use regex, you can grab the relevant info from the data-facets attribute of the element with id body-container. For the non-regex version you need to handle unquoted nulls, and retrieve the inner dictionary whose key is filter_enum_model. I show a function re-write, at the end, to handle this.
The retrieved string is a string representation of a dictionary. This needs converting to an actual dictionary object which you can then extract the option values from. Edit: As R doesn't have a dictionary object a similar structure needs to be found. I will look at this when converting.
I create a user defined function, getOptions(), to return the options for each make. Each car make value comes from the list of possible items in the first dropdown. I loop those possible values, use the function to return a list of options for that make, and add those lists as values to a dictionary, results ,whose keys are the make of car. Again, for R an object with similar functionality to a python dictionary needs to be found.
That dictionary of lists needs converting to a dataframe which includes a transpose operation to make a tidy output of headers, which are the car makes, and columns underneath each header, which contain the associated models.
The whole thing can be written to csv at the end.
So, hopefully that gives you an idea of one way to achieve what you want. Perhaps someone else can use this to help write you a solution.
Python demonstration of this below:
import requests
from bs4 import BeautifulSoup as bs
import re
import ast
import pandas as pd
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
try:
# verify the regex here: https://regex101.com/r/emvqXs/1
data = re.search(r'"filter_enum_model":(.*),"new_used"', r.text ,flags=re.DOTALL).group(1) #regex to extract the string containing the models associated with the car make filter
aDict = ast.literal_eval(data) #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
except:
cleanedList = [] # sometimes there are no associated values in 2nd dropdown
return cleanedList
r = requests.get('https://www.otomoto.pl/osobowe/')
soup = bs(r.content, 'lxml')
values = [item['value'] for item in soup.select('#param571 option') if item['value'] != '']
results = {}
# build a dictionary of lists to hold options for each make
for value in values:
results[value] = getOptions(value) #function call to return options based on make
# turn into a dataframe and transpose so each column header is the make and the options are listed below
df = pd.DataFrame.from_dict(results,orient='index').transpose()
#write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )
Sample of csv output:
Example as sample json for alfa-romeo:
Example of regex match for alfa-romeo:
{"145":1,"146":1,"147":218,"155":1,"156":118,"159":559,"164":2,"166":39,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":89,"GTV":7,"Giulia":251,"Giulietta":378,"Mito":224,"Spider":24,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":378,"gt":89,"gtv":7,"mito":224,"spider":24,"sportwagon":2,"stelvio":242}
Example of the filter option list returned from function call with make parameter value alfa-romeo:
['145', '146', '147', '155', '156', '159', '164', '166', '33', 'Alfasud', 'Brera', 'Crosswagon', 'GT', 'GTV', 'Giulia', 'Giulietta', 'Mito', 'Spider', 'Sportwagon', 'Stelvio']
Sample of fiddler request:
Sample of ajax response html containing options:
<section id="body-container" class="om-offers-list"
data-facets='{"offer_seek":{"offer":2198},"private_business":{"business":1326,"private":872,"all":2198},"categories":{"29":2198,"161":953,"163":953},"categoriesParent":[],"filter_enum_model":{"145":1,"146":1,"147":219,"155":1,"156":116,"159":561,"164":2,"166":37,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":88,"GTV":7,"Giulia":251,"Giulietta":380,"Mito":226,"Spider":25,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":380,"gt":88,"gtv":7,"mito":226,"spider":25,"sportwagon":2,"stelvio":242},"new_used":{"new":371,"used":1827,"all":2198},"sellout":null}'
data-showfacets=""
data-pagetitle="Alfa Romeo samochody osobowe - otomoto.pl"
data-ajaxurl="https://www.otomoto.pl/osobowe/alfa-romeo/?search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
data-searchid=""
data-keys=''
data-vars=""
Alternative version of function without regex:
from bs4 import BeautifulSoup as bs
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
soup = bs(r.content, 'lxml')
data = soup.select_one('#body-container')['data-facets'].replace('null','"null"')
aDict = ast.literal_eval(data)['filter_enum_model'] #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
return cleanedList
print(getOptions('alfa-romeo'))
R conversion and improved python:
Whilst converting to R I found a better way of extracting the parameters from a js file on the server. If you open dev tools you can see the file listed in the sources tab.
R (To be improved):
library(httr)
library(jsonlite)
url <- 'https://www.otomoto.pl/ajax/jsdata/params/'
r <- GET(url)
contents <- content(r, "text")
data <- strsplit(contents, "var searchConditions = ")[[1]][2]
data <- strsplit(as.character(data), ";var searchCondition")[[1]][1]
source <- fromJSON(data)$values$'573'$'571'
makes <- names(source)
for(make in makes){
print(make)
print(source[make][[1]]$value)
#break
}
Python:
import requests
import json
import pandas as pd
r = requests.get('https://www.otomoto.pl/ajax/jsdata/params/')
data = r.text.split('var searchConditions = ')[1]
data = data.split(';var searchCondition')[0]
items = json.loads(data)
source = items['values']['573']['571']
makes = [item for item in source]
results = {}
for make in makes:
df = pd.DataFrame(source[make]) ## build a dictionary of lists to hold options for each make
results[make] = list(df['value'])
dfFinal = pd.DataFrame.from_dict(results,orient='index').transpose() # turn into a dataframe and transpose so each column header is the make and the options are listed below
mask = dfFinal.applymap(lambda x: x is None) #tidy up None values to empty strings https://stackoverflow.com/a/31295814/6241235
cols = dfFinal.columns[(mask).any()]
for col in dfFinal[cols]:
dfFinal.loc[mask[col], col] = ''
print(dfFinal)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

web scraping of concurrent review pages - web-scraping

Related

WebScraping for downloading certain .csv files

How to obtain URL for item in absolute links-for loop (requests_html)

Scraping specific checkbox values using Python

Beautiful Soup Pagination, find_all not finding text within next_page class. Need also to extract data from URLS

How to scrape options from dropdown list and store them in table?

Categories

Resources