How to convert div tags to a table? - web-scraping

I want to extract the table from this website https://www.rankingthebrands.com/The-Brand-Rankings.aspx?rankingID=37&year=214
Checking the source of that website, I noticed that somehow the table tag is missing. I assume that this table is a summary of multiple div classes. Is there any easy approach to convert this table to excel/csv? I badly have coding skills/experience...
Appreciate any help

There are a few way to do that. One of which (in python) is (pretty self-explanatory, I believe):
import lxml.html as lh
import csv
import requests
url = 'https://www.rankingthebrands.com/The-Brand-Rankings.aspx?rankingID=37&year=214'
req = requests.get(url)
doc = lh.fromstring(req.text)
headers = ['Position', 'Name', 'Brand Value', 'Last']
with open('brands.csv', 'a', newline='') as fp:
#note the 'a' in there - for 'append`
file = csv.writer(fp)
file.writerow(headers)
#with the headers out of the way, the heavier xpath lifting begins:
for row in doc.xpath('//div[#class="top100row"]'):
pos = row.xpath('./div[#class="pos"]//text()')[0]
name = row.xpath('.//div[#class="name"]//text()')[0]
brand_val = row.xpath('.//div[#class="weighted"]//text()')[0]
last = row.xpath('.//div[#class="lastyear"]//text()')[0]
file.writerow([pos,name,brand_val,last])
The resulting file should be at least close to what you're looking for.

Related

web scraping of concurrent review pages

How can I scrape concurrent web pages of customer reviews in Python for which follows A) a regular order B)irregular order, let me explain:
In this link there is a page number=2 means second page of the review
https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2',
And when click next button link becomes '.....pageNumber=3..' and so on... I sometimes find the last page , sometimes not...
But in any case, I want to write a line of code that covers all the pages instead of generating all pages and pasting them to Jupiter notebook
My code was like this, number of URL s are reduced:
`import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urlencode
import csv
# Define a list of URL's that will be scraped.
list_of_urls = ['https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=3',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=4',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=5',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=6',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=7',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=8',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=9'
]
# Retrieve each of the url's HTML data and convert the data into a beautiful soup object.
# Find, extract and store reviewer names and review text into a list.
names = []
reviews = []
data_string = ""
for url in list_of_urls:
params = {'api_key': "f00ffd18cb3cb9e64c315b9aa54e29f3", 'url': url}
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
soup = BeautifulSoup(response.text, 'html.parser')
for item in soup.find_all("span", class_="a-profile-name"):
data_string = data_string + item.get_text()
names.append(data_string)
data_string = ""
for item in soup.find_all("span", {"data-hook": "review-body"}):
data_string = data_string + item.get_text()
reviews.append(data_string)
data_string = ""
# Create the dictionary.
reviews_dict = {'Reviewer Name': names, 'Reviews': reviews}
# Print the lengths of each list.
print(len(names), len(reviews))
# Create a new dataframe.
df = pd.DataFrame.from_dict(reviews_dict, orient='index')
df.head()
# Delete all the columns that have missing values.
df.dropna(axis=1, inplace=True)
df.head()
# Transpose the dataframe.
prod_reviews = df.T
print(prod_reviews.head(10))
# Remove special characters from review text.
prod_reviews['Reviews'] = prod_reviews['Reviews'].astype(str)
prod_reviews.head(5)
# Convert dataframe to CSV file.
prod_reviews.to_csv('Review.csv', index=False, header=True)`
So, a list of URLs that will be scraped goes to hundreds...
I want to shorten it, i dont want to paste all URLs, how can i do it????

WebScraping for downloading certain .csv files

I have this question. I need to download certain .csv files from a website as the title said, and i'm having troubles doing it. I'm very new on programming and especially with this topic(web scraping)
from bs4 import BeautifulSoup as BS
import requests
DOMAIN = 'https://datos.gob.ar'
URL = 'https://datos.gob.ar/dataset/cultura-mapa-cultural-espacios-culturales/'
FILETYPE = ".csv"
def get_soup(url):
return BS(requests.get(url).text, 'html.parser')
for link in get_soup(URL).find_all('a'):
file_link = link.get('href')
if FILETYPE in file_link:
print(file_link)
this code shows all avaibable .csv files but I just need to download those which end up with "biblioteca popular.csv" , "cine.csv" and "museos.csv"
Maybe it's a very simple task but I can not finding out
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/456d1087-87f9-4e27-9c9c-1d9734c7e51d/download/biblioteca_especializada.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/01c6c048-dbeb-44e0-8efa-6944f73715d7/download/biblioteca_popular.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/8d0b7f33-d570-4189-9961-9e907193aebc/download/casas_bicentenario.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/4207def0-2ff7-41d5-9095-d42ae8207a5d/download/museos.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/392ce1a8-ef11-4776-b280-6f1c7fae16ae/download/cine.csv
https://datos.cultura.gob.ar/dataset/37305de4-3cce-4d4b-9d9a-fec3ca61d09f/resource/87ebac9c-774c-4ef2-afa7-044c41ee4190/download/teatro.csv
You can extract the JavaScript object housing that info which otherwise would be loaded to where you see if by JavaScript running in the browser. You then need to do some Unicode code point cleaning and string cleaning and parse as JSON. You can use a key word list to select from desired urls.
Unicode cleaning method by #Mark Tolonen
import json
import requests
import re
URL = 'https://datos.gob.ar/dataset/cultura-mapa-cultural-espacios-culturales/'
r = requests.get(URL)
search = ["Bibliotecas Populares", "Salas de Cine", "Museos"]
s = re.sub( r'\n\s{2,}', '', re.search(r'"#graph": (\[[\s\S]+{0}[\s\S]+)}}'.format(search[0]), r.text).group(1))
data = json.loads(re.sub(r'\\"', '', re.sub(r'\\u([0-9a-fA-F]{4})',lambda m: chr(int(m.group(1),16)),s)))
for i in data:
if 'schema:name' in i:
name = i['schema:name']
if name in search:
print(name)
print(i['schema:url'])

Scraping specific checkbox values using Python

I am trying to analyze the data on this website: website
I want to scrape a couple of countries such as BZN|PT - BZN|ES and BZN|RO - BZN|BG
I tried for forecastedTransferCapacitiesMonthAhead the following:
from bs4 import BeautifulSoup
import requests
page = requests.get('https://transparency.entsoe.eu/transmission-domain/r2/forecastedTransferCapacitiesMonthAhead/show')
soup = BeautifulSoup(page.text, 'html.parser')
tran_month = soup.find('table', id='dv-datatable').findAll('tr')
for price in tran_month:
print(''.join(price.get_text("|", strip=True).split()))
But I only get the preselected country. How can I pass my arguments so that I can select the countries that I want? Much obliged.
The code is missing a crucial part - i.e., the parameters which inform the requests, like import/export and from/to countries and types.
In order to solve the issue, below you might find a code built on yours, which uses the GET + parameters function of requests. To run the complete code, you should find out the complete list of parameters per country.
from bs4 import BeautifulSoup
import requests
payload = { # this is the dictionary whose values can be changed for the request
'name' : '',
'defaultValue' : 'false',
'viewType' : 'TABLE',
'areaType' : 'BORDER_BZN',
'atch' : 'false',
'dateTime.dateTime' : '01.05.2020 00:00|UTC|MONTH',
'border.values' : 'CTY|10YPL-AREA-----S!BZN_BZN|10YPL-AREA-----S_BZN_BZN|10YDOM-CZ-DE-SKK',
'direction.values' : ['Export', 'Import']
}
page = requests.get('https://transparency.entsoe.eu/transmission-domain/r2/forecastedTransferCapacitiesMonthAhead/show',
params = payload) # GET request + parameters
soup = BeautifulSoup(page.text, 'html.parser')
tran_month = soup.find('table', id='dv-datatable').findAll('tr')
for price in tran_month: # print all values, row by row (date, export and import)
print(price.text.strip())

Unable to find css selector using BS4

I am trying to scrape some data off of
https://www.bose.com/en_us/locations/?page=1&storesPerPage=10
but am unable to using BS4's css selector.
Because of the many classes of the tag that I am trying to grab, I am using the soup.select() function. I can easily do this using other functions but I am curious why using this specifically does not work.
from bs4 import BeautifulSoup
import requests
url = 'https://www.bose.com/en_us/locations/?page=1&storesPerPage=10'
soup = BeautifulSoup(requests.get(url).content)
soup.select('div.bw__StoreLocation')
# returns []
soup.select('.bw__StoreLocation')
# returns []
However, when I print(soup) I can see that .bw__StoreLocation is in it.
Data is dynamically added. The request url, as per comments, is found in network tab.
import requests
params = (
('page', '0'),
('getRankingInfo', 'true'),
('facets/[/]', '*'),
('aroundRadius', 'all'),
('filters', 'domain:bose.brickworksoftware.com AND publishedAt<=1566084972196'),
('esSearch', '''{
"page":0
,"storesPerPage":15
,"domain":"bose.brickworksoftware.com"
,"locale":"en_US"
,"must":[{"type":"range","field":"published_at","value":{"lte":1566084972196}}]
,"filters":[]
,"aroundLatLngViaIP":"True"
}'''
),
('aroundLatLngViaIP', 'true'),
)
r = requests.get('https://bose.brickworksoftware.com/locations_search', params=params).json()
data = r['hits'][0]['attributes']
address = ', '.join([data['address1'] , data['city'], data['countryCode'], data['postalCode']])
print(address)

How to scrape options from dropdown list and store them in table?

I am trying to make an interactive dashboard with analysis, base on car side. I would like user to be able to pick car brand for example BMW, Audi etc. and base on this choise he will have only avaiablity to pick BMW/Audi etc. models. I have a problem after selecting each brand, I am not able to scrape the models that belongs to that brand. Page that I am scraping from:
main page --> https://www.otomoto.pl/osobowe/
sub car brand page example --> https://www.otomoto.pl/osobowe/audi/
I have tried to scrape every option, so later on I can maybe somehow clean the data to store only models
code:
otomoto_models - paste0("https://www.otomoto.pl/osobowe/"audi/")
models <- read_html(otomoto_models) %>%
html_nodes("option") %>%
html_text()
But it is just scraping the brands with other options avaiable on the page engine type etc. While after inspecting element I can clearly see models types.
otomoto <- "https://www.otomoto.pl/osobowe/"
brands <- read_html(otomoto) %>%
html_nodes("option") %>%
html_text()
brands <- data.frame(brands)
for (i in 1:nrow(brands)){
no_marka_pojazdu <- i
if(brands[i,1] == "Marka pojazdu"){
break
}
}
no_marka_pojazdu <- no_marka_pojazdu + 1
for (i in 1:nrow(brands)){
zuk <- i
if(substr(brands[i,1],1,3) == "Żuk"){
break
}
}
Modele_pojazdow <- as.character(brands[no_marka_pojazdu:zuk,1])
Modele_pojazdow <- removeNumbers(Modele_pojazdow)
Modele_pojazdow <- substr(Modele_pojazdow,1,nchar(Modele_pojazdow)-2)
Modele_pojazdow <- data.frame(Modele_pojazdow)
Above code is only to pick supported car brands on the webpage and store them in the data frame. With that I am able to create html link and direct everything to one selected brand.
I would like to have similar object to "Modele_pojazdow" but with models limited on previous selected car brand.
Dropdown list with models appears as white box with text "Model pojazdu" next to the "Audi" box on the right side.
Some may frown on the solution language being Python, but the aim of this is was to give some pointers (high level process). I haven't written R in a long time so Python was quicker.
EDIT: R script now added
General outline:
The first dropdown options can be grabbed from the value attribute of each node returned by using a css selector of #param571 option. This uses an id selector (#) to target the parent dropdown select element, and then option type selector in descendant combination, to specify the option tag elements within. The html to apply this selector combination to can be retrieved by an xhr request to the url you initially provided. You want a nodeList returned to iterate over; akin to applying selector with js document.querySelectorAll.
The page uses ajax POST requests to update the second dropdown based on your first dropdown choice. Your first dropdown choice determines the value of a parameter search[filter_enum_make], which is used in the POST request to the server. The subsequent response contains a list of the available options (it includes some case alternatives which can be trimmed out).
I captured the POST request by using fiddler. This showed me the request headers and params in the request body. Screenshot sample shown at end.
The simplest way to extract the options from the response text, IMO, is to regex the appropriate string out (I wouldn't normally recommend regex for working with html but in this case it serves us nicely). If you don't want to use regex, you can grab the relevant info from the data-facets attribute of the element with id body-container. For the non-regex version you need to handle unquoted nulls, and retrieve the inner dictionary whose key is filter_enum_model. I show a function re-write, at the end, to handle this.
The retrieved string is a string representation of a dictionary. This needs converting to an actual dictionary object which you can then extract the option values from. Edit: As R doesn't have a dictionary object a similar structure needs to be found. I will look at this when converting.
I create a user defined function, getOptions(), to return the options for each make. Each car make value comes from the list of possible items in the first dropdown. I loop those possible values, use the function to return a list of options for that make, and add those lists as values to a dictionary, results ,whose keys are the make of car. Again, for R an object with similar functionality to a python dictionary needs to be found.
That dictionary of lists needs converting to a dataframe which includes a transpose operation to make a tidy output of headers, which are the car makes, and columns underneath each header, which contain the associated models.
The whole thing can be written to csv at the end.
So, hopefully that gives you an idea of one way to achieve what you want. Perhaps someone else can use this to help write you a solution.
Python demonstration of this below:
import requests
from bs4 import BeautifulSoup as bs
import re
import ast
import pandas as pd
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
try:
# verify the regex here: https://regex101.com/r/emvqXs/1
data = re.search(r'"filter_enum_model":(.*),"new_used"', r.text ,flags=re.DOTALL).group(1) #regex to extract the string containing the models associated with the car make filter
aDict = ast.literal_eval(data) #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
except:
cleanedList = [] # sometimes there are no associated values in 2nd dropdown
return cleanedList
r = requests.get('https://www.otomoto.pl/osobowe/')
soup = bs(r.content, 'lxml')
values = [item['value'] for item in soup.select('#param571 option') if item['value'] != '']
results = {}
# build a dictionary of lists to hold options for each make
for value in values:
results[value] = getOptions(value) #function call to return options based on make
# turn into a dataframe and transpose so each column header is the make and the options are listed below
df = pd.DataFrame.from_dict(results,orient='index').transpose()
#write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )
Sample of csv output:
Example as sample json for alfa-romeo:
Example of regex match for alfa-romeo:
{"145":1,"146":1,"147":218,"155":1,"156":118,"159":559,"164":2,"166":39,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":89,"GTV":7,"Giulia":251,"Giulietta":378,"Mito":224,"Spider":24,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":378,"gt":89,"gtv":7,"mito":224,"spider":24,"sportwagon":2,"stelvio":242}
Example of the filter option list returned from function call with make parameter value alfa-romeo:
['145', '146', '147', '155', '156', '159', '164', '166', '33', 'Alfasud', 'Brera', 'Crosswagon', 'GT', 'GTV', 'Giulia', 'Giulietta', 'Mito', 'Spider', 'Sportwagon', 'Stelvio']
Sample of fiddler request:
Sample of ajax response html containing options:
<section id="body-container" class="om-offers-list"
data-facets='{"offer_seek":{"offer":2198},"private_business":{"business":1326,"private":872,"all":2198},"categories":{"29":2198,"161":953,"163":953},"categoriesParent":[],"filter_enum_model":{"145":1,"146":1,"147":219,"155":1,"156":116,"159":561,"164":2,"166":37,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":88,"GTV":7,"Giulia":251,"Giulietta":380,"Mito":226,"Spider":25,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":380,"gt":88,"gtv":7,"mito":226,"spider":25,"sportwagon":2,"stelvio":242},"new_used":{"new":371,"used":1827,"all":2198},"sellout":null}'
data-showfacets=""
data-pagetitle="Alfa Romeo samochody osobowe - otomoto.pl"
data-ajaxurl="https://www.otomoto.pl/osobowe/alfa-romeo/?search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
data-searchid=""
data-keys=''
data-vars=""
Alternative version of function without regex:
from bs4 import BeautifulSoup as bs
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
soup = bs(r.content, 'lxml')
data = soup.select_one('#body-container')['data-facets'].replace('null','"null"')
aDict = ast.literal_eval(data)['filter_enum_model'] #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
return cleanedList
print(getOptions('alfa-romeo'))
R conversion and improved python:
Whilst converting to R I found a better way of extracting the parameters from a js file on the server. If you open dev tools you can see the file listed in the sources tab.
R (To be improved):
library(httr)
library(jsonlite)
url <- 'https://www.otomoto.pl/ajax/jsdata/params/'
r <- GET(url)
contents <- content(r, "text")
data <- strsplit(contents, "var searchConditions = ")[[1]][2]
data <- strsplit(as.character(data), ";var searchCondition")[[1]][1]
source <- fromJSON(data)$values$'573'$'571'
makes <- names(source)
for(make in makes){
print(make)
print(source[make][[1]]$value)
#break
}
Python:
import requests
import json
import pandas as pd
r = requests.get('https://www.otomoto.pl/ajax/jsdata/params/')
data = r.text.split('var searchConditions = ')[1]
data = data.split(';var searchCondition')[0]
items = json.loads(data)
source = items['values']['573']['571']
makes = [item for item in source]
results = {}
for make in makes:
df = pd.DataFrame(source[make]) ## build a dictionary of lists to hold options for each make
results[make] = list(df['value'])
dfFinal = pd.DataFrame.from_dict(results,orient='index').transpose() # turn into a dataframe and transpose so each column header is the make and the options are listed below
mask = dfFinal.applymap(lambda x: x is None) #tidy up None values to empty strings https://stackoverflow.com/a/31295814/6241235
cols = dfFinal.columns[(mask).any()]
for col in dfFinal[cols]:
dfFinal.loc[mask[col], col] = ''
print(dfFinal)

Resources