Unable to find css selector using BS4 - css

I am trying to scrape some data off of
https://www.bose.com/en_us/locations/?page=1&storesPerPage=10
but am unable to using BS4's css selector.
Because of the many classes of the tag that I am trying to grab, I am using the soup.select() function. I can easily do this using other functions but I am curious why using this specifically does not work.
from bs4 import BeautifulSoup
import requests
url = 'https://www.bose.com/en_us/locations/?page=1&storesPerPage=10'
soup = BeautifulSoup(requests.get(url).content)
soup.select('div.bw__StoreLocation')
# returns []
soup.select('.bw__StoreLocation')
# returns []
However, when I print(soup) I can see that .bw__StoreLocation is in it.

Data is dynamically added. The request url, as per comments, is found in network tab.
import requests
params = (
('page', '0'),
('getRankingInfo', 'true'),
('facets/[/]', '*'),
('aroundRadius', 'all'),
('filters', 'domain:bose.brickworksoftware.com AND publishedAt<=1566084972196'),
('esSearch', '''{
"page":0
,"storesPerPage":15
,"domain":"bose.brickworksoftware.com"
,"locale":"en_US"
,"must":[{"type":"range","field":"published_at","value":{"lte":1566084972196}}]
,"filters":[]
,"aroundLatLngViaIP":"True"
}'''
),
('aroundLatLngViaIP', 'true'),
)
r = requests.get('https://bose.brickworksoftware.com/locations_search', params=params).json()
data = r['hits'][0]['attributes']
address = ', '.join([data['address1'] , data['city'], data['countryCode'], data['postalCode']])
print(address)

Related

For Loop Not Repeating

I'm trying to find the hrefs for all the states where this company has stores, however it only finds the href for the first state.
Can anyone figure out why the for loop doesn't repeat for the rest of the states? Thank you very much for your help!
import requests
from bs4 import BeautifulSoup
import csv
# website
sitemap = 'website_url'
# content of website
sitemap_content = requests.get(sitemap).content
# parsing website
soup = BeautifulSoup(sitemap_content, 'html.parser')
#print(soup)
list_of_divs = soup.findAll('div', attrs={'class':'listings-inner'})
#print(list_of_divs)
header = ['Links']
with open ('/Users/ABC/Desktop/v1.csv','wt') as csvfile:
writer = csv.writer(csvfile, delimiter ="\t" )
writer.writerow(header)
for state in list_of_divs:
# get the url's by state
print(state.find('div', attrs={'class':'itemlist'}).a.get('href'))
rows = [state.find('div', attrs={'class':'itemlist'}).a.get('href')]
writer.writerow(rows)
list_of_divs actually only contains one element, which is the only div on the page with class listings-inner. So when you iterate through all of it's elements and use the find method, it'll only return the first result.
You want to use the find_all method on that div:
import requests
from bs4 import BeautifulSoup
sitemap = 'https://stores.dollargeneral.com/'
sitemap_content = requests.get(sitemap).content
soup = BeautifulSoup(sitemap_content, 'html.parser')
listings_div = soup.find('div', attrs={'class':'listings-inner'})
for state in listings_div.find_all('div', attrs={'class':'itemlist'}):
print(state.a.get('href'))

Scraping specific checkbox values using Python

I am trying to analyze the data on this website: website
I want to scrape a couple of countries such as BZN|PT - BZN|ES and BZN|RO - BZN|BG
I tried for forecastedTransferCapacitiesMonthAhead the following:
from bs4 import BeautifulSoup
import requests
page = requests.get('https://transparency.entsoe.eu/transmission-domain/r2/forecastedTransferCapacitiesMonthAhead/show')
soup = BeautifulSoup(page.text, 'html.parser')
tran_month = soup.find('table', id='dv-datatable').findAll('tr')
for price in tran_month:
print(''.join(price.get_text("|", strip=True).split()))
But I only get the preselected country. How can I pass my arguments so that I can select the countries that I want? Much obliged.
The code is missing a crucial part - i.e., the parameters which inform the requests, like import/export and from/to countries and types.
In order to solve the issue, below you might find a code built on yours, which uses the GET + parameters function of requests. To run the complete code, you should find out the complete list of parameters per country.
from bs4 import BeautifulSoup
import requests
payload = { # this is the dictionary whose values can be changed for the request
'name' : '',
'defaultValue' : 'false',
'viewType' : 'TABLE',
'areaType' : 'BORDER_BZN',
'atch' : 'false',
'dateTime.dateTime' : '01.05.2020 00:00|UTC|MONTH',
'border.values' : 'CTY|10YPL-AREA-----S!BZN_BZN|10YPL-AREA-----S_BZN_BZN|10YDOM-CZ-DE-SKK',
'direction.values' : ['Export', 'Import']
}
page = requests.get('https://transparency.entsoe.eu/transmission-domain/r2/forecastedTransferCapacitiesMonthAhead/show',
params = payload) # GET request + parameters
soup = BeautifulSoup(page.text, 'html.parser')
tran_month = soup.find('table', id='dv-datatable').findAll('tr')
for price in tran_month: # print all values, row by row (date, export and import)
print(price.text.strip())

extracting key-value data from javascript json type data with bs4

I am trying to extract some information from HTML of a web page.
But neither regex method nor list comprehension method works.
At http://bitly.kr/RWz5x, there is some key called encparam enclosed in getjason from a javascript tag which is 49th from all script elements of the page.
Thank you for your help in advance.
sam = requests.get('http://bitly.kr/RWz5x')
#html = sam.text
html=sam.content
soup = BeautifulSoup(html, 'html.parser')
scripts = soup.find_all('script')
#your_script = [script for script in scripts if 'encparam' in str(script)][0]
#print(your_script)
#print(scripts)
pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, scripts.text))
Send your request to the following url which you can find in the sources tab:
import requests
from bs4 import BeautifulSoup as bs
import re
res = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
soup = bs(res.content, 'lxml')
r = re.compile(r"encparam: '(.*)'")
data = soup.find('script', text=r).text
encparam = r.findall(data)[0]
print(encparam)
It is likely you can avoid bs4 altogether:
import requests
import re
r = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
p = re.compile(r"encparam: '(.*)'")
encparam = p.findall(r.text)[0]
print(encparam)
If you actually want the encparam part in the string:
import requests
import re
r = requests.get("https://navercomp.wisereport.co.kr/v2/company/c1010001.aspx?cmp_cd=005930")
p = re.compile(r"(encparam: '\w+')")
encparam = p.findall(r.text)[0]
print(encparam)

AttributeError: 'NoneType' object has no attribute 'text' in web-scraping

I'm trying to recreate the web scraping on this website
https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe
I'm working in jupyter as a first project and I've come up with this error
AttributeError: 'NoneType' object has no attribute 'text'
I've tried changing the link but it makes no difference. I don't really know enough to do anything about the problem. Here is all the code so far...
#import the libraries
import urllib.request
from bs4 import BeautifulSoup
# specify the url
quote_page = "https://www.bloomberg.com/quote/SP1:IND"
page = urllib.request.urlopen(quote_page)
# parse the html using BeautifulSoup and store in variable `soup`
soup = BeautifulSoup(page, "html.parser")
# Take out the <div> of name and get its value
name_box = soup.find("h1", attrs={"class": "name"})
name = name_box.text.strip()
# strip() is used to remove starting and trailing
print (name)
# get the index price
price_box = soup.find("div", attrs={"class":"price"})
price = price_box.text.strip()
print (price)
Any help would be appreciated a lot
I use selenium to webscrape, but I am sure I can help you out (maybe).
This section is were you code gives you the error I assume:
price_box = soup.find("div", attrs={"class":"price"})
price = price_box.text.strip()
print (price)
What I would do is:
price_box = soup.find("div", attrs={"class":"price"})
price = price_box().text
print (price.text)

Scraping data from wikipedia using Scrapy - why/when do errors occur due to processing URLs?

I am just starting to use Scrapy, and I am learning to use it as I go along. Please can someone explain why there is an error in my code, and what this error is? Is this error related to an invalid URL I have provided, and/or is it connected with invalid xpaths?
Here is my code:
from scrapy.spider import Spider
from scrapy.selector import Selector
class CatswikiSpider(Spider):
name = "catswiki"
allowed_domains = ["http://en.wikipedia.org/wiki/Cat‎"]
start_urls = [
"http://en.wikipedia.org/wiki/Cat‎"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//body/div')
for site in sites:
title = ('//h1/span/text()').extract()
subtitle = ('//h2/span/text()').extract()
boldtext = ('//p/b').extract()
links = ('//a/#href').extract()
imagelinks = ('//img/#src').re(r'.*cat.*').extract()
print title, subtitle, boldtext, links, imagelinks
#filename = response.url.split("/")[-2]
#open(filename, 'wb').write(response.body)
And here are some attachments, showing the errors in the command prompt:
You need a function call before all your extract lines. I'm not familiar with scrapy, but it's probably something like:
title = site.xpath('//h1/span/text()').extract()

Resources