Different webpage results when using Scrapy - web-scraping

I was training to scrape on an supermarket website using scrapy :
https://www.pnp.co.za/pnpstorefront/pnp/en/All-Products/Fresh-Food/Milk-%26-Cream/c/milk-and-cream703655157
I noticed that when using chrome, I will get a page showing 106 results over 5 pages. However, when using a spide with scrapy (and other scraping software), the number of results is reduced to 30 products over 2 pages. It seems like the site is limiting the results shown when using scrapy. How would one go around this and have a scrapy spider be seen as my laptop on chrome?
I use the following cmd to run the sypder:
scrapy crawl tstPnPCategories -o out.csv
And here is the spyder script:
import scrapy
class testSpydi(scrapy.Spider):
name = 'tstPnPCategories'
start_urls = [
'https://www.pnp.co.za/pnpstorefront/pnp/en/All-Products/Fresh-Food/Milk-%26-Cream/c/milk-and-cream703655157'
]
def parse(self, response):
names = response.css(".item-name::text").extract()
print("*** *******")
print("")
print("NAMES")
print("")
print("************")
for name in names:
print("")
print(name)
print("")
yield {
'item': name
}
next_page = response.css("li.pagination-next a::attr(href)").get()
if next_page is not None:
yield response.follow(next_page, self.parse)

You have to select a different region to be able to scrape data about more items.
The script has to issue a click on one of the dropdown menu items.
The first item in the dropdown can be clicked by issuing the following:
document.getElementsByClassName('js-base-store')[0].click()
The element was identified by using Developer Tools in Chrome browser.
DevTools is activated by pressing F12 or ctrl + shft + I or by choosing Developer Tools in the browser menu (three vertical dots)
Here is what to look for.

Related

Selenium stopping mid-loop without any error or exception

I was working on a project to scrap data for multiple cities from a website. There are about 1000 cities to scrap and during the for-loop, the webdriver window simply stops going to the requested site. However, the program is still running and no error or exception is thrown, the webdriver window just simply stays at the same webpage.
There is a common feature regarding the webpage that it stopped at, which is the windows are always the ones requesting the button clicks. The button click parts of the code should be correct as it is working as desired for hundreds of cities before the suspension.
However, every different run the program is stopping at different cities and I find no correlation between the cities, which leaves me very confused and unable to identify the root of the problem.
data_lookup = 'https://www.numbeo.com/{}/in/{}'
tabs = ['cost-of-living', 'property-investment', 'quality-of-life']
cities_stat = []
browser = webdriver.Chrome(executable_path = driver_path, chrome_options=ChromeOptions)
for x in range(len(cities_split)):
city_stat = []
for tab in tabs:
browser.get(data_lookup.format(tab, str(cities_split[x][0])+'-'+str(cities_split[x][1])))
try:
city_stat.append(read_data(tab))
except NoSuchElementException:
try:
city_button = browser.find_element(By.LINK_TEXT, cities[x])
city_button.click()
city_stat.append(read_data(tab))
except NoSuchElementException:
try:
if len(cities_split[x]) == 2:
browser.get(data_lookup.format(tab, str(cities_split[x][0])))
try:
city_stat.append(read_data(tab))
except NoSuchElementException:
city_button = browser.find_element(By.LINK_TEXT, cities[x])
city_button.click()
city_stat.append(read_data(tab))
elif len(cities_split[x]) == 3:
city_stat.append(initialize_data(tab))
except:
city_stat.append(initialize_data(tab))
cities_stat.append(city_stat)
I have tried using WebDriverWait to no appeal. All it does is simply make the NoSuchElementException become a TimeoutException.

Scraping: No attribute find_all for <p>

Good morning :)
I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=438653&CID=102313
The text I am trying to get seems to be located inside some <p> and separated by <br>.
For some reason, whenever I try to access a <p>, I get the following mistake: "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?", and this even if I do find instead of find_all().
My code is below (it is a very simple thing with no loop yet, I just would like to identify where the mistake comes from):
from selenium import webdriver
import time
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='MYPATH/chromedriver',options=options)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("div", class_="col-sm-12")
people_in_column = column.find_all("p").find_all("br")
Is there anything obvious I am not understanding here?
Thanks a lot in advance for your help!
You are trying to select a lsit of items aka ResultSet multiples times which is incorrect meaning using find_all method two times but not iterating.The correct way is as follows. Hope, it should work.
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Full working code as an example:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source#.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Output:
Notice of NIH Policy to All Applicants:Meeting rosters are provided for information purposes only. Applicant investigators and institutional officials must not communicate directly with study section members about an application before or after the review. Failure to observe this policy will create a serious breach of integrity in the peer review process, and may lead to actions outlined inNOT-OD-22-044, including removal of the application from immediate
review.

What Caused the Python NoneType Error During My Splinter 'click()' Call?

When trying to scrape the county data from multiple Politico state web pages, such as this one, I concluded the best method was to first click the button that expands the county list before grabbing the table body's data (when present). However, my attempt at clicking the button had failed:
from bs4 import BeautifulSoup as bs
import requests
from splinter import Browser
state_page_url = "https://www.politico.com/2020-election/results/washington/"
executable_path = {'executable_path': 'chrome-driver/chromedriver.exe'}
browser = Browser('chrome', **executable_path, headless=False)
browser.visit(state_page_url)
state_soup = bs(browser.html, 'html.parser')
reveal_button = state_soup.find('button', class_='jsx-3713440361')
if (reveal_button == None):
# Steps to take when the button isn't present
# ...
else:
reveal_button.click()
The error returned when following the else-condition is for my click() call: "TypeError: NoneType object is not callable". This doesn't make sense to me since I thought that the if-statement implied the reveal_button was not a NoneType. Am I misinterpeting the error message, how the reveal_button was set or am I misinterpeting what I'm working with after making state_soup?
Based on the comment thread for the question, and this solution to a similar question, I came across the following fix:
from bs4 import BeautifulSoup as bs
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
# Navigate the page to click the desired button
state_page_url = "https://www.politico.com/2020-election/results/alabama/"
driver = webdriver.Chrome(executable_path='chrome-driver/chromedriver.exe')
driver.get(state_page_url)
button_list = driver.find_elements(By.CLASS_NAME, 'jsx-3713440361')
if button_list == []:
# Actions to take when no button is found
# ...
else:
button_list[-1].click() # The index was determined through trial/error specific to the web page
# Now to grab the table and its data
state_soup = bs(driver.page_source)
state_county_results_table = state_soup.find('tbody', class_='jsx-3713440361')
Note that it required selenium for navigation and interaction while BeautifulSoup4 was used to parse it for the information I'd need

How can I use beautiful soup to get the following data from kick starter?

I am trying to get some data from kick starter. How can use beautiful soup library?
Kick Starter link
https://www.kickstarter.com/discover/advanced?woe_id=2347575&sort=magic&seed=2600008&page=7
These are the following information I need
Crowdfunding goal
Total crowdfunding
Total backers
Length of the campaign (# of days)
This is my current code
import requests
r = requests.get('https://www.kickstarter.com/discover/advanced?woe_id=2347575&sort=magic&seed=2600008&page=1')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'})
len(results)
i'll give you some of hint that i know, and hope you can do by yourself.
crawling has legal problem when you abuse Term of Service.
find_all should use with 'for' statment. it works like find all on web page(Ctrl + f).
e.g.
for a in soup.find_all('div', attrs={'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'}):
print (a)
3.links should be open 'for' statement. - https://www.kickstarte...seed=2600008&page=1
bold number repeated in for statement, so you can crawling all data In orderly
4.you sholud linked twice. - above link, there is list of pj. you should get link of these pj.
so code's algorithm likes this.
for i in range(0,10000):
url = www.kick.....page=i
for pj_link in find_all(each pj's link):
r2 = requests.get(pj_link)
soup2 = BeautifulSoup(r2.text, 'html.parser')
......

Beautiful Soup having trouble parsing output from mechanize

def return_with_soup(url):
#uses mechanize to tell the browser we aren't a bot
#and to retrieve webpage
#returns a soupified webpage
browser = mechanize.Browser() #I am made of human
browser.set_handle_robots(False) #no bots here, no sir
browser.open(url)
#print browser.response().read()
soup = BeautifulSoup(browser.response().read()) #this is where it breaks
return soup
It throws this error in reference to the second to last line "Type error: module is not callable"
What's going on exactly?
BeautifulSoup was imported as a module. So I had to change that line to:
BeautifulSoup.BeautifulSoup(...)

Resources