beautifulsoup href returns empty string - web-scraping

Im sure this is an easy one but somehow I ve been stuck to get the href link under the a tag that jumps to each of the product detail pages. I dont see any javascript wrapped around as well. What am I missing?
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
urls = [
'https://undefeated.com/search?type=product&q=nike'
]
final = []
with requests.Session() as s:
for url in urls:
driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
driver.get(url)
products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#class='product-grid-item ']")))]
soup = bs(driver.page_source, 'lxml')
time.sleep(1)
href = soup.find_all['href']
print(href)
output:
[]
I then tried soup.find_all('a') and it did spit out the a whole bunch including href I am looking for, but still cannot specifically extract only the href...

You just have to find_all the a tag and then try to print the href attribute.
You requests.Session code should be like this:
with requests.Session() as s:
for url in urls:
driver = webdriver.Firefox()
driver.get(url)
products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#class='product-grid-item ']")))]
soup = bs(driver.page_source, 'lxml')
time.sleep(1)
a_links = soup.find_all('a')
for a in a_links:
print(a.get('href'))
Then all the links will be printed.

Related

Beautifulsoup requests.get() redirecting from mentioned url

I use the mentioned code to scrape a specific page
from bs4 import BeautifulSoup
import requests
url = "https://www.mychoize.com/self-drive-car-rentals-pune/cars"
page = requests.get(url=)
print(page.history)
for resp in page.history:
print(resp.status_code, resp.url)
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('div', class_ = "product-box")
for list in lists:
title = list.find('h3', class_ = "margin-o ng-binding")
#print(title)
But it keeps scraping the homepage('https://www.mychoize.com').
In order to stop it from redirecting to homepage I tried the following code to explore the response history
from bs4 import BeautifulSoup
import requests
url = "https://www.mychoize.com/self-drive-car-rentals-pune/cars"
page = requests.get(url ,allow_redirects=True)
print(page.history)
for resp in page.history:
print(resp.status_code, resp.url)
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('div', class_ = "product-box")
for list in lists:
title = list.find('h3', class_ = "margin-o ng-binding")
#print(title)
I obtained the following output
[<Response [302]>, <Response [301]>]
302 https://www.mychoize.com/self-drive-car-rentals-pune/cars
301 http://www.mychoize.com/
How do I prevent it from redirecting?

How to scrape multiple URLs and print them in an individual text file?

I'm trying to learn bs4 for past few days, I successfully scraped a page and print them in a text file so I try to scrape multiple pages and the results too print successfully in the terminal but when I try to print them in a text file only the last file get saved and rest of them are not executed. Since I'm new to coding I can't figure out the actual reason.
import bs4
import requests
from fake_useragent import UserAgent
import io
urls = ['https://en.m.wikipedia.org/wiki/Grove_(nature)','https://en.wikipedia.org/wiki/Azadirachta_indica','https://en.wikipedia.org/wiki/Olive']
user_agent = UserAgent()
for url in urls:
page = requests.get(url, headers={"user-agent": user_agent.chrome})
tree = bs4.BeautifulSoup(page.text, 'html.parser')
title = tree.find('title').get_text()
text = tree.find_all('p')[1].get_text()
name = title + '.txt'
with io.open(name, "w", encoding="utf-8") as text_file:
text_file.write(text)
print('files are ready')
You create the file outside the loop. Put the with statement in the for-loop like this:
import bs4
import requests
from fake_useragent import UserAgent
import io
urls = ['https://en.m.wikipedia.org/wiki/Grove_(nature)','https://en.wikipedia.org/wiki/Azadirachta_indica','https://en.wikipedia.org/wiki/Olive']
user_agent = UserAgent()
for url in urls:
page = requests.get(url, headers={"user-agent": user_agent.chrome})
tree = bs4.BeautifulSoup(page.text, 'html.parser')
title = tree.find('title').get_text()
text = tree.find_all('p')[1].get_text()
name = title + '.txt'
with io.open(name, "w", encoding="utf-8") as text_file:
text_file.write(text)
print('files are ready')
try this:
import bs4
import requests
from fake_useragent import UserAgent
import io
urls = ['https://en.m.wikipedia.org/wiki/Grove_(nature)','https://en.wikipedia.org/wiki/Azadirachta_indica','https://en.wikipedia.org/wiki/Olive']
user_agent = UserAgent()
for url in urls:
page = requests.get(url, headers={"user-agent": user_agent.chrome})
tree = bs4.BeautifulSoup(page.text, 'html.parser')
title = tree.find('title').get_text()
text = tree.find_all('p')[1].get_text()
name = title + '.txt'
with io.open(name, "w", encoding="utf-8") as text_file:
text_file.write(text)
print('files are ready')

BeautifulSoup.text returns blank string in VSCode, but works fine in Google Colab

I am trying to scrape this website https://understat.com/league/EPL.
Once I have parsed the page:
import json
from bs4 import BeautifulSoup
from urllib.request import urlopen
scrape_urlEPL="https://understat.com/league/EPL"
page_connect=urlopen(scrape_urlEPL)
page_html=BeautifulSoup(page_connect, "html.parser")
Then I search for "script" in the html.
page_html.findAll(name="script")
This gives me a list of all occurences of "script". Say I want to extract the text from the third element. Just printing the html for this shows a valid output.
page_html.findAll(name="script")[3]
The output:
<script>
var playersData = JSON.parse('\x5B\x7B\x22id\x22\x3A\x221389\x22,\x22player_name\x22\x3A\x22Jorginho\x22,\x22games\x22\x3A\x2228\x22,\x22time\x22\x3A\x222022\x22,\x22goals\x22\x3A\x227\x22,\x22xG\x22\x3A\x226.972690678201616\x22,\x22assists\x22\x3A\x221\x22,\x22xA\x22\x3A\x221.954869382083416\x22,\x22shots\x22\x3A\x2214\x22,\x22key_passes\x22\x3A\x2224\x22,\x22yellow_cards\x22\x3A\x222\x22,\x22red_cards\x22\x3A\x220\x22,\x22position\x22\x3A\x22M\x20S\x22,\x2....
Now if I want to extract the text from this,
page_html.findAll(name="script")[3].text
This gives an empty string ''.
However the same code works fine in Google Colab and returns:
'\n\tvar playersData\t= JSON.parse('\\x5B\\x7B\\x22id\\x22\\x3A\\x22647\\x22,\\x22player_name\\x22\\x3A\\x22Harry\\x20Kane\\x22,\\x22games\\x22\\x3A\\x2235\\x22,\\x22time\\x22\\x3A\\x223097\\x22,\\x22goals\\x22\\x3A\\x2223\\x22,\\x22xG\\x22\\x3A\\x2222.174858909100294\\x22,\\x22assists\\x22\\x3A\\x2214\\x22,\\x22xA\\x22\\x3A\\x227.577093588188291\\x22,\\x22shots\\x22\\x3A\\x22138\\x22,\\x22key_passes\\x22\\x3A\\x2249...'
which is as expected. I don't understand why this error comes up in VSCode.
Be informed that script TAG is only holding string which IS NOT a TEXT.
JSON.PARSE is a JavaScript function which parse a string
You've to use .string instead of .text
import httpx
import trio
from bs4 import BeautifulSoup
async def main():
async with httpx.AsyncClient(timeout=None) as client:
r = await client.get('https://understat.com/league/EPL')
soup = BeautifulSoup(r.text, 'lxml')
goal = soup.select('script')[3].string
print(goal)
if __name__ == "__main__":
trio.run(main)
Ref : Bs4 difference between string and text

BeautifulSoup isn't returning a url when we query for the src of the img tag

from bs4 import BeautifulSoup
from urllib import request
url = "https://amazon-asin.com/asincheck/?product_id=B000JMLBHU"
req = request.urlopen(url)
soap = BeautifulSoup(req,'html.parser')
soap.find('img',{'class':'resp-img'})['ng-src']
I'm using ng-src because, with only 'src', it returns nothing. But, with ng-src, it returns this:
'{{data.product_details.image_url}}'
Why it doesn't return the url? How can i scrape the url of this image?
Try this:
from selenium import webdriver
driver = webdriver.Firefox(executable_path='c:program/geckodriver')
url = "https://amazon-asin.com/asincheck/?product_id=B000JMLBHU"
driver.get(url)
driver.implicitly_wait(10)
print(driver.find_element_by_css_selector('img.resp-img').get_attribute('ng-src'))
driver.close()
Prints:
https://m.media-amazon.com/images/I/51sPuWd2JbL.jpg
Note yo need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe

Why is this CSS selector returning no results?

I am following along with a webscraping example in Automate-the-boring-stuff-with-python but my CSS selector is returning no results
import bs4
import requests
import sys
import webbrowser
print("Googling ...")
res = requests.get('https://www.google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkelems = soup.find_all(".r a")
numopen = min(5, len(linkelems))
for i in range(numopen):
webbrowser.open('https://google.com' + linkelems[i].get('href'))
Has google since modified how they store search links ?
From inspecting the search page elements I see no reason this selector would not work.
There are two problems:
1.) Instead of soup.find_all(".r a") use soup.select(".r a") Only .select() method accepts CSS selectors
2.) Google page needs that you specify User-Agent header to return correct page.
import bs4
import sys
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
print("Googling ...")
res = requests.get('https://www.google.com/search?q=' + ' '.join(sys.argv[1:]), headers=headers)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkelems = soup.select(".r a")
for a in linkelems:
print(a.text)
Prints (for example):
Googling ...
Tree - Wikipediaen.wikipedia.org › wiki › Tree
... and so on.
A complimentary answer to Andrej Kesely's answer.
If you don't want to deal with figuring out what selectors to use or how to bypass blocks from Google, then you can try to use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that bypass blocks, data extraction, and more is already done for the end-user. All that needs to be done is just to iterate over structured JSON and get the data you want.
Example code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google", # search engine
"q": "fus ro dah", # query
"api_key": os.getenv("API_KEY"), # environment variable with your API-KEY
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
------------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.nexusmods.com/skyrimspecialedition/mods/4889/
https://www.nexusmods.com/skyrimspecialedition/mods/14094/
https://tenor.com/search/fus-ro-dah-gifs
'''
Disclaimer, I work for SerpApi.

Resources