Beautiful soup replaces certain symbols in a URL with other symbols

Beautiful soup replaces certain symbols in a URL with other symbols - web-scraping

I am parsing a certain webpage with Beautiful soup, trying to retrieve all links that are inside h3 tag:
page = = requests.get(https://www....)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for item in soup.find_all('h3'):
links.append(item.a['href']
However, the links found are different than the links present in the page. For example, when the link http://www.estense.com/?p=116872 is present in the page, Beautiful soup returns http://www.estense.com/%3Fp%3D116872, replacing '?' with '%3F' and '=' with %3D. Why is that?
Thanks.

You can unquote the URL using urllib.parse
from urllib import parse
parse.unquote(item.a['href'])

Related

Two different BeautifulSoup outputs?

I have a piece of code that I made in Google Colab that essentially just scrapes a piece of data from a website, show below:
#imports
#<div class="sc-aef7b723-0 dDQUel priceTitle">
#<div class="priceValue ">
from bs4 import BeautifulSoup
import requests
import time
url = 'https://coinmarketcap.com/currencies/index-cooperative/'
HTML = requests.get(url)
soup = BeautifulSoup(HTML.text, 'html.parser')
text = soup.find('div', attrs={'class':'sc-aef7b723-0 dDQUel priceTitle'}).find('div', attrs={'class':'priceValue '}).text
print(text)
I need this to run as a py file on my computer, but when it runs as a py file, I get the error:
text = soup.find('div', attrs={'class':'sc-aef7b723-0 dDQUel priceTitle'}).find('div', attrs={'class':'priceValue '}).text
AttributeError: 'NoneType' object has no attribute 'text'
I was wondering why this happened as it is the exact same code. All of my packages are at the most recent version as well.

You just need to remove the trailing space in a class name attrs={'class':'priceValue'} because when you run the specified web page through the html.parser it corrects the HTML in some ways.
In this case it remove the trailing space that present on the web page, because it doesn't really makes sense to have a trailing space in a class name. Spaces needed when you have more than one class for a given element.
So, parsed web page that you store in your soup variable have that div looking like this <div class="priceValue"><span>$1.74</span></div>. And soup.find function actually care about trailing spaces, so it couldn't match the class priceValue with priceValue
To match the class with any trailing or leading whitespaces you could've used the soup.select function that uses CSS selectors to match elements, so it doesn't care about spaces, you could've found element of that class like this (with any amount of trailing and/or leading whitespaces):
css_selected_value = soup.select("[class= priceValue ]")[0].text
print(css_selected_value)
That being said, I'm not sure why your code works properly on Google Colab, never tried it. Maybe will try to dig into it later.

Why am i getting an empty list instead of a list with elements in it while web scraping

So I am trying to scrape the countries name from the table in a website https://www.theguardian.com/world/2020/oct/25/covid-world-map-countries-most-coronavirus-cases-deaths as a list. But when i am printing it out, it's just giving me empty list, instead of a list containing countries name. could anybody explain why i am getting this? the code is below,
import requests
from bs4 import BeautifulSoup
webpage = requests.get("https://www.theguardian.com/world/2020/oct/25/covid-world-map-countries-most-coronavirus-cases-deaths")
soup = BeautifulSoup(webpage.content, "html.parser")
countries = soup.find_all("div", attrs={"class": 'gv-cell gv-country-name'})
print(countries)
list_of_countries = []
for country in countries:
list_of_countries.append(country.get_text())
print(list_of_countries)
This is the output i am getting
[]
[]
Also, not only here, i was getting the same result (empty list) when i was trying to scrape a product's information from the amazon's website.

The list is dynamically retrieved from another endpoint you can find in the network tab which returns json. Something like as follows should work:
import requests
r = requests.get('https://interactive.guim.co.uk/2020/coronavirus-central-data/latest.json').json() #may need to add headers
countries = [i['attributes']['Country_Region'] for i in r['features']]

Unable to select table from url

I'm having trouble to select a table (or actually anything) from the specified URL. I used find_all method to no avail. There are multiple tables on the webpage separated by <h3> tags (image below is a table with <h3> tag 'Basic').
import requests
from bs4 import BeautifulSoup
res = requests.get('Example_URL')
soup = BeautifulSoup(res.text,'html.parser')
table = soup.find_all('table')[0]
print(table)
I expect to be able to select the table and then print the contents onto the console. But print(table) returns IndexError: list index out of range.

Why does BeautifulSoup skip over some tables when using findAll

I am trying to extract the "Four Factors" table from the following URL, https://www.basketball-reference.com/boxscores/201810160GSW.html, when I use the findAll method in the BeautifulSoup library when searching for tables I do not see that table, nor do I see the "Line Score" table. I am only concerned with the "Four Factors" table, but I figured the note about the "Line Score" table could be useful information.
URL2 = 'https://www.basketball-reference.com/boxscores/201810160GSW.html'
page2 = requests.get(URL2)
page2 = page2.text
soup2 = bs4.BeautifulSoup(page2, 'html.parser')
content = soup2.findAll('table')
If you look at content, you can find the other 4 tables on the page, but the "Four Factors" and "Line Score" do not show up there. In addition to helping me extract the "Four Factors" table, can you explain why it doesn't show up in content?

It comes out in one of the comments which is why you weren't finding it I think.
import requests
from bs4 import BeautifulSoup , Comment
import pandas as pd
r =requests.get('https://www.basketball-reference.com/boxscores/201810160GSW.html')
soup = BeautifulSoup(r.text,'lxml')
comments= soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
if 'id="four_factors"' in comment:
soup = BeautifulSoup(comment, 'lxml')
break
table = soup.select_one('#four_factors')
df = pd.read_html(str(table))[0].fillna('')
print(df)

How to get the span of a dictionary as it appears on the site?

I am trying to get all the meanings in the "noun" heading of the word the user enters.
This is my code for now:
import requests
from bs4 import BeautifulSoup
word=raw_input("Enter word: ").lower()
url=('http://www.dictionary.com/browse/'+word)
r=requests.get(url)
soup=BeautifulSoup(r.content,"html.parser")
try:
meaning=soup.find("div",attrs={"class":"def-content"}).get_text()
print "Meaning of",word,"is: "
print meaning
except AttributeError:
print "Sorry, we were not able to find the word."
pass
finally:
print "Thank you for using our dictionary."
Now suppose the user enters the word "today" and my output will be:
this present day: Today is beautiful.
I dont understand why does it leave so many spaces and why doesnt the part
"Today is beautiful"
come down.
Anyway when you look up that word on this site, you can see there are 2 meanings yet my program only shows one.
I want the output to be:
1.this present day:
Today is beautiful.
2.
this present time or age:
the world of today.
Can anyone explain me whats wrong and how can i fix it?
I have no idea what's wrong so please dont think I dint try.

You are getting the first noun meaning using the above code.
I have rewritten the code, it is as below:
from bs4 import BeautifulSoup
import requests
word = raw_input("Enter word: ").lower()
url = ('http://www.dictionary.com/browse/' + word)
r = requests.get(url)
bsObj = BeautifulSoup(r.content, "lxml")
nouns = bsObj.find("section", {"class": "def-pbk ce-spot"})
data = nouns.findAll('div', {'class': 'def-content'})
count = 1
for item in data:
temp = ' '.join(item.get_text().strip().split())
print str(count) + '. ' + temp
count += 1
Explanation:
Yes. Assuming the website shows noun meaning first, I am retrieving the first section which contains complete noun data. Then I am finding all the meanings under that section inside data variable and iterating it in a loop and fetching the text of each meaning present in the data. Then to remove all the extra spaces I am splitting the fetched text and the joining it with a single space along with the addition of a number at the beginning

try:
meaning = soup.find(attrs={"class": "def-pbk ce-spot"}).get_text(separator="\n",strip=True)
you can strip the whitesapce of the text by pass strip=True to get_text()
the reason way you don't got all the text is that you seletor is wrong, you should make the range bigger.
I add separator= '\n' to get_text() to format output.
if you hava any question, you can read the BeautifulSoup Document.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Beautiful soup replaces certain symbols in a URL with other symbols - web-scraping

You can unquote the URL using urllib.parse from urllib import parse parse.unquote(item.a['href'])

Related

Two different BeautifulSoup outputs?

Why am i getting an empty list instead of a list with elements in it while web scraping

Unable to select table from url

Why does BeautifulSoup skip over some tables when using findAll

How to get the span of a dictionary as it appears on the site?

Categories

Resources